<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#File-formats" data-toc-modified-id="File-formats-1">File formats</a></span><ul class="toc-item"><li><span><a href="#Text" data-toc-modified-id="Text-1.1">Text</a></span><ul class="toc-item"><li><span><a href="#Reading" data-toc-modified-id="Reading-1.1.1">Reading</a></span></li></ul></li></ul></li></ul></div>

In [1]:
import os

os.chdir('/home/lt/GitHub/introduction-to-programming/topics/examples')

# Files

Thus far, our programs have interacted with the wider world only via the `input()` function, for getting information from a human being, and via the `print()` function, for displaying text results on the screen. By now you will probably be fairly tired of this pattern. Maybe more than once you have run your program, then forgotten that it requires you to type some input into the console, and so spent several minutes idly wondering why nothing is happening. Or you have repeatedly hit the wrong keys on your keyboard by accident and been treated to some infuriating error messages. So you will be glad to hear that we are (mostly) going to leave the `input()` function behind from now on. We have really mainly been using `input()` as a crutch, a simple form of interactivity to get us started with programming. Most real programs don't use `input()` at all, or only sparingly. Instead, they have other methods for getting hold of external data. The first of these methods that we will learn about is reading in data from a file stored on our computer. We will also learn a bit about how to write the results of our programs into new files.

## File formats

There are lots of different file formats, each of which stores information in different ways. For example, Microsoft Word *docx* files store text along with various additional pieces of information about how that text is to be displayed, *jpg* image files store information about the colors of the pixels of an image, and some files, such as *exe* files, store entire programs in a form that it is not feasible for a human being to read and interpret. When working with files, we will need to be aware of the nature of the particular file format we are dealing with, and instruct Python to read, or write, it appropriately.

### Text

Let's start with a file format that is fairly easy to deal with: plain text. Plain text files just store text characters (although as we will see, there are some tricky subtleties to consider even here). We can open plain text files and view their contents in a normal text editor such as Notepad.

If you want to follow along with the examples in this lesson, make sure you have first downloaded the [example programs and data files](examples/data/intro_prog_examples.zip) for the class and that you have unzipped this file in your working directory so that 'data' is a subdirectory of your working directory. Like this:

![](images/data_subdirectory.png)

Now find the example text file [melville-moby_dick.txt](examples/data/melville-moby_dick.txt). The *txt* file extension indicates that this file should contain plain text. (Note that one of the many profoundly stupid default options in Windows is that file extensions (*.py*, *.txt*, etc.) are not displayed, so the file may appear only as 'melville-moby_dick' in your file explorer, but its name is still 'melville-moby_dick.txt', and this is how Python and other programs will want you to refer to it. See [here](https://fileinfo.com/help/windows_10_show_file_extensions) for how to change this option if you would prefer to see file extensions.)

You can open a plain text file in your preferred text editor. Since the editor window in Spyder is just a fancy text editor, you can open it there if you like. This might even be the most convenient way to view it, since you will have it open just next to the Spyder console and will be able to see its contents as you try out the example commands below for opening and reading the file.

#### Reading

Let's open the file from Python. There is a [built-in](extras/glossary.md#builtin) function for this, called simply `open()`. The input argument is the string [path](extras/glossary.md#path) to the file. If the file is located in the same directory as our program, then the path is simply the name of the file. But in this case, the file is located in a directory called 'data', so we need to put this together with the file name to build the full path (Take a look back at the use of `os.path` [in the previous lesson](standard_library.ipynb#Paths) if you need to remind yourself how this works).

As with most things, we should [assign](extras/glossary.md#assignment) the result of the `open()` function into a new variable, so that we can then work with it in the rest of our program. If we are working with just a single file, `f` is a convenient choice of variable name. We will use that. But note that if we were working with multiple files it would be better for the clarity of our program if we chose a variable name that says something about which particular file we have opened.

In [2]:
import os

filepath = os.path.join('data', 'melville-moby_dick.txt')
print(filepath)

f = open(filepath)

data/melville-moby_dick.txt


Note that if `open()` cannot find the requested file, it [raises](extras/glossary.md#raise) a `FileNotFoundError`, which looks like this:

In [3]:
open('nonexistent_file.txt')

FileNotFoundError: [Errno 2] No such file or directory: 'nonexistent_file.txt'

If you see this error message, check the following:

* Did you first follow the steps for downloading the file?
* Have you got the name of the file right?
* Is the file in a subdirectory called 'data'?
* Is the 'data' subdirectory in your current working directory? (Try `os.getcwd()` at the Spyder console to see the path to your current working directory.)
* Did you use `os.path.join()` correctly? (Check the output of `print(filepath)` after the `os.path.join()` command above.)

So what [type](extras/glossary.md#type) does the `open()` function [return](extras/glossary.md#return)?

In [4]:
type(f)

_io.TextIOWrapper

If you were hoping it would just be a [string](extras/glossary.md#string) containing the contents of the file, you will be disappointed. As is sometimes the case, an intermediate step lies between us and our seemingly simple goal. First we open the file, *then* we read in its contents.

The curiously-named `TextIOWrapper` is a data type specifically for connecting to text files, then reading from and writing to them (The 'IO' part of its name stands for [Input/Output](extras/glossary.md#IO). This abbreviation is used quite broadly in computing to refer to any process that involves getting or sending information from or to some resource that is external to the computer program, such as a human being, the internet, or a file).

This new type has [methods](extras/glossary.md#method) for reading and writing, as we can see if we apply the `dir()` function:

In [5]:
dir(f)

['_CHUNK_SIZE',
 '__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__next__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_checkClosed',
 '_checkReadable',
 '_checkSeekable',
 '_checkWritable',
 '_finalizing',
 'buffer',
 'close',
 'closed',
 'detach',
 'encoding',
 'errors',
 'fileno',
 'flush',
 'isatty',
 'line_buffering',
 'mode',
 'name',
 'newlines',
 'read',
 'readable',
 'readline',
 'readlines',
 'seek',
 'seekable',
 'tell',
 'truncate',
 'writable',
 'write',
 'writelines']

The `read()` method gets us the contents of the text file as a string (Remember [how to use methods](types.ipynb#Methods)).

In [6]:
text = f.read()

type(text)

str

In [7]:
print(text[:433])

[Moby Dick by Herman Melville 1851]


ETYMOLOGY.

(Supplied by a Late Consumptive Usher to a Grammar School)

The pale Usher--threadbare in coat, heart, body, and brain; I see him
now.  He was ever dusting his old lexicons and grammars, with a queer
handkerchief, mockingly embellished with all the gay flags of all the
known nations of the world.  He loved to dust his old grammars; it
somehow mildly reminded him of his mortality.



Now we can do all the various fun things that we know how to do with strings, such as count the number of words, count the occurrences of a particular word, and so on. For example:

In [8]:
target_word = 'whale'
n_occurrences = text.lower().count(target_word)

print(n_occurrences)

1685


But processing and analyzing natural language texts is the topic of a future lesson. We will for now stick to the drudgery of handling files.

In [9]:
os.chdir('/home/lt/GitHub/introduction-to-programming/topics')