<h1>Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#File-formats" data-toc-modified-id="File-formats-1">File formats</a></span><ul class="toc-item"><li><span><a href="#Plain-text" data-toc-modified-id="Plain-text-1.1">Plain text</a></span><ul class="toc-item"><li><span><a href="#Reading" data-toc-modified-id="Reading-1.1.1">Reading</a></span></li><li><span><a href="#Writing" data-toc-modified-id="Writing-1.1.2">Writing</a></span></li></ul></li></ul></li></ul></div>

In [1]:
import os

def cleanup():
    try:
        os.remove('my_new_file.txt')
    except FileNotFoundError:
        pass

os.chdir('/home/lt/GitHub/introduction-to-programming/topics/examples')
cleanup()

# Files

Thus far, our programs have interacted with the wider world only via the `input()` function, for getting information from a human being, and via the `print()` function, for displaying text results on the screen. By now you will probably be fairly tired of this pattern. Maybe more than once you have run your program, then forgotten that it requires you to type some input into the console, and so spent several minutes idly wondering why nothing is happening. Or you have repeatedly hit the wrong keys on your keyboard by accident and been treated to some infuriating error messages. So you will be glad to hear that we are (mostly) going to leave the `input()` function behind from now on. We have really mainly been using `input()` as a crutch, a simple form of interactivity to get us started with programming. Most real programs don't use `input()` at all, or only sparingly. Instead, they have other methods for getting hold of external data. The first of these methods that we will learn about is reading in data from a file stored on our computer. We will also learn a bit about how to write the results of our programs into new files.

## File formats

There are lots of different file formats, each of which stores information in different ways. For example, Microsoft Word *docx* files store text along with various additional pieces of information about how that text is to be displayed, *jpg* image files store information about the colors of the pixels of an image, and some files, such as *exe* files, store entire programs in a form that it is not feasible for a human being to read and interpret. When working with files, we will need to be aware of the nature of the particular file format we are dealing with, and instruct Python to read, or write, it appropriately.

### Plain text

Let's start with a file format that is fairly easy to deal with: plain text. Plain text files just store text characters (although as we will see, there are some subtleties to consider even here). We can open plain text files and view their contents in a normal text editor.

If you want to follow along with the examples in this lesson, make sure you have first downloaded the [example programs and data files](examples/data/intro_prog_examples.zip) for the class and that you have unzipped this file in your working directory so that 'data' is a subdirectory of your working directory. Like this:

![](images/data_subdirectory.png)

Now find the example text file [melville-moby_dick.txt](examples/data/melville-moby_dick.txt). The *txt* extension indicates that this file should contain plain text. It contains the full text of the novel *Moby Dick* by Herman Melville.

(Note that one of the many profoundly stupid default options in Windows is that file extensions such as *.py*, *.txt*, etc. are not displayed, so the file may appear only as *melville-moby_dick* in your file explorer, but its name is still *melville-moby_dick.txt*, and this is how Python and other programs will want you to refer to it. See [here](https://fileinfo.com/help/windows_10_show_file_extensions) for how to change this option if you would prefer to be able to see file extensions.)

You can open a plain text file in your preferred text editor. Since the editor in Spyder is just a fancy text editor, you can open it there if you like. This might even be the most convenient way to view it, since you will have it open just next to the Spyder console and will be able to see its contents as you try out the example commands below for opening and reading the file.

#### Reading

Let's now open the file from within a Python program. There is a [built-in](extras/glossary.md#builtin) function for this, called simply `open()`. The input argument is a string containing the [path](extras/glossary.md#path) to the file we want to open. If the file is located in the same directory as our program, then the path is simply the name of the file. But in this case, the file is located in a directory called 'data', so we need to put this together with the file name to build the full path (Take a look back at the use of `os.path` [in the previous lesson](standard_library.ipynb#Paths) if you need to remind yourself how this works).

We should [assign](extras/glossary.md#assignment) the result of calling the `open()` function into a new variable, so that we can then work with it in the rest of our program. If we are working with just a single file, `f` is a convenient choice of variable name. We will use that. But note that if we were working with multiple files it would be better for the clarity of our program if we chose a variable name that says something about which particular file we have opened.

In [2]:
import os

filepath = os.path.join('data', 'melville-moby_dick.txt')
print(filepath)

data/melville-moby_dick.txt


In [3]:
f = open(filepath)

Note that if `open()` cannot find the requested file, it [raises](extras/glossary.md#raise) a `FileNotFoundError`, which looks like this:

In [4]:
open('nonexistent_file.txt')

FileNotFoundError: [Errno 2] No such file or directory: 'nonexistent_file.txt'

If you see this error message, check the following:

* Did you first follow the steps for downloading the file?
* Have you got the name of the file right?
* Is the file in a subdirectory called 'data'?
* Is the 'data' subdirectory in your current working directory? (Try `os.getcwd()` at the Spyder console to see the path to your current working directory.)
* Did you use `os.path.join()` correctly? (Check the output of `print(filepath)` after the `os.path.join()` command above.)

So what [type](extras/glossary.md#type) does the `open()` function [return](extras/glossary.md#return)?

In [5]:
type(f)

_io.TextIOWrapper

If you were hoping it would just be a [string](extras/glossary.md#string) containing the contents of the file, you will be disappointed. As is sometimes the case, an intermediate step lies between us and our seemingly simple goal. First we open the file, *then* we read in its contents.

`type()` tells us that `open()` has [returned](extras/glossary.md#return) a `TextIOWrapper`. The 'IO' part stands for [Input/Output](extras/glossary.md#IO). This abbreviation is used quite broadly in computing to refer to any process that involves getting or sending information from or to some resource that is external to the computer program, such as a human being, the internet, or a file. This is a data type specifically for connecting to text files, then reading from and writing to them.

This entity (which we have now stored in our `f` variable) is more commonly and more simply referred to as a 'file object'. A file object has [methods](extras/glossary.md#method) for reading and writing, as we can see if we apply the `dir()` function:

In [6]:
dir(f)

['_CHUNK_SIZE',
 '__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__next__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_checkClosed',
 '_checkReadable',
 '_checkSeekable',
 '_checkWritable',
 '_finalizing',
 'buffer',
 'close',
 'closed',
 'detach',
 'encoding',
 'errors',
 'fileno',
 'flush',
 'isatty',
 'line_buffering',
 'mode',
 'name',
 'newlines',
 'read',
 'readable',
 'readline',
 'readlines',
 'seek',
 'seekable',
 'tell',
 'truncate',
 'writable',
 'write',
 'writelines']

The `read()` method returns the contents of the text file as a string (remember [how to use methods](types.ipynb#Methods)).

In [7]:
text = f.read()

type(text)

str

Famously, the opening line of *Moby Dick* is 'Call me Ishmael.' But as we can see from printing out the first few hundred characters, this isn't quite true:

In [8]:
print(text[:433])

[Moby Dick by Herman Melville 1851]


ETYMOLOGY.

(Supplied by a Late Consumptive Usher to a Grammar School)

The pale Usher--threadbare in coat, heart, body, and brain; I see him
now.  He was ever dusting his old lexicons and grammars, with a queer
handkerchief, mockingly embellished with all the gay flags of all the
known nations of the world.  He loved to dust his old grammars; it
somehow mildly reminded him of his mortality.



Chapter 1 of the novel proper does begin with the famous opener, but it is preceded by a preamble in which two other narrators, the 'pale Usher' and the 'sub-librarian' discuss the etymology of the word 'whale' and various extracts from other books on the subject of whales. This piece of trivia, along with your Python programming skills, is something that you can seriously impress your friends and colleagues with.

So, now that we have the string text of the novel, we can do all the various things that we know how to do with strings, such as count the number of words, count the occurrences of a particular word, and so on. For example:

In [9]:
target_word = 'whale'
n_occurrences = text.lower().count(target_word)

print(n_occurrences)

1685


If you are unable to rein in your excitement at this point, take a moment to go to the Spyder console and try out lots of fun things with the text of *Moby Dick* until you have calmed down. Processing and analyzing natural language texts is the topic of a future lesson; for now we will stick to the technical drudgery of handling files.

The first question we may reasonably ask is: Why did we have to first use a function to open the file and get a 'file object', and only then read in the contents of the file? If the only thing we want to do with the file is to read in its entire contents, then the intermediate step is superfluous, and we can simplify our program a little by applying the `read()` method directly to the result of [calling](extras/glossary.md#call) the `open()` function.

The syntax for this perhaps looks a little strange, but has its logic:

In [10]:
text = open(filepath).read()

When all we want to do is to read in the entire contents of a text file, we can stick to this one-liner combination of `open()` and `read()`.

When would we not want to read in the entire contents of the file? One such situation is if we want to search in the file until we find something that we are looking for, and then stop. For example, imagine that we want to find the first line in *Moby Dick* that contains the word 'whale'.

One way to do this would be to just read in the entire file and then search in the resulting string. We could for example split the entire string into lines, go through them in a [loop](extras/glossary.md#loop), then `break` out of the loop when we find a line containing `'whale'`. Like this:

In [11]:
lines = text.split('\n')

for line in lines:
    if target_word in line.lower():
        break

msg = "The first line containing '{}' is: [...] {}"
print(msg.format(target_word, line))

The first line containing 'whale' is: [...] name a whale-fish is to be called in our tongue leaving out, through


(If you are wondering about `split('\n')`, here we are overriding the default value of the [argument](extras/glossary.md#argument) to the `split()` method. The default is to split at every space `' '`, but we want to split at every [newline character](extras/glossary.md#newline). We encountered the newline character briefly [in the lesson on functions](functions.ipynb#Keyword-arguments) when we looked at the `end` argument to the `print()` function.)

But another way to do this would be to read in the contents of the file line-by-line, then stop reading when we find the line we are searching for. File objects are [iterable](extras/glossary.md#iterable). If we loop through a file object, each run of the loop gives us the next line from the file. (Take a look back at the [lesson on iteration](iteration.ipynb#Iterables) if you need to remind yourself about iterable types.)

So we could also find the first line containing `'whale'` like this:

In [12]:
f = open(filepath)

for line in f:
    if target_word in line.lower():
        break

print(msg.format(target_word, line))

The first line containing 'whale' is: [...] name a whale-fish is to be called in our tongue leaving out, through



Whether we first read in the entire file and then search in its contents, or whether we read the file line-by-line, the result is the same. The difference behind the scenes is that the former method loads the entire contents of the file into our computer's temporary memory, whereas the second method only ever holds one line in memory at any one time, and therefore uses less of the computer's memory. In almost all cases, this difference will not be important, but here are some instances in which it might matter:

* The file we are reading is absolutely colossal (multiple squigabytes) and wouldn't all fit in our computer's temporary memory at once. In this case we have no choice but to read it in parts.
* The file is still very large (a squigabyte or two), so although it can be read in its entirety, the process of doing so slows down our program unacceptably. In this case, reading line-by-line may be faster if the line we are searching for could occur near the beginning of the file.
* We intend our program to be run on a system with very limited memory, such as a miniature device.

#### Writing

Next we might ask: Can we write new text to the file? After all, when we `dir()` the file object, we see a `write()` method.

Let's just try it. What's the worst that could happen (other than accidentally overwriting the entire file and having to download it again)?

In [13]:
f = open(filepath)

f.write('And then I woke up and it was all a dream.')

UnsupportedOperation: not writable

We see an error message informing us that our file is not writable. This is in fact fortunate here, as otherwise the result of the `write()` command above would have been to overwrite the entire contents of the file.

There is an additional [keyword argument](functions.ipynb#Keyword-arguments) to `open()` called `mode`, which specifies what 'mode' we want to open the file in. You can see a list of the possible modes at the [Python documentation page for `open()`](https://docs.python.org/3/library/functions.html#open), and you will also see there what the default value of `mode` is.

Above, we used `open()` without specifying anything for `mode`, so the default value was applied, and the default value happens to be `'r'`, meaning 'read only'. If we instead ask for `'w'` mode, we get a writable file object. We won't do this for the *Moby Dick* file, since we don't want to overwrite it. Instead, we will create our own new file and write some text into it. For example:

In [14]:
f = open('my_new_file.txt', mode='w')

f.write('Writing your first text file is a truly exhilarating experience.')

64

If you enter the two commands above into the Spyder console and then you go and look at your working directory, you should see that the new file has appeared there.

However, if you click on the new file to open it in a text editor, you might not see the text that you wrote with the `write()` method. Depending on some behind-the-scenes details that may occasionally vary, Python does not necessarily write to a file immediately. Writing data to the computer's permanent memory is by computer standards a slow operation, and can be done more efficiently by waiting until there is plenty to write, then writing everything in one go. So Python waits until we write a lot of data or until we definitely stop writing.

We can signal that we have finished writing by closing the file with the `close()` method.

In [15]:
f.close()

Now you should be able to open the file in a text editor and see the finished result of any `write()` commands that you have run.



In [16]:
cleanup()

os.chdir('/home/lt/GitHub/introduction-to-programming/topics')