# Chapter 5:  Files I/O

In this chapter you will learn how to read data from and write data to files. This is quite and essential part of programming, as it is the first step for your program to communicate with the outside world. In most cases you will write programs that take data from some source, manipulates it in someway and writes it out somewhere. For example if you would write a survey, you could take input from participants and save their answers in some files. When the survey is over you would read these files in and do some analysis on the data you have collected and save your results. In this chapter we will read in texts, analyze them a bit, and save out our analysis to files. 

## File Input

Input for your programs often comes from files on your disk, such as texts or some data in csv format. Likewise, you often want output to be written back to files on your disk as well e.g.: you collect tweets about a certain topic and you write it to a file for later analysis. Thus, reading and writing files is often an essential part of programming and, lucky, for us, this is really simple in Python. The following example reads a file from disk:

In [17]:
f = open('data/austen-emma-excerpt.txt', 'rt') # open the file 
text = f.read() # read in its content as a string
f.close() # close the file
print(text) # print the string

Emma by Jane Austen 1816

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.


The `open()` function does not return the actual text that is saved in the text file. It only returns a 'file object' from which we can read the content using the `.read()` function. We passed three arguments to the `open()` function:

 * the name of the file that you wish to open
 * the mode, a combination of characters, 'r' represents read-mode, and 't' represent plain text-mode. This indicates we are reading a plain text file.
 * the last argument, a named argument (encoding), specifies the encoding of the text file.
 
The most important mode arguments the open() function can take are:

* r: Opens a file for reading only. The file pointer is placed at the beginning of the file.
* w: Opens a file for writing only. Overwrites the file if the file exists. If the file does not exist, creates a new file for writing.
* a: Opens a file for appending. The file pointer is at the end of the file if the file exists. If the file does not exist, it creates a new file for writing. Use it if you would like to add something to the end of a file



>UTF-8

>You may wonder what an encoding is and what *utf-8* is. For anyone working with texts and computers this is vital to know. Internally, a computer knows no characters whatsoever: every piece of information is represented as numbers (which in turn are represented in a binary format, as zeroes and ones). An encoding specifies which numbers represent which characters. A famous and long-standing encoding scheme is ASCII, in which for example the letter 'A' is encoded using the number 65. ASCII however only has a very limited alphabet and can not encode a lot of writing systems. A modern-day encoding supporting countless writing systems is *unicode* and *utf-8* is a kind of unicode. This the type of encoding that you will want to use for your data whenever possible. Whenever you have a choice, you should use unicode!

Reading an entire file in one string is not always desirable, especially not with huge files. The following example reads up until a newline everytime, and returns one line at a time. 


In [2]:
f = open('data/austen-emma-excerpt.txt','rt', encoding='utf-8') # open the file
for line in f: # iterate over the file object
    print(line)   # the file object yields one line at a time
    print("n")    # after every line print "n"
f.close() # close the file

Emma by Jane Austen 1816

n


n
VOLUME I

n


n
CHAPTER I

n


n


n
Emma Woodhouse, handsome, clever, and rich, with a comfortable home

n
and happy disposition, seemed to unite some of the best blessings

n
of existence; and had lived nearly twenty-one years in the world

n
with very little to distress or vex her.

n


n
She was the youngest of the two daughters of a most affectionate,

n
indulgent father; and had, in consequence of her sister's marriage,

n
been mistress of his house from a very early period.  Her mother

n
had died too long ago for her to have more than an indistinct

n
remembrance of her caresses; and her place had been supplied

n
by an excellent woman as governess, who had fallen little short

n
of a mother in affection.
n


The 'newline' character is probably something new to you. If you are dealing with plain text files (typically files whose name ends in the '.txt' extension), your machine uses a special character internally to signal that a new line should begin. Internally, such newlines are represented as `"\n"`. Normally, this character is visualized on your screen as if the enter key were pressed. See what happens below: 

In [9]:
f = open('data/austen-emma-excerpt.txt','rt', encoding='utf-8')
for line in f:
    print(line)
f.close()

Emma by Jane Austen 1816



VOLUME I



CHAPTER I





Emma Woodhouse, handsome, clever, and rich, with a comfortable home

and happy disposition, seemed to unite some of the best blessings

of existence; and had lived nearly twenty-one years in the world

with very little to distress or vex her.



She was the youngest of the two daughters of a most affectionate,

indulgent father; and had, in consequence of her sister's marriage,

been mistress of his house from a very early period.  Her mother

had died too long ago for her to have more than an indistinct

remembrance of her caresses; and her place had been supplied

by an excellent woman as governess, who had fallen little short

of a mother in affection.


Rather than just printing, we can of course do whatever we want with this file's content. Let's count the number of lines (but note, that a line does not necessarily correspond to a sentence).

In [5]:
count = 0
f = open('data/austen-emma-excerpt.txt', 'rt', encoding='utf-8')
for line in f:
    count += 1
f.close()
print(count)

TypeError: 'encoding' is an invalid keyword argument for this function

### Useful tips on file reading

The last thing I would like to show you is to store the contents of a file in a list, which I find useful in some cases. Python provides the fileobject.readlines() function, which creates a list, where each element of the list is one line from the file. As you can see in the example below, this keeps the annoying trainling new line characters "\n" at the end of the lines. So in the second example I read in the file as one string and split it on the newline characters "\n".

In [12]:
lines = open('data/austen-emma-excerpt.txt', 'rt').readlines()
print "Number of lines", len(lines)
print lines
print
print open('data/austen-emma-excerpt.txt', 'rt').read().split('\n')


Number of lines 19
['Emma by Jane Austen 1816\n', '\n', 'VOLUME I\n', '\n', 'CHAPTER I\n', '\n', '\n', 'Emma Woodhouse, handsome, clever, and rich, with a comfortable home\n', 'and happy disposition, seemed to unite some of the best blessings\n', 'of existence; and had lived nearly twenty-one years in the world\n', 'with very little to distress or vex her.\n', '\n', 'She was the youngest of the two daughters of a most affectionate,\n', "indulgent father; and had, in consequence of her sister's marriage,\n", 'been mistress of his house from a very early period.  Her mother\n', 'had died too long ago for her to have more than an indistinct\n', 'remembrance of her caresses; and her place had been supplied\n', 'by an excellent woman as governess, who had fallen little short\n', 'of a mother in affection.']

['Emma by Jane Austen 1816', '', 'VOLUME I', '', 'CHAPTER I', '', '', 'Emma Woodhouse, handsome, clever, and rich, with a comfortable home', 'and happy disposition, seemed to unite some

Lastly, below I show a more "pythonic" way of opening a file. It is preferable to use this "with" syntax, you can read up on it why, but for now just remember that its safer.

In [13]:
with open('data/austen-emma-excerpt.txt','rt', encoding='utf-8') as txt:
    for line in txt:
        print line

Emma by Jane Austen 1816



VOLUME I



CHAPTER I





Emma Woodhouse, handsome, clever, and rich, with a comfortable home

and happy disposition, seemed to unite some of the best blessings

of existence; and had lived nearly twenty-one years in the world

with very little to distress or vex her.



She was the youngest of the two daughters of a most affectionate,

indulgent father; and had, in consequence of her sister's marriage,

been mistress of his house from a very early period.  Her mother

had died too long ago for her to have more than an indistinct

remembrance of her caresses; and her place had been supplied

by an excellent woman as governess, who had fallen little short

of a mother in affection.


## Quiz

Read the file `data/austen-emma-excerpt.txt` and compute the average length of the lines:
* In characters
* In words
* Re-calculate both measures when not counting empty lines

In [125]:
f = open('data/austen-emma-excerpt.txt', 'rt', encoding='utf-8')
# insert your code here
# important: always remember to properly close your files again!

- - -

## File Output


Now we mastered the art of reading files, let's move on to writing files, which follows a similar logic:

In [None]:
f = open('data/testoutput.txt', 'wt', encoding='utf-8')
f.write("Hello world!")
f.close()

In this code block, we have created a new file called `testoutput.txt` in the `data` directory. We then wrote a single line to this file and then we closed it. Note that the `w` in `wt` is a crucial addition: if you would have left this out, Python would have opened the file in 'readonly' mode and you wouldn't have been able to write to it! The 't' in the argument, again, signifies that we will be writing to this file in plain text mode.

If you want your data to be written on multiple lines, you need to take care to explicitly encode the newlines. Instead of:
    

In [None]:
f = open('data/testouput.txt','wt', encoding='utf-8')
f.write("Hello world on the first line!")
f.write("Hello world on the second line!")
f.close()

You need to write:

In [None]:
f = open('data/testoutput.txt','wt', encoding='utf-8')
f.write("Hello world on the first line!\n")
f.write("Hello world on the second line!")
f.close()

Otherwise your file would have `Hello world!Hello world!` in it, i.e. without the newlines.

Besides 'read-mode' and 'write-mode' when dealing with text files, there is also the 'append-mode' in Python. Watch out: in 'write-mode', you will always *overwrite* the existing content of the file. However, if you've open a file in 'append-mode', everything you write to the file will be added at the end of the file, without deleting anything of the existing content in the file. In order to enable the append mode, you need to specify `'at'` as your second parameter when you open files ('a' for append mode; 't' for text mode).

#### Exercise

Open the file we just created and check if the writing was succesfull or not

In [11]:
# insert your code here

Read the file `data/austen-emma-excerpt-tokenised.txt`, and write to a file `words.txt` all words occuring in this text (without duplicates!!), alphabetically ordered, one word per line. That way, you are really creating a lexicon or word list of the text. (Tip: you should use sets in this exercise!)

Check your output by viewing the `words.txt` file in a text editor such as Sublime Text 2. (Windows users: do not use Notepad!)

## Pickle

Another very common way of saving data to disk in Python is to just simply "dump" it in a pickle file. This section is going to walk you through this idea. 

Let's say you have read in some document and created a frequency dictionary from your text file:

In [15]:
freq_dict = {'word1': 210, 'word2': 50}
freq_dict

{'word1': 210, 'word2': 50}

You would like to remember this for later use. This is where you can use the pickle module. This module let's you write out arbitrary Python objects to disk and read them back later. pickle has two main methods: The first one is dump, which dumps an object to a file object and the second one is load, which loads an object from a file object

In [16]:
import pickle

In [20]:
pickle.dump(freq_dict, open('freqdict.pkl', 'wb')) # passing the thing that i want to right out and a file object to pickle

In [21]:
pickle.load(open(r'freqdict.pkl'))

{'word1': 210, 'word2': 50}

---

### Working with Directories

Now that we started to work with files we have to gain some insight into how to navigate the folder/directory structure. Most people use some sort of graphical user interface GUI to navigate to files such as the Finder in Mac OS or you click on the My Computer icon on Windows. Now we are going to interact with these folder structures programmatically. The workhorse of this section is going to be Python's os module. The GUI you are using translates the commands of your operating system to clicking on icons for easier use. Python's os modules is very similar to the GUI in that it provides an interface that let's you navigate between folders, create new folders, rename files etc..

In [76]:
import os

Let's get started by checking out which is the current directory are we in actually right now.

In [77]:
print(os.getcwd())

/home/akadar/pymodules/python-course


getcwd refers to "get current working directory". As you can see the name of the current directory is XXXXXXXXXXX. The directories on the left are the names higher level directories.  On Linux and Mac these are delimited by "/", while on Windows by "\". This distinctions extremely unnecessary I know, but what can you do. 

OK, now lets check out what files and folders do we have in this directory

In [59]:
print(os.listdir('.')) # The '.' refers to 'current directory'

['Chapter 7 - More on Loops.ipynb', 'ExamEx20132014.pdf', 'Chapter 1 - Variables.ipynb', 'Chapter 4 - Loops.ipynb', 'data', '.ipynb_checkpoints', 'FilesIO.ipynb', 'Chapter 6 - Regular Expressions.ipynb', 'start-unix.sh', 'start-windows.bat', 'ExerciseKeys', 'README.md', 'Chapter 2 - Collections.ipynb', '.gitignore', 'Chapter 3 - Conditions.ipynb', 'Chapter 5 - Functions and Files.ipynb', 'freqdict.pkl', 'learnutils.py', 'styles', 'images', '.git', 'start-osx.command', 'learnutils.pyc']


Let's see which of these are files and which of these are directories. Whe are going to use os.path.isdir, which returns True if the string in question refers to a directory otherwise it returns False. Since we can have either a directory or a file and there are no other options, we only ask if the current element is a directory and if not, we infer that it is a file.

In [124]:
file_list = os.listdir('.') # list current working directory
files = [] # collect the filenames here
directories = [] # collect the directory names here
for element in file_list:
    if os.path.isdir(element):
        print element, " \t --> is a directory"
        directories.append(element)
    else:
        print element, " \t --> is a file"
        files.append(element)

Chapter 7 - More on Loops.ipynb  	 --> is a file
ExamEx20132014.pdf  	 --> is a file
Chapter 1 - Variables.ipynb  	 --> is a file
Chapter 4 - Loops.ipynb  	 --> is a file
data  	 --> is a directory
.ipynb_checkpoints  	 --> is a directory
FilesIO.ipynb  	 --> is a file
Chapter 6 - Regular Expressions.ipynb  	 --> is a file
start-unix.sh  	 --> is a file
start-windows.bat  	 --> is a file
ExerciseKeys  	 --> is a directory
README.md  	 --> is a file
Chapter 2 - Collections.ipynb  	 --> is a file
.gitignore  	 --> is a file
Chapter 3 - Conditions.ipynb  	 --> is a file
Chapter 5 - Functions and Files.ipynb  	 --> is a file
freqdict.pkl  	 --> is a file
learnutils.py  	 --> is a file
styles  	 --> is a directory
images  	 --> is a directory
.git  	 --> is a directory
start-osx.command  	 --> is a file
learnutils.pyc  	 --> is a file


Alright, so far we used the os module do show where we are and what kind of files and directories are we sorrounded by. In learnutils I implemented a small function that shows the whole directory structures below the current diretory take a look at it. 

In [67]:
from learnutils import print_tree

In [75]:
print_tree()

├── .gitignore
├── Chapter 1 - Variables.ipynb
├── Chapter 2 - Collections.ipynb
├── Chapter 3 - Conditions.ipynb
├── Chapter 4 - Loops.ipynb
├── Chapter 5 - Functions and Files.ipynb
├── Chapter 6 - Regular Expressions.ipynb
├── Chapter 7 - More on Loops.ipynb
├── ExamEx20132014.pdf
├── [94mExerciseKeys[0m
│   ├── AnswersChap1.py
│   ├── AnswersChap2.py
│   ├── AnswersChap3.py
│   ├── AnswersChap4.py
│   ├── AnswersChap5.py
│   └── AnswersChap6.py
├── FilesIO.ipynb
├── README.md
├── [94mdata[0m
│   ├── austen-emma-excerpt-tokenised.txt
│   ├── austen-emma-excerpt.txt
│   └── austen-emma.txt
├── freqdict.pkl
├── [94mimages[0m
│   ├── Python-Programming-Language.png
│   ├── grade.png
│   ├── python-logo-generic.svg
│   └── string_index.png
├── learnutils.py
├── learnutils.pyc
├── start-osx.command
├── start-unix.sh
├── start-windows.bat
└── [94mstyles[0m
    ├── custom.css
    ├── matplotlibrc
    └── screen.css


The os module also allows us to change to different directories

In [80]:
print "DIrectories:", directories

DIrectories: ['data', '.ipynb_checkpoints', 'ExerciseKeys', 'styles', 'images', '.git']


In [90]:
os.chdir('data') # descending to the folder "data"
print os.getcwd() # where are we now?
print os.listdir('.') # what do we have here?
os.chdir('..') # going back up
print os.getcwd() # are we back?


/home/akadar/pymodules/python-course/data
['austen-emma-excerpt-tokenised.txt', 'austen-emma-excerpt.txt', 'austen-emma.txt']
/home/akadar/pymodules/python-course


The following code snippet:
 + goes to the data directory
 + creates a new directory inside it "test"
 + creates a new file "test.txt"
 + removes the file "test.txt"
 + removes the directory "test"

In [121]:
print "We are here:", os.getcwd()
os.chdir('data') # chdir --> change directory
print "We are here:", os.getcwd()
print os.listdir('.')
os.mkdir('test') # mkdir --> make directory
print os.listdir('.')
os.chdir('test') # chdir --> change directory
print os.listdir('.')
open("test.txt", 'wt').write('Testing')
print os.listdir('.')
print open(r"test.txt").read()
os.remove("test.txt")
os.chdir('..')
print "We are here", os.getcwd()
os.rmdir('test')
print os.listdir('.')
os.chdir('..')
print "And we're back to:", os.getcwd()

We are here: /home/akadar/pymodules/python-course
We are here: /home/akadar/pymodules/python-course/data
['austen-emma-excerpt-tokenised.txt', 'austen-emma-excerpt.txt', 'austen-emma.txt']
['austen-emma-excerpt-tokenised.txt', 'austen-emma-excerpt.txt', 'austen-emma.txt', 'test']
[]
['test.txt']
Testing
We are here /home/akadar/pymodules/python-course/data
['austen-emma-excerpt-tokenised.txt', 'austen-emma-excerpt.txt', 'austen-emma.txt']
And we're back to: /home/akadar/pymodules/python-course


# THE END

## Scripts

Up until now, we have been using the interactive IPython software to write our Python code. We have only been writing really small bits and pieces of code, however, instead of writing longer scripts that can provide a more significant batch of functionality. Now, let us make our first independent Python script together. (Note that this way of working will also resemble more closely your future day-to-day coding practice.)

Open 'Sublime Text 2', a popular text editor which we will use in this course (http://www.sublimetext.com/). Create a new file -- this might have happened automatically when you opened the editor -- and save it as "script.py" in a convenient location (here, we will assume that you have saved it in your Desktop folder. Note that files containing Python code typically take the ".py" extension.

If you are working in a UNIX-like environment (Mac or Linux), you should now add the following code on the very first line of your script:  

In [None]:
#!/usr/bin/env/ python

This line will tell your computer which language you want to use to run the script -- in this case, our default installation of Python 3 will be used. In technical terms, the "#!" is called a "shebang" indication. If you are working in MS Windows, you can add this 'shebang line' as well, but it will have no effect. 

Now, let's us add a simply Python function to this file. The `fib()` function will the first numbers in the famous Fibonacci series. The function will only print the items in the series that are smaller than `upper`, i.e. the parameter we pass to this function: 




In [3]:
def fib(upper):
    # write Fibonacci series up to upper
    "Print a Fibonacci series up to upper"
    a, b = 0, 1
    while b < upper:
        print(b)
        a = b
        b = a+b
    return

Next, add a line that actually calls the `fib()` function for `upper=2000`. (Don't forget to take care of the correct indentation!)

In [None]:
fib(2000)

Instead of executing this code by hitting ctrl+enter as we have done in the IPython notebook so far, we will now learn how to execute our code differently. We have two ways for doing this: an easy one, and a difficult one. When you work with a code editor like Sublime Text, there is often an easy way to execute your code. In Sublime, for instance, you can first save your file with a ".py" extension, and then 'build' your code by hitting Ctrl+b (Windows, Unix) or Command+b (Mac OS X). You will now see that the output of your script will be written to your screen in Sublime.

For the second option, you can use a command line interface or prompt. Watch out: this can be pretty scary at first... In general, you should always watch out when you use a command line interface to your machine: only execute commands that you (more or less) understand! You typically have complete control over your machine via such an interface, so you need to watch out not to remove any important files. (You could e.g. unintentionally delete your entire operating system from your hard drive with a single command...). 

First we will deal with instructions for doing this in Mac OS X and Linux-distributions such as Ubuntu. Mac OS X and Linux tend to behave similarly because they are both 'Unix-based' operating systems. 

- On a Mac,  you should open your 'Terminal'-application by clicking the relevant icon in the folder Applications > Utilities. Alternatively, click the magnifying glass in the top-right of your screen (shortcut: Command key+Spacebar). Next, type 'Terminal' in the box that appears and hit enter when your Mac has found the Terminal app.
- On a Linux installation, first open a command line window by navigating to the Terminal via Applications Menu > Accesories > Terminal (Gnome) or Dash > More Apps > Accessories > Terminal (Unity). Under both Gnome and Unity, you should also be able to use the keyboard shortcut `Ctrl+Alt+T` to open a console window.

Now 'cd' (= Change Directory) into the director that contains your `script.py` file (in our case that would mean: `cd ~/Desktop`). Next, execute our script by typing: `python3 script.py`. With this command, we explicitly tell the machine to execute this script using `python` (at least, the default version of Python3 installed on your machine). Normally, the output of the `print(), should now have been send to your console window. Has it?

However, because we added the 'shebang line' at the top of `script.py`, we could have also used the plainer command `./script.py` (which simply means: 'run this program'). To make the program fully executable you might have to "chmod" it first (CHange file MODes), using the command `chmod +x script.py`. With this command you tell your machine, that it is safe to execute this script. For additional info on these scripts and the options they take, you can always run the `man` command (e.g. `man chmod`). A good tutorial that covers the basics of the bash command line interface on Unix-like operating systems is: http://praxis.scholarslab.org/tutorials/bash/.

Under a Windows operating system, you can simply double-click the script.py function: because of the `.py` extension your OS will automatically run the script via Python interpreter (note that the `shebang line` is in reality ignored in your `script.py`). Alternatively, go to Start > All programs > Accessories and click on Command Prompt. In the Search or Run line, you can also type `cmd` and press enter. This will open a DOS console window.

Navigate to the folder that holds you script.py: in our case you could do this via the command: `cd C:\documents and settings\your_username\desktop` or simply `cd desktop`. Next, execute our script by typing: `python3 script.py`. With this command, we explicitly tell the machine to execute this script using `python` (at least it's default version on your machine). Normally, the output of the `print(), should now have been send to your console window. Has it? A good tutorial that covers the basics of the bash command line interface in DOS-based operating systems is: http://www.computerhope.com/issues/chusedos.htm

You can now also import the functionality from `script.py` in other scripts! Remove (or comment out via a hashtag) the following line, containing the actual function call from `script.py`


In [None]:
fib(2000)

Create a new file called `main.py` in the same directory, namely your Desktop folder. Add the shebang line on top, as well as the following statement which will import the functionality from the `script.py` module. Note that the syntax is entirely the same as for importing one of the 'official' functions from the Python Standard Library! Instead of running `script.py`, now try to run `main.py` which will import the `script.fib()` function. Does this work out? You don't have to add the '.py' by the way: your computer will figure out this extension itself.

In [None]:
#!/usr/bin/env python
import script
script.fib(upper=2000)

Note that we have to be explicit about where our Python interpreter should look for the fib() function using the syntax with a dot (`module.function()`). If we want to be able to use the shorter version of the function call, we should have used to following import statement:

In [None]:
from script import *

Can you now try try to run `main.py` again, but now with the shorter call `fib(upper=2000)`, witout explicitly mentioning the module from which the function originates? Does that work work out for you?

Now check out the files in your Desktop folder: you will notice that an additional file has been created, namely `script.pyc`. (If you can't see file, note that you can explicitly list all files in the current directory using the `ls` command in both Windows and Unix-like operating systems.) The extension of this new file stands for 'Compiled Python File'. Don't worry about this file -- you won't be able to inspect its contents using a text editor anyway. This file contains the numerical 'bytecode' that will be executed by your machine: it is this machine-readable version of your code that has actually been imported into the `main.py` module. You can safely ignore these files, but now you know what they are for. (By the way: note that there is no `main.pyc` file which has been created, because no functionality from this file has been imported into another module.)

It's always a good idea to distribute your code over a set of modules. In technical terms, your code should be as 'modular' as possible, meaning that similar functions should be grouped into the same module. This will help you keep your code organized especially when you are working on a larger project. If you have a set of functions that use you for loading and parsing files in Python, why not group under the same module? This way you can organize your own coding more efficiently, as well as share and document your code more easily.  

Now you know how how to store and organize your code in separate files and modules. Still, a lot of programmers continue to explore their data using 'interactive Python' via a so-called 'interactive Python interpreter' that more or less resembles the IPython envirionment you have been working in so far. To launch such an interpreter, just type in `python3` in your console and hit enter. This will launch a `live` Python session in your command line console where you can experiment with your data by typing in commands, much like you have done in the IPython environment so far. Try this out! Just type in lines of Python code after the `>>>` prompt and hit enter to execute it immediately.

---

# Exercises

```When you make the exercices below, don't write your code in the IPython notebook anymore but write in a separate file and run them from the command line! ```

Inspired by *Think Python* by Allen B. Downey (http://thinkpython.com), *Introduction to Programming Using Python* by Y. Liang (Pearson, 2013). Some exercises below have been taken from: http://www.ling.gu.se/~lager/python_exercises.html.

-  Two words are anagrams if you can rearrange the letters from one to spell the other. Write a function called is_anagram that takes two strings and returns True if they are anagrams.

-  Go to Project Gutenberg (http://www.gutenberg.org) and download your favorite out- of-copyright book in plain text format. Make a frequency dictionary of the words in the novel. Sort the words in the dictionary by frequency and write it to a text file called `frequencies.txt`. Make sure your program ignores capitalization as well as punctuation (hint: check out `string.punctuation` online!). Search the web in order to find out how you can sort a dictionary -- this is not easy, because you might have to import another module.

- Rewrite the novel in the previous exercise, by replacing the name of the principal character in the novel by your own name. (Use the `replace()` function for this.) Write the new version of novel to a file called `starring_me.txt`.

-  Define a function sum() and a function multiply() that sums and multiplies (respectively) all the numbers in a list of numbers. For example, sum([1, 2, 3, 4]) should return 10, and multiply([1, 2, 3, 4]) should return 24.


-  A *hapax legomenon* (often abbreviated to hapax) is a word which occurs only once in either the written record of a language, the works of an author, or in a single text. Define a function that given the file name of a text will return all its hapaxes. Make sure your program ignores capitalization as well as punctuation (hint: check out `string.punctuation` online!). Try out the function on your Gutenberg book.

- Inside the same module as the previous exercise (i.e. a file that ends in `.py`), create two additional functions: one that spots 'hapaxes dislegomena' (words occuring only twice) and one that spots 'hapaxes trislegomena' (words occuring only three times) in a text file. Now import these functions in another, standalone script and call all three functions from there. Again, try them out on your Gutenberg-file.

- Write a program that given a text file will create a new text file in which all the lines from the original file are numbered from 1 to n (where n is the number of lines in the file).

- Write a script that rolls a dice everytime you run it by generating a random integer between 1 and 6! You can import functionality for doing this via `random.randint()`.

- [Advanced exercise, possibly involving regular expressions: optional] A *sentence splitter* is a program capable of splitting a text into sentences. The standard set of heuristics for sentence splitting includes (but isn't limited to) the following rules: Sentence boundaries occur at one of "." (periods), "?" or "!", except that:

> - Periods followed by whitespace followed by a lowercase letter are not sentence boundaries.
> - Periods followed by a digit with no intervening whitespace are not sentence boundaries.
> - Periods followed by whitespace and then an uppercase letter, but preceded by any of a short list of titles are not sentence boundaries. Sample titles include Mr., Mrs., Dr., and so on.
> - Periods internal to a sequence of letters with no adjacent whitespace are not sentence boundaries, such as in www.aptex.com or e.g.
> - Periods followed by certain kinds of punctuation (notably comma and more periods) are probably not sentence boundaries.

You might want to check out string functions, like `.islower()` and `.isalpha()` in the official Python documentation online. Your task here is to write a function that given the name of a text file is able to write its content with each sentence on a separate line to a new file whose name is also passed as an argument to the function. The function itself should return a list of sentences. Test your program with the following short text: "Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't. The result written to the new file should be:

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.

Did he mind?

Adam Jones Jr. thinks he didn't.

In any case, this isn't true...

Well, with a probability of .9 it isn't.


------------------------------

You've reached the end of Chapter 5! You can safely ignore the code below, it's only there to make the page pretty:

In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()