# Chapter 7:  File Handling

In this chapter you will learn how to read data from and write data to files. This is quite an essential part of programming, as it is the first step for your program to communicate with the outside world. In most cases you will write programs that take data from some source, manipulate it in someway and write it out somewhere. For example if you would write a survey, you could take input from participants and save their answers in some files. When the survey is over you would read these files in and do some analysis on the data you have collected and save your results. In this chapter we will read in text files, analyze them a bit, and save out our analysis to files. 

## File Input

Input for your programs often comes from files on your disk, such as texts or some data in csv format. Likewise, you often want output to be written back to files on your disk as well e.g.: you collect tweets about a certain topic and you write it to a file for later analysis. Thus, reading and writing files is often an essential part of programming and, lucky, for us, this is really simple in Python. The following example reads a file from disk:

In [125]:
f = open('./data/austen-emma-excerpt.txt', 'r', encoding='utf-8') # open the file 
text = f.read() # read in its content as a string
f.close() # close the file
print(text) # print the string

<_io.TextIOWrapper name='./data/austen-emma-excerpt.txt' mode='r' encoding='utf-8'>
Emma by Jane Austen 1816

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.


The `open()` function does not return the actual text that is saved in the text file. It only returns a 'file object' from which we can read the content using the `.read()` function. We passed three arguments to the `open()` function:

 * the name of the file that you wish to open
 * the mode, a combination of characters, 'r' represents read-mode, and 't' represent plain text-mode. This indicates we are reading a plain text file.
 * the last argument, a named argument (encoding), specifies the encoding of the text file.
 
The most important mode arguments the open() function can take are:

* r: Opens a file for reading only. The file pointer is placed at the beginning of the file.
* w: Opens a file for writing only. Overwrites the file if the file exists. If the file does not exist, creates a new file for writing.
* a: Opens a file for appending. The file pointer is at the end of the file if the file exists. If the file does not exist, it creates a new file for writing. Use it if you would like to add something to the end of a file



>UTF-8

>You may wonder what an encoding is and what *utf-8* is. For anyone working with texts and computers this is vital to know. Internally, a computer knows no characters whatsoever: every piece of information is represented as numbers (which in turn are represented in a binary format, as zeroes and ones). An encoding specifies which numbers represent which characters. A famous and long-standing encoding scheme is ASCII, in which for example the letter 'A' is encoded using the number 65. ASCII however only has a very limited alphabet and can not encode a lot of writing systems. A modern-day encoding supporting countless writing systems is *unicode* and *utf-8* is a kind of unicode. This the type of encoding that you will want to use for your data whenever possible. Whenever you have a choice, you should use unicode!

Reading an entire file in one string is not always desirable, especially not with huge files. The following example reads up until a newline everytime, and returns one line at a time. 


In [121]:
f = open('data/austen-emma-excerpt.txt','rt', encoding='utf-8') # open the file
for line in f: # iterate over the file object
    print(line)   # the file object yields one line at a time 
f.close() # close the file

Emma by Jane Austen 1816



VOLUME I



CHAPTER I





Emma Woodhouse, handsome, clever, and rich, with a comfortable home

and happy disposition, seemed to unite some of the best blessings

of existence; and had lived nearly twenty-one years in the world

with very little to distress or vex her.



She was the youngest of the two daughters of a most affectionate,

indulgent father; and had, in consequence of her sister's marriage,

been mistress of his house from a very early period.  Her mother

had died too long ago for her to have more than an indistinct

remembrance of her caresses; and her place had been supplied

by an excellent woman as governess, who had fallen little short

of a mother in affection.


The 'newline' character is probably something new to you. If you are dealing with plain text files (typically files whose name ends in the '.txt' extension), your machine uses a special character internally to signal that a new line should begin. Internally, such newlines are represented as `"\n"`. Normally, this character is visualized on your screen as if the enter key were pressed. See what happens below: 

Rather than just printing, we can of course do whatever we want with this file's content. Let's count the number of lines (but note, that a line does not necessarily correspond to a sentence).

In [122]:
count = 0
f = open('data/austen-emma-excerpt.txt', 'rt', encoding='utf-8')
for line in f:
    count += 1
f.close()
print(count)

19


### Reading lines

The last thing I would like to show you is to store the contents of a file in a list, which I find useful in some cases. Python provides the fileobject.readlines() function, which creates a list, where each element of the list is one line from the file. As you can see in the example below, this keeps the annoying trainling new line characters "\n" at the end of the lines. So in the second example I read in the file as one string and split it on the newline characters "\n".

In [123]:
lines = open('data/austen-emma-excerpt.txt', 'rt').readlines()
print("Number of lines", len(lines), '\n')
print(lines)
print()
print(open('data/austen-emma-excerpt.txt', 'rt').read().split('\n'))

Number of lines 19 

['Emma by Jane Austen 1816\n', '\n', 'VOLUME I\n', '\n', 'CHAPTER I\n', '\n', '\n', 'Emma Woodhouse, handsome, clever, and rich, with a comfortable home\n', 'and happy disposition, seemed to unite some of the best blessings\n', 'of existence; and had lived nearly twenty-one years in the world\n', 'with very little to distress or vex her.\n', '\n', 'She was the youngest of the two daughters of a most affectionate,\n', "indulgent father; and had, in consequence of her sister's marriage,\n", 'been mistress of his house from a very early period.  Her mother\n', 'had died too long ago for her to have more than an indistinct\n', 'remembrance of her caresses; and her place had been supplied\n', 'by an excellent woman as governess, who had fallen little short\n', 'of a mother in affection.']

['Emma by Jane Austen 1816', '', 'VOLUME I', '', 'CHAPTER I', '', '', 'Emma Woodhouse, handsome, clever, and rich, with a comfortable home', 'and happy disposition, seemed to unite so

Lastly, below I show a more "pythonic" way of opening a file. It is preferable to use the `with` syntax, you can read up on why this is the case, but for now just remember that it's safer.

In [124]:
with open('data/austen-emma-excerpt.txt','rt', encoding='utf-8') as txt:
    for line in txt:
        print(line)

Emma by Jane Austen 1816



VOLUME I



CHAPTER I





Emma Woodhouse, handsome, clever, and rich, with a comfortable home

and happy disposition, seemed to unite some of the best blessings

of existence; and had lived nearly twenty-one years in the world

with very little to distress or vex her.



She was the youngest of the two daughters of a most affectionate,

indulgent father; and had, in consequence of her sister's marriage,

been mistress of his house from a very early period.  Her mother

had died too long ago for her to have more than an indistinct

remembrance of her caresses; and her place had been supplied

by an excellent woman as governess, who had fallen little short

of a mother in affection.


**Exercise:**

Read the file `data/austen-emma-excerpt.txt` and compute the average length of the lines:
* In characters
* In words

Do not count empty lines!

In [129]:
# insert your code here
# important: always remember to properly close your files again!
txt = open('data/austen-emma-excerpt.txt', 'r', encoding='utf-8')
char_average = 0
num_lines = 0
for line in txt:
    char_average += len(line)
    num_lines += 1.0

print(char_average/num_lines)
txt.close()

36.8421052631579


## Sentiment Data Sets

Our goal in the next week is going to be to automatically identify the sentiments in movie reviews. To this end we are going to implement simple, but effecient and effective Sentiment Analysis techniques. Sentiment analysis or opinion mining refers to the use of natural language processing/text mining techniques to identify and extract subjective information in source materials. Sentiment analysis is widely applied to reviews and social media for a variety of applications, ranging from marketing to customer service.

Today we are going to learn about how to navigate directory structures with the **os** module and take the opportunity to look at the movie reviews data sets we are going to use for our Sentiment Analysis lecture.
http://www.cse.iitb.ac.in/~pb/cs626-449-2009/prev-years-other-things-nlp/sentiment-analysis-opinion-mining-pang-lee-omsa-published.pdf

In [131]:
import os
print(os.listdir('.'))
        

['.profile', '.bash_logout', '.bashrc', '.bash_history', '.ipython', '.ipynb_checkpoints', 'Chapter 1 - Variables.ipynb', 'Untitled Folder 1', 'data', 'Untitled0.ipynb', 'grading', 'Assig6.ipynb', 'stoplist.txt', 'Chapter 7 - File handling.ipynb', 'ngram.ipynb', 'freqdict.pkl']


In [141]:
negatives = os.listdir('./data/sentiment/txt_sentoken/pos/')
print(len(negatives))
print('./data/sentiment/txt_sentoken/pos/'+negatives[0])
file_in = open('./data/sentiment/txt_sentoken/pos/'+negatives[0], 'r', encoding="utf-8")
print(file_in.read())

1000
./data/sentiment/txt_sentoken/pos/cv000_29590.txt
films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
in other words , don't dismiss this film because of its source . 
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . 
gett

In [89]:
for element in os.listdir('.'):
    if os.path.isdir(element):
        print(element)

.ipython
.ipynb_checkpoints
Untitled Folder 1
data
grading


In [142]:
print(len(os.listdir('./data/sentiment/txt_sentoken/neg')))
print(len(os.listdir('./data/sentiment/txt_sentoken/pos')))

1000
1000


In [143]:
neg_data_path = './data/sentiment/txt_sentoken/neg/'
negatives = os.listdir(neg_data_path)
print(open(neg_data_path+negatives[100], 'r').read().split())



## File Output


Now we mastered the art of reading files, let's move on to writing files, which follows a similar logic:

In [150]:
f = open('data/testoutput.txt', 'wt', encoding='utf-8')
f.write("Hello world!")
f.write("Hello world!")

f.close()

In [151]:
f = open('data/testoutput.txt', 'a', encoding='utf-8')
f.write("Hey!")
f.close()


In this code block, we have created a new file called `testoutput.txt` in the `data` directory. We then wrote a single line to this file and then we closed it. Note that the `w` in `wt` is a crucial addition: if you would have left this out, Python would have opened the file in 'readonly' mode and you wouldn't have been able to write to it! The 't' in the argument, again, signifies that we will be writing to this file in plain text mode.

If you want your data to be written on multiple lines, you need to take care to explicitly encode the newlines. Instead of:
    

In [94]:
f = open('data/testouput.txt','wt', encoding='utf-8')
f.write("Hello world on the first line!")
f.write("Hello world on the second line!")
f.close()

You need to write:

In [95]:
f = open('data/testoutput.txt','wt', encoding='utf-8')
f.write("Hello world on the first line!\n")
f.write("Hello world on the second line!")
f.close()

Otherwise your file would have `Hello world!Hello world!` in it, i.e. without the newlines.

Besides 'read-mode' and 'write-mode' when dealing with text files, there is also the 'append-mode' in Python. Watch out: in 'write-mode', you will always *overwrite* the existing content of the file. However, if you've open a file in 'append-mode', everything you write to the file will be added at the end of the file, without deleting anything of the existing content in the file. In order to enable the append mode, you need to specify `'at'` as your second parameter when you open files ('a' for append mode; 't' for text mode).

** Exercise:**

Read in 'data/austen-emma.txt' and create a word count dictionary. Write out the results in a text file name "emma_freqs.txt" in the following format:

word1,count  
word2,count
    
    .
    .
    .
word3, count

In [96]:
# insert your code here

Read the file `data/austen-emma-excerpt-tokenised.txt`, and write to a file `words.txt` all words occuring in this text (without duplicates!!), alphabetically ordered, one word per line. That way, you are really creating a lexicon or word list of the text. (Tip: you should use sets in this exercise!)

Check your output by viewing the `words.txt` file in a text editor such as Sublime Text 2. 

## Pickle

In [1]:
import pickle

Another very common way of saving data to disk in Python is to just simply "dump" it in a pickle file. This section is going to walk you through this idea. 

Let's say you have read in some document and created a frequency dictionary from your text file:

In [4]:
print(freq_dict)

{'word2': 50, 'word1': 210}


You would like to remember this for later use. This is where you can use the pickle module. This module let's you write out arbitrary Python objects to disk and read them back later. pickle has two main methods: The first one is dump, which dumps an object to a file object and the second one is load, which loads an object from a file object

In [8]:
freq_dict = {'word1': 210, 'word2': 50}
out = open('freqdict.pkl', 'wb')
pickle.dump(freq_dict, out) # passing the thing that i want to right out and a file object to pickle


In [12]:
a = pickle.load(open('freqdict.pkl', "rb"))
print(a)
a['word3'] = 340
print(a)
out = open('freqdict.pkl', 'wb')
pickle.dump(a, out)

{'word2': 50, 'word1': 210}
{'word2': 50, 'word1': 210, 'word3': 340}


In [13]:
pickle.load(open('freqdict.pkl', "rb"))

{'word1': 210, 'word2': 50}

---

### Working with Directories

Now that we started to work with files we have to gain some insight into how to navigate the folder/directory structure. Most people use some sort of graphical user interface GUI to navigate to files such as the Finder in Mac OS or you click on the My Computer icon on Windows. Now we are going to interact with these folder structures programmatically. The workhorse of this section is going to be Python's os module. The GUI you are using translates the commands of your operating system to clicking on icons for easier use. Python's os modules is very similar to the GUI in that it provides an interface that let's you navigate between folders, create new folders, rename files etc..

In [112]:
import os

Let's get started by checking out which is the current directory are we in actually right now.

In [113]:
print(os.getcwd())

/home/akadar


getcwd refers to "get current working directory". As you can see the name of the current directory is XXXXXXXXXXX. The directories on the left are the names higher level directories.  On Linux and Mac these are delimited by "/", while on Windows by "\". This distinctions extremely unnecessary I know, but what can you do. 

OK, now lets check out what files and folders do we have in this directory

In [114]:
print(os.listdir('.')) # The '.' refers to 'current directory'

['.profile', '.bash_logout', '.bashrc', '.bash_history', '.ipython', '.ipynb_checkpoints', 'Chapter 1 - Variables.ipynb', 'Untitled Folder 1', 'data', 'Untitled0.ipynb', 'grading', 'Assig6.ipynb', 'stoplist.txt', 'Chapter 7 - File handling.ipynb', 'ngram.ipynb', 'freqdict.pkl']


Let's see which of these are files and which of these are directories. Whe are going to use os.path.isdir, which returns True if the string in question refers to a directory otherwise it returns False. Since we can have either a directory or a file and there are no other options, we only ask if the current element is a directory and if not, we infer that it is a file.

In [115]:
file_list = os.listdir('.') # list current working directory
files = [] # collect the filenames here
directories = [] # collect the directory names here
for element in file_list:
    if os.path.isdir(element):
        print(element, " \t --> is a directory")
        directories.append(element)
    else:
        print(element, " \t --> is a file")
        files.append(element)

.profile  	 --> is a file
.bash_logout  	 --> is a file
.bashrc  	 --> is a file
.bash_history  	 --> is a file
.ipython  	 --> is a directory
.ipynb_checkpoints  	 --> is a directory
Chapter 1 - Variables.ipynb  	 --> is a file
Untitled Folder 1  	 --> is a directory
data  	 --> is a directory
Untitled0.ipynb  	 --> is a file
grading  	 --> is a directory
Assig6.ipynb  	 --> is a file
stoplist.txt  	 --> is a file
Chapter 7 - File handling.ipynb  	 --> is a file
ngram.ipynb  	 --> is a file
freqdict.pkl  	 --> is a file


The os module also allows us to change to different directories

In [116]:
print("Directories:", directories)

Directories: ['.ipython', '.ipynb_checkpoints', 'Untitled Folder 1', 'data', 'grading']


In [117]:
os.chdir('data') # descending to the folder "data"
print(os.getcwd()) # where are we now?
print(os.listdir('.')) # what do we have here?
os.chdir('..') # going back up
print(os.getcwd()) # are we back?


/home/akadar/data
['sentiment', 'austen-emma.txt', 'austen-emma-excerpt.txt', 'testoutput.txt', 'testouput.txt']
/home/akadar


The following code snippet:
 + goes to the data directory
 + creates a new directory inside it "test"
 + creates a new file "test.txt"
 + removes the file "test.txt"
 + removes the directory "test"

In [118]:
print("We are here:", os.getcwd())
os.chdir('data') # chdir --> change directory
print("We are here:", os.getcwd())
print(os.listdir('.'))
os.mkdir('test') # mkdir --> make directory
print(os.listdir('.'))
os.chdir('test') # chdir --> change directory
print(os.listdir('.'))
open("test.txt", 'wt').write('Testing')
print(os.listdir('.'))
print(open(r"test.txt").read())
os.remove("test.txt")
os.chdir('..')
print("We are here", os.getcwd())
os.rmdir('test')
print(os.listdir('.'))
os.chdir('..')
print("And we're back to:", os.getcwd())

We are here: /home/akadar
We are here: /home/akadar/data
['sentiment', 'austen-emma.txt', 'austen-emma-excerpt.txt', 'testoutput.txt', 'testouput.txt']
['sentiment', 'austen-emma.txt', 'austen-emma-excerpt.txt', 'testoutput.txt', 'testouput.txt', 'test']
[]
['test.txt']
Testing
We are here /home/akadar/data
['sentiment', 'austen-emma.txt', 'austen-emma-excerpt.txt', 'testoutput.txt', 'testouput.txt']
And we're back to: /home/akadar


###Exercises

- **Exercise 1**: Go to Project Gutenberg (http://www.gutenberg.org) and download your favorite out- of-copyright book in plain text format, then upload it to the Jupyter server. Make a frequency dictionary of the words in the novel. Sort the words in the dictionary by frequency and write it to a text file called `frequencies.txt`. Make sure your program ignores capitalization as well as punctuation (hint: check out `string.punctuation` online!). Search the web in order to find out how you can sort a dictionary -- this is not easy, because you might have to import another module.

- **Exercise 2**: Rewrite the novel in the previous exercise, by replacing the name of the principal character in the novel by your own name. (Use the `replace()` function for this.) Write the new version of novel to a file called `starring_me.txt`.

- **Exercise 3:** A *hapax legomenon* (often abbreviated to hapax) is a word which occurs only once in either the written record of a language, the works of an author, or in a single text. Define a function that given the file name of a text will return all its hapaxes. Make sure your program ignores capitalization as well as punctuation (hint: check out `string.punctuation` online!). Try out the function on your Gutenberg book.

- **Exercise 4:** Write a program that given a text file will create a new text file in which all the lines from the original file are numbered from 1 to n (where n is the number of lines in the file).