https://github.com/jvns/pandas-cookbook

# Chapter 9:  Data Analysis

In this chapter you will learn how to read data from files, do some analysis and write the results to disk. Reading and writing files is quite an essential part of programming, as it is the first step for your program to communicate with the outside world. In most cases you will write programs that take data from some source, manipulates it in someway and writes some results out somewhere. For example if you would write a survey, you could take input from participants on a webserver and save their answers in some files or in a database. When the survey is over you would read these results in and do some analysis on the data you have collected, maybe do some visualizations and save your results.

## File In

Let's start by using the open() function to read some text from a file. The `open()` function does not return the actual text that is saved in the text file. It returns a 'file object' from which we can read the content using the `.read()` function. We pass three arguments to the `open()` function:

 * the name of the file that you wish to open
 * the mode, a combination of characters, 'r' represents read-mode, and 't' represent plain text-mode. This indicates we are reading a plain text file.
 * the last argument, a named argument (encoding), specifies the encoding of the text file.
 
The most important mode arguments the open() function can take are:

* r: Opens a file for reading only. The file pointer is placed at the beginning of the file.
* w: Opens a file for writing only. Overwrites the file if the file exists. If the file does not exist, creates a new file for writing.
* a: Opens a file for appending. The file pointer is at the end of the file if the file exists. If the file does not exist, it creates a new file for writing. Use it if you would like to add something to the end of a file



The following example reads a file from disk. 

In [14]:
f = open('data/austen-emma-excerpt.txt', 'rt') # open the file 
text = f.read() # read in its content as a string
f.close() # close the file
print(text) # print the string

Emma by Jane Austen 1816

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.


Reading an entire file in one string is not always desirable, especially not with huge files. If you use open(r'filename').read(), Python will store the resulting string in memory. If you have a computer with 8GB memory and want to read a file with 16GB of data your are going to run into troubles! The following example reads up until a newline everytime, and returns one line at a time. 


In [15]:
f = open('data/austen-emma-excerpt.txt','rt') # open the file
for line in f: # iterate over the file object
    print(line)   # the file object yields one line at a time
f.close() # close the file

Emma by Jane Austen 1816



VOLUME I



CHAPTER I





Emma Woodhouse, handsome, clever, and rich, with a comfortable home

and happy disposition, seemed to unite some of the best blessings

of existence; and had lived nearly twenty-one years in the world

with very little to distress or vex her.



She was the youngest of the two daughters of a most affectionate,

indulgent father; and had, in consequence of her sister's marriage,

been mistress of his house from a very early period.  Her mother

had died too long ago for her to have more than an indistinct

remembrance of her caresses; and her place had been supplied

by an excellent woman as governess, who had fallen little short

of a mother in affection.


 * ----- *
 

Rather than just printing, we can of course do whatever we want with this file's content. Let's count the number of lines (but note, that a line does not necessarily correspond to a sentence).

In [16]:
count = 0
f = open('data/austen-emma-excerpt.txt', 'rt')
for line in f:
    count += 1
f.close()
print(count)

19


This is a "pythonic" way of opening a file. It is preferable to use this "with" syntax, you can read up on it why, but for now just remember that its safer.

In [17]:
with open('data/austen-emma-excerpt.txt','rt') as txt:
    for line in txt:
        print line

Emma by Jane Austen 1816



VOLUME I



CHAPTER I





Emma Woodhouse, handsome, clever, and rich, with a comfortable home

and happy disposition, seemed to unite some of the best blessings

of existence; and had lived nearly twenty-one years in the world

with very little to distress or vex her.



She was the youngest of the two daughters of a most affectionate,

indulgent father; and had, in consequence of her sister's marriage,

been mistress of his house from a very early period.  Her mother

had died too long ago for her to have more than an indistinct

remembrance of her caresses; and her place had been supplied

by an excellent woman as governess, who had fallen little short

of a mother in affection.


## Excersize

Read the file `data/austen-emma-excerpt.txt` and compute the average length of the lines:
* In characters
* In words
* Re-calculate both measures when not counting empty lines

In [18]:
f = open('data/austen-emma-excerpt.txt', 'rt')
# insert your code here
# important: always remember to properly close your files again!

## File Out

Now we mastered the art of reading files, let's move on to writing files, which follows a similar logic:

In [19]:
f = open('data/testoutput.txt', 'wt')
f.write("Hello world!")
f.close()

If you want your data to be written on multiple lines, you need to take care to explicitly encode the newlines. 

In [20]:
f = open('data/testoutput.txt','wt', encoding='utf-8')
f.write("Hello world on the first line!\n")
f.write("Hello world on the second line!")
f.close()

TypeError: 'encoding' is an invalid keyword argument for this function

### Pickle

Another very common way of saving data to disk in Python is to just simply "dump" it in a pickle file. This section is going to walk you through thxis idea. 

Let's say you have read in some document and created a frequency dictionary from your text file:

In [21]:
freq_dict = {'word1': 210, 'word2': 50}
freq_dict

{'word1': 210, 'word2': 50}

You would like to remember this for later use. This is where you can use the pickle module. This module let's you write out arbitrary Python objects to disk and read them back later. pickle has two main methods: The first one is dump, which dumps an object to a file object and the second one is load, which loads an object from a file object

In [22]:
import pickle

In [23]:
pickle.dump(freq_dict, open('freqdict.pkl', 'wb')) # passing the thing that i want to right out and a file object to pickle

In [24]:
pickle.load(open(r'freqdict.pkl'))

{'word1': 210, 'word2': 50}

## From Reading CSV to Data Visualization

Let's read the csv file using the simple open(r"file").readlines() method
This dataset is a list of how many people were on 7 different bike paths in Montreal, each day.

In [306]:
csvfile = open('./data/bikes.csv').readlines() # reading csv in a list 
print csvfile[:5]

# Printing it one line at a time until 10th line
for counter, row in enumerate(csvfile):
    if counter == 10:
        break
    else:
        print row

['Date;Berri 1;Br\xe9beuf (donn\xe9es non disponibles);C\xf4te-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (donn\xe9es non disponibles)\r\n', '01/01/2012;35;;0;38;51;26;10;16;\r\n', '02/01/2012;83;;1;68;153;53;6;43;\r\n', '03/01/2012;135;;2;104;248;89;3;58;\r\n', '04/01/2012;144;;1;116;318;111;8;61;\r\n']
Date;Berri 1;Br�beuf (donn�es non disponibles);C�te-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (donn�es non disponibles)

01/01/2012;35;;0;38;51;26;10;16;

02/01/2012;83;;1;68;153;53;6;43;

03/01/2012;135;;2;104;248;89;3;58;

04/01/2012;144;;1;116;318;111;8;61;

05/01/2012;197;;2;124;330;97;13;95;

06/01/2012;146;;0;98;244;86;4;75;

07/01/2012;98;;2;80;108;53;6;54;

08/01/2012;95;;1;62;98;64;11;63;

09/01/2012;244;;2;165;432;198;12;173;



It might not be immediately apparent, but csv files are nothing more than just simple tables of data. The first __row__ is the header and all the other __rows__ represent the data. Just from printing out the file we see that the first __column__ represents dates. In our further analysis we will exclude those. Also we can see that the 3rd column (or 2nd if you index from 0) is __missing__. We will need to deal with that as well. Let's see if at least all the rows have the same number of elements.

In [350]:
for i in csvfile:
    print len(i.split(';')),

10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9


No, they don't ... You will encounter these annoying things with data analysis a lot. So the first couple of rows have 10 elements, but all the others have 9. This is pretty crappy, but luckily we know as a fact (from the data collectors) that the last row is missing for all the rows, which lengths is 9. We will just exclude that from our analysis. 

In summary:

- Get rid of the dates
- Get rid of the 2nd column
- Get rid of the last column from the longer rows

In the following script we only take the lines of length 9, we skip the first entry (dates) and we also store the header accordingly

In [351]:
counter = 0
data = []
for counter, row in enumerate(csvfile):
    if counter == 0:
        header = row.split(';')[1:-1]
    elif len(row.split(';')) == 9:
        data.append(row.split(';')[1:]) # append up until the last element (but not the last element)
    elif len(row.split(';')) == 10:
        data.append(row.split(';')[1:-1])
        
    counter += 1
print len(header)
print len(data[0])

8
8


Now we end up with a representation of our data like this.

In [352]:
data[:10]

[['35', '', '0', '38', '51', '26', '10', '16'],
 ['83', '', '1', '68', '153', '53', '6', '43'],
 ['135', '', '2', '104', '248', '89', '3', '58'],
 ['144', '', '1', '116', '318', '111', '8', '61'],
 ['197', '', '2', '124', '330', '97', '13', '95'],
 ['146', '', '0', '98', '244', '86', '4', '75'],
 ['98', '', '2', '80', '108', '53', '6', '54'],
 ['95', '', '1', '62', '98', '64', '11', '63'],
 ['244', '', '2', '165', '432', '198', '12', '173'],
 ['397', '', '3', '238', '563', '275', '18', '241']]

We are almost done with this, but we need to get rid of second (1st) __column__ and also we need to convert all the strings to floats.

In [353]:
header.pop(1)
for i in range(len(data)):
    data[i].pop(1) # removing second element
    data[i] = map(float, data[i]) # calling "float" on all elements of the list "data"


In [354]:
print len(header)
print len(data[0])
data[:10]

7
7


[[35.0, 0.0, 38.0, 51.0, 26.0, 10.0, 16.0],
 [83.0, 1.0, 68.0, 153.0, 53.0, 6.0, 43.0],
 [135.0, 2.0, 104.0, 248.0, 89.0, 3.0, 58.0],
 [144.0, 1.0, 116.0, 318.0, 111.0, 8.0, 61.0],
 [197.0, 2.0, 124.0, 330.0, 97.0, 13.0, 95.0],
 [146.0, 0.0, 98.0, 244.0, 86.0, 4.0, 75.0],
 [98.0, 2.0, 80.0, 108.0, 53.0, 6.0, 54.0],
 [95.0, 1.0, 62.0, 98.0, 64.0, 11.0, 63.0],
 [244.0, 2.0, 165.0, 432.0, 198.0, 12.0, 173.0],
 [397.0, 3.0, 238.0, 563.0, 275.0, 18.0, 241.0]]

In [1]:
means = []
for i in range(len(data[0])):
    container = 0
    for j in range(len(data)):
        container += data[j][i]
    means.append(container)


NameError: name 'data' is not defined

## Additional Material

### Useful tips on file reading

The last thing I would like to show you is to store the contents of a file in a list, which I find useful in some cases. Python provides the fileobject.readlines() function, which creates a list, where each element of the list is one line from the file. As you can see in the example below, this keeps the annoying trainling new line characters "\n" at the end of the lines. So in the second example I read in the file as one string and split it on the newline characters "\n".

In [2]:
lines = open('data/austen-emma-excerpt.txt', 'rt').readlines()
print "Number of lines", len(lines)
print lines
print
print open('data/austen-emma-excerpt.txt', 'rt').read().split('\n')

Number of lines 19
['Emma by Jane Austen 1816\n', '\n', 'VOLUME I\n', '\n', 'CHAPTER I\n', '\n', '\n', 'Emma Woodhouse, handsome, clever, and rich, with a comfortable home\n', 'and happy disposition, seemed to unite some of the best blessings\n', 'of existence; and had lived nearly twenty-one years in the world\n', 'with very little to distress or vex her.\n', '\n', 'She was the youngest of the two daughters of a most affectionate,\n', "indulgent father; and had, in consequence of her sister's marriage,\n", 'been mistress of his house from a very early period.  Her mother\n', 'had died too long ago for her to have more than an indistinct\n', 'remembrance of her caresses; and her place had been supplied\n', 'by an excellent woman as governess, who had fallen little short\n', 'of a mother in affection.']

['Emma by Jane Austen 1816', '', 'VOLUME I', '', 'CHAPTER I', '', '', 'Emma Woodhouse, handsome, clever, and rich, with a comfortable home', 'and happy disposition, seemed to unite some

Lastly, below I show a more "pythonic" way of opening a file. It is preferable to use this "with" syntax, you can read up on it why, but for now just remember that its safer.

In [3]:
with open('data/austen-emma-excerpt.txt','rt', encoding='utf-8') as txt:
    for line in txt:
        print line

TypeError: 'encoding' is an invalid keyword argument for this function