https://github.com/jvns/pandas-cookbook

# Chapter 9:  Data Analysis

In this chapter you will learn how to read data from files, do some analysis and write the results to disk. Reading and writing files is quite an essential part of programming, as it is the first step for your program to communicate with the outside world. In most cases you will write programs that take data from some source, manipulates it in someway and writes some results out somewhere. For example if you would write a survey, you could take input from participants on a webserver and save their answers in some files or in a database. When the survey is over you would read these results in and do some analysis on the data you have collected, maybe do some visualizations and save your results.

## File In

Let's start by using the open() function to read some text from a file. The `open()` function does not return the actual text that is saved in the text file. It returns a 'file object' from which we can read the content using the `.read()` function. We pass three arguments to the `open()` function:

 * the name of the file that you wish to open
 * the mode, a combination of characters, 'r' represents read-mode, and 't' represent plain text-mode. This indicates we are reading a plain text file.
 * the last argument, a named argument (encoding), specifies the encoding of the text file.
 
The most important mode arguments the open() function can take are:

* r: Opens a file for reading only. The file pointer is placed at the beginning of the file.
* w: Opens a file for writing only. Overwrites the file if the file exists. If the file does not exist, creates a new file for writing.
* a: Opens a file for appending. The file pointer is at the end of the file if the file exists. If the file does not exist, it creates a new file for writing. Use it if you would like to add something to the end of a file



The following example reads a file from disk. 

In [None]:
f = open('data/austen-emma-excerpt.txt', 'rt') # open the file 
text = f.read() # read in its content as a string
f.close() # close the file
print(text) # print the string

Reading an entire file in one string is not always desirable, especially not with huge files. If you use open(r'filename').read(), Python will store the resulting string in memory. If you have a computer with 8GB memory and want to read a file with 16GB of data your are going to run into troubles! The following example reads up until a newline everytime, and returns one line at a time. 


In [None]:
f = open('data/austen-emma-excerpt.txt','rt') # open the file
for line in f: # iterate over the file object
    print(line)   # the file object yields one line at a time
f.close() # close the file

 * ----- *
 

Rather than just printing, we can of course do whatever we want with this file's content. Let's count the number of lines (but note, that a line does not necessarily correspond to a sentence).

In [None]:
count = 0
f = open('data/austen-emma-excerpt.txt', 'rt')
for line in f:
    count += 1
f.close()
print(count)

This is a "pythonic" way of opening a file. It is preferable to use this "with" syntax, you can read up on it why, but for now just remember that its safer.

In [None]:
with open('data/austen-emma-excerpt.txt','rt') as txt:
    for line in txt:
        print(line)

## Excersize

Read the file `data/austen-emma-excerpt.txt` and compute the average length of the lines:
* In characters
* In words
* Re-calculate both measures when not counting empty lines

In [None]:
f = open('data/austen-emma-excerpt.txt', 'rt')
# insert your code here
# important: always remember to properly close your files again!

## File Out

Now we mastered the art of reading files, let's move on to writing files, which follows a similar logic:

In [None]:
f = open('data/testoutput.txt', 'wt')
f.write("Hello world!")
f.close()

If you want your data to be written on multiple lines, you need to take care to explicitly encode the newlines. 

In [None]:
f = open('data/testoutput.txt','wt', encoding='utf-8')
f.write("Hello world on the first line!\n")
f.write("Hello world on the second line!")
f.close()

### Pickle

Another very common way of saving data to disk in Python is to just simply "dump" it in a pickle file. This section is going to walk you through thxis idea. 

Let's say you have read in some document and created a frequency dictionary from your text file:

In [None]:
freq_dict = {'word1': 210, 'word2': 50}
freq_dict

You would like to remember this for later use. This is where you can use the pickle module. This module let's you write out arbitrary Python objects to disk and read them back later. pickle has two main methods: The first one is dump, which dumps an object to a file object and the second one is load, which loads an object from a file object

In [None]:
import pickle

In [None]:
pickle.dump(freq_dict, open('freqdict.pkl', 'wb')) # passing the thing that i want to right out and a file object to pickle

In [None]:
pickle.load(open(r'freqdict.pkl'))

## From Reading CSV to Data Visualization

Let's read the csv file using the simple open(r"file").readlines() method
This dataset is a list of how many people were on 7 different bike paths in Montreal, each day.

In [None]:
csvfile = open('./data/bikes.csv').readlines() # reading csv in a list 
print(csvfile[:5])

# Printing it one line at a time until 10th line
for counter, row in enumerate(csvfile):
    if counter == 10:
        break
    else:
        print(row)

It might not be immediately apparent, but csv files are nothing more than just simple tables of data. The first __row__ is the header and all the other __rows__ represent the data. Just from printing out the file we see that the first __column__ represents dates. In our further analysis we will exclude those. Also we can see that the 3rd column (or 2nd if you index from 0) is __missing__. We will need to deal with that as well. Let's see if at least all the rows have the same number of elements.

In [None]:
for i in csvfile:
    print(len(i.split(';'))),

No, they don't ... You will encounter these annoying things with data analysis a lot. So the first couple of rows have 10 elements, but all the others have 9. This is pretty crappy, but luckily we know as a fact (from the data collectors) that the last row is missing for all the rows, which lengths is 9. We will just exclude that from our analysis. 

In summary:

- Get rid of the dates
- Get rid of the 2nd column
- Get rid of the last column from the longer rows

In the following script we only take the lines of length 9, we skip the first entry (dates) and we also store the header accordingly

In [None]:
counter = 0
data = []
for counter, row in enumerate(csvfile):
    if counter == 0:
        header = row.split(';')[1:-1]
    elif len(row.split(';')) == 9:
        data.append(row.split(';')[1:]) # append up until the last element (but not the last element)
    elif len(row.split(';')) == 10:
        data.append(row.split(';')[1:-1])
        
    counter += 1
print(len(header))
print(len(data[0]))

Now we end up with a representation of our data like this.

In [None]:
data[:10]

We are almost done with this, but we need to get rid of second (1st) __column__ and also we need to convert all the strings to floats.

In [None]:
header.pop(1)
for i in range(len(data)):
    data[i].pop(1) # removing second element
    data[i] = map(float, data[i]) # calling "float" on all elements of the list "data"


In [None]:
print(len(header))
print(len(data[0]))
data[:10]

In [None]:
means = []
for i in range(len(data[0])):
    container = 0
    for j in range(len(data)):
        container += data[j][i]
    means.append(container)


## Additional Material

### Useful tips on file reading

The last thing I would like to show you is to store the contents of a file in a list, which I find useful in some cases. Python provides the fileobject.readlines() function, which creates a list, where each element of the list is one line from the file. As you can see in the example below, this keeps the annoying trainling new line characters "\n" at the end of the lines. So in the second example I read in the file as one string and split it on the newline characters "\n".

In [None]:
lines = open('data/austen-emma-excerpt.txt', 'rt').readlines()
print("Number of lines", len(lines))
print(lines)
print
print(open('data/austen-emma-excerpt.txt', 'rt').read().split('\n'))

Lastly, below I show a more "pythonic" way of opening a file. It is preferable to use this "with" syntax, you can read up on it why, but for now just remember that its safer.

In [None]:
with open('data/austen-emma-excerpt.txt','rt', encoding='utf-8') as txt:
    for line in txt:
        print(line)

### Working with Directories

Now that we started to work with files we have to gain some insight into how to navigate the folder/directory structure. Most people use some sort of graphical user interface GUI to navigate to files such as the Finder in Mac OS or you click on the My Computer icon on Windows. Now we are going to interact with these folder structures programmatically. The workhorse of this section is going to be Python's os module. The GUI you are using translates the commands of your operating system to clicking on icons for easier use. Python's os modules is very similar to the GUI in that it provides an interface that let's you navigate between folders, create new folders, rename files etc..

In [None]:
import os

Let's get started by checking out which is the current directory are we in actually right now.

In [None]:
print(os.getcwd())

getcwd refers to "get current working directory". As you can see the name of the current directory is XXXXXXXXXXX. The directories on the left are the names higher level directories.  On Linux and Mac these are delimited by "/", while on Windows by "\". This distinctions extremely unnecessary I know, but what can you do. 

OK, now lets check out what files and folders do we have in this directory

In [None]:
print(os.listdir('.')) # The '.' refers to 'current directory'

Let's see which of these are files and which of these are directories. Whe are going to use os.path.isdir, which returns True if the string in question refers to a directory otherwise it returns False. Since we can have either a directory or a file and there are no other options, we only ask if the current element is a directory and if not, we infer that it is a file.

In [None]:
file_list = os.listdir('.') # list current working directory
files = [] # collect the filenames here
directories = [] # collect the directory names here
for element in file_list:
    if os.path.isdir(element):
        print element, " \t --> is a directory"
        directories.append(element)
    else:
        print element, " \t --> is a file"
        files.append(element)

In [None]:
print "DIrectories:", directories

In [None]:
os.chdir('data') # descending to the folder "data"
print os.getcwd() # where are we now?
print os.listdir('.') # what do we have here?
os.chdir('..') # going back up
print os.getcwd() # are we back?


The following code snippet:
 + goes to the data directory
 + creates a new directory inside it "test"
 + creates a new file "test.txt"
 + removes the file "test.txt"
 + removes the directory "test"

In [None]:
print "We are here:", os.getcwd()
os.chdir('data') # chdir --> change directory
print "We are here:", os.getcwd()
print os.listdir('.')
os.mkdir('test') # mkdir --> make directory
print os.listdir('.')
os.chdir('test') # chdir --> change directory
print os.listdir('.')
open("test.txt", 'wt').write('Testing')
print os.listdir('.')
print open(r"test.txt").read()
os.remove("test.txt")
os.chdir('..')
print "We are here", os.getcwd()
os.rmdir('test')
print os.listdir('.')
os.chdir('..')
print "And we're back to:", os.getcwd()