# FILE IO 

We're going to load csv file and convert it to a usuable data frame.

Specifically, we're using the Iris flower data, found here: https://archive.ics.uci.edu/ml/datasets/iris

In [1]:
import pandas as pd

This "import as" is a convention. 

You will encounter many of these, it is best to stick to them if you wish to maintain readable code.

As an example, SciPy recommends the following imports:

In [2]:
# We wont actually use these imports now, they are just for reference
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

SciPy reference: https://docs.scipy.org/doc/scipy/reference/tutorial/general.html

In [3]:
filename = "iris.data" 
df = pd.read_csv(filename, header=None)

The .data file that we load is actually a .csv file.

"df" is another convention, it stands for dataframe.

But did it work?

In [4]:
print df.head()

     0    1    2    3            4
0  5.1  3.5  1.4  0.2  Iris-setosa
1  4.9  3.0  1.4  0.2  Iris-setosa
2  4.7  3.2  1.3  0.2  Iris-setosa
3  4.6  3.1  1.5  0.2  Iris-setosa
4  5.0  3.6  1.4  0.2  Iris-setosa


Now lets re-load the data with headers this time.

In [5]:
df = pd.read_csv(filename, names=['SepalLength','SepalWidth','PetalLength','PetalWidth','Species'])
print df.head()

   SepalLength  SepalWidth  PetalLength  PetalWidth      Species
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa


We can now access these columns directly.

In [6]:
print "First 5 rows of the Petal Length column\n"
print df.PetalLength[0:5]
print "\n\nVersicolor with Sepal Length over 6.5mm\n"
print df[(df.Species == "Iris-versicolor") & (df.SepalLength > 6.5) ][0:10] 

First 5 rows of the Petal Length column

0    1.4
1    1.4
2    1.3
3    1.5
4    1.4
Name: PetalLength, dtype: float64


Versicolor with Sepal Length over 6.5mm

    SepalLength  SepalWidth  PetalLength  PetalWidth          Species
50          7.0         3.2          4.7         1.4  Iris-versicolor
52          6.9         3.1          4.9         1.5  Iris-versicolor
58          6.6         2.9          4.6         1.3  Iris-versicolor
65          6.7         3.1          4.4         1.4  Iris-versicolor
75          6.6         3.0          4.4         1.4  Iris-versicolor
76          6.8         2.8          4.8         1.4  Iris-versicolor
77          6.7         3.0          5.0         1.7  Iris-versicolor
86          6.7         3.1          4.7         1.5  Iris-versicolor


If we were to load from an Excel file we could use a similar function, also found in Pandas.

(We use the sample xlxs data file found here: http://www.contextures.com/xlSampleData01.html)

In [7]:
filename = "sampledata.xlsx" # This code will work with .xls or .xlsx files
xl = pd.ExcelFile(filename)
df = xl.parse("SalesOrders") # The name of the sheet we want to open
print df.head()

   OrderDate   Region      Rep    Item  Units  Unit Cost   Total
0 2016-01-06     East    Jones  Pencil     95       1.99  189.05
1 2016-01-23  Central   Kivell  Binder     50      19.99  999.50
2 2016-02-09  Central  Jardine  Pencil     36       4.99  179.64
3 2016-02-26  Central     Gill     Pen     27      19.99  539.73
4 2016-03-15     West  Sorvino  Pencil     56       2.99  167.44
