# Pandas
---

Pandas will introducse us to a powerful new data structure: the dataframe. A dataframe is a 2-D array of columnar data format where each variable (typically) represents a column and each record (usually) represents a row. You can think of it as an excel-like table or sql-like table. We will begin by looking at the basic operations we can perform with a dataframe. 

We will cover:

Topic | Method
-|-
[reading data into a dataframe](#Reading-csv-into-dataframe) | `.read_csv()`
[checking the metadata](#Info)| `.info()`
[viewing the first few rows](#Head)|`.head()`
[viewing the last few rows](#Tail)|`.tail()`
[viewing some statistical properties](#Describe)|`.describe()`

## *Resources*
[The pandas documentation](http://pandas.pydata.org/pandas-docs/stable/)

[Tidy data by Hadley](http://vita.had.co.nz/papers/tidy-data.pdf)

[Ten minute tour of Pandas by creator Wes Mckinney](https://vimeo.com/59324550)

---

In [1]:
%load_ext watermark
%watermark -a 'Alexander C Booth' -mvd -p pandas

Alexander C Booth 2016-10-02 

CPython 3.5.1
IPython 4.2.0

pandas 0.18.1

compiler   : GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)
system     : Darwin
release    : 16.0.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit


In [2]:
import pandas as pd

### Reading in the Data
We read the data into RAM (memory) and store it directly into a DataFrame object. Pandas has many built-in functions for reading data into a dataframe. 

Here is a list:
- pd.read_clipboard
- pd.read_csv
- pd.read_excel
- pd.read_fwf
- pd.read_gbq
- pd.read_hdf
- pd.read_html
- pd.read_json
- pd.read_msgpack
- pd.read_pickle
- pd.read_sas
- pd.read_sql
- pd.read_sql_query
- pd.read_sql_table
- pd.read_stata
- pd.read_table

We will read in a csv below.

### Reading csv into dataframe
We should note here that pandas will infer the data types of the columns in your data. It does a good job at this, but we will have keep in mind that it cannot always infer the proper type. 

*For example, the data below has an "ID" column. Since the ID values are numerical, pandas assumes we want these to be numbers instead of labels.*

In [3]:
df = pd.read_csv('../../data/iris.csv')

Since pandas is built on top of NumPy, we can access NumPy attributes like `.shape`

In [4]:
df.shape

(150, 5)

### Viewing the data
First, we will view some metadata about the data, using the `.info()` method. Next, we will use the `.head()` and `.tail()` methods to view small pieces of the data at a time.

### Info

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal length (cm)    150 non-null float64
sepal width (cm)     150 non-null float64
petal length (cm)    150 non-null float64
petal width (cm)     150 non-null float64
class                150 non-null int64
dtypes: float64(4), int64(1)
memory usage: 5.9 KB


The info method allowed us to see that:
- there are 150 rows, 5 columns and some properties about those columns
- there are 4 columns of type float64 and 1 columns of type int64. 
- column names along with how many non-null entries

Note that, in pandas, a column of type 'object' means it has infered those columns to be categorial, meaning their values are strings. In our iris dataframe, although pandas has inferred class to be numerical, it is actually the labels of the classes of flower types. We will change this to object type in the future.

### Head
Now let's view a few rows. The head method will show us the first 5 rows by default, but we can pass in an integer N to see the first N rows instead. 

In [6]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [7]:
df.head(10) # Viewing the first 10 rows

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
5,5.4,3.9,1.7,0.4,0
6,4.6,3.4,1.4,0.3,0
7,5.0,3.4,1.5,0.2,0
8,4.4,2.9,1.4,0.2,0
9,4.9,3.1,1.5,0.1,0


### Tail
Similary, we can view the last 5 rows by calling the tail method. We can also choose to pass in an integer N to see the last N rows instead. 

In [8]:
df.tail()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2
149,5.9,3.0,5.1,1.8,2


In [9]:
df.tail(10) # Last 10 Rows

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
140,6.7,3.1,5.6,2.4,2
141,6.9,3.1,5.1,2.3,2
142,5.8,2.7,5.1,1.9,2
143,6.8,3.2,5.9,2.3,2
144,6.7,3.3,5.7,2.5,2
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2
149,5.9,3.0,5.1,1.8,2


### Describe
Lastly, we can examine the descriptive statistics of the dataframe by calling the describe method. When we call this method, pandas will calculate the following properties of the dataframe for the numerical columns only:
- count
- mean
- standard deviation
- minimum 
- quantiles
- maximum

In [10]:
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667,1.0
std,0.828066,0.433594,1.76442,0.763161,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


It is important to note here that since the column 'class' represents a label and not actually a number, it has no meaning to look at it's descriptive stats. 