# intro to pandas

This file is a [Jupyter](https://jupyter.org/) notebook. The output that appears here was created by a Python kernel when this page was created. You can type the commands that appear in a notebook file like this one into your Python shell (or run them in a Python script) and expect to see the same results, assuming you have the dependencies installed.

We'll be taking a look at a **library** called [pandas](http://pandas.pydata.org/) which gives us some important basic functionality for handling datasets in Python.

If you're not sure what commands are available to you, note that (like the Unix command line) iPython supports tab-completion.

Let's import pandas and load our DataFrame again.

In [12]:
import pandas as pd
input_file = '~/gits/gads_26/datasets/state_hts.tsv'
data = pd.read_csv(input_file, sep='\t')

## 3. editing data

We can create a new column by simply assigning values to it. Recall that write operations to columns require element syntax, not attribute syntax.

In [13]:
feet_to_meters = 0.3048
data['elev_m'] = feet_to_meters * data.elev_ft
data.head()

Unnamed: 0,state,peak,elev_ft,elev_m
0,Alabama,Cheaha Mountain,2405,733.044
1,Alaska,Denali,20320,6193.536
2,Arizona,Humphreys Peak,12633,3850.5384
3,Arkansas,Magazine Mountain,2753,839.1144
4,California,Mount Whitney,14495,4418.076


Another method for creating one column from another uses the Series `apply` method. This gives us greater flexibility:

In [14]:
data['elev_m'] = data.elev_ft.apply(lambda k: int(feet_to_meters * k))
data.head()

Unnamed: 0,state,peak,elev_ft,elev_m
0,Alabama,Cheaha Mountain,2405,733
1,Alaska,Denali,20320,6193
2,Arizona,Humphreys Peak,12633,3850
3,Arkansas,Magazine Mountain,2753,839
4,California,Mount Whitney,14495,4418


Note the use of the anonymous function (denoted by the keyword `lambda`) passed to the `apply` method.

We can also create a column as a function of other columns:

In [15]:
data['scale_factor'] = data.elev_m / data.elev_ft
data.head()

Unnamed: 0,state,peak,elev_ft,elev_m,scale_factor
0,Alabama,Cheaha Mountain,2405,733,0.304782
1,Alaska,Denali,20320,6193,0.304774
2,Arizona,Humphreys Peak,12633,3850,0.304757
3,Arkansas,Magazine Mountain,2753,839,0.304758
4,California,Mount Whitney,14495,4418,0.304795


## 4. manipulating data

The `shape` attribute is a tuple that contains the dimensions (rows, columns) of the DataFrame. Note that the syntax is `shape` and not `shape()`, since it's an attribute of the DataFrame object and not a method.

Another important tool is the `describe` method, which gives a summary of the numeric features in our dataset:

In [16]:
data.describe()

Unnamed: 0,elev_ft,elev_m,scale_factor
count,50.0,50.0,50.0
mean,6161.78,1877.68,0.30463
std,5086.229574,1550.309547,0.000271
min,345.0,105.0,0.303167
25%,2058.75,626.75,0.304642
50%,4588.5,1398.0,0.304717
75%,10616.5,3235.25,0.304767
max,20320.0,6193.0,0.304798


The output is limited to the `elev_ft` column, since this is our only numeric feature. In addition to the count, mean, and standard deviation of the data we also get five important percentiles (0% = min, 25% = first quartile, 50% = median, 75% = third quartile, 100% = max).

These percentiles comprise a **five-number summary** of the distribution of `elev_ft`. The five-number summary is a useful first approximation to the shape of the distribution of the data. It gives us a rough picture of central tendency, central variation, skew, and tail behavior.

This five-number summary suggests that the distribution of `elev_ft` is skewed and fat-tailed.

## 2. selecting data

Sometimes we'll want to use only a subset of our data at once. There are [several ways](http://pandas.pydata.org/pandas-docs/stable/indexing.html) to perform these kinds of selection operations on a DataFrame.

We can access a single column using the same syntax we use to access elements in a `dict`: 

In [17]:
data['state'].head()

0       Alabama
1        Alaska
2       Arizona
3      Arkansas
4    California
Name: state, dtype: object

We can also read columns using attribute notation (note this doesn't work when trying to write to a column):

In [18]:
data.state.head()

0       Alabama
1        Alaska
2       Arizona
3      Arkansas
4    California
Name: state, dtype: object

Let's take a look at our column's data type:

In [19]:
type(data.state)

pandas.core.series.Series

The column is stored as a Series, another fundamental data storage object in pandas. For our purposes, we will mostly see Series objects as constituent parts of a DataFrame.

Series objects have methods too, for example we can find the average height of our 50 highest peaks:

In [20]:
data.elev_ft.mean()

6161.7799999999997

This agrees with the output from `describe`. 

Another useful way to select data is with a **boolean mask**. This is just a fancy term for an array of boolean (T/F) values that indicates which values to return:

In [21]:
data[data.state == 'California']

Unnamed: 0,state,peak,elev_ft,elev_m,scale_factor
4,California,Mount Whitney,14495,4418,0.304795


Under the hood, the boolean condition we use here is an array of 50 T/F values, where the only T occurs at index 4.

We can access specific cells in the DataFrame using the `iloc` syntax. The `iloc` syntax is flexible and can take many different types of inputs (ints, arrays of ints, slice objects, [among others](http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position)).

Keep in mind when using `iloc` that the first argument specifies rows, the second argument specifies columns, and the arguments are separated by a comma. For example, the following command returns values in column `0` for rows `10-14` (like elsewere in Python, [slice objects](https://docs.python.org/3.5/tutorial/introduction.html) are lower index inclusive & upper index exclusive):

In [22]:
data.iloc[10:15, 0]

10      Hawaii
11       Idaho
12    Illinois
13     Indiana
14        Iowa
Name: state, dtype: object

Note the use of `:` on its own in the second argument above. This is a wildcard that returns all the columns. You could also omit the columns argument and get the same result:

In [23]:
data.iloc[10:15]

Unnamed: 0,state,peak,elev_ft,elev_m,scale_factor
10,Hawaii,Mauna Kea,13796,4205,0.304798
11,Idaho,Borah Peak,12662,3859,0.30477
12,Illinois,Charles Mound,1235,376,0.304453
13,Indiana,Hoosier Hill,1257,383,0.304694
14,Iowa,Hawkeye Point,1670,509,0.30479
