# intro to pandas

This file is a [Jupyter](https://jupyter.org/) notebook. The output that appears here was created by a Python kernel when this page was created. You can type the commands that appear in a notebook file like this one into your Python shell (or run them in a Python script) and expect to see the same results, assuming you have the dependencies installed.

We'll be taking a look at a **library** called [pandas](http://pandas.pydata.org/) which gives us some important basic functionality for handling datasets in Python.

## 1. pandas basics

In order to use code in a library, we need to **import** it. This makes the library code accessible to Python by bringing it into [scope](https://docs.python.org/3.5/tutorial/classes.html#python-scopes-and-namespaces). The conventional way to import pandas is like this:

In [27]:
import pandas as pd

Once it's in scope we can access functions and objects that live in pandas by calling them with the `pd` prefix. This prefix specifies the pandas **namespace**, which is a map between pandas objects and the names we use to access them.

For example, we can access the pandas DataFrame object like this:

In [28]:
test_df = pd.DataFrame({'animal': ['cow', 'pig', 'chicken'], 'count': [3, 6, 12]})
test_df

Unnamed: 0,animal,count
0,cow,3
1,pig,6
2,chicken,12


The **DataFrame** is the fundamental object we'll use to manipulate datasets in pandas. A DataFrame can be constructed from a instance of `dict`. It stores values (of possibly different types) in rows and columns, like a relational table. It uses an **index** to keep track of its records; this is the left-most column of integers you see above, without a heading.

Creating a DataFrame explicitly from a `dict` as above can be useful, but frequently we'll  use another function that allows us to easily create a DataFrame from an input file:

In [29]:
input_file = '~/gits/gads_26/datasets/state_hts.tsv'
data = pd.read_csv(input_file, sep='\t')
data.head()

Unnamed: 0,state,peak,elev_ft
0,Alabama,Cheaha Mountain,2405
1,Alaska,Denali,20320
2,Arizona,Humphreys Peak,12633
3,Arkansas,Magazine Mountain,2753
4,California,Mount Whitney,14495


The `read_csv` function is the workhorse for loading data into pandas. Its first argument is a path to an input file, and its second ([keyword](http://sys-exit.blogspot.com/2013/07/python-positional-arguments-and-keyword.html)) argument specifies that our data is tab-delimited. The default behavior of `read_csv` is to use the first row of the input file as a header row.

The `head` function works like its Unix counterpart, printing the first few records of `data`. In this case the default number of rows is 5, but we can change this behavior by being explicit:

In [30]:
data.head(10)

Unnamed: 0,state,peak,elev_ft
0,Alabama,Cheaha Mountain,2405
1,Alaska,Denali,20320
2,Arizona,Humphreys Peak,12633
3,Arkansas,Magazine Mountain,2753
4,California,Mount Whitney,14495
5,Colorado,Mount Elbert,14433
6,Connecticut,Mount Frissell-South Slope,2372
7,Delaware,Ebright Azimuth,442
8,Florida,Britton Hill,345
9,Georgia,Brasstown Bald,4784


Let's look at some of panda's basic exploratory tools. We've already seen the `head` function. Here's another useful one:

In [31]:
data.shape

(50, 3)

The `shape` attribute is a tuple that contains the dimensions (rows, columns) of the DataFrame. Note that the syntax is `shape` and not `shape()`, since it's an attribute of the DataFrame object and not a method.

Another important tool is the `describe` method, which gives a summary of the numeric features in our dataset:

In [32]:
data.describe()

Unnamed: 0,elev_ft
count,50.0
mean,6161.78
std,5086.229574
min,345.0
25%,2058.75
50%,4588.5
75%,10616.5
max,20320.0


The output is limited to the `elev_ft` column, since this is our only numeric feature. In addition to the count, mean, and standard deviation of the data we also get five important percentiles (0% = min, 25% = first quartile, 50% = median, 75% = third quartile, 100% = max).

These percentiles comprise a **five-number summary** of the distribution of `elev_ft`. The five-number summary is a useful first approximation to the shape of the distribution of the data. It gives us a rough picture of central tendency, central variation, skew, and tail behavior.

This five-number summary suggests that the distribution of `elev_ft` is skewed and fat-tailed.

## 2. selecting data

Sometimes we'll want to use only a subset of our data at once. There are [several ways](http://pandas.pydata.org/pandas-docs/stable/indexing.html) to perform these kinds of selection operations on a DataFrame.

We can access a single column using the same syntax we use to access elements in a `dict`: 

In [33]:
data['elev_ft'].sum()

308089

In [34]:
The 

NameError: name 'The' is not defined