# A quck 10 Minutes introduction to Pandas

Credit: this is a notebook cribbed from [the official "10 Minutes to Pandas"](http://pandas.pydata.org/pandas-docs/stable/10min.html).

## Pandas: tools for data analysis in Python

* emphasis on tabular data (csv and the like)
* database/spreadsheet-like functionality
* rich support for mixed data (numpy is for homogeneous arrays)
* integrates cleanly with numpy and matplotlib

<center>
<a href="http://pandas.pydata.org" target="_blank"> 
<img src="files/pandas-book.jpg"></a>
</center>

There is also a wonderful [set of free tutorials](https://bitbucket.org/hrojas/learn-pandas) on Pandas.

In [None]:
import pandas as pd

# Let's also load the plotting/numerical tools
%pylab inline
import matplotlib.pyplot as plt
import numpy as np

# make figures a bit larger by default
plt.rcParams['figure.figsize'] = (10,6)

## A quick taste of Pandas

Consider a standard stocks file you downloaded from [Yahoo Finance](http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices) (note that Pandas actually ships with a utility to get that data for you):

In [None]:
!head AAPL.csv

In [None]:
stock = pd.read_csv('AAPL.csv')
stock.head()

In [None]:
stock.index = stock.pop('Date')
stock.head()

In [None]:
stock.sort(inplace=True)
stock.head()

In [None]:
stock['Adj Close'].plot();

## Object creation

See the [Data Structure Intro](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro) section for more details.

Creating a Series by passing a list of values, letting pandas create a default integer index

In [None]:
s = pd.Series([1,3,5,np.nan,6,8])
s

Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns.

In [None]:
dates = pd.date_range('20130101',periods=6)
dates

In [None]:
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df

Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [None]:
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=range(4),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : 'foo' })

df2

Having specific dtypes

In [None]:
df2.dtypes

## Viewing Data

In [None]:
df.head()

In [None]:
df.tail(3)

Display the index,columns, and the underlying numpy data

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.values

Describe shows a quick statistic summary of your data

In [None]:
df.describe()

Transposing your data

In [None]:
df.T

Sorting by an axis

In [None]:
df.sort_index(axis=1, ascending=False)

Sorting by values

In [None]:
df.sort(columns='B')

## Selection

**Note** While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, `.at`, `.iat`, `.loc`, `.iloc` and `.ix`.

See the [Indexing section](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing) and below.

### Getting

Selecting a single column, which yields a `Series`, equivalent to `df.A`

In [None]:
df['A']

Selecting via `[]`, which slices the rows.

In [None]:
df[:3]

In [None]:
df['20130102':'20130104']

### Selection by Label

See more in [Selection by Label](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-label).

For getting a cross section using a label

In [None]:
df.loc[dates[0]]

Selecting on a multi-axis by label

In [None]:
df.loc[:,['A','B']]

### Selection by Position

Select via the position of the passed integers

In [None]:
df.iloc[3]

By integer slices, acting similar to numpy/python

In [None]:
df.iloc[3:5,0:2]

By lists of integer position locations, similar to the numpy/python style

In [None]:
df.iloc[[1,2,4],[0,2]]

For slicing rows or columns explicitly

In [None]:
df.iloc[1:3,:]

### Boolean Indexing

In [None]:
df

In [None]:
df[df.A > 0]

What happens if I select on the whole table?

In [None]:
df[df > 0]

## Basic operations

Simple statistics

In [None]:
df.mean()

And the same along a different axis:

In [None]:
df.mean(axis=1)

Applying a function to the data

In [None]:
df.apply(np.cumsum)

In [None]:
df.apply(lambda x: x.max() - x.min())

SQL-style operations like merging data

In [None]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [None]:
left

In [None]:
right

In [None]:
pd.merge(left, right, on='key')

Append rows to a dataframe. 

In [None]:
df = pd.DataFrame(np.arange(4*3).reshape((4, 3)), columns=['A','B','C'])
df

In [None]:
s = df.iloc[2]
s

In [None]:
df.append(s, ignore_index=True)

## Grouping

By “group by” we are referring to a process involving one or more of the following steps

- **Splitting** the data into groups based on some criteria
- **Applying** a function to each group independently
- **Combining** the results into a data structure

See the [Grouping docs](http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby) for more.

In [None]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C' : range(8), 'D' : range(10,18)})
df

Grouping and then applying a function sum to the resulting groups.

In [None]:
df.groupby('A').sum()

Grouping by multiple columns forms a hierarchical index, which we then apply the function.

In [None]:
df.groupby(['A','B']).sum()