## IPython Notebooks

* You can run a cell by pressing ``[shift] + [Enter]`` or by pressing the "play" button in the menu.
* You can get help on a function or object by pressing ``[shift] + [tab]`` after the opening parenthesis ``function(``
* You can also get help by executing: ``function?``

We'll use the following standard imports.  Execute this cell first:

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns

## Numpy refresher

NumPy arrays form the underlying data-structure for most computational work in Python, and is used extensively by Pandas underneath the hood.  Here, we give only a very brief overview.  See http://scipy-lectures.github.io/ for a more detailed lesson.

In [None]:
import numpy as np

# Generating a random array
X = np.random.random((3, 5))  # a 3 x 5 array

print(X)

### Pure Python arrays are slow: let's square a matrix

In [None]:
import random

vector_py = [random.random() for i in range(50000)]
vector_np = np.random.random(50000)

In [None]:
%timeit [e**2 for e in vector_py]

In [None]:
%timeit vector_np**2

In [None]:
7.97e-3 / 35e-6

### Accessing elements

In [None]:
# get a single element
X[0, 0]

In [None]:
# get a row
X[1]

In [None]:
# get a column
X[:, 1]

In [None]:
# Transposing an array
X.T

In [None]:
# Turning a row vector into a column vector
y = np.linspace(0, 12, 5)
y

In [None]:
y.shape

In [None]:
# make into a column vector
y[:, np.newaxis]

In [None]:
y[:, np.newaxis].shape

In [None]:
X

In [None]:
print(X.shape)
print(X.reshape(5, 3))

In [None]:
# indexing by an array of integers (fancy indexing)
indices = np.array([3, 1, 0])
print(indices)
X[:, indices]

### Operations along an axis

In [None]:
X

In [None]:
X.shape

In [None]:
np.sum(X, axis=1)

In [None]:
np.max(X, axis=0)

# A quick-ish introduction to Pandas

Credit: this is a notebook cribbed from [the official "10 Minutes to Pandas"](http://pandas.pydata.org/pandas-docs/stable/10min.html).

<a href="http://pandas.pydata.org" target="_blank"> 
<img src="figures/pandas-book.jpg" style="float: left; width: 25%; margin-right: 2em;"></a>

* emphasis on tabular data (csv and the like)
* database/spreadsheet-like functionality
* rich support for mixed data (numpy is for homogeneous arrays)
* integrates cleanly with numpy and matplotlib

There is also a wonderful [set of free tutorials](https://bitbucket.org/hrojas/learn-pandas) on Pandas.

### Unlike NumPy, pandas is a *columnar data store*.

In [None]:
import pandas as pd
import seaborn as sns

# Let's also load the plotting/numerical tools
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# make figures a bit larger by default
plt.rcParams['figure.figsize'] = (10,6)

# make sure your and my random numbers are the same
np.random.seed(12345)

## A quick taste of Pandas

Consider a standard stocks file you downloaded from [Yahoo Finance](http://finance.yahoo.com/q/hp?s=AAPL+Historical+Prices) (note that Pandas actually [ships with a utility to get that data for you](http://pandas.pydata.org/pandas-docs/stable/remote_data.html#remote-data-yahoo)):

In [None]:
!head data/AAPL.csv

In [None]:
stock = pd.read_csv('data/AAPL.csv')

In [None]:
type(stock)

In [None]:
stock.head()

In [None]:
stock.index = stock.pop('Date')
stock.head()

In [None]:
stock.sort_index(inplace=True)
stock.head()

In [None]:
stock['Adj Close'].plot();

In [None]:
ts = pd.Series(np.random.randn(1000),
               index=pd.date_range('1/1/2000', periods=1000))  # just more than 2 years

In [None]:
ts

In [None]:
rdf = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                   columns=['A', 'B', 'C', 'D'])
rdf.head()

In [None]:
rdf.cumsum().head()

In [None]:
rdf.cumsum().plot();

## Object creation

See the [Data Structure Intro](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro) section for more details.

Creating a Series by passing a list of values, letting pandas create a default integer index

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

### Creating a DataFrame by passing a numpy array, with a datetime index and labeled columns

In [None]:
dates = pd.date_range('20130101', periods=6)
dates

In [None]:
rng = np.random.RandomState()

df = pd.DataFrame(rng.normal(size=(6, 4)),
                  index=dates,
                  columns=('rainfall', 'temp', 'humidity', 'windchill'))
df

## Viewing Data

In [None]:
df.head()

In [None]:
df.tail(3)

Display the index,columns, and the underlying numpy data

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.values

Describe shows a quick statistic summary of your data

In [None]:
df.describe()

Transposing your data

In [None]:
df

In [None]:
df.T

Sorting by index

In [None]:
df.sort_index(ascending=False)

Sorting by values

In [None]:
df.sort_values(by='temp')

## Selection

**Note** While many of the NumPy access methods work on DataFrames, use the pandas-specific data access methods, `.at`, `.iat`, `.loc`, `.iloc` and `.ix`.

See the [Indexing section](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing) and below.

### Getting

Selecting a single column, which yields a `Series`, equivalent to `df.rainfall`

In [None]:
df.rainfall

Selecting via `[]`, which slices the rows.

In [None]:
df[:3]

### Selection by Label

In [None]:
df.loc['20130101':'20130103']

See more in [Selection by Label](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-label).

For getting a cross section using a label

In [None]:
df.loc[dates[0]]

Selecting on a multi-axis by label

In [None]:
df.loc[:, ['rainfall', 'temp']]

In [None]:
mask = df.rainfall < 0

In [None]:
df.loc[mask, 'rainfall'] = 0

In [None]:
df

### Selection by Position

Select via the position of the passed integers—esentially equivalent to NumPy indexing.

In [None]:
df.iloc[3]

By integer slices, acting similar to numpy/python

In [None]:
df.iloc[3:5, 0:2]

By lists of integer position locations, similar to the numpy/python style

In [None]:
df.iloc[[1,2,4], [0,2]]

For slicing rows or columns explicitly

In [None]:
df.iloc[1:3, :]

### Boolean Indexing

In [None]:
df

In [None]:
df[df.rainfall > 0]

What happens if I select on the whole table?

In [None]:
df[df > 0]

## Basic operations

Simple statistics

In [None]:
df.head(1)

In [None]:
df.mean()

And the same along a different axis:

In [None]:
df.mean(axis=1)

Applying a function to the data

In [None]:
df.cumsum()

In [None]:
print(df)
df.apply(np.cumsum)

In [None]:
def my_function(column):
    return column.max() - column.min()

df.apply(my_function)

## SQL-style joins (merging data)

In [None]:
left = pd.DataFrame({'subject': ['history', 'literature'], 'papers': [10, 20]})
right = pd.DataFrame({'subject': ['history', 'literature', 'science'], 'books': [4, 5, 9]})

In [None]:
left

In [None]:
right

In [None]:
pd.merge(left, right, on='subject', how='outer')

Above, `how` can be:

  * left: use only keys from left frame (SQL: left outer join)
  * right: use only keys from right frame (SQL: right outer join)
  * outer: use union of keys from both frames (SQL: full outer join)
  * inner: use intersection of keys from both frames (SQL: inner join)

## Append rows to a dataframe

In [None]:
#df = pd.DataFrame(np.arange(4 * 3).reshape((4, 3)), columns=['A','B','C'])
df

In [None]:
#s = df.iloc[2]
#s

In [None]:
df.append(df.iloc[0], ignore_index=True)

In [None]:
row = pd.DataFrame([[0, 1, 2, 3]],
                   columns=df.columns,
                   index=['2015-08-19'])
df.append(row)

## Grouping

By “group by” we are referring to a process involving one or more of the following steps

- **Splitting** the data into groups based on some criteria
- **Applying** a function to each group independently
- **Combining** the results into a data structure

See the [Grouping docs](http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby) for more.

In [None]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : range(8),
                   'D' : range(10, 18)})
df

Grouping and then applying a function sum to the resulting groups.

In [None]:
df.groupby('A').sum()

Grouping by multiple columns forms a hierarchical index, which we then apply the function.

In [None]:
df.groupby(['A', 'B']).sum()

In [None]:
g = df.groupby('A')

In [None]:
g.*?

```
g.A            g.B            g.C            g.D            g.agg          g.aggregate    
g.all          g.any          g.apply        g.bfill        g.boxplot      g.corr         
g.corrwith     g.count        g.cov          g.cumcount     g.cummax       g.cummin       
g.cumprod      g.cumsum       g.describe     g.diff         g.dtypes       g.ffill        
g.fillna       g.filter       g.first        g.get_group    g.groups       g.head         
g.hist         g.idxmax       g.idxmin       g.indices      g.irow         g.last         
g.mad          g.max          g.mean         g.median       g.min          g.name         
g.ngroups      g.nth          g.ohlc         g.pct_change   g.plot         g.prod         
g.quantile     g.rank         g.resample     g.sem          g.shift        g.size         
g.skew         g.std          g.sum          g.tail         g.take         g.transform    
g.tshift       g.var          
```

## Histogramming

In [None]:
s = pd.Series(np.random.randint(0, 7, size=10))
s

In [None]:
s.value_counts()