## Pandas: versatile tool for data wrangling in Python

* emphasis on tabular data (csv and the like)
* database/spreadsheet-like functionality
* rich support for mixed data (numpy is for homogeneous arrays)
* integrates cleanly with numpy and matplotlib
* really shines with time-series data

<img src="http://akamaicovers.oreilly.com/images/0636920023784/lrg.jpg" width="30%">

  - Definative Book: http://shop.oreilly.com/product/0636920023784.do#
  - Quick Ref: http://pandas.pydata.org/pandas-docs/stable/10min.html
  

In [None]:
import numpy as np
import pandas as pd  ## this is by convention
pd.options.display.width = 1000

In [None]:
s = pd.Series([-1, 20, -30, 40, -50])
s

In [None]:
s.index

In [None]:
s.index[2]

In [None]:
s.values

In [None]:
s2 = pd.Series([1, 2, np.nan, 4, 5],
              index=['one', 'two', 'three', 'four', 'five'])
s2

In [None]:
s2.index[0]

In [None]:
s2

In [None]:
s2['three']

In [None]:
s2[2]

In [None]:
s2[['one', 'three', 'two']]

In [None]:
s2

In [None]:
s3 = s2[:3]

In [None]:
s2

In [None]:
s3

In [None]:
s3.to_dict()

In [None]:
pd.Series(s3.to_dict())

In [None]:
df = pd.DataFrame({'A': s2, 'B': s3})
df

In [None]:
df['A']

In [None]:
df['B']

### Boolean indexing

In [None]:
df[df["A"] > 0]

**Note** While many of the NumPy access methods work on DataFrames, use the pandas-specific data access methods, `.at`, `.iat`, `.loc`, `.iloc` and `.ix`.

See the [Indexing section](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing) and below.

In [None]:
df.ix['four']

Select via the position of the passed integers ... looks like NumPy indexing.

In [None]:
df.iloc[1:3, :]

In [None]:
df.columns

In [None]:
df.index

In [None]:
df.values

In [None]:
df

In [None]:
df.values

In [None]:
df.sort_index(ascending=True)

In [None]:
df2 = pd.DataFrame(df.values, columns=['A', 'B'])

In [None]:
df2['A'] = 7.5
df2

In [None]:
df2.dropna?

In [None]:
df3 = df2.dropna()
df3

In [None]:
del df3["A"]
df3

## Reading in Data, Time Series

In [None]:
!head -n 10 multiTimeline.csv

In [None]:
fred = pd.read_csv('multiTimeline.csv')
fred

In [None]:
fred.head(10)

In [None]:
fred['ice cream']

In [None]:
fred["ice cream"].index[:10]

In [None]:
fred["Week"][:5]

In [None]:
fred.Week[0]

In [None]:
# Several ways to convert to date
from datetime import datetime
import dateutil.parser as parser

print(parser.parse(fred.Week[0]))

In [None]:
datetime.strptime(fred.Week[0], '%Y-%m-%d')

In [None]:
dates = [datetime.strptime(x, '%Y-%m-%d') for x in fred.Week]
dates

In [None]:
pd.DatetimeIndex(dates)

In [None]:
# Series in, Series out
pd.to_datetime(fred.Week)

In [None]:
# NumPy array in, DatetimeIndex out
pd.to_datetime(fred.Week.values)

In [None]:
pd.to_datetime(fred.Week, format='%Y-%m-%d')

In [None]:
pd.to_datetime(['96/21/05'], format='%y/%d/%m')

In [None]:
fred.info()

In [None]:
print(fred.to_string())

In [None]:
from IPython.display import HTML
HTML(fred.to_html())

In [None]:
fred['Week'] = pd.to_datetime(fred['Week'])
fred.dtypes

In [None]:
fred.index[:50]

In [None]:
# Returns new object without inplace=True!
fred.set_index('Week', inplace=True)

In [None]:
fred.info()

In [None]:
fred.head()

In [None]:
fred["ice cream"]

In [None]:
type(fred["ice cream"])

In [None]:
fred.index

In [None]:
fred.index[5]

In [None]:
stamp = fred.index[5]
stamp

In [None]:
stamp.year, stamp.month, stamp.day

In [None]:
stamp.weekday()

In [None]:
fred.index.year

In [None]:
fred.index.weekday

In [None]:
fred.info()

In [None]:
%matplotlib inline
import matplotlib.pylab as plt

In [None]:
fred["ice cream"].plot()

In [None]:
fred.plot()

Typically, you'll try to format your data as you read it in. Here we can do all that date and index stuff in one line.

In [None]:
fred = pd.read_csv('multiTimeline.csv',index_col=0,parse_dates=[0])
fred.info()

## Merging data frames

We often want to combine dataframes by index, joining columns together on the same index.

In [None]:
!head full_moon.csv

In [None]:
moon = pd.read_csv('full_moon.csv',index_col=0,parse_dates=[0])
moon.info()

Below, the merge `how` can be:

  * left: use only keys from left frame (SQL: left outer join)
  * right: use only keys from right frame (SQL: right outer join)
  * outer: use union of keys from both frames (SQL: full outer join)
  * inner: use intersection of keys from both frames (SQL: inner join)

In [None]:
df = fred.merge(moon,left_index=True, right_index=True, how = 'right')
df

Let's save that for later:

In [None]:
df.to_csv("merged_data.csv")

In [None]:
!head merged_data.csv

In [None]:
df.to_

In [None]:
import io
a = io.StringIO()
df.to_latex(buf=a)
a.seek(0)
print(a.read())
a.close()

In [None]:
df[["ice cream","full moon"]]['2013':'2014'].head()

In [None]:
df[["ice cream","full moon"]]['2013-06':'2013-09'].head()

In [None]:
stamp

In [None]:
df["ice cream"][stamp]

In [None]:
df.ix[stamp]

In [None]:
years = df.index.year
years

### Grouping 

By “group by” we are referring to a process involving one or more of the following steps

- **Splitting** the data into groups based on some criteria
- **Applying** a function to each group independently
- **Combining** the results into a data structure

See the [Grouping docs](http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby) for more.

In [None]:
annual_min = df.groupby(years).min()
annual_max = df.groupby(years).max()

Another operation to combine dataframes: `pd.concat`

<pre>
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)

Concatenate pandas objects along a particular axis with optional set logic
along the other axes. Can also add a layer of hierarchical indexing on the
concatenation axis, which may be useful if the labels are the same (or
overlapping) on the passed axis number
</pre>

In [None]:
annual_min_and_max = pd.concat([annual_min, annual_max], 
                               axis=1, keys=['min', 'max'])
annual_min_and_max

### Resampling
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#up-and-downsampling

<pre>
B       business day frequency
C       custom business day frequency (experimental)
D       calendar day frequency
W       weekly frequency
M       month end frequency
BM      business month end frequency
CBM     custom business month end frequency
MS      month start frequency
BMS     business month start frequency
CBMS    custom business month start frequency
Q       quarter end frequency
BQ      business quarter endfrequency
QS      quarter start frequency
BQS     business quarter start frequency
A       year end frequency
BA      business year end frequency
AS      year start frequency
BAS     business year start frequency
BH      business hour frequency
H       hourly frequency
T       minutely frequency
S       secondly frequency
L       milliseonds
U       microseconds
N       nanoseconds
</pre>

above from: http://stackoverflow.com/questions/17001389/pandas-resample-documentation

In [None]:
df["ice cream"].resample('A-JUN').min()  # year end June

In [None]:
df["ice cream"].resample('A-JUN', how=['min', 'max'])

In [None]:
annual_minmax = df.resample('A-JUN', how=['min', 'max'])
annual_minmax

In [None]:
annual_minmax.columns

In [None]:
annual_minmax['tennis', 'max']

In [None]:
annual_minmax[('ice cream', 'max')].index

In [None]:
# Your own aggregation function
def mad(x):
    return np.abs(x - x.mean()).mean()
fred["ice cream"].resample('A-JUN', how=mad)

Shifting, correlation, date arithmetic
-------

In [None]:
df.shift(3).head(10)

In [None]:
df_1diff = df - df.shift(1)
df_1diff.head(10)

In [None]:
df_1diff.corr()

In [None]:
(df - df.shift(6)).corr()

In [None]:
def lag_corr(table, periods):
    # you may not care but...
    return (table[periods:] - table.shift(periods)).corr()

def pctchg_corr(table, periods):
    # you may not care but...
    return (table[periods:] / table.shift(periods) - 1).corr()

all_lags = [lag_corr(df, i) for i in range(1, 20)]

lags_onetable = pd.concat(all_lags, keys=range(1, 20))
lags_onetable

In [None]:
unstacked = lags_onetable.unstack(1)
unstacked

In [None]:
unstacked['ice cream', 'tennis'].plot(label='IC-FM')
unstacked['full moon', 'Volleyball'].plot(label='FM-VB')
unstacked['tennis', 'Volleyball'].plot(label='FM-VB')

In [None]:
df_1diff.info()

In [None]:
df_1diff["ice cream"].corr(df_1diff["ice cream"].shift(1))

In [None]:
df_1diff.shift(1).head()

In [None]:
df_1diff.corrwith(df_1diff.shift(1))

In [None]:
pd.DataFrame({'Lag1': df_1diff.corrwith(df_1diff.shift(1)),
     'Lag2': df_1diff.corrwith(df_1diff.shift(2))})
 

In [None]:
lag_acorr_table = pd.DataFrame({'Lag%d' % i: 
                                df_1diff.corrwith(df_1diff.shift(i))
         for i in range(1, 7)})
lag_acorr_table

In [None]:
lag_acorr_table.T

Date arithmetic
====

In [None]:
df.head(10)

In [None]:
df.shift(10, freq='H').head(10)

In [None]:
df.shift(2, freq='M').head(10)

In [None]:
pdf = df.to_period('M')
pdf.head(10).index

In [None]:
pdf.index[0]

In [None]:
pdf.index[0].asfreq('S', 'end')

In [None]:
pdf.index[0].asfreq('S', 'start')

In [None]:
pdf.index[0].asfreq('H', 'end') - 5

In [None]:
# 7th business day
(pdf.index[0].asfreq('B', 'start') + 6).to_timestamp()

In [None]:
df.head()

In [None]:
df.shift(4, freq='D').head()

In [None]:
df.shift(4, freq='D').resample('D').head(50)

In [None]:
df.shift(4, freq='D').resample('D').apply(pd.Series.interpolate).head(50)

In [None]:
df.shift(4, freq='D').resample('D', fill_method='bfill').head(50)

Time zone handling
----

In [None]:
stamp = pd.Timestamp(datetime.now())
stamp

In [None]:
print(stamp.tz)

In [None]:
stamp.tz_localize('US/Pacific')

In [None]:
stamp_pac = stamp.tz_localize('US/Pacific')
stamp_pac

In [None]:
stamp_pac.tz_convert('Asia/Tokyo')

In [None]:
stamp_pac.tz_convert('Asia/Tokyo').hour

In [None]:
stamp_pac.tz_convert('Asia/Tokyo').day

In [None]:
stamp_pac.tz_convert('Asia/Tokyo').value

In [None]:
stamp_pac.tz_convert('Asia/Tokyo').tz_convert('utc').value

In [None]:
df_shifted = df.shift(1, freq='4D9H30T')
df_shang = df_shifted\
    .tz_localize('US/Eastern')\
    .tz_convert('Asia/Shanghai')
df_shang

In [None]:
df_shang.index

In [None]:
df_shifted = df.shift(1, freq='4D9H30T')
df_shifted\
    .tz_localize('US/Eastern')\
    .tz_convert('US/Pacific').resample('A-DEC').mean()