# Agenda
* Numpy
* Pandas
* Lab


# Introduction


## Create a new notebook for your code-along:

From our submission directory, type:
    
    jupyter notebook

From the IPython Dashboard, open a new notebook.
Change the title to: "Numpy and Pandas"

# Introduction to Numpy

* Overview
* ndarray
* Indexing and Slicing

More info: [http://wiki.scipy.org/Tentative_NumPy_Tutorial](http://wiki.scipy.org/Tentative_NumPy_Tutorial)


## Numpy Overview

* Why Python for Data? Numpy brings *decades* of C math into Python!
* Numpy provides a wrapper for extensive C/C++/Fortran codebases, used for data analysis functionality
* NDAarray allows easy vectorized math and broadcasting (i.e. functions for vector elements of different shapes)

In [1]:
!pip install

Collecting package_name
  Downloading package_name-0.1.tar.gz
Building wheels for collected packages: package-name
  Running setup.py bdist_wheel for package-name ... [?25l- done
[?25h  Stored in directory: /Users/justinmcelderry/Library/Caches/pip/wheels/97/25/75/79d4ad8fbbcea368670af12dd8d1f2ccbe49a7ce7deb1fb0ab
Successfully built package-name
Installing collected packages: package-name
Successfully installed package-name-0.1
[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [3]:
import numpy as np

### Creating ndarrays

An array object represents a multidimensional, homogeneous array of fixed-size items. 

In [4]:
# Creating arrays
a = np.zeros((3))
b = np.ones((2,3))
c = np.random.randint(1,10,(2,3,4)) ##give me two matrices that will be of dimensions 3 by 4 
d = np.arange(0,11,1)

In [5]:
print a
print b
print c
print d

[ 0.  0.  0.]
[[ 1.  1.  1.]
 [ 1.  1.  1.]]
[[[9 9 7 7]
  [8 3 9 2]
  [1 8 3 9]]

 [[2 1 2 6]
  [8 4 3 2]
  [9 2 3 4]]]
[ 0  1  2  3  4  5  6  7  8  9 10]


What are these functions?

    arange?

In [None]:
# Note the way each array is printed:
a,b,c,d

In [None]:
## Arithmetic in arrays is element wise

In [6]:
a = np.array( [20,30,40,50] )
b = np.arange( 4 )
b

array([0, 1, 2, 3])

In [7]:
c = a-b
c

array([20, 29, 38, 47])

In [8]:
b**2

array([0, 1, 4, 9])

In [9]:
map ( lambda x: x**2, [0,1,2,3])

[0, 1, 4, 9]

In [10]:
np.array ( [1,1,2], [1,2,3])

TypeError: data type not understood

## Indexing, Slicing and Iterating

In [12]:
# one-dimensional arrays work like lists:
a = np.arange(10)**2
print a

#map ( lambda x: x**2, range(10))

[ 0  1  4  9 16 25 36 49 64 81]


In [13]:
someList = [1,2,3]

someList[0:2]

[1, 2]

In [14]:
a[2:5]

array([ 4,  9, 16])

In [None]:
# Multidimensional arrays use tuples with commas for indexing
# with (row,column) conventions beginning, as always in Python, from 0

In [None]:
b = np.random.randint(1,100,(4,4))

In [15]:
b

array([0, 1, 2, 3])

In [None]:
# Guess the output
print(b[2,3])
print(b[0,0])


In [None]:
b[0:3,1],b[:,1]

In [None]:
b[1:3,:]

# Introduction to Pandas

* Object Creation
* Viewing data
* Selection
* Missing data
* Grouping
* Reshaping
* Time series
* Plotting
* i/o
 

_pandas.pydata.org_

## Pandas Overview

_Source: [pandas.pydata.org](http://pandas.pydata.org/pandas-docs/stable/10min.html)_

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [26]:
dates = pd.date_range('20140101',periods=6)
dates [0]

Timestamp('2014-01-01 00:00:00', freq='D')

In [19]:
date1 = dates[0]

In [22]:
import time

In [25]:
time.localtime ()

time.struct_time(tm_year=2016, tm_mon=11, tm_mday=14, tm_hour=19, tm_min=34, tm_sec=53, tm_wday=0, tm_yday=319, tm_isdst=0)

In [20]:
print date1.day
print date1.month
print date1.year

1
1
2014


In [29]:
np.random.randn(6,4)

array([[-0.64281941, -1.37218057,  0.85279359,  0.55617113],
       [ 0.62633316, -0.84130433,  0.98288055, -1.12221772],
       [-0.02894087, -1.45695165,  0.13753885,  0.71605809],
       [ 2.31619574, -1.52941724,  0.88324096,  0.67841141],
       [-0.03817505,  0.74176407, -0.20497828,  0.60577548],
       [-0.72497532,  1.2002741 , -1.23280259,  0.96282906]])

In [30]:
list('AB')

['A', 'B']

In [27]:
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
z = pd.DataFrame(index = df.index, columns = df.columns)
df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

In [28]:
df.head()

Unnamed: 0,A,B,C,D
2014-01-01,2.044567,-0.366702,0.815522,-0.691441
2014-01-02,1.15821,-0.378174,-1.098081,-2.152624
2014-01-03,0.492711,0.030423,0.468784,-0.616932
2014-01-04,1.098077,0.650648,1.110797,1.730815
2014-01-05,0.05871,-0.971509,1.128806,-0.526605


In [31]:
# Index, columns, underlying numpy data
df.T

Unnamed: 0,2014-01-01 00:00:00,2014-01-02 00:00:00,2014-01-03 00:00:00,2014-01-04 00:00:00,2014-01-05 00:00:00,2014-01-06 00:00:00
A,2.044567,1.15821,0.492711,1.098077,0.05871,-0.320459
B,-0.366702,-0.378174,0.030423,0.650648,-0.971509,-0.988367
C,0.815522,-1.098081,0.468784,1.110797,1.128806,-1.965143
D,-0.691441,-2.152624,-0.616932,1.730815,-0.526605,-0.060955


In [32]:
## You can make a dataframe by using a dictionaries
## ABCDE - keys within the dictionary 
## Each row in the columns AB... = Timestamp 

df2 = pd.DataFrame({ 'A' : 1.,
                         'B' : pd.Timestamp('20130102'),
                         'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
# C - 
                         'D' : np.array([3] * 4,dtype='int32'),
                         'E' : 'foo' })
    

df2

Unnamed: 0,A,B,C,D,E
0,1.0,2013-01-02,1.0,3,foo
1,1.0,2013-01-02,1.0,3,foo
2,1.0,2013-01-02,1.0,3,foo
3,1.0,2013-01-02,1.0,3,foo


In [33]:
# With specific dtypes

df2.dtypes #---> this is a schema...this an attribute



A           float64
B    datetime64[ns]
C           float32
D             int32
E            object
dtype: object

#### Viewing Data

In [34]:
df.head()

Unnamed: 0,A,B,C,D
2014-01-01,2.044567,-0.366702,0.815522,-0.691441
2014-01-02,1.15821,-0.378174,-1.098081,-2.152624
2014-01-03,0.492711,0.030423,0.468784,-0.616932
2014-01-04,1.098077,0.650648,1.110797,1.730815
2014-01-05,0.05871,-0.971509,1.128806,-0.526605


In [35]:
df.tail()

Unnamed: 0,A,B,C,D
2014-01-02,1.15821,-0.378174,-1.098081,-2.152624
2014-01-03,0.492711,0.030423,0.468784,-0.616932
2014-01-04,1.098077,0.650648,1.110797,1.730815
2014-01-05,0.05871,-0.971509,1.128806,-0.526605
2014-01-06,-0.320459,-0.988367,-1.965143,-0.060955


In [36]:
df.index

DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D')

In [37]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.755303,-0.33728,0.076781,-0.386291
std,0.854809,0.622881,1.298083,1.254904
min,-0.320459,-0.988367,-1.965143,-2.152624
25%,0.16721,-0.823176,-0.706364,-0.672814
50%,0.795394,-0.372438,0.642153,-0.571769
75%,1.143177,-0.068858,1.036978,-0.177367
max,2.044567,0.650648,1.128806,1.730815


In [89]:
#df.sort_values(by='B', ascending = False, inplace = True)
df.sort_values(by='B', inplace = True)

In [90]:
df.head()

Unnamed: 0,A,B,C,D,E
2014-01-06,-0.320459,-0.988367,-1.965143,-0.060955,-0.12191
2014-01-05,0.05871,-0.971509,1.128806,-0.526605,-1.053211
2014-01-02,1.15821,-0.378174,-1.098081,-2.152624,-4.305249
2014-01-01,2.044567,-0.366702,0.815522,-0.691441,-1.382882
2014-01-03,0.492711,0.030423,0.468784,-0.616932,-1.233864


### Selection

In [91]:
# Give me the first two columns
df3 = df[['A','B']]

In [92]:
##Give me the first 3 rows
df[0:3]

Unnamed: 0,A,B,C,D,E
2014-01-06,-0.320459,-0.988367,-1.965143,-0.060955,-0.12191
2014-01-05,0.05871,-0.971509,1.128806,-0.526605,-1.053211
2014-01-02,1.15821,-0.378174,-1.098081,-2.152624,-4.305249


In [93]:
# By label
print dates 

df.loc[dates[0]]

# Give me the row associated with the index -- in this case it's a date
df [ df.index == pd.Timestamp('2014-01-01')]

DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D')


Unnamed: 0,A,B,C,D,E
2014-01-01,2.044567,-0.366702,0.815522,-0.691441,-1.382882


In [94]:
# multi-axis by label
# provides the index in these two columns

# df.loc[:,['A','B']]

df.loc['2014-01-01':'2014-01-02', ['A','B']]

# df [ (df.index >= '2014-01-01') & (df.index <= '2014-01-03')][['A','B']]

#df[['A','B']]['2014-01-01':'2014-01-02']]

Unnamed: 0,A,B
2014-01-02,1.15821,-0.378174
2014-01-01,2.044567,-0.366702


In [95]:
# Date Range
# Where the index is in between these two dates...only bring column B
# This is nicest...if youre dataframe is sorted some weird way, this is good
df.loc['20140102':'20140104',['B']]

Unnamed: 0,B
2014-01-02,-0.378174
2014-01-03,0.030423
2014-01-04,0.650648


In [96]:
# Fast access to scalar
df.at[dates[1],'B']

-0.37817389965292375

In [97]:
df.head()

Unnamed: 0,A,B,C,D,E
2014-01-06,-0.320459,-0.988367,-1.965143,-0.060955,-0.12191
2014-01-05,0.05871,-0.971509,1.128806,-0.526605,-1.053211
2014-01-02,1.15821,-0.378174,-1.098081,-2.152624,-4.305249
2014-01-01,2.044567,-0.366702,0.815522,-0.691441,-1.382882
2014-01-03,0.492711,0.030423,0.468784,-0.616932,-1.233864


In [98]:
# iloc provides integer locations similar to np style
# if it's sorted, go with this one
df.iloc[3:]

Unnamed: 0,A,B,C,D,E
2014-01-01,2.044567,-0.366702,0.815522,-0.691441,-1.382882
2014-01-03,0.492711,0.030423,0.468784,-0.616932,-1.233864
2014-01-04,1.098077,0.650648,1.110797,1.730815,3.46163


### Boolean Indexing

In [99]:
df[df.A < 0] # Basically a 'where' operation

Unnamed: 0,A,B,C,D,E
2014-01-06,-0.320459,-0.988367,-1.965143,-0.060955,-0.12191


### Setting

In [100]:
# Creates a copy of the data frame
df_posA = df.copy() # Without "copy" it would act on the dataset
df_posA.head()
# df_posA[df_posA.A < 0] = -1*df_posA

Unnamed: 0,A,B,C,D,E
2014-01-06,-0.320459,-0.988367,-1.965143,-0.060955,-0.12191
2014-01-05,0.05871,-0.971509,1.128806,-0.526605,-1.053211
2014-01-02,1.15821,-0.378174,-1.098081,-2.152624,-4.305249
2014-01-01,2.044567,-0.366702,0.815522,-0.691441,-1.382882
2014-01-03,0.492711,0.030423,0.468784,-0.616932,-1.233864


In [101]:
df_posA[df_posA.A <0] = -1*df_posA

In [102]:
df_posA.head()

Unnamed: 0,A,B,C,D,E
2014-01-06,0.320459,0.988367,1.965143,0.060955,0.12191
2014-01-05,0.05871,-0.971509,1.128806,-0.526605,-1.053211
2014-01-02,1.15821,-0.378174,-1.098081,-2.152624,-4.305249
2014-01-01,2.044567,-0.366702,0.815522,-0.691441,-1.382882
2014-01-03,0.492711,0.030423,0.468784,-0.616932,-1.233864


In [103]:
#Setting new column aligns data by index
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20140102',periods=6))

In [104]:
s1

2014-01-02    1
2014-01-03    2
2014-01-04    3
2014-01-05    4
2014-01-06    5
2014-01-07    6
Freq: D, dtype: int64

In [105]:
df['F'] = s1

In [106]:
df

Unnamed: 0,A,B,C,D,E,F
2014-01-06,-0.320459,-0.988367,-1.965143,-0.060955,-0.12191,5.0
2014-01-05,0.05871,-0.971509,1.128806,-0.526605,-1.053211,4.0
2014-01-02,1.15821,-0.378174,-1.098081,-2.152624,-4.305249,1.0
2014-01-01,2.044567,-0.366702,0.815522,-0.691441,-1.382882,
2014-01-03,0.492711,0.030423,0.468784,-0.616932,-1.233864,2.0
2014-01-04,1.098077,0.650648,1.110797,1.730815,3.46163,3.0


### Missing Data

In [107]:
# Add a column with missing data
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])

In [108]:
df1.loc[dates[0]:dates[1],'E'] = 1

In [109]:
df1

Unnamed: 0,A,B,C,D,E,F,E.1
2014-01-01,2.044567,-0.366702,0.815522,-0.691441,1.0,,1.0
2014-01-02,1.15821,-0.378174,-1.098081,-2.152624,1.0,1.0,1.0
2014-01-03,0.492711,0.030423,0.468784,-0.616932,-1.233864,2.0,-1.233864
2014-01-04,1.098077,0.650648,1.110797,1.730815,3.46163,3.0,3.46163


In [110]:
# find where values are null
pd.isnull(df1)

Unnamed: 0,A,B,C,D,E,F,E.1
2014-01-01,False,False,False,False,False,True,False
2014-01-02,False,False,False,False,False,False,False
2014-01-03,False,False,False,False,False,False,False
2014-01-04,False,False,False,False,False,False,False


### Operations

In [111]:
df.describe()

Unnamed: 0,A,B,C,D,E,F
count,6.0,6.0,6.0,6.0,6.0,5.0
mean,0.755303,-0.33728,0.076781,-0.386291,-0.772581,3.0
std,0.854809,0.622881,1.298083,1.254904,2.509807,1.581139
min,-0.320459,-0.988367,-1.965143,-2.152624,-4.305249,1.0
25%,0.16721,-0.823176,-0.706364,-0.672814,-1.345628,2.0
50%,0.795394,-0.372438,0.642153,-0.571769,-1.143538,3.0
75%,1.143177,-0.068858,1.036978,-0.177367,-0.354735,4.0
max,2.044567,0.650648,1.128806,1.730815,3.46163,5.0


In [112]:
df.mean(),df.mean(1) # Operation on two different axes

(A    0.755303
 B   -0.337280
 C    0.076781
 D   -0.386291
 E   -0.772581
 F    3.000000
 dtype: float64, 2014-01-06    0.257195
 2014-01-05    0.439365
 2014-01-02   -0.962653
 2014-01-01    0.083813
 2014-01-03    0.190187
 2014-01-04    1.841994
 dtype: float64)

### Applying functions

In [113]:
df

Unnamed: 0,A,B,C,D,E,F
2014-01-06,-0.320459,-0.988367,-1.965143,-0.060955,-0.12191,5.0
2014-01-05,0.05871,-0.971509,1.128806,-0.526605,-1.053211,4.0
2014-01-02,1.15821,-0.378174,-1.098081,-2.152624,-4.305249,1.0
2014-01-01,2.044567,-0.366702,0.815522,-0.691441,-1.382882,
2014-01-03,0.492711,0.030423,0.468784,-0.616932,-1.233864,2.0
2014-01-04,1.098077,0.650648,1.110797,1.730815,3.46163,3.0


In [114]:
df.head()

Unnamed: 0,A,B,C,D,E,F
2014-01-06,-0.320459,-0.988367,-1.965143,-0.060955,-0.12191,5.0
2014-01-05,0.05871,-0.971509,1.128806,-0.526605,-1.053211,4.0
2014-01-02,1.15821,-0.378174,-1.098081,-2.152624,-4.305249,1.0
2014-01-01,2.044567,-0.366702,0.815522,-0.691441,-1.382882,
2014-01-03,0.492711,0.030423,0.468784,-0.616932,-1.233864,2.0


In [115]:
# Create a new column called "E", but I want you to refer to 'D' to derive 'E'
# df.D = a Series 
# lambda = a function that I'm not going to use again 
# lambda is just a function that I'm going to throw away 

df['E'] = df.D.apply(lambda x: x*2)

In [116]:
df.head()

Unnamed: 0,A,B,C,D,E,F
2014-01-06,-0.320459,-0.988367,-1.965143,-0.060955,-0.12191,5.0
2014-01-05,0.05871,-0.971509,1.128806,-0.526605,-1.053211,4.0
2014-01-02,1.15821,-0.378174,-1.098081,-2.152624,-4.305249,1.0
2014-01-01,2.044567,-0.366702,0.815522,-0.691441,-1.382882,
2014-01-03,0.492711,0.030423,0.468784,-0.616932,-1.233864,2.0


In [117]:
# This aggregates everything across a row 
# Operation row wise ---> axis = 1...axis = 0 is the default

df.apply(np.cumsum, axis = 1).head()

Unnamed: 0,A,B,C,D,E,F
2014-01-06,-0.320459,-1.308825,-3.273968,-3.334923,-3.456833,1.543167
2014-01-05,0.05871,-0.912799,0.216007,-0.310599,-1.363809,2.636191
2014-01-02,1.15821,0.780036,-0.318045,-2.470669,-6.775918,-5.775918
2014-01-01,2.044567,1.677865,2.493387,1.801946,0.419064,
2014-01-03,0.492711,0.523133,0.991918,0.374985,-0.858879,1.141121


In [118]:
# This is a Series 

df.apply(lambda x: x.max() - x.min())

A    2.365026
B    1.639015
C    3.093949
D    3.883439
E    7.766879
F    4.000000
dtype: float64

In [119]:
import math 
df.apply(lambda x: math.exp(5)*x).head()

Unnamed: 0,A,B,C,D,E,F
2014-01-06,-47.560286,-146.686626,-291.653035,-9.046501,-18.093002,742.065796
2014-01-05,8.713334,-144.184778,167.529689,-78.155173,-156.310346,593.652636
2014-01-02,171.893572,-56.125983,-162.969599,-319.477796,-638.955593,148.413159
2014-01-01,303.440679,-54.423422,121.034216,-102.618948,-205.237896,
2014-01-03,73.124749,4.515145,69.57374,-91.56086,-183.121721,296.826318


In [120]:
# Built in string methods
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

### Merge

In [121]:
np.random.randn(10,4)

array([[ 1.9821421 ,  0.91124706,  1.39106822, -1.82716888],
       [-0.44258514, -0.24826526, -1.74853791,  0.08506902],
       [ 1.48949242,  0.07094412,  0.80927999,  0.18558925],
       [-0.8033083 , -0.86489178,  0.20873824, -0.25693934],
       [ 1.61067371,  0.51639142, -0.47832027,  2.00233843],
       [ 0.35312331,  0.24419948,  0.02327831,  0.25419574],
       [-0.72192966,  1.17256383, -1.01904496,  0.5204198 ],
       [ 0.64426788,  0.10435815, -0.44941484,  0.15487953],
       [-0.08008954, -1.40162233,  0.1150285 , -0.33042034],
       [-1.13627141,  0.32724735,  0.25351131,  0.01046619]])

In [122]:
#Concatenating pandas objects together
df = pd.DataFrame(np.random.randn(10,4))
df

Unnamed: 0,0,1,2,3
0,-0.818414,1.115018,-0.621323,-0.000138
1,0.185227,-1.073613,1.700874,0.412214
2,0.219503,-0.160042,0.028628,-0.695308
3,-0.345039,0.600435,0.199876,0.382168
4,-0.468478,0.800001,-2.235198,-0.711001
5,0.243511,-0.621685,1.844279,-0.219206
6,1.33926,-1.433828,1.209406,1.907346
7,-1.130592,1.693187,-0.779568,-2.236433
8,0.057506,-0.601456,-0.691363,0.715181
9,1.430333,0.050118,-0.08488,0.292155


In [123]:
# Break it into pieces
pieces = [df[:3], df[3:7],df[7:]]
pieces

[          0         1         2         3
 0 -0.818414  1.115018 -0.621323 -0.000138
 1  0.185227 -1.073613  1.700874  0.412214
 2  0.219503 -0.160042  0.028628 -0.695308,
           0         1         2         3
 3 -0.345039  0.600435  0.199876  0.382168
 4 -0.468478  0.800001 -2.235198 -0.711001
 5  0.243511 -0.621685  1.844279 -0.219206
 6  1.339260 -1.433828  1.209406  1.907346,
           0         1         2         3
 7 -1.130592  1.693187 -0.779568 -2.236433
 8  0.057506 -0.601456 -0.691363  0.715181
 9  1.430333  0.050118 -0.084880  0.292155]

In [124]:
pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,-0.818414,1.115018,-0.621323,-0.000138
1,0.185227,-1.073613,1.700874,0.412214
2,0.219503,-0.160042,0.028628,-0.695308
3,-0.345039,0.600435,0.199876,0.382168
4,-0.468478,0.800001,-2.235198,-0.711001
5,0.243511,-0.621685,1.844279,-0.219206
6,1.33926,-1.433828,1.209406,1.907346
7,-1.130592,1.693187,-0.779568,-2.236433
8,0.057506,-0.601456,-0.691363,0.715181
9,1.430333,0.050118,-0.08488,0.292155


In [125]:
# Also can "Join" and "Append"
df

Unnamed: 0,0,1,2,3
0,-0.818414,1.115018,-0.621323,-0.000138
1,0.185227,-1.073613,1.700874,0.412214
2,0.219503,-0.160042,0.028628,-0.695308
3,-0.345039,0.600435,0.199876,0.382168
4,-0.468478,0.800001,-2.235198,-0.711001
5,0.243511,-0.621685,1.844279,-0.219206
6,1.33926,-1.433828,1.209406,1.907346
7,-1.130592,1.693187,-0.779568,-2.236433
8,0.057506,-0.601456,-0.691363,0.715181
9,1.430333,0.050118,-0.08488,0.292155


### Grouping


In [126]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                       'foo', 'bar', 'foo', 'foo'],
                       'B' : ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three'],
                       'C' : np.random.randn(8),
                       'D' : np.random.randn(8)})

In [127]:
df

Unnamed: 0,A,B,C,D
0,foo,one,-0.017739,-2.249875
1,bar,one,-0.45824,-0.094698
2,foo,two,-1.885797,-1.762716
3,bar,three,0.361035,-1.362694
4,foo,two,0.378369,0.188051
5,bar,two,0.123711,0.186098
6,foo,one,1.284181,0.823603
7,foo,three,0.061257,1.4344


In [128]:
df.groupby(['A','B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.45824,-0.094698
bar,three,0.361035,-1.362694
bar,two,0.123711,0.186098
foo,one,1.266442,-1.426272
foo,three,0.061257,1.4344
foo,two,-1.507428,-1.574665


### Reshaping

In [129]:
# You can also stack or unstack levels

In [130]:
a = df.groupby(['A','B']).sum()

In [131]:
# Pivot Tables
pd.pivot_table(df,values=['C','D'],index=['A'],columns=['B'])

Unnamed: 0_level_0,C,C,C,D,D,D
B,one,three,two,one,three,two
A,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bar,-0.45824,0.361035,0.123711,-0.094698,-1.362694,0.186098
foo,0.633221,0.061257,-0.753714,-0.713136,1.4344,-0.787333


### Time Series


In [132]:
import pandas as pd
import numpy as np

In [133]:
# 100 Seconds starting on January 1st
rng = pd.date_range('1/1/2014', periods=100, freq='S')

In [134]:
# Give each second a random value
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [135]:
ts

2014-01-01 00:00:00    124
2014-01-01 00:00:01     65
2014-01-01 00:00:02    382
2014-01-01 00:00:03    313
2014-01-01 00:00:04     82
2014-01-01 00:00:05    492
2014-01-01 00:00:06     60
2014-01-01 00:00:07    469
2014-01-01 00:00:08    482
2014-01-01 00:00:09    230
2014-01-01 00:00:10    436
2014-01-01 00:00:11    319
2014-01-01 00:00:12     47
2014-01-01 00:00:13    194
2014-01-01 00:00:14    440
2014-01-01 00:00:15     62
2014-01-01 00:00:16    345
2014-01-01 00:00:17     31
2014-01-01 00:00:18    154
2014-01-01 00:00:19     47
2014-01-01 00:00:20     50
2014-01-01 00:00:21    267
2014-01-01 00:00:22    290
2014-01-01 00:00:23    122
2014-01-01 00:00:24    456
2014-01-01 00:00:25    270
2014-01-01 00:00:26    440
2014-01-01 00:00:27    227
2014-01-01 00:00:28    375
2014-01-01 00:00:29     38
                      ... 
2014-01-01 00:01:10    445
2014-01-01 00:01:11    412
2014-01-01 00:01:12    153
2014-01-01 00:01:13    449
2014-01-01 00:01:14    168
2014-01-01 00:01:15    111
2

In [136]:
# Built in resampling
ts.resample('1Min').mean() # Resample secondly to 1Minutely

2014-01-01 00:00:00    240.166667
2014-01-01 00:01:00    240.475000
Freq: T, dtype: float64

In [137]:
# Many additional time series features
ts. #use tab

SyntaxError: invalid syntax (<ipython-input-137-5c9240a56f62>, line 2)

### Plotting


In [None]:
ts.plot()

In [None]:
def randwalk(startdate,points):
    ts = pd.Series(np.random.randn(points), index=pd.date_range(startdate, periods=points))
    ts=ts.cumsum()
    ts.plot()
    return(ts)

In [None]:
# Using pandas to make a simple random walker by repeatedly running:
a=randwalk('1/1/2012',1000)

In [None]:
# Pandas plot function will print with labels as default

In [None]:
df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure();df.plot();plt.legend(loc='best') #

### I/O
I/O is straightforward with, for example, pd.read_csv or df.to_csv

#### The benefits of open source:

Let's look under x's in plt modules

# Next Steps

**Recommended Resources**

Name | Description
--- | ---
[Official Pandas Tutorials](http://pandas.pydata.org/pandas-docs/stable/10min.html) | Wes & Company's selection of tutorials and lectures
[Julia Evans Pandas Cookbook](https://github.com/jvns/pandas-cookbook) | Great resource with examples from weather, bikes and 311 calls
[Learn Pandas Tutorials](https://bitbucket.org/hrojas/learn-pandas) | A great series of Pandas tutorials from Dave Rojas
[Research Computing Python Data PYNBs](https://github.com/ResearchComputing/Meetup-Fall-2013/tree/master/python) | A super awesome set of python notebooks from a meetup-based course exclusively devoted to pandas