# Agenda
* Numpy
* Pandas
* Lab


# Introduction


## Create a new notebook for your code-along:

From our submission directory, type:
    
    jupyter notebook

From the IPython Dashboard, open a new notebook.
Change the title to: "Numpy and Pandas"

# Introduction to Numpy

* Overview
* ndarray
* Indexing and Slicing

More info: [http://wiki.scipy.org/Tentative_NumPy_Tutorial](http://wiki.scipy.org/Tentative_NumPy_Tutorial)


## Numpy Overview

* Why Python for Data? Numpy brings *decades* of C math into Python!
* Numpy provides a wrapper for extensive C/C++/Fortran codebases, used for data analysis functionality
* NDAarray allows easy vectorized math and broadcasting (i.e. functions for vector elements of different shapes)

In [4]:
import numpy as np

### Creating ndarrays

An array object represents a multidimensional, homogeneous array of fixed-size items. 

In [2]:
# Creating arrays
import numpy as np
a = np.zeros((3))
b = np.ones((2,3))
c = np.random.randint(1,10,(2,3,4))
d = np.arange(0,11,1)

# if you replace the "a = " by ? and run the code it will describe what the function does

What are these functions?

    arange?

In [3]:
# Note the way each array is printed:
import numpy as np
a = np.zeros((3))
b = np.ones((2,3))
c = np.random.randint(1,10,(2,3,4))
d = np.arange(0,11,1)
a,b,c,d


(array([ 0.,  0.,  0.]), array([[ 1.,  1.,  1.],
        [ 1.,  1.,  1.]]), array([[[4, 7, 7, 3],
         [5, 2, 4, 2],
         [3, 7, 1, 9]],
 
        [[6, 2, 1, 1],
         [6, 7, 5, 3],
         [8, 6, 3, 2]]]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]))

In [None]:
## Arithmetic in arrays is element wise

In [4]:
a = np.array( [20,30,40,50] )
b = np.arange( 4 )

In [5]:
c = a-b
c,b

(array([20, 29, 38, 47]), array([0, 1, 2, 3]))

In [6]:
b = b**2
b


array([0, 1, 4, 9])

## Indexing, Slicing and Iterating

In [7]:
# one-dimensional arrays work like lists:
a = np.arange(10)**2

In [13]:
a

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [8]:
a[2:5]

array([ 4,  9, 16])

In [None]:
# Multidimensional arrays use tuples with commas for indexing
# with (row,column) conventions beginning, as always in Python, from 0

In [9]:
b = np.random.randint(1,100,(4,4))

In [10]:
b

array([[ 3, 19, 13, 17],
       [20, 62, 64,  8],
       [59, 66, 83, 82],
       [48, 38, 52, 84]])

In [11]:
# Guess the output
print(b[2,3])
print(b[0,0])


82
3


In [12]:
b[0:3,1],b[:,1]

(array([19, 62, 66]), array([19, 62, 66, 38]))

In [13]:
b[1:3,:]

array([[20, 62, 64,  8],
       [59, 66, 83, 82]])

# Introduction to Pandas

* Object Creation
* Viewing data
* Selection
* Missing data
* Grouping
* Reshaping
* Time series
* Plotting
* i/o
 

_pandas.pydata.org_

## Pandas Overview

_Source: [pandas.pydata.org](http://pandas.pydata.org/pandas-docs/stable/10min.html)_

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dates = pd.date_range('20140101',periods=6)
dates

DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D')

In [16]:
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
z = pd.DataFrame(index = df.index, columns = df.columns)
df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

In [9]:
# Index, columns, underlying numpy data
df.T
df

Unnamed: 0,A,B,C,D
2014-01-01,-0.387205,0.968981,0.855309,-1.032127
2014-01-02,0.649269,-1.605991,0.011695,-0.70015
2014-01-03,-1.800025,1.455706,0.82162,1.352032
2014-01-04,-1.757586,-0.22096,0.302449,0.457321
2014-01-05,0.434676,0.077812,-0.807126,-1.600972
2014-01-06,-0.717765,-0.663976,0.190241,-0.542402


In [17]:
df2 = pd.DataFrame({ 'A' : 1.,
                         'B' : pd.Timestamp('20130102'),
                         'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                         'D' : np.array([3] * 4,dtype='int32'),
                         'E' : 'foo' })
    

df2

Unnamed: 0,A,B,C,D,E
0,1.0,2013-01-02,1.0,3,foo
1,1.0,2013-01-02,1.0,3,foo
2,1.0,2013-01-02,1.0,3,foo
3,1.0,2013-01-02,1.0,3,foo


In [18]:
# With specific dtypes
df.dtypes

A    float64
B    float64
C    float64
D    float64
dtype: object

#### Viewing Data

In [19]:
df.head()

Unnamed: 0,A,B,C,D
2014-01-01,-0.798288,0.010488,1.459971,-1.208762
2014-01-02,0.496185,-0.65208,0.489994,-1.52843
2014-01-03,-0.135091,-1.552844,0.154443,-0.689701
2014-01-04,0.577989,-0.674331,0.077933,0.100684
2014-01-05,0.889567,-0.933193,0.686272,2.091707


In [20]:
df.tail()

Unnamed: 0,A,B,C,D
2014-01-02,0.496185,-0.65208,0.489994,-1.52843
2014-01-03,-0.135091,-1.552844,0.154443,-0.689701
2014-01-04,0.577989,-0.674331,0.077933,0.100684
2014-01-05,0.889567,-0.933193,0.686272,2.091707
2014-01-06,-0.489662,-1.90477,1.122316,-0.527688


In [18]:
df.index

DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D')

In [21]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.090117,-0.951122,0.665155,-0.293699
std,0.666057,0.687396,0.543601,1.297626
min,-0.798288,-1.90477,0.077933,-1.52843
25%,-0.401019,-1.397931,0.238331,-1.078997
50%,0.180547,-0.803762,0.588133,-0.608695
75%,0.557538,-0.657643,1.013305,-0.056409
max,0.889567,0.010488,1.459971,2.091707


In [22]:
df.sort_values(by='B')
df

# what does this do?

Unnamed: 0,A,B,C,D
2014-01-01,-0.798288,0.010488,1.459971,-1.208762
2014-01-02,0.496185,-0.65208,0.489994,-1.52843
2014-01-03,-0.135091,-1.552844,0.154443,-0.689701
2014-01-04,0.577989,-0.674331,0.077933,0.100684
2014-01-05,0.889567,-0.933193,0.686272,2.091707
2014-01-06,-0.489662,-1.90477,1.122316,-0.527688


### Selection

In [23]:
df[['A','B']]
# double backets make it a dataframe

Unnamed: 0,A,B
2014-01-01,-0.798288,0.010488
2014-01-02,0.496185,-0.65208
2014-01-03,-0.135091,-1.552844
2014-01-04,0.577989,-0.674331
2014-01-05,0.889567,-0.933193
2014-01-06,-0.489662,-1.90477


In [24]:
df[0:3]

Unnamed: 0,A,B,C,D
2014-01-01,-0.387205,0.968981,0.855309,-1.032127
2014-01-02,0.649269,-1.605991,0.011695,-0.70015
2014-01-03,-1.800025,1.455706,0.82162,1.352032


In [25]:
# By label
df.loc[dates[0]]

# using loc includes both sides so 0:1 
# would show 2 lines while in python it would only show 1 line


A   -0.387205
B    0.968981
C    0.855309
D   -1.032127
Name: 2014-01-01 00:00:00, dtype: float64

In [28]:
# multi-axis by label

print df.A[0:2]
df.loc[:,['A','B']]

2014-01-01   -0.387205
2014-01-02    0.649269
Freq: D, Name: A, dtype: float64


Unnamed: 0,A,B
2014-01-01,-0.387205,0.968981
2014-01-02,0.649269,-1.605991
2014-01-03,-1.800025,1.455706
2014-01-04,-1.757586,-0.22096
2014-01-05,0.434676,0.077812
2014-01-06,-0.717765,-0.663976


In [30]:
# Date Range
df.loc['20140102':'20140104',['B']]

Unnamed: 0,B
2014-01-02,-1.605991
2014-01-03,1.455706
2014-01-04,-0.22096


In [31]:
# Fast access to scalar
df.at[dates[1],'B']

-1.6059911311305894

In [29]:
# iloc provides integer locations similar to np style
# instead of name of the row, you can look by index of the row

df.iloc[3:]

Unnamed: 0,A,B,C,D
2014-01-04,-1.757586,-0.22096,0.302449,0.457321
2014-01-05,0.434676,0.077812,-0.807126,-1.600972
2014-01-06,-0.717765,-0.663976,0.190241,-0.542402


### Boolean Indexing

In [32]:
df[df.A < 0] # Basically a 'where' operation

# other example df[(df.A < 0) & (df.C>0)]

Unnamed: 0,A,B,C,D
2014-01-01,-0.387205,0.968981,0.855309,-1.032127
2014-01-03,-1.800025,1.455706,0.82162,1.352032
2014-01-04,-1.757586,-0.22096,0.302449,0.457321
2014-01-06,-0.717765,-0.663976,0.190241,-0.542402


### Setting

In [34]:
df_posA = df.copy() # Without "copy" it would act on the dataset

df_posA[df_posA.A < 0] = -1*df_posA

In [35]:
df_posA

Unnamed: 0,A,B,C,D
2014-01-01,0.387205,-0.968981,-0.855309,1.032127
2014-01-02,0.649269,-1.605991,0.011695,-0.70015
2014-01-03,1.800025,-1.455706,-0.82162,-1.352032
2014-01-04,1.757586,0.22096,-0.302449,-0.457321
2014-01-05,0.434676,0.077812,-0.807126,-1.600972
2014-01-06,0.717765,0.663976,-0.190241,0.542402


In [37]:
#Setting new column aligns data by index
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20140102',periods=6))


In [38]:
s1

2014-01-02    1
2014-01-03    2
2014-01-04    3
2014-01-05    4
2014-01-06    5
2014-01-07    6
Freq: D, dtype: int64

In [39]:
df['F'] = s1

In [43]:
df.describe()

Unnamed: 0,A,B,C,D,F
count,6.0,6.0,6.0,6.0,5.0
mean,-0.596439,0.001929,0.229031,-0.344383,3.0
std,1.04598,1.107514,0.611741,1.071293,1.581139
min,-1.800025,-1.605991,-0.807126,-1.600972,1.0
25%,-1.49763,-0.553222,0.056331,-0.949132,2.0
50%,-0.552485,-0.071574,0.246345,-0.621276,3.0
75%,0.229206,0.746189,0.691827,0.207391,4.0
max,0.649269,1.455706,0.855309,1.352032,5.0


### Missing Data

In [42]:
# Add a column with missing data
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])

In [None]:
df1.loc[dates[0]:dates[1],'E'] = 1

In [44]:
df1

Unnamed: 0,A,B,C,D,F,E
2014-01-01,-0.387205,0.968981,0.855309,-1.032127,,
2014-01-02,0.649269,-1.605991,0.011695,-0.70015,1.0,
2014-01-03,-1.800025,1.455706,0.82162,1.352032,2.0,
2014-01-04,-1.757586,-0.22096,0.302449,0.457321,3.0,


In [48]:
# find where values are null
pd.isnull(df1).sum()

A    0
B    0
C    0
D    0
F    1
E    4
dtype: int64

### Operations

In [49]:
df.describe()

Unnamed: 0,A,B,C,D,F
count,6.0,6.0,6.0,6.0,5.0
mean,-0.596439,0.001929,0.229031,-0.344383,3.0
std,1.04598,1.107514,0.611741,1.071293,1.581139
min,-1.800025,-1.605991,-0.807126,-1.600972,1.0
25%,-1.49763,-0.553222,0.056331,-0.949132,2.0
50%,-0.552485,-0.071574,0.246345,-0.621276,3.0
75%,0.229206,0.746189,0.691827,0.207391,4.0
max,0.649269,1.455706,0.855309,1.352032,5.0


In [50]:
df.mean(),df.mean(1) # Operation on two different axes by row and column

(A   -0.596439
 B    0.001929
 C    0.229031
 D   -0.344383
 F    3.000000
 dtype: float64, 2014-01-01    0.101239
 2014-01-02   -0.129035
 2014-01-03    0.765866
 2014-01-04    0.356245
 2014-01-05    0.420878
 2014-01-06    0.653220
 Freq: D, dtype: float64)

### Applying functions

In [17]:
df

Unnamed: 0,A,B,C,D
2014-01-01,0.038396,-0.339802,0.069078,0.141378
2014-01-02,0.446013,-0.218487,-2.269963,-0.240023
2014-01-03,-0.055671,-1.74426,0.665871,-0.135691
2014-01-04,-0.55865,-0.160196,-1.447581,-0.327606
2014-01-05,-0.310404,-1.608105,2.205788,-0.796481
2014-01-06,0.510712,-0.608694,-1.095705,-0.702242


In [52]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D,F
2014-01-01,-0.387205,0.968981,0.855309,-1.032127,
2014-01-02,0.262064,-0.63701,0.867004,-1.732277,1.0
2014-01-03,-1.537962,0.818695,1.688623,-0.380245,3.0
2014-01-04,-3.295547,0.597736,1.991072,0.077076,6.0
2014-01-05,-2.860871,0.675548,1.183946,-1.523895,10.0
2014-01-06,-3.578636,0.011572,1.374186,-2.066297,15.0


In [54]:
df.apply(lambda x: x.max() - x.min())

# another way
#def custom_func(x):
#    return x.max()
#df.apply(x.max)  

#df.apply(custome_func)

A    2.449294
B    3.061697
C    1.662435
D    2.953003
F    4.000000
dtype: float64

In [55]:
# Built in string methods, this example converts entire string 
# to lower case letter
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

### Merge

In [56]:
np.random.randn(10,4)

array([[-0.53225244, -0.66299725, -0.3485465 , -0.13795054],
       [-0.2508792 , -1.5713097 , -0.04168368, -0.23700279],
       [ 1.36974684, -0.19418975, -1.87129622,  0.09591815],
       [ 0.05663418,  0.29594672,  0.20782892, -0.3306086 ],
       [-1.4802021 ,  0.13243684, -0.31426406,  1.5773732 ],
       [ 1.55383327, -0.53160718, -0.07561462, -0.68543868],
       [-2.41727376, -1.61016599,  0.74736401, -1.49860012],
       [-0.82130272, -0.21764046,  2.68294145, -0.72781233],
       [-1.5409474 ,  1.10070365, -0.21653812, -0.01389488],
       [ 0.17945507,  0.32112606,  0.10100904, -0.95772043]])

In [None]:
#Concatenating pandas objects together
df = pd.DataFrame(np.random.randn(10,4))
df

In [57]:
# Break it into pieces: each of pieces is a list
pieces = [df[:3], df[3:7],df[7:]]
pieces

[                   A         B         C         D    F
 2014-01-01 -0.387205  0.968981  0.855309 -1.032127  NaN
 2014-01-02  0.649269 -1.605991  0.011695 -0.700150  1.0
 2014-01-03 -1.800025  1.455706  0.821620  1.352032  2.0,
                    A         B         C         D    F
 2014-01-04 -1.757586 -0.220960  0.302449  0.457321  3.0
 2014-01-05  0.434676  0.077812 -0.807126 -1.600972  4.0
 2014-01-06 -0.717765 -0.663976  0.190241 -0.542402  5.0,
 Empty DataFrame
 Columns: [A, B, C, D, F]
 Index: []]

In [58]:
pd.concat(pieces)

Unnamed: 0,A,B,C,D,F
2014-01-01,-0.387205,0.968981,0.855309,-1.032127,
2014-01-02,0.649269,-1.605991,0.011695,-0.70015,1.0
2014-01-03,-1.800025,1.455706,0.82162,1.352032,2.0
2014-01-04,-1.757586,-0.22096,0.302449,0.457321,3.0
2014-01-05,0.434676,0.077812,-0.807126,-1.600972,4.0
2014-01-06,-0.717765,-0.663976,0.190241,-0.542402,5.0


In [59]:
# Also can "Join" and "Append"
df

Unnamed: 0,A,B,C,D,F
2014-01-01,-0.387205,0.968981,0.855309,-1.032127,
2014-01-02,0.649269,-1.605991,0.011695,-0.70015,1.0
2014-01-03,-1.800025,1.455706,0.82162,1.352032,2.0
2014-01-04,-1.757586,-0.22096,0.302449,0.457321,3.0
2014-01-05,0.434676,0.077812,-0.807126,-1.600972,4.0
2014-01-06,-0.717765,-0.663976,0.190241,-0.542402,5.0


### Grouping


In [64]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                       'foo', 'bar', 'foo', 'foo'],
                       'B' : ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three'],
                       'C' : np.random.randn(8),
                       'D' : np.random.randn(8)})

In [65]:
df

Unnamed: 0,A,B,C,D
0,foo,one,-1.294183,-0.528049
1,bar,one,2.018171,-3.234526
2,foo,two,0.055212,-0.820947
3,bar,three,-1.636121,-1.38157
4,foo,two,-0.926993,-1.35848
5,bar,two,0.560762,1.164018
6,foo,one,0.482521,-0.317259
7,foo,three,-0.757768,0.457547


In [66]:
df.groupby(['A','B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,2.018171,-3.234526
bar,three,-1.636121,-1.38157
bar,two,0.560762,1.164018
foo,one,-0.811662,-0.845308
foo,three,-0.757768,0.457547
foo,two,-0.871781,-2.179427


### Reshaping

In [None]:
# You can also stack or unstack levels

In [67]:
a = df.groupby(['A','B']).sum()

In [68]:
# Pivot Tables
pd.pivot_table(df,values=['C','D'],index=['A'],columns=['B'])

Unnamed: 0_level_0,C,C,C,D,D,D
B,one,three,two,one,three,two
A,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bar,2.018171,-1.636121,0.560762,-3.234526,-1.38157,1.164018
foo,-0.405831,-0.757768,-0.43589,-0.422654,0.457547,-1.089713


### Time Series


In [None]:
import pandas as pd
import numpy as np

In [None]:
# 100 Seconds starting on January 1st
rng = pd.date_range('1/1/2014', periods=100, freq='S')

In [None]:
# Give each second a random value
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [None]:
ts

In [None]:
# Built in resampling
ts.resample('1Min').mean() # Resample secondly to 1Minutely

In [None]:
# Many additional time series features
ts. #use tab

### Plotting


In [69]:
ts.plot()

NameError: name 'ts' is not defined

In [None]:
def randwalk(startdate,points):
    ts = pd.Series(np.random.randn(points), index=pd.date_range(startdate, periods=points))
    ts=ts.cumsum()
    ts.plot()
    return(ts)

In [None]:
# Using pandas to make a simple random walker by repeatedly running:
a=randwalk('1/1/2012',1000)

In [None]:
# Pandas plot function will print with labels as default

In [None]:
df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure();df.plot();plt.legend(loc='best') #

### I/O
I/O is straightforward with, for example, pd.read_csv or df.to_csv

#### The benefits of open source:

Let's look under x's in plt modules

# Next Steps

**Recommended Resources**

Name | Description
--- | ---
[Official Pandas Tutorials](http://pandas.pydata.org/pandas-docs/stable/10min.html) | Wes & Company's selection of tutorials and lectures
[Julia Evans Pandas Cookbook](https://github.com/jvns/pandas-cookbook) | Great resource with examples from weather, bikes and 311 calls
[Learn Pandas Tutorials](https://bitbucket.org/hrojas/learn-pandas) | A great series of Pandas tutorials from Dave Rojas
[Research Computing Python Data PYNBs](https://github.com/ResearchComputing/Meetup-Fall-2013/tree/master/python) | A super awesome set of python notebooks from a meetup-based course exclusively devoted to pandas