# Agenda
* Numpy
* Pandas
* Lab


# Introduction


## Create a new notebook for your code-along:

From our submission directory, type:
    
    jupyter notebook

From the IPython Dashboard, open a new notebook.
Change the title to: "Numpy and Pandas"

# Introduction to Numpy

* Overview
* ndarray
* Indexing and Slicing

More info: [http://wiki.scipy.org/Tentative_NumPy_Tutorial](http://wiki.scipy.org/Tentative_NumPy_Tutorial)


## Numpy Overview

* Why Python for Data? Numpy brings *decades* of C math into Python!
* Numpy provides a wrapper for extensive C/C++/Fortran codebases, used for data analysis functionality
* NDAarray allows easy vectorized math and broadcasting (i.e. functions for vector elements of different shapes)

In [None]:
import numpy as np

### Creating ndarrays

An array object represents a multidimensional, homogeneous array of fixed-size items. 

In [10]:
# Creating arrays
a = np.zeros((3))
b = np.ones((2,3))
c = np.random.randint(1,10,(2,3,4)) #two matrices, dimensions 3 by 4, values are randomly selected from 1 to 9
d = np.arange(0,11)

In [12]:
print a
print b
print c
print d

[ 0.  0.  0.]
[[ 1.  1.  1.]
 [ 1.  1.  1.]]
[[[7 3 4 5]
  [9 5 4 4]
  [6 2 4 1]]

 [[3 6 8 5]
  [5 2 8 7]
  [4 2 6 8]]]
[ 0  1  2  3  4  5  6  7  8  9 10]


What are these functions?

    arange?

In [None]:
# Note the way each array is printed:
a,b,c,d

In [None]:
## Arithmetic in arrays is element wise

In [13]:
a = np.array( [20,30,40,50] )
b = np.arange( 4 )
b

array([0, 1, 2, 3])

In [14]:
c = a-b
c

array([20, 29, 38, 47])

In [18]:
a = [0,1,2,3]
map(lambda x:x**2, [0,1,2,3])

[0, 1, 4, 9]

In [15]:
b**2

array([0, 1, 4, 9])

## Indexing, Slicing and Iterating

In [19]:
# one-dimensional arrays work like lists:
a = np.arange(10)**2



In [20]:
a

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [21]:
a[2:5]

array([ 4,  9, 16])

In [None]:
# Multidimensional arrays use tuples with commas for indexing
# with (row,column) conventions beginning, as always in Python, from 0

In [30]:
b = np.random.randint(1,10,2)

In [31]:
b

array([6, 8])

In [24]:
# Guess the output
print(b[2,3])
print(b[0,0])


85
25


In [None]:
b[0:3,1],b[:,1]

In [None]:
b[1:3,:]

# Introduction to Pandas

* Object Creation
* Viewing data
* Selection
* Missing data
* Grouping
* Reshaping
* Time series
* Plotting
* i/o
 

_pandas.pydata.org_

## Pandas Overview

_Source: [pandas.pydata.org](http://pandas.pydata.org/pandas-docs/stable/10min.html)_

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [46]:
dates = pd.date_range('20140101',periods=6)
print dates



DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D')


In [43]:
import time
time.localtime()

time.struct_time(tm_year=2016, tm_mon=11, tm_mday=14, tm_hour=19, tm_min=34, tm_sec=54, tm_wday=0, tm_yday=319, tm_isdst=0)

In [38]:
date1 = dates[3]
print date1.day
print date1.month
print date1.year

4
1
2014


In [99]:
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
z = pd.DataFrame(index = df.index, columns = df.columns)
df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

In [100]:
df

Unnamed: 0,A,B,C,D
2014-01-01,0.257154,0.303645,-0.25743,-0.026443
2014-01-02,1.087703,-1.788603,-1.468859,1.142011
2014-01-03,-1.400846,-1.686589,1.096114,-0.009619
2014-01-04,-0.370989,-0.243499,-0.930729,-1.896124
2014-01-05,0.049599,-0.129597,-0.895803,-0.139211
2014-01-06,-0.027578,-0.381481,0.846597,-0.34391


In [58]:
# Index, columns, underlying numpy data
df.T
df.T

Unnamed: 0,2014-01-01 00:00:00,2014-01-02 00:00:00,2014-01-03 00:00:00,2014-01-04 00:00:00,2014-01-05 00:00:00,2014-01-06 00:00:00
A,-0.913759,0.44882,0.077652,-1.483918,1.439604,-0.653853
B,1.152379,-1.041156,-1.606416,-0.0836,-1.705013,-0.892301
C,-0.50948,-0.178627,-0.068104,2.077766,-0.882017,-0.090013
D,-0.501243,-0.187753,0.254724,1.165495,-0.761129,0.231299
E,0.732694,0.053386,1.609158,1.078464,-0.014215,0.033273


In [67]:
df2 = pd.DataFrame({ 'A' : 1.,
                         'B' : pd.Timestamp('20130102'),
                         'C' : pd.Series(1,index=list(range(6)),dtype='float32'),
                         'D' : np.array([3] * 6,dtype='int32'),
                         'E' : 'foo' })
    

df2

Unnamed: 0,A,B,C,D,E
0,1.0,2013-01-02,1.0,3,foo
1,1.0,2013-01-02,1.0,3,foo
2,1.0,2013-01-02,1.0,3,foo
3,1.0,2013-01-02,1.0,3,foo
4,1.0,2013-01-02,1.0,3,foo
5,1.0,2013-01-02,1.0,3,foo


In [70]:
# With specific dtypes
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E            object
dtype: object

#### Viewing Data

In [97]:
df.head()

Unnamed: 0,A,B,C,D,E
2014-01-05,1.439604,-1.705013,-0.882017,-0.761129,-0.014215
2014-01-03,0.077652,-1.606416,-0.068104,0.254724,1.609158
2014-01-02,0.44882,-1.041156,-0.178627,-0.187753,0.053386
2014-01-06,-0.653853,-0.892301,-0.090013,0.231299,0.033273
2014-01-04,-1.483918,-0.0836,2.077766,1.165495,1.078464


In [None]:
df.tail()

In [71]:
df.index

DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D')

In [6]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.240503,-0.335092,0.293088,0.150452
std,0.811299,0.822329,0.926139,0.752499
min,-1.044456,-0.983128,-1.041232,-0.685814
25%,-0.972093,-0.940601,-0.28387,-0.423482
50%,-0.255418,-0.739244,0.359495,0.002338
75%,0.368393,0.286086,0.969948,0.822934
max,0.746932,0.83092,1.405663,1.052807


In [76]:
df.sort_values(by='B', inplace = 'True')
df


Unnamed: 0,A,B,C,D,E
2014-01-05,1.439604,-1.705013,-0.882017,-0.761129,-0.014215
2014-01-03,0.077652,-1.606416,-0.068104,0.254724,1.609158
2014-01-02,0.44882,-1.041156,-0.178627,-0.187753,0.053386
2014-01-06,-0.653853,-0.892301,-0.090013,0.231299,0.033273
2014-01-04,-1.483918,-0.0836,2.077766,1.165495,1.078464
2014-01-01,-0.913759,1.152379,-0.50948,-0.501243,0.732694


### Selection

In [77]:
df[[0,2]]

Unnamed: 0,A,C
2014-01-05,1.439604,-0.882017
2014-01-03,0.077652,-0.068104
2014-01-02,0.44882,-0.178627
2014-01-06,-0.653853,-0.090013
2014-01-04,-1.483918,2.077766
2014-01-01,-0.913759,-0.50948


In [9]:
df[0:3]

Unnamed: 0,A,B,C,D
2014-01-01,0.382109,-0.983128,-1.041232,-0.467352
2014-01-02,-1.044456,0.584138,1.089119,-0.291872
2014-01-03,-1.016764,-0.963995,0.106556,0.296547


In [79]:
# By label
type(df.loc[dates[0]])



pandas.core.series.Series

In [80]:
# multi-axis by label
df.loc[:,['A','B']]

Unnamed: 0,A,B
2014-01-05,1.439604,-1.705013
2014-01-03,0.077652,-1.606416
2014-01-02,0.44882,-1.041156
2014-01-06,-0.653853,-0.892301
2014-01-04,-1.483918,-0.0836
2014-01-01,-0.913759,1.152379


In [81]:
# Date Range
df.loc['20140102':'20140104',['A','B']]


Unnamed: 0,A,B
2014-01-03,0.077652,-1.606416
2014-01-02,0.44882,-1.041156
2014-01-04,-1.483918,-0.0836


In [82]:
# Fast access to scalar
df.at[dates[1],'B']

-1.0411558186389573

In [89]:
df[1:4]

Unnamed: 0,A,B,C,D,E
2014-01-03,0.077652,-1.606416,-0.068104,0.254724,1.609158
2014-01-02,0.44882,-1.041156,-0.178627,-0.187753,0.053386
2014-01-06,-0.653853,-0.892301,-0.090013,0.231299,0.033273


In [90]:
# iloc provides integer locations similar to np style
df.iloc[1:5]

Unnamed: 0,A,B,C,D,E
2014-01-03,0.077652,-1.606416,-0.068104,0.254724,1.609158
2014-01-02,0.44882,-1.041156,-0.178627,-0.187753,0.053386
2014-01-06,-0.653853,-0.892301,-0.090013,0.231299,0.033273
2014-01-04,-1.483918,-0.0836,2.077766,1.165495,1.078464


### Boolean Indexing

In [96]:
df[df.A < 0] # Basically a 'where' operation

Unnamed: 0,A,B,C,D,E
2014-01-06,-0.653853,-0.892301,-0.090013,0.231299,0.033273
2014-01-04,-1.483918,-0.0836,2.077766,1.165495,1.078464
2014-01-01,-0.913759,1.152379,-0.50948,-0.501243,0.732694


### Setting

In [93]:
df_posA = df.copy() # Without "copy" it would act on the dataset

df_posA[df_posA.A < 0] = -1*df_posA

In [94]:
df_posA

Unnamed: 0,A,B,C,D,E
2014-01-05,1.439604,-1.705013,-0.882017,-0.761129,-0.014215
2014-01-03,0.077652,-1.606416,-0.068104,0.254724,1.609158
2014-01-02,0.44882,-1.041156,-0.178627,-0.187753,0.053386
2014-01-06,0.653853,0.892301,0.090013,-0.231299,-0.033273
2014-01-04,1.483918,0.0836,-2.077766,-1.165495,-1.078464
2014-01-01,0.913759,-1.152379,0.50948,0.501243,-0.732694


In [None]:
#Setting new column aligns data by index
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20140102',periods=6))

In [None]:
s1

In [None]:
df['F'] = s1

In [None]:
df

### Missing Data

In [None]:
# Add a column with missing data
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])

In [None]:
df1.loc[dates[0]:dates[1],'E'] = 1

In [None]:
df1

In [None]:
# find where values are null
pd.isnull(df1)

### Operations

In [None]:
df.describe()

In [None]:
df.mean(),df.mean(1) # Operation on two different axes

### Applying functions

In [1]:
!pwd


/Users/leelewis/leelewis-ds4/lessons/lesson-02/code


In [14]:
import pandas as pd
df3 = pd.DataFrame({ "A" : range(5), "B" : range(5)})
df3.head()

df3["C"] = df3["A"]*2
df3.head()

Unnamed: 0,A,B,C
0,0,0,0
1,1,1,2
2,2,2,4
3,3,3,6
4,4,4,8


In [45]:
df3["new_column"] = df3.A.apply(lambda x: x*2)


In [46]:
df3.head(5)

Unnamed: 0,A,B,D,new_column
0,0,0,0,0
1,1,1,10,2
2,2,2,20,4
3,3,3,30,6
4,4,4,40,8


In [47]:
df5 = pd.DataFrame({"firstName" : ["Alex" , "Tom"], "lastName": ["Henry", "Smith"]})
df5.head()



Unnamed: 0,firstName,lastName
0,Alex,Henry
1,Tom,Smith


In [51]:
df3.apply(lambda x: x.max() - x.min())

A              4
B              4
D             40
new_column     8
dtype: int64

### def makeFullName(obj):
    #obj returns a list
    firstName = df5[0]
    lastName = df5[1]
    return firstName + " " + lastName

df5["newName"]= df5.apply(makeFullName, axis= 1)


    

In [40]:
df5["fullName"] = df5.firstName + " " + df5.lastName
print df5

  firstName lastName    fullName
0      Alex    Henry  Alex Henry
1       Tom    Smith   Tom Smith


### df makeFullName (obj):
    
    

### 
df['E']  = df.D.apply(lambda x: x*2)

In [124]:
df.head()

Unnamed: 0,A,B,C,D,E
2014-01-01,0.257154,0.303645,-0.25743,-0.026443,-0.052885
2014-01-02,1.087703,-1.788603,-1.468859,1.142011,2.284022
2014-01-03,-1.400846,-1.686589,1.096114,-0.009619,-0.019238
2014-01-04,-0.370989,-0.243499,-0.930729,-1.896124,-3.792248
2014-01-05,0.049599,-0.129597,-0.895803,-0.139211,-0.278422


In [137]:
df['sum'] = df.apply(lambda x: sum(x), axis = 1)
#df['total'] = df.apply(lambda x: sum(x), axis = 0)

In [138]:
df.head()

Unnamed: 0,A,B,C,D,E,F,sum,total
2014-01-01,0.257154,0.303645,-0.25743,-0.026443,-0.052885,0.224041,,
2014-01-02,1.087703,-1.788603,-1.468859,1.142011,2.284022,1.256275,,
2014-01-03,-1.400846,-1.686589,1.096114,-0.009619,-0.019238,-2.020177,,
2014-01-04,-0.370989,-0.243499,-0.930729,-1.896124,-3.792248,-7.233589,,
2014-01-05,0.049599,-0.129597,-0.895803,-0.139211,-0.278422,-1.393435,,


In [125]:
import math
#df.apply(np.cumsum)
#df.apply(np.cumsum ,axis =1)
df['F']  = df.F.apply.(lambda x: math.sum(x))

SyntaxError: invalid syntax (<ipython-input-125-00c9cd2eb0c4>, line 4)

In [None]:
df.apply(lambda x: x.max() - x.min())

In [None]:
# Built in string methods
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

### Merge

In [53]:
import numpy as np

np.random.randn(10,4)

array([[ 0.03207084, -1.2906335 , -0.28263048,  1.14651962],
       [-1.10780411, -1.5632996 ,  0.25252779,  0.60007849],
       [-0.30746662, -0.1201996 ,  1.16670518, -0.23471177],
       [ 0.7245662 , -1.58627752, -0.47679201,  0.11016519],
       [-0.25475413, -0.94954274,  2.35636807, -0.68466271],
       [-0.03827279, -0.73337229, -1.68137166, -0.06745603],
       [ 1.38202383,  0.22972144, -1.67650353, -0.93906664],
       [ 1.54101228, -1.89487617,  0.1794592 ,  0.16536892],
       [-2.01895572,  3.31007602,  1.58074939, -2.14873164],
       [ 0.24433982,  0.18356098,  0.10702461, -1.02469645]])

In [54]:
#Concatenating pandas objects together
df = pd.DataFrame(np.random.randn(10,4))
df

Unnamed: 0,0,1,2,3
0,0.139688,0.648012,-0.678933,-1.058519
1,0.561435,0.292803,0.053886,0.001643
2,-1.1721,0.323318,-0.997208,-0.735723
3,2.17923,1.081292,-0.317193,-0.481662
4,-1.758127,-2.545318,-1.673364,0.383562
5,0.384023,-1.084177,-0.120094,-0.83588
6,0.971732,-1.004158,0.624792,-0.743611
7,1.378634,0.4123,-0.324388,-0.783513
8,-0.033243,-0.012977,-0.513599,-1.637293
9,0.137691,1.809824,0.22083,0.320937


          0         1         2         3
0  0.139688  0.648012 -0.678933 -1.058519
1  0.561435  0.292803  0.053886  0.001643
2 -1.172100  0.323318 -0.997208 -0.735723


In [58]:
# Break it into pieces
pieces = [df[:3], df[3:7],df[7:]]
pieces

[          0         1         2         3
 0  0.139688  0.648012 -0.678933 -1.058519
 1  0.561435  0.292803  0.053886  0.001643
 2 -1.172100  0.323318 -0.997208 -0.735723,
           0         1         2         3
 3  2.179230  1.081292 -0.317193 -0.481662
 4 -1.758127 -2.545318 -1.673364  0.383562
 5  0.384023 -1.084177 -0.120094 -0.835880
 6  0.971732 -1.004158  0.624792 -0.743611,
           0         1         2         3
 7  1.378634  0.412300 -0.324388 -0.783513
 8 -0.033243 -0.012977 -0.513599 -1.637293
 9  0.137691  1.809824  0.220830  0.320937]

In [None]:
pd.concat(pieces)


In [None]:
# Also can "Join" and "Append"
df

### Grouping


In [60]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                       'foo', 'bar', 'foo', 'foo'],
                       'B' : ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three'],
                       'C' : np.random.randn(8),
                       'D' : np.random.randn(8)})

In [61]:
df

Unnamed: 0,A,B,C,D
0,foo,one,-0.452588,0.391075
1,bar,one,0.533355,1.736969
2,foo,two,-0.371477,0.077584
3,bar,three,-0.487145,-0.885094
4,foo,two,-0.175396,0.212291
5,bar,two,1.050277,-0.160328
6,foo,one,-0.972618,-0.986352
7,foo,three,0.651518,-1.236872


In [70]:
df.groupby(['A','B']).sum().reset_index()

Unnamed: 0,A,B,C,D
0,bar,one,0.533355,1.736969
1,bar,three,-0.487145,-0.885094
2,bar,two,1.050277,-0.160328
3,foo,one,-1.425206,-0.595278
4,foo,three,0.651518,-1.236872
5,foo,two,-0.546873,0.289875


In [69]:
df3 = df.groupby(["A","B"]).apply(lambda x: sum(x["C"])).reset_index()
print df3

     A      B         0
0  bar    one  0.533355
1  bar  three -0.487145
2  bar    two  1.050277
3  foo    one -1.425206
4  foo  three  0.651518
5  foo    two -0.546873


### Reshaping

In [None]:
# You can also stack or unstack levels

In [None]:
a = df.groupby(['A','B']).sum()

In [None]:
# Pivot Tables
pd.pivot_table(df,values=['C','D'],index=['A'],columns=['B'])

### Time Series


In [None]:
import pandas as pd
import numpy as np

In [None]:
# 100 Seconds starting on January 1st
rng = pd.date_range('1/1/2014', periods=100, freq='S')

In [None]:
# Give each second a random value
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [None]:
ts

In [None]:
# Built in resampling
ts.resample('1Min').mean() # Resample secondly to 1Minutely

In [None]:
# Many additional time series features
ts. #use tab

### Plotting


In [None]:
ts.plot()

In [None]:
def randwalk(startdate,points):
    ts = pd.Series(np.random.randn(points), index=pd.date_range(startdate, periods=points))
    ts=ts.cumsum()
    ts.plot()
    return(ts)

In [None]:
# Using pandas to make a simple random walker by repeatedly running:
a=randwalk('1/1/2012',1000)

In [None]:
# Pandas plot function will print with labels as default

In [None]:
df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure();df.plot();plt.legend(loc='best') #

### I/O
I/O is straightforward with, for example, pd.read_csv or df.to_csv

#### The benefits of open source:

Let's look under x's in plt modules

# Next Steps

**Recommended Resources**

Name | Description
--- | ---
[Official Pandas Tutorials](http://pandas.pydata.org/pandas-docs/stable/10min.html) | Wes & Company's selection of tutorials and lectures
[Julia Evans Pandas Cookbook](https://github.com/jvns/pandas-cookbook) | Great resource with examples from weather, bikes and 311 calls
[Learn Pandas Tutorials](https://bitbucket.org/hrojas/learn-pandas) | A great series of Pandas tutorials from Dave Rojas
[Research Computing Python Data PYNBs](https://github.com/ResearchComputing/Meetup-Fall-2013/tree/master/python) | A super awesome set of python notebooks from a meetup-based course exclusively devoted to pandas