# Pandas

Reference: http://pandas.pydata.org/pandas-docs/stable/pandas.pdf

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

Let's first look at how powerful Pandas is

In [2]:
import numpy as np
import pandas as pd

Read the earthquake catalog data

Previously, we wrote a code to read it

In [4]:
import numpy as np
fp=open("./data/earthquake.csv","r")
lines=fp.readlines()
fp.close()
year=[];month=[];day=[];time=[];mag=[];lon=[];lat=[];depth=[];region=[]
for line in lines[1:]:
    if "\"" in line:
        temp=line.split(",")
        year.append(temp[0]);month.append(temp[1])
        day.append(temp[2]);time.append(temp[3])
        mag.append(temp[4]);lon.append(temp[5]);lat.append(temp[6])
        depth.append(temp[7]);region.append(temp[8]+","+temp[9])
    else:
        temp=line.split(",")
        year.append(temp[0]);month.append(temp[1])
        day.append(temp[2]);time.append(temp[3])
        mag.append(temp[4]);lon.append(temp[5]);lat.append(temp[6])
        depth.append(temp[7]);region.append(temp[8])

With Pandas, reading this excel file and assign the information to variables is just one line of coding

In [6]:
eq=pd.read_csv('./data/earthquake.csv')
eq

Unnamed: 0,Year,Month,Day,Time UTC,Mag,Lat,Lon,Depth km,Region,IRIS ID,Timestamp
0,2019,2,11,3:59:47,2.0,59.8971,-152.7478,93.5,SOUTHERN ALASKA,11004556,1549857587
1,2019,2,11,3:36:08,1.1,33.5512,-116.9202,10.7,SOUTHERN CALIFORNIA,11004553,1549856168
2,2019,2,11,3:20:42,1.9,59.4979,-152.9018,73.5,SOUTHERN ALASKA,11004552,1549855242
3,2019,2,11,3:04:42,1.5,63.1436,-152.1082,4.2,CENTRAL ALASKA,11004550,1549854282
4,2019,2,11,2:41:44,2.7,59.6321,-146.3146,14.3,GULF OF ALASKA,11004548,1549852904
...,...,...,...,...,...,...,...,...,...,...,...
995,2019,2,7,1:30:04,0.9,33.4998,-116.7922,4.2,SOUTHERN CALIFORNIA,11003337,1549503004
996,2019,2,7,1:29:22,1.3,61.4188,-149.9530,35.0,SOUTHERN ALASKA,11003339,1549502962
997,2019,2,7,1:18:34,2.2,17.9675,-67.1656,14.0,MONA PASSAGE,11003336,1549502314
998,2019,2,7,1:13:36,2.0,47.8600,-122.0438,27.8,WASHINGTON,11003335,1549502016


Get some idea about the data

In [10]:
eq.describe()

Unnamed: 0,Year,Month,Day,Mag,Lat,Lon,Depth km,IRIS ID,Timestamp
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,2019.0,2.0,8.336,1.783,39.622172,-112.980629,24.3531,11003970.0,1549656000.0
std,0.0,0.0,1.149097,1.101092,18.279838,61.241554,45.350977,363.937,99342.98
min,2019.0,2.0,7.0,0.1,-52.9761,-179.5437,-3.0,11003330.0,1549502000.0
25%,2019.0,2.0,7.0,1.1,33.467,-149.90835,2.9,11003630.0,1549569000.0
50%,2019.0,2.0,8.0,1.6,36.953,-118.92475,9.6,11003990.0,1549645000.0
75%,2019.0,2.0,9.0,2.1,60.338525,-116.2888,32.35,11004290.0,1549743000.0
max,2019.0,2.0,11.0,5.9,69.572,178.1605,573.9,11004560.0,1549858000.0


sort by columns

In [11]:
eq.sort_values(by='Lat',ascending=True)

Unnamed: 0,Year,Month,Day,Time UTC,Mag,Lat,Lon,Depth km,Region,IRIS ID,Timestamp
382,2019,2,9,2:37:43,4.4,-52.9761,-71.3270,16.4,SOUTHERN CHILE,11004164,1549679863
765,2019,2,7,18:42:19,4.3,-37.1278,-72.7738,36.8,CENTRAL CHILE,11003590,1549564939
657,2019,2,8,3:11:05,5.4,-32.8313,57.1405,10.0,SOUTHWEST INDIAN RIDGE,11003763,1549595465
39,2019,2,10,21:00:36,4.3,-28.0239,-70.7217,94.6,CENTRAL CHILE,11004518,1549832436
102,2019,2,10,13:22:57,4.4,-24.0878,-66.7418,196.5,"SALTA PROVINCE, ARGENTINA",11004451,1549804977
...,...,...,...,...,...,...,...,...,...,...,...
577,2019,2,8,10:16:51,3.4,69.1470,-144.7127,0.1,NORTHERN ALASKA,11003831,1549621011
986,2019,2,7,1:57:16,1.7,69.4863,-144.2146,6.9,NORTHERN ALASKA,11003529,1549504636
955,2019,2,7,4:56:26,1.8,69.5077,-143.9237,1.2,NORTHERN ALASKA,11003371,1549515386
636,2019,2,8,4:27:57,2.5,69.5444,-144.4315,10.0,NORTHERN ALASKA,11004058,1549600077


# Let's learn Pandas more systematically

* The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering.
* The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Series is a container for scalars.
* For example, with tabular data (2D data or DataFrame) it is more helpful to think of the index (the rows) and the columns rather than axis 0 and axis 1.

# creat data

## Create a series

In [13]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

## Create a 2d dataframe

In [15]:
data = np.zeros((2,3))
data

array([[0., 0., 0.],
       [0., 0., 0.]])

In [65]:
pd.DataFrame(data)

Unnamed: 0,0,1,2
0,0.0,0.0,0.0
1,0.0,0.0,0.0


In [66]:
data2 = [{'a': 1.0, 'b': 2,'c':3}, {'a': 5, 'b': 10, 'c': 20}]

In [67]:
data2

[{'a': 1.0, 'b': 2, 'c': 3}, {'a': 5, 'b': 10, 'c': 20}]

In [71]:
test=pd.DataFrame(data2)
test

Unnamed: 0,a,b,c
0,1.0,2,3
1,5.0,10,20


In [73]:
pd.DataFrame(test, index=['first', 'second'],columns=['d','e','f'])

Unnamed: 0,d,e,f
first,,,
second,,,


In [74]:
df = pd.DataFrame({
    'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
    'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
    'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

In [77]:
df

Unnamed: 0,A,B,C,D
2013-01-01,1.459775,-0.503493,-0.008784,-1.270561
2013-01-02,-0.468756,-0.592787,-0.606574,0.856356
2013-01-03,1.02003,0.791521,0.788137,-0.23856
2013-01-04,-1.756609,0.191314,-1.24181,0.466497
2013-01-05,-0.101451,-0.308376,-0.433609,0.557878
2013-01-06,0.050416,0.662619,-0.313234,-0.769717


In [83]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,0.72614,0.685097,1.553664,0.577291
2013-01-02,-0.87569,-2.487474,-1.839067,0.184432
2013-01-03,-0.562712,-1.167613,-0.379558,0.585288
2013-01-04,0.470829,-0.889744,-0.363516,0.565676
2013-01-05,-0.988784,-2.245261,-1.259012,1.224473
2013-01-06,1.297132,-0.035248,0.096089,-0.212984


Read from file

In [84]:
eq=pd.read_csv('./data/earthquake.csv')
eq

Unnamed: 0,Year,Month,Day,Time UTC,Mag,Lat,Lon,Depth km,Region,IRIS ID,Timestamp
0,2019,2,11,3:59:47,2.0,59.8971,-152.7478,93.5,SOUTHERN ALASKA,11004556,1549857587
1,2019,2,11,3:36:08,1.1,33.5512,-116.9202,10.7,SOUTHERN CALIFORNIA,11004553,1549856168
2,2019,2,11,3:20:42,1.9,59.4979,-152.9018,73.5,SOUTHERN ALASKA,11004552,1549855242
3,2019,2,11,3:04:42,1.5,63.1436,-152.1082,4.2,CENTRAL ALASKA,11004550,1549854282
4,2019,2,11,2:41:44,2.7,59.6321,-146.3146,14.3,GULF OF ALASKA,11004548,1549852904
...,...,...,...,...,...,...,...,...,...,...,...
995,2019,2,7,1:30:04,0.9,33.4998,-116.7922,4.2,SOUTHERN CALIFORNIA,11003337,1549503004
996,2019,2,7,1:29:22,1.3,61.4188,-149.9530,35.0,SOUTHERN ALASKA,11003339,1549502962
997,2019,2,7,1:18:34,2.2,17.9675,-67.1656,14.0,MONA PASSAGE,11003336,1549502314
998,2019,2,7,1:13:36,2.0,47.8600,-122.0438,27.8,WASHINGTON,11003335,1549502016


# Selection

In [95]:
df

Unnamed: 0,A,B,C,D
2013-01-01,0.72614,0.685097,1.553664,0.577291
2013-01-02,-0.87569,-2.487474,-1.839067,0.184432
2013-01-03,-0.562712,-1.167613,-0.379558,0.585288
2013-01-04,0.470829,-0.889744,-0.363516,0.565676
2013-01-05,-0.988784,-2.245261,-1.259012,1.224473
2013-01-06,1.297132,-0.035248,0.096089,-0.212984


In [105]:
df.loc['2013-01-06',:]

A    1.297132
B   -0.035248
C    0.096089
D   -0.212984
Name: 2013-01-06 00:00:00, dtype: float64

In [106]:
df.head(2)

Unnamed: 0,A,B,C,D
2013-01-01,0.72614,0.685097,1.553664,0.577291
2013-01-02,-0.87569,-2.487474,-1.839067,0.184432


In [107]:
df.tail(2)

Unnamed: 0,A,B,C,D
2013-01-05,-0.988784,-2.245261,-1.259012,1.224473
2013-01-06,1.297132,-0.035248,0.096089,-0.212984


In [108]:
df['A']

2013-01-01    0.726140
2013-01-02   -0.875690
2013-01-03   -0.562712
2013-01-04    0.470829
2013-01-05   -0.988784
2013-01-06    1.297132
Freq: D, Name: A, dtype: float64

In [109]:
df.A

2013-01-01    0.726140
2013-01-02   -0.875690
2013-01-03   -0.562712
2013-01-04    0.470829
2013-01-05   -0.988784
2013-01-06    1.297132
Freq: D, Name: A, dtype: float64

In [110]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,0.72614,0.685097,1.553664,0.577291
2013-01-02,-0.87569,-2.487474,-1.839067,0.184432
2013-01-03,-0.562712,-1.167613,-0.379558,0.585288


In [111]:
print(dates[0])
df.loc[dates[0]]

2013-01-01 00:00:00


A    0.726140
B    0.685097
C    1.553664
D    0.577291
Name: 2013-01-01 00:00:00, dtype: float64

In [112]:
df.loc[dates[0]:dates[1], ['A', 'B']]

Unnamed: 0,A,B
2013-01-01,0.72614,0.685097
2013-01-02,-0.87569,-2.487474


In [113]:
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2013-01-01,0.72614,0.685097
2013-01-02,-0.87569,-2.487474
2013-01-03,-0.562712,-1.167613
2013-01-04,0.470829,-0.889744
2013-01-05,-0.988784,-2.245261
2013-01-06,1.297132,-0.035248


In [114]:
df.loc['20130102':'20130104', ['A', 'B']]

Unnamed: 0,A,B
2013-01-02,-0.87569,-2.487474
2013-01-03,-0.562712,-1.167613
2013-01-04,0.470829,-0.889744


In [115]:
df.loc['20130101', ['A','B']]

A    0.726140
B    0.685097
Name: 2013-01-01 00:00:00, dtype: float64

In [116]:
df.loc['20130101', 'A']

0.7261400353001666

In [117]:
df.at[dates[0], 'A']

0.7261400353001666

In [118]:
df

Unnamed: 0,A,B,C,D
2013-01-01,0.72614,0.685097,1.553664,0.577291
2013-01-02,-0.87569,-2.487474,-1.839067,0.184432
2013-01-03,-0.562712,-1.167613,-0.379558,0.585288
2013-01-04,0.470829,-0.889744,-0.363516,0.565676
2013-01-05,-0.988784,-2.245261,-1.259012,1.224473
2013-01-06,1.297132,-0.035248,0.096089,-0.212984


In [123]:
df.iloc[:,1:3]

Unnamed: 0,B,C
2013-01-01,0.685097,1.553664
2013-01-02,-2.487474,-1.839067
2013-01-03,-1.167613,-0.379558
2013-01-04,-0.889744,-0.363516
2013-01-05,-2.245261,-1.259012
2013-01-06,-0.035248,0.096089


In [138]:
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2013-01-04,-0.471498,0.255575
2013-01-05,-1.482652,-1.11222


In [139]:
df.iloc[[1, 2, 4], [0, 2]]

Unnamed: 0,A,C
2013-01-02,-0.841441,-0.450379
2013-01-03,-1.777894,-0.27224
2013-01-05,-1.482652,-1.745098


In [140]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C,D
2013-01-02,-0.841441,-0.021507,-0.450379,0.72462
2013-01-03,-1.777894,-1.099365,-0.27224,0.546847


In [141]:
df.iloc[:, 1:3]

Unnamed: 0,B,C
2013-01-01,0.268292,-1.439403
2013-01-02,-0.021507,-0.450379
2013-01-03,-1.099365,-0.27224
2013-01-04,0.255575,-0.151032
2013-01-05,-1.11222,-1.745098
2013-01-06,0.148531,-0.932294


In [142]:
df.iloc[1, 1]

-0.021506995539202333

In [126]:
df.iat[1, 1]

-2.487474298609768

### Summary
- all the functions of 'loc', 'iloc', 'at' and 'iat' can locate some values in the DataFrame, but there are differences
- 'loc' and 'at' are used when __the exact name__ of the index and/column are provided, whereas 'iloc' and 'iat' are used when only the index is provided
- 'loc' and 'iloc' can return multiple rows/columns, while 'at' and 'iat' only give a single value at a time

# [Exercise 14](EX14-Pandas-1.ipynb)

# Boolean Indexing

In [144]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.481804,0.268292,-1.439403,1.262896
2013-01-02,-0.841441,-0.021507,-0.450379,0.72462
2013-01-03,-1.777894,-1.099365,-0.27224,0.546847
2013-01-04,-0.471498,0.255575,-0.151032,-1.247897
2013-01-05,-1.482652,-1.11222,-1.745098,-0.318985
2013-01-06,0.071856,0.148531,-0.932294,-0.627366


In [145]:
df[df.A > 0]

Unnamed: 0,A,B,C,D
2013-01-06,0.071856,0.148531,-0.932294,-0.627366


In [146]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,,0.268292,,1.262896
2013-01-02,,,,0.72462
2013-01-03,,,,0.546847
2013-01-04,,0.255575,,
2013-01-05,,,,
2013-01-06,0.071856,0.148531,,


In [147]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.481804,0.268292,-1.439403,1.262896
2013-01-02,-0.841441,-0.021507,-0.450379,0.72462
2013-01-03,-1.777894,-1.099365,-0.27224,0.546847
2013-01-04,-0.471498,0.255575,-0.151032,-1.247897
2013-01-05,-1.482652,-1.11222,-1.745098,-0.318985
2013-01-06,0.071856,0.148531,-0.932294,-0.627366


## add a column

In [148]:
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.481804,0.268292,-1.439403,1.262896,one
2013-01-02,-0.841441,-0.021507,-0.450379,0.72462,one
2013-01-03,-1.777894,-1.099365,-0.27224,0.546847,two
2013-01-04,-0.471498,0.255575,-0.151032,-1.247897,three
2013-01-05,-1.482652,-1.11222,-1.745098,-0.318985,four
2013-01-06,0.071856,0.148531,-0.932294,-0.627366,three


In [149]:
df2['E'].isin(['two', 'four'])

2013-01-01    False
2013-01-02    False
2013-01-03     True
2013-01-04    False
2013-01-05     True
2013-01-06    False
Freq: D, Name: E, dtype: bool

In [150]:
df2[df2['E'].isin(['two', 'four'])]

Unnamed: 0,A,B,C,D,E
2013-01-03,-1.777894,-1.099365,-0.27224,0.546847,two
2013-01-05,-1.482652,-1.11222,-1.745098,-0.318985,four


In [151]:
df2.iloc[1,1]

-0.021506995539202333

# drop a column and a row

In [152]:
df2.drop(columns=['E','D'])

Unnamed: 0,A,B,C
2013-01-01,-0.481804,0.268292,-1.439403
2013-01-02,-0.841441,-0.021507,-0.450379
2013-01-03,-1.777894,-1.099365,-0.27224
2013-01-04,-0.471498,0.255575,-0.151032
2013-01-05,-1.482652,-1.11222,-1.745098
2013-01-06,0.071856,0.148531,-0.932294


In [153]:
# do you really remove them from df2?
df2

Unnamed: 0,A,B,C,D,E
2013-01-01,-0.481804,0.268292,-1.439403,1.262896,one
2013-01-02,-0.841441,-0.021507,-0.450379,0.72462,one
2013-01-03,-1.777894,-1.099365,-0.27224,0.546847,two
2013-01-04,-0.471498,0.255575,-0.151032,-1.247897,three
2013-01-05,-1.482652,-1.11222,-1.745098,-0.318985,four
2013-01-06,0.071856,0.148531,-0.932294,-0.627366,three


# Setting

In [154]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102',periods=6))
s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [155]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.481804,0.268292,-1.439403,1.262896
2013-01-02,-0.841441,-0.021507,-0.450379,0.72462
2013-01-03,-1.777894,-1.099365,-0.27224,0.546847
2013-01-04,-0.471498,0.255575,-0.151032,-1.247897
2013-01-05,-1.482652,-1.11222,-1.745098,-0.318985
2013-01-06,0.071856,0.148531,-0.932294,-0.627366


In [156]:
df['F'] = s1

In [157]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,-0.481804,0.268292,-1.439403,1.262896,
2013-01-02,-0.841441,-0.021507,-0.450379,0.72462,1.0
2013-01-03,-1.777894,-1.099365,-0.27224,0.546847,2.0
2013-01-04,-0.471498,0.255575,-0.151032,-1.247897,3.0
2013-01-05,-1.482652,-1.11222,-1.745098,-0.318985,4.0
2013-01-06,0.071856,0.148531,-0.932294,-0.627366,5.0


In [158]:
df.at[dates[0], 'A'] = 0
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.268292,-1.439403,1.262896,
2013-01-02,-0.841441,-0.021507,-0.450379,0.72462,1.0
2013-01-03,-1.777894,-1.099365,-0.27224,0.546847,2.0
2013-01-04,-0.471498,0.255575,-0.151032,-1.247897,3.0
2013-01-05,-1.482652,-1.11222,-1.745098,-0.318985,4.0
2013-01-06,0.071856,0.148531,-0.932294,-0.627366,5.0


In [159]:
df.iat[0, 1] = 0
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-1.439403,1.262896,
2013-01-02,-0.841441,-0.021507,-0.450379,0.72462,1.0
2013-01-03,-1.777894,-1.099365,-0.27224,0.546847,2.0
2013-01-04,-0.471498,0.255575,-0.151032,-1.247897,3.0
2013-01-05,-1.482652,-1.11222,-1.745098,-0.318985,4.0
2013-01-06,0.071856,0.148531,-0.932294,-0.627366,5.0


In [160]:
df.loc[:, 'D'] = np.array([5] * len(df))
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-1.439403,5,
2013-01-02,-0.841441,-0.021507,-0.450379,5,1.0
2013-01-03,-1.777894,-1.099365,-0.27224,5,2.0
2013-01-04,-0.471498,0.255575,-0.151032,5,3.0
2013-01-05,-1.482652,-1.11222,-1.745098,5,4.0
2013-01-06,0.071856,0.148531,-0.932294,5,5.0


In [161]:
df2 = df.copy()
df2[df2 > 0] = -df2
df2

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-1.439403,-5,
2013-01-02,-0.841441,-0.021507,-0.450379,-5,-1.0
2013-01-03,-1.777894,-1.099365,-0.27224,-5,-2.0
2013-01-04,-0.471498,-0.255575,-0.151032,-5,-3.0
2013-01-05,-1.482652,-1.11222,-1.745098,-5,-4.0
2013-01-06,-0.071856,-0.148531,-0.932294,-5,-5.0


pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations.

In [162]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-1.439403,5,
2013-01-02,-0.841441,-0.021507,-0.450379,5,1.0
2013-01-03,-1.777894,-1.099365,-0.27224,5,2.0
2013-01-04,-0.471498,0.255575,-0.151032,5,3.0
2013-01-05,-1.482652,-1.11222,-1.745098,5,4.0
2013-01-06,0.071856,0.148531,-0.932294,5,5.0


In [163]:
df1 = df.reindex(index=dates[0:4])

df1

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-1.439403,5,
2013-01-02,-0.841441,-0.021507,-0.450379,5,1.0
2013-01-03,-1.777894,-1.099365,-0.27224,5,2.0
2013-01-04,-0.471498,0.255575,-0.151032,5,3.0


In [164]:
df1.dropna(how='any')

Unnamed: 0,A,B,C,D,F
2013-01-02,-0.841441,-0.021507,-0.450379,5,1.0
2013-01-03,-1.777894,-1.099365,-0.27224,5,2.0
2013-01-04,-0.471498,0.255575,-0.151032,5,3.0


In [165]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-1.439403,5,5.0
2013-01-02,-0.841441,-0.021507,-0.450379,5,1.0
2013-01-03,-1.777894,-1.099365,-0.27224,5,2.0
2013-01-04,-0.471498,0.255575,-0.151032,5,3.0


In [166]:
pd.isna(df1)

Unnamed: 0,A,B,C,D,F
2013-01-01,False,False,False,False,True
2013-01-02,False,False,False,False,False
2013-01-03,False,False,False,False,False
2013-01-04,False,False,False,False,False


# Operation

In [167]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-1.439403,5,
2013-01-02,-0.841441,-0.021507,-0.450379,5,1.0
2013-01-03,-1.777894,-1.099365,-0.27224,5,2.0
2013-01-04,-0.471498,0.255575,-0.151032,5,3.0
2013-01-05,-1.482652,-1.11222,-1.745098,5,4.0
2013-01-06,0.071856,0.148531,-0.932294,5,5.0


In [168]:
df.mean()

A   -0.750271
B   -0.304831
C   -0.831741
D    5.000000
F    3.000000
dtype: float64

In [169]:
df.apply(lambda x: x.max() - x.min())

A    1.849750
B    1.367795
C    1.594066
D    0.000000
F    4.000000
dtype: float64

In [170]:
s = pd.Series(np.random.randint(0, 21, size=100))
s

0     12
1     11
2      8
3     18
4      6
      ..
95     5
96    20
97    19
98    10
99     1
Length: 100, dtype: int32

In [171]:
s.value_counts()

1     9
9     8
8     7
14    7
0     6
6     6
19    5
4     5
10    5
13    4
7     4
17    4
3     4
2     4
18    4
11    3
12    3
15    3
16    3
5     3
20    3
dtype: int64

In [172]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s

0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object

In [173]:
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

# [Exercise 15](EX15-Pandas-2.ipynb)

# Merge

In [174]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,-0.279958,-0.896732,-1.379143,2.197539
1,-1.267632,-1.307763,0.586267,-0.150677
2,-0.225223,0.301101,0.286612,-1.888344
3,-1.736239,-0.893966,-2.580451,-0.838404
4,-0.653415,-0.453655,0.916241,-0.273899
5,-0.538474,0.626896,0.395905,0.534472
6,0.483607,-0.708502,1.001696,-0.259525
7,0.615964,-0.576825,-1.270123,-0.642015


In [175]:
s = df.iloc[3]
df.append(s, ignore_index=True)

Unnamed: 0,A,B,C,D
0,-0.279958,-0.896732,-1.379143,2.197539
1,-1.267632,-1.307763,0.586267,-0.150677
2,-0.225223,0.301101,0.286612,-1.888344
3,-1.736239,-0.893966,-2.580451,-0.838404
4,-0.653415,-0.453655,0.916241,-0.273899
5,-0.538474,0.626896,0.395905,0.534472
6,0.483607,-0.708502,1.001696,-0.259525
7,0.615964,-0.576825,-1.270123,-0.642015
8,-1.736239,-0.893966,-2.580451,-0.838404


In [176]:
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
      'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
      'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})

In [177]:
df

Unnamed: 0,A,B,C,D
0,foo,one,0.219533,-0.338711
1,bar,one,1.791045,0.196771
2,foo,two,0.540678,0.207881
3,bar,three,-0.032123,1.708158
4,foo,two,-0.09986,-0.649935
5,bar,two,0.200163,1.742948
6,foo,one,0.147994,-0.966823
7,foo,three,-0.380269,-0.19816


In [178]:
df.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,1.959085,3.647876
foo,0.428076,-1.945748


# IO

In [179]:
df

Unnamed: 0,A,B,C,D
0,foo,one,0.219533,-0.338711
1,bar,one,1.791045,0.196771
2,foo,two,0.540678,0.207881
3,bar,three,-0.032123,1.708158
4,foo,two,-0.09986,-0.649935
5,bar,two,0.200163,1.742948
6,foo,one,0.147994,-0.966823
7,foo,three,-0.380269,-0.19816


In [180]:
df.to_csv('foo.csv')

In [181]:
pd.read_csv('foo.csv')

Unnamed: 0.1,Unnamed: 0,A,B,C,D
0,0,foo,one,0.219533,-0.338711
1,1,bar,one,1.791045,0.196771
2,2,foo,two,0.540678,0.207881
3,3,bar,three,-0.032123,1.708158
4,4,foo,two,-0.09986,-0.649935
5,5,bar,two,0.200163,1.742948
6,6,foo,one,0.147994,-0.966823
7,7,foo,three,-0.380269,-0.19816


In [182]:
df

Unnamed: 0,A,B,C,D
0,foo,one,0.219533,-0.338711
1,bar,one,1.791045,0.196771
2,foo,two,0.540678,0.207881
3,bar,three,-0.032123,1.708158
4,foo,two,-0.09986,-0.649935
5,bar,two,0.200163,1.742948
6,foo,one,0.147994,-0.966823
7,foo,three,-0.380269,-0.19816


In [183]:
np.asarray(df)

array([['foo', 'one', 0.21953273286582506, -0.33871059850582835],
       ['bar', 'one', 1.791044615853483, 0.1967705264198109],
       ['foo', 'two', 0.5406776945702085, 0.20788107266560396],
       ['bar', 'three', -0.03212257562960204, 1.708157643149783],
       ['foo', 'two', -0.09986016245392518, -0.6499352040447881],
       ['bar', 'two', 0.20016298557732096, 1.7429479117664142],
       ['foo', 'one', 0.14799426003894506, -0.9668233807158495],
       ['foo', 'three', -0.3802689567948069, -0.1981598640055589]],
      dtype=object)

In [184]:
df = pd.DataFrame({
    'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
    'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
    'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

# Do math on data

In [208]:
df

Unnamed: 0,one,two,three,flag
a,-0.763326,-0.267947,0.204531,False
b,-0.753912,-1.013104,0.763791,False
c,0.347577,-0.867326,-0.301462,False
d,,0.221087,,False


In [209]:
df.rename(columns={'one': 'foo', 'two': 'bar'},
          index={'a': 'apple', 'b': 'banana', 'd': 'durian'})

Unnamed: 0,foo,bar,three,flag
apple,-0.763326,-0.267947,0.204531,False
banana,-0.753912,-1.013104,0.763791,False
c,0.347577,-0.867326,-0.301462,False
durian,,0.221087,,False


In [210]:
df

Unnamed: 0,one,two,three,flag
a,-0.763326,-0.267947,0.204531,False
b,-0.753912,-1.013104,0.763791,False
c,0.347577,-0.867326,-0.301462,False
d,,0.221087,,False


In [211]:
df['three'] = df['one'] * df['two']
df

Unnamed: 0,one,two,three,flag
a,-0.763326,-0.267947,0.204531,False
b,-0.753912,-1.013104,0.763791,False
c,0.347577,-0.867326,-0.301462,False
d,,0.221087,,False


In [212]:
df['flag'] = df['one'] > 2

In [213]:
df

Unnamed: 0,one,two,three,flag
a,-0.763326,-0.267947,0.204531,False
b,-0.753912,-1.013104,0.763791,False
c,0.347577,-0.867326,-0.301462,False
d,,0.221087,,False


# Don't remember all? Don't worry. Print this cheat sheet for pandas.
http://pandas.pydata.org/Pandas_Cheat_Sheet.pdf