# Pandas

https://pandas.pydata.org

Pandas is the excel version of Python

* Intro
* Series
* Data Frame
* Handling of Missing Data
* GroupBy
* Merging, Joining, and Concatenating
* Operations, Indexing, Slicing of dataframes
* Data input and output
    - Creating dataframes by:
        - Reading files (csv, excel etc.)
        - Reading Databases
        - Reading HTML

## Series


First main data type in pandas is Series. Series is like a numpy array with certain differences:

* Series can contain multiple data types
* Series can be indexed by a label and not just by a numeric index

In [1]:
import numpy as np
import pandas as pd

### Creating a series

Series can be created by using:
* List
* Mumpy Array
* Dictionary

In [2]:
label = ['a','b','c']
my_list = [1,2,3]
arr = np.array([10,20,30])
d = {'a':100,'b':200,'c':300}

In [3]:
# Creating series using list
pd.Series(data=my_list)

0    1
1    2
2    3
dtype: int64

In [4]:
pd.Series(data=my_list,index=label)

a    1
b    2
c    3
dtype: int64

In [5]:
pd.Series(d)

a    100
b    200
c    300
dtype: int64

### Data in series

In [6]:
pd.Series(data=label)

0    a
1    b
2    c
dtype: object

In [7]:
pd.Series([sum,print,len])

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

## Using an Index

The key to use pandas series is understanding its index. Just like hashtable/dictionary, pandas
refers to series data with names / indices.

In [8]:
ser1 = pd.Series([1,2,3,4],index='USA Germany USSR Japan'.split())
ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [9]:
ser2 = pd.Series([1,2,5,4],index='USA Germany Italy Japan'.split())
ser2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

In [10]:
ser1['USA']

1

### Operations on Series

In [11]:
ser1+ser2

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

# Dataframes

Dataframes are the workhorse of pandas, clearly inspired by R. We've dataframe in Spark as well.

Dataframe is nothing but two or more series having same indices put together.

In [12]:
import pandas as pd
import numpy as np

In [13]:
df = pd.DataFrame(np.random.randn(5,4),index = 'A B C D E'.split())

In [14]:
df

Unnamed: 0,0,1,2,3
A,-1.145168,-0.600787,0.159918,-0.796332
B,-0.314291,-0.432884,-0.330731,-0.811644
C,0.945742,-2.29798,-1.33624,-0.219459
D,-0.0653,-0.964928,0.639702,0.225961
E,0.738099,-0.012933,-0.00138,-0.424736


### Selection and Indexing in Dataframe

In [15]:
df = pd.DataFrame(np.random.randn(5,4),index = 'A B C D E'.split(),columns='sum X Y Z'.split())

In [16]:
df

Unnamed: 0,sum,X,Y,Z
A,-0.934549,-1.017813,-0.528777,1.004001
B,-1.476136,0.330583,-0.752967,1.390423
C,-0.575462,0.694732,1.419126,-0.559564
D,-0.337184,-0.041436,1.691055,-0.409294
E,1.070901,-1.141137,-0.863777,1.618827


In [17]:
df['sum']

A   -0.934549
B   -1.476136
C   -0.575462
D   -0.337184
E    1.070901
Name: sum, dtype: float64

In [18]:
df.sum # Not Recommended - SQL Syntax

<bound method DataFrame.sum of         sum         X         Y         Z
A -0.934549 -1.017813 -0.528777  1.004001
B -1.476136  0.330583 -0.752967  1.390423
C -0.575462  0.694732  1.419126 -0.559564
D -0.337184 -0.041436  1.691055 -0.409294
E  1.070901 -1.141137 -0.863777  1.618827>

In [19]:
df = pd.DataFrame(np.random.randn(5,4),index = 'A B C D E'.split(),columns='W X Y Z'.split())

In [20]:
type(df['W'])

pandas.core.series.Series

In [21]:
df

Unnamed: 0,W,X,Y,Z
A,0.507074,-0.795827,0.153282,1.968588
B,0.016054,0.079514,-0.612943,-0.726977
C,1.081331,1.09813,0.007463,0.331976
D,-0.697083,-1.208983,0.86804,-1.294925
E,-1.383383,-0.734641,0.498523,-1.553762


### Create new columns in a dataframe

In [22]:
df['new'] = df['W'] + df['X']

In [23]:
df

Unnamed: 0,W,X,Y,Z,new
A,0.507074,-0.795827,0.153282,1.968588,-0.288753
B,0.016054,0.079514,-0.612943,-0.726977,0.095568
C,1.081331,1.09813,0.007463,0.331976,2.179461
D,-0.697083,-1.208983,0.86804,-1.294925,-1.906066
E,-1.383383,-0.734641,0.498523,-1.553762,-2.118024


### Remove columns from dataframe

In [24]:
df.drop('new',axis=1)

Unnamed: 0,W,X,Y,Z
A,0.507074,-0.795827,0.153282,1.968588
B,0.016054,0.079514,-0.612943,-0.726977
C,1.081331,1.09813,0.007463,0.331976
D,-0.697083,-1.208983,0.86804,-1.294925
E,-1.383383,-0.734641,0.498523,-1.553762


In [25]:
df

Unnamed: 0,W,X,Y,Z,new
A,0.507074,-0.795827,0.153282,1.968588,-0.288753
B,0.016054,0.079514,-0.612943,-0.726977,0.095568
C,1.081331,1.09813,0.007463,0.331976,2.179461
D,-0.697083,-1.208983,0.86804,-1.294925,-1.906066
E,-1.383383,-0.734641,0.498523,-1.553762,-2.118024


In [26]:
df.drop('new',axis=1,inplace=True)

In [27]:
df

Unnamed: 0,W,X,Y,Z
A,0.507074,-0.795827,0.153282,1.968588
B,0.016054,0.079514,-0.612943,-0.726977
C,1.081331,1.09813,0.007463,0.331976
D,-0.697083,-1.208983,0.86804,-1.294925
E,-1.383383,-0.734641,0.498523,-1.553762


In [28]:
df['new'] = df['W'] + df['X']

In [29]:
df

Unnamed: 0,W,X,Y,Z,new
A,0.507074,-0.795827,0.153282,1.968588,-0.288753
B,0.016054,0.079514,-0.612943,-0.726977,0.095568
C,1.081331,1.09813,0.007463,0.331976,2.179461
D,-0.697083,-1.208983,0.86804,-1.294925,-1.906066
E,-1.383383,-0.734641,0.498523,-1.553762,-2.118024


In [30]:
df = df.drop('new',axis=1)

In [31]:
df

Unnamed: 0,W,X,Y,Z
A,0.507074,-0.795827,0.153282,1.968588
B,0.016054,0.079514,-0.612943,-0.726977
C,1.081331,1.09813,0.007463,0.331976
D,-0.697083,-1.208983,0.86804,-1.294925
E,-1.383383,-0.734641,0.498523,-1.553762


In [32]:
df.drop('E')

Unnamed: 0,W,X,Y,Z
A,0.507074,-0.795827,0.153282,1.968588
B,0.016054,0.079514,-0.612943,-0.726977
C,1.081331,1.09813,0.007463,0.331976
D,-0.697083,-1.208983,0.86804,-1.294925


### Selecting Rows

In [33]:
df.loc['A'] # If we need to select by index name then use 'loc'

W    0.507074
X   -0.795827
Y    0.153282
Z    1.968588
Name: A, dtype: float64

In [34]:
df.iloc[0] # If we need to select by number then use 'iloc'

W    0.507074
X   -0.795827
Y    0.153282
Z    1.968588
Name: A, dtype: float64

### Slicing dataframe

In [35]:
df[['W','X']]

Unnamed: 0,W,X
A,0.507074,-0.795827
B,0.016054,0.079514
C,1.081331,1.09813
D,-0.697083,-1.208983
E,-1.383383,-0.734641


In [36]:
df.loc['B','X']

0.07951424801703172

In [37]:
df

Unnamed: 0,W,X,Y,Z
A,0.507074,-0.795827,0.153282,1.968588
B,0.016054,0.079514,-0.612943,-0.726977
C,1.081331,1.09813,0.007463,0.331976
D,-0.697083,-1.208983,0.86804,-1.294925
E,-1.383383,-0.734641,0.498523,-1.553762


In [38]:
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,0.507074,0.153282
B,0.016054,-0.612943


## Conditional Selection
## Multi Index and index hierarchy

In [39]:
outer_index = 'Europe Europe Europe Asia Asia Asia'.split()
inner_index = 'Greenland Norway Belgium India Japan SouthKorea'.split()

In [41]:
hier_index = list(zip(outer_index,inner_index))
hier_index = pd.MultiIndex.from_tuples(hier_index)


In [42]:
hier_index

MultiIndex([('Europe',  'Greenland'),
            ('Europe',     'Norway'),
            ('Europe',    'Belgium'),
            (  'Asia',      'India'),
            (  'Asia',      'Japan'),
            (  'Asia', 'SouthKorea')],
           )

In [43]:
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns='A B'.split())

In [44]:
df

Unnamed: 0,Unnamed: 1,A,B
Europe,Greenland,1.748782,-0.650737
Europe,Norway,-0.148234,-1.371866
Europe,Belgium,0.674133,-1.280629
Asia,India,0.734143,-1.17612
Asia,Japan,-0.606411,1.055696
Asia,SouthKorea,1.198897,1.786396


In [45]:
df.loc['Europe']

Unnamed: 0,A,B
Greenland,1.748782,-0.650737
Norway,-0.148234,-1.371866
Belgium,0.674133,-1.280629


In [46]:
df.loc['Europe'].loc['Greenland']

A    1.748782
B   -0.650737
Name: Greenland, dtype: float64

In [47]:
hier_index = list(zip(outer_index,inner_index))

In [48]:
hier_index

[('Europe', 'Greenland'),
 ('Europe', 'Norway'),
 ('Europe', 'Belgium'),
 ('Asia', 'India'),
 ('Asia', 'Japan'),
 ('Asia', 'SouthKorea')]

# Missing Data

In [50]:
dic = {'A':[1,2,np.nan],
        'B':[5,np.nan,np.nan],
        'C':[10,20,30]}

df = pd.DataFrame(dic)

In [51]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,10
1,2.0,,20
2,,,30


# df dropping

In [52]:
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,10


In [53]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,10
1,2.0,,20
2,,,30


In [54]:
df.dropna(axis=1)

Unnamed: 0,C
0,10
1,20
2,30


In [55]:
df.dropna(thresh=2)

Unnamed: 0,A,B,C
0,1.0,5.0,10
1,2.0,,20


# imputation

In [56]:
df.fillna('FILL NULL VALUES')

Unnamed: 0,A,B,C
0,1,5,10
1,2,FILL NULL VALUES,20
2,FILL NULL VALUES,FILL NULL VALUES,30


In [58]:
df.fillna(value=df['A'].mean())

Unnamed: 0,A,B,C
0,1.0,5.0,10
1,2.0,1.5,20
2,1.5,1.5,30


# Groupby

In [59]:
data = {'Company':['GOOG','MSFT','ORCL','AMEX','FB','FB','AMEX','ORCL','GOOG'],
        'Person':['Anusha','Shashank','Santosh','Prakash','Harish','Vipul','Joydeep','Chinmay','Fahad'],
        'Sales':[100,300,232,542,6534,1223,7653,9230,728]}

In [60]:
data

{'Company': ['GOOG',
  'MSFT',
  'ORCL',
  'AMEX',
  'FB',
  'FB',
  'AMEX',
  'ORCL',
  'GOOG'],
 'Person': ['Anusha',
  'Shashank',
  'Santosh',
  'Prakash',
  'Harish',
  'Vipul',
  'Joydeep',
  'Chinmay',
  'Fahad'],
 'Sales': [100, 300, 232, 542, 6534, 1223, 7653, 9230, 728]}

KeyError: 'Company'

In [63]:
byCompany.std()

NameError: name 'byCompany' is not defined

In [66]:
data = {'A':['foo','foo','foo','bar','bar','bar'],
     'B':['one','one','two','two','one','one'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}


In [67]:
df = pd.DataFrame(data)

In [68]:
df

Unnamed: 0,A,B,C,D
0,foo,one,x,1
1,foo,one,y,3
2,foo,two,x,2
3,bar,two,y,5
4,bar,one,x,4
5,bar,one,y,1


In [69]:
import os

In [70]:
os.getcwd()

'/Users/kavyasree/Downloads'

In [72]:
ls /Users/kavyasree/Downloads/Data

[31m911.csv[m[m*             [31mdf1[m[m*                 [31mdumpdata.xlsx[m[m*
[31mEcommerce Purchases[m[m* [31mdf2[m[m*                 [31mexample.csv[m[m*
[31mExcel_Sample.xlsx[m[m*   [31mdf3[m[m*
[31mSalaries.csv[m[m*        [31mdump.csv[m[m*


In [73]:
ls /Data

ls: /Data: No such file or directory


In [77]:
df = pd.read_csv ('/Users/kavyasree/Downloads/Data/example.csv', index_col=0)

In [78]:
df

Unnamed: 0,a,b,c
0,0,4,8
1,1,5,9
2,2,6,10
3,3,7,11


In [79]:
df = pd.read_html('https://www.fdic.gov/bank/individual/failed/banklist.html')

In [80]:
df[0]

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date
0,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.","April 3, 2020"
1,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,"February 14, 2020"
2,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,"November 1, 2019"
3,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,"October 25, 2019"
4,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,"October 25, 2019"
...,...,...,...,...,...,...
556,"Superior Bank, FSB",Hinsdale,IL,32646,"Superior Federal, FSB","July 27, 2001"
557,Malta National Bank,Malta,OH,6629,North Valley Bank,"May 3, 2001"
558,First Alliance Bank & Trust Co.,Manchester,NH,34264,Southern New Hampshire Bank & Trust,"February 2, 2001"
559,National State Bank of Metropolis,Metropolis,IL,3815,Banterra Bank of Marion,"December 14, 2000"


In [81]:
df

[                             Bank Name           City  ST   CERT  \
 0                 The First State Bank  Barboursville  WV  14361   
 1                   Ericson State Bank        Ericson  NE  18265   
 2     City National Bank of New Jersey         Newark  NJ  21111   
 3                        Resolute Bank         Maumee  OH  58317   
 4                Louisa Community Bank         Louisa  KY  58112   
 ..                                 ...            ...  ..    ...   
 556                 Superior Bank, FSB       Hinsdale  IL  32646   
 557                Malta National Bank          Malta  OH   6629   
 558    First Alliance Bank & Trust Co.     Manchester  NH  34264   
 559  National State Bank of Metropolis     Metropolis  IL   3815   
 560                   Bank of Honolulu       Honolulu  HI  21029   
 
                    Acquiring Institution       Closing Date  
 0                         MVB Bank, Inc.      April 3, 2020  
 1             Farmers and Merchants Bank  F