# Pandas Package

## Pandas consists of

### e primary of which are A set of labeled array data structures, thSeries and DataFrame.

### Index objects enabling both simple axis indexing and multi-level / hierarchical axis indexing.

### An integrated group by engine for aggregating and transforming data sets.

### Date range generation (date_range) and custom date offsets enabling the implementation of customized frequencies.

### Input/Output tools: loading tabular data from flat files (CSV, delimited, Excel 2003), and saving and loading pandas objects from the fast and efficient PyTables/HDF5 format.

### Memory-efficient “sparse” versions of the standard data structures for storing data that is mostly missing or mostly constant (some fixed value).

### Moving window statistics (rolling mean, rolling standard deviation, etc.)

![Data Series and Data Frame](image1.png)

In [None]:
#Key Features of Pandas

#Fast and efficient DataFrame object with default and customized indexing.

#Tools for loading data into in-memory data objects from different file formats.

#Data alignment and integrated handling of missing data.

#Reshaping and pivoting of date sets.

#Label-based slicing, indexing and subsetting of large data sets.

#Columns from a data structure can be deleted or inserted.

#Group by data for aggregation and transformations.

#High performance merging and joining of data.

#Time Series functionality.


In [None]:
## Pandas consists of 3 Datastructures

Series - One dimenstional - 1D labeled homogeneous array, size immutable

Dataframe - 2D and labeled, size-mutable tabular structure with potentially heterogeneously typed columns.

Panel - General 3D labeled, size-mutable array.


In a DataFrame - there is index, referring to axis - 0, and columns referring to axis - 1

In [None]:
Mutability
All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. 
Series is size immutable.

Note − DataFrame is widely used and one of the most important data structures. Panel is used much less.

#### Series

In [4]:
import pandas as pd
import numpy as np


Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

### s = pd.Series(data, index=index) ## data can be dict,ndarray or scalar value

In [174]:
s = pd.Series([1.0,2,3])  ## Series cam be creates with array, dictionary or scalar values

In [175]:
print(s)

0    1.0
1    2.0
2    3.0
dtype: float64


In [176]:
s[0]   ### Accesing the element by index

1.0

In [None]:
### We did not pass any index, by dafauly index values are assigned from 0

In [177]:
 
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])    #We can change the index

In [178]:
s

100    a
101    b
102    c
103    d
dtype: object

In [212]:
s[0]   ### Gives an error as index starts only at 100

KeyError: 0

In [213]:
s[100]

'a'

In [182]:
s = pd.Series([1.0,2,3])

In [18]:
print(s)  ### one element is a float

0    1.0
1    2.0
2    3.0
dtype: float64


In [183]:
s = pd.Series([1,2,3,"Machine", 3.0])
print(s)

0          1
1          2
2          3
3    Machine
4          3
dtype: object


In [184]:
s = pd.Series([1,3,5,np.nan,6,8]) ## nan specifies Not a Number

In [9]:
print(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


In [188]:
Dict = {'A': '1', 'B' : '2', 'C' : '3'}    ### Series with a Dictionary
s = pd. Series(Dict)

In [189]:
s

A    1
B    2
C    3
dtype: object

In [190]:
s[0]   ### Instead of default indexing Dictionary keys are used as index. s[A] should work

'1'

In [191]:
s[A]

NameError: name 'A' is not defined

In [192]:
s['A']  # Syntax error fixed

'1'

In [193]:
s

A    1
B    2
C    3
dtype: object

In [194]:
s.isnull()

A    False
B    False
C    False
dtype: bool

In [86]:
s

0    1
1    2
2    3
3    4
4    5
5    4
dtype: int64

In [195]:
s = pd.Series()   ## Empty Series

In [196]:
s

Series([], dtype: float64)

In [197]:
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve the first element
 

In [200]:
s

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [201]:
s[0]

1

In [202]:
s['a']

1

In [203]:
s['a','c','d']  ## Syntax error

KeyError: ('a', 'c', 'd')

In [204]:
s[['a','c','d']] ## Syntax error fixed

a    1
c    3
d    4
dtype: int64

# Dataframes

![title](dataframe1.jpg)

## Creating Dataframes

In [247]:
##pd.DataFrame( data, index, columns, dtype, copy)

In [1]:
 #import the pandas library and aliasing as pd
import pandas as pd
df = pd.DataFrame()
 

In [2]:
print(df)

Empty DataFrame
Columns: []
Index: []


In [208]:
## Create a dataframe from lists

In [3]:
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)

In [4]:
df

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


In [None]:
###Example 2

In [5]:
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])

In [6]:
df

Unnamed: 0,Name,Age
0,Alex,10
1,Bob,12
2,Clarke,13


In [None]:
###Example 3

In [66]:
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
 

In [67]:
df

Unnamed: 0,Name,Age
0,Alex,10.0
1,Bob,12.0
2,Clarke,13.0


In [None]:
## Example 4 Create DF using lists

In [68]:
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42], 'Sub':["Maths","Phys","Chem","English"]}   
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
 

In [70]:
df[['Name','Age','Sub']]

Unnamed: 0,Name,Age,Sub
rank1,Tom,28,Maths
rank2,Jack,34,Phys
rank3,Steve,29,Chem
rank4,Ricky,42,English


In [2]:
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)

In [3]:
df   ### NaN is appended at missing values

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [4]:
import pandas as pd
data = [{'a': 1, 'b': 2,'c':3},{'a': 5, 'b': 10}]
df1 = pd.DataFrame(data)

In [5]:
df1

Unnamed: 0,a,b,c
0,1,2,3.0
1,5,10,


#### Pass a list of dictionary objects

In [None]:
## Example 6
#The following example shows how to create a DataFrame with a list of dictionaries,
#row indices, and column indices.

In [13]:
 data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

#With two column indices, values same as dictionary keys
#df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b','c','d'])

#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
#df3 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'c'])
 

In [15]:
df2

Unnamed: 0,a,b1
first,1,
second,5,


In [90]:
df2  # there is no column with title as b1

Unnamed: 0,a,b1
first,1,
second,5,


In [275]:
df3

Unnamed: 0,a,c
first,1,
second,5,20.0


In [16]:
data = [{'a': 1, 'b': 2,'c' : 4},{'a': 5, 'b': 10}]

#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b','c'])

In [108]:
df1    ## c value is missing for second row

Unnamed: 0,a,b,c
first,1,2,4.0
second,5,10,


In [20]:
import numpy as np
dict1 = {'A' : 1.,'B' : pd.Timestamp('20130102'),  'C' : pd.Series(1,index=list(range(4)),dtype='float32'),     
         'D' : np.array([3] * 4,dtype='int32'),
         'E' : pd.Categorical(["Tvm","Bngalore","Delhi","Chennai"]),
                   'F' : 'Daily' }

In [21]:
df = pd.DataFrame(dict1)
print(df)

     A          B    C  D         E      F
0  1.0 2013-01-02  1.0  3       Tvm  Daily
1  1.0 2013-01-02  1.0  3  Bngalore  Daily
2  1.0 2013-01-02  1.0  3     Delhi  Daily
3  1.0 2013-01-02  1.0  3   Chennai  Daily


In [22]:
df.tail()

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,Tvm,Daily
1,1.0,2013-01-02,1.0,3,Bngalore,Daily
2,1.0,2013-01-02,1.0,3,Delhi,Daily
3,1.0,2013-01-02,1.0,3,Chennai,Daily


In [40]:
df.head()

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,Tvm,Daily
1,1.0,2013-01-02,1.0,3,Bngalore,Daily
2,1.0,2013-01-02,1.0,3,Delhi,Daily
3,1.0,2013-01-02,1.0,3,Chennai,Daily


In [23]:
df.index

Int64Index([0, 1, 2, 3], dtype='int64')

In [24]:
df.columns

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

In [25]:
df.values

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'Tvm', 'Daily'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'Bngalore',
        'Daily'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'Delhi', 'Daily'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'Chennai',
        'Daily']], dtype=object)

In [44]:
df.describe()   ## Quick statistic summary

Unnamed: 0,A,C,D
count,4.0,4.0,4.0
mean,1.0,1.0,3.0
std,0.0,0.0,0.0
min,1.0,1.0,3.0
25%,1.0,1.0,3.0
50%,1.0,1.0,3.0
75%,1.0,1.0,3.0
max,1.0,1.0,3.0


In [45]:
df.T  ## Transpose of the data

Unnamed: 0,0,1,2,3
A,1,1,1,1
B,2013-01-02 00:00:00,2013-01-02 00:00:00,2013-01-02 00:00:00,2013-01-02 00:00:00
C,1,1,1,1
D,3,3,3,3
E,Tvm,Bngalore,Delhi,Chennai
F,Daily,Daily,Daily,Daily


In [26]:
df

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,Tvm,Daily
1,1.0,2013-01-02,1.0,3,Bngalore,Daily
2,1.0,2013-01-02,1.0,3,Delhi,Daily
3,1.0,2013-01-02,1.0,3,Chennai,Daily


In [27]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,F,E,D,C,B,A
0,Daily,Tvm,3,1.0,2013-01-02,1.0
1,Daily,Bngalore,3,1.0,2013-01-02,1.0
2,Daily,Delhi,3,1.0,2013-01-02,1.0
3,Daily,Chennai,3,1.0,2013-01-02,1.0


In [28]:
df.sort_values(by='E')   ## Sort by values

Unnamed: 0,A,B,C,D,E,F
1,1.0,2013-01-02,1.0,3,Bngalore,Daily
3,1.0,2013-01-02,1.0,3,Chennai,Daily
2,1.0,2013-01-02,1.0,3,Delhi,Daily
0,1.0,2013-01-02,1.0,3,Tvm,Daily


## Selecting Data

In [29]:
dict1 = {'A' : 1.,'B' : pd.Timestamp('20130102'),
   ....:                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
   ....:                      'D' : np.array([3] * 4,dtype='int32'),
   ....:                      'E' : pd.Categorical(["Tvm","Bngalore","Delhi","Chennai"]),
   ....:                      'F' : 'Daily' }

In [30]:
df = pd.DataFrame(dict1)
print(df)


     A          B    C  D         E      F
0  1.0 2013-01-02  1.0  3       Tvm  Daily
1  1.0 2013-01-02  1.0  3  Bngalore  Daily
2  1.0 2013-01-02  1.0  3     Delhi  Daily
3  1.0 2013-01-02  1.0  3   Chennai  Daily


In [130]:
df['A']

0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64

In [31]:
df['E']

0         Tvm
1    Bngalore
2       Delhi
3     Chennai
Name: E, dtype: category
Categories (4, object): [Bngalore, Chennai, Delhi, Tvm]

In [None]:
## Slicing

In [32]:
df[0:3]  ## Slices the rows

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,Tvm,Daily
1,1.0,2013-01-02,1.0,3,Bngalore,Daily
2,1.0,2013-01-02,1.0,3,Delhi,Daily


In [133]:
df[2:]

Unnamed: 0,A,B,C,D,E,F
2,1.0,2013-01-02,1.0,3,Delhi,Daily
3,1.0,2013-01-02,1.0,3,Chennai,Daily


In [134]:
df

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,Tvm,Daily
1,1.0,2013-01-02,1.0,3,Bngalore,Daily
2,1.0,2013-01-02,1.0,3,Delhi,Daily
3,1.0,2013-01-02,1.0,3,Chennai,Daily


# Selection by Label

In [33]:
df.loc[:,['A','B']]

Unnamed: 0,A,B
0,1.0,2013-01-02
1,1.0,2013-01-02
2,1.0,2013-01-02
3,1.0,2013-01-02


In [136]:
df.loc[:,['B','E']]

Unnamed: 0,B,E
0,2013-01-02,Tvm
1,2013-01-02,Bngalore
2,2013-01-02,Delhi
3,2013-01-02,Chennai


In [34]:
df.loc[1:2,['A','B']]

Unnamed: 0,A,B
1,1.0,2013-01-02
2,1.0,2013-01-02


In [138]:
df

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,Tvm,Daily
1,1.0,2013-01-02,1.0,3,Bngalore,Daily
2,1.0,2013-01-02,1.0,3,Delhi,Daily
3,1.0,2013-01-02,1.0,3,Chennai,Daily


In [35]:
df.iloc[3]   ## Select by position selects the 4th row

A                      1
B    2013-01-02 00:00:00
C                      1
D                      3
E                Chennai
F                  Daily
Name: 3, dtype: object

In [140]:
df.iloc[0] ## Selects the first row

A                      1
B    2013-01-02 00:00:00
C                      1
D                      3
E                    Tvm
F                  Daily
Name: 0, dtype: object

In [141]:
df

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,Tvm,Daily
1,1.0,2013-01-02,1.0,3,Bngalore,Daily
2,1.0,2013-01-02,1.0,3,Delhi,Daily
3,1.0,2013-01-02,1.0,3,Chennai,Daily


In [36]:
df.iloc[[1,3],[0,2]]

Unnamed: 0,A,C
1,1.0,1.0
3,1.0,1.0


In [143]:
df.iloc[[1,3],[0,4]]

Unnamed: 0,A,E
1,1.0,Bngalore
3,1.0,Chennai


In [37]:
df.iloc[1:3,:] ## Explicit slicing of rows

Unnamed: 0,A,B,C,D,E,F
1,1.0,2013-01-02,1.0,3,Bngalore,Daily
2,1.0,2013-01-02,1.0,3,Delhi,Daily


In [144]:
df.iloc[1,1]

Timestamp('2013-01-02 00:00:00')

In [145]:
df.iat[1,1]

Timestamp('2013-01-02 00:00:00')

## Boolean Indexing

In [45]:
df[df.A > 2.0]

Unnamed: 0,A,B,C,D,E,F


In [44]:
df[df > 1]

Unnamed: 0,A,B,C,D,E,F
0,,2013-01-02,,3,Tvm,Daily
1,,2013-01-02,,3,Bngalore,Daily
2,,2013-01-02,,3,Delhi,Daily
3,,2013-01-02,,3,Chennai,Daily


In [46]:
df.isin('Tvm')  ## Error : Only list-like or dict-like objects are allowed to be passed to DataFrame.isin(), you passed a 'str'

TypeError: only list-like or dict-like objects are allowed to be passed to DataFrame.isin(), you passed a 'str'

In [48]:
df

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,Tvm,Daily
1,1.0,2013-01-02,1.0,3,Bngalore,Daily
2,1.0,2013-01-02,1.0,3,Delhi,Daily
3,1.0,2013-01-02,1.0,3,Chennai,Daily


In [47]:
df.isin(['Tvm'])  ## Syntax corrected

Unnamed: 0,A,B,C,D,E,F
0,False,False,False,False,True,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False


In [49]:
df.isin(['Tvm','Chennai'])

Unnamed: 0,A,B,C,D,E,F
0,False,False,False,False,True,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,True,False


In [102]:
import pandas as pd
import numpy as np
#s1 = pd.Series([0,1,2,3], index=pd.date_range('20130102', periods=4))
s1 = pd.Series([0,4.0,2,3])

In [100]:
s1  

0    0.0
1    1.0
2    2.0
3    3.0
dtype: float64

In [103]:
dict1 = {'A' : 1.,'B' : pd.Timestamp('20130102'),
   ....:                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
   ....:                      'D' : np.array([3] * 4,dtype='int32'),
   ....:                      'E' : pd.Categorical(["Tvm","Bngalore","Delhi","Chennai"]),
   ....:                      'F' : 'Daily' }

In [104]:
df = pd.DataFrame(dict1)
print(df)


     A          B    C  D         E      F
0  1.0 2013-01-02  1.0  3       Tvm  Daily
1  1.0 2013-01-02  1.0  3  Bngalore  Daily
2  1.0 2013-01-02  1.0  3     Delhi  Daily
3  1.0 2013-01-02  1.0  3   Chennai  Daily


In [105]:
df['C']

0    1.0
1    1.0
2    1.0
3    1.0
Name: C, dtype: float32

In [106]:
df['C'] = s1

In [107]:
df['C']       

0    0.0
1    4.0
2    2.0
3    3.0
Name: C, dtype: float64

In [None]:
#### **** Important the replacement of the columns will work only the datatypes are matching ****

In [108]:
df

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,0.0,3,Tvm,Daily
1,1.0,2013-01-02,4.0,3,Bngalore,Daily
2,1.0,2013-01-02,2.0,3,Delhi,Daily
3,1.0,2013-01-02,3.0,3,Chennai,Daily


In [17]:
df.dropna()  

Unnamed: 0,A,B,C,D,E,F


In [162]:
df

Unnamed: 0,A,B,C,D,E,F
0,1.0,,1.0,3,Tvm,Daily
1,1.0,,1.0,3,Bngalore,Daily
2,1.0,,1.0,3,Delhi,Daily
3,1.0,,1.0,3,Chennai,Daily


### STATISTICS

In [18]:
df = pd.DataFrame({'Student' : ['Priya', 'Norah', 'Elsa', 'Daniel']})

In [19]:
df

Unnamed: 0,Student
0,Priya
1,Norah
2,Elsa
3,Daniel


In [20]:
df = pd.DataFrame({'Student' : ['Priya', 'Norah', 'Elsa', 'Daniel'], 'Maths' : [100,90,88,72],'Science':[100,98,88,40]})

In [167]:
df

Unnamed: 0,Maths,Science,Student
0,100,100,Priya
1,90,98,Norah
2,88,88,Elsa
3,72,40,Daniel


In [22]:
df.mean(axis=0)  ### 

Maths      87.5
Science    81.5
dtype: float64

In [21]:
df.mean() ## means axis = 0

Maths      87.5
Science    81.5
dtype: float64

In [23]:
df.mean(axis=1)  ## average for each student

0    100.0
1     94.0
2     88.0
3     56.0
dtype: float64

In [24]:
df.mean(1)

0    100.0
1     94.0
2     88.0
3     56.0
dtype: float64

In [26]:
df

Unnamed: 0,Maths,Science,Student
0,100,100,Priya
1,90,98,Norah
2,88,88,Elsa
3,72,40,Daniel


 ## Applying Functions

In [25]:
df.apply(np.cumsum)  ## Cumilativesum of the data

Unnamed: 0,Maths,Science,Student
0,100,100,Priya
1,190,198,PriyaNorah
2,278,286,PriyaNorahElsa
3,350,326,PriyaNorahElsaDaniel


In [27]:
s = pd.Series([1,2,3,4,5,4])
print(s.pct_change())


0         NaN
1    1.000000
2    0.500000
3    0.333333
4    0.250000
5   -0.200000
dtype: float64


In [28]:
s

0    1
1    2
2    3
3    4
4    5
5    4
dtype: int64

In [None]:
s = pd.DataFrame([1,2,3,4,5,4])

In [None]:
#df = pd.DataFrame({'Student' : ['Priya', 'Norah', 'Elsa', 'Daniel'], 'Maths' : [100,90,88,72],'Science':[100,98,88,40]})

In [None]:
## Covariance

In [29]:
import pandas as pd
import numpy as np
s1 = pd.Series(np.arange(10))
print(s1)
s2 = pd.Series(np.random.randn(10))
print (s1.cov(s2))

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64
-0.561336452185573


In [None]:
## Correlation

In [30]:
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])

print(frame['a'].corr(frame['b']))



0.2374071500257198


In [31]:
frame

Unnamed: 0,a,b,c,d,e
0,-1.448839,-0.770723,0.192389,1.051327,0.293079
1,0.46086,-0.949988,-0.783928,1.947596,-0.791537
2,-0.96194,0.584179,-0.413749,0.337782,0.419248
3,0.195606,-1.645685,-0.088582,0.210991,0.039451
4,1.202047,0.363043,-0.696261,1.034878,0.651567
5,-0.876625,-1.05971,0.87564,0.159399,0.050102
6,0.756447,0.692849,-0.043657,-0.7936,2.617648
7,-0.763689,-1.062693,2.404561,-0.640498,1.001797
8,1.258731,0.491314,-0.78678,0.009016,0.176402
9,1.883314,-0.938164,-0.618368,-0.113824,1.205474


In [32]:
print(frame.corr())  

          a         b         c         d         e
a  1.000000  0.237407 -0.584194 -0.079011  0.248187
b  0.237407  1.000000 -0.396678 -0.153325  0.428887
c -0.584194 -0.396678  1.000000 -0.475395  0.181748
d -0.079011 -0.153325 -0.475395  1.000000 -0.747168
e  0.248187  0.428887  0.181748 -0.747168  1.000000


## String Methods

In [33]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

In [None]:
s

In [34]:
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object