## What is Pandas?

* Pandas is a package commonly used to deal with data analysis. It simplifies the loading of data from external sources such as text files and databases, as well as providing ways of analysing and manipulating data once it is loaded into your computer. The features provided in pandas automate and simplify a lot of the common tasks that would take many lines of code to write in the basic Python langauge.

## Pandas can work with the following:
- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet.
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels.
- Any other form of observational / statistical data sets. The data actually need not be labelled at all to be placed into a pandas data structure.

## Importing Pandas

In [2]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np

## Introduction to Pandas Data Structures
* Series
* Data Frame

## Series
* A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data

In [3]:
obj = Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [3]:
obj.values

array([ 4,  7, -5,  3], dtype=int64)

In [4]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [5]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [6]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

* Creating a Series from Python dictionary

In [10]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [11]:
print(obj3.values) #  It returns a one dimensional array
print(obj3.index)

[35000 71000 16000  5000]
Index(['Ohio', 'Texas', 'Oregon', 'Utah'], dtype='object')


In [14]:
states = ['California', 'Texas', 'Oregon', 'Utah', 'Ohio']
obj4 = Series(sdata, index = states)
obj4

California        NaN
Texas         71000.0
Oregon        16000.0
Utah           5000.0
Ohio          35000.0
dtype: float64

* Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality

In [17]:
obj4.name = 'population'
obj4.index.name = 'states'
obj4

states
California        NaN
Texas         71000.0
Oregon        16000.0
Utah           5000.0
Ohio          35000.0
Name: population, dtype: float64

* Series index can be altered in place

In [19]:
obj = Series([4, 7, -5, 3])
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

## Data Frame
* A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). 
* The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index).

In [21]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'population': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame

Unnamed: 0,state,year,population
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [44]:
frame2 = DataFrame(data, columns=['year', 'state', 'population', 'debt'],
  index=['one', 'two', 'three', 'four', 'five'])
frame2

Unnamed: 0,year,state,population,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [45]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [46]:
type(frame2['state'])

pandas.core.series.Series

In [47]:
frame2[['state','year']]

Unnamed: 0,state,year
one,Ohio,2000
two,Ohio,2001
three,Ohio,2002
four,Nevada,2001
five,Nevada,2002


In [48]:
type(frame2[['state','year']])

pandas.core.frame.DataFrame

In [49]:
frame2.debt

one      NaN
two      NaN
three    NaN
four     NaN
five     NaN
Name: debt, dtype: object

In [54]:
frame2.debt = frame2.population * 2
frame2

Unnamed: 0,year,state,population,debt
one,2000,Ohio,1.5,3.0
two,2001,Ohio,1.7,3.4
three,2002,Ohio,3.6,7.2
four,2001,Nevada,2.4,4.8
five,2002,Nevada,2.9,5.8


In [57]:
# Adding a new column
frame2['credit'] = frame2.population * 0.5
frame2

Unnamed: 0,year,state,population,debt,credit
one,2000,Ohio,1.5,3.0,0.75
two,2001,Ohio,1.7,3.4,0.85
three,2002,Ohio,3.6,7.2,1.8
four,2001,Nevada,2.4,4.8,1.2
five,2002,Nevada,2.9,5.8,1.45


In [60]:
print(type(frame2.values)) # This returns an n-dim array
print("Shape of returned data : ", frame2.values.shape)

<class 'numpy.ndarray'>
Shape of returned data :  (5, 5)


## Inputs that can be used by Data Frame constructor
![title](datainput.PNG)

## Reindexing
* A critical method on pandas objects is reindex, which means to create a new object with the data conformed to a new index.

In [61]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

* Calling reindex on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present

In [62]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [63]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

In [64]:
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [65]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [70]:
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
        columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [71]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])  # Reindexing an index
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [74]:
states = ['Texas', 'Utah', 'California'] # Reindexing columns
frame3 = frame.reindex(columns=states)
frame3

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


## Reindex Function Arguments
![title](reindex.PNG)

## Dropping entries from an axis
* Series

In [82]:
# For Series
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
print(obj)
new_obj = obj.drop(['c','d'])
print(new_obj)

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
e    4.0
dtype: float64


* Data Frame

In [83]:
# For Dataframe
data = DataFrame(np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [84]:
data = data.drop(['Colorado','Utah'])  # Delete rows
data = data.drop('two', axis=1) # Delete columns, we need to mention the axis else it will give an error for column dropping
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
New York,12,13,14,15


## Indexing, selection, and filtering 

* Series

In [91]:
obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [92]:
obj['a']

0.0

In [94]:
obj[0]

0.0

In [95]:
obj[1:3]

b    1.0
c    2.0
dtype: float64

In [96]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [97]:
obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [98]:
obj[obj < 2] # Boolean indexing

a    0.0
b    1.0
dtype: float64

In [99]:
obj['b':'c'] # Slicing based on index values takes into account both the indexes unlike using index numbers

b    1.0
c    2.0
dtype: float64

* Dataframe

In [100]:
data = DataFrame(np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [101]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [102]:
data[['three','one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [118]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [140]:
data.loc['Ohio':'Utah', ['one','three','four']]  # Selecting specific rows and columns

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11


## Arithmetic and Data Alignment
* One of the most important pandas features is the behavior of arithmetic between objects with different indexes. When adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs

* The internal data alignment introduces NA values in the indices that don’t overlap.

In [142]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

* In the case of DataFrame, alignment is performed on both the rows and the columns

In [144]:
df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
     index=['Ohio', 'Texas', 'Colorado'])

df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [149]:
df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
print(df1)
print(df2)
df1 + df2

     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0
      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   6.0   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0


Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [150]:
# Using the add method of the dataframe will fill value = 0
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


## Function Apply on Data Frames

In [152]:
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.158653,0.833525,1.339129
Ohio,1.761187,-0.424433,0.628023
Texas,-0.604017,-0.693928,0.865849
Oregon,-1.704597,-1.286342,1.49811


In [155]:
f = lambda x: x.max() - x.min()
frame.apply(f, axis=1)

Utah      1.180475
Ohio      2.185620
Texas     1.559776
Oregon    3.202707
dtype: float64

In [157]:
frame.apply(f, axis=0)

b    3.465784
d    2.119867
e    0.870087
dtype: float64

In [161]:
series = Series(np.arange(10))
print(series)
f = lambda x: x * 2
series.apply(f)

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int32


0     0
1     2
2     4
3     6
4     8
5    10
6    12
7    14
8    16
9    18
dtype: int64

## Handling missing data
* pandas uses the floating point value NaN (Not a Number) to represent missing data in both floating as well as in non-floating point arrays. It is just used as a sentinel that can be easily detected:

In [163]:
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

* isnull

In [165]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

* notnull

In [167]:
string_data.notnull()

0     True
1     True
2    False
3     True
dtype: bool

In [178]:
df = DataFrame(np.arange(100).reshape(10,10))
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0,1,2,3,4,5,6,7,8,9
1,10,11,12,13,14,15,16,17,18,19
2,20,21,22,23,24,25,26,27,28,29
3,30,31,32,33,34,35,36,37,38,39
4,40,41,42,43,44,45,46,47,48,49
5,50,51,52,53,54,55,56,57,58,59
6,60,61,62,63,64,65,66,67,68,69
7,70,71,72,73,74,75,76,77,78,79
8,80,81,82,83,84,85,86,87,88,89
9,90,91,92,93,94,95,96,97,98,99


In [180]:
df.loc[4:5,4:5] = None
df.notnull()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,True,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,False,False,True,True,True,True
5,True,True,True,True,False,False,True,True,True,True
6,True,True,True,True,True,True,True,True,True,True
7,True,True,True,True,True,True,True,True,True,True
8,True,True,True,True,True,True,True,True,True,True
9,True,True,True,True,True,True,True,True,True,True


In [182]:
df.dropna()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0,1,2,3,4.0,5.0,6,7,8,9
1,10,11,12,13,14.0,15.0,16,17,18,19
2,20,21,22,23,24.0,25.0,26,27,28,29
3,30,31,32,33,34.0,35.0,36,37,38,39
6,60,61,62,63,64.0,65.0,66,67,68,69
7,70,71,72,73,74.0,75.0,76,77,78,79
8,80,81,82,83,84.0,85.0,86,87,88,89
9,90,91,92,93,94.0,95.0,96,97,98,99


In [184]:
df.fillna(10000)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0,1,2,3,4.0,5.0,6,7,8,9
1,10,11,12,13,14.0,15.0,16,17,18,19
2,20,21,22,23,24.0,25.0,26,27,28,29
3,30,31,32,33,34.0,35.0,36,37,38,39
4,40,41,42,43,10000.0,10000.0,46,47,48,49
5,50,51,52,53,10000.0,10000.0,56,57,58,59
6,60,61,62,63,64.0,65.0,66,67,68,69
7,70,71,72,73,74.0,75.0,76,77,78,79
8,80,81,82,83,84.0,85.0,86,87,88,89
9,90,91,92,93,94.0,95.0,96,97,98,99


## Sorting in Pandas
* There are two kinds of sorting available for DataFrame in Pandas.
    * By label
    * By actual value

In [187]:
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
unsorted_df

Unnamed: 0,col2,col1
1,-0.470004,0.341214
4,-0.340269,1.36156
6,-0.133254,-0.924482
2,0.178383,-0.102982
3,-0.094583,-0.313308
5,-2.222339,0.185028
9,0.978152,-0.076374
8,-0.340791,0.048023
0,-0.932065,0.426714
7,-1.363654,0.135429


* By label means by passing the axis arguments and the order of sorting, DataFrame can be sorted

In [201]:
print(unsorted_df.sort_index(axis=1, ascending=True)) # Sorting based on column index 
print(unsorted_df.sort_index(axis=0, ascending=False)) # Sorting based on row index 

       col1      col2
1  0.341214 -0.470004
4  1.361560 -0.340269
6 -0.924482 -0.133254
2 -0.102982  0.178383
3 -0.313308 -0.094583
5  0.185028 -2.222339
9 -0.076374  0.978152
8  0.048023 -0.340791
0  0.426714 -0.932065
7  0.135429 -1.363654
       col2      col1
9  0.978152 -0.076374
8 -0.340791  0.048023
7 -1.363654  0.135429
6 -0.133254 -0.924482
5 -2.222339  0.185028
4 -0.340269  1.361560
3 -0.094583 -0.313308
2  0.178383 -0.102982
1 -0.470004  0.341214
0 -0.932065  0.426714


* By value - we pass a 'by' argument which will use the column name of the DataFrame with which the values are to be sorted

In [206]:
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df_one_column = unsorted_df.sort_values(by='col1') 
sorted_df_one_column

Unnamed: 0,col1,col2
1,1,3
2,1,2
3,1,4
0,2,1


In [207]:
sorted_df_multiple_columns = unsorted_df.sort_values(by=['col2','col1'], ascending=False) 
sorted_df_multiple_columns

Unnamed: 0,col1,col2
3,1,4
1,1,3
2,1,2
0,2,1


## Grouping in Pandas

In [3]:
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
df

Unnamed: 0,Points,Rank,Team,Year
0,876,1,Riders,2014
1,789,2,Riders,2015
2,863,2,Devils,2014
3,673,3,Devils,2015
4,741,3,Kings,2014
5,812,4,kings,2015
6,756,1,Kings,2016
7,788,1,Kings,2017
8,694,2,Riders,2016
9,701,4,Royals,2014


In [6]:
# What are the current groups?
df.groupby('Team').groups

{'Devils': Int64Index([2, 3], dtype='int64'),
 'Kings': Int64Index([4, 6, 7], dtype='int64'),
 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'),
 'Royals': Int64Index([9, 10], dtype='int64'),
 'kings': Int64Index([5], dtype='int64')}

In [9]:
df.groupby(['Team','Year']).groups

{('Devils', 2014): Int64Index([2], dtype='int64'),
 ('Devils', 2015): Int64Index([3], dtype='int64'),
 ('Kings', 2014): Int64Index([4], dtype='int64'),
 ('Kings', 2016): Int64Index([6], dtype='int64'),
 ('Kings', 2017): Int64Index([7], dtype='int64'),
 ('Riders', 2014): Int64Index([0], dtype='int64'),
 ('Riders', 2015): Int64Index([1], dtype='int64'),
 ('Riders', 2016): Int64Index([8], dtype='int64'),
 ('Riders', 2017): Int64Index([11], dtype='int64'),
 ('Royals', 2014): Int64Index([9], dtype='int64'),
 ('Royals', 2015): Int64Index([10], dtype='int64'),
 ('kings', 2015): Int64Index([5], dtype='int64')}

In [27]:
grouped = df.groupby(['Year', 'Team'])
for name,group in grouped:
    print("Name :", name)
    print("Group: ", group)

Name : (2014, 'Devils')
Group:     Points  Rank    Team  Year
2     863     2  Devils  2014
Name : (2014, 'Kings')
Group:     Points  Rank   Team  Year
4     741     3  Kings  2014
Name : (2014, 'Riders')
Group:     Points  Rank    Team  Year
0     876     1  Riders  2014
Name : (2014, 'Royals')
Group:     Points  Rank    Team  Year
9     701     4  Royals  2014
Name : (2015, 'Devils')
Group:     Points  Rank    Team  Year
3     673     3  Devils  2015
Name : (2015, 'Riders')
Group:     Points  Rank    Team  Year
1     789     2  Riders  2015
Name : (2015, 'Royals')
Group:      Points  Rank    Team  Year
10     804     1  Royals  2015
Name : (2015, 'kings')
Group:     Points  Rank   Team  Year
5     812     4  kings  2015
Name : (2016, 'Kings')
Group:     Points  Rank   Team  Year
6     756     1  Kings  2016
Name : (2016, 'Riders')
Group:     Points  Rank    Team  Year
8     694     2  Riders  2016
Name : (2017, 'Kings')
Group:     Points  Rank   Team  Year
7     788     1  Kings  201

In [28]:
grouped.groups

{(2014, 'Devils'): Int64Index([2], dtype='int64'),
 (2014, 'Kings'): Int64Index([4], dtype='int64'),
 (2014, 'Riders'): Int64Index([0], dtype='int64'),
 (2014, 'Royals'): Int64Index([9], dtype='int64'),
 (2015, 'Devils'): Int64Index([3], dtype='int64'),
 (2015, 'Riders'): Int64Index([1], dtype='int64'),
 (2015, 'Royals'): Int64Index([10], dtype='int64'),
 (2015, 'kings'): Int64Index([5], dtype='int64'),
 (2016, 'Kings'): Int64Index([6], dtype='int64'),
 (2016, 'Riders'): Int64Index([8], dtype='int64'),
 (2017, 'Kings'): Int64Index([7], dtype='int64'),
 (2017, 'Riders'): Int64Index([11], dtype='int64')}

In [29]:
grouped['Points'].agg([np.sum, np.mean, np.std])

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,mean,std
Year,Team,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014,Devils,863,863,
2014,Kings,741,741,
2014,Riders,876,876,
2014,Royals,701,701,
2015,Devils,673,673,
2015,Riders,789,789,
2015,Royals,804,804,
2015,kings,812,812,
2016,Kings,756,756,
2016,Riders,694,694,
