# Pandas
- Libraries are the collection of functions and methods which enables you to perform wide range of actions without writing a code yourself.
- Pandas is a scientific computing library which is used for data manipulation and analysis. In particular, it offers data structures and tools for data manupulation and analysis.

In [2]:
import pandas as pd
import numpy as np

__Important steps in `Machine learning`:__

1) Problem Statement

2) Data Gathering

3) Exploratory Data Analysis (EDA)

4) Feature Engineering

5) Feature Selection

6) Model Traning

7) Model Evaluation

8) Deployment(Amazon web services, Google Cloud Platform, Azure)

__Pandas supports two types of Data Structures:__

    1) Series (1-D array >> Single column of dataframe)
    2) Dataframes ( 2-D array >> Tabular structure with rows and columns)
    
__Use Cases of Pandas:__

    1) Reading csv/excel/json file
    2) EDA
    3) Data Cleaning
    4) Encoding(OneHot, Label)
    5) Join Tables(left, right, inner, outer)
    6) Feature selection

### A] Series Data structure
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.)

#### Creating a Series
- `Syntax: pd.Series(data, index=[index name],dtype=None)`
- data can be of many different things: dict /ndarray /scalar value

#### 1) Creating a series data structure using dictionary

In [4]:
pd.Series({"a": 0, "b": 1, "c": 2})

a    0
b    1
c    2
dtype: int64

In [5]:
pd.Series({1: 'pune', 2: 'mumbai', 3: 'kolkata'})

1       pune
2     mumbai
3    kolkata
dtype: object

#### 2) Creating a series data structure using ndarray

In [7]:
pd.Series(np.random.randint(1,15,6),index=['a','b','c','d','e','f'])

a    8
b    8
c    8
d    7
e    5
f    5
dtype: int32

####  3) Creating a series data structure using scalar value

In [10]:
pd.Series(5,index=['a','b','c','d','e','f'])

a    5
b    5
c    5
d    5
e    5
f    5
dtype: int64

In [11]:
pd.Series(5,index=['a','b','c','d','e','f'],dtype=float)

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
f    5.0
dtype: float64

#### Some important series functions
- pandas.Series.dtype
- pandas.Series.to_list
- pandas.Series.add
- pandas.Series.sub
- pandas.Series.mul
- pandas.Series.div
- pandas.Series.floordiv
- pandas.Series.mod
- pandas.Series.pow
- pandas.Series.aggregate
- pandas.Series.max
- pandas.Series.min
- pandas.Series.mean
- pandas.Series.median
- pandas.Series.sum

__Usefull Pandas Series dataframe functions applicable to strings__

- pandas.Series.str.contains: used to test if pattern or regex is contained within a string of a Series or Index.
- pandas.Series.str.count: Count occurrences of pattern in each string of the Series

In [98]:
df = pd.read_csv('Movies.csv')
df.head(3)

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."


In [100]:
df[df['title'].str.contains('Godfather')]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
711,7.6,The Godfather: Part III,R,Crime,162,"[u'Al Pacino', u'Diane Keaton', u'Andy Garcia']"


In [106]:
df['title'].str.count('Godfather').sum()

3

In [113]:
df[df['title'].str.contains('^T.....e$',regex=True)]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
974,7.4,Tootsie,PG,Comedy,116,"[u'Dustin Hoffman', u'Jessica Lange', u'Teri G..."


### B] DataFrame Data structure
DataFrame is a 2-dimensional labeled data structure with columns of different data-types. DataFrame accepts different kinds of input: Dict of 1D ndarrays, lists, dicts, 2-D np.ndarray, a Series, Another DataFrame, csv file, excel file, jason file etc. Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments.

#### Creating a DataFrame
`Syntax: pd.DataFrame(data,index=[index names],columns=[column names],dtype=None)`

#### 1) Creating dataframe using series

In [15]:
d = {"one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"])}


df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


#### 2) Creating dataframe using dictionary

In [16]:
dict1 = {'columnA' : [2,3,4,5,6],
        'columnB' : [20,10,30,40,50]}

df = pd.DataFrame(dict1)
df

Unnamed: 0,columnA,columnB
0,2,20
1,3,10
2,4,30
3,5,40
4,6,50


__**Note:__  We can add new column into the existing dataframe

In [17]:
df['columnC'] = [100,200,300,400,500]
df

Unnamed: 0,columnA,columnB,columnC
0,2,20,100
1,3,10,200
2,4,30,300
3,5,40,400
4,6,50,500


In [18]:
df['columnD'] = np.array([9,8,7,6,5])
df['columnE'] = pd.Series([4,5,6,7,8])
df

Unnamed: 0,columnA,columnB,columnC,columnD,columnE
0,2,20,100,9,4
1,3,10,200,8,5
2,4,30,300,7,6
3,5,40,400,6,7
4,6,50,500,5,8


#### 3) Creating dataframe using ndarray

In [19]:
array1 = np.random.randint(10,20, size = (5,4))

df = pd.DataFrame(array1)
df

Unnamed: 0,0,1,2,3
0,15,11,14,16
1,13,15,11,19
2,14,15,11,14
3,14,17,18,13
4,18,15,18,10


__**Note:__ We can add or change existing column names of the existing dataframe

In [21]:
# We can add or change existing row names of the existing dataframe

df = pd.DataFrame(array1, columns = list('xyzw'), index = list('abcde'))
df

Unnamed: 0,x,y,z,w
a,15,11,14,16
b,13,15,11,19
c,14,15,11,14
d,14,17,18,13
e,18,15,18,10


### Different action on dataframe

In [162]:
# Creating a dataframe

df = pd.DataFrame(np.random.rand(5,4),index=['A','B','C','D','E'], columns=['W','X','Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,0.740552,0.069665,0.090467,0.842638
B,0.088703,0.26159,0.424412,0.115564
C,0.834176,0.702328,0.901605,0.529934
D,0.437789,0.135471,0.358883,0.942487
E,0.389404,0.106656,0.913278,0.141947


#### 1) Accessing columns from a dataframe
- `Syntax: df[col_name]`
- `Syntax: df[[col_names]]`

In [18]:
df['W']

A    0.210532
B    0.620571
C    0.749588
D    0.763666
E    0.068134
Name: W, dtype: float64

In [163]:
df[['W','X','Y']] # Accessing more than one column

Unnamed: 0,W,X,Y
A,0.740552,0.069665,0.090467
B,0.088703,0.26159,0.424412
C,0.834176,0.702328,0.901605
D,0.437789,0.135471,0.358883
E,0.389404,0.106656,0.913278


__**Note :__ 
Dataframe is nothing but a collection of series data structure

In [19]:
type(df) # Data type of dataframe

pandas.core.frame.DataFrame

In [20]:
type(df['W']) # Data type of a column from a dataframe

pandas.core.series.Series

#### 2) Accessing particular value from a dataframe

In [22]:
df['X']['A']

0.5851830605768354

#### 3) Accessing particular rows and columns (Slicing of dataframe)
- Please refer df.loc,df.iloc functions

In [164]:
df = pd.read_csv('Emp_Records.csv')

In [165]:
df[2:5]

Unnamed: 0,Emp ID,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
2,428945,Joe,54.15,68,0.98,50155,Michigantown
3,408351,Diane,39.67,51,18.3,180294,Hydetown
4,193819,Benjamin,40.31,58,4.01,117642,Fremont


#### 4) Adding a new column to the existing Dataframe

In [23]:
# There are plenty of ways to add new columns to the existing dataframe
# But remember It is very important that the lenghth of the new column to be added should be same as the existing column length otherwise it will raise an error.
# But while creating a new dataframe it is not mandatory for each columns to have same length, If in a case length of 1st column is less than other.. that missing length is filled with "NaN": Not a Number value
# We can use df.insert() to add new column as well. Please refer df.insert()

df['U'] = df['W'] + df ['X']
df

Unnamed: 0,W,X,Y,Z,U
A,0.413474,0.585183,0.551758,0.595063,0.998657
B,0.692924,0.657976,0.723581,0.556687,1.3509
C,0.165744,0.366202,0.002618,0.261684,0.531946
D,0.126109,0.288785,0.356306,0.870329,0.414894
E,0.795726,0.703135,0.990024,0.252183,1.498861


In [31]:
# See this example for length of column refference-

df = pd.DataFrame({'col1':{'idx1':1, 'idx2':2, 'idx3':3}, 'col2':{'idx1':10, 'idx2':20, 'idx3':30, 'idx4':40}})
df

Unnamed: 0,col1,col2
idx1,1.0,10
idx2,2.0,20
idx3,3.0,30
idx4,,40


In [26]:
df['T'] = np.random.rand(5)
df

Unnamed: 0,W,X,Y,Z,U,T
A,0.413474,0.585183,0.551758,0.394888,0.998657,0.898964
B,0.692924,0.657976,0.723581,0.469942,1.3509,0.737078
C,0.165744,0.366202,0.002618,0.371572,0.531946,0.562436
D,0.126109,0.288785,0.356306,0.965962,0.414894,0.477934
E,0.795726,0.703135,0.990024,0.834909,1.498861,0.757445


In [268]:
# Creating a dataframe

df = pd.DataFrame(np.random.rand(5,4),index=['A','B','C','D','E'], columns=['W','X','Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,0.583216,0.637113,0.171992,0.392844
B,0.147656,0.469057,0.801657,0.204338
C,0.523303,0.86926,0.708504,0.178729
D,0.372295,0.313768,0.039861,0.031052
E,0.063425,0.918718,0.824541,0.734311


In [269]:
df['square'] = df['W']**2

In [270]:
df

Unnamed: 0,W,X,Y,Z,square
A,0.583216,0.637113,0.171992,0.392844,0.340141
B,0.147656,0.469057,0.801657,0.204338,0.021802
C,0.523303,0.86926,0.708504,0.178729,0.273846
D,0.372295,0.313768,0.039861,0.031052,0.138603
E,0.063425,0.918718,0.824541,0.734311,0.004023


#### 5) Converting dataframe to dictionary

In [32]:
# Creating dataframe

df = pd.DataFrame(np.random.rand(5,4),index=['A','B','C','D','E'], columns=['W','X','Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,0.420911,0.233213,0.200323,0.225828
B,0.214293,0.075275,0.487459,0.205354
C,0.618041,0.562916,0.055505,0.846818
D,0.192769,0.129582,0.151384,0.418846
E,0.314463,0.678283,0.198544,0.543618


In [33]:
# Converting dataframe to dictionary

df.to_dict()

{'W': {'A': 0.42091109951538486,
  'B': 0.2142930212533838,
  'C': 0.6180414265131713,
  'D': 0.19276949850772307,
  'E': 0.31446336997699964},
 'X': {'A': 0.23321279977024478,
  'B': 0.07527483863233442,
  'C': 0.5629161429446643,
  'D': 0.129581806991363,
  'E': 0.6782832366140314},
 'Y': {'A': 0.2003230724994517,
  'B': 0.4874586518561832,
  'C': 0.05550529129880377,
  'D': 0.15138357856132123,
  'E': 0.19854436369983752},
 'Z': {'A': 0.22582798304466467,
  'B': 0.2053537577241875,
  'C': 0.8468181771241226,
  'D': 0.4188456669068483,
  'E': 0.5436183411837239}}

#### 6) Converting a column/ series of dataframe to dictionary

In [56]:
df = pd.read_csv(r'Emp_Records.csv')
df.head(5)

Unnamed: 0,Emp ID,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
0,677509,Lois,36.36,60,13.68,168251,Denver
1,940761,Brenda,47.02,60,9.01,51063,Stonewall
2,428945,Joe,54.15,68,0.98,50155,Michigantown
3,408351,Diane,39.67,51,18.3,180294,Hydetown
4,193819,Benjamin,40.31,58,4.01,117642,Fremont


In [57]:
df['City'].tolist()

['Denver',
 'Stonewall',
 'Michigantown',
 'Hydetown',
 'Fremont',
 'Macksburg',
 'Atlanta',
 'Blanchester',
 'Delmita',
 'Eureka Springs',
 'Sabetha',
 'Las Vegas',
 'New Matamoras',
 'Maida',
 'Quecreek',
 'Beulaville',
 'New Douglas',
 'Toeterville',
 'Primm Springs',
 'Dutchtown',
 'Shreveport',
 'Heathsville',
 'Middleport',
 'Woodbury',
 'Saint Cloud',
 'Stockholm',
 'Manning',
 'Mount Vernon',
 'Lawrenceburg',
 'Mesa',
 'Panacea',
 'Kline',
 'Bonanza',
 'Liberty',
 'Ohatchee',
 'Nashville',
 'Eckerty',
 'Lima',
 'Wright',
 'Ellsworth',
 'Conroy',
 'Lake Charles',
 'Kalvesta',
 'Knoxville',
 'Rochester',
 'Bowling Green',
 'Uniontown',
 'Topeka',
 'New York City',
 'Banner',
 'East Saint Louis',
 'Hancock',
 'Eatontown',
 'Portage',
 'Oneida',
 'Bartley',
 'Arlee',
 'Cookeville',
 'Saxe',
 'Mc Calla',
 'Trego',
 'Blair',
 'Riverside',
 'Maxwell',
 'Randallstown',
 'Willow Beach',
 'Granger',
 'Alcoa',
 'Baton Rouge',
 'Browning',
 'Hudson',
 'Kansas City',
 'Haswell',
 'Eckert',


In [58]:
df['City'].to_dict()

{0: 'Denver',
 1: 'Stonewall',
 2: 'Michigantown',
 3: 'Hydetown',
 4: 'Fremont',
 5: 'Macksburg',
 6: 'Atlanta',
 7: 'Blanchester',
 8: 'Delmita',
 9: 'Eureka Springs',
 10: 'Sabetha',
 11: 'Las Vegas',
 12: 'New Matamoras',
 13: 'Maida',
 14: 'Quecreek',
 15: 'Beulaville',
 16: 'New Douglas',
 17: 'Toeterville',
 18: 'Primm Springs',
 19: 'Dutchtown',
 20: 'Shreveport',
 21: 'Heathsville',
 22: 'Middleport',
 23: 'Woodbury',
 24: 'Saint Cloud',
 25: 'Stockholm',
 26: 'Manning',
 27: 'Mount Vernon',
 28: 'Lawrenceburg',
 29: 'Mesa',
 30: 'Panacea',
 31: 'Kline',
 32: 'Bonanza',
 33: 'Liberty',
 34: 'Ohatchee',
 35: 'Nashville',
 36: 'Eckerty',
 37: 'Lima',
 38: 'Wright',
 39: 'Ellsworth',
 40: 'Conroy',
 41: 'Lake Charles',
 42: 'Kalvesta',
 43: 'Knoxville',
 44: 'Rochester',
 45: 'Bowling Green',
 46: 'Uniontown',
 47: 'Topeka',
 48: 'New York City',
 49: 'Banner',
 50: 'East Saint Louis',
 51: 'Hancock',
 52: 'Eatontown',
 53: 'Portage',
 54: 'Oneida',
 55: 'Bartley',
 56: 'Arlee',


#### 7) Removing a column from a dataframe

In [34]:
# Creating dataframe

df = pd.DataFrame(np.random.rand(5,4),index=['A','B','C','D','E'], columns=['W','X','Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,0.196368,0.323805,0.444712,0.396279
B,0.563008,0.239274,0.535924,0.693353
C,0.74666,0.219069,0.969917,0.478697
D,0.168447,0.758104,0.049699,0.699971
E,0.508039,0.169921,0.340891,0.488396


In [37]:
df.drop(['Z'],axis=1)

Unnamed: 0,W,X,Y
A,0.196368,0.323805,0.444712
B,0.563008,0.239274,0.535924
C,0.74666,0.219069,0.969917
D,0.168447,0.758104,0.049699
E,0.508039,0.169921,0.340891


#### 8) Removing a row from a dataframe
It will not update the dataframe, for the dataframe to be modified you have to use `inplace=True`

In [39]:
df = pd.DataFrame(np.random.rand(5,4),index=['A','B','C','D','E'], columns=['W','X','Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,0.944594,0.322801,0.615792,0.67898
B,0.008019,0.491135,0.326887,0.89342
C,0.096418,0.13345,0.461565,0.331983
D,0.079746,0.675512,0.755037,0.849787
E,0.178858,0.296672,0.484824,0.236795


In [40]:
df.drop(['E'],axis=0)

Unnamed: 0,W,X,Y,Z
A,0.944594,0.322801,0.615792,0.67898
B,0.008019,0.491135,0.326887,0.89342
C,0.096418,0.13345,0.461565,0.331983
D,0.079746,0.675512,0.755037,0.849787


In [41]:
df # See row has not been removed

Unnamed: 0,W,X,Y,Z
A,0.944594,0.322801,0.615792,0.67898
B,0.008019,0.491135,0.326887,0.89342
C,0.096418,0.13345,0.461565,0.331983
D,0.079746,0.675512,0.755037,0.849787
E,0.178858,0.296672,0.484824,0.236795


In [42]:
df.drop(['E'],axis=0,inplace=True)
df # Now the original dataframe is updated and changes saved

Unnamed: 0,W,X,Y,Z
A,0.944594,0.322801,0.615792,0.67898
B,0.008019,0.491135,0.326887,0.89342
C,0.096418,0.13345,0.461565,0.331983
D,0.079746,0.675512,0.755037,0.849787


#### 9) Selecting a part from a dataframe

##### df.loc & df.iloc
These methods are used in slicing of DataFrame. They help in selecting the data from the DataFrame. They are used in filtering the data according to some conditions.
- `__df.loc()__ is label based data selecting method` which means that we have to pass the name of the row or column which we want to select. This method includes the last element of the range passed in it. Can accept boolean data.
- `__.df.iloc()__  is a indexed based selecting method` which means that we have to pass integer index in the method to select specific row/column. This method does not include the last element of the range passed in it. `Can not accept boolean data.`

##### i) df.loc()
- `Syntax: df.loc[row_labels, column_labels]`

In [43]:
# Creating Dataframe

df = pd.DataFrame({'Brand' : ['Maruti', 'Hyundai', 'Tata',
                                'Mahindra', 'Maruti', 'Hyundai',
                                'Renault', 'Tata', 'Maruti'],
                     'Year' : [2012, 2014, 2011, 2015, 2012, 
                               2016, 2014, 2018, 2019],
                     'Kms Driven' : [50000, 30000, 60000, 
                                     25000, 10000, 46000, 
                                     31000, 15000, 12000],
                     'City' : ['Gurgaon', 'Delhi', 'Mumbai', 
                               'Delhi', 'Mumbai', 'Delhi', 
                               'Mumbai','Chennai',  'Ghaziabad'],
                     'Mileage' :  [28, 27, 25, 26, 28, 
                                   29, 24, 21, 24]})
df

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,28
1,Hyundai,2014,30000,Delhi,27
2,Tata,2011,60000,Mumbai,25
3,Mahindra,2015,25000,Delhi,26
4,Maruti,2012,10000,Mumbai,28
5,Hyundai,2016,46000,Delhi,29
6,Renault,2014,31000,Mumbai,24
7,Tata,2018,15000,Chennai,21
8,Maruti,2019,12000,Ghaziabad,24


In [51]:
# Selcting a particular row and column only

df.loc[[1,2],['Brand','City']]

Unnamed: 0,Brand,City
1,Hyundai,Delhi
2,Tata,Mumbai


In [45]:
# Selecting a range of rows from the DataFrame

df.loc[2 : 5] # This is not indexing this is just the row numbers (include:include)

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
2,Tata,2011,60000,Mumbai,25
3,Mahindra,2015,25000,Delhi,26
4,Maruti,2012,10000,Mumbai,28
5,Hyundai,2016,46000,Delhi,29


In [46]:
# Selecting data according to some conditions

df.loc[(df.Brand == 'Maruti') & (df.Mileage > 25)]

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,28
4,Maruti,2012,10000,Mumbai,28


In [48]:
# Updating the value of any column

df.loc[(df.Year < 2015), ['Mileage']] = 22
df

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,22
1,Hyundai,2014,30000,Delhi,22
2,Tata,2011,60000,Mumbai,22
3,Mahindra,2015,25000,Delhi,26
4,Maruti,2012,10000,Mumbai,22
5,Hyundai,2016,46000,Delhi,29
6,Renault,2014,31000,Mumbai,22
7,Tata,2018,15000,Chennai,21
8,Maruti,2019,12000,Ghaziabad,24


##### ii) df.iloc()
- `Syntax: df.iloc[row_index, column_index]`
- refer ndarray slicing

In [3]:
# Creating Dataframe

df = pd.DataFrame({'Brand' : ['Maruti', 'Hyundai', 'Tata',
                                'Mahindra', 'Maruti', 'Hyundai',
                                'Renault', 'Tata', 'Maruti'],
                     'Year' : [2012, 2014, 2011, 2015, 2012, 
                               2016, 2014, 2018, 2019],
                     'Kms Driven' : [50000, 30000, 60000, 
                                     25000, 10000, 46000, 
                                     31000, 15000, 12000],
                     'City' : ['Gurgaon', 'Delhi', 'Mumbai', 
                               'Delhi', 'Mumbai', 'Delhi', 
                               'Mumbai','Chennai',  'Ghaziabad'],
                     'Mileage' :  [28, 27, 25, 26, 28, 
                                   29, 24, 21, 24]})
df

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,28
1,Hyundai,2014,30000,Delhi,27
2,Tata,2011,60000,Mumbai,25
3,Mahindra,2015,25000,Delhi,26
4,Maruti,2012,10000,Mumbai,28
5,Hyundai,2016,46000,Delhi,29
6,Renault,2014,31000,Mumbai,24
7,Tata,2018,15000,Chennai,21
8,Maruti,2019,12000,Ghaziabad,24


In [75]:
# Selecting rows using integer indices

df.iloc[[0, 2, 4, 7]]

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,28
2,Tata,2011,60000,Mumbai,25
4,Maruti,2012,10000,Mumbai,28
7,Tata,2018,15000,Chennai,21


In [50]:
# Selecting a range of columns and rows simultaneously

df.iloc[1 : 5, 2 : 5] # row,column

Unnamed: 0,Kms Driven,City,Mileage
1,30000,Delhi,22
2,60000,Mumbai,22
3,25000,Delhi,26
4,10000,Mumbai,22


In [168]:
df.iloc[1:5,:]

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
1,Hyundai,2014,30000,Delhi,27
2,Tata,2011,60000,Mumbai,25
3,Mahindra,2015,25000,Delhi,26
4,Maruti,2012,10000,Mumbai,28


In [171]:
df.iloc[:, :]

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,28
1,Hyundai,2014,30000,Delhi,27
2,Tata,2011,60000,Mumbai,25
3,Mahindra,2015,25000,Delhi,26
4,Maruti,2012,10000,Mumbai,28
5,Hyundai,2016,46000,Delhi,29
6,Renault,2014,31000,Mumbai,24
7,Tata,2018,15000,Chennai,21
8,Maruti,2019,12000,Ghaziabad,24


In [7]:
 # We can replace any value with null value
    
df.iloc[2,3]  = np.nan
df

Unnamed: 0,Brand,Year,Kms Driven,City,Mileage
0,Maruti,2012,50000,Gurgaon,28
1,Hyundai,2014,30000,Delhi,27
2,Tata,2011,60000,,25
3,Mahindra,2015,25000,Delhi,26
4,Maruti,2012,10000,Mumbai,28
5,Hyundai,2016,46000,Delhi,29
6,Renault,2014,31000,Mumbai,24
7,Tata,2018,15000,Chennai,21
8,Maruti,2019,12000,Ghaziabad,24


##### iii) Conditional selection

In [52]:
# Creating dataframe

df = pd.DataFrame(np.random.rand(5,4),index=['A','B','C','D','E'], columns=['W','X','Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,0.294056,0.103348,0.859628,0.390841
B,0.928865,0.3557,0.217769,0.805113
C,0.631647,0.659225,0.674658,0.621321
D,0.453323,0.926354,0.918153,0.957451
E,0.91923,0.863915,0.865744,0.103732


In [54]:
# Single condition

df > 0.5 # We get a boolean DataFrame

Unnamed: 0,W,X,Y,Z
A,False,False,True,False
B,True,False,False,True
C,True,True,True,True
D,False,True,True,True
E,True,True,True,False


In [55]:
df[df>0.5]

Unnamed: 0,W,X,Y,Z
A,,,0.859628,
B,0.928865,,,0.805113
C,0.631647,0.659225,0.674658,0.621321
D,,0.926354,0.918153,0.957451
E,0.91923,0.863915,0.865744,


In [61]:
# Multiple conditions

df[df['Y']>0.5][['W','X']]

Unnamed: 0,W,X
A,0.294056,0.103348
C,0.631647,0.659225
D,0.453323,0.926354
E,0.91923,0.863915


In [72]:
# Multiple conditions using `OR`

df[(df['Z']>0.7) | (df['Y']>0.7)]

Unnamed: 0,W,X,Y,Z
A,0.294056,0.103348,0.859628,0.390841
B,0.928865,0.3557,0.217769,0.805113
D,0.453323,0.926354,0.918153,0.957451
E,0.91923,0.863915,0.865744,0.103732


In [73]:
# Multiple conditions using `AND`

df[(df['Z']>0.7) & (df['Y']>0.6)]

Unnamed: 0,W,X,Y,Z
D,0.453323,0.926354,0.918153,0.957451


### Sorting Dataframe

__`-df.sort_index()
 -df.sort_values()
 -df.reset_index()
 -df.set_index()`__

#### 1) df.sort_index()
- `Syntax: df.sort_index(axis=0, ascending=True, inplace=False, ignore_index=False)`
- returns a new DataFrame sorted by label if inplace argument is False, otherwise updates the original DataFrame and returns None.

In [15]:
df = pd.read_csv('Emp_Records.csv') # Creating a dataframe
df

Unnamed: 0,Emp ID,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
0,677509,Lois,36.36,60,13.68,168251,Denver
1,940761,Brenda,47.02,60,9.01,51063,Stonewall
2,428945,Joe,54.15,68,0.98,50155,Michigantown
3,408351,Diane,39.67,51,18.30,180294,Hydetown
4,193819,Benjamin,40.31,58,4.01,117642,Fremont
...,...,...,...,...,...,...,...
95,639892,Jose,22.82,89,1.05,129774,Biloxi
96,704709,Harold,32.61,77,5.93,156194,Carol Stream
97,461593,Nicole,52.66,60,28.53,95673,Detroit
98,392491,Theresa,29.60,57,6.99,51015,Mc Grath


In [14]:
df.sort_index(axis = 0) # Sort rows

Unnamed: 0,Emp ID,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
0,677509,Lois,36.36,60,13.68,168251,Denver
1,940761,Brenda,47.02,60,9.01,51063,Stonewall
2,428945,Joe,54.15,68,0.98,50155,Michigantown
3,408351,Diane,39.67,51,18.30,180294,Hydetown
4,193819,Benjamin,40.31,58,4.01,117642,Fremont
...,...,...,...,...,...,...,...
95,639892,Jose,22.82,89,1.05,129774,Biloxi
96,704709,Harold,32.61,77,5.93,156194,Carol Stream
97,461593,Nicole,52.66,60,28.53,95673,Detroit
98,392491,Theresa,29.60,57,6.99,51015,Mc Grath


In [298]:
df.sort_index(axis = 1) # Sort columns

Unnamed: 0,Age in Company,Age in Yrs,City,Emp ID,First Name,Salary,Weight in Kgs
0,13.68,36.36,Denver,677509,Lois,168251,60
1,9.01,47.02,Stonewall,940761,Brenda,51063,60
2,0.98,54.15,Michigantown,428945,Joe,50155,68
3,18.30,39.67,Hydetown,408351,Diane,180294,51
4,4.01,40.31,Fremont,193819,Benjamin,117642,58
...,...,...,...,...,...,...,...
95,1.05,22.82,Biloxi,639892,Jose,129774,89
96,5.93,32.61,Carol Stream,704709,Harold,156194,77
97,28.53,52.66,Detroit,461593,Nicole,95673,60
98,6.99,29.60,Mc Grath,392491,Theresa,51015,57


In [299]:
df.sort_index(axis = 1, inplace = True) # Changes reverted to the dataframe
df

Unnamed: 0,Age in Company,Age in Yrs,City,Emp ID,First Name,Salary,Weight in Kgs
0,13.68,36.36,Denver,677509,Lois,168251,60
1,9.01,47.02,Stonewall,940761,Brenda,51063,60
2,0.98,54.15,Michigantown,428945,Joe,50155,68
3,18.30,39.67,Hydetown,408351,Diane,180294,51
4,4.01,40.31,Fremont,193819,Benjamin,117642,58
...,...,...,...,...,...,...,...
95,1.05,22.82,Biloxi,639892,Jose,129774,89
96,5.93,32.61,Carol Stream,704709,Harold,156194,77
97,28.53,52.66,Detroit,461593,Nicole,95673,60
98,6.99,29.60,Mc Grath,392491,Theresa,51015,57


#### 2) df.sort_values()
- `Syntax: df.sort_values(axis=0, ascending=True, inplace=False, ignore_index=False)`
- sort by the values along either axis.

In [22]:
df = pd.read_csv('Emp_Records.csv') # Creating a dataframe
df

Unnamed: 0,Emp ID,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
0,677509,Lois,36.36,60,13.68,168251,Denver
1,940761,Brenda,47.02,60,9.01,51063,Stonewall
2,428945,Joe,54.15,68,0.98,50155,Michigantown
3,408351,Diane,39.67,51,18.30,180294,Hydetown
4,193819,Benjamin,40.31,58,4.01,117642,Fremont
...,...,...,...,...,...,...,...
95,639892,Jose,22.82,89,1.05,129774,Biloxi
96,704709,Harold,32.61,77,5.93,156194,Carol Stream
97,461593,Nicole,52.66,60,28.53,95673,Detroit
98,392491,Theresa,29.60,57,6.99,51015,Mc Grath


In [301]:
df.sort_values('Age in Yrs') # Sorting values of a specified column

Unnamed: 0,Emp ID,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
13,301576,Wayne,21.10,87,0.02,92758,Maida
49,879753,Pamela,21.30,47,0.13,149262,Banner
59,750173,Antonio,21.93,82,0.24,181646,Mc Calla
6,539712,Nancy,22.14,50,0.87,98189,Atlanta
11,153989,Jack,22.21,61,0.56,82965,Las Vegas
...,...,...,...,...,...,...,...
74,528673,Paul,58.43,60,22.10,145235,Blue River
7,380086,Carol,59.12,40,34.52,60918,Blanchester
57,515103,Anne,59.27,48,14.01,114426,Cookeville
24,560455,Carolyn,59.42,53,16.08,42005,Saint Cloud


In [302]:
df.sort_values(['Weight in Kgs','Age in Company']) # Sorting more than one column

Unnamed: 0,Emp ID,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
75,765850,Linda,25.96,40,0.20,113256,Albion
47,524896,Judy,56.38,40,5.59,133332,Topeka
41,227922,Amanda,35.02,40,10.28,114257,Lake Charles
7,380086,Carol,59.12,40,34.52,60918,Blanchester
51,447813,Ann,28.23,41,3.69,130014,Hancock
...,...,...,...,...,...,...,...
13,301576,Wayne,21.10,87,0.02,92758,Maida
82,761821,Ernest,32.77,87,2.49,176675,Saranac Lake
87,623929,Jimmy,50.70,87,9.63,120631,Oriskany
95,639892,Jose,22.82,89,1.05,129774,Biloxi


In [303]:
df.sort_values('City',ascending = False) # Sorting in a descending order

Unnamed: 0,Emp ID,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
38,726264,Carl,43.63,90,10.14,162159,Wright
23,388642,Ruby,37.27,59,3.91,160623,Woodbury
65,917395,Christopher,57.37,62,19.73,190765,Willow Beach
85,476433,Lillian,42.79,55,17.65,149878,Wichita
92,969964,Janice,37.57,56,0.93,147641,Whiteman Air Force Base
...,...,...,...,...,...,...,...
6,539712,Nancy,22.14,50,0.87,98189,Atlanta
56,904898,Ann,24.61,44,0.45,182521,Arlee
99,495141,Tammy,38.38,55,2.26,93650,Alma
67,316110,Jeremy,52.51,52,11.33,178847,Alcoa


In [25]:
df.sort_values('First Name',ascending = True, ignore_index= True,inplace = True)
df

Unnamed: 0,Emp ID,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
0,218791,Aaron,48.49,56,20.08,54402,Eckerty
1,247137,Alan,38.65,80,11.15,154810,Knoxville
2,227922,Amanda,35.02,40,10.28,114257,Lake Charles
3,893212,Amy,36.14,52,14.17,112715,Kline
4,183071,Andrea,32.08,52,7.28,54179,Granger
...,...,...,...,...,...,...,...
95,214352,Theresa,24.66,59,2.52,197537,Toeterville
96,622406,Thomas,49.85,73,19.15,73862,Dutchtown
97,917937,Todd,25.93,74,1.22,163560,Randallstown
98,301576,Wayne,21.10,87,0.02,92758,Maida


#### 3) df.reset_index()
Reset indexes to default 0,1,2,3... But it will store the previous index as well in a new column, You can drop those columns later.

`What is the use?`
- when you are working with three different modules [module1,module2,module3] with [df1,df2,df3] all of them have same column headers but different index. Then you can not handle df1,df2,df3 together.

In [26]:
df = pd.read_csv('Emp_Records.csv')
df.sort_values('City', inplace = True)
df

Unnamed: 0,Emp ID,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
75,765850,Linda,25.96,40,0.20,113256,Albion
67,316110,Jeremy,52.51,52,11.33,178847,Alcoa
99,495141,Tammy,38.38,55,2.26,93650,Alma
56,904898,Ann,24.61,44,0.45,182521,Arlee
6,539712,Nancy,22.14,50,0.87,98189,Atlanta
...,...,...,...,...,...,...,...
92,969964,Janice,37.57,56,0.93,147641,Whiteman Air Force Base
85,476433,Lillian,42.79,55,17.65,149878,Wichita
65,917395,Christopher,57.37,62,19.73,190765,Willow Beach
23,388642,Ruby,37.27,59,3.91,160623,Woodbury


In [28]:
df.reset_index(drop=True, inplace = True)
df

Unnamed: 0,Emp ID,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
0,765850,Linda,25.96,40,0.20,113256,Albion
1,316110,Jeremy,52.51,52,11.33,178847,Alcoa
2,495141,Tammy,38.38,55,2.26,93650,Alma
3,904898,Ann,24.61,44,0.45,182521,Arlee
4,539712,Nancy,22.14,50,0.87,98189,Atlanta
...,...,...,...,...,...,...,...
95,969964,Janice,37.57,56,0.93,147641,Whiteman Air Force Base
96,476433,Lillian,42.79,55,17.65,149878,Wichita
97,917395,Christopher,57.37,62,19.73,190765,Willow Beach
98,388642,Ruby,37.27,59,3.91,160623,Woodbury


In [96]:
df1 = pd.DataFrame(np.random.rand(5,4),index=['A','B','C','D','E'], columns=['W','X','Y','Z'])
df2 = pd.DataFrame(np.random.rand(5,4),index=['F','G','H','I','J'], columns=['W','X','Y','Z'])
df3 = pd.DataFrame(np.random.rand(5,4),index=['K','L','M','N','O'], columns=['W','X','Y','Z'])

In [97]:
df1

Unnamed: 0,W,X,Y,Z
A,0.56477,0.010157,0.287851,0.072021
B,0.799363,0.018298,0.852233,0.319509
C,0.141853,0.599876,0.553709,0.65261
D,0.397452,0.166599,0.41042,0.107775
E,0.272057,0.343694,0.601674,0.341927


In [98]:
df1 = df1.reset_index()
df1

Unnamed: 0,index,W,X,Y,Z
0,A,0.56477,0.010157,0.287851,0.072021
1,B,0.799363,0.018298,0.852233,0.319509
2,C,0.141853,0.599876,0.553709,0.65261
3,D,0.397452,0.166599,0.41042,0.107775
4,E,0.272057,0.343694,0.601674,0.341927


In [99]:
df2

Unnamed: 0,W,X,Y,Z
F,0.346012,0.769198,0.772386,0.830127
G,0.987749,0.380816,0.448147,0.495955
H,0.813551,0.18722,0.31044,0.325886
I,0.920826,0.319994,0.253002,0.951507
J,0.429303,0.058789,0.681002,0.376428


In [100]:
df2 = df2.reset_index()
df2

Unnamed: 0,index,W,X,Y,Z
0,F,0.346012,0.769198,0.772386,0.830127
1,G,0.987749,0.380816,0.448147,0.495955
2,H,0.813551,0.18722,0.31044,0.325886
3,I,0.920826,0.319994,0.253002,0.951507
4,J,0.429303,0.058789,0.681002,0.376428


In [101]:
df3

Unnamed: 0,W,X,Y,Z
K,0.988927,0.559227,0.663069,0.423315
L,0.784795,0.11459,0.879515,0.066794
M,0.562109,0.465305,0.021467,0.603401
N,0.72388,0.46492,0.358805,0.04253
O,0.61522,0.529327,0.946885,0.332672


In [102]:
df3 = df3.reset_index()
df3

Unnamed: 0,index,W,X,Y,Z
0,K,0.988927,0.559227,0.663069,0.423315
1,L,0.784795,0.11459,0.879515,0.066794
2,M,0.562109,0.465305,0.021467,0.603401
3,N,0.72388,0.46492,0.358805,0.04253
4,O,0.61522,0.529327,0.946885,0.332672


In [114]:
# Using concat

df4 = pd.concat([df1,df2,df3],ignore_index=True,axis=0)
df4

Unnamed: 0,index,W,X,Y,Z
0,A,0.56477,0.010157,0.287851,0.072021
1,B,0.799363,0.018298,0.852233,0.319509
2,C,0.141853,0.599876,0.553709,0.65261
3,D,0.397452,0.166599,0.41042,0.107775
4,E,0.272057,0.343694,0.601674,0.341927
5,F,0.346012,0.769198,0.772386,0.830127
6,G,0.987749,0.380816,0.448147,0.495955
7,H,0.813551,0.18722,0.31044,0.325886
8,I,0.920826,0.319994,0.253002,0.951507
9,J,0.429303,0.058789,0.681002,0.376428


In [111]:
# Using append

df4 = df1.append([df2,df3],ignore_index=True)
df4

Unnamed: 0,index,W,X,Y,Z
0,A,0.56477,0.010157,0.287851,0.072021
1,B,0.799363,0.018298,0.852233,0.319509
2,C,0.141853,0.599876,0.553709,0.65261
3,D,0.397452,0.166599,0.41042,0.107775
4,E,0.272057,0.343694,0.601674,0.341927
5,F,0.346012,0.769198,0.772386,0.830127
6,G,0.987749,0.380816,0.448147,0.495955
7,H,0.813551,0.18722,0.31044,0.325886
8,I,0.920826,0.319994,0.253002,0.951507
9,J,0.429303,0.058789,0.681002,0.376428


In [112]:
df4.drop(['index'],axis=1)

Unnamed: 0,W,X,Y,Z
0,0.56477,0.010157,0.287851,0.072021
1,0.799363,0.018298,0.852233,0.319509
2,0.141853,0.599876,0.553709,0.65261
3,0.397452,0.166599,0.41042,0.107775
4,0.272057,0.343694,0.601674,0.341927
5,0.346012,0.769198,0.772386,0.830127
6,0.987749,0.380816,0.448147,0.495955
7,0.813551,0.18722,0.31044,0.325886
8,0.920826,0.319994,0.253002,0.951507
9,0.429303,0.058789,0.681002,0.376428


In [12]:
df1 = pd.DataFrame({
    'name': ['A', 'B', 'C', 'D'],
    'math': [60,89,82,70],
    'physics': [66,95,83,66],
    'chemistry': [61,91,77,70]})

df2 = pd.DataFrame({
    'name': ['E', 'F', 'G', 'H'],
    'math': [66,95,83,66],
    'physics': [60,89,82,70],
    'chemistry': [90,81,78,90]})

Unnamed: 0,name,math,physics,chemistry
0,E,66,60,90
1,F,95,89,81
2,G,83,82,78
3,H,66,70,90


In [10]:
df3 = pd.concat([df1,df2],ignore_index=True)
df3

Unnamed: 0,name,math,physics,chemistry
0,A,60,66,61
1,B,89,95,91
2,C,82,83,77
3,D,70,66,70
4,E,66,60,90
5,F,95,89,81
6,G,83,82,78
7,H,66,70,90


#### 4) df.set_index()
- `Syntax: df.set_index(drop=True, inplace=False)`
- Set the DataFrame index using existing columns.

In [59]:
df1 = pd.DataFrame(np.random.rand(5,4), columns=['W','X','Y','Z'])
df1

Unnamed: 0,W,X,Y,Z
0,0.77079,0.575293,0.868616,0.428374
1,0.082035,0.336775,0.512704,0.273871
2,0.509946,0.720662,0.268876,0.139185
3,0.065676,0.198348,0.673913,0.364173
4,0.811707,0.724651,0.069799,0.187297


In [60]:
df1['Sr_no'] = ['idx1','indx2','indx3','indx4','indx5']
df1

Unnamed: 0,W,X,Y,Z,Sr_no
0,0.77079,0.575293,0.868616,0.428374,idx1
1,0.082035,0.336775,0.512704,0.273871,indx2
2,0.509946,0.720662,0.268876,0.139185,indx3
3,0.065676,0.198348,0.673913,0.364173,indx4
4,0.811707,0.724651,0.069799,0.187297,indx5


In [61]:
df1.set_index(['Sr_no'],inplace=True) # 'Sr_no' column has been assigned as index of the dataframe
df1

Unnamed: 0_level_0,W,X,Y,Z
Sr_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
idx1,0.77079,0.575293,0.868616,0.428374
indx2,0.082035,0.336775,0.512704,0.273871
indx3,0.509946,0.720662,0.268876,0.139185
indx4,0.065676,0.198348,0.673913,0.364173
indx5,0.811707,0.724651,0.069799,0.187297


In [62]:
df1.index

Index(['idx1', 'indx2', 'indx3', 'indx4', 'indx5'], dtype='object', name='Sr_no')

In [36]:
df = pd.read_csv('Emp_Records.csv')
df

Unnamed: 0,Emp ID,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
0,677509,Lois,36.36,60,13.68,168251,Denver
1,940761,Brenda,47.02,60,9.01,51063,Stonewall
2,428945,Joe,54.15,68,0.98,50155,Michigantown
3,408351,Diane,39.67,51,18.30,180294,Hydetown
4,193819,Benjamin,40.31,58,4.01,117642,Fremont
...,...,...,...,...,...,...,...
95,639892,Jose,22.82,89,1.05,129774,Biloxi
96,704709,Harold,32.61,77,5.93,156194,Carol Stream
97,461593,Nicole,52.66,60,28.53,95673,Detroit
98,392491,Theresa,29.60,57,6.99,51015,Mc Grath


In [37]:
df.set_index('Emp ID',inplace = True)
df

Unnamed: 0_level_0,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
Emp ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
677509,Lois,36.36,60,13.68,168251,Denver
940761,Brenda,47.02,60,9.01,51063,Stonewall
428945,Joe,54.15,68,0.98,50155,Michigantown
408351,Diane,39.67,51,18.30,180294,Hydetown
193819,Benjamin,40.31,58,4.01,117642,Fremont
...,...,...,...,...,...,...
639892,Jose,22.82,89,1.05,129774,Biloxi
704709,Harold,32.61,77,5.93,156194,Carol Stream
461593,Nicole,52.66,60,28.53,95673,Detroit
392491,Theresa,29.60,57,6.99,51015,Mc Grath


### Pandas DataFrame functions

#### 1) df.columns
It will return the column names in dataframe

In [130]:
# Creating dataframe

df = pd.DataFrame({'Brand' : ['Maruti', 'Hyundai', 'Tata',
                                'Mahindra', 'Maruti', 'Hyundai',
                                'Renault', 'Tata', 'Maruti'],
                     'Year' : [2012, 2014, 2011, 2015, 2012, 
                               2016, 2014, 2018, 2019],
                     'Kms Driven' : [50000, 30000, 60000, 
                                     25000, 10000, 46000, 
                                     31000, 15000, 12000],
                     'City' : ['Gurgaon', 'Delhi', 'Mumbai', 
                               'Delhi', 'Mumbai', 'Delhi', 
                               'Mumbai','Chennai',  'Ghaziabad'],
                     'Mileage' :  [28, 27, 25, 26, 28, 
                                   29, 24, 21, 24]})

In [132]:
df.columns

Index(['Brand', 'Year', 'Kms Driven', 'City', 'Mileage'], dtype='object')

#### 2) df.index
It will return all the row names in dataframe

In [133]:
# Creating dataframe

df = pd.DataFrame({'Brand' : ['Maruti', 'Hyundai', 'Tata',
                                'Mahindra', 'Maruti', 'Hyundai',
                                'Renault', 'Tata', 'Maruti'],
                     'Year' : [2012, 2014, 2011, 2015, 2012, 
                               2016, 2014, 2018, 2019],
                     'Kms Driven' : [50000, 30000, 60000, 
                                     25000, 10000, 46000, 
                                     31000, 15000, 12000],
                     'City' : ['Gurgaon', 'Delhi', 'Mumbai', 
                               'Delhi', 'Mumbai', 'Delhi', 
                               'Mumbai','Chennai',  'Ghaziabad'],
                     'Mileage' :  [28, 27, 25, 26, 28, 
                                   29, 24, 21, 24]})

In [134]:
df.index

RangeIndex(start=0, stop=9, step=1)

#### 3) df.axes
It will return all the row and column names

In [135]:
# Creating dataframe

df = pd.DataFrame({'Brand' : ['Maruti', 'Hyundai', 'Tata',
                                'Mahindra', 'Maruti', 'Hyundai',
                                'Renault', 'Tata', 'Maruti'],
                     'Year' : [2012, 2014, 2011, 2015, 2012, 
                               2016, 2014, 2018, 2019],
                     'Kms Driven' : [50000, 30000, 60000, 
                                     25000, 10000, 46000, 
                                     31000, 15000, 12000],
                     'City' : ['Gurgaon', 'Delhi', 'Mumbai', 
                               'Delhi', 'Mumbai', 'Delhi', 
                               'Mumbai','Chennai',  'Ghaziabad'],
                     'Mileage' :  [28, 27, 25, 26, 28, 
                                   29, 24, 21, 24]})

In [136]:
df.axes

[RangeIndex(start=0, stop=9, step=1),
 Index(['Brand', 'Year', 'Kms Driven', 'City', 'Mileage'], dtype='object')]

#### 4) df.shape
It will return the shape of the dataframe (rows,columns)

In [137]:
# Creating dataframe

df = pd.DataFrame({'Brand' : ['Maruti', 'Hyundai', 'Tata',
                                'Mahindra', 'Maruti', 'Hyundai',
                                'Renault', 'Tata', 'Maruti'],
                     'Year' : [2012, 2014, 2011, 2015, 2012, 
                               2016, 2014, 2018, 2019],
                     'Kms Driven' : [50000, 30000, 60000, 
                                     25000, 10000, 46000, 
                                     31000, 15000, 12000],
                     'City' : ['Gurgaon', 'Delhi', 'Mumbai', 
                               'Delhi', 'Mumbai', 'Delhi', 
                               'Mumbai','Chennai',  'Ghaziabad'],
                     'Mileage' :  [28, 27, 25, 26, 28, 
                                   29, 24, 21, 24]})

In [138]:
df.shape

(9, 5)

#### 5) df.info()
This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

In [139]:
# Creating dataframe

df = pd.DataFrame({'Brand' : ['Maruti', 'Hyundai', 'Tata',
                                'Mahindra', 'Maruti', 'Hyundai',
                                'Renault', 'Tata', 'Maruti'],
                     'Year' : [2012, 2014, 2011, 2015, 2012, 
                               2016, 2014, 2018, 2019],
                     'Kms Driven' : [50000, 30000, 60000, 
                                     25000, 10000, 46000, 
                                     31000, 15000, 12000],
                     'City' : ['Gurgaon', 'Delhi', 'Mumbai', 
                               'Delhi', 'Mumbai', 'Delhi', 
                               'Mumbai','Chennai',  'Ghaziabad'],
                     'Mileage' :  [28, 27, 25, 26, 28, 
                                   29, 24, 21, 24]})

In [140]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Brand       9 non-null      object
 1   Year        9 non-null      int64 
 2   Kms Driven  9 non-null      int64 
 3   City        9 non-null      object
 4   Mileage     9 non-null      int64 
dtypes: int64(3), object(2)
memory usage: 488.0+ bytes


#### 6) df.describe()
It will returns Descriptive statistics(mean,min,max,std deviation) excluding NaN values.

In [141]:
# Creating dataframe

df = pd.DataFrame({'Brand' : ['Maruti', 'Hyundai', 'Tata',
                                'Mahindra', 'Maruti', 'Hyundai',
                                'Renault', 'Tata', 'Maruti'],
                     'Year' : [2012, 2014, 2011, 2015, 2012, 
                               2016, 2014, 2018, 2019],
                     'Kms Driven' : [50000, 30000, 60000, 
                                     25000, 10000, 46000, 
                                     31000, 15000, 12000],
                     'City' : ['Gurgaon', 'Delhi', 'Mumbai', 
                               'Delhi', 'Mumbai', 'Delhi', 
                               'Mumbai','Chennai',  'Ghaziabad'],
                     'Mileage' :  [28, 27, 25, 26, 28, 
                                   29, 24, 21, 24]})

In [143]:
df.describe()

Unnamed: 0,Year,Kms Driven,Mileage
count,9.0,9.0,9.0
mean,2014.555556,31000.0,25.777778
std,2.74368,17755.280905,2.538591
min,2011.0,10000.0,21.0
25%,2012.0,15000.0,24.0
50%,2014.0,30000.0,26.0
75%,2016.0,46000.0,28.0
max,2019.0,60000.0,29.0


#### 7) df[column_name].unique()
It will return distinct(unique) elements in specified axis/column.

In [149]:
# Creating dataframe

df = pd.DataFrame({'Brand' : ['Maruti', 'Hyundai', 'Tata',
                                'Mahindra', 'Maruti', 'Hyundai',
                                'Renault', 'Tata', 'Maruti'],
                     'Year' : [2012, 2014, 2011, 2015, 2012, 
                               2016, 2014, 2018, 2019],
                     'Kms Driven' : [50000, 30000, 60000, 
                                     25000, 10000, 46000, 
                                     31000, 15000, 12000],
                     'City' : ['Gurgaon', 'Delhi', 'Mumbai', 
                               'Delhi', 'Mumbai', 'Delhi', 
                               'Mumbai','Chennai',  'Ghaziabad'],
                     'Mileage' :  [28, 27, 25, 26, 28, 
                                   29, 24, 21, 24]})

In [150]:
df['Brand'].unique()

array(['Maruti', 'Hyundai', 'Tata', 'Mahindra', 'Renault'], dtype=object)

#### 8) df[column_name].nunique()
Count number of distinct(unique) elements in specified axis/column.

In [144]:
# Creating dataframe

df = pd.DataFrame({'Brand' : ['Maruti', 'Hyundai', 'Tata',
                                'Mahindra', 'Maruti', 'Hyundai',
                                'Renault', 'Tata', 'Maruti'],
                     'Year' : [2012, 2014, 2011, 2015, 2012, 
                               2016, 2014, 2018, 2019],
                     'Kms Driven' : [50000, 30000, 60000, 
                                     25000, 10000, 46000, 
                                     31000, 15000, 12000],
                     'City' : ['Gurgaon', 'Delhi', 'Mumbai', 
                               'Delhi', 'Mumbai', 'Delhi', 
                               'Mumbai','Chennai',  'Ghaziabad'],
                     'Mileage' :  [28, 27, 25, 26, 28, 
                                   29, 24, 21, 24]})

In [146]:
df['Brand'].nunique()

5

#### 9) df['column_name'].value_counts()
Returns a Series containing counts of unique rows in the DataFrame.

In [152]:
# Creating dataframe

df = pd.DataFrame({'Brand' : ['Maruti', 'Hyundai', 'Tata',
                                'Mahindra', 'Maruti', 'Hyundai',
                                'Renault', 'Tata', 'Maruti'],
                     'Year' : [2012, 2014, 2011, 2015, 2012, 
                               2016, 2014, 2018, 2019],
                     'Kms Driven' : [50000, 30000, 60000, 
                                     25000, 10000, 46000, 
                                     31000, 15000, 12000],
                     'City' : ['Gurgaon', 'Delhi', 'Mumbai', 
                               'Delhi', 'Mumbai', 'Delhi', 
                               'Mumbai','Chennai',  'Ghaziabad'],
                     'Mileage' :  [28, 27, 25, 26, 28, 
                                   29, 24, 21, 24]})

In [151]:
df['Year'].value_counts()

2012    2
2014    2
2016    1
2018    1
2019    1
2011    1
2015    1
Name: Year, dtype: int64

#### 10) df.head()
- returns first five rows, By default value is 5, we can also specify number of rows we want from the dataframe

In [153]:
df = pd.read_csv('Emp_Records.csv') # Creating dataframe by importing csv

In [154]:
df.head()

Unnamed: 0,Emp ID,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
0,677509,Lois,36.36,60,13.68,168251,Denver
1,940761,Brenda,47.02,60,9.01,51063,Stonewall
2,428945,Joe,54.15,68,0.98,50155,Michigantown
3,408351,Diane,39.67,51,18.3,180294,Hydetown
4,193819,Benjamin,40.31,58,4.01,117642,Fremont


In [155]:
df.head(10)

Unnamed: 0,Emp ID,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
0,677509,Lois,36.36,60,13.68,168251,Denver
1,940761,Brenda,47.02,60,9.01,51063,Stonewall
2,428945,Joe,54.15,68,0.98,50155,Michigantown
3,408351,Diane,39.67,51,18.3,180294,Hydetown
4,193819,Benjamin,40.31,58,4.01,117642,Fremont
5,499687,Patrick,34.86,58,12.02,72305,Macksburg
6,539712,Nancy,22.14,50,0.87,98189,Atlanta
7,380086,Carol,59.12,40,34.52,60918,Blanchester
8,477616,Frances,58.18,42,23.27,121587,Delmita
9,162402,Diana,29.73,60,3.44,43010,Eureka Springs


#### 11) df.tail()
- returns last five rows, By default value is 5, we can also specify number of rows we want from the dataframe

In [156]:
df.tail()

Unnamed: 0,Emp ID,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
95,639892,Jose,22.82,89,1.05,129774,Biloxi
96,704709,Harold,32.61,77,5.93,156194,Carol Stream
97,461593,Nicole,52.66,60,28.53,95673,Detroit
98,392491,Theresa,29.6,57,6.99,51015,Mc Grath
99,495141,Tammy,38.38,55,2.26,93650,Alma


In [157]:
df.tail(10)

Unnamed: 0,Emp ID,First Name,Age in Yrs,Weight in Kgs,Age in Company,Salary,City
90,185032,Eugene,37.84,59,2.64,122950,Racine
91,867084,Deborah,39.77,54,15.41,52767,Atqasuk
92,969964,Janice,37.57,56,0.93,147641,Whiteman Air Force Base
93,158666,Rebecca,56.0,55,34.32,160043,Independence
94,489424,Phillip,39.43,82,1.56,181774,Mapleton
95,639892,Jose,22.82,89,1.05,129774,Biloxi
96,704709,Harold,32.61,77,5.93,156194,Carol Stream
97,461593,Nicole,52.66,60,28.53,95673,Detroit
98,392491,Theresa,29.6,57,6.99,51015,Mc Grath
99,495141,Tammy,38.38,55,2.26,93650,Alma


### Missing values manipulation-
__I) Detecting missing values(NaN [not a number])__

#### 1) df.isnull() / df.isna()
- these functions are used to find the null values in the dataset
- there is no difference in these to functions
- they both returns boolean value

In [178]:
df = pd.read_csv('titanic.csv')
df.isnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [179]:
df.isna()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [180]:
# To calculate total number of null values in the dataset

df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Gender           0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [181]:
df.isnull().sum() # See there are 177, 687, 2 null values in Age, Cabin and Embarked columns

PassengerId      0
Survived         0
Pclass           0
Name             0
Gender           0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

 ___II) Handling missing values missing values(NaN [not a number])___
- There are basically two ways to handel missing values of a dataframe either you can replace those null values or you can drop the columns/rows having null values.
- It is not a good practice to drop null values everytime since data is the key in machine learning algorithm.
- We can drop the column only if the column consist of large number of null values.
- We can replace those null values with some static value, mean of the column, mode of the column, median of the column, min value of the column, max value of the column etc.

- We can replace null values with its mean if that particular column does not consist any outliars.
- If the column consist of outliars then we replace null values with median.
- If it column consist of some catagorical values then we relace then with mode.
- We only replace null values with some static values or min or max if and if only it is required else we use mean mode and median only.

In [183]:
df.shape # 891 rows and 12 columns

(891, 12)

In [182]:
df = pd.read_csv('titanic.csv')
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Gender           0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [186]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Gender       891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [187]:
df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

- As we can see __`age`__ column has 177 null values Which means we can fill those with it's mean or median which depends on whether the column cosist of outliers or not.
- Also we can see __`Cabin`__ column has total 687 null values out of 891. So it will be better to drop that column from the dataset since filling those values with large amount of false data is not a good thing.
- Also we can see that __`Embarked`__ column has only three unique values S, C and Q and having datatype as object and we know that we use mode operation to fill null values of categorical data.

In [188]:
# Calulating mean 

df['Age'].mean()

29.69911764705882

In [189]:
# Calculating median

df['Age'].median()

28.0

In [196]:
# Calculating mode

from scipy import stats
print(stats.mode(df['Embarked']))
print('*'*40)

print(stats.mode(df['Embarked'])[0])
print('*'*40)

print(stats.mode(df['Embarked'])[0][0])

ModeResult(mode=array(['S'], dtype=object), count=array([644]))
****************************************
['S']
****************************************
S


In [197]:
# Calculating min

df['Age'].min() # see this is an outlier

0.42

In [198]:
# Calulating max

df['Age'].max()

80.0

#### 1) df.fillna()
- We can fill null values with mean, median, mode, max, min or with static values

In [200]:
df = pd.read_csv('titanic.csv')
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Gender           0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [201]:
df['Age'].fillna(df['Age'].mean(),inplace=True)

In [202]:
df.isnull().sum() # see we have replaced nulll vales of `Age' column with it's mean

PassengerId      0
Survived         0
Pclass           0
Name             0
Gender           0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [210]:
from scipy import stats

df['Embarked'].fillna(stats.mode(df['Embarked'])[0][0],inplace=True)

In [212]:
df.isnull().sum() # see null values of 'Embarked' column has been removed

PassengerId      0
Survived         0
Pclass           0
Name             0
Gender           0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

#### 2) df.dropna()
- Dropping null values
- We can eigther drop rows having null values or we can drop columns having null values

In [213]:
df = pd.read_csv('titanic.csv')
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Gender           0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [216]:
df.dropna() # By default axis=0; It will drop rows from the dataframe having null values
# You can see that some rows have been dropped

Unnamed: 0,PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


In [217]:
df.dropna(axis=1) # See columns having null values have been dropped

Unnamed: 0,PassengerId,Survived,Pclass,Name,Gender,SibSp,Parch,Ticket,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.2500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.9250
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1000
4,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.0500
...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,0,0,211536,13.0000
887,888,1,1,"Graham, Miss. Margaret Edith",female,0,0,112053,30.0000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,1,2,W./C. 6607,23.4500
889,890,1,1,"Behr, Mr. Karl Howell",male,0,0,111369,30.0000


___thresh flag___
- `Syntax: df.dropna(axis=0, thresh=None)`
- If we have given thresh = 5 then a column or a row must have atleast 5 non null values.
- If a column or row has less number of null values than thresh value then that respective column or row has been dropped.
- Works same for rows and columns

In [218]:
# for rows

df.dropna(thresh=5) # By default axis=0 # Rows having less number of non null values than thresh given will be dropped

Unnamed: 0,PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [221]:
# For columns 

df.dropna(axis=1,thresh=2) # See embarked column has not been dropped since it consist of more than 2 non null values

Unnamed: 0,PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


#### 3) Dropping entire column
- For some instances we have to drop the entire column in this dataframe we have to drop `Cabin` column since it consist large amount of null values

In [228]:
df = pd.read_csv('titanic.csv')
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Gender           0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [229]:
df.drop(['Cabin'],axis=1,inplace=True)
df # see cabin column has been dropped

Unnamed: 0,PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C


### Replacing Categorical Values
#### 1) df.replace()
- We will learn this concept in categorical encoding

In [230]:
df = pd.read_csv('titanic.csv') # Creating dataframe
df.isnull().sum() # Checking for null values

PassengerId      0
Survived         0
Pclass           0
Name             0
Gender           0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [231]:
from scipy import stats 
df['Embarked'].fillna(stats.mode(['Embarked'])[0][0],inplace=True) # filling null values

In [232]:
df.isnull().sum() # Filled null values for column 'Embarked'

PassengerId      0
Survived         0
Pclass           0
Name             0
Gender           0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

In [238]:
df['Embarked'].unique() # So we have three unique values

array(['S', 'C', 'Q', 'Embarked'], dtype=object)

In [239]:
# replacing unique values from 'Embarked' column of the dataframe with numeric values

df['Embarked'].replace({'S' : 0, 'C' : 1, 'Q' : 2}, inplace = True)

In [241]:
df['Embarked'].unique() # Replaced

array([0, 1, 2, 'Embarked'], dtype=object)

#### 2) df.astype()
- Is used to change datatype of any column

In [260]:
df = pd.read_csv('titanic.csv') # Creating dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Gender       891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [245]:
# We can see datatype of 'Age' is float, So we will change the datatype of 'Age' column to int
# But first we have to fill null values with its mean

In [261]:
df['Age'].fillna(df['Age'].mean(),inplace=True)

In [263]:
df['Age'] = df['Age'].astype('int')

In [265]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


In [267]:
df.info() # See data type of 'Age' column changed to int datatype

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Gender       891 non-null    object 
 5   Age          891 non-null    int32  
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(1), int32(1), int64(5), object(5)
memory usage: 80.2+ KB


#### 3) df.insert()
- Used to insert a column at particular index.
- `Syntax: df.insert(index,column_name,values to be added)`
- We can not insert two columns with same name but to allow it to do so, we have to use `allow_duplicates = True` flag

In [275]:
# Creating dataframe

df = pd.DataFrame(np.random.rand(5,4),index=['A','B','C','D','E'], columns=['W','X','Y','Z'])
df

Unnamed: 0,W,X,Y,Z
A,0.20205,0.139777,0.709592,0.503881
B,0.582411,0.92149,0.095278,0.789795
C,0.014007,0.880595,0.772293,0.864699
D,0.350504,0.602581,0.073167,0.015123
E,0.267894,0.600701,0.527621,0.215488


In [276]:
df.insert(1,'V',1.2)
df # New column inserted at at index 1

Unnamed: 0,W,V,X,Y,Z
A,0.20205,1.2,0.139777,0.709592,0.503881
B,0.582411,1.2,0.92149,0.095278,0.789795
C,0.014007,1.2,0.880595,0.772293,0.864699
D,0.350504,1.2,0.602581,0.073167,0.015123
E,0.267894,1.2,0.600701,0.527621,0.215488


In [277]:
df.insert(0,'T',np.random.rand(5))
df # New column inserted at at index 1

Unnamed: 0,T,W,V,X,Y,Z
A,0.353782,0.20205,1.2,0.139777,0.709592,0.503881
B,0.89032,0.582411,1.2,0.92149,0.095278,0.789795
C,0.85132,0.014007,1.2,0.880595,0.772293,0.864699
D,0.823372,0.350504,1.2,0.602581,0.073167,0.015123
E,0.052279,0.267894,1.2,0.600701,0.527621,0.215488


In [278]:
df.insert(0,'T',np.random.rand(5),allow_duplicates=True) # column having same name inserted
df

Unnamed: 0,T,T.1,W,V,X,Y,Z
A,0.107557,0.353782,0.20205,1.2,0.139777,0.709592,0.503881
B,0.76613,0.89032,0.582411,1.2,0.92149,0.095278,0.789795
C,0.626449,0.85132,0.014007,1.2,0.880595,0.772293,0.864699
D,0.00016,0.823372,0.350504,1.2,0.602581,0.073167,0.015123
E,0.467102,0.052279,0.267894,1.2,0.600701,0.527621,0.215488


In [279]:
df.insert(1, 'C', np.array([100,200,300,400,500]))
df

Unnamed: 0,T,C,T.1,W,V,X,Y,Z
A,0.107557,100,0.353782,0.20205,1.2,0.139777,0.709592,0.503881
B,0.76613,200,0.89032,0.582411,1.2,0.92149,0.095278,0.789795
C,0.626449,300,0.85132,0.014007,1.2,0.880595,0.772293,0.864699
D,0.00016,400,0.823372,0.350504,1.2,0.602581,0.073167,0.015123
E,0.467102,500,0.052279,0.267894,1.2,0.600701,0.527621,0.215488


#### 4) df.rename()
- It is used to rename the exiting column names

In [287]:
# Creating a dataframe

df = pd.DataFrame({'Principal_Amount' : [100000, 200000, 300000, 500000, 700000]})
df["Interest_Rate(R)"] = [12,11,13,5,8]
df["Tenure(T)"] = [2,3,4,5,7]
df["Amount"] = df['Principal_Amount'] * ((1 + df["Interest_Rate(R)"]/100)) ** df["Tenure(T)"]
df['Amount'] = np.around(df['Amount'],3)
df['Componud Interest'] = df['Amount'] - df['Principal_Amount']
df.sort_values('Componud Interest', ascending = False, inplace = True)
df.reset_index(drop=True,inplace=True)
df

Unnamed: 0,Principal_Amount,Interest_Rate(R),Tenure(T),Amount,Componud Interest
0,700000,8,7,1199676.988,499676.988
1,300000,13,4,489142.083,189142.083
2,500000,5,5,638140.781,138140.781
3,200000,11,3,273526.2,73526.2
4,100000,12,2,125440.0,25440.0


In [288]:
# renaming column names

df.rename({'Principal_Amount':'P', 'Interest_Rate(R)':'R', 'Tenure(T)':'T', 'Amount':'A', 'Componud Interest':'CI'}, axis =1 , inplace = True)
df

Unnamed: 0,P,R,T,A,CI
0,700000,8,7,1199676.988,499676.988
1,300000,13,4,489142.083,189142.083
2,500000,5,5,638140.781,138140.781
3,200000,11,3,273526.2,73526.2
4,100000,12,2,125440.0,25440.0


In [289]:
df.columns

Index(['P', 'R', 'T', 'A', 'CI'], dtype='object')

In [290]:
df.index

RangeIndex(start=0, stop=5, step=1)

In [291]:
df.columns = ['Principal_Amount','Interest_Rate','Tenure','Amount','Componud Interest'] # Another way to rename names of columns
df

Unnamed: 0,Principal_Amount,Interest_Rate,Tenure,Amount,Componud Interest
0,700000,8,7,1199676.988,499676.988
1,300000,13,4,489142.083,189142.083
2,500000,5,5,638140.781,138140.781
3,200000,11,3,273526.2,73526.2
4,100000,12,2,125440.0,25440.0


### Some important Pandas functions

#### 1) df.groupby()
- `Syntax: df.groupby(axis=0,sort=True,dropna=True)`
- used to group large amounts of data and compute operations on these groups

In [64]:
movies_df = pd.read_csv('Movies.csv')
movies_df

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."
...,...,...,...,...,...,...
974,7.4,Tootsie,PG,Comedy,116,"[u'Dustin Hoffman', u'Jessica Lange', u'Teri G..."
975,7.4,Back to the Future Part III,PG,Adventure,118,"[u'Michael J. Fox', u'Christopher Lloyd', u'Ma..."
976,7.4,Master and Commander: The Far Side of the World,PG-13,Action,138,"[u'Russell Crowe', u'Paul Bettany', u'Billy Bo..."
977,7.4,Poltergeist,PG,Horror,114,"[u'JoBeth Williams', u""Heather O'Rourke"", u'Cr..."


In [66]:
movies_df.groupby('genre').first() # returns first occurances of a dataframe groupped by 'genre'

Unnamed: 0_level_0,star_rating,title,content_rating,duration,actors_list
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Action,9.0,The Dark Knight,PG-13,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
Adventure,8.9,The Lord of the Rings: The Return of the King,PG-13,201,"[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK..."
Animation,8.6,Spirited Away,PG,125,"[u'Daveigh Chase', u'Suzanne Pleshette', u'Miy..."
Biography,8.9,Schindler's List,R,195,"[u'Liam Neeson', u'Ralph Fiennes', u'Ben Kings..."
Comedy,8.6,Life Is Beautiful,PG-13,116,"[u'Roberto Benigni', u'Nicoletta Braschi', u'G..."
Crime,9.3,The Shawshank Redemption,R,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
Drama,8.9,12 Angry Men,NOT RATED,96,"[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals..."
Family,7.9,E.T. the Extra-Terrestrial,PG,115,"[u'Henry Thomas', u'Drew Barrymore', u'Peter C..."
Fantasy,7.7,The City of Lost Children,R,112,"[u'Ron Perlman', u'Daniel Emilfork', u'Judith ..."
Film-Noir,8.3,The Third Man,NOT RATED,93,"[u'Orson Welles', u'Joseph Cotten', u'Alida Va..."


In [67]:
action_df = movies_df.groupby('genre').get_group('Action') # returns a group of specified genre
action_df

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
11,8.8,Inception,PG-13,Action,148,"[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'..."
12,8.8,Star Wars: Episode V - The Empire Strikes Back,PG,Action,124,"[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi..."
19,8.7,Star Wars,PG,Action,121,"[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi..."
20,8.7,The Matrix,R,Action,136,"[u'Keanu Reeves', u'Laurence Fishburne', u'Car..."
...,...,...,...,...,...,...
918,7.5,Running Scared,R,Action,122,"[u'Paul Walker', u'Cameron Bright', u'Chazz Pa..."
954,7.4,X-Men,PG-13,Action,104,"[u'Patrick Stewart', u'Hugh Jackman', u'Ian Mc..."
963,7.4,La Femme Nikita,R,Action,118,"[u'Anne Parillaud', u'Marc Duret', u'Patrick F..."
967,7.4,The Rock,R,Action,136,"[u'Sean Connery', u'Nicolas Cage', u'Ed Harris']"


In [68]:
action_df.reset_index(drop=True,inplace=True)
action_df

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
1,8.8,Inception,PG-13,Action,148,"[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'..."
2,8.8,Star Wars: Episode V - The Empire Strikes Back,PG,Action,124,"[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi..."
3,8.7,Star Wars,PG,Action,121,"[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi..."
4,8.7,The Matrix,R,Action,136,"[u'Keanu Reeves', u'Laurence Fishburne', u'Car..."
...,...,...,...,...,...,...
131,7.5,Running Scared,R,Action,122,"[u'Paul Walker', u'Cameron Bright', u'Chazz Pa..."
132,7.4,X-Men,PG-13,Action,104,"[u'Patrick Stewart', u'Hugh Jackman', u'Ian Mc..."
133,7.4,La Femme Nikita,R,Action,118,"[u'Anne Parillaud', u'Marc Duret', u'Patrick F..."
134,7.4,The Rock,R,Action,136,"[u'Sean Connery', u'Nicolas Cage', u'Ed Harris']"


In [69]:
 movies_df.groupby('genre').get_group('Crime')

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."
21,8.7,City of God,R,Crime,130,"[u'Alexandre Rodrigues', u'Matheus Nachtergael..."
...,...,...,...,...,...,...
927,7.5,Brick,R,Crime,110,"[u'Joseph Gordon-Levitt', u'Lukas Haas', u'Emi..."
931,7.4,Mean Streets,R,Crime,112,"[u'Robert De Niro', u'Harvey Keitel', u'David ..."
950,7.4,Bound,R,Crime,108,"[u'Jennifer Tilly', u'Gina Gershon', u'Joe Pan..."
969,7.4,Law Abiding Citizen,R,Crime,109,"[u'Gerard Butler', u'Jamie Foxx', u'Leslie Bibb']"


In [70]:
movies_df.groupby('content_rating').get_group('PG-13')

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
7,8.9,The Lord of the Rings: The Return of the King,PG-13,Adventure,201,"[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK..."
10,8.8,The Lord of the Rings: The Fellowship of the Ring,PG-13,Adventure,178,"[u'Elijah Wood', u'Ian McKellen', u'Orlando Bl..."
11,8.8,Inception,PG-13,Action,148,"[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'..."
13,8.8,Forrest Gump,PG-13,Drama,142,"[u'Tom Hanks', u'Robin Wright', u'Gary Sinise']"
...,...,...,...,...,...,...
964,7.4,Lincoln,PG-13,Biography,150,"[u'Daniel Day-Lewis', u'Sally Field', u'David ..."
965,7.4,Limitless,PG-13,Mystery,105,"[u'Bradley Cooper', u'Anna Friel', u'Abbie Cor..."
966,7.4,The Simpsons Movie,PG-13,Animation,87,"[u'Dan Castellaneta', u'Julie Kavner', u'Nancy..."
973,7.4,The Cider House Rules,PG-13,Drama,126,"[u'Tobey Maguire', u'Charlize Theron', u'Micha..."


In [71]:
movies_df.groupby(['content_rating','genre']).get_group(('PG-13', 'Action')) # we can create a dataframe of 2 parameters

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
11,8.8,Inception,PG-13,Action,148,"[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'..."
43,8.5,The Dark Knight Rises,PG-13,Action,165,"[u'Christian Bale', u'Tom Hardy', u'Anne Hatha..."
113,8.3,Batman Begins,PG-13,Action,140,"[u'Christian Bale', u'Michael Caine', u'Ken Wa..."
118,8.3,Indiana Jones and the Last Crusade,PG-13,Action,127,"[u'Harrison Ford', u'Sean Connery', u'Alison D..."
177,8.2,The Avengers,PG-13,Action,143,"[u'Robert Downey Jr.', u'Chris Evans', u'Scarl..."
196,8.1,Guardians of the Galaxy,PG-13,Action,121,"[u'Chris Pratt', u'Vin Diesel', u'Bradley Coop..."
240,8.1,The Bourne Ultimatum,PG-13,Action,115,"[u'Matt Damon', u'\xc9dgar Ram\xedrez', u'Joan..."
248,8.1,X-Men: Days of Future Past,PG-13,Action,131,"[u'Patrick Stewart', u'Ian McKellen', u'Hugh J..."
301,8.0,Furious 7,PG-13,Action,137,"[u'Vin Diesel', u'Paul Walker', u'Dwayne Johns..."


In [72]:
movies_df.groupby(['content_rating','genre']).get_group(('PG-13', 'Adventure')).sort_values('duration')

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
510,7.8,Moonrise Kingdom,PG-13,Adventure,94,"[u'Jared Gilman', u'Kara Hayward', u'Bruce Wil..."
943,7.4,The Bucket List,PG-13,Adventure,97,"[u'Jack Nicholson', u'Morgan Freeman', u'Sean ..."
522,7.8,"O Brother, Where Art Thou?",PG-13,Adventure,106,"[u'George Clooney', u'John Turturro', u'Tim Bl..."
662,7.7,True Grit,PG-13,Adventure,110,"[u'Jeff Bridges', u'Matt Damon', u'Hailee Stei..."
310,8.0,Big Fish,PG-13,Adventure,125,"[u'Ewan McGregor', u'Albert Finney', u'Billy C..."
605,7.7,Stardust,PG-13,Adventure,127,"[u'Charlie Cox', u'Claire Danes', u'Sienna Mil..."
299,8.0,Jurassic Park,PG-13,Adventure,127,"[u'Sam Neill', u'Laura Dern', u'Jeff Goldblum']"
222,8.1,Harry Potter and the Deathly Hallows: Part 2,PG-13,Adventure,130,"[u'Daniel Radcliffe', u'Emma Watson', u'Rupert..."
932,7.4,Harry Potter and the Order of the Phoenix,PG-13,Adventure,138,"[u'Daniel Radcliffe', u'Emma Watson', u'Rupert..."
758,7.6,The Abyss,PG-13,Adventure,139,"[u'Ed Harris', u'Mary Elizabeth Mastrantonio',..."


In [73]:
# Another way to do so

movies_df.loc[movies_df['genre'] == 'Crime']

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."
21,8.7,City of God,R,Crime,130,"[u'Alexandre Rodrigues', u'Matheus Nachtergael..."
...,...,...,...,...,...,...
927,7.5,Brick,R,Crime,110,"[u'Joseph Gordon-Levitt', u'Lukas Haas', u'Emi..."
931,7.4,Mean Streets,R,Crime,112,"[u'Robert De Niro', u'Harvey Keitel', u'David ..."
950,7.4,Bound,R,Crime,108,"[u'Jennifer Tilly', u'Gina Gershon', u'Joe Pan..."
969,7.4,Law Abiding Citizen,R,Crime,109,"[u'Gerard Butler', u'Jamie Foxx', u'Leslie Bibb']"


In [74]:
movies_df.loc[movies_df['genre'] == 'Action']

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
11,8.8,Inception,PG-13,Action,148,"[u'Leonardo DiCaprio', u'Joseph Gordon-Levitt'..."
12,8.8,Star Wars: Episode V - The Empire Strikes Back,PG,Action,124,"[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi..."
19,8.7,Star Wars,PG,Action,121,"[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi..."
20,8.7,The Matrix,R,Action,136,"[u'Keanu Reeves', u'Laurence Fishburne', u'Car..."
...,...,...,...,...,...,...
918,7.5,Running Scared,R,Action,122,"[u'Paul Walker', u'Cameron Bright', u'Chazz Pa..."
954,7.4,X-Men,PG-13,Action,104,"[u'Patrick Stewart', u'Hugh Jackman', u'Ian Mc..."
963,7.4,La Femme Nikita,R,Action,118,"[u'Anne Parillaud', u'Marc Duret', u'Patrick F..."
967,7.4,The Rock,R,Action,136,"[u'Sean Connery', u'Nicolas Cage', u'Ed Harris']"


In [75]:
# Here we can use operators also

movies_df.loc[(movies_df['genre'] == 'Action') | (movies_df['genre'] == 'Crime')]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."
...,...,...,...,...,...,...
963,7.4,La Femme Nikita,R,Action,118,"[u'Anne Parillaud', u'Marc Duret', u'Patrick F..."
967,7.4,The Rock,R,Action,136,"[u'Sean Connery', u'Nicolas Cage', u'Ed Harris']"
969,7.4,Law Abiding Citizen,R,Crime,109,"[u'Gerard Butler', u'Jamie Foxx', u'Leslie Bibb']"
976,7.4,Master and Commander: The Far Side of the World,PG-13,Action,138,"[u'Russell Crowe', u'Paul Bettany', u'Billy Bo..."


In [76]:
movies_df.loc[(movies_df['genre'] == 'Action') & 
              (movies_df['duration'] > 120)].sort_values('duration', ascending = False)

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
767,7.6,"It's a Mad, Mad, Mad, Mad World",APPROVED,Action,205,"[u'Spencer Tracy', u'Milton Berle', u'Ethel Me..."
385,8.0,Spartacus,PG-13,Action,197,"[u'Kirk Douglas', u'Laurence Olivier', u'Jean ..."
671,7.7,Grindhouse,R,Action,191,"[u'Kurt Russell', u'Rose McGowan', u'Danny Tre..."
534,7.8,The Longest Day,G,Action,178,"[u'John Wayne', u'Robert Ryan', u'Richard Burt..."
82,8.4,Braveheart,R,Action,177,"[u'Mel Gibson', u'Sophie Marceau', u'Patrick M..."
...,...,...,...,...,...,...
163,8.2,Rush,R,Action,123,"[u'Daniel Br\xfchl', u'Chris Hemsworth', u'Oli..."
918,7.5,Running Scared,R,Action,122,"[u'Paul Walker', u'Cameron Bright', u'Chazz Pa..."
749,7.6,Lone Survivor,R,Action,121,"[u'Mark Wahlberg', u'Taylor Kitsch', u'Emile H..."
19,8.7,Star Wars,PG,Action,121,"[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi..."


In [77]:
movies_df.loc[(movies_df['genre'] == 'Action') & 
              (movies_df['duration'] > 120) & (movies_df['duration'] < 180)].sort_values('duration', ascending = False)

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
534,7.8,The Longest Day,G,Action,178,"[u'John Wayne', u'Robert Ryan', u'Richard Burt..."
82,8.4,Braveheart,R,Action,177,"[u'Mel Gibson', u'Sophie Marceau', u'Patrick M..."
135,8.3,Heat,R,Action,170,"[u'Al Pacino', u'Robert De Niro', u'Val Kilmer']"
36,8.6,Saving Private Ryan,R,Action,169,"[u'Tom Hanks', u'Matt Damon', u'Tom Sizemore']"
684,7.7,The Big Blue,PG-13,Action,168,"[u'Jean-Marc Barr', u'Jean Reno', u'Rosanna Ar..."
...,...,...,...,...,...,...
163,8.2,Rush,R,Action,123,"[u'Daniel Br\xfchl', u'Chris Hemsworth', u'Oli..."
918,7.5,Running Scared,R,Action,122,"[u'Paul Walker', u'Cameron Bright', u'Chazz Pa..."
19,8.7,Star Wars,PG,Action,121,"[u'Mark Hamill', u'Harrison Ford', u'Carrie Fi..."
749,7.6,Lone Survivor,R,Action,121,"[u'Mark Wahlberg', u'Taylor Kitsch', u'Emile H..."


#### 2) df.apply()
- allow you to pass a function and apply it on every single value of the Pandas series.
- `Syntax: df.apply(function)`

In [79]:
df = pd.DataFrame(np.random.randint(10,30, size = (6,5)), columns = list('abcde'))
df

Unnamed: 0,a,b,c,d,e
0,20,25,18,24,23
1,21,20,23,13,18
2,18,27,18,15,23
3,12,15,11,12,15
4,11,20,22,10,21
5,28,15,15,16,26


In [80]:
# Using numpy function

df.apply(np.square)

Unnamed: 0,a,b,c,d,e
0,400,625,324,576,529
1,441,400,529,169,324
2,324,729,324,225,529
3,144,225,121,144,225
4,121,400,484,100,441
5,784,225,225,256,676


In [82]:
# Using lambda

df.apply(lambda x : x ** 2)

Unnamed: 0,a,b,c,d,e
0,400,625,324,576,529
1,441,400,529,169,324
2,324,729,324,225,529
3,144,225,121,144,225
4,121,400,484,100,441
5,784,225,225,256,676


In [85]:
# Using UDF

def square(x):
    return x ** 2
df.apply(square)

Unnamed: 0,a,b,c,d,e
0,400,625,324,576,529
1,441,400,529,169,324
2,324,729,324,225,529
3,144,225,121,144,225
4,121,400,484,100,441
5,784,225,225,256,676


In [86]:
df = pd.DataFrame({'Country': ['India', 'UK', 'USA'], 
                  'Capital' : ['Delhi', 'London', 'Washington DC']})
df

Unnamed: 0,Country,Capital
0,India,Delhi
1,UK,London
2,USA,Washington DC


In [87]:
df['Capital'] = df['Capital'].apply(lambda x : x.lower())
df

Unnamed: 0,Country,Capital
0,India,delhi
1,UK,london
2,USA,washington dc


In [89]:
df['Date'] = ['14-05-2022', '20-02-2022', '10-01-2022']
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Country  3 non-null      object
 1   Capital  3 non-null      object
 2   Date     3 non-null      object
dtypes: object(3)
memory usage: 200.0+ bytes


In [90]:
from datetime import datetime

def string_to_date(date):
    date = datetime.strptime(date, '%d-%m-%Y')
    return date

df['New_Date'] = df['Date'].apply(string_to_date)
df

Unnamed: 0,Country,Capital,Date,New_Date
0,India,delhi,14-05-2022,2022-05-14
1,UK,london,20-02-2022,2022-02-20
2,USA,washingtondc,10-01-2022,2022-01-10


In [92]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Country   3 non-null      object        
 1   Capital   3 non-null      object        
 2   Date      3 non-null      object        
 3   New_Date  3 non-null      datetime64[ns]
dtypes: datetime64[ns](1), object(3)
memory usage: 224.0+ bytes


In [93]:
# End of Pandas..