## EDA And Feature Engineering Flight Price Prediction

check the dataset info below
https://drive.google.com/drive/folders/1vCROauTDZlPX7LOVPL27HFp1DRUyh84N

### FEATURES
The various features of the cleaned dataset are explained below:

1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.

2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.

3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.

4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.

5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.

6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.

7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.

8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.

9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.

10) Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.

11) Price: Target variable stores information of the ticket price.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Reading the data

df=pd.read_excel("C:\\Users\\Nethajimahendra K\\flight_price.xlsx")

In [2]:
df.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [3]:
df.columns

Index(['Airline', 'Date_of_Journey', 'Source', 'Destination', 'Route',
       'Dep_Time', 'Arrival_Time', 'Duration', 'Total_Stops',
       'Additional_Info', 'Price'],
      dtype='object')

In [4]:
#basic info about data
df.info()

# All are object type features except price

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
dtypes: int64(1), object(10)
memory usage: 918.2+ KB


In [5]:
# Converting the date of journey from object to datetime data type and dividing the month, year, day seperately

df["new_column"]=pd.to_datetime(df["Date_of_Journey"],dayfirst=True)

In [6]:
df

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,new_column
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,2019-03-24
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,2019-05-01
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,2019-06-09
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,2019-05-12
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,2019-03-01
...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,9/04/2019,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107,2019-04-09
10679,Air India,27/04/2019,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145,2019-04-27
10680,Jet Airways,27/04/2019,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229,2019-04-27
10681,Vistara,01/03/2019,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648,2019-03-01


In [7]:
df1=df.rename({"new_column":"date_of_journey"},axis=1)

In [8]:
df1

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,date_of_journey
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,2019-03-24
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,2019-05-01
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,2019-06-09
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,2019-05-12
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,2019-03-01
...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,9/04/2019,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107,2019-04-09
10679,Air India,27/04/2019,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145,2019-04-27
10680,Jet Airways,27/04/2019,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229,2019-04-27
10681,Vistara,01/03/2019,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648,2019-03-01


In [9]:
df2=df1.drop("Date_of_Journey",axis=1)

In [10]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,date_of_journey
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,2019-03-24
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,2019-05-01
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,2019-06-09
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,2019-05-12
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,2019-03-01
...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107,2019-04-09
10679,Air India,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145,2019-04-27
10680,Jet Airways,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229,2019-04-27
10681,Vistara,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648,2019-03-01


In [11]:
df2["year"]=df2["date_of_journey"].apply(lambda x:x.year)
df2["month"]=df2["date_of_journey"].apply(lambda x:x.month)
df2["day"]=df2["date_of_journey"].apply(lambda x:x.day)

# Using the above code, Initially we have converted the feature into date time format and then we extracted the day, month, year

# <--- USE ABOVE CODE (Convert the feature into date time and then extract month year day easily) ------------------>
# syntax to convert the categorical featire into date time

# df["date_of_journey"]=pd.to_datetime(<column name>) i,e column name=df["Date_of_Journey"]

## pw skills team used the below code to divide the day month year

df1["year"]=df1["Date_of_Journey"].str.split("/").str[2]
df1["month"]=df1["Date_of_Journey"].str.split("/").str[1]
df1["day"]=df1["Date_of_Journey"].str.split("/").str[0]

# Using the above code, Initially we haven't converted the feature into date time format and then we extracted the day, 
# month, year using the string format. so resultant value stored as a categorical then again we need to execute the below 
# code to convert them into intergers

df1["year"]=df1["year"].astype(int)
df1["month"]=df1["month"].astype(int)
df1["day"]=df1["day"].astype(int)

In [12]:
df1

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,date_of_journey,year,month,day
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,2019-03-24,2019,3,24
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,2019-05-01,2019,5,1
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,2019-06-09,2019,6,9
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,2019-05-12,2019,5,12
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,2019-03-01,2019,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,9/04/2019,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107,2019-04-09,2019,4,9
10679,Air India,27/04/2019,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145,2019-04-27,2019,4,27
10680,Jet Airways,27/04/2019,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229,2019-04-27,2019,4,27
10681,Vistara,01/03/2019,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648,2019-03-01,2019,3,1


In [13]:
df1["day"].dtypes

dtype('int32')

In [14]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,date_of_journey,year,month,day
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,2019-03-24,2019,3,24
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,2019-05-01,2019,5,1
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,2019-06-09,2019,6,9
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,2019-05-12,2019,5,12
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,2019-03-01,2019,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107,2019-04-09,2019,4,9
10679,Air India,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145,2019-04-27,2019,4,27
10680,Jet Airways,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229,2019-04-27,2019,4,27
10681,Vistara,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648,2019-03-01,2019,3,1


In [15]:
df2["day"].dtypes

dtype('int64')

In [16]:
df2.info()

#  10  date_of_journey  10683 non-null  datetime64[ns]
#  11  year             10683 non-null  int64         
#  12  month            10683 non-null  int64         
#  13  day              10683 non-null  int64 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Airline          10683 non-null  object        
 1   Source           10683 non-null  object        
 2   Destination      10683 non-null  object        
 3   Route            10682 non-null  object        
 4   Dep_Time         10683 non-null  object        
 5   Arrival_Time     10683 non-null  object        
 6   Duration         10683 non-null  object        
 7   Total_Stops      10682 non-null  object        
 8   Additional_Info  10683 non-null  object        
 9   Price            10683 non-null  int64         
 10  date_of_journey  10683 non-null  datetime64[ns]
 11  year             10683 non-null  int64         
 12  month            10683 non-null  int64         
 13  day              10683 non-null  int64         
dtypes: datetime64[ns](1), int64(4), object

In [17]:
# You can drop date_of_journey feature if you want

df2.drop("date_of_journey",axis=1,inplace=True)

# or

# df2=df2.drop("date_of_journey",axis=1)

In [18]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,year,month,day
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,2019,3,24
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,2019,5,1
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,2019,6,9
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,2019,5,12
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,2019,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107,2019,4,9
10679,Air India,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145,2019,4,27
10680,Jet Airways,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229,2019,4,27
10681,Vistara,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648,2019,3,1


In [19]:
df2.rename({"year":"Year","day":"Day","month":"Month"},axis=1,inplace=True)

# or

# df2=df2.rename({"year":"Year","day":"Day","month":"Month"},axis=1)

In [20]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897,2019,3,24
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,2019,5,1
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882,2019,6,9
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,2019,5,12
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,2019,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107,2019,4,9
10679,Air India,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145,2019,4,27
10680,Jet Airways,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229,2019,4,27
10681,Vistara,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648,2019,3,1


In [21]:
# Let's do some things for Arrival_time column

df2.Arrival_Time.dtypes

# It is in categorical datatype form, Let's extract arrival hour and minute

dtype('O')

In [22]:
df2["Arrival_Time"]=df2["Arrival_Time"].str.split(" ").str[0]

# or

# we can use lambda function

# df2["Arrival_Time"]=df2["Arrival_Time"].apply(lambda x: x.split(" ")[0])

In [23]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10,2h 50m,non-stop,No info,3897,2019,3,24
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,2019,5,1
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25,19h,2 stops,No info,13882,2019,6,9
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,2019,5,12
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,2019,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107,2019,4,9
10679,Air India,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145,2019,4,27
10680,Jet Airways,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229,2019,4,27
10681,Vistara,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648,2019,3,1


In [24]:
# Add the columns arrival hour (extract hour from arrival_time)

df2["Arrival_hour"]=df2["Arrival_Time"].str.split(":").str[0]

In [25]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10,2h 50m,non-stop,No info,3897,2019,3,24,01
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,2019,5,1,13
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25,19h,2 stops,No info,13882,2019,6,9,04
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,2019,5,12,23
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,2019,3,1,21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107,2019,4,9,22
10679,Air India,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145,2019,4,27,23
10680,Jet Airways,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229,2019,4,27,11
10681,Vistara,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648,2019,3,1,14


In [26]:
# Add the column arrival min (extract hour from arrival_time)

df2["Arrival_min"]=df2["Arrival_Time"].str.split(":").str[1]

In [27]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,01:10,2h 50m,non-stop,No info,3897,2019,3,24,01,10
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662,2019,5,1,13,15
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25,19h,2 stops,No info,13882,2019,6,9,04,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218,2019,5,12,23,30
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302,2019,3,1,21,35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,19:55,22:25,2h 30m,non-stop,No info,4107,2019,4,9,22,25
10679,Air India,Kolkata,Banglore,CCU → BLR,20:45,23:20,2h 35m,non-stop,No info,4145,2019,4,27,23,20
10680,Jet Airways,Banglore,Delhi,BLR → DEL,08:20,11:20,3h,non-stop,No info,7229,2019,4,27,11,20
10681,Vistara,Banglore,New Delhi,BLR → DEL,11:30,14:10,2h 40m,non-stop,No info,12648,2019,3,1,14,10


In [28]:
df2.Arrival_hour.dtypes

# It is in object type

dtype('O')

In [29]:
# Convert the feature into integer datatype

df2["Arrival_hour"]=df2["Arrival_hour"].astype(int)
df2["Arrival_min"]=df2["Arrival_min"].astype(int)

In [30]:
df2.Arrival_hour.dtypes

dtype('int32')

In [31]:
# You can drop arrival time feature if you want

df2.drop("Arrival_Time",axis=1,inplace=True)

In [32]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,2h 50m,non-stop,No info,3897,2019,3,24,1,10
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,7h 25m,2 stops,No info,7662,2019,5,1,13,15
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,19h,2 stops,No info,13882,2019,6,9,4,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,5h 25m,1 stop,No info,6218,2019,5,12,23,30
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,4h 45m,1 stop,No info,13302,2019,3,1,21,35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,19:55,2h 30m,non-stop,No info,4107,2019,4,9,22,25
10679,Air India,Kolkata,Banglore,CCU → BLR,20:45,2h 35m,non-stop,No info,4145,2019,4,27,23,20
10680,Jet Airways,Banglore,Delhi,BLR → DEL,08:20,3h,non-stop,No info,7229,2019,4,27,11,20
10681,Vistara,Banglore,New Delhi,BLR → DEL,11:30,2h 40m,non-stop,No info,12648,2019,3,1,14,10


In [33]:
df2.head()

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,2h 50m,non-stop,No info,3897,2019,3,24,1,10
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,7h 25m,2 stops,No info,7662,2019,5,1,13,15
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,19h,2 stops,No info,13882,2019,6,9,4,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,5h 25m,1 stop,No info,6218,2019,5,12,23,30
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,4h 45m,1 stop,No info,13302,2019,3,1,21,35


In [34]:
df2.sample(4)

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min
1556,Multiple carriers,Delhi,Cochin,DEL → BOM → COK,10:20,15h 10m,1 stop,No info,7670,2019,5,18,1,30
1857,Air Asia,Kolkata,Banglore,CCU → BLR,19:55,2h 30m,non-stop,No info,4107,2019,6,21,22,25
235,Air Asia,Kolkata,Banglore,CCU → BLR,10:20,2h 35m,non-stop,No info,4409,2019,5,1,12,55
1618,SpiceJet,Kolkata,Banglore,CCU → BLR,17:10,2h 30m,non-stop,No check-in baggage included,3841,2019,4,24,19,40


In [35]:
# Apply the same method to Dep_Time as well

df2["Dep_Time"]=df2["Dep_Time"].apply(lambda x: x.split(" ")[0])

In [36]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,2h 50m,non-stop,No info,3897,2019,3,24,1,10
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,7h 25m,2 stops,No info,7662,2019,5,1,13,15
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,19h,2 stops,No info,13882,2019,6,9,4,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,5h 25m,1 stop,No info,6218,2019,5,12,23,30
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,4h 45m,1 stop,No info,13302,2019,3,1,21,35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,19:55,2h 30m,non-stop,No info,4107,2019,4,9,22,25
10679,Air India,Kolkata,Banglore,CCU → BLR,20:45,2h 35m,non-stop,No info,4145,2019,4,27,23,20
10680,Jet Airways,Banglore,Delhi,BLR → DEL,08:20,3h,non-stop,No info,7229,2019,4,27,11,20
10681,Vistara,Banglore,New Delhi,BLR → DEL,11:30,2h 40m,non-stop,No info,12648,2019,3,1,14,10


In [37]:
df2["Dep_hour"]= df2["Dep_Time"].str.split(":").str[0]
df2["Dep_min"]= df2["Dep_Time"].str.split(":").str[1]

In [38]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Dep_Time,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,22:20,2h 50m,non-stop,No info,3897,2019,3,24,1,10,22,20
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,7h 25m,2 stops,No info,7662,2019,5,1,13,15,05,50
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,19h,2 stops,No info,13882,2019,6,9,4,25,09,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,18:05,5h 25m,1 stop,No info,6218,2019,5,12,23,30,18,05
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,16:50,4h 45m,1 stop,No info,13302,2019,3,1,21,35,16,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,19:55,2h 30m,non-stop,No info,4107,2019,4,9,22,25,19,55
10679,Air India,Kolkata,Banglore,CCU → BLR,20:45,2h 35m,non-stop,No info,4145,2019,4,27,23,20,20,45
10680,Jet Airways,Banglore,Delhi,BLR → DEL,08:20,3h,non-stop,No info,7229,2019,4,27,11,20,08,20
10681,Vistara,Banglore,New Delhi,BLR → DEL,11:30,2h 40m,non-stop,No info,12648,2019,3,1,14,10,11,30


In [39]:
# dropping the Dep_Time column

df2.drop("Dep_Time",axis=1,inplace=True)

In [40]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,2h 50m,non-stop,No info,3897,2019,3,24,1,10,22,20
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2 stops,No info,7662,2019,5,1,13,15,05,50
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,19h,2 stops,No info,13882,2019,6,9,4,25,09,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,5h 25m,1 stop,No info,6218,2019,5,12,23,30,18,05
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,4h 45m,1 stop,No info,13302,2019,3,1,21,35,16,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,2h 30m,non-stop,No info,4107,2019,4,9,22,25,19,55
10679,Air India,Kolkata,Banglore,CCU → BLR,2h 35m,non-stop,No info,4145,2019,4,27,23,20,20,45
10680,Jet Airways,Banglore,Delhi,BLR → DEL,3h,non-stop,No info,7229,2019,4,27,11,20,08,20
10681,Vistara,Banglore,New Delhi,BLR → DEL,2h 40m,non-stop,No info,12648,2019,3,1,14,10,11,30


In [41]:
df2["Dep_hour"].dtypes

dtype('O')

In [42]:
df2["Dep_hour"]= df2["Dep_hour"].astype(int)
df2["Dep_min"]=df2["Dep_min"].astype(int)

In [43]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,2h 50m,non-stop,No info,3897,2019,3,24,1,10,22,20
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2 stops,No info,7662,2019,5,1,13,15,5,50
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,19h,2 stops,No info,13882,2019,6,9,4,25,9,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,5h 25m,1 stop,No info,6218,2019,5,12,23,30,18,5
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,4h 45m,1 stop,No info,13302,2019,3,1,21,35,16,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,2h 30m,non-stop,No info,4107,2019,4,9,22,25,19,55
10679,Air India,Kolkata,Banglore,CCU → BLR,2h 35m,non-stop,No info,4145,2019,4,27,23,20,20,45
10680,Jet Airways,Banglore,Delhi,BLR → DEL,3h,non-stop,No info,7229,2019,4,27,11,20,8,20
10681,Vistara,Banglore,New Delhi,BLR → DEL,2h 40m,non-stop,No info,12648,2019,3,1,14,10,11,30


In [44]:
df2["Dep_hour"].dtypes

dtype('int32')

In [45]:
df2["Dep_min"].dtypes

dtype('int32')

In [46]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Source           10683 non-null  object
 2   Destination      10683 non-null  object
 3   Route            10682 non-null  object
 4   Duration         10683 non-null  object
 5   Total_Stops      10682 non-null  object
 6   Additional_Info  10683 non-null  object
 7   Price            10683 non-null  int64 
 8   Year             10683 non-null  int64 
 9   Month            10683 non-null  int64 
 10  Day              10683 non-null  int64 
 11  Arrival_hour     10683 non-null  int32 
 12  Arrival_min      10683 non-null  int32 
 13  Dep_hour         10683 non-null  int32 
 14  Dep_min          10683 non-null  int32 
dtypes: int32(4), int64(4), object(7)
memory usage: 1.1+ MB


In [47]:
# The below are converted to int

#  7   Price            10683 non-null  int64 
#  8   Year             10683 non-null  int64 
#  9   Month            10683 non-null  int64 
#  10  Day              10683 non-null  int64 
#  11  Arrival_hour     10683 non-null  int32 
#  12  Arrival_min      10683 non-null  int32 
#  13  Dep_hour         10683 non-null  int32 
#  14  Dep_min          10683 non-null  int32

In [48]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,2h 50m,non-stop,No info,3897,2019,3,24,1,10,22,20
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2 stops,No info,7662,2019,5,1,13,15,5,50
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,19h,2 stops,No info,13882,2019,6,9,4,25,9,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,5h 25m,1 stop,No info,6218,2019,5,12,23,30,18,5
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,4h 45m,1 stop,No info,13302,2019,3,1,21,35,16,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,2h 30m,non-stop,No info,4107,2019,4,9,22,25,19,55
10679,Air India,Kolkata,Banglore,CCU → BLR,2h 35m,non-stop,No info,4145,2019,4,27,23,20,20,45
10680,Jet Airways,Banglore,Delhi,BLR → DEL,3h,non-stop,No info,7229,2019,4,27,11,20,8,20
10681,Vistara,Banglore,New Delhi,BLR → DEL,2h 40m,non-stop,No info,12648,2019,3,1,14,10,11,30


In [49]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,2h 50m,non-stop,No info,3897,2019,3,24,1,10,22,20
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2 stops,No info,7662,2019,5,1,13,15,5,50
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,19h,2 stops,No info,13882,2019,6,9,4,25,9,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,5h 25m,1 stop,No info,6218,2019,5,12,23,30,18,5
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,4h 45m,1 stop,No info,13302,2019,3,1,21,35,16,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,2h 30m,non-stop,No info,4107,2019,4,9,22,25,19,55
10679,Air India,Kolkata,Banglore,CCU → BLR,2h 35m,non-stop,No info,4145,2019,4,27,23,20,20,45
10680,Jet Airways,Banglore,Delhi,BLR → DEL,3h,non-stop,No info,7229,2019,4,27,11,20,8,20
10681,Vistara,Banglore,New Delhi,BLR → DEL,2h 40m,non-stop,No info,12648,2019,3,1,14,10,11,30


In [50]:
# df2["Duration_hour"].astype(int)

In [51]:

df2

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,2h 50m,non-stop,No info,3897,2019,3,24,1,10,22,20
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2 stops,No info,7662,2019,5,1,13,15,5,50
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,19h,2 stops,No info,13882,2019,6,9,4,25,9,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,5h 25m,1 stop,No info,6218,2019,5,12,23,30,18,5
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,4h 45m,1 stop,No info,13302,2019,3,1,21,35,16,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,2h 30m,non-stop,No info,4107,2019,4,9,22,25,19,55
10679,Air India,Kolkata,Banglore,CCU → BLR,2h 35m,non-stop,No info,4145,2019,4,27,23,20,20,45
10680,Jet Airways,Banglore,Delhi,BLR → DEL,3h,non-stop,No info,7229,2019,4,27,11,20,8,20
10681,Vistara,Banglore,New Delhi,BLR → DEL,2h 40m,non-stop,No info,12648,2019,3,1,14,10,11,30


In [52]:
# Replace the total_stops with numeric values

df2["Total_Stops"].unique()

# ignore about nan values, we will handle the nan values later

array(['non-stop', '2 stops', '1 stop', '3 stops', nan, '4 stops'],
      dtype=object)

In [53]:
# mapping the values for Total_Stops column

df2["Total_Stops"]=df2["Total_Stops"].map({'non-stop':0,'2 stops':2,'1 stop':1,'4 stops':4})

In [54]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,2h 50m,0.0,No info,3897,2019,3,24,1,10,22,20
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2.0,No info,7662,2019,5,1,13,15,5,50
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,19h,2.0,No info,13882,2019,6,9,4,25,9,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,5h 25m,1.0,No info,6218,2019,5,12,23,30,18,5
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,4h 45m,1.0,No info,13302,2019,3,1,21,35,16,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,2h 30m,0.0,No info,4107,2019,4,9,22,25,19,55
10679,Air India,Kolkata,Banglore,CCU → BLR,2h 35m,0.0,No info,4145,2019,4,27,23,20,20,45
10680,Jet Airways,Banglore,Delhi,BLR → DEL,3h,0.0,No info,7229,2019,4,27,11,20,8,20
10681,Vistara,Banglore,New Delhi,BLR → DEL,2h 40m,0.0,No info,12648,2019,3,1,14,10,11,30


In [55]:
## find nan values under total stops column

df2[df2["Total_Stops"].isnull()]

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
402,Air India,Delhi,Cochin,DEL → RPR → NAG → BOM → COK,26h 25m,,No info,10493,2019,6,15,7,40,5,15
919,Air India,Kolkata,Banglore,CCU → BBI → IXR → DEL → BLR,35h 15m,,No info,10991,2019,5,12,23,15,12,0
1218,Air India,Delhi,Cochin,DEL → RPR → NAG → BOM → COK,26h 25m,,No info,11543,2019,6,27,7,40,5,15
1665,Air India,Banglore,New Delhi,BLR → CCU → BBI → HYD → DEL,30h 25m,,No info,12346,2019,3,1,12,15,5,50
2172,Air India,Delhi,Cochin,DEL → RPR → NAG → BOM → COK,38h,,No info,10703,2019,5,18,19,15,5,15
2623,Air India,Mumbai,Hyderabad,BOM → JDH → JAI → DEL → HYD,29h 35m,,No info,18293,2019,3,12,15,15,9,40
2633,Multiple carriers,Delhi,Cochin,DEL → GWL → IDR → BOM → COK,9h 25m,,No info,21829,2019,3,6,21,0,11,35
2718,Air India,Delhi,Cochin,DEL → RPR → NAG → BOM → COK,38h,,No info,15586,2019,3,9,19,15,5,15
2814,Air India,Banglore,New Delhi,BLR → BOM → IDR → GWL → DEL,24h 40m,,No info,13387,2019,3,12,18,5,17,25
2822,Air India,Kolkata,Banglore,CCU → DEL → COK → TRV → BLR,24h 30m,,No info,13007,2019,5,24,10,30,10,0


In [56]:
# Replace Nan values with meaningful insights

# Based on Route feature we can able to find the stops between the stations

#ex: for 402 index value: 

# DEL → RPR → NAG → BOM → COK ---> Here we have 3 stops

# i,e len(Route)-2 ==> 5-2= 3 stops

# <---------------- IMPLEMENT IT LATER -------------------------------->

In [57]:
df2[df2["Total_Stops"].isnull()].shape

(46, 15)

In [58]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,2h 50m,0.0,No info,3897,2019,3,24,1,10,22,20
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2.0,No info,7662,2019,5,1,13,15,5,50
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,19h,2.0,No info,13882,2019,6,9,4,25,9,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,5h 25m,1.0,No info,6218,2019,5,12,23,30,18,5
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,4h 45m,1.0,No info,13302,2019,3,1,21,35,16,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,2h 30m,0.0,No info,4107,2019,4,9,22,25,19,55
10679,Air India,Kolkata,Banglore,CCU → BLR,2h 35m,0.0,No info,4145,2019,4,27,23,20,20,45
10680,Jet Airways,Banglore,Delhi,BLR → DEL,3h,0.0,No info,7229,2019,4,27,11,20,8,20
10681,Vistara,Banglore,New Delhi,BLR → DEL,2h 40m,0.0,No info,12648,2019,3,1,14,10,11,30


In [59]:
df2[df2["Route"].isnull()]

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
9039,Air India,Delhi,Cochin,,23h 40m,,No info,7480,2019,5,6,9,25,9,45


In [60]:
# # dropping the column which has Nan values for Route and Total stops feature

# df2.drop(9039,axis=0,inplace=True)

In [61]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,2h 50m,0.0,No info,3897,2019,3,24,1,10,22,20
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2.0,No info,7662,2019,5,1,13,15,5,50
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,19h,2.0,No info,13882,2019,6,9,4,25,9,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,5h 25m,1.0,No info,6218,2019,5,12,23,30,18,5
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,4h 45m,1.0,No info,13302,2019,3,1,21,35,16,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,2h 30m,0.0,No info,4107,2019,4,9,22,25,19,55
10679,Air India,Kolkata,Banglore,CCU → BLR,2h 35m,0.0,No info,4145,2019,4,27,23,20,20,45
10680,Jet Airways,Banglore,Delhi,BLR → DEL,3h,0.0,No info,7229,2019,4,27,11,20,8,20
10681,Vistara,Banglore,New Delhi,BLR → DEL,2h 40m,0.0,No info,12648,2019,3,1,14,10,11,30


In [62]:
df2["Total_Stops"]=df2["Route"].apply(lambda x:len(x.split("→"))-2 if type(x)!=float else x)

In [63]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,2h 50m,0.0,No info,3897,2019,3,24,1,10,22,20
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2.0,No info,7662,2019,5,1,13,15,5,50
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,19h,2.0,No info,13882,2019,6,9,4,25,9,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,5h 25m,1.0,No info,6218,2019,5,12,23,30,18,5
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,4h 45m,1.0,No info,13302,2019,3,1,21,35,16,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10678,Air Asia,Kolkata,Banglore,CCU → BLR,2h 30m,0.0,No info,4107,2019,4,9,22,25,19,55
10679,Air India,Kolkata,Banglore,CCU → BLR,2h 35m,0.0,No info,4145,2019,4,27,23,20,20,45
10680,Jet Airways,Banglore,Delhi,BLR → DEL,3h,0.0,No info,7229,2019,4,27,11,20,8,20
10681,Vistara,Banglore,New Delhi,BLR → DEL,2h 40m,0.0,No info,12648,2019,3,1,14,10,11,30


In [64]:
# Nan values of the Total_Stops feature is replaced with no.of stops based on Route feature

df2.iloc[[402,919]]

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
402,Air India,Delhi,Cochin,DEL → RPR → NAG → BOM → COK,26h 25m,3.0,No info,10493,2019,6,15,7,40,5,15
919,Air India,Kolkata,Banglore,CCU → BBI → IXR → DEL → BLR,35h 15m,3.0,No info,10991,2019,5,12,23,15,12,0


In [65]:
df2["Total_Stops"].isnull().sum()

1

In [66]:
df2[df2["Route"].isnull()]

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
9039,Air India,Delhi,Cochin,,23h 40m,,No info,7480,2019,5,6,9,25,9,45


In [67]:
# Null value record dropped from df2 dataframe

df2.drop(9039,axis=0,inplace=True)

In [68]:
df2.iloc[9037:9040] # As you can see we are not able to see 9039 index below as we removed it, but it won't be correct, we need 
# to reset the index of all columns.

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
9037,Air Asia,Kolkata,Banglore,CCU → BLR,2h 35m,0.0,No info,5620,2019,3,24,12,55,10,20
9038,IndiGo,Delhi,Cochin,DEL → COK,3h 15m,0.0,No info,5000,2019,6,18,8,50,5,35
9040,Air India,Delhi,Cochin,DEL → LKO → BOM → COK,30h 55m,2.0,No info,10703,2019,3,21,19,15,12,20


In [69]:
df2.reset_index(drop=True,inplace=True) # Important step inorder to reset the index of all columns

In [70]:
df2.iloc[9038:]

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
9038,IndiGo,Delhi,Cochin,DEL → COK,3h 15m,0.0,No info,5000,2019,6,18,8,50,5,35
9039,Air India,Delhi,Cochin,DEL → LKO → BOM → COK,30h 55m,2.0,No info,10703,2019,3,21,19,15,12,20
9040,Multiple carriers,Delhi,Cochin,DEL → BOM → COK,9h 30m,1.0,No info,13727,2019,5,18,21,0,11,30
9041,Jet Airways,Kolkata,Banglore,CCU → BOM → BLR,13h 20m,1.0,No info,14388,2019,5,12,9,20,20,0
9042,Air India,Kolkata,Banglore,CCU → DEL → COK → BLR,15h 20m,2.0,No info,13033,2019,6,6,1,20,10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10677,Air Asia,Kolkata,Banglore,CCU → BLR,2h 30m,0.0,No info,4107,2019,4,9,22,25,19,55
10678,Air India,Kolkata,Banglore,CCU → BLR,2h 35m,0.0,No info,4145,2019,4,27,23,20,20,45
10679,Jet Airways,Banglore,Delhi,BLR → DEL,3h,0.0,No info,7229,2019,4,27,11,20,8,20
10680,Vistara,Banglore,New Delhi,BLR → DEL,2h 40m,0.0,No info,12648,2019,3,1,14,10,11,30


In [71]:
df2

Unnamed: 0,Airline,Source,Destination,Route,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
0,IndiGo,Banglore,New Delhi,BLR → DEL,2h 50m,0.0,No info,3897,2019,3,24,1,10,22,20
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,7h 25m,2.0,No info,7662,2019,5,1,13,15,5,50
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,19h,2.0,No info,13882,2019,6,9,4,25,9,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,5h 25m,1.0,No info,6218,2019,5,12,23,30,18,5
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,4h 45m,1.0,No info,13302,2019,3,1,21,35,16,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10677,Air Asia,Kolkata,Banglore,CCU → BLR,2h 30m,0.0,No info,4107,2019,4,9,22,25,19,55
10678,Air India,Kolkata,Banglore,CCU → BLR,2h 35m,0.0,No info,4145,2019,4,27,23,20,20,45
10679,Jet Airways,Banglore,Delhi,BLR → DEL,3h,0.0,No info,7229,2019,4,27,11,20,8,20
10680,Vistara,Banglore,New Delhi,BLR → DEL,2h 40m,0.0,No info,12648,2019,3,1,14,10,11,30


In [72]:
df2[["Route","Total_Stops"]].isnull().sum()

Route          0
Total_Stops    0
dtype: int64

In [73]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10682 entries, 0 to 10681
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Airline          10682 non-null  object 
 1   Source           10682 non-null  object 
 2   Destination      10682 non-null  object 
 3   Route            10682 non-null  object 
 4   Duration         10682 non-null  object 
 5   Total_Stops      10682 non-null  float64
 6   Additional_Info  10682 non-null  object 
 7   Price            10682 non-null  int64  
 8   Year             10682 non-null  int64  
 9   Month            10682 non-null  int64  
 10  Day              10682 non-null  int64  
 11  Arrival_hour     10682 non-null  int32  
 12  Arrival_min      10682 non-null  int32  
 13  Dep_hour         10682 non-null  int32  
 14  Dep_min          10682 non-null  int32  
dtypes: float64(1), int32(4), int64(4), object(6)
memory usage: 1.1+ MB


In [74]:
# Let's drop the route column

df2.drop("Route",axis=1,inplace=True)

In [75]:
df2

Unnamed: 0,Airline,Source,Destination,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
0,IndiGo,Banglore,New Delhi,2h 50m,0.0,No info,3897,2019,3,24,1,10,22,20
1,Air India,Kolkata,Banglore,7h 25m,2.0,No info,7662,2019,5,1,13,15,5,50
2,Jet Airways,Delhi,Cochin,19h,2.0,No info,13882,2019,6,9,4,25,9,25
3,IndiGo,Kolkata,Banglore,5h 25m,1.0,No info,6218,2019,5,12,23,30,18,5
4,IndiGo,Banglore,New Delhi,4h 45m,1.0,No info,13302,2019,3,1,21,35,16,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10677,Air Asia,Kolkata,Banglore,2h 30m,0.0,No info,4107,2019,4,9,22,25,19,55
10678,Air India,Kolkata,Banglore,2h 35m,0.0,No info,4145,2019,4,27,23,20,20,45
10679,Jet Airways,Banglore,Delhi,3h,0.0,No info,7229,2019,4,27,11,20,8,20
10680,Vistara,Banglore,New Delhi,2h 40m,0.0,No info,12648,2019,3,1,14,10,11,30


In [76]:
df2.Destination.unique()

array(['New Delhi', 'Banglore', 'Cochin', 'Kolkata', 'Delhi', 'Hyderabad'],
      dtype=object)

In [77]:
df2.Source.unique()

array(['Banglore', 'Kolkata', 'Delhi', 'Chennai', 'Mumbai'], dtype=object)

In [78]:
df2.Airline.unique()

array(['IndiGo', 'Air India', 'Jet Airways', 'SpiceJet',
       'Multiple carriers', 'GoAir', 'Vistara', 'Air Asia',
       'Vistara Premium economy', 'Jet Airways Business',
       'Multiple carriers Premium economy', 'Trujet'], dtype=object)

In [79]:
# ## Let's seperate the hours and minutes from the Duration feature

# df2["Duration_hours"]=df2["Duration"].apply(lambda x: x.split("h")[0])

# # or

# df2["Duration"]=df2["Duration"].str.split().str[0].str.split("h").str[0] # Inserting under the same duration feature

In [80]:
df2

Unnamed: 0,Airline,Source,Destination,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min
0,IndiGo,Banglore,New Delhi,2h 50m,0.0,No info,3897,2019,3,24,1,10,22,20
1,Air India,Kolkata,Banglore,7h 25m,2.0,No info,7662,2019,5,1,13,15,5,50
2,Jet Airways,Delhi,Cochin,19h,2.0,No info,13882,2019,6,9,4,25,9,25
3,IndiGo,Kolkata,Banglore,5h 25m,1.0,No info,6218,2019,5,12,23,30,18,5
4,IndiGo,Banglore,New Delhi,4h 45m,1.0,No info,13302,2019,3,1,21,35,16,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10677,Air Asia,Kolkata,Banglore,2h 30m,0.0,No info,4107,2019,4,9,22,25,19,55
10678,Air India,Kolkata,Banglore,2h 35m,0.0,No info,4145,2019,4,27,23,20,20,45
10679,Jet Airways,Banglore,Delhi,3h,0.0,No info,7229,2019,4,27,11,20,8,20
10680,Vistara,Banglore,New Delhi,2h 40m,0.0,No info,12648,2019,3,1,14,10,11,30


In [81]:
# Let's replace the values in duration with hours (hours+minutes)

def replacing_duration_to_hours(x):  
    b=x.split()
    if len(b)==2:
        for i in range(len(b)):
            hours=int(b[0].split("h")[0])
            minutes= int(b[1].split("m")[0])
        return round((minutes/60)+(hours),3)
    else:
        if b[0][-1]=="m":
            minutes=int(b[0].split("m")[0])
            hours=0
            return round((minutes/60),3)
        else:
            hours= int(b[0].split("h")[0])
            minutes=0
            return round((hours)+(minutes/60),3)

In [82]:
df2["Duration_Hours"]=df2["Duration"].apply(replacing_duration_to_hours)

In [83]:
# df2.rename({"Duration":"Duration_Hours"},axis=1,inplace=True)

In [84]:
df2

Unnamed: 0,Airline,Source,Destination,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min,Duration_Hours
0,IndiGo,Banglore,New Delhi,2h 50m,0.0,No info,3897,2019,3,24,1,10,22,20,2.833
1,Air India,Kolkata,Banglore,7h 25m,2.0,No info,7662,2019,5,1,13,15,5,50,7.417
2,Jet Airways,Delhi,Cochin,19h,2.0,No info,13882,2019,6,9,4,25,9,25,19.000
3,IndiGo,Kolkata,Banglore,5h 25m,1.0,No info,6218,2019,5,12,23,30,18,5,5.417
4,IndiGo,Banglore,New Delhi,4h 45m,1.0,No info,13302,2019,3,1,21,35,16,50,4.750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10677,Air Asia,Kolkata,Banglore,2h 30m,0.0,No info,4107,2019,4,9,22,25,19,55,2.500
10678,Air India,Kolkata,Banglore,2h 35m,0.0,No info,4145,2019,4,27,23,20,20,45,2.583
10679,Jet Airways,Banglore,Delhi,3h,0.0,No info,7229,2019,4,27,11,20,8,20,3.000
10680,Vistara,Banglore,New Delhi,2h 40m,0.0,No info,12648,2019,3,1,14,10,11,30,2.667


In [85]:
df2["Duration_Hours"].dtypes

dtype('float64')

In [86]:
## Let's have a look at pw skills method to extract the hours and minutes

## Extracting Hours

print(df2["Duration"].str.split().str[0].str.split("h").str[0])
df2["Duration_hours_pw"]=df2["Duration"].str.split().str[0].str.split("h").str[0]

0         2
1         7
2        19
3         5
4         4
         ..
10677     2
10678     2
10679     3
10680     2
10681     8
Name: Duration, Length: 10682, dtype: object


In [87]:
# Extracting minutes

print(df2["Duration"].str.split().str[1].str.split("m").str[0])

df2["Duration_minutes_pw"]=df2["Duration"].str.split().str[1].str.split("m").str[0]

0         50
1         25
2        NaN
3         25
4         45
        ... 
10677     30
10678     35
10679    NaN
10680     40
10681     20
Name: Duration, Length: 10682, dtype: object


In [88]:
df2

Unnamed: 0,Airline,Source,Destination,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min,Duration_Hours,Duration_hours_pw,Duration_minutes_pw
0,IndiGo,Banglore,New Delhi,2h 50m,0.0,No info,3897,2019,3,24,1,10,22,20,2.833,2,50
1,Air India,Kolkata,Banglore,7h 25m,2.0,No info,7662,2019,5,1,13,15,5,50,7.417,7,25
2,Jet Airways,Delhi,Cochin,19h,2.0,No info,13882,2019,6,9,4,25,9,25,19.000,19,
3,IndiGo,Kolkata,Banglore,5h 25m,1.0,No info,6218,2019,5,12,23,30,18,5,5.417,5,25
4,IndiGo,Banglore,New Delhi,4h 45m,1.0,No info,13302,2019,3,1,21,35,16,50,4.750,4,45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10677,Air Asia,Kolkata,Banglore,2h 30m,0.0,No info,4107,2019,4,9,22,25,19,55,2.500,2,30
10678,Air India,Kolkata,Banglore,2h 35m,0.0,No info,4145,2019,4,27,23,20,20,45,2.583,2,35
10679,Jet Airways,Banglore,Delhi,3h,0.0,No info,7229,2019,4,27,11,20,8,20,3.000,3,
10680,Vistara,Banglore,New Delhi,2h 40m,0.0,No info,12648,2019,3,1,14,10,11,30,2.667,2,40


In [89]:
df2["Duration_minutes_pw"].fillna(0,inplace=True)

In [90]:
df2

Unnamed: 0,Airline,Source,Destination,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min,Duration_Hours,Duration_hours_pw,Duration_minutes_pw
0,IndiGo,Banglore,New Delhi,2h 50m,0.0,No info,3897,2019,3,24,1,10,22,20,2.833,2,50
1,Air India,Kolkata,Banglore,7h 25m,2.0,No info,7662,2019,5,1,13,15,5,50,7.417,7,25
2,Jet Airways,Delhi,Cochin,19h,2.0,No info,13882,2019,6,9,4,25,9,25,19.000,19,0
3,IndiGo,Kolkata,Banglore,5h 25m,1.0,No info,6218,2019,5,12,23,30,18,5,5.417,5,25
4,IndiGo,Banglore,New Delhi,4h 45m,1.0,No info,13302,2019,3,1,21,35,16,50,4.750,4,45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10677,Air Asia,Kolkata,Banglore,2h 30m,0.0,No info,4107,2019,4,9,22,25,19,55,2.500,2,30
10678,Air India,Kolkata,Banglore,2h 35m,0.0,No info,4145,2019,4,27,23,20,20,45,2.583,2,35
10679,Jet Airways,Banglore,Delhi,3h,0.0,No info,7229,2019,4,27,11,20,8,20,3.000,3,0
10680,Vistara,Banglore,New Delhi,2h 40m,0.0,No info,12648,2019,3,1,14,10,11,30,2.667,2,40


In [91]:
df2["Duration_minutes_pw"].isnull().sum()

0

In [92]:
df2["Duration_hours_pw"].isnull().sum()

0

In [93]:
# Encoding the categorical features

df2["Airline"].unique()

array(['IndiGo', 'Air India', 'Jet Airways', 'SpiceJet',
       'Multiple carriers', 'GoAir', 'Vistara', 'Air Asia',
       'Vistara Premium economy', 'Jet Airways Business',
       'Multiple carriers Premium economy', 'Trujet'], dtype=object)

In [94]:
df2["Source"].unique()

array(['Banglore', 'Kolkata', 'Delhi', 'Chennai', 'Mumbai'], dtype=object)

In [95]:
df2["Destination"].unique()

array(['New Delhi', 'Banglore', 'Cochin', 'Kolkata', 'Delhi', 'Hyderabad'],
      dtype=object)

In [96]:
df2["Additional_Info"].unique()

array(['No info', 'In-flight meal not included',
       'No check-in baggage included', '1 Short layover', 'No Info',
       '1 Long layover', 'Change airports', 'Business class',
       'Red-eye flight', '2 Long layover'], dtype=object)

In [97]:
## Let's use one hot encoding to encode the categorical features 

# As the actegorical variable is nominal, we are using One hot encoding technique to encode the categorical feature.

from sklearn.preprocessing import OneHotEncoder

ohe= OneHotEncoder()

In [98]:
ohe.fit_transform(df2[["Airline","Source","Destination"]])

<10682x23 sparse matrix of type '<class 'numpy.float64'>'
	with 32046 stored elements in Compressed Sparse Row format>

In [99]:
ohe.fit_transform(df2[["Source","Destination"]]).toarray()

array([[1., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 1.],
       [0., 0., 1., ..., 0., 0., 0.]])

In [100]:
encoded_features=pd.DataFrame(ohe.fit_transform(df2[["Source","Destination","Airline"]]).toarray(),columns=ohe.get_feature_names_out())

# we can add Additional_Info feature as well while encoding

In [101]:
## concatenate the df2 with encoded_features dataframe

encoded_features.shape

(10682, 23)

In [102]:
df2.shape

(10682, 17)

In [103]:
df2[df2["Airline"]!=np.nan]

Unnamed: 0,Airline,Source,Destination,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,Arrival_hour,Arrival_min,Dep_hour,Dep_min,Duration_Hours,Duration_hours_pw,Duration_minutes_pw
0,IndiGo,Banglore,New Delhi,2h 50m,0.0,No info,3897,2019,3,24,1,10,22,20,2.833,2,50
1,Air India,Kolkata,Banglore,7h 25m,2.0,No info,7662,2019,5,1,13,15,5,50,7.417,7,25
2,Jet Airways,Delhi,Cochin,19h,2.0,No info,13882,2019,6,9,4,25,9,25,19.000,19,0
3,IndiGo,Kolkata,Banglore,5h 25m,1.0,No info,6218,2019,5,12,23,30,18,5,5.417,5,25
4,IndiGo,Banglore,New Delhi,4h 45m,1.0,No info,13302,2019,3,1,21,35,16,50,4.750,4,45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10677,Air Asia,Kolkata,Banglore,2h 30m,0.0,No info,4107,2019,4,9,22,25,19,55,2.500,2,30
10678,Air India,Kolkata,Banglore,2h 35m,0.0,No info,4145,2019,4,27,23,20,20,45,2.583,2,35
10679,Jet Airways,Banglore,Delhi,3h,0.0,No info,7229,2019,4,27,11,20,8,20,3.000,3,0
10680,Vistara,Banglore,New Delhi,2h 40m,0.0,No info,12648,2019,3,1,14,10,11,30,2.667,2,40


In [104]:
encoded_features.isnull().sum()

Source_Banglore                              0
Source_Chennai                               0
Source_Delhi                                 0
Source_Kolkata                               0
Source_Mumbai                                0
Destination_Banglore                         0
Destination_Cochin                           0
Destination_Delhi                            0
Destination_Hyderabad                        0
Destination_Kolkata                          0
Destination_New Delhi                        0
Airline_Air Asia                             0
Airline_Air India                            0
Airline_GoAir                                0
Airline_IndiGo                               0
Airline_Jet Airways                          0
Airline_Jet Airways Business                 0
Airline_Multiple carriers                    0
Airline_Multiple carriers Premium economy    0
Airline_SpiceJet                             0
Airline_Trujet                               0
Airline_Vista

In [105]:
# concatenate the dataframes

df_final=pd.concat([df2,encoded_features],axis=1)

In [106]:
df_final

Unnamed: 0,Airline,Source,Destination,Duration,Total_Stops,Additional_Info,Price,Year,Month,Day,...,Airline_GoAir,Airline_IndiGo,Airline_Jet Airways,Airline_Jet Airways Business,Airline_Multiple carriers,Airline_Multiple carriers Premium economy,Airline_SpiceJet,Airline_Trujet,Airline_Vistara,Airline_Vistara Premium economy
0,IndiGo,Banglore,New Delhi,2h 50m,0.0,No info,3897,2019,3,24,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Air India,Kolkata,Banglore,7h 25m,2.0,No info,7662,2019,5,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Jet Airways,Delhi,Cochin,19h,2.0,No info,13882,2019,6,9,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,IndiGo,Kolkata,Banglore,5h 25m,1.0,No info,6218,2019,5,12,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,IndiGo,Banglore,New Delhi,4h 45m,1.0,No info,13302,2019,3,1,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10677,Air Asia,Kolkata,Banglore,2h 30m,0.0,No info,4107,2019,4,9,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10678,Air India,Kolkata,Banglore,2h 35m,0.0,No info,4145,2019,4,27,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10679,Jet Airways,Banglore,Delhi,3h,0.0,No info,7229,2019,4,27,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10680,Vistara,Banglore,New Delhi,2h 40m,0.0,No info,12648,2019,3,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [107]:
## we can drop the 	Airline	Source	Destination	Duration columns if needed.

## Let's solve the different dataset