##` column description`
flight_date: The date of the flight in MM/DD/YYYY format.

airline: Name of the airline operating the flight (e.g., Air India, Indigo, Vistara).

flight_num: Unique flight number assigned by the airline (e.g., AI-473, G8-2403).

class: Travel class of the ticket (e.g., Economy, Business).

from: Departure city or airport (e.g., Delhi).

dep_time: Departure time in HH:MM format.

to: Destination city or airport (e.g., Mumbai).

arr_time: Arrival time in HH:MM format.

duration: Total flight duration in hours and minutes (e.g., 2h 30m).

price: Ticket price in local currency (e.g., 7,581).

stops: Number of stops (e.g., Non-stop, 1-stop, 2-stop).

#`Order of Severity in Data Cleaning:`

1️⃣ Completeness (Highest Severity)

2️⃣ Validity

3️⃣ Accuracy

4️⃣ Consistency (Lowest Severity)

Explanation:
Completeness → `Missing data can make analysis impossible or misleading. If key values are missing, the dataset becomes unreliable.`

Validity → `Data should conform to predefined formats, types, and rules. Invalid data (e.g., negative ticket prices) can introduce serious errors.`

Accuracy → `Ensuring that the data reflects real-world values. Even if data is complete and valid, inaccurate data can still lead to incorrect conclusions.ex-Duplicated entry`

Consistency → `Uniform representation across the dataset (e.g., date formats, naming conventions). While important, inconsistent data can often be corrected without catastrophic impact.`

In [1]:
import pandas as pd
pd.set_option('display.max_columns',None)

In [2]:
df=pd.read_csv('/content/goibibo_flights_data.csv')
df.head()

Unnamed: 0,flight date,airline,flight_num,class,from,dep_time,to,arr_time,duration,price,stops,Unnamed: 11,Unnamed: 12
0,26-06-2023,SpiceJet,SG-8709,economy,Delhi,18:55,Mumbai,21:05,02h 10m,6013,non-stop,,
1,26-06-2023,SpiceJet,SG-8157,economy,Delhi,06:20,Mumbai,08:40,02h 20m,6013,non-stop,,
2,26-06-2023,AirAsia,I5-764,economy,Delhi,04:25,Mumbai,06:35,02h 10m,6016,non-stop,,
3,26-06-2023,Vistara,UK-995,economy,Delhi,10:20,Mumbai,12:35,02h 15m,6015,non-stop,,
4,26-06-2023,Vistara,UK-963,economy,Delhi,08:50,Mumbai,11:10,02h 20m,6015,non-stop,,


In [3]:
df.head()

Unnamed: 0,flight date,airline,flight_num,class,from,dep_time,to,arr_time,duration,price,stops,Unnamed: 11,Unnamed: 12
0,26-06-2023,SpiceJet,SG-8709,economy,Delhi,18:55,Mumbai,21:05,02h 10m,6013,non-stop,,
1,26-06-2023,SpiceJet,SG-8157,economy,Delhi,06:20,Mumbai,08:40,02h 20m,6013,non-stop,,
2,26-06-2023,AirAsia,I5-764,economy,Delhi,04:25,Mumbai,06:35,02h 10m,6016,non-stop,,
3,26-06-2023,Vistara,UK-995,economy,Delhi,10:20,Mumbai,12:35,02h 15m,6015,non-stop,,
4,26-06-2023,Vistara,UK-963,economy,Delhi,08:50,Mumbai,11:10,02h 20m,6015,non-stop,,


In [4]:
df=df.drop(columns=['Unnamed: 11','Unnamed: 12'])

In [5]:
df.rename(columns={'flight date':'flight_date'},inplace=True)
df.rename(columns={'flight_num':'flight_id'},inplace=True)
df.rename(columns={'from':'departure_city'},inplace=True)
df.rename(columns={'dep_time':'departure_time'},inplace=True)
df.rename(columns={'to':'arrival_city'},inplace=True)
df.rename(columns={'arr_time':'arrival_time'},inplace=True)

In [6]:
df.shape

(300261, 11)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300261 entries, 0 to 300260
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   flight_date     300261 non-null  object
 1   airline         300261 non-null  object
 2   flight_id       300261 non-null  object
 3   class           300261 non-null  object
 4   departure_city  300261 non-null  object
 5   departure_time  300261 non-null  object
 6   arrival_city    300261 non-null  object
 7   arrival_time    300261 non-null  object
 8   duration        300261 non-null  object
 9   price           300261 non-null  object
 10  stops           300261 non-null  object
dtypes: object(11)
memory usage: 25.2+ MB


In [8]:
df.isna().sum()

Unnamed: 0,0
flight_date,0
airline,0
flight_id,0
class,0
departure_city,0
departure_time,0
arrival_city,0
arrival_time,0
duration,0
price,0


In [9]:
df.duplicated().sum()

np.int64(2)

In [10]:
df[df.duplicated(keep=False)]

Unnamed: 0,flight_date,airline,flight_id,class,departure_city,departure_time,arrival_city,arrival_time,duration,price,stops
516,29-06-2023,Air India,AI-807,economy,Delhi,17:20,Mumbai,08:35,15h 15m,12272,1-stop
563,29-06-2023,Air India,AI-807,economy,Delhi,17:20,Mumbai,08:35,15h 15m,12272,1-stop
6080,26-07-2023,Air India,AI-475,economy,Delhi,13:00,Mumbai,13:35,24h 35m,4828,1-stop
6181,26-07-2023,Air India,AI-475,economy,Delhi,13:00,Mumbai,13:35,24h 35m,4828,1-stop


In [11]:
df.drop_duplicates(keep='first',inplace=True)

In [12]:
df.shape

(300259, 11)

In [13]:
df['flight_date']=pd.to_datetime(df['flight_date'],dayfirst=True)

In [14]:
df.sample(20)

Unnamed: 0,flight_date,airline,flight_id,class,departure_city,departure_time,arrival_city,arrival_time,duration,price,stops
298610,2023-07-08,Air India,AI-430,business,Chennai,09:55,Hyderabad,21:35,11h 40m,67724,1-stop
52362,2023-08-11,Air India,AI-663,economy,Mumbai,13:20,Delhi,03:40,14h 20m,5568,1-stop
195240,2023-07-14,Vistara,UK-824,economy,Chennai,20:30,Bangalore,19:15,22h 45m,5121,1-stop
83681,2023-08-10,Air India,AI-888,economy,Mumbai,19:00,Chennai,12:50,17h 50m,7670,1-stop
209278,2023-07-19,Vistara,UK-747,business,Delhi,06:30,Mumbai,13:00,06h 30m,71855,1-stop
206156,2023-08-06,Vistara,UK-836,economy,Chennai,10:45,Hyderabad,22:55,12h 10m,4127,1-stop
156817,2023-07-31,Vistara,UK-876,economy,Hyderabad,21:35,Delhi,08:35,11h 00m,5713,1-stop
49727,2023-07-30,Indigo,6E-416,economy,Mumbai,17:15,Delhi,19:20,02h 05m,3139,non-stop
233927,2023-08-09,Vistara,UK-902,business,Mumbai,15:45,Bangalore,23:20,07h 35m,67932,1-stop
153278,2023-07-04,Vistara,UK-874,economy,Hyderabad,08:30,Delhi,14:05,05h 35m,10118,1-stop


In [15]:
df.insert(0, 'year', df['flight_date'].dt.year)  # Insert 'year' at position 0
df.insert(1, 'month', df['flight_date'].dt.month)  # Insert 'month' at position 1
df.insert(2, 'day', df['flight_date'].dt.day)  # Insert 'day' at position 2

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 300259 entries, 0 to 300260
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   year            300259 non-null  int32         
 1   month           300259 non-null  int32         
 2   day             300259 non-null  int32         
 3   flight_date     300259 non-null  datetime64[ns]
 4   airline         300259 non-null  object        
 5   flight_id       300259 non-null  object        
 6   class           300259 non-null  object        
 7   departure_city  300259 non-null  object        
 8   departure_time  300259 non-null  object        
 9   arrival_city    300259 non-null  object        
 10  arrival_time    300259 non-null  object        
 11  duration        300259 non-null  object        
 12  price           300259 non-null  object        
 13  stops           300259 non-null  object        
dtypes: datetime64[ns](1), int32(3), object(10

In [17]:
df['stops'].value_counts()

Unnamed: 0_level_0,count
stops,Unnamed: 1_level_1
1-stop,243601
non-stop,36044
2+-stop,13288
1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia IXU\n\t\t\t\t\t\t\t\t\t\t\t\t,1839
1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia IDR\n\t\t\t\t\t\t\t\t\t\t\t\t,1398
1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia Patna\n\t\t\t\t\t\t\t\t\t\t\t\t,674
1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia Indore\n\t\t\t\t\t\t\t\t\t\t\t\t,381
1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia PAT\n\t\t\t\t\t\t\t\t\t\t\t\t,354
1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia MYQ\n\t\t\t\t\t\t\t\t\t\t\t\t,321
1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia Bhubaneswar\n\t\t\t\t\t\t\t\t\t\t\t\t,301


In [18]:
df['stops'].unique()

array(['non-stop', '1-stop',
       '1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia IXU\n\t\t\t\t\t\t\t\t\t\t\t\t',
       '1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia Chennai\n\t\t\t\t\t\t\t\t\t\t\t\t',
       '1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia Indore\n\t\t\t\t\t\t\t\t\t\t\t\t',
       '1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia RPR\n\t\t\t\t\t\t\t\t\t\t\t\t',
       '2+-stop',
       '1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia Lucknow\n\t\t\t\t\t\t\t\t\t\t\t\t',
       '1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia GOP\n\t\t\t\t\t\t\t\t\t\t\t\t',
       '1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia Raipur\n\t\t\t\t\t\t\t\t\t\t\t\t',
       '1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia Nagpur\n\t\t\t\t\t\t\t\t\t\t\t\t',
       '1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia Surat\n\t\t\t\t\t\t\t\t\t\t\t\t',
       '1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia Hyderabad\n\t\t\t\t\t\t\t\t\t\t\t\t',
       '1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia STV\n\t\t\t\t\t\t\t\t\t\t\t\t',
       '1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia IDR\n\t\t\t\t\t\t\t\t\t\t\t\t',
       '1-stop\n

In [19]:
df['Stops']=df['stops'].str.split(' ').str.get(0).str.split('\n').str.get(0)
df.sample(10)

Unnamed: 0,year,month,day,flight_date,airline,flight_id,class,departure_city,departure_time,arrival_city,arrival_time,duration,price,stops,Stops
191626,2023,7,27,2023-07-27,Indigo,6E-6137,economy,Chennai,09:00,Mumbai,14:30,05h 30m,2820,1-stop,1-stop
127364,2023,8,12,2023-08-12,GO FIRST,G8-393,economy,Kolkata,08:55,Delhi,19:00,10h 05m,7323,1-stop,1-stop
198372,2023,7,2,2023-07-02,Vistara,UK-836,economy,Chennai,10:45,Kolkata,09:35,22h 50m,14154,1-stop,1-stop
64240,2023,7,12,2023-07-12,GO FIRST,G8-341,economy,Mumbai,20:15,Kolkata,08:00,11h 45m,5875,1-stop,1-stop
144378,2023,7,13,2023-07-13,Indigo,6E-512,economy,Kolkata,12:55,Hyderabad,17:25,04h 30m,6554,1-stop,1-stop
204859,2023,7,23,2023-07-23,Indigo,6E-6137,economy,Chennai,09:00,Hyderabad,14:15,05h 15m,2891,1-stop,1-stop
238602,2023,7,4,2023-07-04,Air India,AI-635,business,Mumbai,07:05,Hyderabad,23:35,16h 30m,46378,1-stop\n\t\t\t\t\t\t\t\t\t\t\t\tVia IDR\n\t\t\...,1-stop
28506,2023,8,12,2023-08-12,Air India,AI-439,economy,Delhi,06:00,Kolkata,13:55,07h 55m,4747,1-stop,1-stop
154335,2023,7,13,2023-07-13,Air India,AI-559,economy,Hyderabad,06:30,Delhi,09:00,02h 30m,2723,non-stop,non-stop
25088,2023,7,24,2023-07-24,Vistara,UK-899,economy,Delhi,14:45,Kolkata,09:40,18h 55m,9707,2+-stop,2+-stop


In [20]:
df['Stops'].value_counts()

Unnamed: 0_level_0,count
Stops,Unnamed: 1_level_1
1-stop,250927
non-stop,36044
2+-stop,13288


In [21]:
df['stops'].str.split(' ').str.get(1).str.split('\n').str.get(0).value_counts(dropna=False)

Unnamed: 0_level_0,count
stops,Unnamed: 1_level_1
,292933
IXU,1839
IDR,1398
Patna,674
Indore,381
PAT,354
MYQ,321
Bhubaneswar,301
KLH,284
JGB,193


In [22]:
df.head()

Unnamed: 0,year,month,day,flight_date,airline,flight_id,class,departure_city,departure_time,arrival_city,arrival_time,duration,price,stops,Stops
0,2023,6,26,2023-06-26,SpiceJet,SG-8709,economy,Delhi,18:55,Mumbai,21:05,02h 10m,6013,non-stop,non-stop
1,2023,6,26,2023-06-26,SpiceJet,SG-8157,economy,Delhi,06:20,Mumbai,08:40,02h 20m,6013,non-stop,non-stop
2,2023,6,26,2023-06-26,AirAsia,I5-764,economy,Delhi,04:25,Mumbai,06:35,02h 10m,6016,non-stop,non-stop
3,2023,6,26,2023-06-26,Vistara,UK-995,economy,Delhi,10:20,Mumbai,12:35,02h 15m,6015,non-stop,non-stop
4,2023,6,26,2023-06-26,Vistara,UK-963,economy,Delhi,08:50,Mumbai,11:10,02h 20m,6015,non-stop,non-stop


#`AS flight_id have 1500+ unique category, so one hot encode it could be computationally very expensive, so we can avoid it`

In [23]:
len(df['flight_id'].unique())

1569

In [24]:
df['airline'].unique()

array(['SpiceJet', 'AirAsia', 'Vistara', 'GO FIRST', 'Indigo',
       'Air India', 'Trujet', 'StarAir'], dtype=object)

In [25]:
df['departure_city'].unique()

array(['Delhi', 'Mumbai', 'Bangalore', 'Kolkata', 'Hyderabad', 'Chennai'],
      dtype=object)

In [26]:
df['arrival_city'].unique()

array(['Mumbai', 'Bangalore', 'Kolkata', 'Hyderabad', 'Chennai', 'Delhi'],
      dtype=object)

In [27]:
df[['hr_duration','min_duration']]=df['duration'].str.extract(r'(?:(\d+)h)?\s?(?:(\d+)m)?').fillna(0).astype(int)
df.head()

Unnamed: 0,year,month,day,flight_date,airline,flight_id,class,departure_city,departure_time,arrival_city,arrival_time,duration,price,stops,Stops,hr_duration,min_duration
0,2023,6,26,2023-06-26,SpiceJet,SG-8709,economy,Delhi,18:55,Mumbai,21:05,02h 10m,6013,non-stop,non-stop,2,10
1,2023,6,26,2023-06-26,SpiceJet,SG-8157,economy,Delhi,06:20,Mumbai,08:40,02h 20m,6013,non-stop,non-stop,2,20
2,2023,6,26,2023-06-26,AirAsia,I5-764,economy,Delhi,04:25,Mumbai,06:35,02h 10m,6016,non-stop,non-stop,2,10
3,2023,6,26,2023-06-26,Vistara,UK-995,economy,Delhi,10:20,Mumbai,12:35,02h 15m,6015,non-stop,non-stop,2,15
4,2023,6,26,2023-06-26,Vistara,UK-963,economy,Delhi,08:50,Mumbai,11:10,02h 20m,6015,non-stop,non-stop,2,20


In [28]:
df['duration'].unique()

array(['02h 10m', '02h 20m', '02h 15m', '02h 05m', '12h 15m', '16h 20m',
       '11h 45m', '14h 30m', '15h 40m', '03h 45m', '02h 30m', '05h 50m',
       '08h 00m', '06h 00m', '14h 40m', '16h 10m', '18h 00m', '23h 10m',
       '24h 10m', '08h 50m', '04h 30m', '15h 15m', '11h 00m', '19h 05m',
       '22h 50m', '26h 25m', '17h 45m', '19h 35m', '26h 40m', '15h 10m',
       '20h 50m', '11h 25m', '22h 15m', '26h 00m', '21h 45m', '03h 50m',
       '04h 25m', '07h 40m', '08h 20m', '10h 25m', '23h 45m', '19h 30m',
       '06h 30m', '12h 25m', '21h 05m', '28h 10m', '28h 15m', '09h 15m',
       '17h 55m', '07h 05m', '13h 50m', '07h 35m', '15h 50m', '24h 25m',
       '04h 10m', '04h 15m', '05h 05m', '29h 20m', '17h 00m', '27h 10m',
       '24h 45m', '05h 45m', '12h 45m', '13h 45m', '17h 50m', '05h 30m',
       '23h 50m', '05h 00m', '26h 30m', '12h 50m', '08h 55m', '11h 10m',
       '12h 10m', '15h 35m', '15h 45m', '07h 55m', '13h 15m', '16h 00m',
       '22h 45m', '06h 20m', '07h 15m', '30h 05m', 

In [29]:
df['hr_duration'].unique()

array([ 2, 12, 16, 11, 14, 15,  3,  5,  8,  6, 18, 23, 24,  4, 19, 22, 26,
       17, 20, 21,  7, 10, 28,  9, 13, 29, 27, 30, 25, 31, 33, 36, 35, 34,
       39,  1, 37, 40, 32, 41, 38,  0, 47, 42, 49, 45, 44])

In [30]:
df['hr_duration'].max()

49

In [31]:
df[df.index==193995]

Unnamed: 0,year,month,day,flight_date,airline,flight_id,class,departure_city,departure_time,arrival_city,arrival_time,duration,price,stops,Stops,hr_duration,min_duration
193995,2023,6,27,2023-06-27,Air India,AI-672,economy,Chennai,16:05,Bangalore,14:20,22h 15m,23652,2+-stop,2+-stop,22,15


In [32]:
df['duration']=df['hr_duration']*60+df['min_duration']

In [33]:
df[df.index==193995]

Unnamed: 0,year,month,day,flight_date,airline,flight_id,class,departure_city,departure_time,arrival_city,arrival_time,duration,price,stops,Stops,hr_duration,min_duration
193995,2023,6,27,2023-06-27,Air India,AI-672,economy,Chennai,16:05,Bangalore,14:20,1335,23652,2+-stop,2+-stop,22,15


In [34]:
df[df.index==194465]

Unnamed: 0,year,month,day,flight_date,airline,flight_id,class,departure_city,departure_time,arrival_city,arrival_time,duration,price,stops,Stops,hr_duration,min_duration
194465,2023,7,4,2023-07-04,Air India,AI-569,economy,Chennai,06:20,Bangalore,22:30,970,16335,1-stop,1-stop,16,10


In [35]:
df.head()

Unnamed: 0,year,month,day,flight_date,airline,flight_id,class,departure_city,departure_time,arrival_city,arrival_time,duration,price,stops,Stops,hr_duration,min_duration
0,2023,6,26,2023-06-26,SpiceJet,SG-8709,economy,Delhi,18:55,Mumbai,21:05,130,6013,non-stop,non-stop,2,10
1,2023,6,26,2023-06-26,SpiceJet,SG-8157,economy,Delhi,06:20,Mumbai,08:40,140,6013,non-stop,non-stop,2,20
2,2023,6,26,2023-06-26,AirAsia,I5-764,economy,Delhi,04:25,Mumbai,06:35,130,6016,non-stop,non-stop,2,10
3,2023,6,26,2023-06-26,Vistara,UK-995,economy,Delhi,10:20,Mumbai,12:35,135,6015,non-stop,non-stop,2,15
4,2023,6,26,2023-06-26,Vistara,UK-963,economy,Delhi,08:50,Mumbai,11:10,140,6015,non-stop,non-stop,2,20


In [36]:
df['duration'].dtype

dtype('int64')

In [37]:
df['arrival_time']=df['arrival_time'].str.split(':').apply(lambda x:int(x[0])*60+int(x[1]))

In [38]:
df['departure_time']=df['departure_time'].str.split(':').apply(lambda x:int(x[0])*60+int(x[1]))

In [39]:
df.head()

Unnamed: 0,year,month,day,flight_date,airline,flight_id,class,departure_city,departure_time,arrival_city,arrival_time,duration,price,stops,Stops,hr_duration,min_duration
0,2023,6,26,2023-06-26,SpiceJet,SG-8709,economy,Delhi,1135,Mumbai,1265,130,6013,non-stop,non-stop,2,10
1,2023,6,26,2023-06-26,SpiceJet,SG-8157,economy,Delhi,380,Mumbai,520,140,6013,non-stop,non-stop,2,20
2,2023,6,26,2023-06-26,AirAsia,I5-764,economy,Delhi,265,Mumbai,395,130,6016,non-stop,non-stop,2,10
3,2023,6,26,2023-06-26,Vistara,UK-995,economy,Delhi,620,Mumbai,755,135,6015,non-stop,non-stop,2,15
4,2023,6,26,2023-06-26,Vistara,UK-963,economy,Delhi,530,Mumbai,670,140,6015,non-stop,non-stop,2,20


In [40]:
df['price']=df['price'].str.replace(',','').str.strip().astype(int)

In [41]:
df.head()

Unnamed: 0,year,month,day,flight_date,airline,flight_id,class,departure_city,departure_time,arrival_city,arrival_time,duration,price,stops,Stops,hr_duration,min_duration
0,2023,6,26,2023-06-26,SpiceJet,SG-8709,economy,Delhi,1135,Mumbai,1265,130,6013,non-stop,non-stop,2,10
1,2023,6,26,2023-06-26,SpiceJet,SG-8157,economy,Delhi,380,Mumbai,520,140,6013,non-stop,non-stop,2,20
2,2023,6,26,2023-06-26,AirAsia,I5-764,economy,Delhi,265,Mumbai,395,130,6016,non-stop,non-stop,2,10
3,2023,6,26,2023-06-26,Vistara,UK-995,economy,Delhi,620,Mumbai,755,135,6015,non-stop,non-stop,2,15
4,2023,6,26,2023-06-26,Vistara,UK-963,economy,Delhi,530,Mumbai,670,140,6015,non-stop,non-stop,2,20


In [42]:
df=df.drop(columns=['flight_id','hr_duration','min_duration','stops'])
df.sample(10)

Unnamed: 0,year,month,day,flight_date,airline,class,departure_city,departure_time,arrival_city,arrival_time,duration,price,Stops
93667,2023,8,12,2023-08-12,GO FIRST,economy,Bangalore,330,Delhi,1140,810,3501,1-stop
138175,2023,7,15,2023-07-15,Vistara,economy,Kolkata,615,Bangalore,1020,405,8192,2+-stop
129048,2023,7,7,2023-07-07,Vistara,economy,Kolkata,615,Mumbai,1275,660,10529,1-stop
158339,2023,8,11,2023-08-11,Vistara,economy,Hyderabad,510,Delhi,1085,575,5951,1-stop
35312,2023,8,13,2023-08-13,Vistara,economy,Delhi,375,Hyderabad,1255,880,5819,1-stop
221877,2023,8,12,2023-08-12,Air India,business,Delhi,905,Hyderabad,435,970,39911,1-stop
291995,2023,7,21,2023-07-21,Vistara,business,Chennai,1030,Mumbai,1370,340,50296,1-stop
244946,2023,8,13,2023-08-13,Vistara,business,Mumbai,735,Chennai,1365,630,65852,2+-stop
276890,2023,8,5,2023-08-05,Air India,business,Hyderabad,680,Delhi,830,150,24484,non-stop
15948,2023,7,25,2023-07-25,GO FIRST,economy,Delhi,540,Bangalore,970,430,5666,1-stop


In [43]:
df['Stops'].unique()

array(['non-stop', '1-stop', '2+-stop'], dtype=object)

In [44]:
df['Stops']=df['Stops'].map({'non-stop':0,'1-stop':1, '2+-stop':2})

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 300259 entries, 0 to 300260
Data columns (total 13 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   year            300259 non-null  int32         
 1   month           300259 non-null  int32         
 2   day             300259 non-null  int32         
 3   flight_date     300259 non-null  datetime64[ns]
 4   airline         300259 non-null  object        
 5   class           300259 non-null  object        
 6   departure_city  300259 non-null  object        
 7   departure_time  300259 non-null  int64         
 8   arrival_city    300259 non-null  object        
 9   arrival_time    300259 non-null  int64         
 10  duration        300259 non-null  int64         
 11  price           300259 non-null  int64         
 12  Stops           300259 non-null  int64         
dtypes: datetime64[ns](1), int32(3), int64(5), object(4)
memory usage: 28.6+ MB


In [46]:
df.rename(columns={'year':'flight_year'},inplace=True)
df.rename(columns={'month':'flight_month'},inplace=True)
df.rename(columns={'day':'flight_day'},inplace=True)

In [51]:
df=pd.get_dummies(df,drop_first=True).astype(int)

In [52]:
df.head()

Unnamed: 0,flight_year,flight_month,flight_day,flight_date,departure_time,arrival_time,duration,price,Stops,airline_AirAsia,airline_GO FIRST,airline_Indigo,airline_SpiceJet,airline_StarAir,airline_Trujet,airline_Vistara,class_economy,departure_city_Chennai,departure_city_Delhi,departure_city_Hyderabad,departure_city_Kolkata,departure_city_Mumbai,arrival_city_Chennai,arrival_city_Delhi,arrival_city_Hyderabad,arrival_city_Kolkata,arrival_city_Mumbai
0,2023,6,26,1687737600000000000,1135,1265,130,6013,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,1
1,2023,6,26,1687737600000000000,380,520,140,6013,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,1
2,2023,6,26,1687737600000000000,265,395,130,6016,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1
3,2023,6,26,1687737600000000000,620,755,135,6015,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,1
4,2023,6,26,1687737600000000000,530,670,140,6015,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,1


In [53]:
df.shape

(300259, 27)

In [54]:
df.to_csv('cleaned_processed_flights_data',index=False)