Predict The Flight Ticket Price 

Flight ticket prices can be something hard to guess, today we might see a price, check out the price of the same flight tomorrow, it will be a different story. We might have often heard travellers saying that flight ticket prices are so unpredictable. Here you will be provided with prices of flight tickets for various airlines between the months of March and June of 2019 and between various cities.

Size of training set: 10683 records

Size of test set: 2671 records

FEATURES:
Airline: The name of the airline.

Date_of_Journey: The date of the journey

Source: The source from which the service begins.

Destination: The destination where the service ends.

Route: The route taken by the flight to reach the destination.

Dep_Time: The time when the journey starts from the source.

Arrival_Time: Time of arrival at the destination.

Duration: Total duration of the flight.

Total_Stops: Total stops between the source and destination.

In [1]:
import re
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import rcParams
%matplotlib inline
import warnings 
warnings.filterwarnings('ignore')

In [2]:
ff_train= pd.read_excel('Data_Train.xlsx')
ff_test=pd.read_excel('Test_set.xlsx')

In [3]:
ff_train.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [4]:
ff_test.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info
0,Jet Airways,6/06/2019,Delhi,Cochin,DEL → BOM → COK,17:30,04:25 07 Jun,10h 55m,1 stop,No info
1,IndiGo,12/05/2019,Kolkata,Banglore,CCU → MAA → BLR,06:20,10:20,4h,1 stop,No info
2,Jet Airways,21/05/2019,Delhi,Cochin,DEL → BOM → COK,19:15,19:00 22 May,23h 45m,1 stop,In-flight meal not included
3,Multiple carriers,21/05/2019,Delhi,Cochin,DEL → BOM → COK,08:00,21:00,13h,1 stop,No info
4,Air Asia,24/06/2019,Banglore,Delhi,BLR → DEL,23:55,02:45 25 Jun,2h 50m,non-stop,No info


In [5]:
# See the shape 
print(ff_train.shape)
print(ff_test.shape)

(10683, 11)
(2671, 10)


In [6]:
# Get a count of unique values in train dataset.
for col in ff_train.columns:
    print("Count of unique values in", col, ff_train[col].nunique())

Count of unique values in Airline 12
Count of unique values in Date_of_Journey 44
Count of unique values in Source 5
Count of unique values in Destination 6
Count of unique values in Route 128
Count of unique values in Dep_Time 222
Count of unique values in Arrival_Time 1343
Count of unique values in Duration 368
Count of unique values in Total_Stops 5
Count of unique values in Additional_Info 10
Count of unique values in Price 1870


In [7]:
# Get a count of unique values in test dataset.
for col in ff_test.columns:
    print("Count of unique values in", col, ff_test[col].nunique())

Count of unique values in Airline 11
Count of unique values in Date_of_Journey 44
Count of unique values in Source 5
Count of unique values in Destination 6
Count of unique values in Route 100
Count of unique values in Dep_Time 199
Count of unique values in Arrival_Time 704
Count of unique values in Duration 320
Count of unique values in Total_Stops 5
Count of unique values in Additional_Info 6


In [8]:
# Let's take only arrival time only
ff_train['Arrival_Time'] = ff_train['Arrival_Time'].str.split(' ').str[0]
ff_test['Arrival_Time'] = ff_test['Arrival_Time'].str.split(' ').str[0]

In [9]:
def get_departuretimeofday(depart):
    depart = depart.split(':')
    depart = int(depart[0])
    if (depart >= 6 and depart < 12):
        return 'Morning'
    elif (depart >= 12 and depart < 17):
        return 'Noon'
    elif (depart >= 17 and depart < 20):
        return 'Evening'
    else:
        return 'Night'
    
ff_train['Departure_timeofday'] = ff_train['Dep_Time'].apply(get_departuretimeofday)   
ff_test['Departure_timeofday'] = ff_test['Dep_Time'].apply(get_departuretimeofday) 

ff_train['Arrival_timeofday'] = ff_train['Arrival_Time'].apply(get_departuretimeofday)   
ff_test['Arrival_timeofday'] = ff_test['Arrival_Time'].apply(get_departuretimeofday)

In [10]:
ff_train['Additional_Info'] = ff_train['Additional_Info'].str.replace('No info', 'No Info')
ff_test['Additional_Info'] = ff_test['Additional_Info'].str.replace('No info', 'No Info')

In [11]:
ff_train['Total_Stops'] = ff_train['Total_Stops'].str.replace('non-stop', '0')
ff_train['Total_Stops'] = ff_train['Total_Stops'].str.replace('stops', ' ')
ff_train['Total_Stops'] = ff_train['Total_Stops'].str.replace('stop', ' ')

ff_test['Total_Stops'] = ff_test['Total_Stops'].str.replace('non-stop', '0')
ff_test['Total_Stops'] = ff_test['Total_Stops'].str.replace('stops', ' ')
ff_test['Total_Stops'] = ff_test['Total_Stops'].str.replace('stop', ' ')

ff_train['Total_Stops'].fillna(0, inplace=True)
ff_test['Total_Stops'].fillna(0, inplace=True)

ff_train['Total_Stops'] = ff_train['Total_Stops'].astype(float)
ff_test['Total_Stops'] = ff_test['Total_Stops'].astype(float)

In [12]:
ff_train.shape, ff_test.shape

((10683, 13), (2671, 12))

In [13]:
ff_test.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Departure_timeofday,Arrival_timeofday
0,Jet Airways,6/06/2019,Delhi,Cochin,DEL → BOM → COK,17:30,04:25,10h 55m,1.0,No Info,Evening,Night
1,IndiGo,12/05/2019,Kolkata,Banglore,CCU → MAA → BLR,06:20,10:20,4h,1.0,No Info,Morning,Morning
2,Jet Airways,21/05/2019,Delhi,Cochin,DEL → BOM → COK,19:15,19:00,23h 45m,1.0,In-flight meal not included,Evening,Evening
3,Multiple carriers,21/05/2019,Delhi,Cochin,DEL → BOM → COK,08:00,21:00,13h,1.0,No Info,Morning,Night
4,Air Asia,24/06/2019,Banglore,Delhi,BLR → DEL,23:55,02:45,2h 50m,0.0,No Info,Night,Night


In [14]:
ff_train = pd.get_dummies(ff_train, columns=['Airline', 'Source', 'Destination', 'Additional_Info', 'Date_of_Journey',
                                             'Dep_Time', 'Arrival_Time', 'Departure_timeofday',
                                              'Arrival_timeofday'],drop_first=True)

ff_test = pd.get_dummies(ff_test, columns=['Airline', 'Source', 'Destination', 'Additional_Info', 'Date_of_Journey',
                                             'Dep_Time', 'Arrival_Time', 'Departure_timeofday',
                                              'Arrival_timeofday'],drop_first=True)

In [15]:
# Remove → from the route column 
def remove_unwanted(route):
    route = str(route)
    route = route.split(' → ')
    return ' '.join(route)

ff_train['Route'] = ff_train['Route'].apply(remove_unwanted)
ff_test['Route'] = ff_test['Route'].apply(remove_unwanted)

from sklearn.feature_extraction.text import TfidfVectorizer
# TfidfVectorizer - Transforms text to feature vectors that can be used as input to model.
tf = TfidfVectorizer(ngram_range=(1, 1), lowercase=False)
ff_train_route = tf.fit_transform(ff_train['Route'])
ff_test_route = tf.transform(ff_test['Route'])

ff_train_route = pd.DataFrame(data=ff_train_route.toarray(), columns=tf.get_feature_names())
ff_test_route = pd.DataFrame(data=ff_test_route.toarray(), columns=tf.get_feature_names())

In [16]:
ff_train=pd.concat([ff_train, ff_train_route], axis=1)
ff_test=pd.concat([ff_test, ff_test_route], axis=1)
ff_train.drop(['Route','Duration'], axis=1, inplace=True)
ff_test.drop(['Route','Duration'], axis=1, inplace=True)

In [17]:
ff_train.shape, ff_test.shape

((10683, 565), (2671, 518))

In [18]:
ff_train['Dep_Time_22:30'] = 0

In [19]:
missing_cols_test = []
for col in ff_train.columns:
    if col not in ff_test.columns:
        missing_cols_test.append(col)
        
for i in missing_cols_test:
    ff_test[i] = 0

ff_test.drop('Price', axis=1, inplace=True)

In [20]:
x = ff_train.drop(labels=['Price'], axis=1)
y = ff_train['Price'].values

from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.25, random_state=1)

In [21]:
x_train.shape, y_train.shape, x_val.shape, y_val.shape

((8012, 565), (8012,), (2671, 565), (2671,))

In [22]:
 ff_test.shape

(2671, 565)

In [23]:
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import BaggingRegressor
from math import sqrt 
from sklearn.metrics import mean_squared_log_error

In [24]:
rf = RandomForestRegressor(n_estimators=29,
                           criterion='mse', 
                           max_depth=58, 
                           min_samples_split=5, 
                           min_samples_leaf=2, 
                           min_weight_fraction_leaf=0.0, 
                           max_features='auto', 
                           max_leaf_nodes=None, 
                           min_impurity_decrease=0.20,  
                           bootstrap=True, 
                           oob_score=True, 
                           n_jobs=-1, 
                           random_state=11) 
rf.fit(x_train, y_train)
y_pred_rf = rf.predict(x_val)
print('RMSLE:', sqrt(mean_squared_log_error(y_val, y_pred_rf)))

RMSLE: 0.1261974607773459


In [33]:
xgb = XGBRegressor(learning_rate=0.02, 
                   gamma=100, 
                   max_depth=25,  
                   min_child_weight=1, 
                   max_delta_step=0, 
                   subsample=0.75,  
                   colsample_bylevel=0.95,  
                   colsample_bytree=0.70,  
                   reg_lambda=1)
xgb.fit(x_train, y_train)
y_pred_xgb = xgb.predict(x_val)
print('RMSLE:', sqrt(mean_squared_log_error(y_val, y_pred_xgb)))

RMSLE: 0.18784732302320706


In [34]:
gb = GradientBoostingRegressor(loss='lad', 
                               learning_rate=0.2,  
                               random_state=10, 
                               n_estimators=92,   
                               max_depth=11,  
                               subsample=1.0, 
                               min_samples_split=40, 
                               min_samples_leaf=1,
                               max_features='auto')
gb.fit(x_train, y_train)
y_pred_gb = gb.predict(x_val)
print('RMSLE:', sqrt(mean_squared_log_error(y_val, y_pred_gb)))

RMSLE: 0.1550385805722753


In [35]:
bgr= BaggingRegressor(base_estimator=None, 
                      n_estimators=80,  
                      max_samples=1.0, 
                      max_features=1.0, 
                      bootstrap=True, 
                      bootstrap_features=True,
                      oob_score=True,
                      n_jobs=None, 
                      random_state=13, 
                      verbose=0)
bgr.fit(x_train, y_train)
y_pred_bgr = bgr.predict(x_val)
print('RMSLE:', sqrt(mean_squared_log_error(y_val, y_pred_bgr)))

RMSLE: 0.14862916507663826


In [36]:
y_pred = y_pred_rf*0.12 + y_pred_xgb*0.18 + y_pred_gb*0.14+y_pred_bgr*0.14
print('RMSLE:', sqrt(mean_squared_log_error(y_val, y_pred)))

RMSLE: 0.5929660936333876


Predict on Test dataset using Bagging Regressor and Gradient Boosting Regressor Method.

In [37]:
bgr= BaggingRegressor(base_estimator=None, 
                      n_estimators=80,  
                      max_samples=1.0, 
                      max_features=1.0, 
                      bootstrap=True, 
                      bootstrap_features=True,
                      oob_score=True,
                      n_jobs=None, 
                      random_state=13, 
                      verbose=0)
bgr.fit(x_train, y_train)
y_pred_price_bgr = bgr.predict(ff_test)

In [38]:
gb = GradientBoostingRegressor(loss='lad', 
                               learning_rate=0.2,  
                               random_state=10, 
                               n_estimators=92,   
                               max_depth=11,  
                               subsample=1.0, 
                               min_samples_split=40, 
                               min_samples_leaf=1,
                               max_features='auto')
gb.fit(x_train, y_train)
y_pred_price_gb = gb.predict(ff_test)

In [39]:
# Let's print the predictions 
print("Price predictions of  Gradient Boosting Regressor : \n",y_pred_price_gb)
print('\n')
print("Price predictions of Bagging Regressor: \n",y_pred_price_bgr)

Price predictions of  Gradient Boosting Regressor : 
 [17219.22835316 10486.04749956 15291.50238366 ... 18432.96675216
 17459.05119883 12199.69394216]


Price predictions of Bagging Regressor: 
 [15171.30282853 16332.45541667  9874.09947917 ... 15525.49522321
 20617.64333333 18392.44958448]


In [40]:
Price_ff = pd.DataFrame(data=y_pred_price_gb, columns=['Price'])
writer = pd.ExcelWriter('output_Price.xlsx', engine='xlsxwriter')
Price_ff.to_excel(writer,sheet_name='Sheet1', index=False)
writer.save()