# Predicting the Flight Ticket Price
<br>
<br>
Flight ticket prices can be something hard to guess, today we might see a price, check out the price of the same flight tomorrow, it will be a different story. We might have often heard travellers saying that flight ticket prices are so unpredictable. Huh! Here we take on the challenge! As data scientists, we are gonna prove that given the right data anything can be predicted. Here you will be provided with prices of flight tickets for various airlines between the months of March and June of 2019 and between various cities.<br>
<br>
FEATURES:
Airline: The name of the airline.<br>
<br>
Date_of_Journey: The date of the journey

Source: The source from which the service begins.

Destination: The destination where the service ends.

Route: The route taken by the flight to reach the destination.

Dep_Time: The time when the journey starts from the source.

Arrival_Time: Time of arrival at the destination.

Duration: Total duration of the flight.

Total_Stops: Total stops between the source and destination.

Additional_Info: Additional information about the flight

Price: The price of the ticket

In [82]:
# importing libraries
import pandas as pd
import numpy as np

In [83]:
# importing data
train=pd.read_excel("Data_Train.xlsx")
test=pd.read_excel("Test_set.xlsx")

In [84]:
train.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [85]:
train.columns

Index(['Airline', 'Date_of_Journey', 'Source', 'Destination', 'Route',
       'Dep_Time', 'Arrival_Time', 'Duration', 'Total_Stops',
       'Additional_Info', 'Price'],
      dtype='object')

In [86]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
Airline            10683 non-null object
Date_of_Journey    10683 non-null object
Source             10683 non-null object
Destination        10683 non-null object
Route              10682 non-null object
Dep_Time           10683 non-null object
Arrival_Time       10683 non-null object
Duration           10683 non-null object
Total_Stops        10682 non-null object
Additional_Info    10683 non-null object
Price              10683 non-null int64
dtypes: int64(1), object(10)
memory usage: 918.1+ KB


In [87]:
# test info
test.info() # Test data set doesnot have any null value

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2671 entries, 0 to 2670
Data columns (total 10 columns):
Airline            2671 non-null object
Date_of_Journey    2671 non-null object
Source             2671 non-null object
Destination        2671 non-null object
Route              2671 non-null object
Dep_Time           2671 non-null object
Arrival_Time       2671 non-null object
Duration           2671 non-null object
Total_Stops        2671 non-null object
Additional_Info    2671 non-null object
dtypes: object(10)
memory usage: 208.8+ KB


In [88]:
# Finding the null values
train.isna().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              1
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        1
Additional_Info    0
Price              0
dtype: int64

### Cleaning and Preprocessing of the data

In [89]:
# Dealing with missing values
print("Length of Training Data:", len(train))
train=train.dropna()
print("Length of Training Data after droping NA:", len(train))

Length of Training Data: 10683
Length of Training Data after droping NA: 10682


We observed that only one row is empty so we removed it.

### Data Cleaning and Feature Engineering

#### Date of Journey

In [90]:
# Train data

train['Journey_Date']=pd.to_datetime(train.Date_of_Journey,format='%d/%m/%Y').dt.day
train['Journey_Month']=pd.to_datetime(train.Date_of_Journey,format='%d/%m/%Y').dt.month

In [91]:
# Test data

test['Journey_Date']=pd.to_datetime(test.Date_of_Journey,format='%d/%m/%Y').dt.day
test['Journey_Month']=pd.to_datetime(test.Date_of_Journey,format='%d/%m/%Y').dt.month

In [92]:
# Deleting the original date feature
train.drop(labels ='Date_of_Journey',axis=1, inplace=True)


In [93]:
test.drop(labels='Date_of_Journey', axis=1, inplace=True)

### Duration

In [94]:
# Train data

duration = list(train['Duration'])

for i in range(len(duration)) :
    if len(duration[i].split()) != 2:
        if 'h' in duration[i] :
            duration[i] = duration[i].strip() + ' 0m'
        elif 'm' in duration[i] :
            duration[i] = '0h {}'.format(duration[i].strip())

dur_hours = []
dur_minutes = []  

for i in range(len(duration)) :
    dur_hours.append(int(duration[i].split()[0][:-1]))
    dur_minutes.append(int(duration[i].split()[1][:-1]))
    
train['Duration_hours'] = dur_hours
train['Duration_minutes'] =dur_minutes

train.drop(labels = 'Duration', axis = 1, inplace = True)


In [95]:
# Test data

durationT = list(test['Duration'])

for i in range(len(durationT)) :
    if len(durationT[i].split()) != 2:
        if 'h' in durationT[i] :
            durationT[i] = durationT[i].strip() + ' 0m'
        elif 'm' in durationT[i] :
            durationT[i] = '0h {}'.format(durationT[i].strip())
            
dur_hours = []
dur_minutes = []  

for i in range(len(durationT)) :
    dur_hours.append(int(durationT[i].split()[0][:-1]))
    dur_minutes.append(int(durationT[i].split()[1][:-1]))
  
    
test['Duration_hours'] = dur_hours
test['Duration_minutes'] = dur_minutes

test.drop(labels = 'Duration', axis = 1, inplace = True)

### Departute and Arrival Time

In [96]:
# Train
train['Depart_Time_Hour'] = pd.to_datetime(train.Dep_Time).dt.hour
train['Depart_Time_Minutes']= pd.to_datetime(train.Dep_Time).dt.minute

train.drop(labels= 'Dep_Time', axis=1, inplace= True)

train['Arrival_Time_Hour'] = pd.to_datetime(train.Arrival_Time).dt.hour
train['Arrival_Time_Minutes']= pd.to_datetime(train.Arrival_Time).dt.minute

train.drop(labels = 'Arrival_Time', axis = 1, inplace = True)


In [97]:
# Test
test['Depart_Time_Hour'] = pd.to_datetime(test.Dep_Time).dt.hour
test['Depart_Time_Minutes']= pd.to_datetime(test.Dep_Time).dt.minute

test.drop(labels= 'Dep_Time', axis=1, inplace= True)

test['Arrival_Time_Hour'] = pd.to_datetime(test.Arrival_Time).dt.hour
test['Arrival_Time_Minutes']= pd.to_datetime(test.Arrival_Time).dt.minute

test.drop(labels = 'Arrival_Time', axis = 1, inplace = True)


In [98]:
train.head()

Unnamed: 0,Airline,Source,Destination,Route,Total_Stops,Additional_Info,Price,Journey_Date,Journey_Month,Duration_hours,Duration_minutes,Depart_Time_Hour,Depart_Time_Minutes,Arrival_Time_Hour,Arrival_Time_Minutes
0,IndiGo,Banglore,New Delhi,BLR → DEL,non-stop,No info,3897,24,3,2,50,22,20,1,10
1,Air India,Kolkata,Banglore,CCU → IXR → BBI → BLR,2 stops,No info,7662,1,5,7,25,5,50,13,15
2,Jet Airways,Delhi,Cochin,DEL → LKO → BOM → COK,2 stops,No info,13882,9,6,19,0,9,25,4,25
3,IndiGo,Kolkata,Banglore,CCU → NAG → BLR,1 stop,No info,6218,12,5,5,25,18,5,23,30
4,IndiGo,Banglore,New Delhi,BLR → NAG → DEL,1 stop,No info,13302,1,3,4,45,16,50,21,35


In [99]:
# Data Preprocessing
# Classifying the dependent and independent variables

Y_train=train.iloc[:,6].values
X_train=train.iloc[:,train.columns != 'Price'].values


In [100]:
Y_train

array([ 3897,  7662, 13882, ...,  7229, 12648, 11753], dtype=int64)

In [101]:
X_test= test.iloc[:,:].values

### Encoding Categorical Variables

In [102]:
from sklearn.preprocessing import LabelEncoder

In [103]:
le1=LabelEncoder()
le2=LabelEncoder()


In [104]:
# Train

X_train[:,0] = le1.fit_transform(X_train[:,0])
X_train[:,1] = le1.fit_transform(X_train[:,1])
X_train[:,2] = le1.fit_transform(X_train[:,2])
X_train[:,3] = le1.fit_transform(X_train[:,3])
X_train[:,4] = le1.fit_transform(X_train[:,4])
X_train[:,5] = le1.fit_transform(X_train[:,5])

# Test

X_test[:,0] = le2.fit_transform(X_test[:,0])
X_test[:,1] = le2.fit_transform(X_test[:,1])
X_test[:,2] = le2.fit_transform(X_test[:,2])
X_test[:,3] = le2.fit_transform(X_test[:,3])
X_test[:,4] = le2.fit_transform(X_test[:,4])
X_test[:,5] = le2.fit_transform(X_test[:,5])



In [105]:
# Data after encording

print(pd.DataFrame(X_train).head())

  0  1  2    3  4  5   6  7   8   9   10  11  12  13
0  3  0  5   18  4  8  24  3   2  50  22  20   1  10
1  1  3  0   84  1  8   1  5   7  25   5  50  13  15
2  4  2  1  118  1  8   9  6  19   0   9  25   4  25
3  3  3  0   91  0  8  12  5   5  25  18   5  23  30
4  3  0  5   29  0  8   1  3   4  45  16  50  21  35


### Feature Scaling


In [106]:
from sklearn.preprocessing import StandardScaler

In [107]:
sc_X=StandardScaler()

In [108]:
X_train=sc_X.fit_transform(X_train)
X_test= sc_X.fit_transform(X_test)

sc_y=StandardScaler()

Y_train= Y_train.reshape((len(Y_train),1))
Y_train=sc_X.fit_transform(Y_train)

Y_train= Y_train.ravel()



In [109]:
# Data after scaling

print(pd.DataFrame(X_train).head())

         0         1         2         3         4         5         6   \
0 -0.410805 -1.658359  2.416534 -1.547082  1.407210  0.499921  1.237288   
1 -1.261152  0.890014 -0.973812  0.249946 -0.253703  0.499921 -1.475307   
2  0.014369  0.040556 -0.295743  1.175687 -0.253703  0.499921 -0.531796   
3 -0.410805  0.890014 -0.973812  0.440539 -0.807341  0.499921 -0.177979   
4 -0.410805 -1.658359  2.416534 -1.247577 -0.807341  0.499921 -1.475307   

         7         8         9         10        11        12        13  
0 -1.467402 -0.970614  1.279041  1.654154 -0.234950 -1.800436 -0.890014  
1  0.250289 -0.381999 -0.196319 -1.303113  1.363607 -0.050909 -0.587094  
2  1.109135  1.030677 -1.671678 -0.607286  0.031476 -1.363054  0.018745  
3  0.250289 -0.617445 -0.196319  0.958326 -1.034229  1.407030  0.321664  
4 -1.467402 -0.735168  0.983969  0.610412  1.363607  1.115442  0.624584  


In [110]:
print(pd.DataFrame(Y_train).head())

          0
0 -1.125535
1 -0.309068
2  1.039783
3 -0.622209
4  0.914006


### Modelling with Support Vector Regression

In [111]:
from sklearn.svm import SVR

In [112]:
svr= SVR(kernel= "rbf")
svr.fit(X_train, Y_train)


SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
  gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False)

In [113]:
Y_pred= sc_X.inverse_transform(svr.predict(X_test))
pd.DataFrame(Y_pred, columns = ['Price']).to_excel("Final_pred.xlsx", index= False)

In [114]:
print(pd.DataFrame(Y_pred).head())

              0
0  10562.422245
1   4651.334644
2  12541.631054
3  10171.471601
4   3543.827258
