# ✈ Flight Price Prediction using Regression

Dataset : https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction

Analyst : Titan Bagus Bramantyo (https://linkendin.com/in/titanbr)

#### ✅ Import library

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#### ✅ Load dataset

In [3]:
dataset = pd.read_csv('flight_price.csv',sep=',')
dataset.tail()

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
300148,Vistara,UK-822,Chennai,Morning,one,Evening,Hyderabad,Business,10.08,49,69265
300149,Vistara,UK-826,Chennai,Afternoon,one,Night,Hyderabad,Business,10.42,49,77105
300150,Vistara,UK-832,Chennai,Early_Morning,one,Night,Hyderabad,Business,13.83,49,79099
300151,Vistara,UK-828,Chennai,Early_Morning,one,Evening,Hyderabad,Business,10.0,49,81585
300152,Vistara,UK-822,Chennai,Morning,one,Evening,Hyderabad,Business,10.08,49,81585


**FEATURE**.
1. Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
2. Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.
3. Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.
4. Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
5. Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
6. Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7. Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.
8. Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
9. Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
10. Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
11. Price: Target variable stores information of the ticket price.

#### ✅ Delete unnecessary column

In [4]:
dataset = dataset.drop(columns=['flight','days_left'])
dataset.tail()

Unnamed: 0,airline,source_city,departure_time,stops,arrival_time,destination_city,class,duration,price
300148,Vistara,Chennai,Morning,one,Evening,Hyderabad,Business,10.08,69265
300149,Vistara,Chennai,Afternoon,one,Night,Hyderabad,Business,10.42,77105
300150,Vistara,Chennai,Early_Morning,one,Night,Hyderabad,Business,13.83,79099
300151,Vistara,Chennai,Early_Morning,one,Evening,Hyderabad,Business,10.0,81585
300152,Vistara,Chennai,Morning,one,Evening,Hyderabad,Business,10.08,81585


#### ✅ Reduce records number

#### ✅ Label encoding

We need to do label encoding because we want to do PCA process which is not supports string datatype

In [5]:
from sklearn import preprocessing as pre

# Airline Encoding
le = pre.LabelEncoder()
le.fit(dataset['airline'])
dataset['airline'] = le.transform(dataset['airline'])
airline_labels = dict(zip(le.classes_, le.transform(le.classes_)))
print(airline_labels)

{'AirAsia': 0, 'Air_India': 1, 'GO_FIRST': 2, 'Indigo': 3, 'SpiceJet': 4, 'Vistara': 5}


In [6]:
# source_city
le.fit(dataset['source_city'])
dataset['source_city'] = le.transform(dataset['source_city'])
source_labels = dict(zip(le.classes_, le.transform(le.classes_)))
print(source_labels)

{'Bangalore': 0, 'Chennai': 1, 'Delhi': 2, 'Hyderabad': 3, 'Kolkata': 4, 'Mumbai': 5}


In [7]:
# stops
le.fit(dataset['stops'])
dataset['stops'] = le.transform(dataset['stops'])
stops_labels = dict(zip(le.classes_, le.transform(le.classes_)))
print(stops_labels)

{'one': 0, 'two_or_more': 1, 'zero': 2}


In [8]:
# departure time
le.fit(dataset['departure_time'])
dataset['departure_time'] = le.transform(dataset['departure_time'])
deptime_labels = dict(zip(le.classes_, le.transform(le.classes_)))
print(deptime_labels)

{'Afternoon': 0, 'Early_Morning': 1, 'Evening': 2, 'Late_Night': 3, 'Morning': 4, 'Night': 5}


In [9]:
# arrival time
le.fit(dataset['arrival_time'])
dataset['arrival_time'] = le.transform(dataset['arrival_time'])
arrtime_labels = dict(zip(le.classes_, le.transform(le.classes_)))
print(arrtime_labels)

{'Afternoon': 0, 'Early_Morning': 1, 'Evening': 2, 'Late_Night': 3, 'Morning': 4, 'Night': 5}


In [10]:
# destination city
le.fit(dataset['destination_city'])
dataset['destination_city'] = le.transform(dataset['destination_city'])
des_labels = dict(zip(le.classes_, le.transform(le.classes_)))
print(des_labels)

{'Bangalore': 0, 'Chennai': 1, 'Delhi': 2, 'Hyderabad': 3, 'Kolkata': 4, 'Mumbai': 5}


In [11]:
# class seat
le.fit(dataset['class'])
dataset['class'] = le.transform(dataset['class'])
class_labels = dict(zip(le.classes_, le.transform(le.classes_)))
print(class_labels)

{'Business': 0, 'Economy': 1}


#### ✅ Separating variable and target

In [52]:
X = dataset.iloc[:,:-1]
Y = dataset.iloc[:,1]
print(X)
print(Y)

dataframe = pd.DataFrame(dataset)
dataframe.head

        airline  source_city  departure_time  stops  arrival_time  \
0             4            2               2      2             5   
1             4            2               1      2             4   
2             0            2               1      2             1   
3             5            2               4      2             0   
4             5            2               4      2             4   
...         ...          ...             ...    ...           ...   
300148        5            1               4      0             2   
300149        5            1               0      0             5   
300150        5            1               1      0             5   
300151        5            1               1      0             2   
300152        5            1               4      0             2   

        destination_city  class  duration  
0                      5      1      2.17  
1                      5      1      2.33  
2                      5      1      2.

<bound method NDFrame.head of         airline  source_city  departure_time  stops  arrival_time  \
0             4            2               2      2             5   
1             4            2               1      2             4   
2             0            2               1      2             1   
3             5            2               4      2             0   
4             5            2               4      2             4   
...         ...          ...             ...    ...           ...   
300148        5            1               4      0             2   
300149        5            1               0      0             5   
300150        5            1               1      0             5   
300151        5            1               1      0             2   
300152        5            1               4      0             2   

        destination_city  class  duration  price  
0                      5      1      2.17   5953  
1                      5      1      2.

#### ✅ PCA process

In [53]:
from sklearn.decomposition import PCA

In [54]:
pca = PCA(n_components=2)

fit_pca = pca.fit_transform(X)
new_df = pd.DataFrame(data = fit_pca, columns = ['x', 'y'])
new_df.head()

Unnamed: 0,x,y
0,-10.122329,1.481031
1,-9.986488,1.682313
2,-10.151544,2.651835
3,-10.009926,2.732258
4,-9.921393,1.753761


#### ✅ Splitting data for training and test

In [55]:
from sklearn.model_selection import train_test_split as trates
X_train, X_test, Y_train, Y_test = trates(X, Y, test_size=0.2, random_state=7)

#### ✅ Using Linear Regression

In [62]:
from sklearn.linear_model import LinearRegression

regressor=LinearRegression()
regressor.fit(X_train, Y_train)

LinearRegression()

#### ✅ Prediction data test result

In [63]:
y_pred = regressor.predict(X_test)
accuracy = regressor.score(X_test,Y_test)*100
print("ML model Accuracy is",accuracy,'%')

ML model Accuracy is 100.0 %


#### ✅ Evaluate linear regression

In [65]:
from sklearn.metrics import mean_squared_error,r2_score
from math import sqrt

print(Y_test)
print(y_pred)
print("RMSE = ", sqrt(mean_squared_error(Y_test,y_pred)))
print("r2_score: ",r2_score(Y_test,y_pred))

37442     2
26706     2
172802    3
240001    5
82508     5
         ..
210007    2
284192    3
104379    0
256618    0
84990     0
Name: source_city, Length: 60031, dtype: int32
[ 2.00000000e+00  2.00000000e+00  3.00000000e+00 ... -6.56967120e-13
 -6.49318578e-13 -6.39267985e-13]
RMSE =  4.3411651916857195e-13
r2_score:  1.0


#### ✅ Using Support Vector Regression

In [66]:
from sklearn import svm

regr = svm.SVR(kernel="linear")
regr.fit(X_train, Y_train)

SVR(kernel='linear')

#### ✅ Prediction data test result

In [68]:
y_pred = regr.predict(X_test)

print(Y_test)
print(y_pred)

37442     2
26706     2
172802    3
240001    5
82508     5
         ..
210007    2
284192    3
104379    0
256618    0
84990     0
Name: source_city, Length: 60031, dtype: int32
[2.02001017 2.020068   2.98015891 ... 0.09944401 0.09968661 0.10000427]


#### ✅ Evaluate SVR

In [71]:
print(Y_test)
print(y_pred)

print("RMSE SVR = ", sqrt(mean_squared_error(Y_test, y_pred)))
print("r2_score: ",r2_score(Y_test,y_pred))

37442     2
26706     2
172802    3
240001    5
82508     5
         ..
210007    2
284192    3
104379    0
256618    0
84990     0
Name: source_city, Length: 60031, dtype: int32
[2.02001017 2.020068   2.98015891 ... 0.09944401 0.09968661 0.10000427]
RMSE SVR =  0.06972792687222241
r2_score:  0.9984030719186624


#### ✅ Using KNN Regression

In [74]:
from sklearn import neighbors
n_neighbors = 3

knn = neighbors.KNeighborsRegressor(n_neighbors, weights="uniform")
y_pred = knn.fit(X_train, Y_train).predict(X_test)

#### ✅ Evaluate KNN

In [76]:
from sklearn.metrics import mean_squared_error
from math import sqrt

print(Y_test)
print(y_pred)

print("RMSE KNN Reg. = ",sqrt(mean_squared_error(Y_test, y_pred)))
print("r2_score: ",r2_score(Y_test,y_pred))

37442     2
26706     2
172802    3
240001    5
82508     5
         ..
210007    2
284192    3
104379    0
256618    0
84990     0
Name: source_city, Length: 60031, dtype: int32
[2. 2. 3. ... 0. 0. 0.]
RMSE KNN Reg. =  0.03164386979646677
r2_score:  0.9996711097269335


#### ✅ Using MLP Regression

In [77]:
from sklearn.neural_network import MLPRegressor
regr = MLPRegressor(random_state=1, max_iter=5000).fit(X_train, Y_train)
y_pred = regr.predict(X_test)

#### ✅ Evaluate MLP

In [79]:
from sklearn.metrics import mean_squared_error
from math import sqrt

print(Y_test)
print(y_pred)

print("RMSE MLP Reg. = ",sqrt(mean_squared_error(Y_test, y_pred)))
print("r2_score: ",r2_score(Y_test,y_pred))

37442     2
26706     2
172802    3
240001    5
82508     5
         ..
210007    2
284192    3
104379    0
256618    0
84990     0
Name: source_city, Length: 60031, dtype: int32
[2.0108242  2.02186187 3.02422002 ... 0.00533524 0.01043447 0.01832694]
RMSE MLP Reg. =  0.015440855799478504
r2_score:  0.9999216904756688


## 😎 Conclusion

_Dari hasil percobaan beberapa algoritma dalam regresi, dihasilkan nilai RMSE sebagai berikut._
1. Linear Regression : **4.34**
2. SVR Regression : **0.069**
3. KNN Regression : **0.031**
4. MLP Regression : **0.015**

_Apabila nilai RMSE semakin kecil maka semakin baik. Dengan begitu algoritma **MLP Regression** adalah algoritma yang paling baik di antara ke-empat algoritma yang diuji_