<a href="https://colab.research.google.com/github/makhmudov-khondamir/Machine-Learning-Projects/blob/main/Airfare%20price%20prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project: Airfare price prediction**
Predicting what the future prices of airline tickets might be for airlines

In [122]:
#extract the zip file and dataset preparation

import zipfile
import os

# path to the zip file
zip_path = 'aviachipta-narxini-bashorat-qilish.zip'

# directory for extraction
new_file_name = '/content/extracted_files'

# create the directory if it does not exist
os.makedirs(new_file_name, exist_ok=True)

# and extract the zip file
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(new_file_name)


In [123]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, accuracy_score, mean_absolute_error
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import plotly.express as px

In [124]:
df=pd.read_csv("/content/extracted_files/train_data.csv")
df

Unnamed: 0,id,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,1,Vistara,UK-810,Bangalore,Early_Morning,one,Night,Mumbai,Economy,14.25,21,7212
1,2,SpiceJet,SG-5094,Hyderabad,Evening,zero,Night,Kolkata,Economy,1.75,7,5292
2,3,Vistara,UK-846,Bangalore,Morning,one,Evening,Delhi,Business,9.58,5,60553
3,4,Vistara,UK-706,Kolkata,Morning,one,Evening,Hyderabad,Economy,6.75,28,5760
4,5,Indigo,6E-5394,Chennai,Early_Morning,zero,Morning,Mumbai,Economy,2.00,4,10712
...,...,...,...,...,...,...,...,...,...,...,...,...
19995,19996,Indigo,6E-6178,Bangalore,Night,one,Early_Morning,Mumbai,Economy,7.92,45,3153
19996,19997,AirAsia,I5-582,Kolkata,Morning,one,Afternoon,Delhi,Economy,5.83,24,3911
19997,19998,Vistara,UK-832,Chennai,Early_Morning,two_or_more,Evening,Bangalore,Economy,35.33,17,14822
19998,19999,Vistara,UK-996,Mumbai,Evening,one,Morning,Bangalore,Economy,16.33,21,6450


In [125]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                20000 non-null  int64  
 1   airline           20000 non-null  object 
 2   flight            20000 non-null  object 
 3   source_city       20000 non-null  object 
 4   departure_time    20000 non-null  object 
 5   stops             20000 non-null  object 
 6   arrival_time      20000 non-null  object 
 7   destination_city  20000 non-null  object 
 8   class             20000 non-null  object 
 9   duration          20000 non-null  float64
 10  days_left         20000 non-null  int64  
 11  price             20000 non-null  int64  
dtypes: float64(1), int64(3), object(8)
memory usage: 1.8+ MB


### **Based on the information mentioned about our dataset, i can conclude that:**
-------------------
**Columns which we don't need while building our models and should be dropped:**
- flight
- id

**Categorical values which needs to be normalized with OneHotEncoder:**
- airline (6 unique values)
- source_city (6 unique values)
- departure_time (6 unique values)
- stops (3 unique values)
- arrival_time (6 unique values)
- destination_city (6 unique values)
- class (2 unique values)
-----------------

In [126]:
df.describe()

Unnamed: 0,id,duration,days_left,price
count,20000.0,20000.0,20000.0,20000.0
mean,10000.5,12.177627,25.92415,20960.2817
std,5773.647028,7.157944,13.624874,22775.459535
min,1.0,0.83,1.0,1105.0
25%,5000.75,6.83,14.0,4783.0
50%,10000.5,11.25,26.0,7425.0
75%,15000.25,16.08,38.0,42521.0
max,20000.0,38.58,49.0,114523.0


In [127]:
fig = px.histogram(df, x='price', )
fig.show()

In [128]:
df.drop(df[df['price'] > 70000].index, inplace=True)

**i decided to drop values which are more than 70000 based on the graph above, as they may cause our models work worse**

In [129]:
x=df.drop('price',axis=True)
y=df['price']

In [130]:
x.drop(['flight','id'],axis=1,inplace=True)

In [131]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

In [132]:
categorical=list(df[['airline','source_city','departure_time','stops','arrival_time','destination_city','class']])
numerical=list(df[['duration','days_left']])

In [133]:
pipelineCat=Pipeline([
    ('encoder',OneHotEncoder())
])
pipelineNum=Pipeline([
    ('scaler',StandardScaler())
])

fullpipeline=ColumnTransformer([
    ('categorical',pipelineCat,categorical),
    ('numerical',pipelineNum,numerical)]
)

In [134]:
Xtrain=fullpipeline.fit_transform(x_train)
Xtest=fullpipeline.transform(x_test)

### **Building Models**
i decided to build LinearRegression and RandomForestRegressor models

In [135]:
linearregression=LinearRegression()
LR_model=linearregression.fit(Xtrain,y_train)

tree=RandomForestRegressor()
RF_model=tree.fit(Xtrain,y_train)

In [None]:
# to download these models
"""import pickle
with open('/content/mavjud_model_nomi.pkl','wb') as file:
  pickle.dump(yangi_model_nomi,file)"""

### **applying models to the splitted dataset for testing and evaluating**

In [136]:
predictionLR=LR_model.predict(Xtest)
predictionRF=RF_model.predict(Xtest)

In [137]:
#Mean Absolute Error (MAE)
maeLR=mean_absolute_error(y_test, predictionLR)
maeRF=mean_absolute_error(y_test, predictionRF)
print('Mean Absolute Error (MAE):')
print(f'maeLR: {maeLR}')
print(f'maeRF: {maeRF}')

#Mean Squared error
mseLR=mean_squared_error(y_test, predictionLR)
mseRF=mean_squared_error(y_test, predictionRF)
print('\nMean Squared error (RMSE):')
print(f'mseLR: {np.sqrt(mseLR)}')
print(f'mseRF: {np.sqrt(mseRF)}')

Mean Absolute Error (MAE):
maeLR: 4114.378434329924
maeRF: 1697.3051326604423

Mean Squared error (RMSE):
mseLR: 5869.539549388632
mseRF: 3104.224905603567


Based on the evaluation process, we can see that RandomForest is superior (this model is recommended more than other models, generally). So, we continue to testing the specific test dataset with our RF_model


# **Testing**

In [138]:
test_set=pd.read_csv("/content/extracted_files/test_data.csv")
test_set

Unnamed: 0,id,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left
0,1,Air_India,AI-765,Kolkata,Evening,one,Night,Delhi,Business,28.25,2
1,2,Vistara,UK-747,Delhi,Early_Morning,one,Night,Mumbai,Business,13.83,34
2,3,Air_India,AI-570,Mumbai,Early_Morning,zero,Early_Morning,Chennai,Business,2.00,30
3,4,AirAsia,I5-974,Hyderabad,Night,one,Late_Night,Delhi,Economy,5.17,26
4,5,Air_India,AI-770,Kolkata,Night,one,Afternoon,Mumbai,Economy,16.33,35
...,...,...,...,...,...,...,...,...,...,...,...
4995,4996,Air_India,AI-768,Kolkata,Afternoon,one,Morning,Bangalore,Business,17.42,15
4996,4997,Indigo,6E-6214,Kolkata,Morning,zero,Afternoon,Mumbai,Economy,3.00,40
4997,4998,Air_India,AI-402,Kolkata,Morning,one,Night,Mumbai,Business,11.17,37
4998,4999,Air_India,AI-673,Mumbai,Early_Morning,one,Night,Hyderabad,Business,13.33,38


In [139]:
preparedX=fullpipeline.transform(test_set)
prediction=RF_model.predict(preparedX)

In [142]:
solution=pd.DataFrame({'id':test_set['id'],'price':prediction})

In [None]:
solution.to_csv('solution.csv',index=False)