# Airline fare price prediction

Using the dataset provided by [lalit_joshi](https://www.kaggle.com/datasets/lalitjoshi89/airlinepriceprediction)

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import unicodedata

In [2]:
# Loading "raw" data from csv for a quick glance
data = pd.read_csv(r'dataset/airlines_data.csv')
data.info

<bound method DataFrame.info of      ;Airline_Name;Date_of_Journey;Source;Destination;Dept_Time;Total_Stops;Duration_of_Flight;Arr_Time;Fare
0     0;AirAsia;26/02/2022;Kolkata;Mumbai;13:30;1 st...                                                     
1     1;AirAsia;26/02/2022;Kolkata;Mumbai;9:05;2 sto...                                                     
2     2;AirAsia;26/02/2022;Kolkata;Mumbai;16:15;1 st...                                                     
3     3;AirAsia;26/02/2022;Kolkata;Mumbai;23:40;1 st...                                                     
4     4;AirAsia;26/02/2022;Kolkata;Mumbai;20:00;1 st...                                                     
...                                                 ...                                                     
2016  2019;Vistara;10/7/22;Mumbai;Chennai;6:20;2 Sto...                                                     
2017  2020;Vistara;10/7/22;Mumbai;Chennai;11:25;2 St...                                         

### The Dataset

The data comes in a excel file format which I re-exported as a csv directly from Excel (note this could have been done using python too.)

In the data we have 10 variables and 2021 features. But we get all in 1 column due to the file being delimeted by semicolon. Lets fix that!

In [3]:
# Loading the data  with necessary attribute to convert it correctly into a dataframe
data = pd.read_csv(r'dataset/airlines_data.csv', delimiter=';', thousands=' ')
data

Unnamed: 0.1,Unnamed: 0,Airline_Name,Date_of_Journey,Source,Destination,Dept_Time,Total_Stops,Duration_of_Flight,Arr_Time,Fare
0,0,AirAsia,26/02/2022,Kolkata,Mumbai,13:30,1 stop,07 h 05 m,20:35,3 379
1,1,AirAsia,26/02/2022,Kolkata,Mumbai,9:05,2 stop,13 h 10 m,22:15,3 379
2,2,AirAsia,26/02/2022,Kolkata,Mumbai,16:15,1 stop,08 h 20 m,0:35,3 379
3,3,AirAsia,26/02/2022,Kolkata,Mumbai,23:40,1 stop,06 h 55 m,6:35,3 379
4,4,AirAsia,26/02/2022,Kolkata,Mumbai,20:00,1 stop,10 h 35 m,6:35,3 379
...,...,...,...,...,...,...,...,...,...,...
2016,2019,Vistara,10/7/22,Mumbai,Chennai,6:20,2 Stop,13h 55m,20:15,15 192
2017,2020,Vistara,10/7/22,Mumbai,Chennai,11:25,2 Stop,11h 20m,22:45,16 442
2018,2021,Vistara,10/7/22,Mumbai,Chennai,6:45,2 Stop,13h 30m,20:15,16 442
2019,2022,Vistara,10/7/22,Mumbai,Chennai,9:05,2 Stop,11h 10m,20:15,17 282


In [4]:
# Observing data types
data.dtypes

Unnamed: 0             int64
Airline_Name          object
Date_of_Journey       object
Source                object
Destination           object
Dept_Time             object
Total_Stops           object
Duration_of_Flight    object
Arr_Time              object
Fare                  object
dtype: object

In [5]:
# Checking for missing values
data.isnull().sum()

Unnamed: 0            0
Airline_Name          0
Date_of_Journey       0
Source                0
Destination           0
Dept_Time             0
Total_Stops           0
Duration_of_Flight    0
Arr_Time              0
Fare                  0
dtype: int64

### **Good!** and **Bad!**

Now there are some obvious observations that need to be address such as:

- The first `Unnamed` feature is redundant  and needs to be impute as it seems index.
- There is inconsistenacy on the date format on the `Date_of_Journey` feature
- The `Total_Stops` feature should be converted to numerical values.

A lot of our data is not the proper type such as:
- `Date_of_Journey` should be `datetime`
//- `Dept_Time`, `Arr_time`, `Duration_of_Flight` should be `datetime` or `timedelta`//
- `Total_Stops` should be `int`
- `Fare` should be `int`

It would be a good idea to change the name of the features to shorter names.

I found that `Airline_Name`, `Source` and `Destination` have repeated or diferent instances of the same word and that needs to be change i.e.: "Air Asia" instead of "AirAsia" or "MAA" instead of Chennai.

In [6]:
# Imputing the  "Unnamed" feature
data.drop(data.columns[0], axis=1, inplace=True)

In [7]:
# Renaming and reordering features.
new_cols = ["airline_name", "flight_date", "flight_dep", "flight_arr", "dep_time", "total_stops", "flight_time", "arr_time", "flight_fare"]
data.columns = new_cols

new_cols_ord = ["airline_name", "flight_date", "flight_dep", "dep_time", "flight_arr", "arr_time", "total_stops","flight_time", "flight_fare"]
data = data.reindex(columns=new_cols_ord)

In [8]:
# Changing data type of flight_date to desire format: Y-M-D.
data['flight_date'] = pd.to_datetime(data['flight_date'])

In [9]:
# Changing dep_time and arr_time data type.

"""Each instance is a datetime object altought is not shown when data.dtypes is executed"""

# data['dep_time'] = pd.to_datetime(data.dep_time, format='%H:%M').dt.time
# data['arr_time'] = pd.to_datetime(data.arr_time, format='%H:%M').dt.time

'Each instance is a datetime object altought is not shown when data.dtypes is executed'

In [10]:
# Converting string to numerical values in total_flights feature.

"""numpy.where, is a vectorized version of if/else, with the 
condition constructed by str.contains"""

data['total_stops'] = np.where(data.total_stops.str.contains("1"), 1,
                    np.where(data.total_stops.str.contains("2"), 2, 
                    np.where(data.total_stops.str.contains("3"), 3, 0,
                    )
                )
            )    

In [11]:
# Removing space from strings on flight_time feature
# data['flight_time'] = data['flight_time'].apply(lambda x:  x.replace(' ',''));
# data.head()

In [12]:
# Changing "flight_fare" to int data type

"""The feature came with a unicode break '\xa0' the unicodedata.normalize() will remove this issue"""

data['flight_fare'] = data['flight_fare'].apply(lambda x: unicodedata.normalize("NFKD", x).replace(' ',''));
data['flight_fare'] = pd.to_numeric(data['flight_fare']);


In [13]:
# Removing repeated values on 'airline_name' feature
airline_values = {
    'Air Asia':'AirAsia',
    'Spicejet':'SpiceJet'
}

for key, value in airline_values.items():
    # Replace key character with value character in string
    data['airline_name'] = data['airline_name'].replace(key, value)

In [14]:
# Removing repeated values on 'flight_dep' feature
dest_values = {'DEL':'New Delhi', 'GAU': 'Guwahati',
                'MAA': 'Chennai', 'BLR':'Bangalore',
                'CCU':'Kolkata', 'BOM':'Mumbai',
                'Bengaluru':'Bangalore'}

for key, value in dest_values.items():
    # Replace key character with value character in string
    data['flight_dep'] = data['flight_dep'].replace(key, value)

In [15]:
# Removing repeated values on 'flight_dep' feature
for key, value in dest_values.items():
    # Replace key character with value character in string
    data['flight_arr'] = data['flight_arr'].replace(key, value)

In [16]:
data.tail(20)

Unnamed: 0,airline_name,flight_date,flight_dep,dep_time,flight_arr,arr_time,total_stops,flight_time,flight_fare
2001,Vistara,2022-10-07,Mumbai,19:45,Chennai,9:55,1,14h 10m,11129
2002,Vistara,2022-10-07,Mumbai,18:30,Chennai,9:55,1,15h 25m,11129
2003,Vistara,2022-10-07,Mumbai,17:35,Chennai,9:55,1,16h 20m,11129
2004,Vistara,2022-10-07,Mumbai,15:45,Chennai,9:55,1,18h 10m,11129
2005,Vistara,2022-10-07,Mumbai,14:40,Chennai,9:55,1,19h 15m,11129
2006,Vistara,2022-10-07,Mumbai,15:45,Chennai,22:45,1,7h 00m,11307
2007,Vistara,2022-10-07,Mumbai,14:40,Chennai,22:45,1,8h 05m,11307
2008,Vistara,2022-10-07,Mumbai,11:55,Chennai,20:15,1,8h 20m,11307
2009,Vistara,2022-10-07,Mumbai,11:55,Chennai,22:45,1,10h 50m,11307
2010,Vistara,2022-10-07,Mumbai,8:45,Chennai,20:15,1,11h 30m,11307


In [17]:
# Creating a new feature with only the flight day
data['flight_day'] = pd.to_datetime(data.flight_date, format = "%Y-%m-%d").dt.day

In [18]:
data['flight_month'] = pd.to_datetime(data.flight_date, format = "%Y-%m-%d").dt.month

In [19]:
# After the flight date has been use to create to new feature (flight_day, flight_month) we can drop it
data.drop(['flight_date'], axis = 1, inplace = True)

In [20]:
# Extracting the values from dep_time into new features

# Extracting the hours from the departure time
data['dep_hour'] = pd.to_datetime(data['dep_time']).dt.hour

# Extracting the minutes from the departure time
data['dep_min'] = pd.to_datetime(data['dep_time']).dt.minute

# Droping the dep_time feature
data.drop(['dep_time'], axis = 1, inplace = True)

In [21]:
data.head()

Unnamed: 0,airline_name,flight_dep,flight_arr,arr_time,total_stops,flight_time,flight_fare,flight_day,flight_month,dep_hour,dep_min
0,AirAsia,Kolkata,Mumbai,20:35,1,07 h 05 m,3379,26,2,13,30
1,AirAsia,Kolkata,Mumbai,22:15,2,13 h 10 m,3379,26,2,9,5
2,AirAsia,Kolkata,Mumbai,0:35,1,08 h 20 m,3379,26,2,16,15
3,AirAsia,Kolkata,Mumbai,6:35,1,06 h 55 m,3379,26,2,23,40
4,AirAsia,Kolkata,Mumbai,6:35,1,10 h 35 m,3379,26,2,20,0


In [22]:
# Same as with departure time now extracting the values from arr_time into new features

# Extracting the hours from the arrival time
data['arr_hour'] = pd.to_datetime(data['arr_time']).dt.hour

# Extracting the minutes from the arrival time
data['arr_min'] = pd.to_datetime(data['arr_time']).dt.minute

# Droping the flight_dep feature
data.drop(['arr_time'], axis = 1, inplace = True)

In [23]:
data.tail(20)

Unnamed: 0,airline_name,flight_dep,flight_arr,total_stops,flight_time,flight_fare,flight_day,flight_month,dep_hour,dep_min,arr_hour,arr_min
2001,Vistara,Mumbai,Chennai,1,14h 10m,11129,7,10,19,45,9,55
2002,Vistara,Mumbai,Chennai,1,15h 25m,11129,7,10,18,30,9,55
2003,Vistara,Mumbai,Chennai,1,16h 20m,11129,7,10,17,35,9,55
2004,Vistara,Mumbai,Chennai,1,18h 10m,11129,7,10,15,45,9,55
2005,Vistara,Mumbai,Chennai,1,19h 15m,11129,7,10,14,40,9,55
2006,Vistara,Mumbai,Chennai,1,7h 00m,11307,7,10,15,45,22,45
2007,Vistara,Mumbai,Chennai,1,8h 05m,11307,7,10,14,40,22,45
2008,Vistara,Mumbai,Chennai,1,8h 20m,11307,7,10,11,55,20,15
2009,Vistara,Mumbai,Chennai,1,10h 50m,11307,7,10,11,55,22,45
2010,Vistara,Mumbai,Chennai,1,11h 30m,11307,7,10,8,45,20,15


In [24]:
len(data['flight_time'][0].split())

4

In [25]:
data['flight_time'].value_counts()

2h 10m       57
2h 15m       47
2h 05m       31
2h 20m       30
2h 55m       29
             ..
15h 55m       1
14h 40m       1
08 h 55 m     1
12 h 25 m     1
23h 20m       1
Name: flight_time, Length: 439, dtype: int64

In [26]:
data['flight_time'][0].split((", "))

['07 h 05 m']

In [32]:
# Extracting the flight time into hours and minutes from the flight_time feature

# Assigning and converting flight_time into a list
flight_time = list(data['flight_time'])

for i in range(len(flight_time)):
    if len(flight_time[i].split()) != 2: # Check if flight_time contains only hour or mins
        if 'h' in flight_time[i]:
            flight_time[i] = flight_time[i].strip() + ' 0m' # Adds 0 minute
        else:
            flight_time[i] = '0h ' + flight_time[i] # Adds 0 hours


In [40]:

flight_time_hours = []
flight_time_mins = []
for i in range(len(flight_time)):
    flight_time_hours.append(int(flight_time[i].split(sep = 'h')[0])) # Extracts the hours from flight_time
    flight_time_mins.append(int(flight_time[i].split(sep = 'm')[0].split()[-1])) # Extracts only minutes from flight_time


ValueError: invalid literal for int() with base 10: 'h'

In [None]:
# Adding flight_time_hours and flight_time_mins list to dataframe

data['flight_time_hours'] = flight_time_hours
data['flight_time_mins'] = flight_time_mins

# Droping the flight_time feature after the the new features have been created
data.drop(['flight_time'], axis = 1, inplace = True)

In [None]:
data.head()

In [None]:
# Creating a new feature with only the flight month
unique_vals = data['airline_name'].unique()
print(f"There are {len(unique_vals)} unique elements in this feature : {unique_vals} ")

### It's Exploratory Data Analaysis o'clock

- Highest and lowest fare (More expensive or cheapest airline to flight with)
- Longest and shortest flight
- Cheapest

In [None]:
sns.heatmap(data.corr(), cmap="YlGnBu")
plt.show()