---
# Flight Price Prediction
### Fitting model using Random Forest
---

## 1. Introduction 
---

Flight price prediction is a challenging and important problem in the travel industry as accurately predicting the price of flights can help travel agencies, airlines and customers make informed decisions about flight bookings.

### 1.1 Problem Statement

The problem statement for flight price prediction is to develop a model that can accurately predict the price of flights for different routes and dates. This is a challenging problem because flight prices are affected by many factors including airline, route, departure time, seasonality and competition.

### 1.2 Goal

The goal is to build a machine learning model that can take these factors into account and accurately predict the price of flights for new routes and dates. This model can be used by travel agencies, airlines and customers to make informed decisions about flight bookings.

### 1.3 Approach

To solve this problem, we have access to flight data including historical prices, flight routes, airline information and other relevant factors. We can then use this data to train and validate machine learning models that can predict flight prices based on the input features to ensure their accuarcy and effectiveness. Here, we will be using **Random Forest**


### 1.4 Use case

This problem is an important one for the travel industry and has many real-world applications such as dynamic pricing, demand forecasting and revenue management.

## 2. Importing Essential Libraries
---

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 3. Importing and  Exploring Dataset
---
1. The dataset is in the form of an Excel file.
2. The dataset needs to be loaded using the pandas read_excel() function.
3. The dataset needs to be checked for completeness to identify any hidden information such as null values in a column or a row.
4. Null values need to be checked and addressed if present using imputation methods in sklearn or by filling NaN values with mean, median or mode using the fillna() method.
5. A statistical analysis can be performed by describing the dataset to gain insights into its features and distributions.

In [2]:
train_data = pd.read_excel("Data_Train.xlsx")

### 3.1 Data Eyeballing

In [3]:
train_data.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [4]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10683 entries, 0 to 10682
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          10683 non-null  object
 1   Date_of_Journey  10683 non-null  object
 2   Source           10683 non-null  object
 3   Destination      10683 non-null  object
 4   Route            10682 non-null  object
 5   Dep_Time         10683 non-null  object
 6   Arrival_Time     10683 non-null  object
 7   Duration         10683 non-null  object
 8   Total_Stops      10682 non-null  object
 9   Additional_Info  10683 non-null  object
 10  Price            10683 non-null  int64 
dtypes: int64(1), object(10)
memory usage: 918.2+ KB


In [5]:
# checking frequency of duration
train_data["Duration"].value_counts()

2h 50m     550
1h 30m     386
2h 45m     337
2h 55m     337
2h 35m     329
          ... 
31h 30m      1
30h 25m      1
42h 5m       1
4h 10m       1
47h 40m      1
Name: Duration, Length: 368, dtype: int64

In [6]:
# checking for null values
train_data.isnull().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              1
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        1
Additional_Info    0
Price              0
dtype: int64

### 3.2 Data Description

This dataset contains information about flight details including Airline, Date_of_Journey, Source, Destination, Route, Dep_Time, Arrival_Time, Duration, Total_Stops, Additional_Info, and Price. It contains 10683 entries with 11 columns. The data types of the columns are mainly object type with the exception of the 'Price' column which is of type int64.

The 'Duration' column shows the duration of the flight in hours and minutes, and has 368 unique values.

The dataset has 2 missing values, one in the 'Route' column and the other in the 'Total_Stops' column.

In [9]:
#dropping null values
train_data.dropna(inplace = True)

In [10]:
# checking if null values are dropped
train_data.isnull().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              0
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        0
Additional_Info    0
Price              0
dtype: int64

We are now ready to move on to the (Exploratory Data Analysis)EDA

## 4. Exploratory Data Analysis (EDA)
---