# **Flight Price Prediction Notebook**

# **Outline**
1. Introduction
2. Data Preparation
    - 2.1 Importing neccessary packages
    - 2.2 Load the data
    - 2.3 Check for data cleanliness
3. Exploratory Data Analysis
    - 3.1 Understanding the data
    - 3.2 Visualizations
4. Data Preprocessing
5. Modeling


# **1. Introduction**

> We are given a dataset of flight prices, obtained from Kaggle, the main objective of this notebook is to extract meaningful information from the data, and based on that, building a predictive model to predict the price of a flight accordingly.

## Data Parameters

- `airline`: the airline brand
- `flight`: the flight id
- `source_city`: the departure location of the flight
- `departure_time`: the departure time of the flight (period of the day, not the exact time)
- `stops`: number of stops between the departure and arrival destination
- `arrival_time`: the period of the day where the flight arrive
- `destination_city`: the destination of the flight
- `class`: seat class, either Economy or Business
- `duration`: the fight duration
- `days_left`: days left until the flight since booking day
- `price`: the price of the ticket

# **2. Data preparation**

## 2.1. Importing neccessary libraries

In [1]:
import pandas as pd
import matplotlib as plt
import numpy as np
import seaborn as sb

## 2.2. Load the data

In [2]:
df = pd.read_csv("data/Clean_Dataset.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953
1,1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953
2,2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1,5956
3,3,Vistara,UK-995,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,1,5955
4,4,Vistara,UK-963,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,1,5955


In [4]:
df.columns

Index(['Unnamed: 0', 'airline', 'flight', 'source_city', 'departure_time',
       'stops', 'arrival_time', 'destination_city', 'class', 'duration',
       'days_left', 'price'],
      dtype='object')

## 2.3. Check for data cleanliness

The data consists of 1 unneccessary column ("Unnamed: 0"), let's remove that

In [5]:
df = df.drop('Unnamed: 0', axis=1)
df.head()

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953
1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953
2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1,5956
3,Vistara,UK-995,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,1,5955
4,Vistara,UK-963,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,1,5955


Check for null data

In [6]:
df.isna().sum()

airline             0
flight              0
source_city         0
departure_time      0
stops               0
arrival_time        0
destination_city    0
class               0
duration            0
days_left           0
price               0
dtype: int64

From the result, we can clearly see that the data has no null value, therefore, it is ready to use.

# **3. Exploratory Data Analysis**

## 3.1. Understanding the data