# Introduction

In this project I will be analysing the following [dataset](https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction) that is derived from a flight booking website.<br>

The following tasks will be performed:
1. Data cleaning using Python
2. Exploratory data analysis (EDA) using python libraries (Pandas, Matplotlib)
3. Data visualisation using Matplotlib and Seaborn

## Importing Required Libraries

In [1]:
# import the required python libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

## Data Loading and Intial Inspection

In [2]:
# import the dataset using the first column as the index

df = pd.read_csv("C:\\Users\\Jana\\Desktop\\Flight Analysis\\flight_dataset.csv", index_col = 0)

In [3]:
# display first 5 rows

df.head()

Unnamed: 0,airline,flight,source_city,departure_time,stops,arrival_time,destination_city,class,duration,days_left,price
0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953
1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953
2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1,5956
3,Vistara,UK-995,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,1,5955
4,Vistara,UK-963,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,1,5955


In [4]:
# view the dataset's dimentionality (rows, columns)

df.shape

(300153, 11)

In [5]:
# get information about the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 300153 entries, 0 to 300152
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   airline           300153 non-null  object 
 1   flight            300153 non-null  object 
 2   source_city       300153 non-null  object 
 3   departure_time    300153 non-null  object 
 4   stops             300153 non-null  object 
 5   arrival_time      300153 non-null  object 
 6   destination_city  300153 non-null  object 
 7   class             300153 non-null  object 
 8   duration          300153 non-null  float64
 9   days_left         300153 non-null  int64  
 10  price             300153 non-null  int64  
dtypes: float64(1), int64(2), object(8)
memory usage: 27.5+ MB


### About the Dataset

The dataset contains the following attributes: <br>

1. **Airline:** The name of the airline company.
2. **Flight Code:** The plane's flight code which consists of a two-character airline designator and a 1 to 4 digit number.
4. **Source City:** The departure city which is the city where the journey begins and the passenger starts their travel.
5. **Departure Period:** The departure period of the flight created by grouping time periods into bins.
6. **Stops:** The number of stops between the source and destination cities.
7. **Arrival Period:** The arrival period of the flight created by grouping time periods into bins.
8. **Destination City:** The final city that marks the end of the flight.
9. **Class:** The specific cabin or level of service depending on the seat.
10. **Duration:** The overall time it takes to travel between cities in hours.
11. **Days Left:** The number of days between booking and flight date.
12. **Price:** The ticket price provided in the same currency.

## Data Cleaning

In [6]:
# renaming some columns for better understanding

df.rename(columns={'flight': 'flight_code' , 'departure_time': 'departure_period' , 'arrival_time': 'arrival_period'}, inplace = True)
df

Unnamed: 0,airline,flight_code,source_city,departure_period,stops,arrival_period,destination_city,class,duration,days_left,price
0,SpiceJet,SG-8709,Delhi,Evening,zero,Night,Mumbai,Economy,2.17,1,5953
1,SpiceJet,SG-8157,Delhi,Early_Morning,zero,Morning,Mumbai,Economy,2.33,1,5953
2,AirAsia,I5-764,Delhi,Early_Morning,zero,Early_Morning,Mumbai,Economy,2.17,1,5956
3,Vistara,UK-995,Delhi,Morning,zero,Afternoon,Mumbai,Economy,2.25,1,5955
4,Vistara,UK-963,Delhi,Morning,zero,Morning,Mumbai,Economy,2.33,1,5955
...,...,...,...,...,...,...,...,...,...,...,...
300148,Vistara,UK-822,Chennai,Morning,one,Evening,Hyderabad,Business,10.08,49,69265
300149,Vistara,UK-826,Chennai,Afternoon,one,Night,Hyderabad,Business,10.42,49,77105
300150,Vistara,UK-832,Chennai,Early_Morning,one,Night,Hyderabad,Business,13.83,49,79099
300151,Vistara,UK-828,Chennai,Early_Morning,one,Evening,Hyderabad,Business,10.00,49,81585


In [7]:
# count missing values

print(df.isnull().sum())

airline             0
flight_code         0
source_city         0
departure_period    0
stops               0
arrival_period      0
destination_city    0
class               0
duration            0
days_left           0
price               0
dtype: int64


**This shows that there are no _null_ or missing values in the dataframe.**

In [8]:
# detect duplicate rows

has_duplicates = df.duplicated().any()
print(f"Does the dataframe have any duplicate rows?\n{has_duplicates}")

Does the dataframe have any duplicate rows?
False


In [9]:
# remove leading and trailing spaces

df_stripped = df.map(lambda x: x.strip() if isinstance(x, str) else x)
print(df_stripped)

         airline flight_code source_city departure_period stops  \
0       SpiceJet     SG-8709       Delhi          Evening  zero   
1       SpiceJet     SG-8157       Delhi    Early_Morning  zero   
2        AirAsia      I5-764       Delhi    Early_Morning  zero   
3        Vistara      UK-995       Delhi          Morning  zero   
4        Vistara      UK-963       Delhi          Morning  zero   
...          ...         ...         ...              ...   ...   
300148   Vistara      UK-822     Chennai          Morning   one   
300149   Vistara      UK-826     Chennai        Afternoon   one   
300150   Vistara      UK-832     Chennai    Early_Morning   one   
300151   Vistara      UK-828     Chennai    Early_Morning   one   
300152   Vistara      UK-822     Chennai          Morning   one   

       arrival_period destination_city     class  duration  days_left  price  
0               Night           Mumbai   Economy      2.17          1   5953  
1             Morning           Mumba

In [10]:
# save the clean data

df.to_csv('cleaned_data.csv', index = False)
print("Cleaned data saved to 'cleaned_data.csv'")

Cleaned data saved to 'cleaned_data.csv'
