# Flight Price Prediction
* The goal of this project is to generate flight prices based on information provided in the `flight.csv` dataset. Here we will wrangle the dataset, and create a model that generates flight prices. 

In [13]:
!pip install pandas



In [14]:
import numpy as np
import pandas as pd
import datetime

In [15]:
data = pd.read_csv('dataset/flights.csv')
data.head()

  data = pd.read_csv('dataset/flights.csv')


Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,...,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,5,...,408.0,-22.0,0,0,,,,,,
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,10,...,741.0,-9.0,0,0,,,,,,
2,2015,1,1,4,US,840,N171US,SFO,CLT,20,...,811.0,5.0,0,0,,,,,,
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,20,...,756.0,-9.0,0,0,,,,,,
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,25,...,259.0,-21.0,0,0,,,,,,


In [16]:
data['YEAR'].unique()

array([2015], dtype=int64)

## Data Assessment
* Here we will check for null, duplicate and other invalid values
* We will also get a global feel of the dataset by getting the type of the different columns.

In [17]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5819079 entries, 0 to 5819078
Data columns (total 31 columns):
 #   Column               Dtype  
---  ------               -----  
 0   YEAR                 int64  
 1   MONTH                int64  
 2   DAY                  int64  
 3   DAY_OF_WEEK          int64  
 4   AIRLINE              object 
 5   FLIGHT_NUMBER        int64  
 6   TAIL_NUMBER          object 
 7   ORIGIN_AIRPORT       object 
 8   DESTINATION_AIRPORT  object 
 9   SCHEDULED_DEPARTURE  int64  
 10  DEPARTURE_TIME       float64
 11  DEPARTURE_DELAY      float64
 12  TAXI_OUT             float64
 13  WHEELS_OFF           float64
 14  SCHEDULED_TIME       float64
 15  ELAPSED_TIME         float64
 16  AIR_TIME             float64
 17  DISTANCE             int64  
 18  WHEELS_ON            float64
 19  TAXI_IN              float64
 20  SCHEDULED_ARRIVAL    int64  
 21  ARRIVAL_TIME         float64
 22  ARRIVAL_DELAY        float64
 23  DIVERTED             int64  
 24

In [18]:
def check_data_completeness(dataframe):
    print("Number of NULL rows per column --> \n",dataframe.isna().sum())
    print("##########################################")
    print("Number of duplicate rows: -->", dataframe.duplicated().sum())

In [19]:
check_data_completeness(data)

Number of NULL rows per column --> 
 YEAR                         0
MONTH                        0
DAY                          0
DAY_OF_WEEK                  0
AIRLINE                      0
FLIGHT_NUMBER                0
TAIL_NUMBER              14721
ORIGIN_AIRPORT               0
DESTINATION_AIRPORT          0
SCHEDULED_DEPARTURE          0
DEPARTURE_TIME           86153
DEPARTURE_DELAY          86153
TAXI_OUT                 89047
WHEELS_OFF               89047
SCHEDULED_TIME               6
ELAPSED_TIME            105071
AIR_TIME                105071
DISTANCE                     0
WHEELS_ON                92513
TAXI_IN                  92513
SCHEDULED_ARRIVAL            0
ARRIVAL_TIME             92513
ARRIVAL_DELAY           105071
DIVERTED                     0
CANCELLED                    0
CANCELLATION_REASON    5729195
AIR_SYSTEM_DELAY       4755640
SECURITY_DELAY         4755640
AIRLINE_DELAY          4755640
LATE_AIRCRAFT_DELAY    4755640
WEATHER_DELAY          4755640
dt

<h5><font color='orange'>Observations</font></h5>

* Our assessment shows that the data contains only flights for the year 2015. So we will not have to truncate it.
* After assessing the dataset, we have decided to remove multiple uncessary columns that are not needed for the generation of the flight prices. Below is a list of columns we will keep:
1. Airline: Unique airline identification number
2. Date: Date of flight
3. Origin_Airport, Destination_airport
4. Distance
5. Scheduled departure and scheduled arrival

## Data Cleaning
* To clean this data set, we will start by making a copy of the original data frame and all our cleaning operations will be performed on the copy of the original data.

In [20]:
# The data represents days months and years each on a specific column. We will combine this date infromation into a single column.
df = data.copy()
df['DATE'] = pd.to_datetime(df[['YEAR', 'MONTH', 'DAY']])
df.DATE

0         2015-01-01
1         2015-01-01
2         2015-01-01
3         2015-01-01
4         2015-01-01
             ...    
5819074   2015-12-31
5819075   2015-12-31
5819076   2015-12-31
5819077   2015-12-31
5819078   2015-12-31
Name: DATE, Length: 5819079, dtype: datetime64[ns]

### Formatting the shceduled departure time and scheduled arrival time

* We have noted that the times are integers and tenths represents minutes, hundreths represends hours. We will convert them to datetime accordingly.

In [21]:
# Convert 'HHMM' string to datetime
def convert_to_hours(time_value):
    if time_value == 2400:
        time_value = 0
    time_str = f"{int(time_value):04d}"
    hours = int(time_str[:2])
    minutes = int(time_str[2:])
    return datetime.time(hours, minutes)


# Cobine date and time column to a new datetime object
def mix_date_time(date, time):
    return datetime.datetime.combine(date, time)

def flight_time_format(df, col):
    # Identify null time values
    null_mask = df[col].isnull()

    # if time is 2400 increment the date and set the time to midnight
    time_2400_mask = df[col] == 2400
    df.loc[time_2400_mask, 'DATE'] += pd.Timedelta(days=1)
    df.loc[time_2400_mask, col] = datetime.time(0, 0)

    # Format the other valid time values
    not_null_not_2400_mask = ~null_mask & ~time_2400_mask
    df.loc[not_null_not_2400_mask, col] = df.loc[not_null_not_2400_mask, col].apply(convert_to_hours)

    # combine date and time
    combined = df.apply(lambda row: mix_date_time(row['DATE'], row[col]) if not pd.isnull(row[col]) else np.nan, axis=1)
    return combined


In [22]:
df['SCHEDULED_DEPARTURE'] = flight_time_format(df, 'SCHEDULED_DEPARTURE')
df['SCHEDULED_ARRIVAL'] =  df['SCHEDULED_ARRIVAL'].apply(convert_to_hours)

  df.loc[time_2400_mask, col] = datetime.time(0, 0)


### Removing all the unecessary columns

In [23]:
df = df[["AIRLINE","SCHEDULED_DEPARTURE", "SCHEDULED_ARRIVAL", "ORIGIN_AIRPORT","DESTINATION_AIRPORT","DISTANCE"]].copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5819079 entries, 0 to 5819078
Data columns (total 6 columns):
 #   Column               Dtype         
---  ------               -----         
 0   AIRLINE              object        
 1   SCHEDULED_DEPARTURE  datetime64[ns]
 2   SCHEDULED_ARRIVAL    object        
 3   ORIGIN_AIRPORT       object        
 4   DESTINATION_AIRPORT  object        
 5   DISTANCE             int64         
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 266.4+ MB


### The distance is in miles, for comprehension sake, we will change that to KM. 1 mile = 1.60934 km

In [24]:
df['DISTANCE'] *= 1.60934
df = df.rename(columns={"DISTANCE": "DISTANCE_KM"})


In [25]:
# Check the values of the new datase
check_data_completeness(df)

Number of NULL rows per column --> 
 AIRLINE                0
SCHEDULED_DEPARTURE    0
SCHEDULED_ARRIVAL      0
ORIGIN_AIRPORT         0
DESTINATION_AIRPORT    0
DISTANCE_KM            0
dtype: int64
##########################################
Number of duplicate rows: --> 211


In [26]:
# Drop duplicates
df.drop_duplicates().shape

(5818868, 6)

In [27]:
df.head()

Unnamed: 0,AIRLINE,SCHEDULED_DEPARTURE,SCHEDULED_ARRIVAL,ORIGIN_AIRPORT,DESTINATION_AIRPORT,DISTANCE_KM
0,AS,2015-01-01 00:05:00,04:30:00,ANC,SEA,2330.32432
1,AA,2015-01-01 00:10:00,07:50:00,LAX,PBI,3749.7622
2,US,2015-01-01 00:20:00,08:06:00,SFO,CLT,3695.04464
3,AA,2015-01-01 00:20:00,08:05:00,LAX,MIA,3769.07428
4,AS,2015-01-01 00:25:00,03:20:00,SEA,ANC,2330.32432


To calculate the cost of each flight, we will get the average price of the origne airports multiplied by the price per km due to the length of the flight. (the price per km is the mean of the lowest price and highest price 0.15 and 0.40 for domestic flights) The flight price will also vary according to the time of the year.