# FDS Mini Project


**WARNING: Before making any git commit to this notebook please clear all output in this notebook**

## 1. Cleaning the data

### Invalid Columns: 
- delete unnamed column which was serving as index (index already exists - duplicated column)
- delete last column (contains only NaN values) - 'Unnamed 21'

### NaN values:
- check number of NaN values/location of NaN values
- leave NaN values that are required in order not to lose data (for example: a cancelled flight will always have NaN values for DEP_TIME, ARR_TIME, ARR_DEL15, DEP_DEL15 - as the flight did not happen)
- delete NaN values that would incommodate analysis and plotting later on (for example, flight timings that are simply missing without the flight having been cancelled)

### Times conversion (Note: 00:00 timings all represent cancelled flights)
- observation --> no flight leaves at 00:00, all *00:00 date/time values belong to flights that have been cancelled*
- converted DEP_TIME and ARR_TIME to 4-character string of the format: hhmm (error when attempting to convert to date/time) 
- added two extra columns: ARR_TIME_MINS and DEP_TIME_MINS representing the arrival and departure time in minutes for easier calculations

### Irrelevant columns (to this project) to be removed/ duplicated data:
- Remove both OP_CARRIER_AIRLINE_ID and OP_CARRIER
- Remove ORIGIN_AIRPORT_SEQ_ID
- Remove DEST_AIRPORT_SEQ_ID





In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#Importing sklearn functions
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.cluster import KMeans

In [None]:
#--------------------------------------- Load dataset ------------------------------------------#
flight_data_path = os.path.join(os.getcwd(), 'datasets', 'flight_jan_2019.csv.gz')
flight_data = pd.read_csv(flight_data_path, compression = 'gzip')

# Delete 'Unnamed 1' and 'Unnamed 21'
del flight_data['Unnamed: 0']
del flight_data['Unnamed: 21']
flight_data

#---------------------------------------- Check for 'NaN' values ------------------------------#

# for col in flight_data.columns: 
#    print(col, ' :',flight_data[col].isna().sum())
    
    # NA VALUES: TAIL_NUM  : 2543
    #            DEP_TIME  : 16352
    #            DEP_DEL15  : 16355
    #            ARR_TIME  : 17061
    #            ARR_DEL15  : 18022
    #            Unnamed: 21  : 583985

# Dealing with DEP_TIME and ARR_TIME Nan values
flight_data[np.isnan(flight_data.DEP_TIME)] # Observation: cancelled flights have Nan values for DEP_TIME, ARR_TIME, DEP_DEL15,ARR_DEL15  
# NaN values therefore make sense in this case, eliminating rows with NaN values with plotting can be done by filtering:
#                       flight_data[~np.isnan(flight_data['DEP_TIME'])]['DEP_TIME'].isna().sum()    

# Eliminate rows with NaN values in place for DEP/ARR_DELL15 AND ARR_TIME where the DEP_TIME is registered (timings simply missing)
indices_to_eliminate = list(flight_data[(~np.isnan(flight_data['DEP_TIME']))][np.isnan(flight_data['DEP_DEL15'])].index.values) + list(flight_data[(~np.isnan(flight_data['DEP_TIME']))][np.isnan(flight_data['ARR_TIME'])].index.values) + list(flight_data[(~np.isnan(flight_data['DEP_TIME']))][np.isnan(flight_data['ARR_DEL15'])].index.values)
flight_data = flight_data.drop(indices_to_eliminate)

#--------------------------------------Modifying data types----------------------------------#
flight_data.dtypes
# CANCELLED/DIVERTED to integer value
flight_data['CANCELLED'] = flight_data['CANCELLED'].astype(int)
flight_data['DIVERTED'] = flight_data['DIVERTED'].astype(int)
flight_data.dtypes
flight_data
# Modifying timings date/time format
#flight_data['DEP_TIME'] = pd.to_datetime(flight_data['DEP_TIME'], format='%H%M').dt.time

# OBSERVATION: flights with value 0.0 - keeping in mind that timings are currently floats - are all NaN values - so no flight leaves at 00:00 (those are simply cancelled values)
len(flight_data[(flight_data['DEP_TIME'] == 0.0)][flight_data['CANCELLED'] == 1]['DEP_TIME']) - flight_data[flight_data['DEP_TIME'] == 0.0]['DEP_TIME'].isna().sum()
len(flight_data[(flight_data['DEP_TIME'] == 0.0)][flight_data['CANCELLED'] == 1]['DEP_TIME']) - flight_data[flight_data['DEP_TIME'] == 0.0]['DEP_TIME'].isna().sum()

# Convert DEP_TIME and ARR_TIME to int and add new columns: DEP_TIME_MINS and ARR_TIME_MINS for easy calculations
def convert_minutes(x):
    minutes = int(x[2])*10 + int(x[3])
    hr_minutes = (int(x[0])*10 + int(x[1]))*60
    return minutes+hr_minutes

def fill_in(x):
    if (len(x) == 4):
        return x
    if (len(x) == 3):
        return '0' + x
    if (len(x) == 2):
        return '00' + x
    if (len(x) == 1):
        return '000' + x
    if (len(x) == 0):
        return '000' + x
    return '0000'
    
flight_data['DEP_TIME'] = flight_data['DEP_TIME'].fillna(0)
flight_data['DEP_TIME'] = flight_data['DEP_TIME'].astype(int)
flight_data['DEP_TIME'] = flight_data['DEP_TIME'].astype(str)
flight_data['DEP_TIME'] = flight_data['DEP_TIME'].apply(fill_in)
flight_data['DEP_TIME_MINS'] = flight_data['DEP_TIME'].apply(convert_minutes)
flight_data['ARR_TIME'] = flight_data['ARR_TIME'].fillna(0)
flight_data['ARR_TIME'] = flight_data['ARR_TIME'].astype(int)
flight_data['ARR_TIME'] = flight_data['ARR_TIME'].astype(str)
flight_data['ARR_TIME'] = flight_data['ARR_TIME'].apply(fill_in)
flight_data['ARR_TIME_MINS'] = flight_data['ARR_TIME'].apply(convert_minutes)

#-------------------------------ATTEMPT AT CONVERTING TO DATE/TIME-----------------#
def fill_in(x):
    if (len(x) == 4):
        return x
    if (len(x) == 3):
        return '0' + x
    if (len(x) == 2):
        return '00' + x
    if (len(x) == 1):
        return '000' + x
    if (len(x) == 0):
        return '000' + x
    return '0000'
    
#def convert_time(x):
#    return datetime.datetime.strptime(x,'%H%M' )
    
#flight_data['DEP_TIME'] = flight_data['DEP_TIME'].apply(fill_in)
#flight_data['ARR_TIME'] = flight_data['ARR_TIME'].apply(fill_in)
#flight_data['DEP_TIME'] = flight_data['DEP_TIME'].apply(convert_time)
#flight_data['DEP_TIME'] = flight_data['DEP_TIME'].apply(check)
#flight_data['DEP_TIME'] = pd.to_datetime(flight_data['DEP_TIME'], format=)


#------------------------------------Eliminating extra columns------------------------------#

flight_data['OP_UNIQUE_CARRIER'].nunique()  # 17
flight_data['OP_CARRIER_AIRLINE_ID'].nunique()  # 17
flight_data['OP_CARRIER'].nunique() # 17
# Remove both OP_CARRIER_AIRLINE_ID and OP_CARRIER
del flight_data['OP_CARRIER_AIRLINE_ID']
del flight_data['OP_CARRIER']

flight_data['TAIL_NUM'].nunique() # 5445
flight_data['ORIGIN_AIRPORT_ID'].nunique() # 346
flight_data['ORIGIN_AIRPORT_SEQ_ID'].nunique() # 346
# Remove ORIGIN_AIRPORT_SEQ_ID
del flight_data['ORIGIN_AIRPORT_SEQ_ID']

flight_data['DEST_AIRPORT_ID'].nunique() # 346
flight_data['DEST_AIRPORT_SEQ_ID'].nunique() # 346
# Remove DEST_AIRPORT_SEQ_ID
del flight_data['DEST_AIRPORT_SEQ_ID']

del flight_data['ORIGIN_AIRPORT_ID']
del flight_data['DEST_AIRPORT_ID']

flight_data.head()


## 2. Data Analysis Preparation / Pre-processing

### Data selection:

* As we only need reliable data, which the flight were not cancelled, the normal_flight is filtered from the original dataset
* By combining or processing some of the columns, the data would be more concise and brief

In [None]:
normal_flight = flight_data[flight_data['CANCELLED'] == 0.0].drop(columns=['CANCELLED','DEP_TIME','DEP_TIME_BLK','ARR_TIME','DIVERTED','TAIL_NUM'])
normal_flight['FLIGHT_NUM'] = normal_flight.apply(lambda x : x['OP_UNIQUE_CARRIER'] + str(x['OP_CARRIER_FL_NUM']), axis=1)
normal_flight['TRAVEL_TIME'] = normal_flight.apply(lambda x : x['ARR_TIME_MINS'] - x['DEP_TIME_MINS'], axis=1)
normal_flight.drop(columns=['OP_CARRIER_FL_NUM','ARR_TIME_MINS'],inplace=True)

In [None]:
normal_flight.head()

### Data Transfer

* Transfering the categorical data to relative delay rate(Better observation)

In [None]:
# group by each column and calculate delay rate (DEP_DR and ARR_DR) of each attribute
cols = ['DAY_OF_MONTH','DAY_OF_WEEK','OP_UNIQUE_CARRIER','FLIGHT_NUM','ORIGIN','DEST']
for col in cols:
    dep_name, arr_name = 'DEP_DR_'+col, 'ARR_DR_'+col
    stat = normal_flight[[col, 'DEP_DEL15', 'ARR_DEL15']].groupby(col).transform('mean')
    normal_flight[dep_name] = stat['DEP_DEL15']
    normal_flight[arr_name] = stat['ARR_DEL15']
normal_flight.drop(columns=cols,inplace=True)
    
    

In [None]:
normal_flight.head()

In [None]:
dep_ontime_cnt, dep_delay_cnt = np.sum(normal_flight['DEP_DEL15'] == 0.0), np.sum(normal_flight['DEP_DEL15'] == 1.0)
arr_ontime_cnt, arr_delay_cnt = np.sum(normal_flight['ARR_DEL15'] == 0.0), np.sum(normal_flight['ARR_DEL15'] == 1.0)
# plt.title('Delay Statistics')
stat_df = pd.DataFrame({'Type': ['Departure','Departure','Arrival','Arrival'], 'Status': ['Ontime','Delayed','Ontime','Delayed'], 'Flight Count': [dep_ontime_cnt, dep_delay_cnt, arr_ontime_cnt, arr_delay_cnt]})
plt.ylim((0,500000))
plt.title('Delay Statistics Jan 2019')
sns.barplot(data=stat_df, x='Type', y='Flight Count', hue='Status')

## 3. PCA Analysis

### Training preparation

* Cancelled flights are removed from original dataset as they are not relevant to delay prediction
* Dataset is split up into training data(60%), validation data(20%) and test data(20%)


### Implement of PCA

* Choose n = 6, PCA would help us to do Dimension reduction
* (Reduce the computational overhead of the algorithm and Reserve most of the data: 80%)

In [None]:
# split up test and train data for normal flight
all_dep = normal_flight['DEP_DEL15']
all_arr = normal_flight['ARR_DEL15']
all_data = normal_flight.drop(columns=['DEP_DEL15','ARR_DEL15'])
all_data = StandardScaler().fit_transform(all_data)
pca = PCA(n_components=6).fit(all_data)
all_data = pca.transform(all_data)
train_data, test_data = train_test_split(all_data, train_size=0.8, random_state=42)
train_data, val_data = train_test_split(train_data, train_size=0.75, random_state=42)
train_dep, test_dep = train_test_split(all_dep, train_size=0.8, random_state=42)
train_dep, val_dep = train_test_split(train_dep, train_size=0.75, random_state=42)
train_arr, test_arr = train_test_split(all_arr, train_size=0.8, random_state=42)
train_arr, val_arr = train_test_split(train_arr, train_size=0.75, random_state=42)
print('Data reserved by PCA (in percentage):', np.sum(pca.explained_variance_ratio_))


In [None]:
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Ontime Departure Flights')
sns.scatterplot(x=train_data[train_dep == 0.0][:,0], y=train_data[train_dep == 0.0][:,1], color='b')

In [None]:
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Delayed Departure Flights')
sns.scatterplot(x=train_data[train_dep == 1.0][:,0], y=train_data[train_dep == 1.0][:,1], color='red')

## 4. Flight Delay Prediction

### Implement of KNN:

* View the different results with different k-value, choose the best one(observation) among all of them.
* We would use False Positive and False Negative to see the correctness of result

* False Positive : Prediction is True, but the truth is False
* False Negative : Prediction is False, but the truth is True


In [None]:
# Calculate the accuracy of prediction against validation data given k-value
def test_accuracy(k, mode='DEP'):
    print('Running KNN with k =',k)
    train_target = train_dep if mode == 'DEP' else train_arr
    val_target = val_dep if mode == 'DEP' else val_arr
    knn = KNeighborsClassifier(n_neighbors=k, weights='distance', n_jobs=-1).fit(train_data, train_target)
    prediction = knn.predict(val_data)
    # false positive, prediction > target
    fp = np.sum(prediction > val_target) / len(val_data)
    # false negative, prediction < target
    fn = np.sum(prediction < val_target) / len(val_data)
    return [k,fp,fn]

In [None]:
run_result = np.array([test_accuracy(k, mode='DEP') for k in range(1,50,4)])
plt.plot(run_result[:,0], run_result[:,1], 'r') # False positive
plt.plot(run_result[:,0], run_result[:,2], 'b') # False negative
plt.show()

In [None]:
dep_knn = KNeighborsClassifier(n_neighbors=9, weights='distance', n_jobs=-1).fit(train_data, train_dep)
dep_prediction = dep_knn.predict(test_data)
dep_fp = np.sum(dep_prediction > test_dep) / len(test_data)
dep_fn = np.sum(dep_prediction < test_dep) / len(test_data)
dep_fp, dep_fn # (false positive rate, false negative rate) for departure delay


In [None]:
run_result = np.array([test_accuracy(k, mode='ARR') for k in range(1,50,4)])
plt.plot(run_result[:,0], run_result[:,1], 'r') # False positive
plt.plot(run_result[:,0], run_result[:,2], 'b') # False negative
plt.show()

In [None]:
arr_knn = KNeighborsClassifier(n_neighbors=9, weights='distance', n_jobs=-1).fit(train_data, train_arr)
arr_prediction = arr_knn.predict(test_data)
arr_fp = np.sum(arr_prediction > test_arr) / len(test_data)
arr_fn = np.sum(arr_prediction < test_arr) / len(test_data)
arr_fp, arr_fn # (false positive rate, false negative rate) for arrival delay

## 5. Analysis

### Dataset URL:
* https://www.kaggle.com/divyansh22/flight-delay-prediction
* The data of detail of flight is collected, we could use the details to predict the flight will delay or not.


### QUESTIONS:

* The delay of the flight is annoying, it would usually cause a series of time conflict. Therefore, we're wondering that what if we could predict the delay of the flight, then we could preplan the schedule and use the time more properly


### TOOL:

* it is a prediction problem, we are trying to predict whether or not the flight will delay.
* we use the the distance between the origin and destination, total travel minutes, the time block and etc. to predict the probability of delay.

* And the reason why we choose them is because they are relative to the delay(e.g. Knowing the distance from one city to another and travel time could be seen as reference to predict the flight will delay or not) 
* We used standardised data, which could be more precise and accurate.
* With what we said in (4,Implement of KNN),  false positive and false negative are used to see the result.


### ANALYSIS & FINDINGS:

* Even we have a quite good prediction through the regression model, but it still have some problem. From the original data, we know that the sample number of flight which not dalayed is totally greater than that of flight which delayed, it is also a reason why the probability of the false positive is smaller than that of false negative.
* graph can be seen in the bottom of (4. Flight Delay Prediction)

### FUTURE DIRECTIONS:

* There still a lot of event which has not been considered might happened before departure of the flight(the number and weight of luggage, missing passengers and etc.), 
