# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

## Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

In [6]:
import pandas as pd
import numpy as np

In [7]:
flights = pd.read_csv('flights_big.csv')

In [8]:
flights_test = pd.read_csv('flights_test.csv')

* dropping massive null columns

In [9]:
#flights.isnull().sum()

In [10]:
flights_test.columns

Index(['fl_date', 'mkt_unique_carrier', 'branded_code_share', 'mkt_carrier',
       'mkt_carrier_fl_num', 'op_unique_carrier', 'tail_num',
       'op_carrier_fl_num', 'origin_airport_id', 'origin', 'origin_city_name',
       'dest_airport_id', 'dest', 'dest_city_name', 'crs_dep_time',
       'crs_arr_time', 'dup', 'crs_elapsed_time', 'flights', 'distance'],
      dtype='object')

* dropping columns that mean the same thing and certain columns that don't mean anything

In [11]:
flights_test1= flights_test.drop(['mkt_carrier', 'op_unique_carrier', 'flights', 
                          'tail_num', 'branded_code_share', 'op_carrier_fl_num'], axis = 1)

* adding month, day of the week, day of the month

In [12]:
flights_test1['fl_date'] = pd.to_datetime(flights_test1['fl_date'], errors='coerce')
flights_test1['month'] = flights_test1['fl_date'].dt.month
flights_test1['day_of_week'] = flights_test1['fl_date'].dt.dayofweek
flights_test1['day_of_month'] = flights_test1['fl_date'].dt.day
flights_test1['year'] = flights_test1['fl_date'].dt.year

* splitting origin_city_name and dest_city_name from its short version name

In [13]:
flights_test1[['origin_city_name_only', 'origin_city_name_short']] = flights_test1['origin_city_name'].str.split(',', expand = True)

In [14]:
flights_test1[['dest_city_name_only', 'dest_city_name_short']] = flights_test1['dest_city_name'].str.split(',', expand = True)

In [15]:
#dropping short hand version of city names
flights_test1 = flights_test1.drop(['origin_city_name_short', 'dest_city_name_short', 'origin_city_name', 'dest_city_name', 'dest', 'origin'], axis = 1)

In [16]:
flights_test1['origin_city_name_only'] = flights_test1['origin_city_name_only'].str.strip().str.lower()
flights_test1['dest_city_name_only'] = flights_test1['dest_city_name_only'].str.strip().str.lower()

* make hour departure and arrival feature

In [17]:
flights_test1['hour_departure'] = flights_test1['crs_dep_time'].apply(
    lambda x: str(x)[:2] if len(str(x)) == 6 else (str(x)[:1]))

In [18]:
flights_test1['hour_arrival'] = flights_test1['crs_arr_time'].apply(
    lambda x: str(x)[:2] if len(str(x)) == 6 else (str(x)[:1]))

* categorize long, medium and short flights merge on flight mkt

In [19]:
#Converting to air_time
def flight_duration(x):
    if x <=180:
        return 'Short'
    elif x >180 and x<360:
        return 'Medium'
    elif x>=360:
        return 'Long'

flights['flight_duration_type']=flights['air_time'].apply(lambda x: flight_duration(x))
flights['flight_duration_type'].value_counts()

Short     424766
Medium     63415
Long        1840
Name: flight_duration_type, dtype: int64

In [20]:
#dropping original air_time feature
flights = flights.drop(['air_time', 'actual_elapsed_time'], axis = 1)

* taxi-in categories

In [21]:
def taxi_in_duration(x):
    if x <=15:
        return 'short_taxi_in'
    elif x > 15 and x<50:
        return 'medium_taxi_in'
    else:
        return 'long_taxi_in'

flights['taxi_in_duration']=flights['taxi_in'].apply(lambda x: taxi_in_duration(x))
flights['taxi_in_duration'].value_counts()

short_taxi_in     455478
medium_taxi_in     34623
long_taxi_in        9899
Name: taxi_in_duration, dtype: int64

* taxi-out categories

In [22]:
def taxi_out_duration(x):
    if x <=25:
        return 'short_taxi_out'
    elif x > 25 and x<70:
        return 'medium_taxi_out'
    else:
        return 'long_taxi_out'

flights['taxi_out_duration']=flights['taxi_out'].apply(lambda x: taxi_out_duration(x))
flights['taxi_out_duration'].value_counts()

short_taxi_out     423700
medium_taxi_out     65567
long_taxi_out       10733
Name: taxi_out_duration, dtype: int64

* making df for taxis in and out and flight duration type

In [23]:
taxi_time_df = flights[['mkt_carrier_fl_num', 'taxi_out_duration', 'taxi_in_duration', 'flight_duration_type']].drop_duplicates(subset = 'mkt_carrier_fl_num')

In [24]:
taxi_time_df.shape

(7010, 4)

# merging with test flights df

In [25]:
flights_test1 = pd.merge(taxi_time_df, flights_test1, on = 'mkt_carrier_fl_num', how = 'inner',validate = 'one_to_many') 

In [26]:
flights_test1 = flights_test1.drop('mkt_carrier_fl_num', axis = 1)

* filling duration type nans with mode

In [27]:
mode_duration_type = (flights_test1['flight_duration_type'].mode())

In [28]:
flights_test1['flight_duration_type'] = flights_test1['flight_duration_type'].fillna(str(mode_duration_type))

* arr_delay per unique_carrier

In [29]:
mean_arrdelay_carrier = flights.groupby('mkt_unique_carrier')['arr_delay'].mean()

In [30]:
mean_arrdelay_carrier.name = 'mean_arrdelay_carrier'

In [31]:
flights_test1 = pd.merge(flights_test1, mean_arrdelay_carrier, how = 'left', on = ['mkt_unique_carrier'])

* arrival delay per dest_airport id

In [32]:
mean_arrdelay_dest_air = flights.groupby('dest_airport_id')['arr_delay'].mean()

In [33]:
mean_arrdelay_dest_air.name = 'mean_arrdelay_dest_air'

In [34]:
flights_test1 = pd.merge(flights_test1, mean_arrdelay_dest_air, how = 'left', on = ['dest_airport_id'])

* arrival delay per origin_airport id

In [35]:
mean_arrdelay_origin_air = flights.groupby('origin_airport_id')['arr_delay'].mean()

In [36]:
mean_arrdelay_origin_air.name = 'mean_arrdelay_origin_air'

In [37]:
flights_test1 = pd.merge(flights_test1, mean_arrdelay_origin_air, how = 'left', on = ['origin_airport_id'])

* filling Nan with mean for airport delay means

In [38]:
flights_test1['mean_arrdelay_dest_air'] = flights_test1['mean_arrdelay_dest_air'].fillna(flights_test1['mean_arrdelay_dest_air'].mean())
flights_test1['mean_arrdelay_origin_air'] = flights_test1['mean_arrdelay_origin_air'].fillna(flights_test1['mean_arrdelay_origin_air'].mean())

## make the types categories from feature engineering

In [39]:
flights_test1["mkt_unique_carrier"] = flights_test1["mkt_unique_carrier"].astype("category")
flights_test1["origin_airport_id"] = flights_test1["origin_airport_id"].astype("category")
flights_test1["dest_airport_id"] = flights_test1["dest_airport_id"].astype("category")
flights_test1["flight_duration_type"] = flights_test1["flight_duration_type"].astype("category")
flights_test1["taxi_in_duration"] = flights_test1["taxi_in_duration"].astype("category")
flights_test1["taxi_out_duration"] = flights_test1["taxi_out_duration"].astype("category")

* encoding categorical data

In [40]:
# ENCODE AIRPORTS AND TAILNUM
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
flights_test1['mkt_unique_carrier'] = encoder.fit_transform(flights_test1[['mkt_unique_carrier']])
flights_test1['origin_airport_id'] = encoder.fit_transform(flights_test1[['origin_airport_id']])
flights_test1['dest_airport_id'] = encoder.fit_transform(flights_test1[['dest_airport_id']])
flights_test1['taxi_in_duration'] = encoder.fit_transform(flights_test1[['taxi_in_duration']])
flights_test1['taxi_out_duration'] = encoder.fit_transform(flights_test1[['taxi_out_duration']])
flights_test1['flight_duration_type'] = encoder.fit_transform(flights_test1[['flight_duration_type']])

* join on weather data

In [41]:
weather = pd.read_csv('weather.csv')

In [42]:
weather['City'] = weather['City'].str.strip().str.lower()

In [43]:
weather['date'] = pd.to_datetime(weather['StartTime(UTC)']).dt.date

In [44]:
weather['EndTime(UTC)'] = pd.to_datetime(weather['EndTime(UTC)'])

In [45]:
temp = weather.groupby(['City', 'date']).max()['EndTime(UTC)']

In [46]:
temp = temp.reset_index()

In [47]:
weather_merged = weather.merge(temp, on = ['City', 'date', 'EndTime(UTC)'], how = 'inner')

In [48]:
weather_merged = weather_merged.rename(columns = {'date': 'fl_date', 'City': 'origin_city_name_only'})

In [49]:
weather_merged = weather_merged.drop(['StartTime(UTC)', 'EndTime(UTC)'], axis = 1)

In [50]:
weather_merged['fl_date'] = pd.to_datetime(weather_merged['fl_date'])

* finding weather between '2018-12-01 and 2019-01-30'

In [51]:
weather_merged = weather_merged[(weather_merged['fl_date'] > '2018-12-01') & (weather_merged['fl_date'] < '2019-01-31')]

In [52]:
#most common weather patterns per city during those times above
weather_merged1 = weather_merged.groupby('origin_city_name_only').agg(lambda x:x.value_counts().index[0]).reset_index()

In [53]:
weather_merged1 = weather_merged1.drop('fl_date', axis = 1)

* merging weather with flights_test using origin city name as key

In [54]:
flights_test2 = pd.merge(flights_test1, weather_merged1, on = 'origin_city_name_only', how = 'left') 

In [55]:
#dropping nulls
flights_test2 = flights_test2.dropna()

In [56]:
flights_test2.columns

Index(['taxi_out_duration', 'taxi_in_duration', 'flight_duration_type',
       'fl_date', 'mkt_unique_carrier', 'origin_airport_id', 'dest_airport_id',
       'crs_dep_time', 'crs_arr_time', 'dup', 'crs_elapsed_time', 'distance',
       'month', 'day_of_week', 'day_of_month', 'year', 'origin_city_name_only',
       'dest_city_name_only', 'hour_departure', 'hour_arrival',
       'mean_arrdelay_carrier', 'mean_arrdelay_dest_air',
       'mean_arrdelay_origin_air', 'Type', 'Severity'],
      dtype='object')

* dropping flights_weather row null values, origin_city_name_only, dest_city_name_only

In [57]:
flights_test2 = flights_test2.drop(['origin_city_name_only', 'dest_city_name_only'], axis = 1)

In [58]:
flights_test2 = flights_test2.drop(['fl_date'], axis = 1)

* Categorize and encode weather and severity

In [59]:
flights_test2["Type"] = flights_test2["Type"].astype("category")
flights_test2["Severity"] = flights_test2["Severity"].astype("category")

In [60]:
flights_test2['Type'] = encoder.fit_transform(flights_test2[['Type']])
flights_test2['Severity'] = encoder.fit_transform(flights_test2[['Severity']])

In [62]:
flights_test2 = flights_test2.drop('dup', axis = 1)

In [63]:
flights_test2.shape

(561393, 21)

In [64]:
flights_test2.columns

Index(['taxi_out_duration', 'taxi_in_duration', 'flight_duration_type',
       'mkt_unique_carrier', 'origin_airport_id', 'dest_airport_id',
       'crs_dep_time', 'crs_arr_time', 'crs_elapsed_time', 'distance', 'month',
       'day_of_week', 'day_of_month', 'year', 'hour_departure', 'hour_arrival',
       'mean_arrdelay_carrier', 'mean_arrdelay_dest_air',
       'mean_arrdelay_origin_air', 'Type', 'Severity'],
      dtype='object')

### Modeling

#### pickle module to save model

In [65]:
import pickle

### predicting using test data

* making the flights_test transformed into an array 

In [74]:
X = np.array(flights_test2)

* pickle from linear regression

In [81]:
with open('model_linear_Ela_pickle', 'rb') as linear_Elas_file:
    model_elastic_linear = pickle.load(linear_Elas_file)

In [82]:
y_predict_linear_elastic =  model_elastic_linear.predict(X)

* pickle from elastic net

In [78]:
from sklearn.linear_model import ElasticNet
elasnet_model = ElasticNet()

In [79]:
with open('model_linear_pickle', 'rb') as linear_file:
    model_linear = pickle.load(linear_file)

In [80]:
y_predict_linear = model_linear.predict(X)