# Challenge Data Scientist - Mauricio Abarca

The problem consists in predicting the probability of delay of the flights that land or take off from the airport of Santiago de Chile
(SCL). For that you will have a dataset using public and real data where each row corresponds to a flight that landed or took off
from SCL during 2017. The following information is available for each flight:
- **Fecha-I:** Scheduled date and time of the flight.
- **Vlo-I:** Scheduled flight number.
- **Ori-I:** Programmed origin city code.
- **Des-I:** Programmed destination city code.
- **Emp-I:** Scheduled flight airline code.
- **Fecha-O** Date and time of flight operation.
- **Vlo-O:** Flight operation number of the flight.
- **Ori-O:** Operation origin city code.
- **Des-O:** Operation destination city code.
- **Emp-O:** Airline code of the operated flight.
- **DIA:** Day of the month of flight operation.
- **MES:** Number of the month of operation of flight.
- **AÑO:** Year of flight operation.
- **DIANOM:** Day of the week of flight operation.
- **TIPOVUELO:** Type of flight, I = Internatiol, N = National.
- **OPERA:** Name of the airline that operates.
- **SIGLAORI:** Name city of origin.
- **SIGLADES:** Destination city name. 


## Challenges
1) How is the data distributed? Did you find any noteworthy insight to share? What can you conclude about this?
   > *[See Solution](##-Challenge-1)*
   
2) Generate the following additional columns. Please export them to a CSV file named synthetic_features.csv:
   - **high_season** : 1 if Date-I is between Dec-15 and Mar-3, or Jul-15 and Jul-31, or Sep-11 and Sep-30, 0 otherwise. 
   - **min_diff** : difference in minutes between Date-O and Date-I. 
   - **delay_15** : 1 if min_diff > 15, 0 if not. 
   - **period_day** : morning (between 5:00 and 11:59), afternoon (between 12:00 and 18:59) and night (between 19:00 and 4:59), based onDate-I.
  
   > *[See Solution](##-Challenge-2)*
     
  
3) What is the behavior of the delay rate across destination, airline, month of the year, day of the week, season, type of flight?What variables would you expect to have the most influence in predicting delays?
   > *[See Solution](##-Challenge-3)*
   
4) Train one or several models (using the algorithm(s) of your choice) to estimate the likelihood of a flight delay. Feel free to generate additional variables and/or supplement with external variables.
   > *[See Solution](##-Challenge-4)*
   
5) Evaluate model performance in the predictive task across each model that you trained. Define and justify what metrics you used to assess model performance. Pick the best trained model and evaluate the following:
   - What variables were the most influential in the prediction task? 
   - How could you improve the Performance?
   > *[See Solution](##-Challenge-5)*


## Virtual Environment

Before anything, we need to ensure that we have the proper virtual environment ready to work. Since the scope of the challenge is not how to properly use venv on your machine, I will skip theese and assume that the venv you'll use is created and ready to be use (in other words, activated). In my personal case, the venv is called latamds (and will be referred with that name later on). To see how to create the venv and give it a name, refer to the [venv docs](https://docs.python.org/3/library/venv.html).

Once latamds is created and activated, we need to ensure that the notebook is using it. In vscode this is simple as check the right corner for the kernel named (in my case) *''latamds(Python 3.11.2)*. 

At this point you should have the latamds venv created, activated, and used as kernel on this Notebook. Now import all the modules present on requirements.txt. After that you are ready to work with this Notebook.

In [10]:
## Base Libraries 

import numpy as np
import pandas as pd

print('Base libraries loaded!')

base_df = pd.read_csv('Data/dataset_SCL.csv')
print('Dataset Loaded!')

Base libraries loaded!
Dataset Loaded!


  base_df = pd.read_csv('Data/dataset_SCL.csv')


## Challenge 1

First let's print the first five rows from the dataset, and after that let's print the information about it.

In [11]:
base_df.head() # we can see delayed flights

Unnamed: 0,Fecha-I,Vlo-I,Ori-I,Des-I,Emp-I,Fecha-O,Vlo-O,Ori-O,Des-O,Emp-O,DIA,MES,AÑO,DIANOM,TIPOVUELO,OPERA,SIGLAORI,SIGLADES
0,2017-01-01 23:30:00,226,SCEL,KMIA,AAL,2017-01-01 23:33:00,226,SCEL,KMIA,AAL,1,1,2017,Domingo,I,American Airlines,Santiago,Miami
1,2017-01-02 23:30:00,226,SCEL,KMIA,AAL,2017-01-02 23:39:00,226,SCEL,KMIA,AAL,2,1,2017,Lunes,I,American Airlines,Santiago,Miami
2,2017-01-03 23:30:00,226,SCEL,KMIA,AAL,2017-01-03 23:39:00,226,SCEL,KMIA,AAL,3,1,2017,Martes,I,American Airlines,Santiago,Miami
3,2017-01-04 23:30:00,226,SCEL,KMIA,AAL,2017-01-04 23:33:00,226,SCEL,KMIA,AAL,4,1,2017,Miercoles,I,American Airlines,Santiago,Miami
4,2017-01-05 23:30:00,226,SCEL,KMIA,AAL,2017-01-05 23:28:00,226,SCEL,KMIA,AAL,5,1,2017,Jueves,I,American Airlines,Santiago,Miami


In [12]:
base_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68206 entries, 0 to 68205
Data columns (total 18 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Fecha-I    68206 non-null  object
 1   Vlo-I      68206 non-null  object
 2   Ori-I      68206 non-null  object
 3   Des-I      68206 non-null  object
 4   Emp-I      68206 non-null  object
 5   Fecha-O    68206 non-null  object
 6   Vlo-O      68205 non-null  object
 7   Ori-O      68206 non-null  object
 8   Des-O      68206 non-null  object
 9   Emp-O      68206 non-null  object
 10  DIA        68206 non-null  int64 
 11  MES        68206 non-null  int64 
 12  AÑO        68206 non-null  int64 
 13  DIANOM     68206 non-null  object
 14  TIPOVUELO  68206 non-null  object
 15  OPERA      68206 non-null  object
 16  SIGLAORI   68206 non-null  object
 17  SIGLADES   68206 non-null  object
dtypes: int64(3), object(15)
memory usage: 9.4+ MB


You can see that there's a difference between **Fecha-I** and **Fecha-O**, at least in the first 5 rows. This is expected, because this challenges consist on try to predict the probability of a flight being delay from SCL (wether it's landing or taking off). With that on mind, and seeing that there flights that got delayed and others got ahead (at least on the first 5 rows). I will create a new column called '*delayed*', that will be true when there's a possitive difference between Fecha-O and Fecha-I. After that I will show you a summary of the data.

But first, if you look the information table above, you can see that the dates are saved as an object, so in order to work with them properly I will convert it to the dtype datetime.

In [16]:
# For now just theese will be converted to datetime
base_df['Fecha-I'] = pd.to_datetime(base_df['Fecha-I'])
base_df['Fecha-O'] = pd.to_datetime(base_df['Fecha-O'])
base_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68206 entries, 0 to 68205
Data columns (total 18 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Fecha-I    68206 non-null  datetime64[ns]
 1   Vlo-I      68206 non-null  object        
 2   Ori-I      68206 non-null  object        
 3   Des-I      68206 non-null  object        
 4   Emp-I      68206 non-null  object        
 5   Fecha-O    68206 non-null  datetime64[ns]
 6   Vlo-O      68205 non-null  object        
 7   Ori-O      68206 non-null  object        
 8   Des-O      68206 non-null  object        
 9   Emp-O      68206 non-null  object        
 10  DIA        68206 non-null  int64         
 11  MES        68206 non-null  int64         
 12  AÑO        68206 non-null  int64         
 13  DIANOM     68206 non-null  object        
 14  TIPOVUELO  68206 non-null  object        
 15  OPERA      68206 non-null  object        
 16  SIGLAORI   68206 non-null  object       

In [30]:
check_tables = ['Fecha-O', 'Fecha-I', 'delayed_datetime', 'delayed_minutes', 'delayed']

delayed_df = base_df.copy() # we will create a copy to work with from here on
delayed_df['delayed_datetime']  = delayed_df['Fecha-O'] - delayed_df['Fecha-I']
delayed_df['delayed_minutes']  = pd.to_numeric((delayed_df['Fecha-O'] - delayed_df['Fecha-I']).dt.total_seconds() / 60, downcast='integer')
delayed_df['delayed'] = delayed_df['delayed_minutes'] > 0
delayed_df[check_tables].head()

Unnamed: 0,Fecha-O,Fecha-I,delayed_datetime,delayed_minutes,delayed
0,2017-01-01 23:33:00,2017-01-01 23:30:00,0 days 00:03:00,3,True
1,2017-01-02 23:39:00,2017-01-02 23:30:00,0 days 00:09:00,9,True
2,2017-01-03 23:39:00,2017-01-03 23:30:00,0 days 00:09:00,9,True
3,2017-01-04 23:33:00,2017-01-04 23:30:00,0 days 00:03:00,3,True
4,2017-01-05 23:28:00,2017-01-05 23:30:00,-1 days +23:58:00,-2,False


## Challenge 2

## Challenge 3

## Challenge 4

## Challenge 5