# Challenge Data Scientist - Mauricio Abarca

The problem consists in predicting the probability of delay of the flights that land or take off from the airport of Santiago de Chile
(SCL). For that you will have a dataset using public and real data where each row corresponds to a flight that landed or took off
from SCL during 2017. The following information is available for each flight:
- **Fecha-I:** Scheduled date and time of the flight.
- **Vlo-I:** Scheduled flight number.
- **Ori-I:** Programmed origin city code.
- **Des-I:** Programmed destination city code.
- **Emp-I:** Scheduled flight airline code.
- **Fecha-O** Date and time of flight operation.
- **Vlo-O:** Flight operation number of the flight.
- **Ori-O:** Operation origin city code.
- **Des-O:** Operation destination city code.
- **Emp-O:** Airline code of the operated flight.
- **DIA:** Day of the month of flight operation.
- **MES:** Number of the month of operation of flight.
- **AÑO:** Year of flight operation.
- **DIANOM:** Day of the week of flight operation.
- **TIPOVUELO:** Type of flight, I = Internatiol, N = National.
- **OPERA:** Name of the airline that operates.
- **SIGLAORI:** Name city of origin.
- **SIGLADES:** Destination city name. 


## Challenges
1) How is the data distributed? Did you find any noteworthy insight to share? What can you conclude about this?
   > *[See Solution](##-Challenge-1)*
   
2) Generate the following additional columns. Please export them to a CSV file named synthetic_features.csv:
   - **high_season** : 1 if Date-I is between Dec-15 and Mar-3, or Jul-15 and Jul-31, or Sep-11 and Sep-30, 0 otherwise. 
   - **min_diff** : difference in minutes between Date-O and Date-I. 
   - **delay_15** : 1 if min_diff > 15, 0 if not. 
   - **period_day** : morning (between 5:00 and 11:59), afternoon (between 12:00 and 18:59) and night (between 19:00 and 4:59), based onDate-I.
  
   > *[See Solution](##-Challenge-2)*
     
  
3) What is the behavior of the delay rate across destination, airline, month of the year, day of the week, season, type of flight?What variables would you expect to have the most influence in predicting delays?
   > *[See Solution](##-Challenge-3)*
   
4) Train one or several models (using the algorithm(s) of your choice) to estimate the likelihood of a flight delay. Feel free to generate additional variables and/or supplement with external variables.
   > *[See Solution](##-Challenge-4)*
   
5) Evaluate model performance in the predictive task across each model that you trained. Define and justify what metrics you used to assess model performance. Pick the best trained model and evaluate the following:
   - What variables were the most influential in the prediction task? 
   - How could you improve the Performance?
   > *[See Solution](##-Challenge-5)*


## Virtual Environment

Before anything, we need to ensure that we have the proper virtual environment ready to work. Since the scope of the challenge is not how to properly use venv on your machine, I will skip theese and assume that the venv you'll use is created and ready to be use (in other words, activated). In my personal case, the venv is called latamds (and will be referred with that name later on). To see how to create the venv and give it a name, refer to the [venv docs](https://docs.python.org/3/library/venv.html).

Once latamds is created and activated, we need to ensure that the notebook is using it. In vscode this is simple as check the right corner for the kernel named (in my case) *''latamds(Python 3.11.2)*. 

At this point you should have the latamds venv created, activated, and used as kernel on this Notebook. Now import all the modules present on requirements.txt. After that you are ready to work with this Notebook.

In [1]:
## Base Libraries 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# This ensures plots are displayed inline in the Jupyter notebook
%matplotlib inline

print('Base libraries loaded!')

base_df = pd.read_csv('data/dataset_SCL.csv')
print('Dataset Loaded!')

Base libraries loaded!
Dataset Loaded!


  base_df = pd.read_csv('data/dataset_SCL.csv')


## Challenge 1

First let's print the first five rows and the last five, from the dataset, and after that let's print the information about it.

In [2]:
base_df.head() # we can see delayed flights

Unnamed: 0,Fecha-I,Vlo-I,Ori-I,Des-I,Emp-I,Fecha-O,Vlo-O,Ori-O,Des-O,Emp-O,DIA,MES,AÑO,DIANOM,TIPOVUELO,OPERA,SIGLAORI,SIGLADES
0,2017-01-01 23:30:00,226,SCEL,KMIA,AAL,2017-01-01 23:33:00,226,SCEL,KMIA,AAL,1,1,2017,Domingo,I,American Airlines,Santiago,Miami
1,2017-01-02 23:30:00,226,SCEL,KMIA,AAL,2017-01-02 23:39:00,226,SCEL,KMIA,AAL,2,1,2017,Lunes,I,American Airlines,Santiago,Miami
2,2017-01-03 23:30:00,226,SCEL,KMIA,AAL,2017-01-03 23:39:00,226,SCEL,KMIA,AAL,3,1,2017,Martes,I,American Airlines,Santiago,Miami
3,2017-01-04 23:30:00,226,SCEL,KMIA,AAL,2017-01-04 23:33:00,226,SCEL,KMIA,AAL,4,1,2017,Miercoles,I,American Airlines,Santiago,Miami
4,2017-01-05 23:30:00,226,SCEL,KMIA,AAL,2017-01-05 23:28:00,226,SCEL,KMIA,AAL,5,1,2017,Jueves,I,American Airlines,Santiago,Miami


In [3]:
base_df.tail() # we can see some format changes on Vlo-O

Unnamed: 0,Fecha-I,Vlo-I,Ori-I,Des-I,Emp-I,Fecha-O,Vlo-O,Ori-O,Des-O,Emp-O,DIA,MES,AÑO,DIANOM,TIPOVUELO,OPERA,SIGLAORI,SIGLADES
68201,2017-12-22 14:55:00,400,SCEL,SPJC,JAT,2017-12-22 15:41:00,400.0,SCEL,SPJC,JAT,22,12,2017,Viernes,I,JetSmart SPA,Santiago,Lima
68202,2017-12-25 14:55:00,400,SCEL,SPJC,JAT,2017-12-25 15:11:00,400.0,SCEL,SPJC,JAT,25,12,2017,Lunes,I,JetSmart SPA,Santiago,Lima
68203,2017-12-27 14:55:00,400,SCEL,SPJC,JAT,2017-12-27 15:35:00,400.0,SCEL,SPJC,JAT,27,12,2017,Miercoles,I,JetSmart SPA,Santiago,Lima
68204,2017-12-29 14:55:00,400,SCEL,SPJC,JAT,2017-12-29 15:08:00,400.0,SCEL,SPJC,JAT,29,12,2017,Viernes,I,JetSmart SPA,Santiago,Lima
68205,2017-12-31 14:55:00,400,SCEL,SPJC,JAT,2017-12-31 15:04:00,400.0,SCEL,SPJC,JAT,31,12,2017,Domingo,I,JetSmart SPA,Santiago,Lima


In [4]:
base_df.info() # Beweare of dtype

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68206 entries, 0 to 68205
Data columns (total 18 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Fecha-I    68206 non-null  object
 1   Vlo-I      68206 non-null  object
 2   Ori-I      68206 non-null  object
 3   Des-I      68206 non-null  object
 4   Emp-I      68206 non-null  object
 5   Fecha-O    68206 non-null  object
 6   Vlo-O      68205 non-null  object
 7   Ori-O      68206 non-null  object
 8   Des-O      68206 non-null  object
 9   Emp-O      68206 non-null  object
 10  DIA        68206 non-null  int64 
 11  MES        68206 non-null  int64 
 12  AÑO        68206 non-null  int64 
 13  DIANOM     68206 non-null  object
 14  TIPOVUELO  68206 non-null  object
 15  OPERA      68206 non-null  object
 16  SIGLAORI   68206 non-null  object
 17  SIGLADES   68206 non-null  object
dtypes: int64(3), object(15)
memory usage: 9.4+ MB


You can notice that apart from **DIA**, **MES** and **AÑO**, all the fields on the data are objects (strings). That being said, before seeing the distribution of the data, let's transform **Fecha-I** and **Fecha-O** to datetime, and categorize all the other fields (since they are mostly categoric), but not before checking if the fields of **Vlo-I** and **Vlo-O** are just numbers on a string, or contain something else. After that let's solve take a decision about the Null value on **Vlo-O**, and convert those decimal values to an integer.

In [31]:
# Checking if the field Vlo-I is just a numeric string or has some letters in it
vlo_i = base_df[base_df['Vlo-I'].str.contains('[a-zA-Z]', na=False)]
print(vlo_i)
print('\nTOTAL MATCHING ROWS (Vlo-I Letters): {}\n'.format(vlo_i.shape[0]))

# Let's check if has decimals too
vlo_i_d = base_df[base_df['Vlo-I'].str.contains('\.', na=False)]
print(vlo_i_d)
print('\nTOTAL MATCHING ROWS (Vlo-I Decimals): {}\n'.format(vlo_i_d.shape[0]))


                   Fecha-I  Vlo-I Ori-I Des-I Emp-I              Fecha-O  \
22232  2017-05-13 21:50:00   989P  SCEL  SUMU   AAL  2017-05-13 21:52:00   
27464  2017-06-16 20:30:00   940P  SCEL  KDFW   AAL  2017-06-16 20:50:00   
39225  2017-08-07 19:00:00   591P  SCEL  LFPG   PUE  2017-08-07 20:35:00   
39266  2017-08-24 23:00:00   846A  SCEL  KIAH   UAL  2017-08-24 23:00:00   
58126  2017-11-05 17:00:00  1104A  SCEL  SCSE   SKU  2017-11-05 17:42:00   

      Vlo-O Ori-O Des-O Emp-O  DIA  MES   AÑO   DIANOM TIPOVUELO  \
22232   989  SCEL  SUMU   AAL   13    5  2017   Sabado         I   
27464   940  SCEL  KDFW   AAL   16    6  2017  Viernes         I   
39225  591P  SCEL  LFPG   PUE    7    8  2017    Lunes         I   
39266  2804  SCEL  KIAH   UAL   24    8  2017   Jueves         I   
58126  1104  SCEL  SCSE   SKU    5   11  2017  Domingo         N   

                          OPERA  SIGLAORI    SIGLADES  
22232         American Airlines  Santiago  Montevideo  
27464         American

In [26]:
# Checking if the field Vlo-O is just a numeric string or has some letters in it
vlo_o = base_df[base_df['Vlo-O'].str.contains('[a-zA-Z]', na=False)]
print(vlo_o)
print('\nTOTAL MATCHING ROWS (Vlo-O Letters): {}\n'.format(vlo_o.shape[0]))

                   Fecha-I Vlo-I Ori-I Des-I Emp-I              Fecha-O Vlo-O  \
13906  2017-03-30 10:30:00    71  SCEL  SCIE   SKU  2017-03-30 10:50:00   71R   
13907  2017-03-22 11:00:00  1071  SCEL  SCIE   SKU  2017-03-22 11:00:00   71R   
17055  2017-03-22 10:00:00   201  SCEL  SCIE   LXP  2017-03-22 11:50:00  201R   
19207  2017-04-25 09:00:00    71  SCEL  SCIE   SKU  2017-04-25 09:29:00   71R   
22167  2017-04-06 21:10:00    43  SCEL  SCIE   LAW  2017-04-06 21:24:00   43R   
22301  2017-05-26 09:40:00   401  SCEL  LFPG   AFR  2017-05-26 10:03:00  401A   
22302  2017-05-28 09:40:00   401  SCEL  LFPG   AFR  2017-05-28 09:43:00  401B   
24304  2017-05-11 10:00:00   802  SCEL  SPJC   SKU  2017-05-11 10:36:00  802R   
27246  2017-05-15 12:15:00   114  SCEL  SCAT   LAN  2017-05-15 13:10:00  114R   
27247  2017-05-25 13:15:00   622  SCEL  MMMX   LAN  2017-05-25 13:23:00  622R   
32196  2017-06-29 12:40:00   492  SCEL  SACO   LAN  2017-06-29 12:27:00  492R   
38150  2017-07-10 23:30:00  

In [30]:
# Let's seek decimals
vlo_o_d = base_df[base_df['Vlo-O'].str.contains('\.', na=False)]
print(vlo_o_d)
print('\nTOTAL MATCHING ROWS (Vlo-O Decimals): {}\n'.format(vlo_o_d.shape[0]))

                   Fecha-I Vlo-I Ori-I Des-I Emp-I              Fecha-O  \
63806  2017-12-15 11:00:00   150  SCEL  SCFA   SKU  2017-12-15 11:19:00   
63807  2017-12-16 11:20:00   150  SCEL  SCFA   SKU  2017-12-16 11:34:00   
63808  2017-12-17 11:00:00   150  SCEL  SCFA   SKU  2017-12-17 11:35:00   
63809  2017-12-18 11:00:00   150  SCEL  SCFA   SKU  2017-12-18 11:21:00   
63810  2017-12-19 11:00:00   150  SCEL  SCFA   SKU  2017-12-19 11:19:00   
...                    ...   ...   ...   ...   ...                  ...   
65531  2017-12-12 15:30:00   265  SCEL  SCTE   LAN  2017-12-12 15:32:00   
65532  2017-12-16 13:41:00   265  SCEL  SCTE   LAN  2017-12-16 13:49:00   
65533  2017-12-18 15:59:00   265  SCEL  SCTE   LAN  2017-12-18 16:07:00   
65534  2017-12-19 15:11:00   265  SCEL  SCTE   LAN  2017-12-19 15:10:00   
65535  2017-12-25 16:19:00   265  SCEL  SCTE   LAN  2017-12-25 16:24:00   

       Vlo-O Ori-O Des-O Emp-O  DIA  MES   AÑO   DIANOM TIPOVUELO  \
63806  150.0  SCEL  SCFA   SKU

As you can see, there are 5 rows containing letters and 0 containing decimals on the column Vlo-I. But there are 1730 rows containing letters, and the same number containing decimals on Vlo-O. Let's solve the decimal problem before anything else.

## Challenge 2

## Challenge 3

## Challenge 4

## Challenge 5