# Challenge Data Scientist - Wen Li

The goal of this project is to predict the probability of delay of the flights that land or take off from the airport of Santiago de Chile (SCL). For that we have a dataset using public and real data where each row corresponds to a flight that landed or took off from SCL during 2017.

## 0.0 Package Import and Global variables assignment

In [26]:
import pandas as pd
import numpy as np
from datetime import datetime, date
source_df = "dataset_SCL.csv"
develop_flag = False

## 1.0 Read Data

In [2]:
raw_df = pd.read_csv(source_df)
if develop_flag:
    print("The shape of raw dataset is {}".format(raw_df.shape))

  raw_df = pd.read_csv(source_df)


## 2.0 Data Pre-processing

### 2.1 Data cleaning

In [98]:
## 2.1.1 Remove duplicate data
df = raw_df.drop_duplicates()
if develop_flag:
    print("Drop {} duplicated rows".format(raw_df.shape[0]-df.shape[0]))
## 2.1.2 Check and fix structural errors if there are any

## TODO: Remove this part and Add these part in README
if develop_flag:
    print(set(df["Ori-I"]), set(df["Des-I"]), set(df["Emp-I"]), set(df["Des-O"]), set(df["Ori-O"]), 
      set(df["Emp-O"]), set(df["DIANOM"]), set(df["OPERA"]), set(df["SIGLAORI"]), set(df["SIGLADES"]))
### Standardize capitalization good
### Clear formatting

### Convert data type
df["Fecha-I"] = pd.to_datetime(df["Fecha-I"])
df["Fecha-O"] = pd.to_datetime(df["Fecha-O"])

## 2.1.3 Handle missing data
## TODO: Remove this part and Add these part in README
if develop_flag:
    missing_col = []
    for column in df.columns:
        if df[column].isnull().values.any():
            missing_col.append(column)
    for column in missing_col:
        print(df[df[column].isna()])
        
df.loc[df["Vlo-O"].isna(),"Vlo-O"] = df[df["Vlo-O"].isna()]["Vlo-I"]

### 2.2 Additional columns

In [100]:
## 2.2.0 Date-I & Time_I
df["Date-I"] = df['Fecha-I'].dt.strftime("%m-%d")
df["Time-I"] = df['Fecha-I'].dt.strftime("%H:%M")

## 2.2.1 high_season: if Date-I is between Dec-15 and Mar-3, or Jul-15 and Jul-31, or Sep-11 and Sep-30, 0 otherwise.
df["high_season"] = np.where((df["Date-I"] <= "03-03") | 
                             (("06-15" <= df["Date-I"]) & (df["Date-I"] <= "06-31"))
                             | (("09-11" <= df["Date-I"]) & (df["Date-I"] <= "09-30")) | ("12-15" <= df["Date-I"]), 1, 0)

## 2.2.2 min_diff : difference in minutes between Fecha-O and Fecha-I
df["min_diff"] = (df['Fecha-O'] - df['Fecha-I'])/np.timedelta64(1,'m')

## 2.2.3 delay_15 : 1 if min_diff > 15, 0 if not.
df["delay_15"] = np.where(df["min_diff"] > 15, 1, 0)

## 2.2.4 period_day : morning (between 5:00 and 11:59), afternoon (between 12:00 and 18:59) and night (between 19:00 and 4:59)
df["period_day"] = np.where((df["Time-I"] <= "04:59") | (df["Time-I"] >= "19:00"), "night", "")
df.loc[(df["Time-I"] <= "11:59") & ("05:00" <= df["Time-I"]), "period_day"] = "morning"
df.loc[(df["Time-I"] <= "18:59") & ("12:00" <= df["Time-I"]), "period_day"] = "afternoon"

## 2.2.5 Drop Date-I & Time_I
df = df.drop(["Date-I", "Time-I"], axis=1)

## 2.2.6 Save them to new csv files
df.to_csv("synthetic_features.csv")


## 3.0 Exploratory Data Analysis
### 3.1 Data Distribution

### 3.2 Relationship between delay rate and other variables

## 4.0 Models
### 4.1 Training

### 4.2 Testing and Evaluation

## 5.0 Conclusion