# Demo for public usability of API launched by Luis Mazabuel:

**URL = https://acid-mle-challenge-3skroqvvwa-ue.a.run.app**

In the following cells, I'm gonna:
* 1. Generate Files in the empty repository.
* 2. Replicate JUAN's best model
    a. Achieve the same configuration
    b. Predict with it
    c. Evaluate to reach the same metrics as presented by him
* 3. Propose an own model config
    a. Predict with it
    b. Evaluate it to know if it's better as JUAN's one

**All of these steps using just the API launched for this purposes.**

In [1]:
import requests
import json

In [2]:
host = 'https://acid-mle-challenge-3skroqvvwa-ue.a.run.app/'

# 1. GENERATE INITIAL FILES

## a) src/files/ empty

In [3]:
r = requests.get(f"{host}files/all_files")
r.content

b'[{"input":["dataset_SCL.csv"]},{"output":[]}]'

## b) Upload 'dataset_SCL.csv' in 'src/files/'

You can prefer to use the swagger (open URL in browser) to load the file from your local resource with the UI help of the swagger.

In [11]:
file = {'file': open('dataset_SCL.csv', 'rb')}
r = requests.post(url=f"{host}files/upload_file", files=file) 
r.content

b'"Successfully uploaded dataset_SCL.csv"'

## c) Generate 'dataset_SCL_complete.csv' and 'synthetic_features.csv' files in 'src/files/output'

In [25]:
body = {
    "generate_both_files": True,
    "generate_files": "string",
    "test_mode": False,
    "test_size": 0,
    "test_random_state": 0
}
r = requests.post(url=f"{host}files/create_additional_features", json=body) 
r.json()

{'values with nan': [{'Fecha-I': '2017-01-19 11:00:00',
   'Vlo-I': '200',
   'Ori-I': 'SCEL',
   'Des-I': 'SPJC',
   'Emp-I': 'LAW',
   'Fecha-O': '2017-01-19 11:03:00',
   'Vlo-O': '',
   'Ori-O': 'SCEL',
   'Des-O': 'SPJC',
   'Emp-O': '56R',
   'DIA': 19,
   'MES': 1,
   'AÑO': 2017,
   'DIANOM': 'Jueves',
   'TIPOVUELO': 'I',
   'OPERA': 'Latin American Wings',
   'SIGLAORI': 'Santiago',
   'SIGLADES': 'Lima',
   'temporada_alta': 1,
   'dif_min': 3.0,
   'atraso_15': 0,
   'periodo_dia': 'mañana'}],
 'values without nan': [{'Fecha-I': '2017-01-01 23:30:00',
   'Vlo-I': '226',
   'Ori-I': 'SCEL',
   'Des-I': 'KMIA',
   'Emp-I': 'AAL',
   'Fecha-O': '2017-01-01 23:33:00',
   'Vlo-O': '226',
   'Ori-O': 'SCEL',
   'Des-O': 'KMIA',
   'Emp-O': 'AAL',
   'DIA': 1,
   'MES': 1,
   'AÑO': 2017,
   'DIANOM': 'Domingo',
   'TIPOVUELO': 'I',
   'OPERA': 'American Airlines',
   'SIGLAORI': 'Santiago',
   'SIGLADES': 'Miami',
   'temporada_alta': 1,
   'dif_min': 3.0,
   'atraso_15': 0,
   '

# 2. REPLICATE JUAN'S BEST MODEL

Section: **'Métricas XGBoost dejando Features más importantes'**
Model Config:

    - DATA_SPLIT-RANDOM_STATE: 42
    - MODEL: XGBClassifier
    - MODEL-RANDOM_STATE: 1
    - MODEL_PARAMS: {"learning_rate":0.01, "subsample": 1, "max_depth": 10}
    - X_train (features): ['MES_7', 'TIPOVUELO_I', 'OPERA_Copa Air', 'OPERA_Latin American Wings', 'MES_12', 'OPERA_Grupo LATAM', 'MES_10', 'OPERA_JetSmart SPA', 'OPERA_Air Canada', 'MES_9', 'OPERA_American Airlines']
    - y balanced?: NO
    - GridSearchCV: NO
    
Accuracy: 0.82
ROC AUC: 0.5092

## 1. Generate 'X_train', 'X_test', 'y_train' and 'y_test' files into 'src/files/output/' from 'dataset_SCL_complete.csv'.
(or the concat of 'values with nan' and 'values without nan' result in previous response)

In [27]:
body = {
    "data_filename": "output/dataset_SCL_complete.csv",
    "features_filter": [
        "MES_7",
        "TIPOVUELO_I",
        "OPERA_Copa Air",
        "OPERA_Latin American Wings",
        "MES_12",
        "OPERA_Grupo LATAM",
        "MES_10",
        "OPERA_JetSmart SPA",
        "OPERA_Air Canada",
        "MES_9",
        "OPERA_American Airlines"
    ],
    "categorical_features": [
        "OPERA",
        "MES",
        "TIPOVUELO"
    ],
    "numerical_features": [],
    "minmax_scaler_numerical_f": False,
    "label": "atraso_15",
    "shuffle_data": True,
    "shuffle_features": [
        "OPERA",
        "MES",
        "TIPOVUELO",
        "SIGLADES",
        "DIANOM",
        "atraso_15"
    ],
    "random_state": 42
}
r = requests.post(url=f"{host}files/train_test_split", json=body) 
r.json()

{'X_train': [{'MES_7': 0,
   'TIPOVUELO_I': 1,
   'OPERA_Copa Air': 0,
   'OPERA_Latin American Wings': 0,
   'MES_12': 0,
   'OPERA_Grupo LATAM': 1,
   'MES_10': 0,
   'OPERA_JetSmart SPA': 0,
   'OPERA_Air Canada': 0,
   'MES_9': 1,
   'OPERA_American Airlines': 0},
  {'MES_7': 0,
   'TIPOVUELO_I': 1,
   'OPERA_Copa Air': 0,
   'OPERA_Latin American Wings': 0,
   'MES_12': 0,
   'OPERA_Grupo LATAM': 1,
   'MES_10': 0,
   'OPERA_JetSmart SPA': 0,
   'OPERA_Air Canada': 0,
   'MES_9': 0,
   'OPERA_American Airlines': 0},
  {'MES_7': 0,
   'TIPOVUELO_I': 1,
   'OPERA_Copa Air': 0,
   'OPERA_Latin American Wings': 0,
   'MES_12': 0,
   'OPERA_Grupo LATAM': 1,
   'MES_10': 0,
   'OPERA_JetSmart SPA': 0,
   'OPERA_Air Canada': 0,
   'MES_9': 0,
   'OPERA_American Airlines': 0},
  {'MES_7': 0,
   'TIPOVUELO_I': 0,
   'OPERA_Copa Air': 0,
   'OPERA_Latin American Wings': 0,
   'MES_12': 0,
   'OPERA_Grupo LATAM': 1,
   'MES_10': 0,
   'OPERA_JetSmart SPA': 0,
   'OPERA_Air Canada': 0,
   'ME

## 2. Train model with X and y train and save it in 'src/models/'

In [30]:
body = {
    "X_train_filename": "dataset_SCL_complete-X_train.csv",
    "y_train_filename": "dataset_SCL_complete-y_train.csv",
    "model_name": "xgb",
    "destination_model_name": "JUAN_best_model",
    "model_custom_params": {"learning_rate":0.01, "subsample": 1, "max_depth": 10},
    "grid_search_cv": False,
    "grid_search_cv_params": {
        "param_grid": {
            "learning_rate": [],
            "n_estimators": [],
            "subsample": []
        },
        "cv": 0,
        "n_jobs": 0,
        "verbose": 0
    },
    "random_state":1,
    "balancing_method":None
}
r = requests.post(url=f"{host}models/train_binary_classification_model", json=body) 
r.json()

"Success! Model 'JUAN_best_model.pkl' created in 'src/models/'."

## 3. Predict with JUAN's best model

In [31]:
body = {
    "model_filename": "JUAN_best_model",
    "X_test_filename": "dataset_SCL_complete-X_test.csv"
}
r = requests.post(url=f"{host}models/predict", json=body) 
r.json()

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


## 4. Evaluate JUAN's best model

In [35]:
body = {
    "y_real_filename": "dataset_SCL_complete-y_test.csv",
    "y_predicted_filename": "JUAN_best_model-predictions.csv"
}
r = requests.get(url=f"{host}models/classification_report", params=body) 
r.json()

{'accuracy': 0.8193975475386529,
 'recall': 0.020423048869438368,
 'f1_score': 0.039688164422395464,
 'roc_auc_score': 0.5092329976611394,
 'data': {'real': {'0': 0,
   '1': 0,
   '2': 1,
   '3': 0,
   '4': 0,
   '5': 0,
   '6': 0,
   '7': 0,
   '8': 0,
   '9': 0,
   '10': 0,
   '11': 0,
   '12': 0,
   '13': 0,
   '14': 0,
   '15': 1,
   '16': 0,
   '17': 0,
   '18': 0,
   '19': 0,
   '20': 0,
   '21': 0,
   '22': 0,
   '23': 0,
   '24': 1,
   '25': 0,
   '26': 1,
   '27': 0,
   '28': 0,
   '29': 0,
   '30': 0,
   '31': 0,
   '32': 0,
   '33': 0,
   '34': 0,
   '35': 0,
   '36': 1,
   '37': 0,
   '38': 0,
   '39': 0,
   '40': 0,
   '41': 0,
   '42': 1,
   '43': 0,
   '44': 0,
   '45': 0,
   '46': 0,
   '47': 0,
   '48': 0,
   '49': 0,
   '50': 0,
   '51': 0,
   '52': 0,
   '53': 0,
   '54': 0,
   '55': 0,
   '56': 0,
   '57': 0,
   '58': 1,
   '59': 0,
   '60': 0,
   '61': 0,
   '62': 0,
   '63': 1,
   '64': 0,
   '65': 0,
   '66': 1,
   '67': 0,
   '68': 0,
   '69': 0,
   '70': 0,
   

# 3. MODEL PROPOSED BY LUIS TO IMPROVE PREDICTIONS

Section: **'Métricas XGBoost dejando Features más importantes'**
Model Config:

    - DATA_SPLIT-RANDOM_STATE: 8
    - MODEL: XGBClassifier
    - MODEL-RANDOM_STATE: 8
    - MODEL_PARAMS: {
        "objective": 'binary:logistic',
        "nthread": 4
        }
    - X TRAIN (FEATURES): [
        ]
    - Y BALANCED?: YES
    - BALANCING_METHODOLOGY: WEIGHTED ('balanced')
    - GridSearchCV: YES
    - GridSearchCV_PARAMS: {
        "param_grid": {
            'max_depth': [2, 3, 4, 5, 6, 7, ,8 ,9, 10],
            'n_estimators': [60, 100, 140, 180, 220],
            'learning_rate': [0.1, 0.01, 0.05]
            },
        "scoring": 'roc_auc',
        "cv":10,
        "n_jobs": -1,
        "verbose": 1

## 1. Generate 'X_train', 'X_test', 'y_train' and 'y_test' files into 'src/files/output/' from 'dataset_SCL_complete.csv'.
(or the concat of 'values with nan' and 'values without nan' result in previous response)

In [54]:
body = {
    "data_filename": "output/dataset_SCL_complete.csv",
    "features_filter": [],
    "categorical_features": [
        "OPERA",
        "TIPOVUELO",
        "DIANOM",
        "MES",
        "temporada_alta",
        "periodo_dia",
    ],
    "numerical_features": [],
    "minmax_scaler_numerical_f": False,
    "label": "atraso_15",
    "shuffle_data": True,
    "shuffle_features": [
        "OPERA",
        "TIPOVUELO",
        "DIANOM",
        "MES",
        "temporada_alta",
        "periodo_dia",
        "atraso_15"
    ],
    "random_state": 8
}
r = requests.post(url=f"{host}files/train_test_split", json=body) 
r.json()

{'X_train': [{'OPERA_Aerolineas Argentinas': 0,
   'OPERA_Aeromexico': 0,
   'OPERA_Air Canada': 0,
   'OPERA_Air France': 0,
   'OPERA_Alitalia': 0,
   'OPERA_American Airlines': 0,
   'OPERA_Austral': 0,
   'OPERA_Avianca': 0,
   'OPERA_British Airways': 0,
   'OPERA_Copa Air': 0,
   'OPERA_Delta Air': 0,
   'OPERA_Gol Trans': 0,
   'OPERA_Grupo LATAM': 1,
   'OPERA_Iberia': 0,
   'OPERA_JetSmart SPA': 0,
   'OPERA_K.L.M.': 0,
   'OPERA_Lacsa': 0,
   'OPERA_Latin American Wings': 0,
   'OPERA_Oceanair Linhas Aereas': 0,
   'OPERA_Plus Ultra Lineas Aereas': 0,
   'OPERA_Qantas Airways': 0,
   'OPERA_Sky Airline': 0,
   'OPERA_United Airlines': 0,
   'TIPOVUELO_I': 0,
   'TIPOVUELO_N': 1,
   'DIANOM_Domingo': 0,
   'DIANOM_Jueves': 1,
   'DIANOM_Lunes': 0,
   'DIANOM_Martes': 0,
   'DIANOM_Miercoles': 0,
   'DIANOM_Sabado': 0,
   'DIANOM_Viernes': 0,
   'MES_1': 0,
   'MES_2': 0,
   'MES_3': 0,
   'MES_4': 0,
   'MES_5': 0,
   'MES_6': 0,
   'MES_7': 0,
   'MES_8': 0,
   'MES_9': 0,
  

## 2. Train model with X and y train and save it in 'src/models/'

In [55]:
body = {
    "X_train_filename": "dataset_SCL_complete-X_train.csv",
    "y_train_filename": "dataset_SCL_complete-y_train.csv",
    "model_name": "xgb",
    "destination_model_name": "LUIS_model",
    "model_custom_params": {
        "objective": 'binary:logistic',
        "nthread": 4
    },
    "grid_search_cv": True,
    "grid_search_cv_params": {
        "param_grid": {
            'max_depth': [2, 3, 4, 5, 6, 7, 8 ,9, 10],
            'n_estimators': [60, 100, 140, 180, 220],
            'learning_rate': [0.1, 0.01, 0.05]
        },
        "scoring": 'roc_auc',
        "cv": 10,
        "n_jobs": 10,
        "verbose": 1
    },
    "random_state":8,
    "balancing_method": 'balanced'
}
r = requests.post(url=f"{host}models/train_binary_classification_model", json=body) 
r.json()

"Success! Model 'LUIS_model.pkl' created in 'src/models/'."

## 3. Predict with LUIS's model

In [56]:
body = {
    "model_filename": "LUIS_model",
    "X_test_filename": "dataset_SCL_complete-X_test.csv"
}
r = requests.post(url=f"{host}models/predict", json=body) 
r.json()

[1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,


## 4. Evaluate LUIS' model

In [58]:
body = {
    "y_real_filename": "dataset_SCL_complete-y_test.csv",
    "y_predicted_filename": "LUIS_model-predictions.csv"
}
r = requests.get(url=f"{host}models/classification_report", params=body) 
r.json()

{'accuracy': 0.6407055269237605,
 'recall': 0.6244897959183674,
 'f1_score': 0.39145157649183543,
 'roc_auc_score': 0.6344386503442898,
 'data': {'real': {'0': 0,
   '1': 0,
   '2': 1,
   '3': 0,
   '4': 0,
   '5': 0,
   '6': 0,
   '7': 0,
   '8': 1,
   '9': 0,
   '10': 0,
   '11': 0,
   '12': 0,
   '13': 0,
   '14': 0,
   '15': 1,
   '16': 0,
   '17': 0,
   '18': 0,
   '19': 0,
   '20': 0,
   '21': 1,
   '22': 0,
   '23': 0,
   '24': 0,
   '25': 0,
   '26': 0,
   '27': 0,
   '28': 0,
   '29': 0,
   '30': 1,
   '31': 0,
   '32': 0,
   '33': 1,
   '34': 0,
   '35': 0,
   '36': 0,
   '37': 0,
   '38': 0,
   '39': 0,
   '40': 0,
   '41': 0,
   '42': 0,
   '43': 0,
   '44': 0,
   '45': 1,
   '46': 1,
   '47': 0,
   '48': 0,
   '49': 0,
   '50': 1,
   '51': 1,
   '52': 0,
   '53': 0,
   '54': 0,
   '55': 0,
   '56': 0,
   '57': 0,
   '58': 0,
   '59': 0,
   '60': 0,
   '61': 0,
   '62': 0,
   '63': 0,
   '64': 0,
   '65': 0,
   '66': 0,
   '67': 0,
   '68': 1,
   '69': 0,
   '70': 0,
   '71

# Conclusion

Despite the fact that at first glance the model has *worse* accuracy, I really believe that significant progress has been achieved since:
* 1) The recall rate goes from 3% to 62.44%.

* 2) Even more important for me: the ROC AUC metric goes from 50.92% to 63.44%. This means that the trade-off between the True Positives Rate (TPR) and the True Negatives Rate (TNR) improves substantially, avoiding bias due to unbalanced data.