# **Predicción CALIDAD del AGUA**
![img](https://agqlabs.es/tienda/wp-content/uploads/2020/09/análisis-calidad-del-agua.jpg)
## *ÍNDICE:* 
---
- [ 0. INTRODUCCIÓN](#0)
- [ 1. ANÁLISIS EXPLORATORIO DE DATOS (EDA):](#1)
    - [1.1. Acondicionamiento bases de datos](#12)
    - [1.2. Análisis visual](#13)
    - [1.3. Análisis estadístico](#14)   
- [ 2. MACHINE LEARNING](#2)
    - [2.1. Preparación y limpieza de datos](#21)
    - [2.2. Feature Engineering](#22)
    - [2.3. Modelado](#23)
- [ 3. RESULTADOS](#3)
    - [3.1. Visualización y reporting de los resultados](#31)
    - [3.2. Creación de un pipeline para el flujo automatizado](#32)
---
### *Contacto:*
___
* Email: ***carla.glezz@gmail.com***
* Linkedin: ***https://www.linkedin.com/in/mariacarlagonzalezgonzalez/***
---
---


# **0. Introducción**<a id='0'></a>

In [7]:
import pandas as pd
import os 


Se realiza la introducción en el archivo: <a href='src/0_Introduccion.ipynb'>Intro</a>

# **1. Análisis exploratorio de datos**<a id='1'></a>

Se realiza un análisis exploratorio de los datos en el archivo: <a href='src/1_EDA.ipynb'>EDA</a>

# **2. Machine Learning**<a id='2'></a>

En el notebook de <a href='src/2a_ML_Baseline.ipynb'>Machine Learning (baseline)</a> se realiza: 

- Prueba de contacto con los distintos modelos supervisados de clasificación y el dataset sin tratar. (<a href='src/model/model_metrics/baseline_metrics.csv'>Métricas obtenidas de los modelos baseline</a>)

- Se ajustan los hiperparámetros de todos los modelos de clasificación.
    - Se guardan los modelos con los mejores hiperparámetros propuestos.
    - Se obtienen las métricas de los modelos y se asocia cada uno con su path del modelo.pkl
    (<a href='src/model/model_metrics/metrics.csv'>Métricas obtenidas de los modelos ajustados</a>)


In [6]:
pd.read_csv(os.getcwd()+'/src/model/model_metrics/baseline_metrics.csv',sep=';').sort_values('Precision',ascending=False)

Unnamed: 0,model,ACC,Precision,Recall,F1,ROC,Jaccard
5,BaggingClassifier,0.96875,0.909091,0.778443,0.83871,0.884686,0.722222
9,XGBClassifier,0.97375,0.903226,0.838323,0.869565,0.913928,0.769231
4,RandomForestClassifier,0.96125,0.894737,0.712575,0.793333,0.851403,0.657459
7,GradientBoostingClassifier,0.965,0.888112,0.760479,0.819355,0.874657,0.693989
10,VotingClassifier,0.9575,0.877863,0.688623,0.771812,0.838729,0.628415
6,AdaBoostClassifier,0.9425,0.804878,0.592814,0.682759,0.788033,0.518325
2,DecisionTreeClassifier,0.959375,0.786517,0.838323,0.811594,0.905903,0.682927
0,LogisticRegression,0.90875,0.632911,0.299401,0.406504,0.639582,0.255102
11,LinearDiscriminantAnalysis,0.903125,0.546154,0.42515,0.478114,0.691989,0.314159
3,ExtraTreeClassifier,0.8975,0.508571,0.532934,0.520468,0.73646,0.351779


In [17]:
pd.read_csv(os.getcwd()+'/src/model/model_metrics/metrics.csv',sep=';').sort_values('Precision',ascending=False)

Unnamed: 0,model,params_tried,best_params,ACC,Precision,Recall,F1,ROC,Jaccard,model_path
11,GradientBoostingClassifier,"{'loss': ['log_loss'], 'n_estimators': [100, 2...","{'learning_rate': 0.01, 'loss': 'log_loss', 'm...",0.92,0.97561,0.239521,0.384615,0.619412,0.238095,model/GradientBoostingClassifier_1.pkl
10,XGBClassifier,"{'nthread': [4], 'objective': ['binary:logisti...","{'colsample_bytree': 0.7, 'learning_rate': 0.0...",0.97375,0.937063,0.802395,0.864516,0.898057,0.761364,model/XGBClassifier.pkl
6,BaggingClassifier,"{'n_estimators': [10, 20, 30, 50, 100], 'max_s...","{'max_samples': 0.5, 'n_estimators': 100}",0.9675,0.919708,0.754491,0.828947,0.873407,0.707865,model/BaggingClassifier_1.pkl
8,GradientBoostingClassifier,"{'n_estimators': [10, 20, 30, 50, 100], 'max_d...","{'criterion': 'mse', 'loss': 'exponential', 'm...",0.9725,0.918367,0.808383,0.859873,0.900005,0.75419,model/GradientBoostingClassifier.pkl
4,RandomForestClassifier,"{'n_estimators': array([ 10, 25, 41, 56, 7...","{'class_weight': None, 'criterion': 'entropy',...",0.9725,0.912752,0.814371,0.860759,0.90265,0.755556,model/RandomForestClassifier.pkl
7,AdaBoostClassifier,"{'n_estimators': [10, 20, 30, 50, 100]}",{'n_estimators': 100},0.939375,0.777778,0.586826,0.668942,0.783643,0.502564,model/AdaBoostClassifier.pkl
1,KNeighborsClassifier,"{'n_neighbors': [3, 5, 7, 9, 11, 13, 15], 'wei...","{'algorithm': 'ball_tree', 'leaf_size': 20, 'n...",0.899375,0.615385,0.095808,0.165803,0.544415,0.090395,model/KNeighborsClassifier.pkl
5,BaggingClassifier,{'base_estimator': [DecisionTreeClassifier(cla...,{'base_estimator': DecisionTreeClassifier(clas...,0.91625,0.562738,0.886228,0.688372,0.902988,0.524823,model/BaggingClassifier.pkl
2,DecisionTreeClassifier,"{'criterion': ['log_loss', 'gini', 'entropy'],...","{'class_weight': 'balanced', 'criterion': 'gin...",0.9125,0.554217,0.826347,0.663462,0.874444,0.496403,model/DecisionTreeClassifier.pkl
9,SVC,"[{'C': [1, 10, 100, 1000], 'kernel': ['linear'...","{'C': 1000, 'class_weight': 'balanced', 'kerne...",0.870625,0.43865,0.856287,0.580122,0.864292,0.408571,model/SVC.pkl


En el notebook de <a href='src/2b_ML_BalancedData.ipynb'>Machine Learning</a> se realiza: 

- Feature selection con ***RFE***. (<a href='src/data/processed/data_featureselected.csv'>Métricas de los modelos con las features reducidas</a>)

- Balanceo de los datos:

    - Método oversampling: ***SMOTE***

    - Método mixto: ***SMOTEENN***
    
- Modelado con los estimadores cuyas métricas resultan más satisfactorias:

    - Baseline con el nuevo dataset 

    - Modelado modificando hiperparámetros con el nuevo dataset 

    - Modelado con todos las features añadiendo el balanceo y ajuste de parámetros 

- Adicionalmente, se prueba el modelo de red neuronal ***Perceptrón***. (<a href='src/model/model_metrics/DL_metrics.csv'>Métricas del Perceptrón</a>)

In [12]:
pd.read_csv(os.getcwd()+'/src/model/model_metrics/baseline_fs_metrics.csv',sep=';').sort_values('Precision',ascending=False)

Unnamed: 0,model,ACC,Precision,Recall,F1,ROC,Jaccard
4,RandomForestClassifier,0.95875,0.917355,0.664671,0.770833,0.828846,0.627119
9,XGBClassifier,0.9625,0.902256,0.718563,0.8,0.854745,0.666667
10,VotingClassifier,0.9575,0.896,0.670659,0.767123,0.830793,0.622222
5,BaggingClassifier,0.9525,0.858268,0.652695,0.741497,0.820067,0.589189
7,GradientBoostingClassifier,0.951875,0.846154,0.658683,0.740741,0.822363,0.588235
6,AdaBoostClassifier,0.930625,0.75,0.502994,0.602151,0.741727,0.430769
0,LogisticRegression,0.903125,0.625,0.179641,0.27907,0.58354,0.162162
2,DecisionTreeClassifier,0.924375,0.618557,0.718563,0.66482,0.833461,0.497925
11,LinearDiscriminantAnalysis,0.89125,0.472,0.353293,0.40411,0.653618,0.253219
3,ExtraTreeClassifier,0.885625,0.457895,0.520958,0.487395,0.72454,0.322222


In [11]:
pd.read_csv(os.getcwd()+'/src/model/model_metrics/fs_metrics.csv',sep=';').sort_values('Precision',ascending=False)

Unnamed: 0,model,params_tried,best_params,ACC,Precision,Recall,F1,ROC,Jaccard,model_path
0,RandomForestClassifier,"{'n_estimators': array([ 10, 25, 41, 56, 7...","{'class_weight': None, 'criterion': 'entropy',...",0.9575,0.915966,0.652695,0.762238,0.822858,0.615819,model/RandomForestClassifier_1.pkl
1,BaggingClassifier,"{'n_estimators': [10, 20, 30, 50, 100], 'max_s...","{'max_samples': 0.5, 'n_estimators': 100}",0.956875,0.908333,0.652695,0.759582,0.822509,0.61236,model/BaggingClassifier_2.pkl
2,XGBClassifier,"{'nthread': [4], 'objective': ['binary:logisti...","{'colsample_bytree': 1.0, 'learning_rate': 0.0...",0.955,0.88,0.658683,0.753425,0.824108,0.604396,model/XGBClassifier_1.pkl
3,GradientBoostingClassifier,"{'n_estimators': [10, 20, 30, 50, 100], 'max_d...","{'criterion': 'mse', 'loss': 'log_loss', 'max_...",0.953125,0.848485,0.670659,0.749164,0.828351,0.59893,model/GradientBoostingClassifier_2.pkl


In [16]:
pd.read_csv(os.getcwd()+'/src/model/model_metrics/metrics_balanced.csv',sep=';').sort_values('Precision',ascending=False)

Unnamed: 0,model,params_tried,best_params,ACC,Precision,Recall,F1,ROC,Jaccard,model_path
3,GradientBoostingClassifier,"[{'n_estimators': [10, 20, 30, 50, 100], 'max_...","{'learning_rate': 0.1, 'loss': 'log_loss', 'ma...",0.941875,0.654167,0.94012,0.771499,0.9411,0.628,model/GradientBoostingClassifier_3.pkl
1,RandomForestClassifier,"{'n_estimators': array([ 10, 25, 41, 56, 7...","{'class_weight': 'balanced', 'criterion': 'ent...",0.93875,0.638554,0.952096,0.764423,0.944645,0.618677,model/RandomForestClassifier_2.pkl
0,XGBClassifier,"{'nthread': [4], 'objective': ['binary:logisti...","{'colsample_bytree': 0.6, 'learning_rate': 0.0...",0.9275,0.593407,0.97006,0.736364,0.9463,0.582734,model/XGBClassifier_2.pkl
2,BaggingClassifier,"{'n_estimators': [10, 20, 30, 50, 100], 'max_s...","{'max_samples': 0.5, 'n_estimators': 100}",0.921875,0.575,0.964072,0.720358,0.940515,0.562937,model/BaggingClassifier_3.pkl


In [13]:
pd.read_csv(os.getcwd()+'/src/model/model_metrics/DL_metrics.csv',sep=';').sort_values('Precision',ascending=False)

Unnamed: 0,model,ACC,Precision,Recall,F1,ROC,Jaccard
0,MLPClassifier,0.946875,0.788732,0.670659,0.724919,0.824862,0.568528
1,MLPClassifier,0.92,0.719101,0.383234,0.5,0.682894,0.333333


In [18]:
pd.read_csv(os.getcwd()+'/src/model/model_metrics/DL_metrics_hyp.csv',sep=';').sort_values('Precision',ascending=False)

Unnamed: 0,model,params_tried,best_params,ACC,Precision,Recall,F1,ROC,Jaccard,model_path
1,MLPClassifier,"{'activation': ['relu', 'logistic'], 'early_st...","{'activation': 'relu', 'early_stopping': True,...",0.936875,0.858696,0.473054,0.610039,0.731991,0.438889,model/MLPClassifier.pkl
2,MLPClassifier,"{'activation': ['relu', 'logistic'], 'early_st...","{'activation': 'relu', 'early_stopping': True,...",0.94,0.784,0.586826,0.671233,0.783992,0.505155,model/MLPClassifier_2.pkl
0,MLPClassifier,"{'activation': ['relu'], 'early_stopping': [Tr...","{'activation': 'relu', 'early_stopping': True,...",0.93375,0.760684,0.532934,0.626761,0.756697,0.45641,model/MLPClassifier_1.pkl


# **3. Resultados y conclusiones**<a id='3'></a>

En el notebook de <a href='src/3_Resultados.ipynb'>Resultados</a> se realiza: 

- Visualización de los distintos resultados de los modelos
- Elección del modelo, según sus métricas

El modelo seleccionado es el `GradientBoostClassifier()`