# **Predicción CALIDAD del AGUA**
![img](https://agqlabs.es/tienda/wp-content/uploads/2020/09/análisis-calidad-del-agua.jpg)
## *ÍNDICE:* 
---
- [ 0. INTRODUCCIÓN](#0)
- [ 1. ANÁLISIS EXPLORATORIO DE DATOS (EDA):](#1)
    - [1.1. Acondicionamiento bases de datos](#12)
    - [1.2. Análisis visual](#13)
    - [1.3. Análisis estadístico](#14)   
- [ 2. MACHINE LEARNING](#2)
    - [2.1. Preparación y limpieza de datos](#21)
    - [2.2. Feature Engineering](#22)
    - [2.3. Modelado](#23)
- [ 3. RESULTADOS](#3)
    - [3.1. Visualización y reporting de los resultados](#31)
    - [3.2. Creación de un pipeline para el flujo automatizado](#32)
---
### *Contacto:*
___
* Email: ***carla.glezz@gmail.com***
* Linkedin: ***https://www.linkedin.com/in/mariacarlagonzalezgonzalez/***
---
---


# **0. Introducción**<a id='0'></a>

In [7]:
import pandas as pd
import os 


Se realiza la introducción en el archivo: <a href='src/0_Introduccion.ipynb'>Intro</a>

# **1. Análisis exploratorio de datos**<a id='1'></a>

Se realiza un análisis exploratorio de los datos en el archivo: <a href='src/1_EDA.ipynb'>EDA</a>

# **2. Machine Learning**<a id='2'></a>

Se realizan distintas pruebas con los modelos supervisados para clasificación en el notebook:  <a href='src/2a_ML_Baseline.ipynb'>Machine Learning (baseline)</a>

En el cual obtiene la comparativa de las métricas en: <a href='src/model/model_metrics/baseline_metrics.csv'>baseline_metrics.csv</a>

In [6]:
pd.read_csv(os.getcwd()+'/src/model/model_metrics/baseline_metrics.csv',sep=';').sort_values('Precision',ascending=False)

Unnamed: 0,model,ACC,Precision,Recall,F1,ROC,Jaccard
5,BaggingClassifier,0.96875,0.909091,0.778443,0.83871,0.884686,0.722222
9,XGBClassifier,0.97375,0.903226,0.838323,0.869565,0.913928,0.769231
4,RandomForestClassifier,0.96125,0.894737,0.712575,0.793333,0.851403,0.657459
7,GradientBoostingClassifier,0.965,0.888112,0.760479,0.819355,0.874657,0.693989
10,VotingClassifier,0.9575,0.877863,0.688623,0.771812,0.838729,0.628415
6,AdaBoostClassifier,0.9425,0.804878,0.592814,0.682759,0.788033,0.518325
2,DecisionTreeClassifier,0.959375,0.786517,0.838323,0.811594,0.905903,0.682927
0,LogisticRegression,0.90875,0.632911,0.299401,0.406504,0.639582,0.255102
11,LinearDiscriminantAnalysis,0.903125,0.546154,0.42515,0.478114,0.691989,0.314159
3,ExtraTreeClassifier,0.8975,0.508571,0.532934,0.520468,0.73646,0.351779


También se obtiene la comparativa de las métricas en: <a href='src/model/model_metrics/metrics.csv'>metrics.csv</a>

In [10]:
pd.read_csv(os.getcwd()+'/src/model/model_metrics/metrics.csv',sep=';').sort_values('Precision',ascending=False)

Unnamed: 0,model,params_tried,best_params,ACC,Precision,Recall,F1,ROC,Jaccard,model_path
10,XGBClassifier,"{'nthread': [4], 'objective': ['binary:logisti...","{'colsample_bytree': 0.7, 'learning_rate': 0.0...",0.97375,0.937063,0.802395,0.864516,0.898057,0.761364,model/XGBClassifier.pkl
6,BaggingClassifier,"{'n_estimators': [10, 20, 30, 50, 100], 'max_s...","{'max_samples': 0.5, 'n_estimators': 100}",0.9675,0.919708,0.754491,0.828947,0.873407,0.707865,model/BaggingClassifier_1.pkl
8,GradientBoostingClassifier,"{'n_estimators': [10, 20, 30, 50, 100], 'max_d...","{'criterion': 'mse', 'loss': 'exponential', 'm...",0.9725,0.918367,0.808383,0.859873,0.900005,0.75419,model/GradientBoostingClassifier.pkl
4,RandomForestClassifier,"{'n_estimators': array([ 10, 25, 41, 56, 7...","{'class_weight': None, 'criterion': 'entropy',...",0.9725,0.912752,0.814371,0.860759,0.90265,0.755556,model/RandomForestClassifier.pkl
7,AdaBoostClassifier,"{'n_estimators': [10, 20, 30, 50, 100]}",{'n_estimators': 100},0.939375,0.777778,0.586826,0.668942,0.783643,0.502564,model/AdaBoostClassifier.pkl
1,KNeighborsClassifier,"{'n_neighbors': [3, 5, 7, 9, 11, 13, 15], 'wei...","{'algorithm': 'ball_tree', 'leaf_size': 20, 'n...",0.899375,0.615385,0.095808,0.165803,0.544415,0.090395,model/KNeighborsClassifier.pkl
5,BaggingClassifier,{'base_estimator': [DecisionTreeClassifier(cla...,{'base_estimator': DecisionTreeClassifier(clas...,0.91625,0.562738,0.886228,0.688372,0.902988,0.524823,model/BaggingClassifier.pkl
2,DecisionTreeClassifier,"{'criterion': ['log_loss', 'gini', 'entropy'],...","{'class_weight': 'balanced', 'criterion': 'gin...",0.9125,0.554217,0.826347,0.663462,0.874444,0.496403,model/DecisionTreeClassifier.pkl
9,SVC,"[{'C': [1, 10, 100, 1000], 'kernel': ['linear'...","{'C': 1000, 'class_weight': 'balanced', 'kerne...",0.870625,0.43865,0.856287,0.580122,0.864292,0.408571,model/SVC.pkl
3,ExtraTreeClassifier,"{'criterion': ['gini', 'entropy'], 'max_depth'...","{'class_weight': 'balanced', 'criterion': 'gin...",0.836875,0.385366,0.946108,0.54766,0.885126,0.377088,model/ExtraTreeClassifier.pkl


# **3. Resultados y conclusiones**<a id='3'></a>