# Drinking Water Potability Project

**Charles Serve-Catelin** - **Samuel Pujade** - **Mathieu Ract**

## Récupération et Nettoyage des données

In [4]:
import project
from importlib import reload
reload(project)

project.avoid_warnings()
project.load_data("./data/drinking_water_potability.csv", disp=False)
project.display_explanatory_variables(disp=False)
project.check_null_values(disp=False)
project.cleaning_dataset('delete') # 'mean' or 'delete'
project.cope_outliers('Q+1.5') # 'delete' or 'Q+1.5' or 'Q'

project.split_dataset(ratio=0.8, disp=False)
project.scaling_trainset()
project.set_metric('accuracy') # 'accuracy'  or 'f1_score'

project.fitting_KNN_model()
project.testing_KNN_model()

project.fitting_LR_model()
project.testing_LR_model()

project.fitting_RF_model()
project.testing_RF_model()

project.fitting_SVM_model()
project.testing_SVM_model()

project.fitting_XGboost_model()
project.testing_XGboost_model()

Sun Nov  7 22:05:38 2021 : Données chargés
Sun Nov  7 22:05:38 2021 : Données nulles trouvées
Sun Nov  7 22:05:38 2021 : Données nulles supprimées
Sun Nov  7 22:05:38 2021 : Données aberrantes gérées
Sun Nov  7 22:05:38 2021 : Dataset séparé
Sun Nov  7 22:05:38 2021 : Trainset mis à l'echelle
Sun Nov  7 22:05:38 2021 : model KNN ajusté
Sun Nov  7 22:05:38 2021 : model KNN testé
Accuracy KNN : 67.74 %

Sun Nov  7 22:05:38 2021 : model LR ajusté
Sun Nov  7 22:05:38 2021 : model LR testé
Accuracy LR : 60.3 %

Sun Nov  7 22:05:38 2021 : model RF ajusté
Sun Nov  7 22:05:38 2021 : model RF testé
Accuracy RF : 69.23 %

Sun Nov  7 22:05:38 2021 : model SVM ajusté
Sun Nov  7 22:05:38 2021 : model SVM testé
Accuracy SVM : 70.72 %

Sun Nov  7 22:05:38 2021 : model XGboost ajusté
Sun Nov  7 22:05:38 2021 : model XGboost testé
Accuracy XGboost : 66.25 %



## Tuning kNN hyperparameters

We need to specify a parameter grid to sample from during fitting :

In [17]:
param_grid_kNN = {'n_neighbors' : list(range(1, 31)), # Number of neighbors to use
    'weights': ['uniform', 'distance'], # Weight function used in prediction
    'leaf_size' : list(range(1, 51)), # Leaf size passed to BallTree or KDTree
    'p' : [1, 2]} # Power parameter for the Minkowski metric

param_grid_kNN_small = {'n_neighbors' : list(range(20, 30)), # Number of neighbors to use
    'weights': ['distance'], # Weight function used in prediction
    'leaf_size' : list(range(1, 11)), # Leaf size passed to BallTree or KDTree
    'p' : [2]} # Power parameter for the Minkowski metric

best_params_kNN_RS = project.tuning_kNN_hyperparameters(param_grid_kNN, 'RandomizedSearchCV')
best_params_kNN_GS = project.tuning_kNN_hyperparameters(param_grid_kNN_small, 'GridSearchCV')

# Temps RandomizedSearchCV (n_iter = 100) : 6s
# Params RandomizedSearchCV : {'weights': 'distance', 'p': 2, 'n_neighbors': 28, 'leaf_size': 1}
# Accuracy RandomizedSearchCV : 68,37%
# Temps GridSearchCV : 36s
# Params RandomizedSearchCV : {'weights': 'distance', 'p': 2, 'n_neighbors': 28, 'leaf_size': 1}
# Accuracy RandomizedSearchCV : 68,37%


Fri Nov  5 17:31:59 2021 : Meilleurs hyperparamètres trouvés                
Durée de la recherche : 5.43 secondes
Meilleurs Hyperparamètres par méthode RandomizedSearchCV : {'weights': 'distance', 'p': 2, 'n_neighbors': 28, 'leaf_size': 1}
Fitting 5 folds for each of 100 candidates, totalling 500 fitsCV ...
Fri Nov  5 17:32:37 2021 : Meilleurs hyperparamètres trouvés                
Durée de la recherche : 38.39 secondes
Meilleurs Hyperparamètres par méthode GridSearchCV : {'leaf_size': 1, 'n_neighbors': 28, 'p': 2, 'weights': 'distance'}


In [30]:
project.fitting_kNN_tuned_model(best_params_kNN_RS)
project.fitting_kNN_tuned_model(best_params_kNN_GS)

Sat Oct 30 18:07:44 2021 : meilleur model kNN testé et ajusté
Accuracy kNN : 68.37 %

Sat Oct 30 18:07:44 2021 : meilleur model kNN testé et ajusté
Accuracy kNN : 68.37 %



## Tuning Logistic Regression hyperparameters

We need to specify a parameter grid to sample from during fitting :

In [75]:
param_grid_LR = {'C': list(range(0.001, 100, 20)),  # penalty strength
    'penalty': ['l2'], # Norm of the penalty
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']} # Algorithm to use in the optimization problem

best_params_LR_RS = project.tuning_LR_hyperparameters(param_grid_LR, 'RandomizedSearchCV')
best_params_LR_GS = project.tuning_LR_hyperparameters(param_grid_LR, 'GridSearchCV')

# Temps RandomizedSearchCV (n_iter = 100) : 1s
# Params RandomizedSearchCV : {'solver': 'newton-cg', 'penalty': 'l2', 'C': 1.0023052380778996}
# Accuracy RandomizedSearchCV : 64,38%
# Temps GridSearchCV : 5s
# Params RandomizedSearchCV : {'solver': 'newton-cg', 'penalty': 'l2', 'C': 1.0023052380778996}
# Accuracy RandomizedSearchCV : 64,38%

Sat Oct 30 19:49:14 2021 : Meilleurs hyperparamètres trouvés                
Durée de la recherche : 1.18 secondes
Meilleurs Hyperparamètres par méthode RandomizedSearchCV : {'solver': 'newton-cg', 'penalty': 'l2', 'C': 1.0023052380778996}
Fitting 5 folds for each of 100 candidates, totalling 500 fitsCV ...
Sat Oct 30 19:49:19 2021 : Meilleurs hyperparamètres trouvés                
Durée de la recherche : 5.23 secondes
Meilleurs Hyperparamètres par méthode GridSearchCV : {'C': 1.0023052380778996, 'penalty': 'l2', 'solver': 'newton-cg'}


In [76]:
project.fitting_LR_tuned_model(best_params_LR_RS)
project.fitting_LR_tuned_model(best_params_LR_GS)

Ajustement et test du meilleur model LR ...
Sat Oct 30 19:49:29 2021 : meilleur model LR testé et ajusté
Accuracy LR : 64.38 %

Ajustement et test du meilleur model LR ...
Sat Oct 30 19:49:29 2021 : meilleur model LR testé et ajusté
Accuracy LR : 64.38 %



## Tuning RF hyperparameters

We need to specify a parameter grid to sample from during fitting :

In [80]:
param_grid_RF = {'n_estimators' : list(range(200, 2000, 200)), # The number of trees in the forest
    'max_depth' : list(range(10, 110, 10)) + [None], # max number of levels in each decision tree
    'min_samples_split' : [2, 5, 10], # min number of data points placed in a node before the node is split
    'min_samples_leaf' : [1, 2, 4], # min number of data points allowed in a leaf node
    'bootstrap' : [True, False]} # method for sampling data points (with or without replacement)

param_grid_RF_small = {'n_estimators' : list(range(1000, 1400, 20)), # The number of trees in the forest
    'max_depth' : list(range(50, 150, 10)) + [None], # max number of levels in each decision tree
    'min_samples_split' : [5], # min number of data points placed in a node before the node is split
    'min_samples_leaf' : [4], # min number of data points allowed in a leaf node
    'bootstrap' : [True]} # method for sampling data points (with or without replacement)

#best_params_RF_RS = project.tuning_RF_hyperparameters(param_grid_RF, 'RandomizedSearchCV')
best_params_RF_GS = project.tuning_RF_hyperparameters(param_grid_RF_small, 'GridSearchCV')

# Temps RandomizedSearchCV (n_iter = 100) : 915s
# Params RandomizedSearchCV : {'n_estimators': 1200, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_depth': 100, 'bootstrap': True}
# Accuracy RandomizedSearchCV : 68,85%
# Temps GridSearchCV : 6170s
# Params GridSearchCV : {'bootstrap': True, 'max_depth': 90, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 1020}
# Accuracy GridSearchCV : 68,21%

Fitting 5 folds for each of 220 candidates, totalling 1100 fitsV ...
Sun Oct 31 15:16:49 2021 : Meilleurs hyperparamètres trouvés                
Durée de la recherche : 6169.54 secondes
Meilleurs Hyperparamètres par méthode GridSearchCV : {'bootstrap': True, 'max_depth': 90, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 1020}


In [81]:
project.fitting_testing_best_RF_model(best_params_RF_RS)
project.fitting_testing_best_RF_model(best_params_RF_GS)

Sun Oct 31 15:29:16 2021 : meilleur model RF testé et ajusté
Accuracy RF : 68.21 %

Sun Oct 31 15:29:21 2021 : meilleur model RF testé et ajusté
Accuracy RF : 68.21 %



## Tuning SVM hyperparameters

We need to specify a parameter grid to sample from during fitting :

In [None]:
param_grid_SVM = {'kernel': ['linear', 'rbf', 'poly', 'sigmoid'], # Kernel type
    'C': [0.1, 1, 10, 100], # Regularization parameter
    'gamma': [1, 0.1, 0.01, 0.001]}

# best_params_SVM_RS = project.tuning_SVM_hyperparameters(param_grid_SVM, 'RandomizedSearchCV')
best_params_SVM_GS = project.tuning_SVM_hyperparameters(param_grid_SVM, 'GridSearchCV')

Fitting 5 folds for each of 64 candidates, totalling 320 fitshCV ...


In [None]:
project.fitting_testing_best_SVM_model(best_params_SVM_RS)
#project.fitting_testing_best_SVM_model(best_params_SVM_GS)

## Tuning XGboost hyperparameters

We need to specify a parameter grid to sample from during fitting :

In [None]:
param_grid_XGboost = {'min_child_weight': [1, 5, 10],
                      'gamma': [0.5, 1, 1.5, 2, 5],
                      'subsample': [0.6, 0.8, 1.0],
                      'colsample_bytree': [0.6, 0.8, 1.0],
                      'max_depth': [3, 4, 5]}

best_params_XGboost_GS = project.tuning_XGboost_hyperparameters(param_grid_XGboost, 'GridSearchCV')

# Durée de la recherche : 1827.81 secondes
# Meilleurs Hyperparamètres par méthode GridSearchCV : {
# 'colsample_bytree': 1.0, 
# 'gamma': 5, 
# 'max_depth': 5, 
# 'min_child_weight': 1, 
# 'subsample': 1.0}

In [8]:
project.fitting_XGboost_tuned_model(best_params_XGboost_GS)

Sun Nov  7 21:57:01 2021 : meilleur model XGBoost testé et ajusté
Accuracy XGBoost : 63.11 %

