# Modeling

This dataset is about the Water Wells in Tanzania. This data will need to be prepared and cleaned as it will be used in a classifer to predict the condition of a water well. 

<span style = 'color:green; font-size:13pt'>**Note:**</span> Clicking on anything in the table of content will send you to section. Clicking on the headers will send you back to the Table of Contents
<hr style="border:2px solid magenta">

## Table of Contents: <a id ="title"><a>
- [Imports](#imports)
- [Opening Data](#opening)
- [Class Explanation](#class)
- [Pipeline Preparation](#pipeline)
- [Dummy Model](#dummy)
- [Baseline Model - Logistic Regression](#lg)
- [Grid Search- Logistic Regression](#gs)
- [SMOTE + Best Logistic Regression](#smote)
- [Grid Search - SMOTE using Best Logistic Regression](#gs2)
- [Summary](#summary)
- [Exporting to CSVs](#exports)
- [Conclusions](#conclusions)

### [Imports](#title) <a id ="imports"><a>
<hr style="border:2px solid magenta">

In [1]:
import numpy as np
import pandas as pd

import statsmodels.api as sm
import sklearn 
from sklearn.base import clone
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold,cross_validate, cross_val_predict
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, FunctionTransformer, LabelBinarizer
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, ConfusionMatrixDisplay, confusion_matrix, classification_report, \
plot_confusion_matrix, roc_curve, auc, accuracy_score

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector as selector
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImPipeline

from matplotlib import pyplot as plt
import seaborn as sns
import datetime as dt

### [Opening Data](#title)  <a id ='opening'></a>
<hr style="border:2px solid magenta">

Creating two different models. df has more columns than df2. Opening csvs that were created from [EDA](https://github.com/irwin-lam/PumpItUp/blob/main/EDA.ipynb)

In [2]:
df = pd.read_csv('./Data/TrainCleaned1.csv')
df2 = pd.read_csv('./Data/TrainCleaned2.csv')

Creating x and y from each df as variables that we can use throughout the notebook

In [3]:
x = df.drop('status_group', axis =1)
y = df.status_group

x2 = df2.drop('status_group', axis =1)
y2 = df2.status_group

### [Class Creation](#title)  <a id ='class'></a>
<hr style="border:2px solid magenta">

This class is called ModelsList  
**functions**  
>[<span style = 'color:green; font-size:13pt'>__ init __</span>](#func1)  
[<span style = 'color:green; font-size:13pt'>
    update
    <span style = 'color:red; font-size:13pt'>
    </span>
</span>](#func2)  
[<span style = 'color:green; font-size:13pt'>
    class_report
    <span style = 'color:red; font-size:13pt'>
    </span>
</span>](#func3)  
[<span style = 'color:green; font-size:13pt'>cv_summary<span style = 'color:red; font-size:13pt'></span></span>](#func4)  
[<span style = 'color:green; font-size:13pt'>delete_last<span style = 'color:red; font-size:13pt'></span></span>](#func5)  
 

**attributes**
><span style = 'color:blue'>xtrain</span>  &rarr; x training array  
<span style = 'color:blue'>xtest</span>  &rarr; x testing array  
<span style = 'color:blue'>ytrain</span>  &rarr; y training array  
<span style = 'color:blue'>ytest</span>  &rarr; y testing array  
<span style = 'color:blue'>length</span>  &rarr; number of models' information stored  
<span style = 'color:blue'>classification_reports</span>  &rarr;  an array of classification report  
<span style = 'color:blue'>cv</span>  &rarr;  an array of tuples with means and std  
<span style = 'color:blue'>df</span>  &rarr; master df that stores most of the data
***
**returns** Object ModelsList
<hr style="border:2px solid orange">  

<span style = 'color:green; font-size:13pt'>__ init __</span><a id = 'func1'></a>  
**parameters**  
><span style = 'color:blue'>self</span> &rarr; refers to the object  
<span style = 'color:blue'>x</span> &rarr; dataframe of values  
<span style = 'color:blue'>y</span> &rarr; dataframe of targets  
***
**returns** Object ModelsList  
***
**how it works**  
* performs a train_test_split of x and y and stores the outputs to the attributes xtrain, xtest, ytrain, ytest
* attribute length set to 0
* attributes classification_reports and cv set to an empty list
* attribute df set to an empty dataframe with columns made   
  **option to add more metrics inside**
  
<hr style="border:2px solid orange">  

<span style = 'color:green; font-size:13pt'>update</span><a id = 'func2'></a>  
**parameters**  
><span style = 'color:blue'>self</span> &rarr; refers to the object  
<span style = 'color:blue'>estimator</span> &rarr; model   
<span style = 'color:blue'>name</span> &rarr; name of the model  
<span style = 'color:blue'>fit</span> &rarr; boolean value to see if the model needs to be fit.  
&emsp;&emsp; DEFAULT: True   
<span style = 'color:blue'>params</span> &rarr; a string dictionary of the hyperparameters    
***
**returns** Object ModelsList, prints out df  
***
**how it works**  
* increase length by 1
* if fit is set to True, fit the model with the attributes xtrain and ytrain
* creates array ypred of the predictions of the model for xtest
* variable log_loss is created from negative mean of cross_val_score with scoring of 'neg_log_loss' 
* variable trainscore is the score of the model for xtrain and ytrain  
* variable testscore is the score of the model for xtest and ytest  
* variable model_to_add is a row of the data to be inserted into the df
  **if df has more columns, add the corresponding variable to the correct column
  
* adds model_to_add to df
* variable cv is a list of accuracy scores of the train set
* append a tuple of average and standard deviation of the scores into cv
  
<hr style="border:2px solid orange">  

<span style = 'color:green; font-size:13pt'>class_report</span><a id = 'func3'></a>  
**parameters**  
><span style = 'color:blue'>self</span> &rarr; refers to the object   
***
**returns** print classification reports  
***
**how it works**  
* for loops the length 
* prints the df[Model] and classification_reports for the corresponding loop   

<hr style="border:2px solid orange">  

<span style = 'color:green; font-size:13pt'>cv_summary</span><a id = 'func4'></a>  
**parameters**  
><span style = 'color:blue'>self</span> &rarr; refers to the object
***
**returns** print cv summaries  
***
**how it works**  
* for loops the length 
* prints the df[Model] and cv which is the average and std of the scores

<hr style="border:2px solid orange">  

<span style = 'color:green; font-size:13pt'>delete_last</span><a id = 'func5'></a>  
**parameters**  
><span style = 'color:blue'>self</span> &rarr; refers to the object
***
**returns** Object ModelsList  
***
**how it works**  
* deletes the last row of the df
* updates the length

<hr style="border:2px solid orange">

In [4]:
class ModelsList():
    def __init__(self, x,y):
        self.xtrain, self.xtest, self.ytrain, self.ytest = train_test_split(x,y,random_state=42)
        self.length = 0
        self.classification_reports = []
        self.cv = []
        self.df = pd.DataFrame({'Model' : pd.Series(dtype='str'), 
                    'train_score' : pd.Series(dtype='float64'), 
                    'test_score': pd.Series(dtype='float64'),
                    'log_loss': pd.Series(dtype='float64'),
                    'params':pd.Series(dtype='O')})
    
    def update(self, estimator, name, fit = True, params = 'None'):
        self.length += 1
        if fit:
            estimator.fit(self.xtrain, self.ytrain)
            
        ypred = estimator.predict(self.xtest)
        log_loss = -cross_val_score(estimator, self.xtrain, self.ytrain, scoring = 'neg_log_loss', n_jobs = -1).mean()
        self.classification_reports.append(classification_report(self.ytest, ypred))
        trainscore = estimator.score(self.xtrain, self.ytrain)
        testscore = estimator.score(self.xtest, self.ytest)
        model_to_add = [name, trainscore, testscore, log_loss, params]
        self.df.loc[len(self.df.index)] = model_to_add
        
        cv = cross_val_score(estimator, self.xtrain, self.ytrain)
        self.cv.append((cv.mean(), cv.std()))
        return self.df
    
    def class_report(self):
        for length in range(self.length):
            print(
            f"""Classification Report for '{self.df.Model[length]}':
                
            """
            )
            print(self.classification_reports[length])
            
    def cv_summary(self):
        for length in range(self.length):
            print(
            f"""Classification Report for '{self.df.Model[length]}':
                {self.cv[length][0]:.5f} ± {self.cv[length][1]:.5f} accuracy
            """
            )
            
    def delete_last(self):
        self.length -= 1
        self.df.drop(self.df.tail(1).index, inplace=True)

Creating two ModelsList objects to store my information

In [5]:
model1 = ModelsList(x,y)
model2 = ModelsList(x2,y2)

### [Pipeline Preparation](#title) <a id ='pipeline'></a>  
<hr style="border:2px solid magenta">  

Setting up two sub-pipelines to deal with numeric and categorical values

In [6]:
subpipenum = Pipeline([
    ('num_impute',SimpleImputer(add_indicator=True)),
    ('ss', StandardScaler())
])

subpipecat = Pipeline([
    ('cat_impute', SimpleImputer(strategy='most_frequent', add_indicator=True)),
    ('ohe', OneHotEncoder(sparse=True, handle_unknown='ignore'))
])

Using a column transformer to utlize the sub-pipelines  
>NOTE: using <span style = 'color:blue'>**remainder = 'passthrough'**</span> is for values that were not specified in transformers will get automatically passed through and <span style = 'color:blue'>**n_jobs = -1**</span> is for amount of cores

In [7]:
ct = ColumnTransformer(transformers = [
    ('subpipe_num',subpipenum, selector(dtype_include=np.number)),
    ('subpipe_cat', subpipecat, selector(dtype_include=object))
], remainder='passthrough', n_jobs = -1)

### [Dummy Model](#title)<a id ="dummy"><a>
<hr style="border:2px solid magenta">
    
Creating a dummy model as a baseline model

In [8]:
dummy = Pipeline([
    ('ct', ct),
    ('dummy', DummyClassifier())
])

In [9]:
%%time
#roughly 30s
model1.update(dummy, 'Dummy')



Wall time: 33.7 s




Unnamed: 0,Model,train_score,test_score,log_loss,params
0,Dummy,0.447385,0.437306,19.12076,


Took 30 seconds and the accuracy for train and test roughly 44% with a log loss of 19.04, which is very high. Expected for a dummy model

In [10]:
%%time
#roughly 20s
model2.update(dummy, 'Dummy')



Wall time: 18.6 s




Unnamed: 0,Model,train_score,test_score,log_loss,params
0,Dummy,0.447003,0.449158,19.057962,


Took 20 seconds and the accuracy for train and test roughly 44% with a log loss of 19.07, which is very high. Expected for a dummy model

### [Logistic Regression Model](#title) <a id ="lg"><a>
<hr style="border:2px solid magenta">

Lets do a basic Logistic Regression Model to see if it improves the accuracies

In [11]:
lgr1 = Pipeline([
    ('ct', ct),
    ('lg', LogisticRegression(random_state=42, n_jobs=-1))
])

In [12]:
lgr2 = Pipeline([
    ('ct', ct),
    ('lg', LogisticRegression(random_state=42, n_jobs=-1))
])

In [13]:
%%time
#roughly 50s
model1.update(lgr1, 'LogReg')

Wall time: 43 s


Unnamed: 0,Model,train_score,test_score,log_loss,params
0,Dummy,0.447385,0.437306,19.12076,
1,LogReg,0.861347,0.775286,0.570089,


Took about 1 min. 
> * Training score 86.1%, testing score 77.5%, log loss 0.57. 
> * Delta between the two scores 8.6%
> * This is a huge improvement from the dummy model. 
> * However, I cannot tell if this is good or not without playing around with the hyper parameters.

In [14]:
%%time
#roughly 50s
model2.update(lgr2, 'LogReg')

Wall time: 33.8 s


Unnamed: 0,Model,train_score,test_score,log_loss,params
0,Dummy,0.447003,0.449158,19.057962,
1,LogReg,0.802716,0.739057,0.627404,


Took about 30 seconds. 
> * Training score 80.9%, testing score 73.9%, a log loss 0.62. 
> * Delta between the two scores 6.4%
> * This is a huge improvement from the dummy model. 
> * However, I cannot tell if this is good or not without playing around with the hyper parameters.

<hr style="border:2px solid orange">  

Looking at the name of the hyper parameters that I can change.

In [15]:
pd.DataFrame.from_dict(lgr1.get_params(), orient='index').index

Index(['memory', 'steps', 'verbose', 'ct', 'lg', 'ct__n_jobs', 'ct__remainder',
       'ct__sparse_threshold', 'ct__transformer_weights', 'ct__transformers',
       'ct__verbose', 'ct__subpipe_num', 'ct__subpipe_cat',
       'ct__subpipe_num__memory', 'ct__subpipe_num__steps',
       'ct__subpipe_num__verbose', 'ct__subpipe_num__num_impute',
       'ct__subpipe_num__ss', 'ct__subpipe_num__num_impute__add_indicator',
       'ct__subpipe_num__num_impute__copy',
       'ct__subpipe_num__num_impute__fill_value',
       'ct__subpipe_num__num_impute__missing_values',
       'ct__subpipe_num__num_impute__strategy',
       'ct__subpipe_num__num_impute__verbose', 'ct__subpipe_num__ss__copy',
       'ct__subpipe_num__ss__with_mean', 'ct__subpipe_num__ss__with_std',
       'ct__subpipe_cat__memory', 'ct__subpipe_cat__steps',
       'ct__subpipe_cat__verbose', 'ct__subpipe_cat__cat_impute',
       'ct__subpipe_cat__ohe', 'ct__subpipe_cat__cat_impute__add_indicator',
       'ct__subpipe_cat__cat_im

In [16]:
params = {
    'lg__solver' : ['lbfgs','newton-cg', 'saga'],
    'lg__max_iter': [250, 750, 1000, 1500],
    'lg__C' : [0.1,0.5, 1],
    'lg__tol' : [0.0001, 0.005, 0.001,0.01, 0.1],
    'lg__class_weight' : [{'functional': 1, 'non functional': 1, 'functional needs repair': 2},
                          {'functional': 1, 'non functional': 1, 'functional needs repair': 1},
                          {'functional': 1, 'non functional': 1, 'functional needs repair': .8}]
}
#540 candidates x 5 folds = 2700 fits

I chose to change the hyperparameters for:
>* lg_solver
>* lg_max_iter
>* lg_C
>* lg_tol
>* lg_class_weight

I did some playing with values for this. This is what I ended up keeping. 

### [Grid Search - Logistic Regression Model](#title) <a id ="gs"><a>
<hr style="border:2px solid magenta">
    
Creating GridSearchCV with the hyper parameters set above

In [17]:
gs1 = GridSearchCV(
    estimator= lgr1,
    param_grid= params,
    cv = 5,
    verbose = 2,
    n_jobs = -1
)

In [18]:
gs2 = GridSearchCV(
    estimator= lgr2,
    param_grid= params,
    cv = 5,
    verbose = 2,
    n_jobs = -1
)

Fitting these Grid Searches.  
**Note: These take very long to run**

In [19]:
%%time
#1h 36min 54s
gs1.fit(model1.xtrain, model1.ytrain)

Fitting 5 folds for each of 540 candidates, totalling 2700 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   49.8s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:  9.4min
[Parallel(n_jobs=-1)]: Done 624 tasks      | elapsed: 16.4min
[Parallel(n_jobs=-1)]: Done 989 tasks      | elapsed: 26.4min
[Parallel(n_jobs=-1)]: Done 1434 tasks      | elapsed: 43.5min
[Parallel(n_jobs=-1)]: Done 1961 tasks      | elapsed: 62.8min
[Parallel(n_jobs=-1)]: Done 2568 tasks      | elapsed: 89.2min
[Parallel(n_jobs=-1)]: Done 2700 out of 2700 | elapsed: 94.7min finished


Wall time: 1h 35min 4s


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ct',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('subpipe_num',
                                                                         Pipeline(steps=[('num_impute',
                                                                                          SimpleImputer(add_indicator=True)),
                                                                                         ('ss',
                                                                                          StandardScaler())]),
                                                                         <sklearn.compose._column_transformer.make_column_selector object at 0x000001CCE8E33D00>),
                                                                        ('subpipe_cat',
       

In [20]:
%%time
#1h 4min 20s
gs2.fit(model2.xtrain, model2.ytrain)

Fitting 5 folds for each of 540 candidates, totalling 2700 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   27.9s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:  5.6min
[Parallel(n_jobs=-1)]: Done 624 tasks      | elapsed: 10.1min
[Parallel(n_jobs=-1)]: Done 989 tasks      | elapsed: 16.2min
[Parallel(n_jobs=-1)]: Done 1434 tasks      | elapsed: 26.0min
[Parallel(n_jobs=-1)]: Done 1961 tasks      | elapsed: 37.5min
[Parallel(n_jobs=-1)]: Done 2568 tasks      | elapsed: 52.6min
[Parallel(n_jobs=-1)]: Done 2700 out of 2700 | elapsed: 55.8min finished


Wall time: 55min 58s


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ct',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('subpipe_num',
                                                                         Pipeline(steps=[('num_impute',
                                                                                          SimpleImputer(add_indicator=True)),
                                                                                         ('ss',
                                                                                          StandardScaler())]),
                                                                         <sklearn.compose._column_transformer.make_column_selector object at 0x000001CCE8E33D00>),
                                                                        ('subpipe_cat',
       

Took about a couple of hours to do this

<hr style="border:2px solid orange">

Making variables to store the best_estimator and best_params.  
Taking a quick look at the hyper parameters

In [21]:
gs1best = gs1.best_estimator_
gs1bestparam = gs1.best_params_
gs1bestparam

{'lg__C': 0.5,
 'lg__class_weight': {'functional': 1,
  'non functional': 1,
  'functional needs repair': 1},
 'lg__max_iter': 750,
 'lg__solver': 'lbfgs',
 'lg__tol': 0.0001}

In [22]:
gs2best = gs2.best_estimator_
gs2bestparam = gs2.best_params_
gs2bestparam

{'lg__C': 1,
 'lg__class_weight': {'functional': 1,
  'non functional': 1,
  'functional needs repair': 0.8},
 'lg__max_iter': 250,
 'lg__solver': 'lbfgs',
 'lg__tol': 0.0001}

One thing I noticed is that C (value of 1), tolerance (value of 0.0001) and the solver (lbfgs) are the same.  
The class weights and max_iters are different which is very interesting to see. 

<hr style="border:2px solid orange">

Using ModelsList class to update and store this model's performance/ hyper-parameters

In [23]:
%%time
#1min 38s
model1.update(gs1best, 'Best LogReg', False, gs1bestparam)

Wall time: 1min 39s


Unnamed: 0,Model,train_score,test_score,log_loss,params
0,Dummy,0.447385,0.437306,19.12076,
1,LogReg,0.861347,0.775286,0.570089,
2,Best LogReg,0.8778,0.781818,0.547148,"{'lg__C': 0.5, 'lg__class_weight': {'functiona..."


Took 2 mins. 
> * Train score 87.8%, test score 78.1%, log loss 0.547. 
> * Train score increased by 1.6%, test score increased by .6%, log loss decreased by 0.023. 
> * Delta between two scores 9.6% 
> * log loss got decreased which means it might imply its a better fitting model.  
> * However the difference of the scores is larger, so it means it might be slightly overfit. 

In [24]:
%%time
#roughly 1m
model2.update(gs2best, 'Best LogReg', False, gs2bestparam)

Wall time: 37.5 s


Unnamed: 0,Model,train_score,test_score,log_loss,params
0,Dummy,0.447003,0.449158,19.057962,
1,LogReg,0.802716,0.739057,0.627404,
2,Best LogReg,0.834613,0.745859,0.617268,"{'lg__C': 1, 'lg__class_weight': {'functional'..."


Took 1 min. 
> * The train score 81.5% and test score 74.6% and log loss is 0.614. 
> * The train score increase by .6,  and the log loss also decreased about .01. 
> * This means that the model is improving. 
> * The train score improved more than the testing score, so it means that the difference got bigger. 
> * The log loss got decreased which means it might imply its a better fitting model. 

### [Smote +  Best Logistic Regression Model](#title) <a id ="smote"><a>
<hr style="border:2px solid magenta">

Making Pipeline for SMOTE

In [25]:
imbpipe1 = ImPipeline([
    ('ct', ct),
    ('sm', SMOTE(random_state = 42, n_jobs =-1)),
    ('gs_best', LogisticRegression(random_state = 42, 
                                   C = gs1bestparam['lg__C'], 
                                   max_iter = gs1bestparam['lg__max_iter'], 
                                   solver = gs1bestparam['lg__solver'],
                                   tol = gs1bestparam['lg__tol'],
                                   class_weight = gs1bestparam['lg__class_weight'],
                                   n_jobs=-1))
])

In [26]:
imbpipe2 = ImPipeline([
    ('ct', ct),
    ('sm', SMOTE(random_state = 42, n_jobs =-1)),
    ('gs_best', LogisticRegression(random_state = 42, 
                                   C = gs2bestparam['lg__C'], 
                                   max_iter = gs2bestparam['lg__max_iter'], 
                                   solver = gs2bestparam['lg__solver'],
                                   tol = gs2bestparam['lg__tol'],
                                   class_weight = gs2bestparam['lg__class_weight'],
                                   n_jobs=-1))
])

Updating model1 and model2 modelslist

In [27]:
%%time
#4min 4s
model1.update(imbpipe1, 'Smote with Best LogReg', True, gs1bestparam)

Wall time: 4min 9s


Unnamed: 0,Model,train_score,test_score,log_loss,params
0,Dummy,0.447385,0.437306,19.12076,
1,LogReg,0.861347,0.775286,0.570089,
2,Best LogReg,0.8778,0.781818,0.547148,"{'lg__C': 0.5, 'lg__class_weight': {'functiona..."
3,Smote with Best LogReg,0.88963,0.752929,0.595275,"{'lg__C': 0.5, 'lg__class_weight': {'functiona..."


Took ~4 mins. 
>* Train score 89%, test score 75.2%, log loss 0.595.   
>* Train score increase by 1.2%, test score decreased by 2.9%, the log loss increased by 0.048.  
>* Delta between two scores 13.7%. That's pretty big   
>* Log loss increased and delta is bigger which means it implies that it is overfit.   

In [28]:
%%time
#2min 55s
model2.update(imbpipe2, 'Smote with Best LogReg', True, gs2bestparam)

Wall time: 1min 49s


Unnamed: 0,Model,train_score,test_score,log_loss,params
0,Dummy,0.447003,0.449158,19.057962,
1,LogReg,0.802716,0.739057,0.627404,
2,Best LogReg,0.834613,0.745859,0.617268,"{'lg__C': 1, 'lg__class_weight': {'functional'..."
3,Smote with Best LogReg,0.826308,0.711785,0.680222,"{'lg__C': 1, 'lg__class_weight': {'functional'..."


Took ~2 mins. 
>* Train score 82.6%, test score 71.1%, log loss 0.68.   
>* Train score increase by 2.4%, test score decreased by 2.7%, the log loss increased by 0.052.  
>* Delta between two scores 11.5%. That's pretty big   
>* Log loss increased and delta is bigger which means it implies that it is overfit.   

<hr style="border:2px solid orange">

Looking at the hyperparameters for SMOTE

In [29]:
pd.DataFrame.from_dict(imbpipe1.get_params(), orient='index').index

Index(['memory', 'steps', 'verbose', 'ct', 'sm', 'gs_best', 'ct__n_jobs',
       'ct__remainder', 'ct__sparse_threshold', 'ct__transformer_weights',
       'ct__transformers', 'ct__verbose', 'ct__subpipe_num', 'ct__subpipe_cat',
       'ct__subpipe_num__memory', 'ct__subpipe_num__steps',
       'ct__subpipe_num__verbose', 'ct__subpipe_num__num_impute',
       'ct__subpipe_num__ss', 'ct__subpipe_num__num_impute__add_indicator',
       'ct__subpipe_num__num_impute__copy',
       'ct__subpipe_num__num_impute__fill_value',
       'ct__subpipe_num__num_impute__missing_values',
       'ct__subpipe_num__num_impute__strategy',
       'ct__subpipe_num__num_impute__verbose', 'ct__subpipe_num__ss__copy',
       'ct__subpipe_num__ss__with_mean', 'ct__subpipe_num__ss__with_std',
       'ct__subpipe_cat__memory', 'ct__subpipe_cat__steps',
       'ct__subpipe_cat__verbose', 'ct__subpipe_cat__cat_impute',
       'ct__subpipe_cat__ohe', 'ct__subpipe_cat__cat_impute__add_indicator',
       'ct__subpipe_

In [30]:
params = {
    'sm__sampling_strategy' : ['minority', 'not majority', 'all'],
    'sm__k_neighbors': [5,10,15]
}
#9 candidiates x 5 folds = 45 fits

### [Grid Search - Smote Using Best Logistic Regression Model](#title) <a id ="gs2"><a>
<hr style="border:2px solid magenta">
    
Setting up Grid Searches with the hyper parameters given above

In [31]:
smote_gs1 = GridSearchCV(
    estimator= imbpipe1,
    param_grid= params,
    cv =5,
    verbose = 2,
    n_jobs = -1
)

In [32]:
smote_gs2 = GridSearchCV(
    estimator= imbpipe2,
    param_grid= params,
    cv =5,
    verbose = 2,
    n_jobs = -1
)

In [33]:
%%time
#5min 52s
smote_gs1.fit(model1.xtrain,model1.ytrain)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  5.4min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  5.4min finished


Wall time: 5min 55s


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ct',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('subpipe_num',
                                                                         Pipeline(steps=[('num_impute',
                                                                                          SimpleImputer(add_indicator=True)),
                                                                                         ('ss',
                                                                                          StandardScaler())]),
                                                                         <sklearn.compose._column_transformer.make_column_selector object at 0x000001CCE8E33D00>),
                                                                        ('subpipe_cat',
       

In [34]:
%%time
#3min 29s
smote_gs2.fit(model2.xtrain,model2.ytrain)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   49.7s
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  1.8min remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  1.8min finished


Wall time: 1min 56s


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ct',
                                        ColumnTransformer(n_jobs=-1,
                                                          remainder='passthrough',
                                                          transformers=[('subpipe_num',
                                                                         Pipeline(steps=[('num_impute',
                                                                                          SimpleImputer(add_indicator=True)),
                                                                                         ('ss',
                                                                                          StandardScaler())]),
                                                                         <sklearn.compose._column_transformer.make_column_selector object at 0x000001CCE8E33D00>),
                                                                        ('subpipe_cat',
       

This was a lot faster than the logistic regression grid search becuase I am only doing 45 fits compared to 2700 fits

<hr style="border:2px solid orange">

Setting up variables

In [35]:
smote_gs1best = smote_gs1.best_estimator_
smote_gs2best = smote_gs2.best_estimator_

smote_gs1bestparam = smote_gs1.best_params_
smote_gs2bestparam = smote_gs2.best_params_

<hr style="border:2px solid orange">

Updating model1 and model2 ModelsList

In [36]:
%%time
# roughly 1m
model1.update(smote_gs1best, 'Best Smote + LogReg', False, smote_gs1bestparam)

Wall time: 2min 31s


Unnamed: 0,Model,train_score,test_score,log_loss,params
0,Dummy,0.447385,0.437306,19.12076,
1,LogReg,0.861347,0.775286,0.570089,
2,Best LogReg,0.8778,0.781818,0.547148,"{'lg__C': 0.5, 'lg__class_weight': {'functiona..."
3,Smote with Best LogReg,0.88963,0.752929,0.595275,"{'lg__C': 0.5, 'lg__class_weight': {'functiona..."
4,Best Smote + LogReg,0.879596,0.753266,0.596776,"{'sm__k_neighbors': 15, 'sm__sampling_strategy..."


Took ~4 mins. 
>* Train score 88%, test score 75.3%, log loss 0.597.   
>* Train score increase by 1.8%, test score decreased by 2.2%, the log loss increased by .026.  
>* Delta between two scores 12.6%. That's pretty big   
>* Log loss increased and delta is bigger which means it implies that it is overfit.   
>* This did slightly worse in the training than previous model and had a bigger 'error'   

In [37]:
%%time
# roughly 1m
model2.update(smote_gs2best, 'Best Smote + LogReg', False, smote_gs2bestparam)

Wall time: 49.7 s


Unnamed: 0,Model,train_score,test_score,log_loss,params
0,Dummy,0.447003,0.449158,19.057962,
1,LogReg,0.802716,0.739057,0.627404,
2,Best LogReg,0.834613,0.745859,0.617268,"{'lg__C': 1, 'lg__class_weight': {'functional'..."
3,Smote with Best LogReg,0.826308,0.711785,0.680222,"{'lg__C': 1, 'lg__class_weight': {'functional'..."
4,Best Smote + LogReg,0.826891,0.714613,0.678226,"{'sm__k_neighbors': 5, 'sm__sampling_strategy'..."


### [Summary](#title) <a id ="summary"><a>

<hr style="border:2px solid magenta">
    
Showing useful values to evaluate which model is the best

In [38]:
model1.class_report()

Classification Report for 'Dummy':
                
            
                         precision    recall  f1-score   support

             functional       0.55      0.55      0.55      8098
functional needs repair       0.08      0.08      0.08      1074
         non functional       0.39      0.39      0.39      5678

               accuracy                           0.45     14850
              macro avg       0.34      0.34      0.34     14850
           weighted avg       0.45      0.45      0.45     14850

Classification Report for 'LogReg':
                
            
                         precision    recall  f1-score   support

             functional       0.78      0.87      0.82      8098
functional needs repair       0.54      0.26      0.35      1074
         non functional       0.80      0.74      0.77      5678

               accuracy                           0.78     14850
              macro avg       0.70      0.62      0.65     14850
           weighted

In [39]:
model1.cv_summary()

Classification Report for 'Dummy':
                0.44842 ± 0.00252 accuracy
            
Classification Report for 'LogReg':
                0.77336 ± 0.00296 accuracy
            
Classification Report for 'Best LogReg':
                0.77915 ± 0.00242 accuracy
            
Classification Report for 'Smote with Best LogReg':
                0.75086 ± 0.00287 accuracy
            
Classification Report for 'Best Smote + LogReg':
                0.75169 ± 0.00356 accuracy
            


In [40]:
model1.df

Unnamed: 0,Model,train_score,test_score,log_loss,params
0,Dummy,0.447385,0.437306,19.12076,
1,LogReg,0.861347,0.775286,0.570089,
2,Best LogReg,0.8778,0.781818,0.547148,"{'lg__C': 0.5, 'lg__class_weight': {'functiona..."
3,Smote with Best LogReg,0.88963,0.752929,0.595275,"{'lg__C': 0.5, 'lg__class_weight': {'functiona..."
4,Best Smote + LogReg,0.879596,0.753266,0.596776,"{'sm__k_neighbors': 15, 'sm__sampling_strategy..."


**Notes**:
- Best Logistic Regression performs the best overall. It has the highest test score and the lowest log loss. These are very good metrics to look at. Another note to look at is that it has the highest f1 score for functional.  
- Smote models performed better at predicting the needs repair class, but it did not perform that much better than the Logistic Regression.  
- Despite Smote models having the highest scores for train, it seems like it overfit due to the test score doing much worse. 

<hr style="border:2px solid orange">

In [41]:
model2.class_report()

Classification Report for 'Dummy':
                
            
                         precision    recall  f1-score   support

             functional       0.55      0.55      0.55      8098
functional needs repair       0.07      0.07      0.07      1074
         non functional       0.38      0.38      0.38      5678

               accuracy                           0.45     14850
              macro avg       0.33      0.33      0.33     14850
           weighted avg       0.45      0.45      0.45     14850

Classification Report for 'LogReg':
                
            
                         precision    recall  f1-score   support

             functional       0.75      0.84      0.79      8098
functional needs repair       0.52      0.25      0.34      1074
         non functional       0.74      0.69      0.71      5678

               accuracy                           0.74     14850
              macro avg       0.67      0.59      0.61     14850
           weighted

In [42]:
model2.cv_summary()

Classification Report for 'Dummy':
                0.44716 ± 0.00831 accuracy
            
Classification Report for 'LogReg':
                0.73661 ± 0.00495 accuracy
            
Classification Report for 'Best LogReg':
                0.74083 ± 0.00353 accuracy
            
Classification Report for 'Smote with Best LogReg':
                0.70723 ± 0.00230 accuracy
            
Classification Report for 'Best Smote + LogReg':
                0.71057 ± 0.00464 accuracy
            


In [43]:
model2.df

Unnamed: 0,Model,train_score,test_score,log_loss,params
0,Dummy,0.447003,0.449158,19.057962,
1,LogReg,0.802716,0.739057,0.627404,
2,Best LogReg,0.834613,0.745859,0.617268,"{'lg__C': 1, 'lg__class_weight': {'functional'..."
3,Smote with Best LogReg,0.826308,0.711785,0.680222,"{'lg__C': 1, 'lg__class_weight': {'functional'..."
4,Best Smote + LogReg,0.826891,0.714613,0.678226,"{'sm__k_neighbors': 5, 'sm__sampling_strategy'..."


### [Exporting to CSVs](#title) <a id ="exports"><a>

<hr style="border:2px solid magenta">
    
Export modellists as csvs to explore in the visualizations notebook

In [45]:
model1.df.to_csv('./Data/Model1.csv', index=False)
model2.df.to_csv('./Data/Model2.csv', index=False)

Grabbing the cv_results from the Grid Search to explore at a later time

In [46]:
gs1cv_results = pd.DataFrame.from_dict(gs1.cv_results_)
gs2cv_results = pd.DataFrame.from_dict(gs2.cv_results_)

In [47]:
smote_gs1cv_results = pd.DataFrame.from_dict(smote_gs1.cv_results_)
smote_gs2cv_results = pd.DataFrame.from_dict(smote_gs2.cv_results_)

Exporting these cv_results as a csv to explore one day in the future

In [48]:
gs1cv_results.to_csv('./Data/GridSearch1cv_results.csv', index=False)
gs2cv_results.to_csv('./Data/GridSearch2cv_results.csv', index=False)
smote_gs1cv_results.to_csv('./Data/SmoteGridSearch1cv_results.csv', index=False)
smote_gs2cv_results.to_csv('./Data/SmoteGridSearch2cv_results.csv', index=False)

I will create some visualizations in the next [notebook](https://github.com/irwin-lam/PumpItUp/blob/main/Visualizations.ipynb). 

### [Conclusions](#title) <a id ="conclusions"><a>

<hr style="border:2px solid magenta">

The best model is the Logistic Regression with Tuned hyperparameters. 
> * The C is set to 0.5. 
> * The class weight is set to be balanced. 
> * The max iterations is set to 750. 
> * The solver is set to lbfgs. 
> * The tolerance is set to 0.0001. 

Out of the models I used, this model has the lowest log loss. The test accuracy is the highest out of the five models I did. The difference between the training score and test score is relatively low. The recall value for the functional class is the highest. We would want to avoid going to a functional water source too often. The recall value is the true positive over all positives for that class. By having a 89%, this means only 11% of the positives for functional are false positive. 

The next step would be to convert this problem to a binary classification as it makes more sense to send someone to repair a water source that is either not functional or needs repairs than a functional well. This would remove the uncertainity if the water source needs repair due our dataset being limited in that class. The other models might perform better in this situation as the data imbalance would be removed. There is roughly only 7% of the data is classified as needs repairs, and 38% of the data is classified as non functional. This better even out the class distribution to be 45-55. 

Another step would be getting data about the deaths and sicknesses in the regions from the water quality. This would provide more insight in which region to focus on sending people to repair or build new water sources.  This could add more weight to certain regions as it is more "important" to focus those regions as it would save more people. 