BloomTech Data Science

*Unit 2, Sprint 2, Module 3*

---

# Module Project: Hyperparameter Tuning
This week, the module projects will focus on creating and improving a model for the Tanzania Water Pump dataset. Your goal is to create a model to predict whether a water pump is functional, non-functional, or functional needs repair.

## Directions

The tasks for this project are as follows:

- **Task 1:** Use `wrangle` function to import training and test data.
- **Task 2:** Split training data into feature matrix `X` and target vector `y`.
- **Task 3:** Establish the baseline accuracy score for your dataset.
- **Task 4:** Build `clf_dt`.
- **Task 5:** Build `clf_rf`.
- **Task 6:** Evaluate classifiers using k-fold cross-validation.
- **Task 7:** Tune hyperparameters for best performing classifier.
- **Task 8:** Print out best score and params for model.
- **Task 9:** Create `submission.csv` and upload to Kaggle.

You should limit yourself to the following libraries for this project:

- `category_encoders`
- `matplotlib`
- `pandas`
- `pandas-profiling`
- `sklearn`

# I. Wrangle Data

In [21]:
# Imports 
%%capture
!pip install category_encoders==2.*
!pip install pandas_profiling==2.*
from category_encoders import OrdinalEncoder
from pandas_profiling import ProfileReport

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, validation_curve, KFold # k-fold CV
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV # Hyperparameter tuning
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Downloading dataset

# mounting your google drive on colab
from google.colab import drive
drive.mount('/content/gdrive')

# work directory
%cd /content/gdrive/My Drive/Kaggle/bloomtech-water-pump-challenge

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/My Drive/Kaggle/bloomtech-water-pump-challenge


In [37]:
def wrangle(fm_path, tv_path=None):
    if tv_path:
        df = pd.merge(pd.read_csv(fm_path, 
                              na_values=[0, -2.000000e-08],
                              parse_dates=['date_recorded']),
                  pd.read_csv(tv_path)).set_index('id')
    else:
        df = pd.read_csv(fm_path, 
                     na_values=[0, -2.000000e-08],
                     parse_dates=['date_recorded'],
                     index_col='id')

    # Drop constant columns
    df.drop(columns=['recorded_by'], inplace=True)

    # Create age feature
    df['pump_age'] = df['date_recorded'].dt.year - df['construction_year']
    df.drop(columns=['date_recorded', 'construction_year'], inplace=True)    

    # Drop HCCCs
    cutoff = 100
    drop_cols = [col for col in df.select_dtypes('object').columns
                 if df[col].nunique() > cutoff]
    df.drop(columns=drop_cols, inplace=True)

    # Drop duplicate columns
    dupe_cols = [col for col in df.head(100).T.duplicated().index
                 if df.head(100).T.duplicated()[col]]
    df.drop(columns=dupe_cols, inplace=True)

    # Drop columns with high proportion of zeros
    df.drop(columns= 'num_private', inplace=True) 
            

    return df


**Task 1:** Using the above `wrangle` function to read `train_features.csv` and `train_labels.csv` into the DataFrame `df`, and `test_features.csv` into the DataFrame `X_test`.

In [38]:
df = wrangle('/content/gdrive/My Drive/Kaggle/bloomtech-water-pump-challenge/train_features.csv', '/content/gdrive/My Drive/Kaggle/bloomtech-water-pump-challenge/train_labels.csv')
X_test = wrangle('/content/gdrive/My Drive/Kaggle/bloomtech-water-pump-challenge/test_features.csv')

In [39]:
print(df.shape)
df.head()

(47519, 29)


Unnamed: 0_level_0,amount_tsh,gps_height,longitude,latitude,basin,region,region_code,district_code,population,public_meeting,...,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group,pump_age
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
454.0,50.0,2092.0,35.42602,-4.227446,Internal,Manyara,21,1.0,160.0,True,...,soft,good,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe,functional,15.0
510.0,,,35.510074,-5.724555,Internal,Dodoma,1,6.0,,True,...,soft,good,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional,
14146.0,,,32.499866,-9.081222,Lake Rukwa,Mbeya,12,6.0,,True,...,soft,good,enough,shallow well,shallow well,groundwater,other,other,non functional,
47410.0,,,34.060484,-8.830208,Rufiji,Mbeya,12,7.0,,True,...,soft,good,insufficient,river,river/lake,surface,communal standpipe,communal standpipe,non functional,
1288.0,300.0,1023.0,37.03269,-6.040787,Wami / Ruvu,Morogoro,5,1.0,120.0,True,...,salty,salty,enough,shallow well,shallow well,groundwater,other,other,non functional,14.0


# II. Split Data

**Task 2:** Split your DataFrame `df` into a feature matrix `X` and the target vector `y`. You want to predict `'status_group'`.

**Note:** You won't need to do a train-test split because you'll use cross-validation instead.

In [40]:
target= 'status_group'
X = df.drop(columns=target)
y = df[target]

# III. Establish Baseline

**Task 3:** Since this is a **classification** problem, you should establish a baseline accuracy score. Figure out what is the majority class in `y_train` and what percentage of your training observations it represents.

In [41]:
baseline_acc = y.value_counts(normalize=True).max()
print('Baseline Accuracy Score:', baseline_acc)

Baseline Accuracy Score: 0.5429828068772491


Note: Since will split the data using cross-validation, we are going to establish our baseline using the entire target vector. 

# IV. Build Models

**Task 4:** Build a `Pipeline` named `clf_dt`. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `DecisionTreeClassifier` Predictor.

**Note:** Do not train `clf_dt`. You'll do that in a subsequent task. 

In [42]:
clf_dt = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    DecisionTreeClassifier(random_state=42)
);

**Task 5:** Build a `Pipeline` named `clf_rf`. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `RandomForestClassifier` predictor.

**Note:** Do not train `clf_rf`. You'll do that in a subsequent task. 

In [43]:
clf_rf = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(n_estimators=25, random_state=42)
);

# V. Check Metrics

**Task 6:** Evaluate the performance of both of your classifiers using k-fold cross-validation.

In [44]:
kfold_cv = KFold(n_splits=5, shuffle=True, random_state=42)

cv_scores_dt = cross_val_score(clf_dt, X, y, cv=kfold_cv, n_jobs=-1)
cv_scores_rf = cross_val_score(clf_rf, X, y, cv=kfold_cv, n_jobs=-1)

In [45]:
print('CV scores DecisionTreeClassifier')
print(cv_scores_dt)
print('Mean CV accuracy score:', cv_scores_dt.mean())
print('STD CV accuracy score:', cv_scores_dt.std())

CV scores DecisionTreeClassifier
[0.74347643 0.74842172 0.7477904  0.753367   0.749658  ]
Mean CV accuracy score: 0.7485427116583068
STD CV accuracy score: 0.0031863920362635535


In [46]:
print('CV score RandomForestClassifier')
print(cv_scores_rf)
print('Mean CV accuracy score:', cv_scores_rf.mean())
print('STD CV accuracy score:', cv_scores_rf.std())

CV score RandomForestClassifier
[0.7907197  0.79682239 0.79377104 0.79882155 0.79459118]
Mean CV accuracy score: 0.7949451723733529
STD CV accuracy score: 0.0027534985843720126


# VI. Tune Model

**Task 7:** Choose the best performing of your two models and tune its hyperparameters using a `RandomizedSearchCV` named `model`. Make sure that you include cross-validation and that `n_iter` is set to at least `25`.

**Note:** If you're not sure which hyperparameters to tune, check the notes from today's guided project and the `sklearn` documentation. 

In [50]:
param_grid = {    
    'simpleimputer__strategy': ['mean', 'median'],
    'randomforestclassifier__max_depth': range(5,50,5),
    'randomforestclassifier__n_estimators': range(15,125,25)}

In [52]:
model = RandomizedSearchCV(clf_rf,
             param_distributions=param_grid,
             n_jobs=-1,
             cv=3,
             verbose=1,
             n_iter=25,
             )

model.fit(X,y) # fit into the entire data

Fitting 3 folds for each of 25 candidates, totalling 75 fits


RandomizedSearchCV(cv=3,
                   estimator=Pipeline(steps=[('ordinalencoder',
                                              OrdinalEncoder()),
                                             ('simpleimputer', SimpleImputer()),
                                             ('randomforestclassifier',
                                              RandomForestClassifier(n_estimators=25,
                                                                     random_state=42))]),
                   n_iter=25, n_jobs=-1,
                   param_distributions={'randomforestclassifier__max_depth': range(5, 50, 5),
                                        'randomforestclassifier__n_estimators': range(15, 125, 25),
                                        'simpleimputer__strategy': ['mean',
                                                                    'median']},
                   verbose=1)

**Task 8:** Print out the best score and best params for `model`.

In [53]:
best_score = model.best_score_
best_params = model.best_params_

print('Best score for `model`:', best_score)
print('Best params for `model`:', best_params)

Best score for `model`: 0.800185222120398
Best params for `model`: {'simpleimputer__strategy': 'mean', 'randomforestclassifier__n_estimators': 65, 'randomforestclassifier__max_depth': 20}


# Communicate Results

**Task 9:** Create a DataFrame `submission` whose index is the same as `X_test` and that has one column `'status_group'` with your predictions. Next, save this DataFrame as a CSV file and upload your submissions to our competition site. 

**Note:** Check the `sample_submission.csv` file on the competition website to make sure your submissions follows the same formatting. 

In [54]:
y_pred= model.predict(X_test)
submission = pd.DataFrame({'status_group': y_pred}, index=X_test.index)
submission

Unnamed: 0_level_0,status_group
id,Unnamed: 1_level_1
37098,non functional
14530,functional
62607,functional
46053,non functional
47083,functional
...,...
26092,functional
919,non functional
47444,non functional
61128,functional


In [55]:
# Create timestamp
pd.Timestamp.now().strftime('%Y-%m-%d_%H%M_')

'2022-10-12_0400_'

In [56]:
datestamp = pd.Timestamp.now().strftime('%Y-%m-%d_%H%M_') #string from time format
submission.to_csv(f'{datestamp}submission.csv') #format string

In [59]:
# Generate CSV file
submission.to_csv('third_submission.csv')

# Download CSV file
from google.colab import files
files.download('third_submission.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [57]:
# Once you have found the best model, you might as well save it and then reload it when you want to test it later

# save model
import pickle

filename = 'assignment_module3_model_rs_80'

#save your model (it will be stored in your current working directory - download to your computer if GDrive is not mounted)
pickle.dump(model,open(filename,'wb'))
#load model
model_rf_loaded = pickle.load(open(filename,'rb'))