BloomTech Data Science

*Unit 2, Sprint 2, Module 3*

---

# Module Project: Hyperparameter Tuning
This week, the module projects will focus on creating and improving a model for the Tanazania Water Pump dataset. Your goal is to create a model to predict whether a water pump is functional, non-functional, or functional needs repair.

## Directions

The tasks for this project are as follows:

- **Task 1:** Use `wrangle` function to import training and test data.
- **Task 2:** Split training data into feature matrix `X` and target vector `y`.
- **Task 3:** Establish the baseline accuracy score for your dataset.
- **Task 4:** Build `clf_dt`.
- **Task 5:** Build `clf_rf`.
- **Task 6:** Evaluate classifiers using k-fold cross-validation.
- **Task 7:** Tune hyperparameters for best performing classifier.
- **Task 8:** Print out best score and params for model.
- **Task 9:** Create `submission.csv` and upload to Kaggle.

You should limit yourself to the following libraries for this project:

- `category_encoders`
- `matplotlib`
- `pandas`
- `pandas-profiling`
- `sklearn`

# I. Wrangle Data

In [22]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from pandas_profiling import ProfileReport

from sklearn.model_selection import cross_val_score, RandomizedSearchCV, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from category_encoders import OrdinalEncoder

In [55]:
def wrangle(fm_path, tv_path=None):
    if tv_path:
        df = pd.merge(pd.read_csv(fm_path, 
                                  na_values=[0, -2.000000e-08],
                                 parse_dates=['date_recorded']),
                      pd.read_csv(tv_path)).set_index('id')
    else:
        df = pd.read_csv(fm_path, 
                         na_values=[0, -2.000000e-08],
                         parse_dates=['date_recorded'],
                         index_col='id')
        
    df['pump_age'] = df['date_recorded'].dt.year - df['construction_year']
        
    # Drop constant columns
    df.drop(columns=['recorded_by'], inplace=True)
    df.drop(columns=['date_recorded'], inplace=True)

    # Drop HCCCs
    cutoff = 100
    drop_cols = [col for col in df.select_dtypes('object').columns
                 if df[col].nunique() > cutoff]
    df.drop(columns=drop_cols, inplace=True)

    # Drop duplicate columns
    dupe_cols = [col for col in df.head(100).T.duplicated().index
                 if df.head(100).T.duplicated()[col]]
    df.drop(columns=dupe_cols, inplace=True)
    
#     df.dropna(subset=['longitude', 'latitude'], inplace=True)

    return df

**Task 1:** Using the above `wrangle` function to read `train_features.csv` and `train_labels.csv` into the DataFrame `df`, and `test_features.csv` into the DataFrame `X_test`.

In [56]:
df = wrangle('train_features.csv', 'train_labels.csv')
X_test = wrangle('test_features.csv')

X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11880 entries, 37098 to 8075
Data columns (total 30 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   amount_tsh             3572 non-null   float64
 1   gps_height             7717 non-null   float64
 2   longitude              11501 non-null  float64
 3   latitude               11501 non-null  float64
 4   num_private            140 non-null    float64
 5   basin                  11880 non-null  object 
 6   region                 11880 non-null  object 
 7   region_code            11880 non-null  int64  
 8   district_code          11876 non-null  float64
 9   population             7547 non-null   float64
 10  public_meeting         11235 non-null  object 
 11  scheme_management      11105 non-null  object 
 12  permit                 11263 non-null  object 
 13  construction_year      7674 non-null   float64
 14  extraction_type        11880 non-null  object 
 15 

# II. Split Data

**Task 2:** Split your DataFrame `df` into a feature matrix `X` and the target vector `y`. You want to predict `'status_group'`.

**Note:** You won't need to do a train-test split because you'll use cross-validation instead.

In [57]:
target = 'status_group'
X = df.drop(target, axis=1)
y = df[target]

# III. Establish Baseline

**Task 3:** Since this is a **classification** problem, you should establish a baseline accuracy score. Figure out what is the majority class in `y_train` and what percentage of your training observations it represents.

In [50]:
baseline_acc = y.value_counts(normalize=True).max()
baseline_class = y.value_counts().idxmax()
print('Baseline Accuracy Score:', baseline_acc, baseline_class)

Baseline Accuracy Score: 0.5451330121945928 functional


# IV. Build Models

**Task 4:** Build a `Pipeline` named `clf_dt`. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `DecisionTreeClassifier` Predictor.

**Note:** Do not train `clf_dt`. You'll do that in a subsequent task. 

In [51]:
clf_dt = make_pipeline(OrdinalEncoder(),
                      SimpleImputer(strategy='mean'),
                      DecisionTreeClassifier(random_state=11,
                                            max_depth=10))

**Task 5:** Build a `Pipeline` named `clf_rf`. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `RandomForestClassifier` predictor.

**Note:** Do not train `clf_rf`. You'll do that in a subsequent task. 

In [52]:
clf_rf = make_pipeline(OrdinalEncoder(),
                      SimpleImputer(strategy='mean'),
                      RandomForestClassifier(random_state=11,
                                            n_jobs=-1,
                                            n_estimators=240,
                                            max_depth=25))

# V. Check Metrics

**Task 6:** Evaluate the performance of both of your classifiers using k-fold cross-validation.

In [8]:
cv_scores_dt = cross_val_score(clf_dt, X, y, cv=5, n_jobs=-1)
cv_scores_rf = cross_val_score(clf_rf, X, y, cv=5, n_jobs=-1)

In [9]:
print('CV scores DecisionTreeClassifier')
print(cv_scores_dt)
print('Mean CV accuracy score:', cv_scores_dt.mean())
print('STD CV accuracy score:', cv_scores_dt.std())

CV scores DecisionTreeClassifier
[0.74326599 0.73579545 0.74579125 0.74789562 0.74123961]
Mean CV accuracy score: 0.7427975850085973
STD CV accuracy score: 0.0041624157207019955


In [10]:
print('CV score RandomForestClassifier')
print(cv_scores_rf)
print('Mean CV accuracy score:', cv_scores_rf.mean())
print('STD CV accuracy score:', cv_scores_rf.std())

CV score RandomForestClassifier
[0.80208333 0.80397727 0.80082071 0.80787037 0.80006314]
Mean CV accuracy score: 0.8029629642916236
STD CV accuracy score: 0.002788672279847727


# VI. Tune Model

**Task 7:** Choose the best performing of your two models and tune its hyperparameters using a `RandomizedSearchCV` named `model`. Make sure that you include cross-validation and that `n_iter` is set to at least `25`.

**Note:** If you're not sure which hyperparameters to tune, check the notes from today's guided project and the `sklearn` documentation. 

In [12]:
param_grid = {
    'simpleimputer__strategy': ['mean', 'median'],
    'randomforestclassifier__max_depth': range(20, 30, 2),
    'randomforestclassifier__n_estimators': range(200, 320, 20)
}

In [36]:
model = RandomizedSearchCV(clf_rf,
                           param_distributions=param_grid,
                           n_jobs=-1,
                           verbose=1,
                           n_iter=25,
                           cv=5)
model.fit(X, y)



Fitting 5 folds for each of 3 candidates, totalling 15 fits


KeyboardInterrupt: 

**Task 8:** Print out the best score and best params for `model`.

In [14]:
best_score = model.best_score_
best_params = model.best_params_

print('Best score for `model`:', best_score)
print('Best params for `model`:', best_params)

Best score for `model`: 0.8045412736931205
Best params for `model`: {'simpleimputer__strategy': 'mean', 'randomforestclassifier__n_estimators': 260, 'randomforestclassifier__max_depth': 24}


In [58]:
clf_rf = make_pipeline(OrdinalEncoder(),
                      SimpleImputer(strategy='mean'),
                      RandomForestClassifier(random_state=10,
                                            n_jobs=-1,
                                            n_estimators=260,
                                            max_depth=24,
                                            min_samples_leaf=2,
                                            criterion='gini'
                                            ))
clf_rf.fit(X, y)

cv_scores_rf = cross_val_score(clf_rf, X, y, cv=5, n_jobs=-1)
print('CV score RandomForestClassifier')
print(cv_scores_rf)
print('Mean CV accuracy score:', cv_scores_rf.mean())
print('STD CV accuracy score:', cv_scores_rf.std())

CV score RandomForestClassifier
[0.80650253 0.80460859 0.80239899 0.81197391 0.80153636]
Mean CV accuracy score: 0.8054040727347841
STD CV accuracy score: 0.0037146157574844893


In [59]:
param_grid = {
    'randomforestclassifier__random_state': [10, 100, 150, 200]
}

In [60]:
model = GridSearchCV(clf_rf,
                    param_grid=param_grid,n_jobs=-1,
                    verbose=1,
                    cv=5)
model.fit(X, y)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


In [61]:
best_score = model.best_score_
best_params = model.best_params_

print('Best score for `model`:', best_score)
print('Best params for `model`:', best_params)

Best score for `model`: 0.8063300163761861
Best params for `model`: {'randomforestclassifier__random_state': 10}


# Communicate Results

**Task 9:** Create a DataFrame `submission` whose index is the same as `X_test` and that has one column `'status_group'` with your predictions. Next, save this DataFrame as a CSV file and upload your submissions to our competition site. 

**Note:** Check the `sample_submission.csv` file on the competition website to make sure your submissions follows the same formatting. 

In [34]:
submission = pd.DataFrame({'status_group': model.predict(X_test)}, index=X_test.index)

In [35]:
submission

Unnamed: 0_level_0,status_group
id,Unnamed: 1_level_1
37098,non functional
14530,functional
62607,functional
46053,non functional
47083,functional
...,...
26092,functional
919,non functional
47444,non functional
61128,functional


In [36]:
submission.to_csv('submission_cv.csv')