<a href="https://colab.research.google.com/github/jduell12/DS-Unit-2-Kaggle-Challenge/blob/main/guidedProject_Unit2_Sprint_2_module3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cross-Validation

* Do **k-fold cross-validation** with independent test set
* Use scikit-learn for **hyperparameter optimization**

In [1]:
import sys
# If you're on Colab:
if 'google.colab' in sys.modules:
  DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/main/data/'
  !pip install category_encoders==2.*
else:
  DATA_PATH = '../data/'

Collecting category_encoders==2.*
  Downloading category_encoders-2.2.2-py2.py3-none-any.whl (80 kB)
[?25l[K     |████                            | 10 kB 20.4 MB/s eta 0:00:01[K     |████████▏                       | 20 kB 23.0 MB/s eta 0:00:01[K     |████████████▏                   | 30 kB 16.1 MB/s eta 0:00:01[K     |████████████████▎               | 40 kB 11.3 MB/s eta 0:00:01[K     |████████████████████▎           | 51 kB 8.9 MB/s eta 0:00:01[K     |████████████████████████▍       | 61 kB 9.0 MB/s eta 0:00:01[K     |████████████████████████████▍   | 71 kB 7.4 MB/s eta 0:00:01[K     |████████████████████████████████| 80 kB 4.1 MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.2.2


In [4]:
from category_encoders import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score #k-fold cross validation,
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV # hyperparameter fine tuning 
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

# I. Wrangle Data

In [5]:
def wrangle(fm_path, tv_path=None):
  if tv_path:
    df = pd.merge(pd.read_csv(fm_path, na_values=[0, -2.000000e-08],parse_dates=['date_recorded']), pd.read_csv(tv_path)).set_index('id')
  else:
    df = pd.read_csv(fm_path, na_values=[0, -2.000000e-08], parse_dates=['date_recorded'], index_col='id')
  # drop constant columns 
  df.drop(columns=['recorded_by'], inplace=True)
  # create age feature 
  df['pump_age'] = df['date_recorded'].dt.year - df['construction_year']
  df.drop(columns=['date_recorded'], inplace=True)
  # drop high cardinality columns 
  cutoff = 100
  drop_cols = [col for col in df.select_dtypes('object').columns if df[col].nunique() > 100]
  df.drop(columns = drop_cols, inplace=True)
  # drop duplicate columns 
  dup_cols = [col for col in df.head(15).T.duplicated().index if df.head(15).T.duplicated()[col]]
  df.drop(columns=dup_cols, inplace=True)
  return df

In [7]:
df = wrangle(DATA_PATH + 'waterpumps/train_features.csv', DATA_PATH + 'waterpumps/train_labels.csv')
X_test = wrangle(DATA_PATH + 'waterpumps/test_features.csv')

#II. Split Data

# Split TV from FM

In [8]:
target = 'status_group'
y_train = df[target]
X_train = df.drop(columns=target)

# Training-Validation Split

* Since we're doing k-fold CV (cross validation), there's no need for a validation set

# III. Establish Baseline

This is a **classification** our baseline with be **accuracy**

In [9]:
print('Baseline Accuracy:', y_train.value_counts(normalize=True).max())

Baseline Accuracy: 0.5430899510092763


# IV. Build Models

* DecisionTreeClassifier
* RandomForestClassifier

In [10]:
model_dt = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    DecisionTreeClassifier(random_state=42)
)

In [11]:
model_rf = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(n_estimators=25, random_state=42)
)

**Check cross-validation scores**

In [19]:
cv_scores_dt = cross_val_score(model_dt, X_train, y_train, cv=5, n_jobs=-1)
cv_scores_rf = cross_val_score(model_rf, X_train, y_train, cv=5, n_jobs=-1)

In [20]:
print('CV score DecisionTreeClassifier')
print(cv_scores_dt)
print('Mean CV accuracy score:', cv_scores_dt.mean())
print('STD CV accuracy score:', cv_scores_dt.std())

CV score DecisionTreeClassifier
[0.75782828 0.74991582 0.75917508 0.75580808 0.74779022]
Mean CV accuracy score: 0.7541034981518187
STD CV accuracy score: 0.0044697812186829024


In [21]:
print('CV score RandomForestClassifier')
print(cv_scores_rf)
print('Mean CV accuracy score:', cv_scores_rf.mean())
print('STD CV accuracy score:', cv_scores_rf.std())

CV score RandomForestClassifier
[0.80505051 0.79907407 0.79991582 0.79974747 0.79425878]
Mean CV accuracy score: 0.7996093309558248
STD CV accuracy score: 0.003424451376257444


# V. Tune Model

* What are important hyperparameters for RandomForestClassifier?
  * max_depth: 5-35
  * n_estimators: 25-100
  * imputation strategy: mean or median

**GridSearch**: Very thourough, but it can take a long time

In [None]:
param_grid = {
    'simpleimputer__strategy': ['mean', 'median'],
    'randomforestclassifier__max_depth': range(5, 40, 5),
    'rnadomforestclassifier__n_estimators': range(25, 125, 25)
}

model_rfgs = GridSearchCV(
    model_rf, 
    param_grid=param_grid,
    n_jobs=-1,
    cv=5,
    verbose=1
)

model_rfgs.fit(X_train, y_train)

In [22]:
7*4*2*5 #number of models for grid search

280

**RandomizedSearchCV**: Quicker, less effective but usually good enough

# Make Submission