<a href="https://colab.research.google.com/github/jduell12/DS-Unit-2-Kaggle-Challenge/blob/main/guided_project_unit2_sprint2_module4_empty.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification Metrics

- get and interpret the **confusion matrix** for classification models
- use classification metrics: **precision, recall**
- understand the relationships between precision, recall, **thresholds, and predicted probabilities**, to help **make decisions and allocate budgets**

In [None]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [None]:
from category_encoders import OrdinalEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import plot_confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# I. Wrangle Data

In [None]:
def wrangle(fm_path, tv_path=None):
  if tv_path:
    df = pd.merge(pd.read_csv(fm_path, na_values=[0, -2.000000e-08],parse_dates=['date_recorded']), pd.read_csv(tv_path)).set_index('id')
      # create new target
    df['needs_repair'] = (df['status_group'] != 'functional').astype(int)
    df.drop(columns='status_group', inplace=True)
  else:
    df = pd.read_csv(fm_path, na_values=[0, -2.000000e-08], parse_dates=['date_recorded'], index_col='id')
  # drop constant columns 
  df.drop(columns=['recorded_by'], inplace=True)
  # create age feature 
  df['pump_age'] = df['date_recorded'].dt.year - df['construction_year']
  df.drop(columns=['date_recorded'], inplace=True)
  # drop high cardinality columns 
  cutoff = 100
  drop_cols = [col for col in df.select_dtypes('object').columns if df[col].nunique() > 100]
  df.drop(columns = drop_cols, inplace=True)
  # drop duplicate columns 
  dup_cols = [col for col in df.head(15).T.duplicated().index if df.head(15).T.duplicated()[col]]
  df.drop(columns=dup_cols, inplace=True)

  return df

In [None]:
df = wrangle(DATA_PATH + 'waterpumps/train_features.csv', DATA_PATH + 'waterpumps/train_labels.csv')
X_test = wrangle(DATA_PATH + 'waterpumps/test_features.csv')

# EDA 
* How can we transform our target so that this is a **binary classification** problem?

# II. Split Data

In [None]:
# split TV / FM
target = 'needs_repair'
X = df.drop(columns=target)
y = df[target]

# train-val split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# sanity check
assert len(X_train) + len(X_val) == len(X)

# III. Establish Baseline

In [None]:
print('Baseline Accuracy:', y_train.value_counts(normalize=True).max())

Baseline Accuracy: 0.5425829668132747


# IV. Build Model

* OrdinalEncoder
* SimpleImputer
* RandomForestClassifier

# Interlude: Beware of leakage

If you leave 'status_group' in your feature matrix, you'll have **leakage**

# V Check Metrics

In [None]:
print('Training Accuracy:', model.score(X_train, y_train))
print('Validation Accuracy:', model.score(X_val, y_val))

Training Accuracy: 0.9949493886655864
Validation Accuracy: 0.8223905723905723


**Confusion Matrix**

**Recall Score**

Of those pumps that actually needed repair, what proportion did you correctly predict as needing repair?

**Precision Score**

Of all the pumps that you predicted as needing repair, what proportion actually needed repair?

**Classification Report**

# Case Study

Let's say that it costs the Tanzanian government $100 to inspect a water pump and there is only funding for 2000 pump inspections. 

In [None]:
n_inspections = 2000

Scenario 1: Choose pumps randomly

Scenario 2: Using our model 'out of the box'

Scenario 3: We emphasize **precision** in our model, and only select pumps that our model is very certain (>0.85) needs repair