<a href="https://colab.research.google.com/github/gptix/DS-Unit-2-Kaggle-Challenge/blob/master/module4/Jud%20Taylor%20-%20assignment_kaggle_challenge_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 4*

---

# TODO - Lesson

# Classification Metrics

## Assignment
- [ ] If you haven't yet, [review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Plot a confusion matrix for your Tanzania Waterpumps model.
- [ ] Continue to participate in our Kaggle challenge. Every student should have made at least one submission that scores at least 70% accuracy (well above the majority class baseline).
- [ ] Submit your final predictions to our Kaggle competition. Optionally, go to **My Submissions**, and _"you may select up to 1 submission to be used to count towards your final leaderboard score."_
- [ ] Commit your notebook to your fork of the GitHub repo.
- [ ] Read [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.


## Stretch Goals

### Reading
- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Simple guide to confusion matrix terminology](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) by Kevin Markham, with video
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)


### Doing
- [ ] Share visualizations in our Slack channel!
- [ ] RandomizedSearchCV / GridSearchCV, for model selection. (See module 3 assignment notebook)
- [ ] More Categorical Encoding. (See module 2 assignment notebook)
- [ ] Stacking Ensemble. (See below)

### Stacking Ensemble

Here's some code you can use to "stack" multiple submissions, which is another form of ensembling:

```python
import pandas as pd

# Filenames of your submissions you want to ensemble
files = ['submission-01.csv', 'submission-02.csv', 'submission-03.csv']

target = 'status_group'
submissions = (pd.read_csv(file)[[target]] for file in files)
ensemble = pd.concat(submissions, axis='columns')
majority_vote = ensemble.mode(axis='columns')[0]

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission[target] = majority_vote
submission.to_csv('my-ultimate-ensemble-submission.csv', index=False)
```

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
import pandas as pd

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

In [3]:
from sklearn.model_selection import train_test_split
# Further split the 'train's set into 'train' and 'val'.
# The new 'train' set will be 80% of the original set.
# Specify the 'random_state' parameter explicitly, so that behavior will be 
# replicable.   
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['status_group'], random_state=42)

print(f"Train shape: {train.shape}, Val shape: {val.shape}, Test shape: {test.shape}.")

Train shape: (47520, 41), Val shape: (11880, 41), Test shape: (14358, 40).


In [0]:
import numpy as np

def wrangle(X):
    """A standard set of steps to wrangle data in sets."""
    
    # Prevent SettingWithCopyWarning
    # Make a copy of the dataframe received.
    X = X.copy()
    
    # About 3% of the time, latitude has small values near zero,
    # outside Tanzania, so we'll treat these values like zero.
    # Use the replace method on the column '
    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    
    # We know that none of the pumps are located at logitude or latitude zero.
    # So, we should replace these values, temporarily, with zeros, so that
    # a later step can impute more appropriate values.
    # Iteration 2 - add 'gps_height'
    cols_with_zeros = ['longitude', 'latitude',
                       'gps_height', 'population', 'construction_year']
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)
        X[col + '_MISSING'] = X[col].isnull()



    # The columns 'quantity' & 'quantity_group' are duplicates, so we drop one.
    # 'payment_type' is also superfluous.
    X = X.drop(columns=['quantity_group', 'payment_type'])

    # Invariant (so unusable for analysis) columns
    invariant_cols = ['recorded_by', 'id']
    X = X.drop(columns=invariant_cols)

    X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)

    X['year_recorded'] = X['date_recorded'].dt.year
    X['month_recorded'] = X['date_recorded'].dt.month
    X['day_recorded'] = X['date_recorded'].dt.day

    X = X.drop(columns='date_recorded')

    X['years'] = X['year_recorded']-X['construction_year']
    X['years_MISSING'] = X['years'].isnull()

    # return the wrangled dataframe.
    return X

train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

In [5]:
# The value to be predicted is 'status_group'.
target = 'status_group'

# Make a list of all features except the target.  This might be changed by 
# dropping further columns. We also drop the 'id' column.
train_features = train.drop(columns=[target])

# Get numeric features (as a Pandas series).
numeric_features = train_features.select_dtypes(include='number').columns.tolist()

# Get non-numeric features with relatively low cardinality.
## Get the cardinality of the NON-numeric features (as a Pandas series).
cardinality = train_features.select_dtypes(exclude='number').nunique()

## Get a list of all categorical features with cardinality below (or) 50.
categorical_features = cardinality[cardinality <= 50].index.tolist()

# Concatenate the two series to list all features.
features = numeric_features + categorical_features
print(features)

['amount_tsh', 'gps_height', 'longitude', 'latitude', 'num_private', 'region_code', 'district_code', 'population', 'construction_year', 'year_recorded', 'month_recorded', 'day_recorded', 'years', 'basin', 'region', 'public_meeting', 'scheme_management', 'permit', 'extraction_type', 'extraction_type_group', 'extraction_type_class', 'management', 'management_group', 'payment', 'water_quality', 'quality_group', 'quantity', 'source', 'source_type', 'source_class', 'waterpoint_type', 'waterpoint_type_group', 'longitude_MISSING', 'latitude_MISSING', 'gps_height_MISSING', 'population_MISSING', 'construction_year_MISSING', 'years_MISSING']


In [0]:
X_train = train[features]
y_train = train[target]

X_val = val[features]
y_val = val[target]

X_test = test[features]

In [7]:
from sklearn.pipeline import make_pipeline

import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

# Random Forest Classifier with Ordinal Encoder
RFC_ord_pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(random_state=1984, n_jobs=-1, n_estimators=100)
)

RFC_ord_pipeline.fit(X_train, y_train)
print('Validation Accuracy', RFC_ord_pipeline.score(X_val, y_val))

Validation Accuracy 0.8108585858585858


In [34]:
from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels

y_pred = RFC_ord_pipeline.predict(X_val)

def plot_cfz_matrix(y_true, y_predicted):
  ul = unique_labels(y_val)
  cols = [f'Predicted {label}' for label in ul]
  idx = [f'Actual {lbl}' for lbl in ul]
  return pd.DataFrame(confusion_matrix(y_true, y_predicted),
                      columns=cols, index=idx)

plot_cfz_matrix(y_val, y_pred)

Unnamed: 0,Predicted functional,Predicted functional needs repair,Predicted non functional
Actual functional,5713,200,539
Actual functional needs repair,425,292,146
Actual non functional,861,76,3628
