<a href="https://colab.research.google.com/github/medinadiegoeverardo/DS-Unit-2-Kaggle-Challenge/blob/master/module4/medinadiego_4_assignment_kaggle_challenge_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 4

## Assignment
- [ ] If you haven't yet, [review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Plot a confusion matrix for your Tanzania Waterpumps model.
- [ ] Continue to participate in our Kaggle challenge. Every student should have made at least one submission that scores at least 60% accuracy (above the majority class baseline).
- [ ] Submit your final predictions to our Kaggle competition. Optionally, go to **My Submissions**, and _"you may select up to 1 submission to be used to count towards your final leaderboard score."_
- [ ] Commit your notebook to your fork of the GitHub repo.
- [ ] Read [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.


## Stretch Goals

### Reading
- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Simple guide to confusion matrix terminology](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) by Kevin Markham, with video
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)


### Doing
- [ ] Share visualizations in our Slack channel!
- [ ] RandomizedSearchCV / GridSearchCV, for model selection. (See below)
- [ ] Stacking Ensemble. (See below)
- [ ] More Categorical Encoding. (See below)

### RandomizedSearchCV / GridSearchCV, for model selection

- _[Introduction to Machine Learning with Python](http://shop.oreilly.com/product/0636920030515.do)_ discusses options for "Grid-Searching Which Model To Use" in Chapter 6:

> You can even go further in combining GridSearchCV and Pipeline: it is also possible to search over the actual steps being performed in the pipeline (say whether to use StandardScaler or MinMaxScaler). This leads to an even bigger search space and should be considered carefully. Trying all possible solutions is usually not a viable machine learning strategy. However, here is an example comparing a RandomForestClassifier and an SVC ...

The example is shown in [the accompanying notebook](https://github.com/amueller/introduction_to_ml_with_python/blob/master/06-algorithm-chains-and-pipelines.ipynb), code cells 35-37. Could you apply this concept to your own pipelines?

### Stacking Ensemble

Here's some code you can use to "stack" multiple submissions, which is another form of ensembling:

```python
import pandas as pd

# Filenames of your submissions you want to ensemble
files = ['submission-01.csv', 'submission-02.csv', 'submission-03.csv']

target = 'status_group'
submissions = (pd.read_csv(file)[[target]] for file in files)
ensemble = pd.concat(submissions, axis='columns')
majority_vote = ensemble.mode(axis='columns')[0]

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission[target] = majority_vote
submission.to_csv('my-ultimate-ensemble-submission.csv', index=False)
```


### More Categorical Encodings

**1.** The article **[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)** mentions 4 encodings:

- **"Categorical Encoding":** This means using the raw categorical values as-is, not encoded. Scikit-learn doesn't support this, but some tree algorithm implementations do. For example, [Catboost](https://catboost.ai/), or R's [rpart](https://cran.r-project.org/web/packages/rpart/index.html) package.
- **Numeric Encoding:** Synonymous with Label Encoding, or "Ordinal" Encoding with random order. We can use [category_encoders.OrdinalEncoder](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html).
- **One-Hot Encoding:** We can use [category_encoders.OneHotEncoder](http://contrib.scikit-learn.org/categorical-encoding/onehot.html).
- **Binary Encoding:** We can use [category_encoders.BinaryEncoder](http://contrib.scikit-learn.org/categorical-encoding/binary.html).


**2.** The short video 
**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)** introduces an interesting idea: use both X _and_ y to encode categoricals.

Category Encoders has multiple implementations of this general concept:

- [CatBoost Encoder](http://contrib.scikit-learn.org/categorical-encoding/catboost.html)
- [James-Stein Encoder](http://contrib.scikit-learn.org/categorical-encoding/jamesstein.html)
- [Leave One Out](http://contrib.scikit-learn.org/categorical-encoding/leaveoneout.html)
- [M-estimate](http://contrib.scikit-learn.org/categorical-encoding/mestimate.html)
- [Target Encoder](http://contrib.scikit-learn.org/categorical-encoding/targetencoder.html)
- [Weight of Evidence](http://contrib.scikit-learn.org/categorical-encoding/woe.html)

Category Encoder's mean encoding implementations work for regression problems or binary classification problems. 

For multi-class classification problems, you will need to temporarily reformulate it as binary classification. For example:

```python
encoder = ce.TargetEncoder(min_samples_leaf=..., smoothing=...) # Both parameters > 1 to avoid overfitting
X_train_encoded = encoder.fit_transform(X_train, y_train=='functional')
X_val_encoded = encoder.transform(X_train, y_val=='functional')
```

**3.** The **[dirty_cat](https://dirty-cat.github.io/stable/)** library has a Target Encoder implementation that works with multi-class classification.

```python
 dirty_cat.TargetEncoder(clf_type='multiclass-clf')
```
It also implements an interesting idea called ["Similarity Encoder" for dirty categories](https://www.slideshare.net/GaelVaroquaux/machine-learning-on-non-curated-data-154905090).

However, it seems like dirty_cat doesn't handle missing values or unknown categories as well as category_encoders does. And you may need to use it with one column at a time, instead of with your whole dataframe.

**4. [Embeddings](https://www.kaggle.com/learn/embeddings)** can work well with sparse / high cardinality categoricals.

_**I hope it’s not too frustrating or confusing that there’s not one “canonical” way to encode categorcals. It’s an active area of research and experimentation! Maybe you can make your own contributions!**_

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [0]:
import pandas as pd

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

### feature eng

In [0]:
# suspected duplicates

train_dupl = train[['source', 'source_type', 'waterpoint_type', 'waterpoint_type_group','extraction_type', 'extraction_type_group',
                 'extraction_type_class', 'payment', 'payment_type', 'quantity', 'quantity_group']]

In [0]:
train_dupl.tail()

Unnamed: 0,source,source_type,waterpoint_type,waterpoint_type_group,extraction_type,extraction_type_group,extraction_type_class,payment,payment_type,quantity,quantity_group
59395,spring,spring,communal standpipe,communal standpipe,gravity,gravity,gravity,pay per bucket,per bucket,enough,enough
59396,river,river/lake,communal standpipe,communal standpipe,gravity,gravity,gravity,pay annually,annually,enough,enough
59397,machine dbh,borehole,hand pump,hand pump,swn 80,swn 80,handpump,pay monthly,monthly,enough,enough
59398,shallow well,shallow well,hand pump,hand pump,nira/tanira,nira/tanira,handpump,never pay,never pay,insufficient,insufficient
59399,shallow well,shallow well,hand pump,hand pump,nira/tanira,nira/tanira,handpump,pay when scheme fails,on failure,enough,enough


In [0]:
train.source.value_counts()
# dropping source_type since source has more unique values
# also waterpoint_type_group since waterpoint_type has 1 unique value more

spring                  17021
shallow well            16824
machine dbh             11075
river                    9612
rainwater harvesting     2295
hand dtw                  874
lake                      765
dam                       656
other                     212
unknown                    66
Name: source, dtype: int64

In [0]:
def replacing_dates(df):
  df['date_recorded'] = pd.to_datetime(df['date_recorded'], infer_datetime_format=True)
  df['year_recorded'] = df['date_recorded'].dt.year
  df['month_recorded'] = df['date_recorded'].dt.month
  df['day_recorded'] = df['date_recorded'].dt.day

replacing_dates(train)
replacing_dates(test)

In [0]:
train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group,year_recorded,month_recorded,day_recorded
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional,2011,3,14
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional,2013,3,6
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional,2013,2,25
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional,2013,1,28
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional,2011,7,13


In [0]:
columns_drop = ['payment', 'extraction_type', 'waterpoint_type_group', 'quantity_group', 'source_type', 'date_recorded']
target = 'status_group'

features = train.columns.drop(columns_drop + [target])

In [0]:
features

Index(['id', 'amount_tsh', 'funder', 'gps_height', 'installer', 'longitude',
       'latitude', 'wpt_name', 'num_private', 'basin', 'subvillage', 'region',
       'region_code', 'district_code', 'lga', 'ward', 'population',
       'public_meeting', 'recorded_by', 'scheme_management', 'scheme_name',
       'permit', 'construction_year', 'extraction_type_group',
       'extraction_type_class', 'management', 'management_group',
       'payment_type', 'water_quality', 'quality_group', 'quantity', 'source',
       'source_class', 'waterpoint_type', 'year_recorded', 'month_recorded',
       'day_recorded'],
      dtype='object')

In [0]:
# replace 'none' with np.nan, impute later. no need to reduce cardinality (ordinal encoder will be used)
import numpy as np
train['wpt_name'] = train['wpt_name'].replace('none', np.nan)
# replacing_nulls_with_nulls(train)

In [0]:
def replacing_nulls_with_nulls(df):
  cols = df.columns
  cols = list(cols) # train.columns.to_list()
  those_null = []
  for col in cols:
    if df[col].isnull().any() == False:
      continue
    
    df[col] = df[col].replace(0, np.nan)
    those_null.append(col)
  return those_null

replacing_nulls_with_nulls(train)
replacing_nulls_with_nulls(test)

['funder',
 'installer',
 'subvillage',
 'public_meeting',
 'scheme_management',
 'scheme_name',
 'permit']

In [0]:
x_train = train[features]
y_train = train[target]
x_test = test[features]

target = 'status_group'

features = train.columns.drop(columns_drop + [target])

### pipeline, etc

In [0]:
from sklearn.impute import SimpleImputer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
import category_encoders as ce
from sklearn.pipeline import make_pipeline
from scipy.stats import uniform, randint

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(), 
    RandomForestClassifier(random_state=10))

param_distributions = {
    'simpleimputer__strategy': ['mean', 'median'], 
    'randomforestclassifier__n_estimators': randint(50, 300), # range(1, len(X_train.columns)+1)
    'randomforestclassifier__max_depth': [5, 10, 15, 20], 
    'randomforestclassifier__max_features': uniform(0, 1), 
}

search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=5, 
    cv=3, 
    verbose=10,
    return_train_score=True, 
    n_jobs=-1
)

search.fit(x_train, y_train);

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  6.3min
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:  9.3min finished


In [0]:
print('best hyperparameters', search.best_params_)
print('best accuracy score: ', search.best_score_) 
y_pred = search.predict(x_test)

best hyperparameters {'randomforestclassifier__max_depth': 15, 'randomforestclassifier__max_features': 0.18884867296076535, 'randomforestclassifier__n_estimators': 173, 'simpleimputer__strategy': 'mean'}
best accuracy score:  0.7987710437710438


In [0]:
# submission = sample_submission.copy()
# submission['status_group'] = y_pred

In [0]:
submission.to_csv('medinadiegokaggle_4.csv', index=False)

NameError: ignored

In [0]:
from google.colab import files
files.download('medinadiegokaggle_4.csv')

In [0]:
test.shape

In [0]:
train.shape

### Random Forest Classifier

In [0]:
from sklearn.model_selection import train_test_split
train, validation = train_test_split(train, random_state=10, train_size=.8)

In [0]:
columns_drop = ['payment', 'extraction_type', 'waterpoint_type_group', 'quantity_group', 'source_type', 'date_recorded']
target = 'status_group'

features = train.columns.drop(columns_drop + [target])

In [0]:
replacing_dates(train)
replacing_dates(validation)

# replace 'none' with np.nan, impute later. no need to reduce cardinality (ordinal encoder will be used)
train['wpt_name'] = train['wpt_name'].replace('none', np.nan)
replacing_nulls_with_nulls(train)
replacing_nulls_with_nulls(validation)

In [0]:
xx_train = train[features]
yy_train = train[target]
xx_val = validation[features]
yy_val = validation[target]
xx_test = test[features]

In [0]:
xx_train.head()

In [0]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='mean'), 
    RandomForestClassifier(random_state=10, max_depth=20, 
                           max_features=0.0287, n_estimators=238))



In [0]:
from sklearn.metrics import accuracy_score

# Fit on train, score on val
pipeline.fit(xx_train, yy_train)
y_pred = pipeline.predict(xx_val)
print('Validation Accuracy', pipeline.score(yy_val, y_pred)))

### confusion matrix

In [0]:
from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels
import matplotlib.pyplot as plt
import seaborn as sns

confusion_matrix(yy_val, y_pred)

In [0]:
from sklearn.preprocessing import Normalizer
to_normalize = confusion_matrix(yy_val, y_pred)
norm = Normalizer().transform(to_normalize)

labels = unique_labels(y_pred)
columns = [f'predicted {label}' for label in labels]
index = [f'actual {label}' for label in labels]
table_2 = pd.DataFrame(norm, columns=columns, index=index)
table_2

In [0]:
# same results as normalizer

from sklearn.preprocessing import MinMaxScaler
to_minmax = confusion_matrix(yy_val, y_pred)
minmax = MinMaxScaler().fit_transform(to_minmax)

labels_2 = unique_labels(y_pred)
columns_2 = [f'predicted {label}' for label in labels_2]
index_2 = [f'actual {label}' for label in labels_2]
table_3 = pd.DataFrame(norm, columns=columns_2, index=index_2)
table_3

In [0]:
labels_2 = unique_labels(y_pred)
columns_2 = [f'predicted {label}' for label in labels_2]
index_2 = [f'actual {label}' for label in labels_2]
table_3 = pd.DataFrame(norm, columns=columns_2, index=index_2)
sns.heatmap(table_3, cmap='BuPu_r', fmt='.2%', annot=True) # .1f, d

In [0]:
def con_array(y_true, y_pred):
  labels = unique_labels(y_pred)
  columns = [f'predicted {label}' for label in labels]
  index = [f'actual {label}' for label in labels]
  return columns, index

con_array(yy_val, y_pred)

In [0]:
def convert_array_list(y_true, y_pred):
  labels = unique_labels(y_pred)
  columns = [f'predicted {label}' for label in labels]
  index = [f'actual {label}' for label in labels]
  table = pd.DataFrame(confusion_matrix(y_true, y_pred), 
                       columns=columns, index=index)
  return table

convert_array_list(yy_val, y_pred)

In [0]:
def convert_array_list(y_true, y_pred):
  labels = unique_labels(y_pred)
  columns = [f'predicted {label}' for label in labels]
  index = [f'actual {label}' for label in labels]
  table = pd.DataFrame(confusion_matrix(y_true, y_pred), 
                       columns=columns, index=index)
  return sns.heatmap(table, annot=True, cmap='CMRmap_r', fmt='d') # fmt='d' changes numerical notation

convert_array_list(yy_val, y_pred);

In [0]:
correct_pred = 5957+201+3386
total_pred = 61+466+537+112+1126+34+5957+201+3386
correct_pred / total_pred

In [0]:
from sklearn.metrics import accuracy_score
print('best accuracy score: ', search.best_score_)
print(accuracy_score(y_train, y_pred))

In [0]:
sum(y_pred == y_train) / len(y_pred) # what

In [0]:
from sklearn.metrics import classification_report
print(classification_report(y_train, y_pred))

In [0]:
convert_array_list(y_train, y_pred);

In [0]:
total_non_func_pred = 21761+72+79
correct_non_funct = 21761

In [0]:
# precision

correct_non_funct / total_non_func_pred

In [0]:
# recall
actual_non_func = 1060+3+21761
correct_non_funct / actual_non_func

### precision, recall, thresholds, and predicted probabilities

In [0]:
len(test)

In [0]:
len(x_train)

In [0]:
y_train.value_counts(normalize=True)

In [0]:
# based on historical data, if you randomly chose waterpumps to inspect, then 
# about 46% of the waterpumps would need repairs, and 54% would not need repairs

trips = 2000
print(f'Baseline: {trips * 0.46} waterpumps repairs in {trips} trips')

In [0]:
# REDEFINING our target. Identify which waterpumps are non-functional or are functional but needs repair

y_train = y_train != 'functional' # give me those that != functional
y_train.value_counts(normalize=True)

In [0]:
y_train.head()

In [0]:
len(x_test) == len(test)

In [0]:
pipeline.fit(x_train, y_train)
y_pred = search.predict(x_test)
y_pred

In [0]:
convert_array_list(y_train, y_pred); 