<a href="https://colab.research.google.com/github/nedprz/DS-Unit-2-Kaggle-Challenge/blob/master/module4-classification-metrics/Ned_Przezdziecki_LS_DS_224_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 4*

---

# Classification Metrics

## Assignment
- [ ] If you haven't yet, [review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Plot a confusion matrix for your Tanzania Waterpumps model.
- [ ] Continue to participate in our Kaggle challenge. Every student should have made at least one submission that scores at least 70% accuracy (well above the majority class baseline).
- [ ] Submit your final predictions to our Kaggle competition. Optionally, go to **My Submissions**, and _"you may select up to 1 submission to be used to count towards your final leaderboard score."_
- [ ] Commit your notebook to your fork of the GitHub repo.
- [ ] Read [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](http://archive.is/DelgE), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.


## Stretch Goals

### Reading

- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)


### Doing
- [ ] Share visualizations in our Slack channel!
- [ ] RandomizedSearchCV / GridSearchCV, for model selection. (See module 3 assignment notebook)
- [ ] Stacking Ensemble. (See module 3 assignment notebook)
- [ ] More Categorical Encoding. (See module 2 assignment notebook)

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [6]:
import pandas as pd
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import train_test_split


# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

  import pandas.util.testing as tm


In [4]:
train.shape, test.shape

((59400, 41), (14358, 40))

In [29]:

from sklearn.preprocessing import LabelEncoder

lb_make = LabelEncoder()
train["code"] = lb_make.fit_transform(train["status_group"])
train[["status_group", "code"]].head()
#train.dtypes

Unnamed: 0,status_group,code
24947,non functional,2
22630,functional,0
13789,functional,0
15697,functional,0
22613,non functional,2


In [0]:
train, val = train_test_split(train,random_state=42)

In [31]:
train.shape, val.shape, test.shape

((33412, 42), (11138, 42), (14358, 40))

In [33]:
val.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group,code
46577,59804,0.0,2011-08-05,Hifab,0,Hesawa,33.431506,-2.766605,Kisima Cha Jadi,0,Lake Victoria,Chaya,Mwanza,19,4,Kwimba,Nyambiti,0,True,GeoData Consultants Ltd,VWC,,True,0,other,other,other,vwc,user-group,never pay,never pay,soft,good,seasonal,seasonal,shallow well,shallow well,groundwater,other,other,non functional,2
52347,55302,0.0,2011-04-11,,0,,33.803964,-9.322491,Kikusya,0,Lake Nyasa,Kikusya,Mbeya,12,4,Rungwe,Lupata,0,True,GeoData Consultants Ltd,VWC,K,,0,gravity,gravity,gravity,vwc,user-group,unknown,unknown,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,non functional,2
17899,40571,0.0,2013-03-29,Government Of Tanzania,1659,DWE,36.666703,-3.29979,Shule Ya Msingi,0,Pangani,Maina,Arusha,2,2,Arusha Rural,Kimnyaki,200,True,GeoData Consultants Ltd,VWC,Likamba mindeu pipe line,True,1991,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional,0
1798,54044,0.0,2011-07-27,Ministry Of Water,0,DWE,33.341156,-3.257782,Bomba La Zahanati,0,Lake Victoria,Nghengele,Mwanza,19,4,Kwimba,Hungumalwa,0,True,GeoData Consultants Ltd,VWC,MMILUKI,True,0,mono,mono,motorpump,vwc,user-group,never pay,never pay,soft,good,enough,enough,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,functional,0
35170,36152,0.0,2011-04-13,,0,,33.853957,-8.832872,Kwa Mzee Bakari,0,Rufiji,Juhudi,Mbeya,12,7,Mbarali,Igurusi,0,True,GeoData Consultants Ltd,WUA,,True,0,gravity,gravity,gravity,wua,user-group,pay monthly,monthly,soft,good,insufficient,insufficient,river,river/lake,surface,communal standpipe,communal standpipe,functional,0


In [28]:
val.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
2980,37098,0.0,2012-10-09,Rural Water Supply And Sanitat,0,DWE,31.985658,-3.59636,Kasela,0,Lake Tanganyika,Bufanka Centre,Shinyanga,17,5,Bukombe,Iyogela,0,True,GeoData Consultants Ltd,WUG,,True,0,other,other,other,wug,user-group,unknown,unknown,soft,good,dry,dry,shallow well,shallow well,groundwater,other,other,non functional
5246,14530,0.0,2012-11-03,Halmashauri Ya Manispa Tabora,0,Halmashauri ya manispa tabora,32.832815,-4.944937,Mbugani,0,Lake Tanganyika,Maendeleo,Tabora,14,6,Tabora Urban,Uyui,0,True,GeoData Consultants Ltd,VWC,,True,0,india mark ii,india mark ii,handpump,vwc,user-group,never pay,never pay,milky,milky,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,functional
22659,62607,10.0,2013-02-25,Siter Fransis,1675,DWE,35.488289,-4.242048,Kwa Leosi,0,Internal,Qatabradiki,Manyara,21,1,Babati,Dareda,148,True,GeoData Consultants Ltd,Water Board,,True,2008,gravity,gravity,gravity,water board,user-group,pay per bucket,per bucket,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe,functional
39888,46053,0.0,2011-08-13,Kkkt,0,KKKT,33.140828,-9.059386,Jangi,0,Lake Rukwa,Chawama,Mbeya,12,6,Mbozi,Iyula,0,False,GeoData Consultants Ltd,VWC,,False,0,nira/tanira,nira/tanira,handpump,vwc,user-group,never pay,never pay,soft,good,seasonal,seasonal,shallow well,shallow well,groundwater,hand pump,hand pump,non functional
13361,47083,50.0,2013-02-08,Wateraid,1109,SEMA,34.217077,-4.430529,Mkima,0,Internal,Mkima,Singida,13,1,Iramba,Mtoa,235,True,GeoData Consultants Ltd,WUA,Tyeme water supply,True,2011,mono,mono,motorpump,wua,user-group,pay per bucket,per bucket,soft,good,enough,enough,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,functional


In [17]:
train['status_group'].value_counts(normalize=True)

functional                 0.542334
non functional             0.384871
functional needs repair    0.072795
Name: status_group, dtype: float64

In [22]:
train['code'].value_counts(normalize=True)

0    0.542334
2    0.384871
1    0.072795
Name: code, dtype: float64

In [0]:
target = 'status_group'
features = ['amount_tsh', 'gps_height', 'longitude', 'latitude', 'num_private', 'region_code', 'district_code', 'population', 'construction_year', 'basin', 'region', 'public_meeting', 'recorded_by', 'scheme_management', 'permit', 'extraction_type', 'extraction_type_group', 'extraction_type_class', 'management', 'management_group', 'payment', 'payment_type', 'water_quality', 'quality_group', 'quantity', 'source', 'source_type', 'source_class', 'waterpoint_type', 'waterpoint_type_group']
from scipy.stats import randint, uniform


In [0]:
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]

In [67]:
pipeline = make_pipeline(
    ce.TargetEncoder(), 
    SimpleImputer(), 
    StandardScaler(), 
    #LogisticRegression(max_iter=200),
    RandomForestClassifier(random_state=42)
)

param_distributions = {
    'simpleimputer__strategy': ['mean','median'],
    'randomforestclassifier__n_estimators': randint(50, 500), 
    'randomforestclassifier__max_depth': [5, 10, 15, 20, None], 
    'randomforestclassifier__max_features': uniform(0, 1),
  
}

# If you're on Colab, decrease n_iter & cv parameters
search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=6, 
    cv=3, 
    scoring='neg_mean_absolute_error', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)




# Fit on train
search.fit(X_train, y_train)

# Score on val
#print('Validation Accuracy', pipeline.score(X_val, y_val))

# Predict on test
#y_pred = pipeline.predict(X_test)

Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:    6.3s finished


TypeError: ignored

In [0]:
pipeline = search.best_estimator_

In [70]:
from sklearn.metrics import mean_absolute_error



y_pred = pipeline.predict(X_test)
y_pred

TypeError: ignored

In [71]:
pipeline.score(X_val,y_val)

TypeError: ignored

In [0]:
test['status_group'] =y_pred
header = ["id","status_group"]
test.to_csv('xbest.csv',columns=header,index=False)