<a href="https://colab.research.google.com/github/npgeorge/DS-Unit-2-Kaggle-Challenge/blob/master/Nicholas_George_Kaggle_Challenge_Assignment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 1*

---

# Decision Trees

## Assignment
- [ ] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Begin with baselines for classification.
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this classification problem, you may want to use the parameter `logistic=True`, but it can be slow.

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```


In [1]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

Collecting category_encoders==2.*
[?25l  Downloading https://files.pythonhosted.org/packages/a0/52/c54191ad3782de633ea3d6ee3bb2837bda0cf3bc97644bb6375cf14150a0/category_encoders-2.1.0-py2.py3-none-any.whl (100kB)
[K     |███▎                            | 10kB 16.2MB/s eta 0:00:01[K     |██████▌                         | 20kB 1.8MB/s eta 0:00:01[K     |█████████▉                      | 30kB 2.6MB/s eta 0:00:01[K     |█████████████                   | 40kB 1.7MB/s eta 0:00:01[K     |████████████████▍               | 51kB 2.1MB/s eta 0:00:01[K     |███████████████████▋            | 61kB 2.5MB/s eta 0:00:01[K     |██████████████████████▉         | 71kB 2.9MB/s eta 0:00:01[K     |██████████████████████████▏     | 81kB 3.3MB/s eta 0:00:01[K     |█████████████████████████████▍  | 92kB 3.7MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 2.4MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.1.0


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [3]:
# Split train into train & val
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['status_group'], random_state=42)

train.shape, val.shape, test.shape

((47520, 41), (11880, 41), (14358, 40))

In [0]:
'''
print('Train Year Min:', train['year_recorded'].min())
print('Train Year Max:', train['year_recorded'].max())
print('Test Year Min:', test['year_recorded'].min())
print('Test Year Max:', test['year_recorded'].max())

In [5]:
train['status_group'].value_counts(normalize=True)

functional                 0.543077
non functional             0.384238
functional needs repair    0.072685
Name: status_group, dtype: float64

In [6]:
train.isnull().sum() #missing data

id                           0
amount_tsh                   0
date_recorded                0
funder                    2904
gps_height                   0
installer                 2917
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                 286
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            2644
recorded_by                  0
scheme_management         3128
scheme_name              22532
permit                    2443
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_

In [0]:
import numpy as np

def wrangle(X):
    """Wrangle train, validate, and test sets in the same way"""
    
    # Prevent SettingWithCopyWarning
    X = X.copy()
    
    # About 3% of the time, latitude has small values near zero,
    # outside Tanzania, so we'll treat these values like zero.
    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    
    # When columns have zeros and shouldn't, they are like null values.
    # So we will replace the zeros with nulls, and impute missing values later.
    # Also create a "missing indicator" column, because the fact that
    # values are missing may be a predictive signal.
    cols_with_zeros = ['longitude', 'latitude', 'construction_year', 
                       'gps_height', 'population']
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)
        X[col+'_MISSING'] = X[col].isnull()
            
    # Drop duplicate columns
    duplicates = ['quantity_group', 'payment_type']
    X = X.drop(columns=duplicates)
    
    # Drop recorded_by (never varies) and id (always varies, random)
    unusable_variance = ['recorded_by', 'id']
    X = X.drop(columns=unusable_variance)
    
    # Convert date_recorded to datetime
    X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)
    
    # Extract components from date_recorded, then drop the original column
    X['year_recorded'] = X['date_recorded'].dt.year
    X['month_recorded'] = X['date_recorded'].dt.month
    X['day_recorded'] = X['date_recorded'].dt.day
    X = X.drop(columns='date_recorded')
    
    # Engineer feature: how many years from construction_year to date_recorded
    X['years'] = X['year_recorded'] - X['construction_year']
    X['years_MISSING'] = X['years'].isnull()

    #engineer seasonality, what season was the pump recorded?
    jan = (X['month_recorded'] == 1)
    feb = (X['month_recorded'] == 2)
    march = (X['month_recorded'] == 3)
    april = (X['month_recorded'] == 4)
    may = (X['month_recorded'] == 5)
    june = (X['month_recorded'] == 6)
    july = (X['month_recorded'] == 7)
    aug = (X['month_recorded'] == 8)
    sep = (X['month_recorded'] == 9)
    octo = (X['month_recorded'] == 10)
    nov = (X['month_recorded'] == 11)
    dec = (X['month_recorded'] == 12)

    hot_dry = nov | dec | jan | feb
    heavy_rain = march | april | may
    cool_dry = june | july | aug | sep | octo

    X['hot_dry_season'] = hot_dry
    X['heavy_rain_season'] = heavy_rain
    X['cool_dry_season'] = cool_dry

    #fix amount_tsh outliers, most likely skewing model
    #X['amount_tsh'] = X['amount_tsh'] > 0


    # return the wrangled dataframe
    return X

train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

In [8]:
train['year_recorded'].min()

2002

In [9]:
train.head()

Unnamed: 0,amount_tsh,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group,longitude_MISSING,latitude_MISSING,construction_year_MISSING,gps_height_MISSING,population_MISSING,year_recorded,month_recorded,day_recorded,years,years_MISSING,hot_dry_season,heavy_rain_season,cool_dry_season
43360,0.0,,,,33.542898,-9.174777,Kwa Mzee Noa,0,Lake Nyasa,Mpandapanda,Mbeya,12,4,Rungwe,Kiwira,,True,VWC,K,,,gravity,gravity,gravity,vwc,user-group,never pay,soft,good,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe,functional,False,False,True,True,True,2011,7,27,,True,False,False,True
7263,500.0,Rc Church,2049.0,ACRA,34.66576,-9.308548,Kwa Yasinta Ng'Ande,0,Rufiji,Kitichi,Iringa,11,4,Njombe,Imalinyi,175.0,True,WUA,Tove Mtwango gravity Scheme,True,2008.0,gravity,gravity,gravity,wua,user-group,pay monthly,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional,False,False,False,False,False,2011,3,23,3.0,False,False,True,False
2486,25.0,Donor,290.0,Do,38.238568,-6.179919,Kwasungwini,0,Wami / Ruvu,Kwedigongo,Pwani,6,1,Bagamoyo,Mbwewe,2300.0,True,VWC,,False,2010.0,india mark ii,india mark ii,handpump,vwc,user-group,pay per bucket,salty,salty,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,functional,False,False,False,False,False,2011,3,7,1.0,False,False,True,False
313,0.0,Government Of Tanzania,,DWE,30.716727,-1.289055,Kwajovin 2,0,Lake Victoria,Kihanga,Kagera,18,1,Karagwe,Isingiro,,True,,,True,,other,other,other,vwc,user-group,never pay,soft,good,enough,shallow well,shallow well,groundwater,other,other,non functional,False,False,True,True,True,2011,7,31,,True,False,False,True
52726,0.0,Water,,Gove,35.389331,-6.399942,Chama,0,Internal,Mtakuj,Dodoma,1,6,Bahi,Nondwa,,True,VWC,Zeje,True,,mono,mono,motorpump,vwc,user-group,pay per bucket,soft,good,enough,machine dbh,borehole,groundwater,communal standpipe,communal standpipe,functional,False,False,True,True,True,2011,3,10,,True,False,True,False


In [0]:
#nice map depiction of the water pumps
# https://plot.ly/python/mapbox-layers/#base-maps-in-layoutmapboxstyle
#fig = px.scatter_mapbox(train, lat='latitude', lon='longitude', color='status_group', opacity=0.1)
#fig.update_layout(mapbox_style='stamen-terrain')
#fig.show()

In [0]:
#Notes
#try feature engineering on Tanzania dates recorded
#feature engineer a seasonality parameter
#December– February. This is the hot dry season. ...
#March. Intermittent rains start at this time. ...
#April - May. This is the heavier rainy season, and road conditions can become difficult. ...
#June - October. This is the cooler dry season. ...
#November. Here begin the short rains.

#another feature engineering possibility
#location relative to highly populated cities
#the theory being that the more remote a pump is, the more neglected it is, and the more likely it is to fail. 

#another feature
#proximity to water source
#is it upstream or downstream?

#feature
#how many pumps between it and the water source?

#feature 

In [10]:
#from class
# The status_group column is the target
target = 'status_group'

# Get a dataframe with all train columns except the target & (id was dropped earlier)
train_features = train.drop(columns=[target,
                                     'wpt_name',
                                     #'subvillage',
                                     'gps_height',
                                     #'lga', 
                                     #'ward', 
                                     #'subvillage', 
                                     #'funder', 
                                     'scheme_name',
                                     #'population'
                                     #'scheme_management',
                                     #'management',
                                     #'management_group',
                                     'installer',
                                     #'longitude_MISSING',
                                     #'latitude_MISSING',
                                     #'construction_year_MISSING',
                                     #'gps_height_MISSING',
                                     #'population_MISSING',
                                     #'years_MISSING',
                                     'extraction_type',
                                     'extraction_type_group',
                                     'waterpoint_type',
                                     #'source',
                                     'source_type',
                                     ])
# Get a list of the numeric features
numeric_features = train_features.select_dtypes(include='number').columns.tolist()

# Get a series with the cardinality of the nonnumeric features
cardinality = train_features.select_dtypes(exclude='number').nunique()

# Get a list of all categorical features with cardinality <= 50
categorical_features = cardinality[cardinality <= 50].index.tolist()

# Combine the lists 
features = numeric_features + categorical_features
print(len(features))
print(features)

36
['amount_tsh', 'longitude', 'latitude', 'num_private', 'region_code', 'district_code', 'population', 'construction_year', 'year_recorded', 'month_recorded', 'day_recorded', 'years', 'basin', 'region', 'public_meeting', 'scheme_management', 'permit', 'extraction_type_class', 'management', 'management_group', 'payment', 'water_quality', 'quality_group', 'quantity', 'source', 'source_class', 'waterpoint_type_group', 'longitude_MISSING', 'latitude_MISSING', 'construction_year_MISSING', 'gps_height_MISSING', 'population_MISSING', 'years_MISSING', 'hot_dry_season', 'heavy_rain_season', 'cool_dry_season']


In [11]:
train_features.isnull().sum() #missing data

amount_tsh                       0
funder                        2904
longitude                     1442
latitude                      1442
num_private                      0
basin                            0
subvillage                     286
region                           0
region_code                      0
district_code                    0
lga                              0
ward                             0
population                   17066
public_meeting                2644
scheme_management             3128
permit                        2443
construction_year            16517
extraction_type_class            0
management                       0
management_group                 0
payment                          0
water_quality                    0
quality_group                    0
quantity                         0
source                           0
source_class                     0
waterpoint_type_group            0
longitude_MISSING                0
latitude_MISSING    

In [12]:
train_features.select_dtypes(exclude='number').nunique()

funder                        1716
basin                            9
subvillage                   17231
region                          21
lga                            124
ward                          2082
public_meeting                   2
scheme_management               12
permit                           2
extraction_type_class            7
management                      12
management_group                 5
payment                          7
water_quality                    8
quality_group                    6
quantity                         5
source                          10
source_class                     3
waterpoint_type_group            6
longitude_MISSING                2
latitude_MISSING                 2
construction_year_MISSING        2
gps_height_MISSING               2
population_MISSING               2
years_MISSING                    2
hot_dry_season                   2
heavy_rain_season                2
cool_dry_season                  2
dtype: int64

In [0]:
# Arrange data into X features matrix and y target vector 
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]

In [14]:
X_train.isnull().sum()

amount_tsh                       0
longitude                     1442
latitude                      1442
num_private                      0
region_code                      0
district_code                    0
population                   17066
construction_year            16517
year_recorded                    0
month_recorded                   0
day_recorded                     0
years                        16517
basin                            0
region                           0
public_meeting                2644
scheme_management             3128
permit                        2443
extraction_type_class            0
management                       0
management_group                 0
payment                          0
water_quality                    0
quality_group                    0
quantity                         0
source                           0
source_class                     0
waterpoint_type_group            0
longitude_MISSING                0
latitude_MISSING    

In [15]:
(X_train['longitude']).isnull().sum()

1442

In [16]:
X_train.head(10)
#region code
#district code
#basin
#region

Unnamed: 0,amount_tsh,longitude,latitude,num_private,region_code,district_code,population,construction_year,year_recorded,month_recorded,day_recorded,years,basin,region,public_meeting,scheme_management,permit,extraction_type_class,management,management_group,payment,water_quality,quality_group,quantity,source,source_class,waterpoint_type_group,longitude_MISSING,latitude_MISSING,construction_year_MISSING,gps_height_MISSING,population_MISSING,years_MISSING,hot_dry_season,heavy_rain_season,cool_dry_season
43360,0.0,33.542898,-9.174777,0,12,4,,,2011,7,27,,Lake Nyasa,Mbeya,True,VWC,,gravity,vwc,user-group,never pay,soft,good,insufficient,spring,groundwater,communal standpipe,False,False,True,True,True,True,False,False,True
7263,500.0,34.66576,-9.308548,0,11,4,175.0,2008.0,2011,3,23,3.0,Rufiji,Iringa,True,WUA,True,gravity,wua,user-group,pay monthly,soft,good,enough,spring,groundwater,communal standpipe,False,False,False,False,False,False,False,True,False
2486,25.0,38.238568,-6.179919,0,6,1,2300.0,2010.0,2011,3,7,1.0,Wami / Ruvu,Pwani,True,VWC,False,handpump,vwc,user-group,pay per bucket,salty,salty,insufficient,shallow well,groundwater,hand pump,False,False,False,False,False,False,False,True,False
313,0.0,30.716727,-1.289055,0,18,1,,,2011,7,31,,Lake Victoria,Kagera,True,,True,other,vwc,user-group,never pay,soft,good,enough,shallow well,groundwater,other,False,False,True,True,True,True,False,False,True
52726,0.0,35.389331,-6.399942,0,1,6,,,2011,3,10,,Internal,Dodoma,True,VWC,True,motorpump,vwc,user-group,pay per bucket,soft,good,enough,machine dbh,groundwater,communal standpipe,False,False,True,True,True,True,False,True,False
8558,0.0,31.214583,-8.431428,0,15,2,200.0,1986.0,2011,8,7,25.0,Lake Tanganyika,Rukwa,True,VWC,True,gravity,vwc,user-group,never pay,soft,good,insufficient,river,surface,communal standpipe,False,False,False,False,False,False,False,False,True
2559,20000.0,36.6967,-3.337926,0,2,2,150.0,1995.0,2013,9,3,18.0,Pangani,Arusha,True,VWC,True,gravity,vwc,user-group,pay monthly,soft,good,insufficient,spring,groundwater,communal standpipe,False,False,False,False,False,False,False,False,True
54735,0.0,36.292724,-5.177333,0,1,1,,,2011,4,17,,Internal,Dodoma,True,VWC,False,motorpump,vwc,user-group,pay per bucket,soft,good,enough,machine dbh,groundwater,communal standpipe,False,False,True,True,True,True,False,True,False
25763,0.0,32.877248,-8.925921,0,12,6,,,2011,8,3,,Lake Rukwa,Mbeya,False,VWC,False,handpump,vwc,user-group,never pay,soft,good,enough,machine dbh,groundwater,hand pump,False,False,True,True,True,True,False,False,True
44540,0.0,33.014412,-3.115869,0,19,7,,,2011,8,3,,Lake Victoria,Mwanza,True,VWC,True,submersible,vwc,user-group,pay monthly,soft,good,enough,machine dbh,groundwater,other,False,False,True,True,True,True,False,False,True


In [24]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True, cols=['extraction_type_class',
                                               'basin',
                                               'region',
                                               'source_class',
                                               'payment',
                                               'water_quality',
                                               'quality_group',
                                               'quantity',
                                               'waterpoint_type_group',
                                               'hot_dry_season', 
                                               'heavy_rain_season',
                                               'cool_dry_season']),
    ce.OrdinalEncoder(), 
    SimpleImputer(),
    StandardScaler(),
    RandomForestClassifier(n_estimators=1100, 
                           n_jobs=-1, 
                           min_samples_leaf=2) 
                           #max_depth=100, 
                           #class_weight='balanced',
                           #max_features=5)
)

# Fit on train
pipeline.fit(X_train, y_train)

# Score on Train/Val
print('Training Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Predict on Test Data
y_pred_rfc = pipeline.predict(X_test)


Training Accuracy 0.8985058922558923
Validation Accuracy 0.8111952861952862


In [26]:
#cross validation
from sklearn.model_selection import cross_val_score

k = 3
scores = cross_val_score(pipeline, X_train, y_train, cv=k, scoring=)
print(f'MAE for {k} folds:', -scores)

MAE for 3 folds: [-0.80236081 -0.8041543  -0.79910348]


In [0]:
#randomized search
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

param_distributions = {
    'simpleimputer__strategy': ['mean', 'median'], 
    'selectkbest__k': range(1, len(X_train.columns)+1), 
    'ridge__alpha': [0.1, 1, 10], 
}

# If you're on Colab, decrease n_iter & cv parameters
search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=100, 
    cv=5, 
    scoring='neg_mean_absolute_error', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)

search.fit(X_train, y_train);

In [0]:
from hpsklearn import HyperoptEstimator, random_forest, svc, knn
from hyperopt import hp

clf = hp.pchoice( 'my_name', 
          [ ( 0.4, random_forest('my_name.random_forest') ),
            ( 0.3, svc('my_name.svc') ),
            ( 0.3, knn('my_name.knn') ) ]

estim = HyperoptEstimator( classifier=clf )

In [0]:

from hpsklearn import HyperoptEstimator, any_sparse_classifier, tfidf
#from sklearn.datasets import fetch_20newsgroups
from sklearn import metrics
from hyperopt import tpe
import numpy as np
# Download the data and split into training and test sets
#train = fetch_20newsgroups( subset='train' )
#test = fetch_20newsgroups( subset='test' )
#X_train = train.data
#y_train = train.target
#X_test = test.data
#y_test = test.target
estim = HyperoptEstimator( classifier=any_sparse_classifier('clf'), 
                            preprocessing=[tfidf('tfidf')],
                            algo=tpe.suggest, trial_timeout=300)
estim.fit( X_train, y_train )
print( estim.score( X_test, y_test ) )
# <<show score here>>
print( estim.best_model() )
# <<show model here>>

In [0]:
#highest score
'''
# Get a dataframe with all train columns except the target & (id was dropped earlier)
train_features = train.drop(columns=[target,
                                     'wpt_name',
                                     #'subvillage',
                                     'gps_height',
                                     #'lga', 
                                     #'ward', 
                                     #'subvillage', 
                                     #'funder', 
                                     'scheme_name',
                                     #'population'
                                     #'scheme_management',
                                     #'management',
                                     #'management_group',
                                     'installer',
                                     #'longitude_MISSING',
                                     #'latitude_MISSING',
                                     #'construction_year_MISSING',
                                     #'gps_height_MISSING',
                                     #'population_MISSING',
                                     #'years_MISSING',
                                     #'extraction_type',
                                     #'extraction_type_group',
                                     #'waterpoint_type',
                                     #'source',
                                     #'source_type',
                                     ])

import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True, cols=['extraction_type_class',
                                               'region',
                                               'source_class',
                                               'water_quality',
                                               'quality_group',
                                               'quantity',
                                               'hot_dry_season', 
                                               'heavy_rain_season',
                                               'cool_dry_season']),
    ce.OrdinalEncoder(), 
    SimpleImputer(),
    StandardScaler(),
    RandomForestClassifier(n_estimators=1100, n_jobs=-1, min_samples_leaf=2)
)

Training Accuracy 0.9049031986531987
Validation Accuracy 0.8155723905723906

In [0]:
'''
# Write submission csv file
submission = sample_submission.copy()
submission['status_group'] = y_pred_rfc
submission.to_csv('rfc_v12_nick.csv', index=False)

In [0]:
'''
from google.colab import files
files.download('rfc_v12_nick.csv')