# Phase III Project Technical Notebook

#### Authors: Kyle Dufrane and Brad Horn

In [1]:
# Import needed libraries

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pickle
import pandas as pd
from yellowbrick.classifier import ROCAUC

from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score, roc_curve, confusion_matrix, plot_confusion_matrix

pd.set_option('display.max_columns', 999)

### Business Understanding

#### Flatiron LLC has recently been awarded a contract to maintain wells in Tanzania. They're looking for a system to help develop preventative maintenance schedules by predicting pump failures and replacement schedules to better serve their client.

### Overview

#### Given the business problem we hope to identify the following features through our EDA:
* Are wells failing by geographic location?
* Does well type or source effect pump longevity? 
* Does well management or payment effect pump longevity?

### Data Understanding

#### This dataset comes from the Government of Tanzania and contains over ~59,000 wells with the earliest recorded construction year being 1966. Below you will see our data cleaning process.

#### This dataset comes in three files, test_set, training_set_labels, and training_set_values. We will exclude the test set until the final model has been completed then predict and submit our findings. 

#### To start we will look at the training_set_labels:

In [7]:
# Import training labels CSV
df_training_labels = pd.read_csv('data/Training_set_labels.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'data/Training_set_labels.csv'

In [None]:
df_training_labels.shape

In [None]:
df_training_labels.head()

In [None]:
df_training_labels.info()

### Checking NA values

In [None]:
df_training_labels.isna().sum()

### Class Imbalance

#### Based on our counts, we can see that we will have to counter the class imbalance. We will fix this issue later on in our model building process. 

In [None]:
df_training_labels['status_group'].value_counts()

### Training_set_values

In [None]:
# Import training values CSV
df_training_values = pd.read_csv('data/Training_set_values.csv')

In [None]:
df_training_values.shape

#### Looking at the above cells output we can see that we have 40 predictive features to chose from being: 

* amount_tsh : Total static head (amount water available to waterpoint)
* date_recorded : The date the row was entered
* funder : Who funded the well
* gps_height : Altitude of the well
* installer : Organization that installed the well
* longitude : GPS coordinate
* latitude : GPS coordinate
* wpt_name : Name of the waterpoint if there is one
* num_private :Private use or not
* basin : Geographic water basin
* subvillage : Geographic location
* region : Geographic location
* region_code : Geographic location (coded)
* district_code : Geographic location (coded)
* lga : Geographic location
* ward : Geographic location
* population : Population around the well
* public_meeting : True/False
* recorded_by : Group entering this row of data
* scheme_management : Who operates the waterpoint
* scheme_name : Who operates the waterpoint
* permit : If the waterpoint is permitted
* construction_year : Year the waterpoint was constructed
* extraction_type : The kind of extraction the waterpoint uses
* extraction_type_group : The kind of extraction the waterpoint uses
* extraction_type_class : The kind of extraction the waterpoint uses
* management : How the waterpoint is managed
* management_group : How the waterpoint is managed
* payment : What the water costs
* payment_type : What the water costs
* water_quality : The quality of the water
* quality_group : The quality of the water
* quantity : The quantity of water
* quantity_group : The quantity of water
* source : The source of the water
* source_type : The source of the water
* source_class : The source of the water
* waterpoint_type : The kind of waterpoint
* waterpoint_type_group : The kind of waterpoint

In [None]:
df_training_values.info()

#### A quick review of the Non-Null column shows that we are missing values in this data set. Below we will dive deeper into which columns are the most effected. 

In [None]:
df_training_values.isna().sum()

#### Out of the 40 features 7 of them are missing values. A few items stand out:

* Funder and installer have close to equal amounts of missing values
* subvillage has the least amount of missing values
* scheme_name is missing almost half of the values - we will drop this column

In [None]:
# Dropping column from dataframe
df_training_values.drop('scheme_name', axis = 1, inplace = True)

#### We need to explore more to see how we should handle these values.

In [None]:
# creating a list of columns with missing values
missing_values = ['funder', 'installer', 'subvillage', 'public_meeting',\
                  'scheme_management', 'permit']

# creating a dataframe with above missing_values
df_training_values[missing_values].info()

In [None]:
df_training_values[missing_values].isna().sum()

#### We can now see that all of these features are of the dtype object which narrows down our options to dealing with the missing values. What are these features composed of? 

#### To start, lets take a look at our previous mentioned insite of funders and installers having close to the same amount of missing values. 

##### Note: prior to running the below cells I misread the value counts and thought that both of these columns had the same amount of NA values. The below lines raised the red flag of 'why are the true values the same but the false values differ?'

In [None]:
df_training_values[df_training_values['funder'].isna()]['installer'].isna().value_counts()

In [None]:
df_training_values[df_training_values['installer'].isna()]['funder'].isna().value_counts()

#### Looking at the above counts it looks like our counts vary minimally but enough so where we cannot attack these two columns as the same. 

In [None]:
df_training_values['funder'].value_counts()

In [None]:
df_training_values[df_training_values['funder'].isna()]

In [None]:
df_training_values['installer'].value_counts()

In [None]:
df_training_values[df_training_values['installer'].isna()]

In [None]:
df_training_values['subvillage'].value_counts()

In [None]:
df_training_values['subvillage'].isna()

In [None]:
df_training_values['public_meeting'].value_counts()

In [None]:
df_training_values[df_training_values['public_meeting'].isna()]

In [None]:
df_training_values[df_training_values['public_meeting'].isna()]['recorded_by'].value_counts()

#### Inspecting the above dataframe you can see that all the items have been recorded by GeoData Consultants Ltd. Lets take a look at the whole dataframe. 

In [None]:
df_training_values[df_training_values['recorded_by'] == 'GeoData Consultants Ltd']['recorded_by'].value_counts()

#### Seeing how all of the data has been recorded by the same vendor this will have no impact on our modeling. This is another column that we can drop. 

In [None]:
df_training_values.drop('recorded_by', axis = 1, inplace = True)

In [None]:
df_training_values['scheme_management'].value_counts()

In [None]:
df_training_values['permit'].value_counts()

## Data Preparation

### For each column we will create two variables for modeling. One with the mode value for each column and one with a newly created variable denoted 'other'.

In [None]:
# Creating new dataframe
df_training_val_mode = df_training_values.copy()
df_training_val_other = df_training_values.copy()


In [None]:
# Filling NAN values to 'Other'

df_training_val_other['funder'] = df_training_val_other['funder']\
                            .replace(np.nan, 'Other', regex = True)

df_training_val_other['installer'] = df_training_val_other['installer']\
                                .replace(np.nan, 'Other', regex = True)

df_training_val_other['subvillage'] = df_training_val_other['subvillage']\
                                    .replace(np.nan, 'Other', regex = True)

df_training_val_other['public_meeting'] = df_training_val_other['public_meeting']\
                                            .replace(np.nan, 'Other', regex = True)

df_training_val_other['scheme_management'] = df_training_val_other['scheme_management']\
                                                .replace(np.nan, 'Other', regex = True)

df_training_val_other['permit'] = df_training_val_other['permit']\
                            .replace(np.nan, 'Other', regex = True)


In [None]:
# Filling NAN values with most common feature based on count

df_training_val_mode['funder'].fillna(df_training_val_mode['funder']\
                        .value_counts().index[0], inplace = True)

df_training_val_mode['installer'].fillna(df_training_val_mode['installer']\
                                .value_counts().index[0], inplace = True)

df_training_val_mode['subvillage'].fillna(df_training_val_mode['subvillage']\
                                    .value_counts().index[0], inplace = True)

df_training_val_mode['public_meeting'].fillna(df_training_val_mode['public_meeting']\
                                            .value_counts().index[0], inplace = True)

df_training_val_mode['scheme_management'].fillna(df_training_val_mode['scheme_management']\
                                                 .value_counts().index[0], inplace = True)

df_training_val_mode['permit'].fillna(df_training_val_mode['permit']\
                            .value_counts().index[0], inplace = True)

In [None]:
df_training_val_mode.isna().sum()

In [None]:
df_training_val_other.isna().sum()

### Joining Tables

#### Now lets merge the tables so we only have two data sets to work with. To start, both dataframes have an ID column so we will create a new column on our target set and drop the identical column.

In [None]:
df_training_labels['id_2'] = df_training_labels['id']
df_training_labels.drop('id', axis = 1, inplace = True)

#### Next we will join our tables and create two dataframes for mode and other

In [None]:
df_mode = pd.concat([df_training_val_mode, df_training_labels], join = 'inner', axis = 1)
df_other = pd.concat([df_training_val_other, df_training_labels], join = 'inner', axis = 1)

In [None]:
df_mode

In [None]:
df_mode[df_mode['id'] == df_mode['id_2']]

In [None]:
df_other[df_other['id'] == df_other['id_2']]

#### As seen above our total rows equal that of the normal dataframe so we can conclude that our merges have been successful and we can drop our id_2 column.

In [None]:
df_mode.drop(['id_2'], axis = 1, inplace = True)
df_other.drop(['id_2'], axis = 1, inplace = True)

### Additional Columns to Drop

#### The Id columns and date_recorded are considered admin columns and will not have much predictive power in our model therefore we can drop these columns. 

In [None]:
df_mode.drop(['id', 'date_recorded'], axis = 1, inplace = True)
df_other.drop(['id', 'date_recorded'], axis = 1, inplace = True)

In [None]:
def get_totals(dataframe, filter_column, filter_groupby):

        '''
        **** filter_column & filter_groupby need to be passed
        as strings ****

        1. get_totals will calculate the sum of the variables
        within a column and return a new column with the 
        sum of their total occurances in the dataframe
        
        2. get_totals will calulate the percentage of the 
        values column vs the total values

        dataframe = pandas dataframe
        filter_column = column to filter by
        filter_groupby = groupby column to filter by

        '''

        df_new = pd.DataFrame(dataframe.groupby(filter_groupby)[filter_column].value_counts())
        df_new[f'{filter_groupby}_values'] = df_new[filter_column]
        df_new.drop(filter_column, axis = 1, inplace = True)
        df_new.reset_index(inplace = True)

        types = set()

        for idx, value in enumerate(df_new[f'{filter_groupby}_values']):
            for type_ in df_new[filter_column]:
                types.add(type_)
            
        total_values = {}
            
        for value in types:
            total_values[value] = df_new[df_new[filter_column] == value][f'{filter_groupby}_values'].sum()

        df_new[f'{filter_groupby}_total_values'] = df_new[filter_column].map(total_values)

        df_new[f'{filter_groupby}_percentage'] = df_new[f'{filter_groupby}_values'] / df_new[f'{filter_groupby}_total_values']
            
        return df_new


In [None]:
# function_df = df.drop('status_group', axis = 1)

# percentage_dict = {}

# for idx, column in enumerate(function_df.columns):
#     percentage_dict[column] = get_totals(df, column, 'status_group')

# pickle_out = open('percentage_dict.pickle', 'wb')
# pickle.dump(percentage_dict, pickle_out)

In [None]:
pickle_in = open('percentage_dict.pickle', 'rb')

percentage_dict = pickle.load(pickle_in)

### First Simple Model

In [None]:
df_mode.columns

In [None]:
source = percentage_dict['source']
functional = source[source['status_group'] == 'functional']
functional

plt.bar(source['source'], source['status_group_percentage'])
plt.xticks(rotation=45, ha='right')
plt.title('Top Functional Wells by Source')
plt.savefig('saved_objects/source_bar');

In [None]:
source = percentage_dict['payment']
functional = source[source['status_group'] == 'functional']
functional

plt.bar(source['payment'], source['status_group_percentage'])
plt.xticks(rotation=45, ha='right')
plt.title('Money?')
plt.savefig('saved_objects/money_bar');

#### To start our modeling process we will use only our integers and floats.

In [None]:
X_mode_fsm = df_mode.select_dtypes(['int64', 'float64'])
y_mode_fsm = df_mode['status_group']

X_other_fsm = df_other.select_dtypes(['int64','float64'])
y_other_fsm = df_other['status_group']

In [None]:
X_mode_train, X_mode_test, y_mode_train, y__mode_test = train_test_split(X_mode_fsm,y_mode_fsm, random_state = 42, stratify = y_mode_fsm)

X_other_train, X_other_test, y_other_train, y_other_test = train_test_split(X_other_fsm,y_other_fsm, random_state = 42, stratify = y_mode_fsm)


dtc_mode = DecisionTreeClassifier()
dtc_other = DecisionTreeClassifier()

dtc_mode.fit(X_mode_train, y_mode_train)
dtc_other.fit(X_other_train, y_other_train)

In [None]:
print(dtc_mode.score(X_mode_train, y_mode_train))
print(dtc_other.score(X_other_train, y_other_train))

In [None]:
y_hat_mode = dtc_mode.predict(X_mode_train)
y_hat_other = dtc_other.predict(X_other_train)

In [None]:
print('mode recall:', recall_score(y_mode_train, y_hat_mode, average = 'macro'))
print('mode precision:', precision_score(y_mode_train, y_hat_mode, average = 'macro'))
print('mode f1 score:', f1_score(y_mode_train, y_hat_mode, average = 'macro'))

print('---------------------------------------------------------')

print('ohter recall:', recall_score(y_other_train, y_hat_other, average = 'macro'))
print('other precision:', precision_score(y_other_train, y_hat_other, average = 'macro'))
print('other f1 score:', f1_score(y_other_train, y_hat_other, average = 'macro'))

In [None]:
cross_val_score(dtc_mode, X_mode_train, y_mode_train, cv = 3, scoring = 'recall_macro')

In [None]:
cross_val_score(dtc_other, X_mode_train, y_mode_train, cv = 3, scoring = 'recall_macro')

### The cross val scores are pretty consitent across the folds. This doesnt give us much insight as far as our NAN replacements in during the EDA. 

### Model Exploration

#### Now that we have our baseline established we will loop through other models to see if we can get better results.

In [None]:
# model_selection = [LogisticRegression(random_state = 42, max_iter = 1000, n_jobs = -1),\
#                    RandomForestClassifier(random_state = 42, n_jobs = -1),\
#                    DecisionTreeClassifier(), KNeighborsClassifier(n_jobs = -1), 
#                   SVC(random_state = 42)]

# vanilla_models = {}

# for idx_mode, model in enumerate(model_selection):
#     vanilla_models[idx_mode] = model.fit(X_mode_train, y_mode_train)

In [None]:
# for key, val in enumerate(vanilla_models.values()):
#     print(val, val.score(X_mode_train, y_mode_train))

#### Based on the scores above, our scores are the best using RandomForestClassifier and DecisionTreeClassifier. Let's did deeper into these two models.

In [None]:
# # Select models from dictionary
# rfc = vanilla_models[1]
# dtc = vanilla_models[2]

# # predict on each model

# rfc_mode_yhat = rfc.predict(X_mode_train)
# dtc_mode_yhat = dtc.predict(X_mode_train)

In [None]:
# # Review scores for both models

# print('rfc recall:', recall_score(y_mode_train, rfc_mode_yhat, average = 'macro'))
# print('rfc precision:', precision_score(y_mode_train, rfc_mode_yhat, average = 'macro'))
# print('rfc f1 score:', f1_score(y_mode_train, rfc_mode_yhat, average = 'macro'))

# print('---------------------------------------------------------')

# print('dtc recall:', recall_score(y_mode_train, dtc_mode_yhat, average = 'macro'))
# print('dtc precision:', precision_score(y_mode_train, dtc_mode_yhat, average = 'macro'))
# print('dtc f1 score:', f1_score(y_mode_train, dtc_mode_yhat, average = 'macro'))


### Small advantage do the decision tree classifier. Lets see if our cross val & auc score shows anymore insights. 

In [None]:
# cross_val_score(rfc, X_mode_train, y_mode_train, cv = 5, n_jobs=-1, scoring = 'recall_macro')

In [None]:
# cross_val_score(dtc, X_mode_train, y_mode_train, cv = 5, n_jobs = -1, scoring = 'recall_macro')

### Since our stakeholder is considered with pump failures we need to avoid False Negatives. I.E. we do not want to say the bump is broken when it in fact it is operational. Therefore we need to focus on our recall score and tune our model appropriately which is why we're using the recall_macro score. As seen above our Random Forest is performing the best. We will move forward with tuning this model going forward.

### Our models above only utilized our numerical values. We will now begin using our categorical features and identify feature importance. 

In [None]:
# # Separate data by target and predictors
# X_cat = df_mode.drop('status_group', axis = 1)
# y_cat = df_mode['status_group']

# # Perform train test split
# X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(X_cat, y_cat, random_state = 42, stratify = y_cat)

# # One hot encoded categorical data
# ohe = OneHotEncoder(drop = 'first')


# # Select initial parameters
# df_feat_import = X_train_cat[['extraction_type', 'management', 'payment', 'water_quality', 'source', 'source_class', 'region_code', 'district_code']]

# # fit transform data
# X_mode_train_enc = ohe.fit_transform(df_feat_import)

# # Instantiate model
# rfc_feat_import = RandomForestClassifier(random_state = 42, class_weight= 'balanced', n_jobs = -1)


# # Fit encoded data to model
# rfc_feat_import.fit(X_mode_train_enc, y_train_cat)

# # Model score
# rfc_feat_import.score(X_mode_train_enc, y_train_cat)

# # Predict on training data
# rfc_yhat_1 = rfc_feat_import.predict(X_mode_train_enc)

# # Recall score on training data
# recall_score(y_train_cat, rfc_yhat, average='macro')

# # Precision score on training data
# precision_score(y_train_cat, rfc_yhat, average='macro')

# #F1 Score on training data
# f1_score(y_train_cat, rfc_yhat, average='macro')

# # 5-fold cross validation
# cross_val_score(rfc_feat_import, X_mode_train_enc, y_train_cat, cv = 5, scoring = 'recall_macro')

# plot_confusion_matrix(rfc_feat_import, X_mode_train_enc, y_train_cat);

# visualizer = ROCAUC(clf)
# visualizer.fit(X_train, y_train)
# visualizer.score(X_train, y_train)
# visualizer.show()

# #### Adding features to see if our model improves all other steps are a repeat from above

# ohe = OneHotEncoder(drop = 'first')

# df_feat_import = X_train_cat[[
#  'source_type',
#  'region',
#  'district_code',
#  'public_meeting',
#  'extraction_type',
#  'extraction_type_group',
#  'extraction_type_class',
#  'management',
#  'payment_type',
#  'quantity_group',
#  'source',
#  'source_class',
#  'waterpoint_type_group']]

# X_mode_train_enc = ohe.fit_transform(df_feat_import)

# rfc_feat_import = RandomForestClassifier(random_state = 42, class_weight= 'balanced', n_jobs = -1)

# rfc_feat_import.fit(X_mode_train_enc, y_train_cat)

# rfc_feat_import.score(X_mode_train_enc, y_train_cat)

# rfc_yhat = rfc_feat_import.predict(X_mode_train_enc)

# recall_score(y_train_cat, rfc_yhat, average = 'macro')

# cross_val_score(rfc_feat_import, X_mode_train_enc, y_train_cat, cv = 5, scoring = 'recall_macro')

# plot_confusion_matrix(rfc_feat_import, X_mode_train_enc, y_train_cat);

# visualizer = ROCAUC(clf)
# visualizer.fit(X_train, y_train)
# visualizer.score(X_train, y_train)
# visualizer.show()

# #### Adding features to see if our model improves all other steps are a repeat from above

# ohe = OneHotEncoder(drop = 'first')

# df_feat_import = X_train_cat[['region_code', 'source_type', 'basin', 'region', 'region_code', 'district_code',\
#                    'public_meeting', 'scheme_management', 'permit', 'construction_year', 'extraction_type',\
#                   'extraction_type_group', 'extraction_type_class','management',\
#                    'management_group', 'payment', 'payment_type', 'water_quality',\
#                   'quality_group', 'quantity', 'quantity_group', 'source', 'source_type',
#                    'source_class', 'waterpoint_type', 'waterpoint_type_group']]

# X_mode_train_enc = ohe.fit_transform(df_feat_import)

# rfc_feat_import = RandomForestClassifier(random_state = 42)

# rfc_feat_import.fit(X_mode_train_enc, y_train_cat)

# rfc_feat_import.score(X_mode_train_enc, y_train_cat)

# rfc_yhat = rfc_feat_import.predict(X_mode_train_enc)

# recall_score(y_train_cat, rfc_yhat, average = 'macro')

# cross_val_score(rfc_feat_import, X_mode_train_enc, y_train_cat, cv = 5, scoring = 'recall_macro')

# plot_confusion_matrix(rfc_feat_import, X_mode_train_enc, y_train_cat);

# visualizer = ROCAUC(clf)
# visualizer.fit(X_train, y_train)
# visualizer.score(X_train, y_train)
# visualizer.show()

# ### Now that we have seen model improvement we will use a GridSearch to find our best parameters

# # param_grid = {
# #  'max_depth': [3,10, None],
# #  'criterion': ['gini', 'entropy'],
# #  'min_samples_leaf': [1, 2, 4],
# #  'n_estimators': [100, 500],
# #  'class_weight': ['balanced', 'balanced_subsample'],
# #  'n_jobs': [-1]
# # }

# # grid_search = GridSearchCV(rfc_feat_import, param_grid, n_jobs=-1, cv = 3, return_train_score=True)

# # grid_search.fit(X_mode_train_enc, y_train_cat)

# # 'grid_search.best_params_'

# # "'class_weight': 'balanced',
# #  'criterion': 'entropy',
# #  'max_depth': None,
# #  'min_samples_leaf': 1,
# #  'n_estimators': 500,
# #  'n_jobs': -1")

# #### Base on our best_params_ we will input these features into a new to model and repeat the above steps

# ohe = OneHotEncoder(drop = 'first')

# df_feat_import = X_train_cat[['region_code', 'source_type', 'basin', 'region', 'region_code', 'district_code',\
#                    'public_meeting', 'scheme_management', 'permit', 'construction_year', 'extraction_type',\
#                   'extraction_type_group', 'extraction_type_class','management',\
#                    'management_group', 'payment', 'payment_type', 'water_quality',\
#                   'quality_group', 'quantity', 'quantity_group', 'source', 'source_type',
#                    'source_class', 'waterpoint_type', 'waterpoint_type_group']]

# X_mode_train_enc = ohe.fit_transform(df_feat_import)

# rfc_feat_import = RandomForestClassifier(random_state = 42, class_weight='balanced', criterion='entropy', n_estimators = 500, n_jobs=-1)

# rfc_feat_import.fit(X_mode_train_enc, y_train_cat)

# rfc_feat_import.score(X_mode_train_enc, y_train_cat)

# rfc_yhat = rfc_feat_import.predict(X_mode_train_enc)

# recall_score(y_train_cat, rfc_yhat, average = 'macro')

# cross_val_score(rfc_feat_import, X_mode_train_enc, y_train_cat, cv = 5, scoring = 'recall_macro')

# plot_confusion_matrix(rfc_feat_import, X_mode_train_enc, y_train_cat);

# visualizer = ROCAUC(clf)
# visualizer.fit(X_train, y_train)
# visualizer.score(X_train, y_train)
# visualizer.show()

# ### Based on our previous model our recall jumped to 85.86% from 77.56%! Also, our cross_val_score is within < 2% span which is showing that our model has low bias. 

# ### Previously we separated our data into two data frames. We will repeat the above process with the second dataframe to see if we get different results.

# X_other = df_other.drop('status_group', axis = 1)
# y_other = df_other['status_group']

# X_train_other, X_test_other, y_train_other, y_test_other = train_test_split(X_other, y_other, random_state = 42, stratify = y_other)

# ohe = OneHotEncoder(drop = 'first')

# df_feat_import = X_train_other[['extraction_type', 'management', 'payment', 'water_quality', 'source', 'source_class', 'region_code', 'district_code']]

# X_other_train_enc = ohe.fit_transform(df_feat_import)

# rfc_feat_import = RandomForestClassifier(random_state = 42, class_weight= 'balanced', n_jobs = -1)

# rfc_feat_import.fit(X_other_train_enc, y_train_other)

# rfc_feat_import.score(X_other_train_enc, y_train_other)

# rfc_yhat = rfc_feat_import.predict(X_other_train_enc)

# recall_score(y_train_other, rfc_yhat, average='macro')

# cross_val_score(rfc_feat_import, X_other_train_enc, y_train_other, cv = 5, scoring = 'recall_macro')

# plot_confusion_matrix(rfc_feat_import, X_other_train_enc, y_train_other);

# visualizer = ROCAUC(clf)
# visualizer.fit(X_train, y_train)
# visualizer.score(X_train, y_train)
# visualizer.show()

# ohe = OneHotEncoder(drop = 'first')

# df_feat_import = X_train_other[[
#  'source_type',
#  'region',
#  'district_code',
#  'extraction_type',
#  'extraction_type_group',
#  'extraction_type_class',
#  'management',
#  'payment_type',
#  'quantity_group',
#  'source',
#  'source_class',
#  'waterpoint_type_group']]

# X_other_train_enc = ohe.fit_transform(df_feat_import)

# rfc_feat_import = RandomForestClassifier(random_state = 42, class_weight= 'balanced', n_jobs = -1)

# rfc_feat_import.fit(X_other_train_enc, y_train_other)

# rfc_feat_import.score(X_other_train_enc, y_train_other)

# rfc_yhat = rfc_feat_import.predict(X_other_train_enc)

# recall_score(y_train_other, rfc_yhat, average='macro')

# cross_val_score(rfc_feat_import, X_other_train_enc, y_train_other, cv = 5, scoring = 'recall_macro')

# plot_confusion_matrix(rfc_feat_import, X_other_train_enc, y_train_other);

# visualizer = ROCAUC(clf)
# visualizer.fit(X_train, y_train)
# visualizer.score(X_train, y_train)
# visualizer.show()

# ohe = OneHotEncoder(drop = 'first')

# df_feat_import = X_train_other[['region_code', 'source_type', 'basin', 'region', 'region_code', 'district_code',\
#                     'scheme_management', 'construction_year', 'extraction_type',\
#                   'extraction_type_group', 'extraction_type_class','management',\
#                    'management_group', 'payment', 'payment_type', 'water_quality',\
#                   'quality_group', 'quantity', 'quantity_group', 'source', 'source_type',
#                    'source_class', 'waterpoint_type', 'waterpoint_type_group']]

# X_other_train_enc = ohe.fit_transform(df_feat_import)

# rfc_feat_import = RandomForestClassifier(random_state = 42)

# rfc_feat_import.fit(X_other_train_enc, y_train_other)

# rfc_feat_import.score(X_other_train_enc, y_train_other)

# rfc_yhat = rfc_feat_import.predict(X_other_train_enc)

# recall_score(y_train_other, rfc_yhat, average='macro')

# cross_val_score(rfc_feat_import, X_other_train_enc, y_train_other, cv = 5, scoring = 'recall_macro')

# plot_confusion_matrix(rfc_feat_import, X_other_train_enc, y_train_other);

# visualizer = ROCAUC(clf)
# visualizer.fit(X_train, y_train)
# visualizer.score(X_train, y_train)
# visualizer.show()

# param_grid = {
#  'max_depth': [3,10, None],
#  'criterion': ['gini', 'entropy'],
#  'min_samples_leaf': [1, 2, 4],
#  'n_estimators': [100, 500],
#  'class_weight': ['balanced', 'balanced_subsample'],
#  'n_jobs': [-1]
# }

# grid_search = GridSearchCV(rfc_feat_import, param_grid, n_jobs=-1, cv = 3, return_train_score=True)

# grid_search.fit(X_other_train_enc, y_train_other)

# grid_search.best_params_

# grid_search.best_score_

# ohe = OneHotEncoder(drop = 'first')

# df_feat_import = X_train_other[['region_code', 'source_type', 'basin', 'region', 'region_code', 'district_code',\
#                     'scheme_management', 'construction_year', 'extraction_type',\
#                   'extraction_type_group', 'extraction_type_class','management',\
#                    'management_group', 'payment', 'payment_type', 'water_quality',\
#                   'quality_group', 'quantity', 'quantity_group', 'source', 'source_type',
#                    'source_class', 'waterpoint_type', 'waterpoint_type_group']]

# X_other_train_enc = ohe.fit_transform(df_feat_import)

# rfc_feat_import = RandomForestClassifier(class_weight='balanced_subsample', criterion='gini', min_samples_leaf=1, n_estimators=100, random_state = 42, n_jobs=-1)

# rfc_feat_import.fit(X_other_train_enc, y_train_other)

# rfc_feat_import.score(X_other_train_enc, y_train_other)

# rfc_yhat = rfc_feat_import.predict(X_other_train_enc)

# recall_score(y_train_other, rfc_yhat, average='macro')

# cross_val_score(rfc_feat_import, X_other_train_enc, y_train_other, cv = 5, scoring = 'recall_macro')

# plot_confusion_matrix(rfc_feat_import, X_other_train_enc, y_train_other);

# visualizer = ROCAUC(clf)
# visualizer.fit(X_train, y_train)
# visualizer.score(X_train, y_train)
# visualizer.show()

### This model performed almost as good as the first one. With the first model getting a recall score of 85.86% it has beaten the second model by .58%, not much but still an increase that we're looking for!

### Categorical & Numerical Data

### Finally we will build a pipeline to incorporate all of our data.

In [None]:
X = df_mode[['region_code', 'source_type', 'basin', 'region', 'district_code',\
                   'public_meeting', 'scheme_management', 'permit', 'extraction_type',\
                  'extraction_type_group', 'extraction_type_class','management',\
                   'management_group', 'payment', 'payment_type', 'water_quality',\
                  'quality_group', 'quantity', 'quantity_group', 'source',
                   'source_class', 'waterpoint_type', 'waterpoint_type_group', 'gps_height', 'population',\
                   'construction_year', 'num_private', 'longitude', 'latitude']]

y = df_mode['status_group']

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state = 42, stratify = y)

In [None]:
cat_features = ['region_code', 'source_type', 'basin', 'region', 'district_code',\
                   'public_meeting', 'scheme_management', 'permit', 'construction_year', 'extraction_type',\
                  'extraction_type_group', 'extraction_type_class','management',\
                   'management_group', 'payment', 'payment_type', 'water_quality',\
                  'quality_group', 'quantity', 'quantity_group', 'source',
                   'source_class', 'waterpoint_type', 'waterpoint_type_group']

categorical_transformer = OneHotEncoder(handle_unknown = 'ignore')

preprocessor = ColumnTransformer([('cat', categorical_transformer, cat_features)])

clf = Pipeline([('preprocessor', preprocessor), 
               ('classifier', RandomForestClassifier(verbose = 1, random_state = 42))])

clf.fit(X_train, y_train)

In [None]:
clf.score(X_train, y_train)

In [None]:
y_hat = clf.predict(X_train)

In [None]:
recall_score(y_train, y_hat, average = 'macro')

In [None]:
param_grid = {
 'classifier__max_depth': [3,10, None],
 'classifier__criterion': ['gini', 'entropy'],
 'classifier__min_samples_leaf': [1, 2, 4],
 'classifier__n_estimators': [100, 500],
 'classifier__class_weight': ['balanced', 'balanced_subsample'],
 'classifier__n_jobs': [-1]
}

grid_search = GridSearchCV(clf, param_grid, n_jobs=-1, cv = 3, return_train_score=True)

grid_search.fit(X_train, y_train)

In [None]:
list(clf.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names(input_features = cat_features))

In [None]:
feat_import_desc = list(clf.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names(input_features = cat_features))
feat_import_num = grid_search.best_estimator_.named_steps['classifier'].feature_importances_

In [None]:
feat_imprt = pd.DataFrame(list(zip(feat_import_desc, feat_import_num)), columns=('category', 'value'))
for value in cat_features:
    if 

In [None]:
grid_search.best_params_

In [None]:
categoricals = list

In [None]:
grid_search.best_score_

In [None]:
yhat = grid_search.predict(X_train)

print('recall score:', recall_score(y_train, yhat, average = 'macro'))
print('precision score:', precision_score(y_train, yhat, average = 'macro'))
print('f1 score:', f1_score(y_train, yhat, average = 'macro'))

In [None]:
plot_confusion_matrix(grid_search, X_train, y_train)
plt.grid(None);

In [None]:
visualizer = ROCAUC(clf)
visualizer.fit(X_train, y_train)
visualizer.score(X_train, y_train)
visualizer.show()


In [None]:
y_hat_test = grid_search.predict(X_test)

print('recall score:', recall_score(y_test, y_hat_test, average = 'macro'))
print('precision score:', precision_score(y_test, y_hat_test, average = 'macro'))
print('f1 score:', f1_score(y_test, y_hat_test, average = 'macro'))

In [None]:
plot_confusion_matrix(grid_search, X_test, y_test)
plt.grid(None)
plt.savefig('saved_objects/final_confusion');

In [None]:
visualizer = ROCAUC(clf)
visualizer.fit(X_test, y_test)
visualizer.score(X_test, y_test)
visualizer.show()
plt.savefig('saved_objects/final_ROC_AUC')

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_poly_train = pd.DataFrame(poly.fit_transform(X_train), columns=poly.get_feature_names(features.columns))
X_poly_test = pd.DataFrame(poly.transform(X_test), columns=poly.get_feature_names(features.columns))
X_poly_train.head()

In [None]:
from sklearn.feature_selection import VarianceThreshold

threshold_ranges = np.linspace(0, 2, num=6)

for thresh in threshold_ranges:
    print(thresh)
    selector = VarianceThreshold(thresh)
    reduced_feature_train = selector.fit_transform(X_train)
    reduced_feature_test = selector.transform(X_test)
    lr = RandomForestClassifier()
    lr.fit(reduced_feature_train, y_train)
    run_model(lr, reduced_feature_train, reduced_feature_test, y_train, y_test)
    print('--------------------------------------------------------------------')