## BloomTech Data Science

---


# Decision Trees

- clean data with **outliers and missing values**
- use scikit-learn for **decision trees**
- get and interpret **feature importances** of a tree-based model
- understand why decision trees are useful to model 

In [None]:
%%capture 
!pip install category_encoders==2.*
!pip install pandas_profiling==2.*

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from category_encoders import OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# pd.options.display.max_rows = 100

# Downloading the Tanzania Waterpump Dataset

Make sure  you only use the dataset that is available through the **DS** **Kaggle Competition**. DO NOT USE any other Tanzania waterpump datasets that you might find online.

There are two ways you can get the dataset. Make sure you have joined the competition first!:

1. You can download the dataset directly by accessing the challenge and the files through the Kaggle Competition URL on Canvas (make sure you have joined the competition!)

2. Use the Kaggle API using the code in the following cells. This article provides helpful information on how to fetch your Kaggle Dataset into Google Colab using the Kaggle API. 

> https://medium.com/analytics-vidhya/how-to-fetch-kaggle-datasets-into-google-colab-ea682569851a

# Using Kaggle API to download dataset

In [None]:
# # mounting your google drive on colab
# from google.colab import drive
# drive.mount('/content/gdrive')

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# #change your working directory, if you want to or have already saved your kaggle dataset on google drive.
# %cd /content/gdrive/My Drive/Kaggle
# # update it to your folder location on drive that contians the dataset and/or kaggle API token json file.

In [None]:
# # Download your Kaggle Dataset, if you haven't already done so 
# import os
# os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle" # providing the config path to kaggle.json 
# !kaggle competitions download -c bloomtech-water-pump-challenge # downloading dataset by running the Kaggle API command

In [None]:
# Unzip your Kaggle dataset, if you haven't already done so.
# !unzip \*.zip  && rm *.zip

In [None]:
# # List all files in your Kaggle folder on your google drive.
# !ls

# I. Wrangle Data

### Import data



In [2]:
train_features = pd.read_csv('train_features.csv')
train_labels = pd.read_csv('train_labels.csv')
test_features = pd.read_csv('test_features.csv')

In [3]:
train_features.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,454,50.0,2013-02-27,Dmdd,2092,DMDD,35.42602,-4.227446,Narmo,0,...,per bucket,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe
1,510,0.0,2011-03-17,Cmsr,0,Gove,35.510074,-5.724555,Lukali,0,...,never pay,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump
2,14146,0.0,2011-07-10,Kkkt,0,KKKT,32.499866,-9.081222,Mahakama,0,...,never pay,soft,good,enough,enough,shallow well,shallow well,groundwater,other,other
3,47410,0.0,2011-04-12,,0,,34.060484,-8.830208,Shule Ya Msingi Chosi A,0,...,monthly,soft,good,insufficient,insufficient,river,river/lake,surface,communal standpipe,communal standpipe
4,1288,300.0,2011-04-05,Ki,1023,Ki,37.03269,-6.040787,Kwa Mjowe,0,...,on failure,salty,salty,enough,enough,shallow well,shallow well,groundwater,other,other


In [4]:
train_labels.head()

Unnamed: 0,id,status_group
0,454,functional
1,510,functional
2,14146,non functional
3,47410,non functional
4,1288,non functional


In [5]:
test_features.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,37098,0.0,2012-10-09,Rural Water Supply And Sanitat,0,DWE,31.985658,-3.59636,Kasela,0,...,unknown,soft,good,dry,dry,shallow well,shallow well,groundwater,other,other
1,14530,0.0,2012-11-03,Halmashauri Ya Manispa Tabora,0,Halmashauri ya manispa tabora,32.832815,-4.944937,Mbugani,0,...,never pay,milky,milky,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump
2,62607,10.0,2013-02-25,Siter Fransis,1675,DWE,35.488289,-4.242048,Kwa Leosi,0,...,per bucket,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe
3,46053,0.0,2011-08-13,Kkkt,0,KKKT,33.140828,-9.059386,Jangi,0,...,never pay,soft,good,seasonal,seasonal,shallow well,shallow well,groundwater,hand pump,hand pump
4,47083,50.0,2013-02-08,Wateraid,1109,SEMA,34.217077,-4.430529,Mkima,0,...,per bucket,soft,good,enough,enough,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe


In [6]:
train = pd.merge(pd.read_csv('train_features.csv'), pd.read_csv('train_labels.csv'))
X_test = pd.read_csv('test_features.csv')

### EDA

In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47520 entries, 0 to 47519
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     47520 non-null  int64  
 1   amount_tsh             47520 non-null  float64
 2   date_recorded          47520 non-null  object 
 3   funder                 44644 non-null  object 
 4   gps_height             47520 non-null  int64  
 5   installer              44631 non-null  object 
 6   longitude              47520 non-null  float64
 7   latitude               47520 non-null  float64
 8   wpt_name               47520 non-null  object 
 9   num_private            47520 non-null  int64  
 10  basin                  47520 non-null  object 
 11  subvillage             47224 non-null  object 
 12  region                 47520 non-null  object 
 13  region_code            47520 non-null  int64  
 14  district_code          47520 non-null  int64  
 15  lg

In [8]:
ProfileReport(train, minimal=True).to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# train = pd.merge(pd.read_csv('train_features.csv',na_values=[0, -2.000000e-08]), 
#                  pd.read_csv('train_labels.csv'))
# X_test = pd.read_csv('test_features.csv', na_values=[0, -2.000000e-08])


def wrangle(df):

  # Set the index to 'id'
  df.set_index('id', inplace=True)

  # Drop Constant Column
  df.drop(columns=, inplace=True)

  # Drop Duplicate Column
  df.drop(columns=, inplace=True)

  # Drop High Cardinality Columns
  threshold = 100
  cols_to_drop = [col for col in df.select_dtypes('object') if df[col].nunique() > threshold]
  df.drop(columns=cols_to_drop, inplace=True)

  # Drop columns with high proportion of zeros
  df.drop(columns=, inplace=True)

  return df

train = wrangle(train)
X_test = wrangle(X_test)


In [None]:
# null island! 
# bunch of data coordinates at 0,0

plt.scatter(train['longitude'], train['latitude'])
plt.show()

In [None]:
train[train['latitude'] ==0]

In [None]:
train[train['longitude'] ==0]

# II. Split Data

## Split TV from FM

In [None]:
target =
y = train[target]
X = train.drop(columns=target)

## Training-Validation Split

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=.2, random_state=42)

# III. Establish Baseline

- Is this a *regression* or a *classification* problem?

In [None]:
print('baseline accuracy:',)

# IV. Build Model(s)

**First Model:** Logistic Regression

In [None]:
model_lr = make_pipeline()

model_lr.fit(X_train, y_train)

**Second Model:** Decision Tree Classifier

In [None]:
model_dt = make_pipeline()

model_dt.fit(X_train, y_train)

**Interlude: How does a tree model work?**

# V. Check Metrics

In [None]:
print('model_lr accuracy score for training',)
print('model_lr accuracy score for val', )

In [None]:

print('model_dt accuracy score for training', )
print('model_dt accuracy score for val', )

# VI. Tune Model

In [None]:
depths = range(5, 20, 2)
list(depths)

In [None]:
# very similar steps to how we tuned alpha for ridge regression

# train_acc = []
# val_acc = []


#   tree_model.fit(X_train, y_train)
#   train_acc.append(tree_model.score(X_train, y_train))
#   val_acc.append(tree_model.score(X_val, y_val))

In [None]:

# plt.plot(depths, train_acc, color='blue', label='training')
# plt.plot(depths, val_acc, color='orange', label='validation')
# plt.xlabel('max_depth')
# plt.ylabel('accuracy')
# plt.title('Validation Curves') # These plots are called VALIDATION CURVES! 
# plt.legend()
# plt.show()

# VII. Communicate Results


### Gini importance

In [None]:
features = 
gini_importances = 
pd.Series(data=gini_importances, index=features).sort_values(key=abs).tail(10).plot(kind='barh')
plt.ylabel('features')
plt.xlabel('gini importance');

# VIII. Kaggle Submission

In [None]:
predictions = pd.DataFrame(data=model_lr.predict(X_test), index=X_test.index)

In [None]:
predictions.columns = ['status_group']

In [None]:
predictions

In [None]:
# generate CSV
predictions.to_csv('new_submission.csv')

In [None]:
# download
from google.colab import files
files.download("new_submission.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>