# Data Science  - Unit 2.2.1
Name: Michael Luo

Date: 2022/11/08

# Module Project: Decision Trees

This week, the module projects will focus on creating and improving a model for the Tanazania Water Pump dataset. Your goal is to create a model to predict whether a water pump is functional, non-functional, or functional needs repair.


## Directions

The tasks for this project are as follows:

- **Task 1:** Sign up for a [Kaggle](https://www.kaggle.com/) account. Join the kaggle competition, and download the water pump dataset.
- **Task 2:** Use `wrangle` function to import training and test data.
- **Task 3:** Split training data into feature matrix `X` and target vector `y`.
- **Task 4:** Split feature matrix `X` and target vector `y` into training and validation sets.
- **Task 5:** Establish the baseline accuracy score for your dataset.
- **Task 6:** Build and train `model_dt`.
- **Task 7:** Calculate the training and validation accuracy score for your model.
- **Task 8:** Adjust model's `max_depth` to reduce overfitting.
- **Task 9 `stretch goal`:** Create a horizontal bar chart showing the 10 most important features for your model.

You should limit yourself to the following libraries for this project:

- `category_encoders`
- `matplotlib`
- `pandas`
- `pandas-profiling`
- `sklearn`

In [None]:
!pip install category_encoders
!pip install pandas_profiling
#IMPORTS
import pandas as pd
import numpy as np

#data processing
from sklearn.model_selection import train_test_split
from category_encoders import OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

#modeling
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.tree import DecisionTreeClassifier, plot_tree

#visualization
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Kaggle

**Task 1:** [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. **We recommend that you choose a username that's based on your name, since you might include it in your resume in the future.** Go to our Kaggle competition website (the URL is given on Canvas). Go to the **Rules** page. Accept the rules of the competition and download the dataset. Notice that the **Rules** page also has instructions for the Submission process. The **Data** page has feature definitions.

# I. Wrangle Data

In [None]:
#Mount our gdrive as virtual drive
from google.colab import drive
drive.mount('/content/drive')
  #creates a /MyDrive folder in .../content/drive
  #directory to top level Gdrive is .../content/drive/MyDrive/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#cwd to our files
%cd /content/drive/MyDrive/Colab Notebooks/Kaggle

/content/drive/MyDrive/Colab Notebooks/Kaggle


In [None]:
def wrangle(fm_path, tv_path=None):
    if tv_path:
        df = pd.merge(pd.read_csv(fm_path, 
                                  na_values=[0, -2.000000e-08]),
                      pd.read_csv(tv_path)).set_index('id')
    else:
        df = pd.read_csv(fm_path, 
                         na_values=[0, -2.000000e-08],
                         index_col='id')

    # Drop constant columns
    df.drop(columns=['recorded_by'], inplace=True)

    # Drop HCCCs
    cutoff = 100
    drop_cols = [col for col in df.select_dtypes('object').columns
                 if df[col].nunique() > cutoff]
    df.drop(columns=drop_cols, inplace=True)

    # Drop duplicate columns
    dupe_cols = [col for col in df.head(100).T.duplicated().index
                 if df.head(100).T.duplicated()[col]]
    df.drop(columns=dupe_cols, inplace=True)             

    return df

**Task 1:** Using the `wrangle` function above, read the `train_features.csv` and  `train_labels.csv` files into the DataFrame `df`. Next, use the same function to read the test set `test_features.csv` into the DataFrame `X_test`.

In [None]:
df = wrangle('train_features.csv', 'train_labels.csv')
X_test = wrangle('test_features.csv')

# II. Split Data

**Task 3:** Split your DataFrame `df` into a feature matrix `X` and the target vector `y`. You want to predict `'status_group'`.

In [None]:
target = 'status_group'
X = df.drop(columns=target)
y = df[target]

**Task 4:** Using a randomized split, divide `X` and `y` into a training set (`X_train`, `y_train`) and a validation set (`X_val`, `y_val`).

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

# III. Establish Baseline

**Task 5:** Since this is a **classification** problem, you should establish a baseline accuracy score. Figure out what is the majority class in `y_train` and what percentage of your training observations it represents.

In [None]:
baseline_acc = y_train.value_counts(normalize=True).max()
print('Baseline Accuracy Score:', baseline_acc)

Baseline Accuracy Score: 0.5441799289754045


# IV. Build Model

**Task 6:** Build a `Pipeline` named `model_dt`, and fit it to your training data. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `DecisionTreeClassifier` predictor.

**Note:** Don't forget to set the `random_state` parameter for your `DecisionTreeClassifier`.

In [None]:
X.iloc[:, 9:].head()

In [None]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Float64Index: 47519 entries, 454.0 to 23812.0
Data columns (total 29 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   amount_tsh             14189 non-null  float64
 1   gps_height             31245 non-null  float64
 2   longitude              46086 non-null  float64
 3   latitude               46086 non-null  float64
 4   num_private            617 non-null    float64
 5   basin                  47519 non-null  object 
 6   region                 47519 non-null  object 
 7   region_code            47519 non-null  int64  
 8   district_code          47500 non-null  float64
 9   population             30472 non-null  float64
 10  public_meeting         44831 non-null  Int32  
 11  scheme_management      44417 non-null  object 
 12  permit                 45080 non-null  Int32  
 13  construction_year      31017 non-null  float64
 14  extraction_type        47519 non-null  object 

In [None]:
#Modify Features

#Region/Region code: some regions have more than 1 code
# print(X['region'].nunique(), X['region_code'].nunique())
# X[['region','region_code']].value_counts().sort_index()
# X = X.drop(columns='region_code') #MAYBE - TODO

#Binary Features encoded as T/F
# binary_cols = ['public_meeting', 'permit']
# for col in binary_cols:
#   X[col] = X[col].astype(float).astype('Int32') #cast to float -> nullable Int type to impute later?

In [None]:
temp.shape

(38015, 31)

In [None]:
model_dt= make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    DecisionTreeClassifier(random_state=42, max_depth=15)
)
model_dt.fit(X_train, y_train)

Pipeline(steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['basin', 'region', 'public_meeting',
                                      'scheme_management', 'permit',
                                      'extraction_type',
                                      'extraction_type_group',
                                      'extraction_type_class', 'management',
                                      'management_group', 'payment',
                                      'payment_type', 'water_quality',
                                      'quality_group', 'quantity', 'source',
                                      'source_type', 'source_class',
                                      'waterpoint_type',
                                      'waterpoin...
communal standpipe             3
hand pump                      4
improved spring                5
cattle trough                  6
dam                            7
NaN                           -2
dtype: int64},
                

# V. Check Metrics

**Task 7:** Calculate the training and validation accuracy scores for `model_dt`.

In [None]:
training_acc = model_dt.score(X_train, y_train)
val_acc = model_dt.score(X_val, y_val)

print('Training Accuracy Score:', training_acc)
print('Validation Accuracy Score:', val_acc)

Training Accuracy Score: 0.8545048007365513
Validation Accuracy Score: 0.7629419191919192


# VI. Tune Model

**Task 8:** Is there a large difference between your training and validation accuracy? If so, experiment with different setting for `max_depth` in your `DecisionTreeClassifier` to reduce the amount of overfitting in your model.

In [None]:
# Use this cell to experiment and then change 
# your model hyperparameters in Task 6

# VII. Communicate Results

**Task 9 `stretch goal`:** Create a horizontal barchart that shows the the 10 most important features for model_dt, sorted by value.

**Note:** [`DecisionTreeClassifier.feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decisiontreecla#sklearn.tree.DecisionTreeClassifier.feature_importances_) returns values that are different from [`LogisticRegression.coef_`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). All the values will be positive, and they will sum to `1`.

In [None]:
model_dt.named_steps

{'ordinalencoder': OrdinalEncoder(cols=['basin', 'region', 'public_meeting', 'scheme_management',
                      'permit', 'extraction_type', 'extraction_type_group',
                      'extraction_type_class', 'management', 'management_group',
                      'payment', 'payment_type', 'water_quality',
                      'quality_group', 'quantity', 'source', 'source_type',
                      'source_class', 'waterpoint_type',
                      'waterpoint_type_group'],
                mapping=[{'col': 'ba...
                          'mapping': groundwater    1
 surface        2
 unknown        3
 NaN           -2
 dtype: int64},
                         {'col': 'waterpoint_type', 'data_type': dtype('O'),
                          'mapping': other                          1
 communal standpipe multiple    2
 communal standpipe             3
 hand pump                      4
 improved spring                5
 cattle trough                  6
 dam             

In [None]:
importances = pd.DataFrame(data={'Importance': model_dt.named_steps['decisiontreeclassifier'].feature_importances_}, 
                           index=model_dt.named_steps['ordinalencoder'].get_feature_names()).sort_values(by='Importance').tail(n=10)

importances

Unnamed: 0,Importance
district_code,0.02733
payment_type,0.027495
population,0.037514
extraction_type_class,0.038892
gps_height,0.05895
construction_year,0.062583
latitude,0.09337
longitude,0.110781
waterpoint_type,0.113465
quantity,0.242005
