Lambda School Data Science

*Unit 2, Sprint 2, Module 2*

---

In [2]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/main/data/'
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

# Module Project: Random Forests

This week, the module projects will focus on creating and improving a model for the Tanazania Water Pump datset. Your goal is to create a model to predict whether a water pump is functional, non-functional, or needs repair.

Dataset source: [DrivenData.org](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/).

## Directions

The tasks for this project are as follows:

- **Task 1:** Sign up for a [Kaggle](https://www.kaggle.com/) account.
- **Task 2:** Use `wrangle` function to import training and test data.
- **Task 3:** Split training data into feature matrix `X` and target vector `y`.
- **Task 4:** Split feature matrix `X` and target vector `y` into training and test sets.
- **Task 5:** Establish the baseline accuracy score for your dataset.
- **Task 6:** Build and train `model_dt`.
- **Task 7:** Calculate the training and validation accuracy score for your model.
- **Task 8:** Adjust model's `max_depth` to reduce overfitting.
- **Task 9 `stretch goal`:** Create a horizontal bar chart showing the 10 most important features for your model.

You should limit yourself to the following libraries for this project:

- `category_encoders`
- `matplotlib`
- `pandas`
- `pandas-profiling`
- `sklearn`

# I. Wrangle Data

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from category_encoders import OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier, plot_tree
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

  import pandas.util.testing as tm


In [4]:
train = pd.merge(pd.read_csv('train_features.csv', na_values=[0, -2.000000e-08]), 
                 pd.read_csv('train_labels.csv', na_values=[0, -2.000000e-08]))

def wrangle(df):

  # Set the index to 'id'
  df.set_index('id', inplace=True)

  # Drop Constant Column
  df.drop(columns='recorded_by', inplace=True)

  # Drop Duplicate Column
  df.drop(columns='quantity_group', inplace=True)

  # Drop High Cardinality Columns
  cols_to_drop = [col for col in df.select_dtypes('object') if df[col].nunique() > 100]
  df.drop(columns=cols_to_drop, inplace=True)

  # Drop columns with high proportion of null values
  df.drop(columns='num_private', inplace=True)

  return df

**Task 1:** Sign up for a [Kaggle](https://www.kaggle.com/) account. Choose a username that's based on your real name. Like GitHub, Kaggle is part of your public profile as a data scientist.

**Task 2:** Modify the `wrangle` function to engineer a `'pump_age'` feature. Then use the function to read `train_features.csv` and `train_labels.csv` into the DataFrame `df`, and `test_features.csv` into the DataFrame `X_test`.

In [5]:
test = pd.read_csv('test_features.csv', na_values=[0, -2.000000e-08])

df = wrangle(train)
X_test = wrangle(test)
print(df.shape, X_test.shape)

(47520, 29) (11880, 28)


# II. Split Data

**Task 3:** Split your DataFrame `df` into a feature matrix `X` and the target vector `y`. You want to predict `'status_group'`.

In [6]:
target = 'status_group'
y = train[target]
X = train.drop(columns=target)

**Task 4:** Using a randomized split, divide `X` and `y` into a training set (`X_train`, `y_train`) and a validation set (`X_val`, `y_val`).

In [9]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 42)

# III. Establish Baseline

**Task 5:** Since this is a **classification** problem, you should establish a baseline accuracy score. Figure out what is the majority class in `y_train` and what percentage of your training observations it represents.

In [10]:
baseline_acc = y_train.value_counts(normalize = True).max()
print('Baseline Accuracy Score:', baseline_acc)

Baseline Accuracy Score: 0.5440867003367004


# IV. Build Model

**Task 6:** Build a `Pipeline` named `model_rf`, and fit it to your training data. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `RandomForestClassifier` predictor.

**Note:** Don't forget to set the `random_state` parameter for your `RandomForestClassifier`. Also, to decrease training time, set `n_jobs` to `-1`.

In [11]:
from sklearn.ensemble import RandomForestClassifier

model_rf = make_pipeline(OrdinalEncoder(),
                         SimpleImputer(strategy = 'mean'),
                         RandomForestClassifier(random_state = 42, max_depth = 12) )

model_rf.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['basin', 'region', 'public_meeting',
                                      'scheme_management', 'permit',
                                      'extraction_type',
                                      'extraction_type_group',
                                      'extraction_type_class', 'management',
                                      'management_group', 'payment',
                                      'payment_type', 'water_quality',
                                      'quality_group', 'quantity', 'source',
                                      'source_type', 'source_class',
                                      'waterpoint_type...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=12, max_features='auto',
                                   

# V. Check Metrics

**Task 7:** Calculate the training and validation accuracy scores for `model_rf`.

In [12]:
training_acc = model_rf.score(X_train, y_train)
val_acc = model_rf.score(X_val, y_val)

print('Training Accuracy Score:', training_acc)
print('Validation Accuracy Score:', val_acc)

Training Accuracy Score: 0.8146306818181818
Validation Accuracy Score: 0.779776936026936


# VI. Tune Model

**Task 8:** Tune `n_estimators` and `max_depth` hyperparameters for your `RandomForestClassifier` to get the best validation accuracy score for `model_rf`. 

In [13]:
# Use this cell to experiment and then change 
# your model hyperparameters in Task 6
depths = np.arange(1, 30, 1)
train_acc_depth = []
val_acc_depth = []


for depth in depths:
  model_rf = make_pipeline(OrdinalEncoder(),
                         SimpleImputer(strategy = 'mean'),
                         RandomForestClassifier(random_state = 42, max_depth = depth) )

  model_rf.fit(X_train, y_train)
  train_acc_depth.append(model_rf.score(X_train, y_train))
  val_acc_depth.append(model_rf.score(X_val, y_val))


In [14]:
pd.DataFrame(list(zip(train_acc_depth, val_acc_depth)), index = depths)

Unnamed: 0,0,1
1,0.637153,0.642677
2,0.671507,0.674874
3,0.701941,0.703493
4,0.712674,0.714646
5,0.720276,0.724537
6,0.732428,0.732008
7,0.743739,0.741267
8,0.755577,0.751052
9,0.766283,0.758207
10,0.77904,0.762521


In [16]:
estimators = np.arange(10, 300, 10)

train_acc_est = []
val_acc_est = []

for estimator in estimators:
  model_rf = make_pipeline(OrdinalEncoder(),
                         SimpleImputer(strategy = 'mean'),
                         RandomForestClassifier(random_state = 42, 
                                                max_depth = 20, 
                                                n_estimators = estimator,
                                                n_jobs = -1) )

  model_rf.fit(X_train, y_train)
  train_acc_est.append(model_rf.score(X_train, y_train))
  val_acc_est.append(model_rf.score(X_val, y_val))

In [22]:
pd.DataFrame(list(zip(train_acc_est, val_acc_est)), index = estimators)

Unnamed: 0,0,1
10,0.935185,0.798506
20,0.943892,0.804714
30,0.946391,0.805766
40,0.947654,0.806608
50,0.948785,0.806608
60,0.949811,0.808081
70,0.9501,0.80787
80,0.950047,0.807976
90,0.950836,0.809659
100,0.950652,0.809133


In [29]:
samples = np.arange(0.4, 0.75, 0.01)
train_acc_samp = []
val_acc_samp = []

for sample in samples:
  model_rf = make_pipeline(OrdinalEncoder(),
                         SimpleImputer(strategy = 'mean'),
                         RandomForestClassifier(random_state = 42, 
                                                max_depth = 15, 
                                                n_estimators = 20,
                                                max_samples = sample,
                                                n_jobs = -1) )

  model_rf.fit(X_train, y_train)
  train_acc_samp.append(model_rf.score(X_train, y_train))
  val_acc_samp.append(model_rf.score(X_val, y_val))

In [33]:
pd.DataFrame(list(zip(train_acc_samp, val_acc_samp)), index = samples, columns = ['training', 'validation']).sort_values(by = ['validation'])

Unnamed: 0,training,validation
0.41,0.842093,0.790614
0.66,0.858112,0.791982
0.51,0.850563,0.792719
0.55,0.852562,0.792824
0.4,0.844092,0.79314
0.43,0.845881,0.79335
0.59,0.852641,0.793561
0.47,0.84759,0.793666
0.56,0.853614,0.793876
0.54,0.852089,0.793981


In [34]:
model_rf2 = make_pipeline(OrdinalEncoder(),
                         SimpleImputer(strategy = 'mean'),
                         RandomForestClassifier(random_state = 42, 
                                                max_depth = 20, 
                                                n_estimators = 130,
                                                max_samples = 0.65,
                                                n_jobs = -1) )

model_rf2.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['basin', 'region', 'public_meeting',
                                      'scheme_management', 'permit',
                                      'extraction_type',
                                      'extraction_type_group',
                                      'extraction_type_class', 'management',
                                      'management_group', 'payment',
                                      'payment_type', 'water_quality',
                                      'quality_group', 'quantity', 'source',
                                      'source_type', 'source_class',
                                      'waterpoint_type...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=20, max_features='auto',
                                   

In [21]:
model2_train_acc = model_rf2.score(X_train, y_train)
model2_val_acc = model_rf2.score(X_val, y_val)
print(f'Training Accuracy: {model2_train_acc}; Validation Accuracy: {model2_val_acc}.')

Training Accuracy: 0.9501262626262627; Validation Accuracy: 0.8089225589225589.


# VII. Communicate Results

**Task 9:** Generate a list of predictions for `X_test`. The list should be named `y_pred`.

In [36]:
X_test.head()

Unnamed: 0_level_0,amount_tsh,gps_height,longitude,latitude,basin,region,region_code,district_code,population,public_meeting,scheme_management,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
37098,,,31.985658,-3.59636,Lake Tanganyika,Shinyanga,17,5.0,,True,WUG,True,,other,other,other,wug,user-group,unknown,unknown,soft,good,dry,shallow well,shallow well,groundwater,other,other
14530,,,32.832815,-4.944937,Lake Tanganyika,Tabora,14,6.0,,True,VWC,True,,india mark ii,india mark ii,handpump,vwc,user-group,never pay,never pay,milky,milky,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump
62607,10.0,1675.0,35.488289,-4.242048,Internal,Manyara,21,1.0,148.0,True,Water Board,True,2008.0,gravity,gravity,gravity,water board,user-group,pay per bucket,per bucket,soft,good,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe
46053,,,33.140828,-9.059386,Lake Rukwa,Mbeya,12,6.0,,False,VWC,False,,nira/tanira,nira/tanira,handpump,vwc,user-group,never pay,never pay,soft,good,seasonal,shallow well,shallow well,groundwater,hand pump,hand pump
47083,50.0,1109.0,34.217077,-4.430529,Internal,Singida,13,1.0,235.0,True,WUA,True,2011.0,mono,mono,motorpump,wua,user-group,pay per bucket,per bucket,soft,good,enough,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe


In [35]:
y_pred = model_rf2.predict(X_test)

assert len(y_pred) == len(X_test), f'Your list of predictions should have {len(X_test)} items in it. '

In [39]:
len(y_pred)

11880

**Task 11 `stretch goal`:** Create a DataFrame `submission` whose index is the same as `X_test` and that has one column `'status_group'` with your predictions. Next, save this DataFrame as a CSV file and upload your submissions to our competition site. 

**Note:** Check the `sample_submission.csv` file on the competition website to make sure your submissions follows the same formatting. 

In [None]:
submission = pd.DataFrame(data = y_pred, index = X_test.index, columns = ['status_group'])
submission

In [50]:
submission.to_csv('submission_fixed.csv')

In [51]:
from google.colab import files
files.download('submission_fixed.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>