# 1. Overview

Based on the descriptive and exploratory analysis done in notebook 00_data_understanding, this Python Script will work on 2 models: logistic and decission tree classifier, we will chose the best model based on the one that has better evaluation metrics. We will then improve the chosen model with tuned hyperparameters.

# 2. Data Understanding

## 2.1 Data Description

This file will use the df_train_transform excel sheet created in the previous notebook: 00_data_understanding

## 2.2 Import Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder


## 2.3 Functions

# 3. Code

## 3.1 Import the database

In [2]:
df = pd.read_excel('df_train_transform.xlsx')
df.head()

Unnamed: 0,amount_tsh,gps_height,population,basin,region,extraction_type_class,management_group,payment_type,quality_group,quantity_group,source_type,waterpoint_type,funder_type,installer_type,scheme_management_grouped,status_group
0,6000.0,1390,109,lake nyasa,iringa,gravity,usergroup,annually,good,enough,spring,communal standpipe,individualother,other,government,functional
1,0.0,1399,280,lake victoria,mara,gravity,usergroup,never pay,good,insufficient,rainwater harvesting,communal standpipe,individualother,other,other,functional
2,25.0,686,250,pangani,manyara,gravity,usergroup,per bucket,good,enough,dam,communal standpipe multiple,individualother,other,government,functional
3,0.0,263,58,ruvuma southern coast,mtwara,submersible,usergroup,never pay,good,dry,borehole,communal standpipe multiple,international aid,ngo,government,non functional
4,0.0,0,0,lake victoria,kagera,gravity,other,never pay,good,seasonal,rainwater harvesting,communal standpipe,individualother,other,other,functional


## 3.2 Class Imbalance checking

In [3]:
print(df['status_group'].value_counts(normalize=True))

functional                 0.543081
non functional             0.384242
functional needs repair    0.072677
Name: status_group, dtype: float64


We decide to group together into a same class functional needs repair and functional. In this way, we have a binary classification problem

In [4]:
df['status_group'] = df['status_group'].replace('functional needs repair', 'functional')

# Verificar los cambios
print(df['status_group'].value_counts(normalize=True))

functional        0.615758
non functional    0.384242
Name: status_group, dtype: float64


## 3.3 Define predictor and target variables

In [5]:
y = df['status_group']
X = df.drop('status_group', axis=1)

## 3.4 Dealing with categorical columns

See which are the categorical columns

In [6]:
X_categorical = X.select_dtypes(include=['object', 'category'])

X_categorical.columns

Index(['basin', 'region', 'extraction_type_class', 'management_group',
       'payment_type', 'quality_group', 'quantity_group', 'source_type',
       'waterpoint_type', 'funder_type', 'installer_type',
       'scheme_management_grouped'],
      dtype='object')

In [7]:
X_categorical

Unnamed: 0,basin,region,extraction_type_class,management_group,payment_type,quality_group,quantity_group,source_type,waterpoint_type,funder_type,installer_type,scheme_management_grouped
0,lake nyasa,iringa,gravity,usergroup,annually,good,enough,spring,communal standpipe,individualother,other,government
1,lake victoria,mara,gravity,usergroup,never pay,good,insufficient,rainwater harvesting,communal standpipe,individualother,other,other
2,pangani,manyara,gravity,usergroup,per bucket,good,enough,dam,communal standpipe multiple,individualother,other,government
3,ruvuma southern coast,mtwara,submersible,usergroup,never pay,good,dry,borehole,communal standpipe multiple,international aid,ngo,government
4,lake victoria,kagera,gravity,other,never pay,good,seasonal,rainwater harvesting,communal standpipe,individualother,other,other
...,...,...,...,...,...,...,...,...,...,...,...,...
59395,pangani,kilimanjaro,gravity,usergroup,per bucket,good,enough,spring,communal standpipe,individualother,other,water board
59396,rufiji,iringa,gravity,usergroup,annually,good,enough,riverlake,communal standpipe,individualother,other,government
59397,rufiji,mbeya,handpump,usergroup,monthly,fluoride,enough,borehole,hand pump,international aid,other,government
59398,rufiji,dodoma,handpump,usergroup,never pay,good,insufficient,shallow well,hand pump,individualother,other,government


- We see that the following columns have a maximum of 6 categories:
    management_group, quantity_group, waterpoint_type, funder_type, scheme_management_grouped

- Only to these columns will we apply a one hot encoder because more than 6 categories is excessive 

In [8]:
# Create OneHotEncoder instance
ohe = OneHotEncoder(sparse=False)

# List of columns to encode
columns_to_encode = ['management_group', 'quantity_group', 'waterpoint_type', 'funder_type', 'scheme_management_grouped']

# Apply OneHotEncoder to the selected columns and ensure the output is a DataFrame with the correct index and column names
X_train_categorical = pd.DataFrame(ohe.fit_transform(X[columns_to_encode]),
                                   index=X.index,
                                   columns=np.hstack(ohe.categories_))

# Remove the original categorical columns from X
X.drop(columns_to_encode, axis=1, inplace=True)

# Concatenate the original DataFrame X with the new one hot encoded columns
X = pd.concat([X, X_train_categorical], axis=1)

X.head()

Unnamed: 0,amount_tsh,gps_height,population,basin,region,extraction_type_class,payment_type,quality_group,source_type,installer_type,...,individualother,international aid,ngo,private companies,religious organizations,community,government,other,private sector,water board
0,6000.0,1390,109,lake nyasa,iringa,gravity,annually,good,spring,other,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,1399,280,lake victoria,mara,gravity,never pay,good,rainwater harvesting,other,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,25.0,686,250,pangani,manyara,gravity,per bucket,good,dam,other,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,263,58,ruvuma southern coast,mtwara,submersible,never pay,good,borehole,ngo,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0,0,lake victoria,kagera,gravity,never pay,good,rainwater harvesting,other,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


Let's see the new list of categorical columns from X

In [9]:
X_categorical = X.select_dtypes(include=['object', 'category'])

X_categorical.columns

Index(['basin', 'region', 'extraction_type_class', 'payment_type',
       'quality_group', 'source_type', 'installer_type'],
      dtype='object')

Given that these categorical columns have more than 6 categories, we are going to use mean encoding on them to use them in our model as numerical values