# 1. Overview

Based on the descriptive and exploratory analysis done in notebook 00_data_understanding, this Python Script will work on preprocessing the data, preparing it so that we can then work on the model training in the future.

# 2. Data Understanding

## 2.1 Data Description

This file will use the df_train_transform excel sheet created in the previous notebook: 00_data_understanding

## 2.2 Import Necessary Libraries

In [1]:
# pip install category_encoders

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from category_encoders import TargetEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split



## 2.3 Functions

# 3. Code

## 3.1 Import the database

In [3]:
df = pd.read_excel('df_train_transform.xlsx')
df.head()

Unnamed: 0,amount_tsh,gps_height,population,basin,region,public_meeting,permit,extraction_type_class,management_group,payment_type,quality_group,quantity_group,source_type,waterpoint_type,scheme_management_grouped,status_group
0,6000.0,1390,109,lake nyasa,iringa,1.0,0.0,gravity,usergroup,annually,good,enough,spring,communal standpipe,government,functional
1,0.0,1399,280,lake victoria,mara,,1.0,gravity,usergroup,never pay,good,insufficient,rainwater harvesting,communal standpipe,other,functional
2,25.0,686,250,pangani,manyara,1.0,1.0,gravity,usergroup,per bucket,good,enough,dam,communal standpipe multiple,government,functional
3,0.0,263,58,ruvuma southern coast,mtwara,1.0,1.0,submersible,usergroup,never pay,good,dry,borehole,communal standpipe multiple,government,non functional
4,0.0,0,0,lake victoria,kagera,1.0,1.0,gravity,other,never pay,good,seasonal,rainwater harvesting,communal standpipe,other,functional


## 3.2 Dealing with null values

In [5]:
# For train data
(df.isna().sum()/len(df))*100

amount_tsh                   0.000000
gps_height                   0.000000
population                   0.000000
basin                        0.000000
region                       0.000000
public_meeting               5.612795
permit                       5.144781
extraction_type_class        0.000000
management_group             0.000000
payment_type                 0.000000
quality_group                0.000000
quantity_group               0.000000
source_type                  0.000000
waterpoint_type              0.000000
scheme_management_grouped    0.000000
status_group                 0.000000
dtype: float64

### Column 'public_meeting'

In [6]:
df["public_meeting"].value_counts(normalize=True)

1.0    0.909838
0.0    0.090162
Name: public_meeting, dtype: float64

In [7]:
# Given that the null values are only 6%, lets replace them with the mode

# Calculate the mode of the 'public_meeting' column
permit_mode = df['public_meeting'].mode()[0]

# Fill missing values in 'public_meeting' with the mode
df['public_meeting'].fillna(permit_mode, inplace=True)

# Verify if all NA values are filled
print(df['public_meeting'].value_counts(normalize=True))

1.0    0.914899
0.0    0.085101
Name: public_meeting, dtype: float64


### Column 'permit'

In [8]:
df["permit"].value_counts(normalize=True)

1.0    0.68955
0.0    0.31045
Name: permit, dtype: float64

In [9]:
# Given that the null values are only 5%, lets replace them with the mode

# Calculate the mode of the 'permit' column
permit_mode = df['permit'].mode()[0]

# Fill missing values in 'permit' with the mode
df['permit'].fillna(permit_mode, inplace=True)

# Verify if all NA values are filled
print(df['permit'].value_counts(normalize=True))

1.0    0.705522
0.0    0.294478
Name: permit, dtype: float64


## 3.3 Class Imbalance checking

In [10]:
# Check class distribution in y_train
print("Class distribution of status_group:")
print(df['status_group'].value_counts(normalize=True))

Class distribution of status_group:
functional                 0.543081
non functional             0.384242
functional needs repair    0.072677
Name: status_group, dtype: float64


We decide to group together into a same class functional needs repair and functional. In this way, we have a binary classification problem

In [11]:
# Replace 'functional needs repair' with 'functional'
df['status_group'] = df['status_group'].replace('functional needs repair', 'functional')

# Verify changes by checking the class distribution again in y_train and y_test
print("Class distribution in y_train after replacement:")
print(df['status_group'].value_counts(normalize=True))

Class distribution in y_train after replacement:
functional        0.615758
non functional    0.384242
Name: status_group, dtype: float64


## 3.4 Define predictor and target variables

In [12]:
y = df['status_group']
X = df.drop('status_group', axis=1)

## 3.5 Do a train test split

In [13]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 3.6 Doing target enconder on the categorical columns

Given that most of the categorical columns have more than 6 values, we are going to perfom a target enconder on these categorical columns

**X_train**

In [16]:
# Convert 'status_group' to a categorical type with a numeric representation
if y_train['status_group'].dtype == 'object':
    y_train['status_group'] = y_train['status_group'].astype('category').cat.codes
    
# Capture the target before it's dropped
y = y_train['status_group']
    
# Capture categorical columns
categorical_columns = X_train.select_dtypes(include=['object', 'category']).columns

# Initialize the TargetEncoder
encoder = TargetEncoder(cols=categorical_columns)

# Fit and transform the categorical columns
encoder.fit(X_train[categorical_columns], y)
            
X_train[categorical_columns] = encoder.transform(X_train[categorical_columns])

# Display the DataFrame to check the results
X_train.head()

KeyError: 'status_group'

**X_test**

In [None]:
# Convert 'status_group' to a categorical type with a numeric representation
if X_test['status_group'].dtype == 'object':
    X_test['status_group'] = X_test['status_group'].astype('category').cat.codes
    
# Capture the target before it's dropped
y = X_test['status_group']

# Drops the target column from the DataFrame
X = X_test.drop('status_group', axis=1)
    
# Capture categorical columns
categorical_columns = X_test.select_dtypes(include=['object', 'category']).columns

# Initialize the TargetEncoder
encoder = TargetEncoder(cols=categorical_columns)

# Fit and transform the categorical columns
encoder.fit(X_test[categorical_columns], y)
            
X_test[categorical_columns] = encoder.transform(X_test[categorical_columns])

# Display the DataFrame to check the results
X_test.head()

## 3.7 Dealing with numerical columns

**X_train**

In [None]:
# Capture numerical columns
numerical_columns = X_train.select_dtypes(include=['int64', 'float64']).columns

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the numerical columns
scaler.fit(X_train[numerical_columns])

X_train[numerical_columns] = scaler.transform(X_train[numerical_columns])

# Display the DataFrame to check the results
X_train.head()

In [None]:
**X_test**

In [None]:
# Capture numerical columns
numerical_columns = X_test.select_dtypes(include=['int64', 'float64']).columns

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the numerical columns
scaler.fit(X_test[numerical_columns])

X_test[numerical_columns] = scaler.transform(X_test[numerical_columns])

# Display the DataFrame to check the results
X_test.head()

## 3.8 Concatenate train on one side and test on the other

In [None]:
# Concatenate all train
df_train = pd.concat([X_train,y_train], axis=1)

# Concatenate all test
df_test = pd.concat([X_test,y_test], axis=1)

# Create a label column
df_train['is_test'] = 0
df_test['is_test'] = 1

## 3.9 Concatenate everything in one dataframe

In [None]:
data_processed = pd.concat([df_train,df_test], axis=1)
data_processed

# 4. Export the data

In [None]:
data_processed.to_excel('df_data_processed.xlsx', index=False)