# 1. Overview

In this Python Script, we will apply all the data transformations that were done in the 01_data_preprocessing python script on the test dataset. Moreover, we will also transform the test dataset with the best decision tree classifier that was created in the 02_model_creation. With all this, we will be able to obtain the predicted values and determine whether a pump will be functional or non-functional 

# 2. Data Understanding

## 2.1 Data Description

This file will use the test dataset given to us from DrivenData called: Test_set_values

## 2.2 Import Necessary Libraries

In [4]:
import pickle
import pandas as pd


## 2.3 Define global variables

In [3]:
INPUT_PATH_Test_set_values = "../Data/Test_set_values.csv"

## 2.4 Functions

In [None]:
def categorize_funder(funder):
    """
    Categorizes a funder name into specific groups based on keywords.
    
    Args:
    funder (str): A string representing the name of the funder to categorize.
    
    Returns:
    str: A category name representing the type of organization the funder belongs to.
    
    This function takes a funder name, converts it to lowercase, removes leading/trailing spaces, 
    and categorizes it into predefined groups like 'Government', 'Religious Organizations', 'NGO',
    'International Aid', 'Private Companies', or 'Individual/Other' based on keywords found within the name.
    """
    funder = funder.lower().strip()  # convert to lowercase and strip whitespaces to standardize
    if any(x in funder for x in ['government','ministry','gov','minis']): 
        return 'Government'
    elif any(x in funder for x in ['church', 'muslim','mus', 'islamic','islam','catholic', 'rc']):
        return 'Religious Organizations'
    elif any(x in funder for x in ['ngo', 'foundation', 'fund', 'trust', 'society','socie']):
        return 'NGO'
    elif any(x in funder for x in ['international','internatio', 'un', 'world bank']):
        return 'International Aid'
    elif any(x in funder for x in ['ltd', 'company','compa', 'group', 'enterprise']):
        return 'Private Companies'
    else:
        return 'Individual/Other'


In [None]:
def categorize_installer(installer):
    """
    Categorizes an installer name into specific groups based on keywords.

    Args:
    installer (str): A string representing the name of the installer to categorize.

    Returns:
    str: A category name representing the type of entity the installer belongs to.

    This function processes an installer name by converting it to lowercase and removing
    any leading/trailing whitespace. It categorizes the name into predefined groups such as 
    'DWE', 'Government', 'Community', 'NGO', 'Private Company', 'Institutional', or 'Other' 
    based on specific keywords present in the installer's name. This helps in standardizing 
    installer data for better analysis and insight extraction.
    """
    installer = installer.lower().strip()  # convert to lowercase and strip whitespaces to standardize
    if 'dw' in installer:
        return 'DWE'
    elif any(x in installer for x in ['government', 'govt', 'gove']):
        return 'Government'
    elif any(x in installer for x in ['resource']):
        return 'Other'
    elif any(x in installer for x in ['community', 'villagers', 'village','commu']):
        return 'Community'
    elif any(x in installer for x in ['ngo', 'unicef', 'foundat']):
        return 'NGO'
    elif 'company' in installer or 'contractor' in installer:
        return 'Private Company'
    elif any(x in installer for x in ['school','schoo','church', 'rc']):
        return 'Institutional'
    else:
        return 'Other'

In [11]:
def group_scheme_management(value):
    """
    Categorizes scheme management types into broader, more generalized groups.

    Args:
    value (str): A string representing the scheme management type to categorize.

    Returns:
    str: A generalized category name representing the type of scheme management.

    This function takes a specific scheme management type and categorizes it into 
    more generalized groups such as 'Government', 'Community', 'Private Sector', 
    'Water Board', or 'Other'. This categorization aids in simplifying the analysis 
    and understanding of the data by reducing the number of distinct categories, 
    making trends and patterns more discernible.
    """
    if value in ['VWC', 'Water authority', 'Parastatal']:
        return 'Government'
    elif value in ['WUG', 'WUA']:
        return 'Community'
    elif value in ['Company', 'Private operator']:
        return 'Private Sector'
    elif value == 'Water Board':
        return 'Water Board'  # Retain this as a separate category if distinct characteristics are important
    else:
        return 'Other'

In [None]:
def clean_text(text):
    """
    Cleans a text string by converting to lowercase, removing non-alphanumeric characters (excluding numbers),
    and replacing multiple spaces with a single space. If the input is solely numeric, it returns the input as is.

    Args:
    text (str or NaN): The text to be cleaned; can be a string, numeric, or NaN for missing values.

    Returns:
    str or NaN: The cleaned text, with all characters in lowercase, non-alphanumeric characters removed (excluding numbers),
                and multiple spaces collapsed to a single space, or the original text if input was numeric or NaN if input was NaN.

    This function standardizes a text string by making it lowercase, stripping out any characters that are not letters or spaces,
    and then replacing sequences of spaces with a single space, facilitating uniform data processing and analysis. If the input
    is numeric, it is assumed to be standardized already and is returned without modification.
    """
    if pd.isna(text):
        return text
    if isinstance(text, (int, float)):  # Check if the input is numeric
        return text
    text = text.lower()  # Convert to lowercase
    text = ''.join(char for char in text if char.isalpha() or char.isspace())  # Remove special characters and numbers
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    return text

# 3. Code

## 3.1 Import the dataset

In [5]:
df_predict = pd.read_csv(INPUT_PATH_Test_set_values)
df_predict.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,50785,0.0,2013-02-04,Dmdd,1996,DMDD,35.290799,-4.059696,Dinamu Secondary School,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,other,other
1,51630,0.0,2013-02-04,Government Of Tanzania,1569,DWE,36.656709,-3.309214,Kimnyak,0,...,never pay,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe
2,17168,0.0,2013-02-01,,1567,,34.767863,-5.004344,Puma Secondary,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,other,other
3,45559,0.0,2013-01-22,Finn Water,267,FINN WATER,38.058046,-9.418672,Kwa Mzee Pange,0,...,unknown,soft,good,dry,dry,shallow well,shallow well,groundwater,other,other
4,49871,500.0,2013-03-27,Bruder,1260,BRUDER,35.006123,-10.950412,Kwa Mzee Turuka,0,...,monthly,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe


# 3.2 Apply the same data transformations on df_predict as the ones done in 00_data_understanding

## 3.2.1 Applying transformation functions

**Column 'funder'**

In [None]:
# Handling NaN values with a filler string like 'Unknown'
df_predict['funder'] = df_predict['funder'].fillna('Unknown').astype(str)

# Apply the mapping function to the 'funder' column
df_predict['funder_type'] = df_predict['funder'].apply(categorize_funder)

# Check the categorized data
print(df_predict['funder_type'].value_counts())

**Column 'installer'**

In [None]:
# Handling NaN values with a filler string like 'Unknown'
df_train_merge['installer'] = df_train_merge['installer'].fillna('Unknown').astype(str)

# Apply the mapping function to the 'installer' column
df_train_merge['installer_type'] = df_train_merge['installer'].apply(categorize_installer)

# Now you can check your categorized data
print(df_train_merge['installer_type'].value_counts())

**Column 'scheme_management_grouped'**

In [12]:
# Apply the grouping function to the 'scheme_management' column
df_predict['scheme_management_grouped'] = df_predict['scheme_management'].apply(group_scheme_management)

# Check the new value counts to see the grouped data
print(df_train_merge['scheme_management_grouped'].value_counts(normalize=True))

## 3.2.2 Converting data types

In [14]:
# Converting 'construction_year' to object
df_predict['construction_year'] = df_predict['construction_year'].astype('object')

In [17]:
df_predict.columns

Index(['id', 'amount_tsh', 'date_recorded', 'funder', 'gps_height',
       'installer', 'longitude', 'latitude', 'wpt_name', 'num_private',
       'basin', 'subvillage', 'region', 'region_code', 'district_code', 'lga',
       'ward', 'population', 'public_meeting', 'recorded_by',
       'scheme_management', 'scheme_name', 'permit', 'construction_year',
       'extraction_type', 'extraction_type_group', 'extraction_type_class',
       'management', 'management_group', 'payment', 'payment_type',
       'water_quality', 'quality_group', 'quantity', 'quantity_group',
       'source', 'source_type', 'source_class', 'waterpoint_type',
       'waterpoint_type_group', 'scheme_management_grouped'],
      dtype='object')

## 3.2.3 Drop unnecesary columns

In [None]:
drop_column_list = ['scheme_name', 'num_private', 'wpt_name', 'subvillage', 'lga', 'ward', 'recorded_by','extraction_type_group',
                    'extraction_type', 'management', 'payment', 'water_quality', 'quantity', 'source', 'source_class',
                    'waterpoint_type_group', 'date_recorded','funder','installer','scheme_management',
                    'longitude','latitude','region_code','district_code','construction_year']

## 3.2.3 Cleaning the data set

In [None]:
# Apply the cleaning function to each object-type column in the DataFrame
for col in df_predict.select_dtypes(include='object').columns:
    df_predict[col] = df_predict[col].apply(clean_text)

## 3.2 Fillna with the modes calculated in 01_data_preprocessing

In [10]:
(df_predict.isna().sum()/len(df_predict))*100

id                        0.000000
amount_tsh                0.000000
date_recorded             0.000000
funder                    5.851852
gps_height                0.000000
installer                 5.905724
longitude                 0.000000
latitude                  0.000000
wpt_name                  0.000000
num_private               0.000000
basin                     0.000000
subvillage                0.666667
region                    0.000000
region_code               0.000000
district_code             0.000000
lga                       0.000000
ward                      0.000000
population                0.000000
public_meeting            0.000000
recorded_by               0.000000
scheme_management         6.525253
scheme_name              47.757576
permit                    0.000000
construction_year         0.000000
extraction_type           0.000000
extraction_type_group     0.000000
extraction_type_class     0.000000
management                0.000000
management_group    

From the python script 01_data_preprocessing we know that public_meeting_mode is 1.0 and the permit_mode is 1.0. So we are going to directly fill the NaNs of public_meeting and of permit with the value 1.0

**Fillna in column 'public_meeting'**

In [7]:
df_predict['public_meeting'].fillna(1.0, inplace=True)

**Fillna in column 'permit'**

In [8]:
df_predict['permit'].fillna(1.0, inplace=True)

Let's check that there are no more null-values left

In [None]:
(df_predict.isna().sum()/len(df_predict))*100

## 3.3 Doing target enconder on the categorical columns

Let's apply a one hot encoder for the categorical columns that have 6 or less categories

In [None]:
# Capture categorical columns from X_train for encoding
categorical_columns = df_predict.select_dtypes(include=['object', 'category']).columns

# Initialize one hot encoder
one_hot_encoder = OneHotEncoder()

# Encoding the categorical columns in df_predict
for col in categorical_columns:
    if df_predict[col].nunique() <= 6:
        # Apply OneHotEncoder for columns with 6 or fewer unique values
        df_predict = pd.get_dummies(df_predict, columns=[col], drop_first=True)

Let's call in the saved fits (for the categorical columns that have more than 6 categories) applied to the categorical columns in the 01_data_preprocessing script

In [2]:
# Column 'basin'
basin_pickle = pickle.load(open('basin_target_encoder.pickle', 'rb'))
df_predict['basin'] = basin_pickle.transform(df_predict['basin'])

# Column 'extraction_type_class'
extraction_type_class_pickle = pickle.load(open('extraction_type_class_target_encoder.pickle', 'rb'))
df_predict['extraction_type_class'] = extraction_type_class_pickle.transform(df_predict['extraction_type_class'])

# Column 'installer_type'
installer_type_pickle = pickle.load(open('installer_type_target_encoder.pickle', 'rb'))
df_predict['installer_type'] = installer_type_pickle.transform(df_predict['installer_type'])

# Column 'payment_type'
payment_type_pickle = pickle.load(open('payment_type_target_encoder.pickle', 'rb'))
df_predict['payment_type'] = payment_type_pickle.transform(df_predict['payment_type'])

# Column 'region_target'
region_target_pickle = pickle.load(open('region_target_encoder.pickle', 'rb'))
df_predict['region_target'] = region_target_pickle.transform(df_predict['region_target'])

# Column 'source_type'
source_type_pickle = pickle.load(open('source_type_target_encoder.pickle', 'rb'))
df_predict['source_type'] = source_type_pickle.transform(df_predict['source_type'])

# Column 'waterpoint_type'
waterpoint_type_pickle = pickle.load(open('waterpoint_type_target_encoder.pickle', 'rb'))
df_predict['waterpoint_type'] = waterpoint_type_pickle.transform(df_predict['waterpoint_type'])

## 3.4 Dealing with numerical columns

Let's call in the saved fits applied to the numerical columns in the 01_data_preprocessing script

In [None]:
# Capture numerical columns
numerical_columns = df_predict.select_dtypes(include=['int64', 'float64']).columns

# Numerical Columns
numerical_columns_pickle = pickle.load(open('numerical_columns_scaler.pickle', 'rb'))
df_predict[numerical_columns] = numerical_columns_pickle.transform(df_predict[numerical_columns])

## 3.5 Apply the Decision Tree Classifier created in 02_model_creation

In [None]:
# Decision Tree Classifier
best_tree_pickle = pickle.load(open('best_tree.pickle', 'rb'))
df_predict['status_group'] = best_tree_pickle.predict_proba(df_predict)