# **Project Dataset Selection and Justification**

# **Tanzanian Water Wells**

For this project, I selected the Tanzanian Water Wells dataset from the DrivenData “Pump It Up: Data Mining the Water Table” competition. This dataset provides information about over 59,000 water wells across Tanzania, including details on their construction, location, management, and current operational status. The target variable, status_group, indicates whether a well is functional, functional but needs repair, or non-functional, making this a multiclass classification problem fully aligned with the requirements of the Phase 3 project.

The decision to use this dataset was guided by both technical and business relevance considerations:

Relevance to Real-World Problems:
Access to clean water remains a critical issue in developing regions, and identifying which wells are likely to fail is a meaningful application of data science for public good. This problem has direct implications for NGOs, government agencies, and infrastructure planners responsible for maintaining water systems.

Strong Alignment with Classification Objectives:
The project’s target variable is categorical, enabling the use of supervised classification algorithms such as logistic regression, decision trees, and ensemble models. The dataset’s structure naturally supports an iterative modeling process involving baseline and tuned models, as required by the project guidelines.

Technical Suitability and Manageability:
The dataset is sufficiently complex—with over 40 features spanning categorical, numeric, and geospatial data—but remains clean and well-documented. It allows for exploration of essential preprocessing techniques such as handling missing values, encoding categorical features, scaling, and avoiding data leakage, without overwhelming computational or cleaning requirements.

Industry and Career Relevance:
The dataset’s focus on infrastructure reliability and maintenance parallels predictive analytics challenges found in the energy and engineering sectors, including geothermal plant operations and equipment reliability monitoring. As such, this project demonstrates the transferability of classification methods to industrial asset management and sustainability use cases.

In summary, the Tanzanian Water Wells dataset provides a meaningful, technically rich, and career-relevant foundation for applying machine learning classification techniques to a real-world problem. It offers both interpretability and complexity, making it an ideal choice for this phase of the data science program.

# **Business Understanding**

Access to clean and functional water points remains a major challenge in Tanzania, especially in rural areas where communities rely heavily on boreholes and hand pumps. Many wells fail due to poor maintenance, aging infrastructure, or environmental conditions, making it difficult for authorities to prioritize which sites need attention first.

This project aims to support the **Tanzanian Ministry of Water and Irrigation and partner organizations **such as **WaterAid, UNICEF, and World Vision** by developing a predictive classification model that identifies the operational status of water wells. The model will help stakeholders allocate maintenance resources more efficiently, reduce downtime, and improve access to clean water.

Our target stakeholders include **government planners,** **NGOs**, and **field maintenance teams** who can use model insights to schedule preventive repairs and optimize field operations. The classification model will predict whether a well is functional, needs repair, or non-functional based on features such as construction year, installer, pump type, and location.

The project scope focuses on building and evaluating this predictive model and communicating actionable insights through analysis and visualization. Implementation of automated systems, real-time monitoring, or field validation lies outside the current scope.

By helping decision-makers transition from reactive to preventive maintenance, this project aligns with Sustainable Development Goal 6 (Clean Water and Sanitation) and supports long-term water access sustainability across Tanzania.



# **Data Understanding**


The dataset used in this project is the Tanzanian Water Wells dataset, sourced from the Taarifa and Tanzania Ministry of Water repositories, and made publicly available through DrivenData. It contains information on over 59,000 water points across Tanzania, including details about their physical characteristics, installation, management, and operational status.

The primary target variable is status_group, which categorizes each water point as functional, functional but needs repair, or non-functional. The objective is to build a classification model capable of predicting this status based on various predictors.

The dataset includes multiple features such as location data (region, district, latitude, longitude), construction attributes (installer, construction year, pump type, extraction type), and management details (management type, water source, payment options). These predictors include a mix of categorical and numerical data, offering a rich basis for feature engineering and model development.

With over 59,000 observations and 40 features, the dataset provides ample data for training and testing robust classification models. However, some preprocessing is necessary to handle missing values, inconsistent labels, and redundant or highly correlated features. Exploratory analysis will help assess feature distributions, class balance, and potential data quality issues.

The data was originally collected through national field surveys conducted by the Ministry of Water and NGO partners. While generally reliable, certain inconsistencies or missing entries may exist due to human error during field reporting. These issues will be addressed during the data cleaning phase to ensure accuracy and consistency for modeling.

Overall, this dataset is suitable for building a classification model aimed at predicting the functionality of water wells, supporting data-driven decision-making in water resource management.

# **Data Preparation**

**Step 1** Importing relevant libraries

In [1]:
#importing relevant tools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


**Step 2:** Loading CSV files

In [2]:
# loading data in readiness for data preparatin

#load training data
train_values = pd.read_csv('Training_set_values.csv')
train_labels = pd.read_csv('Training_set_lables.csv')

#load test data
sub_format = pd.read_csv('SubmissionFormat.csv')

test_values = pd.read_csv('Test_set_values.csv')

**Step 3:** Merge Training Data and Labels

In [3]:
#checking for common column for merging
train_values.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [4]:
train_labels.head()


Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional
3,67743,non functional
4,19728,functional


In [5]:
# Merge on the 'id' column (common key in both files)df = pd.merge(train_values, train_labels, on='id')
df = pd.merge(train_values, train_labels, on='id')
# Preview the data
df.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional


**Step 4:** Inspect Basic Info

In [6]:
df.info()
df.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55763 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59398 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
count,59400.0,59400.0,59400,55763,59400.0,55745,59400.0,59400.0,59398,59400.0,...,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400
unique,,,356,1896,,2145,,,37399,,...,8,6,5,5,10,7,3,7,6,3
top,,,2011-03-15,Government Of Tanzania,,DWE,,,none,,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
freq,,,572,9084,,17402,,,3563,,...,50818,50818,33186,33186,17021,17021,45794,28522,34625,32259
mean,37115.131768,317.650385,,,668.297239,,34.077427,-5.706033,,0.474141,...,,,,,,,,,,
std,21453.128371,2997.574558,,,693.11635,,6.567432,2.946019,,12.23623,...,,,,,,,,,,
min,0.0,0.0,,,-90.0,,0.0,-11.64944,,0.0,...,,,,,,,,,,
25%,18519.75,0.0,,,0.0,,33.090347,-8.540621,,0.0,...,,,,,,,,,,
50%,37061.5,0.0,,,369.0,,34.908743,-5.021597,,0.0,...,,,,,,,,,,
75%,55656.5,20.0,,,1319.25,,37.178387,-3.326156,,0.0,...,,,,,,,,,,


Before building our classification model, we prepared the Tanzanian Water Wells dataset to ensure data quality and consistency. The dataset contained 59,400 records and 41 columns, including both numeric and categorical features such as location coordinates, water source type, management, and well status. Our target variable was status_group, indicating whether a waterpoint is functional, functional needs repair, or non-functional.

We identified several columns with missing values, including funder, installer, scheme_management, scheme_name, public_meeting, and permit. Since some of these variables contain a significant proportion of missing entries, we plan to handle them by either imputing the most frequent values, filling with “unknown,” or dropping variables that add little value.

Next, we confirmed that some columns such as date_recorded are stored as object types and need conversion to datetime format. Similarly, categorical variables like region, source, and payment_type will later be encoded using one-hot encoding to make them suitable for machine learning models.

We also noted potential multicollinearity among related variables (e.g., extraction_type, extraction_type_group, and extraction_type_class), which will be addressed through correlation checks and feature selection. Numeric variables like gps_height, population, and amount_tsh will be normalized or scaled to reduce model bias from differing value ranges.

Overall, the dataset provides a rich foundation for building a predictive classification model, but careful preprocessing will be essential to ensure data quality and model performance.

# **Data Cleaning**

**Step 1:** **handle Missing Values**

We’ll identify missing data, then decide how to handle them.

In [7]:
# Check missing values
missing = df.isnull().sum().sort_values(ascending=False)
print(missing[missing > 0])

scheme_name          28810
scheme_management     3878
installer             3655
funder                3637
public_meeting        3334
permit                3056
subvillage             371
wpt_name                 2
dtype: int64


There is quite a huge number of missing values, we will therefore need to carefully handle them.
below is a strategy of how we are going to handle them.


**Scheme_name**	has **28,810 missing count** of **type Categorical**	we will **Fill with "unknown"**	because **Too many missing, no reliable pattern**


**Scheme_management** has **3,878 missing count** of type **Categorical** we will **Fill with mode** (most frequent) because of **Limited missing, same entity type (management) likely repeats.**

**installer** has **3,655 missing value counts** of type **Categorical** we will **Fill with "unknown"** because **Missing likely due to poor field records, not informative to drop.**

**Funder** has **3,637 missing value count** of type **Categorical** we will
**Fill with "unknown"** because **High variety, missing not systematic.**

**public_meeting** has **3,334 missing value count** of type **Boolean** we will
**Fill with mode (most frequent)** because **Missing probably means “information not recorded.”**

**Permit** has **3,056 missing value count** of type **Boolean** we will **Fill with mode (most frequent)** because  **Missing means no permit info, not necessarily “no permit.”**

**Subvillage** has **371 missing value count** of type **Categorical** we will **Fill with "unknown" **because **Very few missing.**

**wpt_name **has only 2 **missing value count** of type  **Categorical** we will **Fill with "unknown"** because **Only two missing — trivial.**

In [8]:
# Fill high-missing categorical values with 'unknown'
cols_unknown = ['scheme_name', 'installer', 'funder', 'subvillage', 'wpt_name']
for col in cols_unknown:
    df[col] = df[col].fillna('unknown')

# Fill boolean-like fields with mode (most common)
cols_mode = ['scheme_management', 'public_meeting', 'permit']
for col in cols_mode:
    df[col] = df[col].fillna(df[col].mode()[0])

# Double-check missing values are gone
print(df.isnull().sum().sort_values(ascending=False).head(10))

id               0
amount_tsh       0
date_recorded    0
funder           0
gps_height       0
installer        0
longitude        0
latitude         0
wpt_name         0
num_private      0
dtype: int64


  df[col] = df[col].fillna(df[col].mode()[0])


**Step 2: Convert Data Types & Handle Inconsistencies**

By performing the following

1.   Convert dates properly
2.   Fix boolean-like text values
3.   Ensure numeric columns are truly numeric





In [9]:
# Convert 'date_recorded' to datetime
df['date_recorded'] = pd.to_datetime(df['date_recorded'])

# Convert boolean-like columns (public_meeting, permit) to numeric
df['public_meeting'] = df['public_meeting'].map({True: 1, False: 0, 'True': 1, 'False': 0}).fillna(0)
df['permit'] = df['permit'].map({True: 1, False: 0, 'True': 1, 'False': 0}).fillna(0)

# Confirm data types
print(df.dtypes.head(15))

id                        int64
amount_tsh              float64
date_recorded    datetime64[ns]
funder                   object
gps_height                int64
installer                object
longitude               float64
latitude                float64
wpt_name                 object
num_private               int64
basin                    object
subvillage               object
region                   object
region_code               int64
district_code             int64
dtype: object


**Step 3: Drop Redundant / Low-Value Features**

Some columns in this dataset are:

Duplicates or near-duplicates (e.g. extraction_type, extraction_type_group, and extraction_type_class describe the same thing at different detail levels).

Non-predictive identifiers (e.g. id, wpt_name are unique to each record and not useful for modeling).

Columns that generalize others (e.g. management vs management_group).

We'll keep the most general but meaningful version of each and drop the rest.

In [10]:
# Drop identifiers and duplicates
df = df.drop([
    'id', 'wpt_name',  # identifiers
    'extraction_type_group', 'extraction_type_class',  # redundant
    'management_group', 'quantity_group',  # redundant
    'source_type', 'source_class', 'waterpoint_type_group',  # redundant
    'recorded_by'  # same for all rows: "GeoData Consultants Ltd"
], axis=1)

print("Remaining columns:", len(df.columns))
df.head()

Remaining columns: 31


Unnamed: 0,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,num_private,basin,subvillage,...,extraction_type,management,payment,payment_type,water_quality,quality_group,quantity,source,waterpoint_type,status_group
0,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,0,Lake Nyasa,Mnyusi B,...,gravity,vwc,pay annually,annually,soft,good,enough,spring,communal standpipe,functional
1,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,0,Lake Victoria,Nyamara,...,gravity,wug,never pay,never pay,soft,good,insufficient,rainwater harvesting,communal standpipe,functional
2,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,0,Pangani,Majengo,...,gravity,vwc,pay per bucket,per bucket,soft,good,enough,dam,communal standpipe multiple,functional
3,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,0,Ruvuma / Southern Coast,Mahakamani,...,submersible,vwc,never pay,never pay,soft,good,dry,machine dbh,communal standpipe multiple,non functional
4,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,0,Lake Victoria,Kyanyamisa,...,gravity,other,never pay,never pay,soft,good,seasonal,rainwater harvesting,communal standpipe,functional


Our Data looks perfectly clean.

We can now move to the final part of data preparation:

1.   Scaling numeric features
2.   Encoding categorical variables
3.   Encoding the target (status_group)




**Step 1: Scale Numeric Columns**

We’ll standardize only numeric columns (since categorical ones will be encoded separately).

In [11]:

# Identify numeric columns
num_cols = df.select_dtypes(include=['int64', 'float64']).columns
print("Numeric columns:", num_cols)

# Scale numeric features
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

Numeric columns: Index(['amount_tsh', 'gps_height', 'longitude', 'latitude', 'num_private',
       'region_code', 'district_code', 'population', 'public_meeting',
       'permit', 'construction_year'],
      dtype='object')


**Step 2: Encode the Target Variable**

Our target (status_group) has three classes:

1. Functional

2. Functional needs repair

3. Non functional

We’ll encode them numerically for classification.

In [12]:

label_encoder = LabelEncoder()
df['status_group'] = label_encoder.fit_transform(df['status_group'])

# Mapping reference
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Label mapping:", label_mapping)

Label mapping: {'functional': np.int64(0), 'functional needs repair': np.int64(1), 'non functional': np.int64(2)}


**Step 3: One-Hot Encode Categorical Features**

Now we convert all categorical predictors to numeric.

In [13]:
# Identify categorical columns (excluding target)
cat_cols = df.select_dtypes(include=['object']).columns

# One-hot encode categorical features
df_encoded = pd.get_dummies(df, columns=cat_cols, drop_first=True)

print("Encoded dataset shape:", df_encoded.shape)
df_encoded.head()

Encoded dataset shape: (59400, 28359)


Unnamed: 0,amount_tsh,date_recorded,gps_height,longitude,latitude,num_private,region_code,district_code,population,public_meeting,...,source_river,source_shallow well,source_spring,source_unknown,waterpoint_type_communal standpipe,waterpoint_type_communal standpipe multiple,waterpoint_type_dam,waterpoint_type_hand pump,waterpoint_type_improved spring,waterpoint_type_other
0,1.895665,2011-03-14,1.041252,0.131052,-1.408791,-0.038749,-0.244325,-0.06537,-0.150399,0.304987,...,False,False,True,False,True,False,False,False,False,False
1,-0.10597,2013-03-06,1.054237,0.09461,1.207934,-0.038749,0.267409,-0.376781,0.21229,0.304987,...,False,False,False,False,True,False,False,False,False,False
2,-0.09763,2013-02-25,0.025541,0.515158,0.639751,-0.038749,0.324269,-0.169174,0.14866,0.304987,...,False,False,False,False,False,True,False,False,False,False
3,-0.10597,2013-01-28,-0.584751,0.671308,-1.84972,-0.038749,4.247564,5.955245,-0.25857,0.304987,...,False,False,False,False,False,True,False,False,False,False
4,-0.10597,2011-07-13,-0.9642,-0.448669,1.317271,-0.038749,0.153691,-0.480585,-0.381587,0.304987,...,False,False,False,False,True,False,False,False,False,False


**Step 4: Split Data for Modeling**

Finally, we’ll split the data into training and testing sets.

In [14]:


X = df_encoded.drop('status_group', axis=1)
y = df_encoded['status_group']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training shape:", X_train.shape)
print("Testing shape:", X_test.shape)

Training shape: (47520, 28358)
Testing shape: (11880, 28358)


That’s a solid, production-ready dataset for a classification task , large enough to avoid overfitting but rich enough to capture patterns.

# **Modeling**

**Step 1: Baseline Logistic Regression**

(I noticed the dataset was too big for my computer to run, I therefore reduced the sample size to 10,000 rows which is enough to get statistically useful trends, use OrdinalEncoder which keeps one column per feature (not thousands like OneHot) and finally dropping string-heavy columns removes most memory bloat.)

In [13]:
#importing additional libraries

from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

#  Reduce dataset size for performance
df_sample = df.sample(n=10000, random_state=42).copy()

# Drop high-cardinality or irrelevant columns
cols_to_drop = ['funder', 'installer', 'wpt_name', 'subvillage', 'scheme_name']
df_sample.drop(columns=cols_to_drop, inplace=True, errors='ignore')

#  Define features and target
X = df_sample.drop(columns=['status_group'], errors='ignore')
y = df_sample['status_group']

# --- Extract datetime features -
X['year_recorded'] = pd.to_datetime(X['date_recorded']).dt.year
X['month_recorded'] = pd.to_datetime(X['date_recorded']).dt.month
X.drop(columns=['date_recorded'], inplace=True, errors='ignore')

# --- Split ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

#  Separate numeric and categorical
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns

# Build pipeline with OrdinalEncoder (lighter than OneHot)
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), categorical_features)
    ]
)

model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=300, random_state=42))
])

# --- Fit and evaluate ---
model_pipeline.fit(X_train, y_train)
y_pred = model_pipeline.predict(X_test)

print(" Baseline Logistic Regression Results")
print("Training Accuracy:", accuracy_score(y_train, model_pipeline.predict(X_train)))
print("Testing Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

 Baseline Logistic Regression Results
Training Accuracy: 0.614625
Testing Accuracy: 0.611

Classification Report:
                         precision    recall  f1-score   support

             functional       0.61      0.81      0.70      1083
functional needs repair       0.00      0.00      0.00       144
         non functional       0.61      0.45      0.52       773

               accuracy                           0.61      2000
              macro avg       0.41      0.42      0.40      2000
           weighted avg       0.57      0.61      0.58      2000



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# **Baseline Model Evaluation (Logistic Regression)**

The baseline logistic regression model achieved a training accuracy of        **61.5%** and a testing accuracy of **61.1%,** indicating that the model generalizes reasonably well and is not overfitting. The similar performance on both datasets suggests that the model’s predictions are consistent, though its overall accuracy is modest.

However, the model exhibits several weaknesses. **It completely fails to predict the minority class “functional needs repair”**, which is a common issue when dealing with imbalanced target distributions. Additionally, the model generated a convergence warning, implying that it reached the maximum number of iterations before fully optimizing the parameters. **The F1-scores** reveal that **the model performs well for the functional category,** **moderately for non-functional, and poorly for functional needs repair.**


**Step 2: Improve the Baseline (Hyperparameter Tuning)**

To improve:

We’ll increase max_iter to ensure convergence.

Tune regularization strength (C) and penalty type using GridSearchCV.


In [14]:
from sklearn.model_selection import GridSearchCV

# Define pipeline again (reuse preprocessor)
log_reg = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42)

pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', log_reg)
])

#  Define parameter grid for tuning
param_grid = {
    'classifier__C': [0.01, 0.1, 1, 10],
    'classifier__penalty': ['l2'],
    'classifier__class_weight': [None, 'balanced']
}

# Grid search
grid_search = GridSearchCV(pipe, param_grid, cv=3, scoring='accuracy', verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Evaluate best model
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)

print("Best Parameters:", grid_search.best_params_)
print("Best CV Score:", grid_search.best_score_)
print("\nTuned Model Results:")
print("Training Accuracy:", accuracy_score(y_train, best_model.predict(X_train)))
print("Testing Accuracy:", accuracy_score(y_test, y_pred_best))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_best))

Fitting 3 folds for each of 8 candidates, totalling 24 fits
Best Parameters: {'classifier__C': 1, 'classifier__class_weight': None, 'classifier__penalty': 'l2'}
Best CV Score: 0.6190010194712158

Tuned Model Results:
Training Accuracy: 0.624375
Testing Accuracy: 0.6185

Classification Report:
                         precision    recall  f1-score   support

             functional       0.62      0.81      0.70      1083
functional needs repair       0.00      0.00      0.00       144
         non functional       0.62      0.47      0.53       773

               accuracy                           0.62      2000
              macro avg       0.41      0.43      0.41      2000
           weighted avg       0.57      0.62      0.59      2000



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# **Tuned Logistic Regression Model Evaluation**

After performing hyperparameter tuning using GridSearchCV, the best parameters were found to be:
**C = 1,** **penalty = 'l2'**, and **class_weight = None.** The model achieved a cross-validation score of 0.619, a training accuracy of 62.4%, and a testing accuracy of 61.8%. These results show a small but measurable improvement compared to the baseline model (61.1% testing accuracy).

Despite this improvement, the model still struggles to correctly identify the **minority class,** “**functional needs repair**”, which continues to receive zero precision and recall. This reinforces the effect of class imbalance in the dataset. T**he functional wells remain well-predicted**, while** non-functional wells show moderate performance.**

Overall, hyperparameter tuning slightly improved model generalization and stability, but **class imbalance remains a limiting factor**. Further enhancement could involve exploring tree-based models (e.g., Random Forests or Gradient Boosting) and class rebalancing techniques such as SMOTE or class-weight adjustments.

**Step 3: Build a More Complex Model (Random Forest Classifier)**

In [16]:
#importing additional libraries

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, classification_report

# Identify numeric and categorical columns
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
    ]
)

# Random Forest model pipeline
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42, n_jobs=-1))
])

# Train model
rf_pipeline.fit(X_train, y_train)

# Predictions
y_pred_train = rf_pipeline.predict(X_train)
y_pred_test = rf_pipeline.predict(X_test)

# Evaluation
print("Training Accuracy:", accuracy_score(y_train, y_pred_train))
print("Testing Accuracy:", accuracy_score(y_test, y_pred_test))
print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_pred_test))

Training Accuracy: 0.99875
Testing Accuracy: 0.7605

Classification Report (Test Set):
                         precision    recall  f1-score   support

             functional       0.77      0.86      0.81      1083
functional needs repair       0.45      0.27      0.34       144
         non functional       0.78      0.71      0.74       773

               accuracy                           0.76      2000
              macro avg       0.67      0.61      0.63      2000
           weighted avg       0.75      0.76      0.75      2000



# **Model 3: Random Forest Classifier**
Training Accuracy: 0.999

Testing Accuracy: 0.761

Interpretation

The Random Forest achieved a significant improvement in performance compared to the Logistic Regression models (which were around 61–62% accuracy). The testing accuracy increased to about 76%, showing that the model captures non-linear relationships and feature interactions much better.

However, the training accuracy (99.9%) indicates overfitting — the model fits the training data extremely well but doesn’t generalize perfectly to new data.

Class Performance

1. Functional wells: Strong precision (0.77) and recall (0.86).

2. Non-functional wells: Good balance (F1 = 0.74).

3. Functional needs repair: Still weak (F1 = 0.34), meaning class imbalance     remains an issue — the model struggles to detect rare cases.

**Summary**

The Random Forest baseline shows that introducing model complexity substantially improves predictive power, but overfitting and minority-class performance remain challenges. These findings justify proceeding to the next step (hyperparameter tuning) to improve generalization and fairness across classes.

**Step 4: Hyperparameter-Tuned Random Forest**

This step focuses on improving generalization and addressing overfitting by tuning parameters like:

`n_estimators:` number of trees

`max_depth:` tree depth

`min_samples_split & min_samples_leaf:` control overfitting

max_features: number of features considered at each split

In [17]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

#  Pipeline for preprocessing + model
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42, n_jobs=-1))
])

# Define parameter grid for RandomizedSearchCV
param_dist = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [10, 20, 30, None],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
    'classifier__max_features': ['sqrt', 'log2']
}

#  Randomized Search
random_search = RandomizedSearchCV(
    rf_pipeline,
    param_distributions=param_dist,
    n_iter=10,                 # limits the number of combinations for speed
    scoring='accuracy',
    cv=3,
    verbose=2,
    random_state=42,
    n_jobs=-1
)

# Fit search on training data
random_search.fit(X_train, y_train)

# Best model
print("Best Parameters:", random_search.best_params_)
print("Best CV Score:", random_search.best_score_)

# Evaluate on test set
best_rf_model = random_search.best_estimator_
y_pred_train = best_rf_model.predict(X_train)
y_pred_test = best_rf_model.predict(X_test)

print("\nTuned Random Forest Results:")
print("Training Accuracy:", accuracy_score(y_train, y_pred_train))
print("Testing Accuracy:", accuracy_score(y_test, y_pred_test))
print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_pred_test))

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best Parameters: {'classifier__n_estimators': 100, 'classifier__min_samples_split': 5, 'classifier__min_samples_leaf': 1, 'classifier__max_features': 'sqrt', 'classifier__max_depth': 30}
Best CV Score: 0.7623752582315059

Tuned Random Forest Results:
Training Accuracy: 0.894625
Testing Accuracy: 0.767

Classification Report (Test Set):
                         precision    recall  f1-score   support

             functional       0.75      0.91      0.82      1083
functional needs repair       0.64      0.16      0.26       144
         non functional       0.81      0.68      0.74       773

               accuracy                           0.77      2000
              macro avg       0.73      0.58      0.61      2000
           weighted avg       0.77      0.77      0.75      2000



# **Tuned Random Forest Model Summary**

After performing hyperparameter tuning using RandomizedSearchCV, the best combination of parameters found was:

**n_estimators = 100**

**max_depth = 30**

**min_samples_split = 5**

**min_samples_leaf = 1**

**max_features = 'sqrt'**

This optimized model achieved a training accuracy of **0.895** and a testing accuracy of** 0.767**, showing a good balance between bias and variance. The drop in training accuracy compared to the baseline Random Forest (which was nearly perfect) suggests that the model has reduced overfitting and generalizes better to unseen data.

**In terms of class performance:**

The model performs very well for “functional” wells **(F1 = 0.82)** and reasonably well for “non functional” wells** (F1 = 0.74).**

The “functional needs repair” class remains challenging, with low recall **(0.16)**, reflecting continued class imbalance and limited representation of this category in the training data.

Overall, the tuned Random Forest is the best-performing model so far, offering strong generalization and interpretability. It captures **non-linear interactions** among features and provides a solid foundation for further improvements, such as class weighting or SMOTE balancing.

# **Modeling Summary and Final Model Justification**

The modeling process followed an iterative approach aimed at predicting the operational status of Tanzanian water wells. We began with a baseline Logistic Regression model, chosen for its interpretability and as a foundational benchmark. After appropriate data cleaning, encoding, and scaling, the logistic model achieved approximately 61% testing accuracy, showing balanced but modest predictive power. However, it struggled to identify the minority class (“functional needs repair”), highlighting the presence of class imbalance and nonlinear relationships in the data.

To improve the model’s performance, we tuned the logistic regression hyperparameters using GridSearchCV, optimizing penalty type, regularization strength, and class weights. The tuned version yielded a small improvement, reaching around 62% accuracy, which confirmed that logistic regression’s linear nature limited its ability to fully capture complex feature interactions. This justified exploring a more flexible, non-linear model.

Subsequently, we trained a Random Forest Classifier to capture interactions and non-linear patterns across categorical and numeric features. The baseline Random Forest significantly improved testing accuracy to about 76%, with much better recall for functional and non-functional wells. Further hyperparameter tuning boosted the performance slightly to about 77% accuracy, improving generalization and reducing overfitting. The tuned Random Forest emerged as the final model for deployment due to its balance between accuracy, robustness, and interpretability via feature importance analysis. This model will be used to generate final predictions for submission on the unseen test dataset.

# **Recommendations and Next Steps**

While the tuned Random Forest performed well, further improvements could be achieved through addressing class imbalance using techniques like SMOTE oversampling or class-weight adjustments to enhance recall for “functional needs repair” wells. Future iterations could also explore Gradient Boosting algorithms (e.g., XGBoost or LightGBM) for potentially higher accuracy and better handling of imbalanced data. Additionally, improving data quality—such as standardizing missing values and verifying key features like installer, funder, and construction_year—could further enhance model performance. Finally, periodic retraining with new well data is recommended to ensure the model remains relevant and adaptive to changing field conditions.