# **Project Dataset Selection and Justification**

# **Tanzanian Water Wells**

For this project, I selected the Tanzanian Water Wells dataset from the DrivenData “Pump It Up: Data Mining the Water Table” competition. This dataset provides information about over 59,000 water wells across Tanzania, including details on their construction, location, management, and current operational status. The target variable, status_group, indicates whether a well is functional, functional but needs repair, or non-functional, making this a multiclass classification problem fully aligned with the requirements of the Phase 3 project.

The decision to use this dataset was guided by both technical and business relevance considerations:

Relevance to Real-World Problems:
Access to clean water remains a critical issue in developing regions, and identifying which wells are likely to fail is a meaningful application of data science for public good. This problem has direct implications for NGOs, government agencies, and infrastructure planners responsible for maintaining water systems.

Strong Alignment with Classification Objectives:
The project’s target variable is categorical, enabling the use of supervised classification algorithms such as logistic regression, decision trees, and ensemble models. The dataset’s structure naturally supports an iterative modeling process involving baseline and tuned models, as required by the project guidelines.

Technical Suitability and Manageability:
The dataset is sufficiently complex—with over 40 features spanning categorical, numeric, and geospatial data—but remains clean and well-documented. It allows for exploration of essential preprocessing techniques such as handling missing values, encoding categorical features, scaling, and avoiding data leakage, without overwhelming computational or cleaning requirements.

Industry and Career Relevance:
The dataset’s focus on infrastructure reliability and maintenance parallels predictive analytics challenges found in the energy and engineering sectors, including geothermal plant operations and equipment reliability monitoring. As such, this project demonstrates the transferability of classification methods to industrial asset management and sustainability use cases.

In summary, the Tanzanian Water Wells dataset provides a meaningful, technically rich, and career-relevant foundation for applying machine learning classification techniques to a real-world problem. It offers both interpretability and complexity, making it an ideal choice for this phase of the data science program.

# **Business Understanding**

Access to clean and functional water points remains a major challenge in Tanzania, especially in rural areas where communities rely heavily on boreholes and hand pumps. Many wells fail due to poor maintenance, aging infrastructure, or environmental conditions, making it difficult for authorities to prioritize which sites need attention first.

This project aims to support the **Tanzanian Ministry of Water and Irrigation and partner organizations **such as **WaterAid, UNICEF, and World Vision** by developing a predictive classification model that identifies the operational status of water wells. The model will help stakeholders allocate maintenance resources more efficiently, reduce downtime, and improve access to clean water.

Our target stakeholders include **government planners,** **NGOs**, and **field maintenance teams** who can use model insights to schedule preventive repairs and optimize field operations. The classification model will predict whether a well is functional, needs repair, or non-functional based on features such as construction year, installer, pump type, and location.

The project scope focuses on building and evaluating this predictive model and communicating actionable insights through analysis and visualization. Implementation of automated systems, real-time monitoring, or field validation lies outside the current scope.

By helping decision-makers transition from reactive to preventive maintenance, this project aligns with Sustainable Development Goal 6 (Clean Water and Sanitation) and supports long-term water access sustainability across Tanzania.



# **Data Understanding**


The dataset used in this project is the Tanzanian Water Wells dataset, sourced from the Taarifa and Tanzania Ministry of Water repositories, and made publicly available through DrivenData. It contains information on over 59,000 water points across Tanzania, including details about their physical characteristics, installation, management, and operational status.

The primary target variable is status_group, which categorizes each water point as functional, functional but needs repair, or non-functional. The objective is to build a classification model capable of predicting this status based on various predictors.

The dataset includes multiple features such as location data (region, district, latitude, longitude), construction attributes (installer, construction year, pump type, extraction type), and management details (management type, water source, payment options). These predictors include a mix of categorical and numerical data, offering a rich basis for feature engineering and model development.

With over 59,000 observations and 40 features, the dataset provides ample data for training and testing robust classification models. However, some preprocessing is necessary to handle missing values, inconsistent labels, and redundant or highly correlated features. Exploratory analysis will help assess feature distributions, class balance, and potential data quality issues.

The data was originally collected through national field surveys conducted by the Ministry of Water and NGO partners. While generally reliable, certain inconsistencies or missing entries may exist due to human error during field reporting. These issues will be addressed during the data cleaning phase to ensure accuracy and consistency for modeling.

Overall, this dataset is suitable for building a classification model aimed at predicting the functionality of water wells, supporting data-driven decision-making in water resource management.

# **Data Preparation**

**Step 1** Importing relevant libraries

In [22]:
#importing relevant tools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline


**Step 2:** Loading CSV files

In [9]:
# loading data in readiness for data preparatin

#load training data
train_values = pd.read_csv('Training_set_values.csv')
train_labels = pd.read_csv('Training_set_lables.csv')

#load test data
sub_format = pd.read_csv('SubmissionFormat.csv')

test_values = pd.read_csv('Test_set_values.csv')

**Step 3:** Merge Training Data and Labels

In [10]:
#checking for common column for merging
train_values.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [11]:
train_labels.head()


Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional
3,67743,non functional
4,19728,functional


In [12]:
# Merge on the 'id' column (common key in both files)df = pd.merge(train_values, train_labels, on='id')
df = pd.merge(train_values, train_labels, on='id')
# Preview the data
df.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional


**Step 4:** Inspect Basic Info

In [13]:
df.info()
df.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55763 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59398 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
count,59400.0,59400.0,59400,55763,59400.0,55745,59400.0,59400.0,59398,59400.0,...,59400,59400,59400,59400,59400,59400,59400,59400,59400,59400
unique,,,356,1896,,2145,,,37399,,...,8,6,5,5,10,7,3,7,6,3
top,,,2011-03-15,Government Of Tanzania,,DWE,,,none,,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
freq,,,572,9084,,17402,,,3563,,...,50818,50818,33186,33186,17021,17021,45794,28522,34625,32259
mean,37115.131768,317.650385,,,668.297239,,34.077427,-5.706033,,0.474141,...,,,,,,,,,,
std,21453.128371,2997.574558,,,693.11635,,6.567432,2.946019,,12.23623,...,,,,,,,,,,
min,0.0,0.0,,,-90.0,,0.0,-11.64944,,0.0,...,,,,,,,,,,
25%,18519.75,0.0,,,0.0,,33.090347,-8.540621,,0.0,...,,,,,,,,,,
50%,37061.5,0.0,,,369.0,,34.908743,-5.021597,,0.0,...,,,,,,,,,,
75%,55656.5,20.0,,,1319.25,,37.178387,-3.326156,,0.0,...,,,,,,,,,,


Before building our classification model, we prepared the Tanzanian Water Wells dataset to ensure data quality and consistency. The dataset contained 59,400 records and 41 columns, including both numeric and categorical features such as location coordinates, water source type, management, and well status. Our target variable was status_group, indicating whether a waterpoint is functional, functional needs repair, or non-functional.

We identified several columns with missing values, including funder, installer, scheme_management, scheme_name, public_meeting, and permit. Since some of these variables contain a significant proportion of missing entries, we plan to handle them by either imputing the most frequent values, filling with “unknown,” or dropping variables that add little value.

Next, we confirmed that some columns such as date_recorded are stored as object types and need conversion to datetime format. Similarly, categorical variables like region, source, and payment_type will later be encoded using one-hot encoding to make them suitable for machine learning models.

We also noted potential multicollinearity among related variables (e.g., extraction_type, extraction_type_group, and extraction_type_class), which will be addressed through correlation checks and feature selection. Numeric variables like gps_height, population, and amount_tsh will be normalized or scaled to reduce model bias from differing value ranges.

Overall, the dataset provides a rich foundation for building a predictive classification model, but careful preprocessing will be essential to ensure data quality and model performance.

# **Data Cleaning**

**Step 1:** **handle Missing Values**

We’ll identify missing data, then decide how to handle them.

In [14]:
# Check missing values
missing = df.isnull().sum().sort_values(ascending=False)
print(missing[missing > 0])

scheme_name          28810
scheme_management     3878
installer             3655
funder                3637
public_meeting        3334
permit                3056
subvillage             371
wpt_name                 2
dtype: int64


There is quite a huge number of missing values, we will therefore need to carefully handle them.
below is a strategy of how we are going to handle them.


**Scheme_name**	has **28,810 missing count** of **type Categorical**	we will **Fill with "unknown"**	because **Too many missing, no reliable pattern**


**Scheme_management** has **3,878 missing count** of type **Categorical** we will **Fill with mode** (most frequent) because of **Limited missing, same entity type (management) likely repeats.**

**installer** has **3,655 missing value counts** of type **Categorical** we will **Fill with "unknown"** because **Missing likely due to poor field records, not informative to drop.**

**Funder** has **3,637 missing value count** of type **Categorical** we will
**Fill with "unknown"** because **High variety, missing not systematic.**

**public_meeting** has **3,334 missing value count** of type **Boolean** we will
**Fill with mode (most frequent)** because **Missing probably means “information not recorded.”**

**Permit** has **3,056 missing value count** of type **Boolean** we will **Fill with mode (most frequent)** because  **Missing means no permit info, not necessarily “no permit.”**

**Subvillage** has **371 missing value count** of type **Categorical** we will **Fill with "unknown" **because **Very few missing.**

**wpt_name **has only 2 **missing value count** of type  **Categorical** we will **Fill with "unknown"** because **Only two missing — trivial.**

In [15]:
# Fill high-missing categorical values with 'unknown'
cols_unknown = ['scheme_name', 'installer', 'funder', 'subvillage', 'wpt_name']
for col in cols_unknown:
    df[col] = df[col].fillna('unknown')

# Fill boolean-like fields with mode (most common)
cols_mode = ['scheme_management', 'public_meeting', 'permit']
for col in cols_mode:
    df[col] = df[col].fillna(df[col].mode()[0])

# Double-check missing values are gone
print(df.isnull().sum().sort_values(ascending=False).head(10))

id               0
amount_tsh       0
date_recorded    0
funder           0
gps_height       0
installer        0
longitude        0
latitude         0
wpt_name         0
num_private      0
dtype: int64


  df[col] = df[col].fillna(df[col].mode()[0])


**Step 2: Convert Data Types & Handle Inconsistencies**

By performing the following

1.   Convert dates properly
2.   Fix boolean-like text values
3.   Ensure numeric columns are truly numeric





In [16]:
# Convert 'date_recorded' to datetime
df['date_recorded'] = pd.to_datetime(df['date_recorded'])

# Convert boolean-like columns (public_meeting, permit) to numeric
df['public_meeting'] = df['public_meeting'].map({True: 1, False: 0, 'True': 1, 'False': 0}).fillna(0)
df['permit'] = df['permit'].map({True: 1, False: 0, 'True': 1, 'False': 0}).fillna(0)

# Confirm data types
print(df.dtypes.head(15))

id                        int64
amount_tsh              float64
date_recorded    datetime64[ns]
funder                   object
gps_height                int64
installer                object
longitude               float64
latitude                float64
wpt_name                 object
num_private               int64
basin                    object
subvillage               object
region                   object
region_code               int64
district_code             int64
dtype: object


**Step 3: Drop Redundant / Low-Value Features**

Some columns in this dataset are:

Duplicates or near-duplicates (e.g. extraction_type, extraction_type_group, and extraction_type_class describe the same thing at different detail levels).

Non-predictive identifiers (e.g. id, wpt_name are unique to each record and not useful for modeling).

Columns that generalize others (e.g. management vs management_group).

We'll keep the most general but meaningful version of each and drop the rest.

In [17]:
# Drop identifiers and duplicates
df = df.drop([
    'id', 'wpt_name',  # identifiers
    'extraction_type_group', 'extraction_type_class',  # redundant
    'management_group', 'quantity_group',  # redundant
    'source_type', 'source_class', 'waterpoint_type_group',  # redundant
    'recorded_by'  # same for all rows: "GeoData Consultants Ltd"
], axis=1)

print("Remaining columns:", len(df.columns))
df.head()

Remaining columns: 31


Unnamed: 0,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,num_private,basin,subvillage,...,extraction_type,management,payment,payment_type,water_quality,quality_group,quantity,source,waterpoint_type,status_group
0,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,0,Lake Nyasa,Mnyusi B,...,gravity,vwc,pay annually,annually,soft,good,enough,spring,communal standpipe,functional
1,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,0,Lake Victoria,Nyamara,...,gravity,wug,never pay,never pay,soft,good,insufficient,rainwater harvesting,communal standpipe,functional
2,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,0,Pangani,Majengo,...,gravity,vwc,pay per bucket,per bucket,soft,good,enough,dam,communal standpipe multiple,functional
3,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,0,Ruvuma / Southern Coast,Mahakamani,...,submersible,vwc,never pay,never pay,soft,good,dry,machine dbh,communal standpipe multiple,non functional
4,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,0,Lake Victoria,Kyanyamisa,...,gravity,other,never pay,never pay,soft,good,seasonal,rainwater harvesting,communal standpipe,functional


Our Data looks perfectly clean.

We can now move to the final part of data preparation:

1.   Scaling numeric features
2.   Encoding categorical variables
3.   Encoding the target (status_group)




**Step 1: Scale Numeric Columns**

We’ll standardize only numeric columns (since categorical ones will be encoded separately).

In [21]:

# Identify numeric columns
num_cols = df.select_dtypes(include=['int64', 'float64']).columns
print("Numeric columns:", num_cols)

# Scale numeric features
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

Numeric columns: Index(['amount_tsh', 'gps_height', 'longitude', 'latitude', 'num_private',
       'region_code', 'district_code', 'population', 'public_meeting',
       'permit', 'construction_year'],
      dtype='object')


**Step 2: Encode the Target Variable**

Our target (status_group) has three classes:

1. Functional

2. Functional needs repair

3. Non functional

We’ll encode them numerically for classification.

In [23]:

label_encoder = LabelEncoder()
df['status_group'] = label_encoder.fit_transform(df['status_group'])

# Mapping reference
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Label mapping:", label_mapping)

Label mapping: {'functional': np.int64(0), 'functional needs repair': np.int64(1), 'non functional': np.int64(2)}


**Step 3: One-Hot Encode Categorical Features**

Now we convert all categorical predictors to numeric.

In [24]:
# Identify categorical columns (excluding target)
cat_cols = df.select_dtypes(include=['object']).columns

# One-hot encode categorical features
df_encoded = pd.get_dummies(df, columns=cat_cols, drop_first=True)

print("Encoded dataset shape:", df_encoded.shape)
df_encoded.head()

Encoded dataset shape: (59400, 28359)


Unnamed: 0,amount_tsh,date_recorded,gps_height,longitude,latitude,num_private,region_code,district_code,population,public_meeting,...,source_river,source_shallow well,source_spring,source_unknown,waterpoint_type_communal standpipe,waterpoint_type_communal standpipe multiple,waterpoint_type_dam,waterpoint_type_hand pump,waterpoint_type_improved spring,waterpoint_type_other
0,1.895665,2011-03-14,1.041252,0.131052,-1.408791,-0.038749,-0.244325,-0.06537,-0.150399,0.304987,...,False,False,True,False,True,False,False,False,False,False
1,-0.10597,2013-03-06,1.054237,0.09461,1.207934,-0.038749,0.267409,-0.376781,0.21229,0.304987,...,False,False,False,False,True,False,False,False,False,False
2,-0.09763,2013-02-25,0.025541,0.515158,0.639751,-0.038749,0.324269,-0.169174,0.14866,0.304987,...,False,False,False,False,False,True,False,False,False,False
3,-0.10597,2013-01-28,-0.584751,0.671308,-1.84972,-0.038749,4.247564,5.955245,-0.25857,0.304987,...,False,False,False,False,False,True,False,False,False,False
4,-0.10597,2011-07-13,-0.9642,-0.448669,1.317271,-0.038749,0.153691,-0.480585,-0.381587,0.304987,...,False,False,False,False,True,False,False,False,False,False


**Step 4: Split Data for Modeling**

Finally, we’ll split the data into training and testing sets.

In [25]:


X = df_encoded.drop('status_group', axis=1)
y = df_encoded['status_group']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training shape:", X_train.shape)
print("Testing shape:", X_test.shape)

Training shape: (47520, 28358)
Testing shape: (11880, 28358)


That’s a solid, production-ready dataset for a classification task , large enough to avoid overfitting but rich enough to capture patterns.

# **Modeling**

**Step 1: Baseline Logistic Regression**

In [26]:
df.head()

Unnamed: 0,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,num_private,basin,subvillage,...,extraction_type,management,payment,payment_type,water_quality,quality_group,quantity,source,waterpoint_type,status_group
0,1.895665,2011-03-14,Roman,1.041252,Roman,0.131052,-1.408791,-0.038749,Lake Nyasa,Mnyusi B,...,gravity,vwc,pay annually,annually,soft,good,enough,spring,communal standpipe,0
1,-0.10597,2013-03-06,Grumeti,1.054237,GRUMETI,0.09461,1.207934,-0.038749,Lake Victoria,Nyamara,...,gravity,wug,never pay,never pay,soft,good,insufficient,rainwater harvesting,communal standpipe,0
2,-0.09763,2013-02-25,Lottery Club,0.025541,World vision,0.515158,0.639751,-0.038749,Pangani,Majengo,...,gravity,vwc,pay per bucket,per bucket,soft,good,enough,dam,communal standpipe multiple,0
3,-0.10597,2013-01-28,Unicef,-0.584751,UNICEF,0.671308,-1.84972,-0.038749,Ruvuma / Southern Coast,Mahakamani,...,submersible,vwc,never pay,never pay,soft,good,dry,machine dbh,communal standpipe multiple,2
4,-0.10597,2011-07-13,Action In A,-0.9642,Artisan,-0.448669,1.317271,-0.038749,Lake Victoria,Kyanyamisa,...,gravity,other,never pay,never pay,soft,good,seasonal,rainwater harvesting,communal standpipe,0


In [27]:
# --- Import Required Libraries ---
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# --- Assume df is your cleaned dataset ---
# Define features and target
X = df.drop(columns=['status_group'], errors='ignore')
y = df['status_group']

# --- Split data into train and test sets ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# --- Fix datetime issue before training ---
if 'date_recorded' in X_train.columns:
    X_train['year_recorded'] = X_train['date_recorded'].dt.year
    X_train['month_recorded'] = X_train['date_recorded'].dt.month
    X_train['day_recorded'] = X_train['date_recorded'].dt.day
    X_train = X_train.drop(columns=['date_recorded'], errors='ignore')

if 'date_recorded' in X_test.columns:
    X_test['year_recorded'] = X_test['date_recorded'].dt.year
    X_test['month_recorded'] = X_test['date_recorded'].dt.month
    X_test['day_recorded'] = X_test['date_recorded'].dt.day
    X_test = X_test.drop(columns=['date_recorded'], errors='ignore')

# --- Scale numeric data ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- Baseline Model: Logistic Regression ---
baseline_model = LogisticRegression(max_iter=500, random_state=42)
baseline_model.fit(X_train_scaled, y_train)

# --- Evaluate model ---
y_pred_train = baseline_model.predict(X_train_scaled)
y_pred_test = baseline_model.predict(X_test_scaled)

print("Training Accuracy:", accuracy_score(y_train, y_pred_train))
print("Testing Accuracy:", accuracy_score(y_test, y_pred_test))
print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_pred_test))

ValueError: could not convert string to float: 'unknown'

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# --- Define features and target ---
X = df.drop(columns=['status_group'], errors='ignore')
y = df['status_group']

# --- Split data ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# --- Extract datetime parts ---
for dataset in [X_train, X_test]:
    dataset['year_recorded'] = pd.to_datetime(dataset['date_recorded']).dt.year
    dataset['month_recorded'] = pd.to_datetime(dataset['date_recorded']).dt.month
    dataset['day_recorded'] = pd.to_datetime(dataset['date_recorded']).dt.day
    dataset.drop(columns=['date_recorded'], inplace=True, errors='ignore')

# --- Separate numeric and categorical columns ---
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns

# --- Create preprocessing pipeline ---
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
    ]
)

# --- Combine preprocessing + model in a pipeline ---
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=500, random_state=42))
])

# --- Train the model ---
model_pipeline.fit(X_train, y_train)

# --- Predict and evaluate ---
y_pred_train = model_pipeline.predict(X_train)
y_pred_test = model_pipeline.predict(X_test)

print("Training Accuracy:", accuracy_score(y_train, y_pred_train))
print("Testing Accuracy:", accuracy_score(y_test, y_pred_test))
print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_pred_test))

NameError: name 'df' is not defined

In [2]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# --- Define features and target ---
X = df.drop(columns=['status_group'], errors='ignore')
y = df['status_group']

# --- Split data ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# --- Extract datetime parts ---
for dataset in [X_train, X_test]:
    dataset['year_recorded'] = pd.to_datetime(dataset['date_recorded']).dt.year
    dataset['month_recorded'] = pd.to_datetime(dataset['date_recorded']).dt.month
    dataset['day_recorded'] = pd.to_datetime(dataset['date_recorded']).dt.day
    dataset.drop(columns=['date_recorded'], inplace=True, errors='ignore')

# --- Separate numeric and categorical columns ---
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns

# --- Create preprocessing pipeline ---
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)
    ]
)

# --- Combine preprocessing + model in a pipeline ---
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=500, random_state=42))
])

# --- Train the model ---
model_pipeline.fit(X_train, y_train)

# --- Predict and evaluate ---
y_pred_train = model_pipeline.predict(X_train)
y_pred_test = model_pipeline.predict(X_test)

print("Training Accuracy:", accuracy_score(y_train, y_pred_train))
print("Testing Accuracy:", accuracy_score(y_test, y_pred_test))
print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_pred_test))

NameError: name 'df' is not defined