# Water Filter Predictive Maintenance - End-to-End ML Project

## Problem Statement

**Goal**: Predict when a home RO water filter needs maintenance BEFORE it fails.

**Why it matters**:
- Unsafe drinking water if TDS goes too high
- Save money by doing timely maintenance instead of emergency repairs
- Better user experience with proactive alerts

**ML Type**: Binary Classification (maintenance_needed: 0 or 1)

**Key Metric**: **Recall** - we'd rather have a false alarm than miss a bad filter!

### Laravel Parallel
Like defining API specs before coding. We know what we want to build and why.

In [None]:
# Import everything we need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, confusion_matrix, classification_report,
                             roc_curve, auc)

sns.set_theme(style='whitegrid')
%matplotlib inline

print("All imports ready!")

---
## Step 2: Load & Inspect Data

First run `01_generate_water_filter_data.ipynb` to create the dataset!

In [None]:
df = pd.read_csv('../data/water_filter_readings.csv')
print(f"Dataset: {df.shape[0]} rows x {df.shape[1]} columns")
df.head()

In [None]:
df.info()

In [None]:
df.describe().round(2)

---
## Step 3: Exploratory Data Analysis (EDA)

In [None]:
# Target distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

df['maintenance_needed'].value_counts().plot(kind='bar', ax=axes[0], color=['green', 'red'])
axes[0].set_title(f'Maintenance Needed ({df["maintenance_needed"].mean():.1%} positive)')
axes[0].set_xticklabels(['No', 'Yes'], rotation=0)

df['tds_alert'].value_counts().plot(kind='bar', ax=axes[1], color=['green', 'orange'])
axes[1].set_title(f'TDS Alert ({df["tds_alert"].mean():.1%} positive)')
axes[1].set_xticklabels(['No', 'Yes'], rotation=0)

plt.tight_layout()
plt.show()

In [None]:
# TDS output distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.histplot(df['tds_output'], bins=50, kde=True, ax=axes[0])
axes[0].axvline(x=100, color='red', linestyle='--', label='Alert limit (100 ppm)')
axes[0].set_title('TDS Output Distribution')
axes[0].legend()

# TDS output vs filter age - THE KEY RELATIONSHIP
sns.scatterplot(data=df, x='filter_age_days', y='tds_output', 
                hue='maintenance_needed', alpha=0.3, ax=axes[1])
axes[1].axhline(y=100, color='red', linestyle='--', alpha=0.5)
axes[1].set_title('Filter Age vs TDS Output')

plt.tight_layout()
plt.show()

In [None]:
# Maintenance rate by membrane status and region
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

df.groupby('membrane_status')['maintenance_needed'].mean().plot(
    kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('Maintenance Rate by Membrane Status')
axes[0].set_ylabel('Maintenance Rate')
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=0)

df.groupby('region')['maintenance_needed'].mean().plot(
    kind='bar', ax=axes[1], color='steelblue')
axes[1].set_title('Maintenance Rate by Region')
axes[1].set_ylabel('Maintenance Rate')
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=0)

plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
numeric_cols = df.select_dtypes(include=[np.number]).columns
plt.figure(figsize=(10, 8))
sns.heatmap(df[numeric_cols].corr(), annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Feature Correlations')
plt.show()

---
## Step 4: Data Preprocessing

Like validation and data casting in Laravel.

In [None]:
# Check for missing values
print("Missing values:")
print(df.isna().sum())
print("\nNo missing values - our generated data is clean!")

In [None]:
# Encode categorical variables
df_ml = df.copy()

# Encode membrane_status: good=0, degraded=1, needs_replacement=2
membrane_map = {'good': 0, 'degraded': 1, 'needs_replacement': 2}
df_ml['membrane_status_encoded'] = df_ml['membrane_status'].map(membrane_map)

# Encode region (one-hot encoding)
df_ml = pd.get_dummies(df_ml, columns=['region'], prefix='region')

# Drop columns not useful for modeling
drop_cols = ['filter_id', 'reading_date', 'membrane_status', 'tds_alert']
df_ml = df_ml.drop(columns=drop_cols)

print(f"Features for modeling: {list(df_ml.columns)}")
df_ml.head()

---
## Step 5: Feature Engineering

Like creating computed attributes / accessors in an Eloquent model.

In [None]:
# Create new features that might help the model

# TDS reduction percentage (how well is the filter working?)
df_ml['tds_reduction_pct'] = ((df_ml['tds_input'] - df_ml['tds_output']) / df_ml['tds_input'] * 100).round(1)

# Filter efficiency (flow per unit pressure)
df_ml['flow_per_pressure'] = (df_ml['flow_rate_lpm'] / df_ml['pressure_psi']).round(4)

# Usage intensity (liters per day of age)
df_ml['usage_intensity'] = (df_ml['total_usage_liters'] / (df_ml['filter_age_days'] + 1)).round(1)

# Is the input water quality poor? (above 500 ppm)
df_ml['high_tds_input'] = (df_ml['tds_input'] > 500).astype(int)

print("New features created!")
df_ml[['tds_reduction_pct', 'flow_per_pressure', 'usage_intensity', 'high_tds_input']].describe().round(2)

---
## Step 6: Model Selection

In [None]:
# Prepare features and target
X = df_ml.drop(columns=['maintenance_needed'])
y = df_ml['maintenance_needed']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training: {len(X_train)} | Test: {len(X_test)}")
print(f"Maintenance rate - Train: {y_train.mean():.1%} | Test: {y_test.mean():.1%}")

In [None]:
# Try 4 models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth=5),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
}

results = []
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    results.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),  # Most important for us!
        'F1': f1_score(y_test, y_pred),
    })

results_df = pd.DataFrame(results).set_index('Model').round(3)
print("Model Comparison:")
results_df.style.highlight_max(axis=0, color='lightgreen')

---
## Step 7: Hyperparameter Tuning

In [None]:
# GridSearchCV on Random Forest (usually the best)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='recall',  # Optimize for recall!
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best CV Recall: {grid_search.best_score_:.3f}")

---
## Step 8: Final Evaluation

In [None]:
# Evaluate the tuned model on test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

print("=== Final Model Performance ===")
print(classification_report(y_test, y_pred, target_names=['OK', 'Maintenance']))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['OK', 'Maintenance'], yticklabels=['OK', 'Maintenance'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - Final Model')
plt.show()

In [None]:
# Feature Importance - which sensors matter most?
importance = pd.Series(
    best_model.feature_importances_, index=X.columns
).sort_values(ascending=True)

plt.figure(figsize=(10, 6))
importance.tail(10).plot(kind='barh', color='steelblue')  # Top 10
plt.xlabel('Importance')
plt.title('Top 10 Most Important Features')
plt.show()

In [None]:
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'Model (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], 'r--', label='Random (AUC = 0.5)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

---
## Step 9: Build the Alert System

Like building an API endpoint that returns filter status.

In [None]:
def check_filter_health(reading, model, feature_columns):
    """
    Check a water filter's health and return status.
    
    Like a Laravel API endpoint:
    GET /api/filter/{id}/health â†’ {status, message, confidence}
    """
    # Rule-based TDS alert (immediate)
    if reading.get('tds_output', 0) > 100:
        return {
            'status': 'ALERT',
            'message': f'TDS output is {reading["tds_output"]} ppm - exceeds safe limit (100 ppm)!',
            'action': 'Immediate maintenance required. Do not drink this water.'
        }
    
    # ML-based prediction
    reading_df = pd.DataFrame([reading])[feature_columns]
    probability = model.predict_proba(reading_df)[0][1]
    
    if probability > 0.7:
        return {
            'status': 'WARNING',
            'message': f'Maintenance likely needed soon ({probability:.0%} confidence)',
            'action': 'Schedule maintenance within 1 week.'
        }
    elif probability > 0.4:
        return {
            'status': 'WATCH',
            'message': f'Filter showing early signs of wear ({probability:.0%} confidence)',
            'action': 'Monitor closely. Check again in a few days.'
        }
    else:
        return {
            'status': 'OK',
            'message': f'Filter is working well ({1-probability:.0%} confidence)',
            'action': 'No action needed.'
        }

In [None]:
# Demo the alert system
test_readings = [
    {'tds_input': 350, 'tds_output': 38, 'flow_rate_lpm': 2.1, 'pressure_psi': 55,
     'temperature_c': 25, 'filter_age_days': 30, 'daily_usage_liters': 15,
     'total_usage_liters': 450, 'sediment_filter_age_days': 30,
     'membrane_status_encoded': 0, 'region_East': 0, 'region_North': 1,
     'region_South': 0, 'region_West': 0, 'tds_reduction_pct': 89.1,
     'flow_per_pressure': 0.038, 'usage_intensity': 15.0, 'high_tds_input': 0},
    
    {'tds_input': 500, 'tds_output': 85, 'flow_rate_lpm': 1.1, 'pressure_psi': 42,
     'temperature_c': 30, 'filter_age_days': 250, 'daily_usage_liters': 25,
     'total_usage_liters': 6250, 'sediment_filter_age_days': 90,
     'membrane_status_encoded': 1, 'region_East': 1, 'region_North': 0,
     'region_South': 0, 'region_West': 0, 'tds_reduction_pct': 83.0,
     'flow_per_pressure': 0.026, 'usage_intensity': 25.0, 'high_tds_input': 0},
    
    {'tds_input': 600, 'tds_output': 150, 'flow_rate_lpm': 0.5, 'pressure_psi': 38,
     'temperature_c': 28, 'filter_age_days': 340, 'daily_usage_liters': 30,
     'total_usage_liters': 10200, 'sediment_filter_age_days': 110,
     'membrane_status_encoded': 2, 'region_East': 0, 'region_North': 0,
     'region_South': 1, 'region_West': 0, 'tds_reduction_pct': 75.0,
     'flow_per_pressure': 0.013, 'usage_intensity': 30.0, 'high_tds_input': 1},
]

labels = ['New filter (30 days)', 'Aging filter (250 days)', 'Old filter (340 days)']

for label, reading in zip(labels, test_readings):
    result = check_filter_health(reading, best_model, X.columns)
    print(f"\n{'='*50}")
    print(f"Filter: {label}")
    print(f"Status: {result['status']}")
    print(f"Message: {result['message']}")
    print(f"Action: {result['action']}")

---
## Summary

### What We Built
A complete ML pipeline that:
1. Generates realistic water filter sensor data (10,000 records)
2. Explores and visualizes the data
3. Engineers useful features
4. Trains and compares 4 different ML models
5. Tunes the best model's hyperparameters
6. Evaluates performance with proper metrics
7. Builds a practical alert system

### Key Learnings
- EDA is essential before modeling
- Feature engineering can significantly improve results
- Recall matters more than accuracy for safety-critical applications
- Random Forest / Gradient Boosting typically outperform simple models

### Real-World Next Steps
- Connect to actual IoT sensor data
- Build a REST API (your Laravel skills apply here!)
- Deploy as a scheduled job checking all filters daily
- Send push notifications / emails for alerts
- Retrain model periodically with new data