In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "intro",
   "metadata": {},
   "source": [
    "# 🌊 Water Quality Prediction\n",
    "## Internship Project - Shell-Edunet Skills4Future AI-ML (June–July 2025)\n",
    "\n",
    "> **“Turning chemical signals into clarity — one prediction at a time.”**\n",
    "\n",
    "## 📌 Project Overview\n",
    "\n",
    "Developed during the Shell-Edunet Skills4Future Internship, this project predicts water quality parameters using machine learning on historical data from Punjab (2000–2021). The model estimates pollutant levels to support sustainable water management and environmental protection, addressing the growing need for clean water amidst pollution and climate change.\n",
    "\n",
    "## ⚙️ Technologies Used\n",
    "- **Python** — Core programming language\n",
    "- **Pandas** — Data manipulation and cleaning\n",
    "- **NumPy** — Numerical computations\n",
    "- **Matplotlib & Seaborn** — Data visualization\n",
    "- **Scikit-learn** — Machine learning algorithms and evaluation\n",
    "\n",
    "## 💧 Predicted Water Quality Parameters\n",
    "- **O₂** (Dissolved Oxygen)\n",
    "- **NO₃** (Nitrates)\n",
    "- **NO₂** (Nitrites)\n",
    "- **SO₄** (Sulfates)\n",
    "- **PO₄** (Phosphates)\n",
    "- **Cl⁻** (Chlorides)\n",
    "\n",
    "## 🤖 Model & Methodology\n",
    "- **Model**: `RandomForestRegressor` wrapped in `MultiOutputRegressor`\n",
    "- **Features**: NH₄, BSK5, Suspended Solids, Year, Month\n",
    "- **Missing Data**: Median imputation\n",
    "- **Data Split**: 80% training, 20% testing\n",
    "- **Evaluation Metrics**: R² Score, Mean Squared Error\n",
    "\n",
    "**Built with 💻, ☕, and a sip of clean water by Abhinandan Kumar**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "install-libs",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ✅ STEP 1: Install & Import Required Libraries\n",
    "!pip install pandas numpy matplotlib seaborn scikit-learn  # Uncomment if needed\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.multioutput import MultiOutputRegressor\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import mean_squared_error, r2_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "load-explore",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ✅ STEP 2: Load and Explore Dataset\n",
    "print(\"\\n--- Loading Dataset ---\")\n",
    "df = pd.read_csv('PB_All_2000_2021.csv', sep=';')\n",
    "print(f\"\\nDataset Shape: {df.shape}\")\n",
    "\n",
    "# Initial Exploration\n",
    "print(\"\\n--- Dataset Info ---\")\n",
    "print(df.info())\n",
    "\n",
    "print(\"\\n--- Missing Values ---\")\n",
    "print(df.isnull().sum())\n",
    "\n",
    "print(\"\\n--- Statistical Summary ---\")\n",
    "print(df.describe().T)\n",
    "\n",
    "print(\"\\n--- Sample Data (Before Processing) ---\")\n",
    "print(df.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "clean-engineer",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ✅ STEP 3: Data Cleaning & Feature Engineering\n",
    "print(\"\\n--- Cleaning & Feature Engineering ---\")\n",
    "\n",
    "# Convert date column\n",
    "df['date'] = pd.to_datetime(df['date'], format='%d.%m.%Y')\n",
    "\n",
    "# Extract date features\n",
    "df['year'] = df['date'].dt.year\n",
    "df['month'] = df['date'].dt.month\n",
    "\n",
    "# Sort by station and time\n",
    "df = df.sort_values(by=['id', 'date'])\n",
    "\n",
    "# Handle missing values with median imputation\n",
    "print(\"\\nPerforming median imputation for missing values...\")\n",
    "df.fillna(df.median(numeric_only=True), inplace=True)\n",
    "\n",
    "# Verify cleaning\n",
    "print(\"\\n--- Missing Values After Cleaning ---\")\n",
    "print(df.isnull().sum())\n",
    "\n",
    "print(\"\\n--- Sample Data (After Processing) ---\")\n",
    "print(df.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "features-targets",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ✅ STEP 4: Define Features and Targets\n",
    "features = ['NH4', 'BSK5', 'Suspended', 'year', 'month']\n",
    "targets = ['O2', 'NO3', 'NO2', 'SO4', 'PO4', 'CL']\n",
    "\n",
    "X = df[features]\n",
    "y = df[targets]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "split-data",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ✅ STEP 5: Split into Train and Test Sets\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, test_size=0.2, random_state=42\n",
    ")\n",
    "print(f\"\\nTrain/Test Split: {X_train.shape[0]} train, {X_test.shape[0]} test samples\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "train-model",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ✅ STEP 6: Train MultiOutput RandomForest Regressor\n",
    "print(\"\\n--- Model Training ---\")\n",
    "model = MultiOutputRegressor(RandomForestRegressor(random_state=42))\n",
    "model.fit(X_train, y_train)\n",
    "print(\"Training completed!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "evaluate-model",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ✅ STEP 7: Make Predictions and Evaluate\n",
    "print(\"\\n--- Model Evaluation ---\")\n",
    "y_pred = model.predict(X_test)\n",
    "\n",
    "# Store performance metrics\n",
    "performance_data = []\n",
    "for i, col in enumerate(targets):\n",
    "    r2 = r2_score(y_test[col], y_pred[:, i])\n",
    "    mse = mean_squared_error(y_test[col], y_pred[:, i])\n",
    "    performance_data.append({'Parameter': col, 'R² Score': r2, 'Mean Squared Error': mse})\n",
    "    print(f\"\\n--- {col} ---\")\n",
    "    print(f\"R² Score: {r2:.4f}\")\n",
    "    print(f\"MSE: {mse:.4f}\")\n",
    "\n",
    "# Display performance table\n",
    "performance_df = pd.DataFrame(performance_data)\n",
    "print(\"\\nModel Performance Snapshot:\")\n",
    "print(performance_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "visualize-results",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ✅ STEP 8: Visualize Predictions\n",
    "print(\"\\nGenerating visualizations...\")\n",
    "plt.style.use('seaborn')\n",
    "\n",
    "for i, col in enumerate(targets):\n",
    "    plt.figure(figsize=(8, 4))\n",
    "    sns.regplot(x=y_test[col], y=y_pred[:, i], \n",
    "                scatter_kws={'alpha':0.5, 'color':'blue', 'edgecolor':'k'},\n",
    "                line_kws={'color':'red'})\n",
    "    plt.xlabel(f\"Actual {col}\")\n",
    "    plt.ylabel(f\"Predicted {col}\")\n",
    "    plt.title(f\"{col}: Actual vs Predicted (R² = {r2_score(y_test[col], y_pred[:, i]):.2f})\")\n",
    "    plt.grid(True)\n",
    "    plt.tight_layout()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "conclusion",
   "metadata": {},
   "source": [
    "## ✅ Final Notes\n",
    "\n",
    "This project showcases the power of machine learning in environmental monitoring by predicting key water quality parameters. The model supports sustainable water management and can be enhanced with additional features like rainfall, land use, or industrial activity data for improved accuracy and real-time applications.\n",
    "\n",
    "**Built with 💻, ☕, and a sip of clean water by Abhinandan Kumar**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}