# üåßÔ∏è Project 2.0: Rainfall/Drought Forecasting

**Goal:** Utilize Time Series Analysis (e.g., ARIMA, Prophet, or LSTMs) to forecast future rainfall patterns, providing essential predictive input for the Water Quality Model and supporting drought early warning systems.

**Owner:** Jetty AI Lab

## üõ†Ô∏è Notebook Series Plan

| Step | Description | Status |
| :--- | :--- | :--- |
| **2.0** | **Data Acquisition & Cleaning** | Current |
| **2.1** | **Exploratory Data Analysis (EDA)** | Next |
| **2.2** | **Time Series Decomposition & Stationarity Check** | To Do |
| **2.3** | **Model Training (Prophet / ARIMA / LSTM)** | To Do |
| **2.4** | **Model Evaluation & Forecasting** | To Do |

## üìÅ Data Source & Structure

**Source:** Historical Meteorological Data (Rainfall Readings)

| Column Name | Data Type | Description |
| :--- | :--- | :--- |
| **Date** | Datetime | The timestamp for the rainfall reading. **(Will be set as the index)** |
| **Rainfall_mm** | Float | The measured rainfall in millimeters (the target variable for forecasting). |

In [49]:
import pandas as pd
import numpy as np

# 1. Load Raw Data (Re-run from the last successful point)
data_path = '~/jetty-ai-lab/projects/water-climate/data/raw/rainfall_data.csv'

try:
    # Read the data
    df_raw = pd.read_csv(data_path)
    
    # 2. Create a Datetime Column and Index (Cleanup)
    # Combine the three columns into a single datetime object
    df_raw['Date'] = pd.to_datetime(df_raw[['Year', 'Month', 'Day']])
    
    # Set the Date as the index
    df_raw.set_index('Date', inplace=True)
    
    # 3. Select Only the Target Variable and Rename
    # We isolate the 'Precipitation' column (our target) and rename it for clarity.
    df_ts = df_raw[['Precipitation']].copy()
    df_ts.rename(columns={'Precipitation': 'Rainfall_mm'}, inplace=True)
    
    # 4. Final Review
    print("‚úÖ Time Series Data Ready. Initial 5 rows:")
    print(df_ts.head())
    print("\nFinal Data Info:")
    df_ts.info()

except FileNotFoundError:
    print(f"‚ùå Error: Data file not found at {data_path}. Please check the filename and path.")
except KeyError as e:
    # This catches errors if 'Year', 'Month', 'Day', or 'Precipitation' were changed.
    print(f"‚ùå Error: A required column was not found in the CSV: {e}")
    

‚úÖ Time Series Data Ready. Initial 5 rows:
            Rainfall_mm
Date                   
2000-01-01         0.00
2000-02-01         0.11
2000-03-01         0.01
2000-04-01         0.02
2000-05-01       271.14

Final Data Info:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 252 entries, 2000-01-01 to 2020-12-01
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Rainfall_mm  252 non-null    float64
dtypes: float64(1)
memory usage: 3.9 KB


In [50]:
# Save the cleaned time series data to the processed folder
output_path = '~/jetty-ai-lab/projects/water-climate/data/processed/rainfall_timeseries.csv'
df_ts.to_csv(output_path)
print(f"‚úÖ Cleaned time series data saved to: {output_path}")

‚úÖ Cleaned time series data saved to: ~/jetty-ai-lab/projects/water-climate/data/processed/rainfall_timeseries.csv


In [52]:
from statsmodels.tsa.stattools import adfuller

def adf_test(series):
    # Perform the ADF test
    result = adfuller(series.dropna())
    
    # Extract and display key results
    print('Augmented Dickey-Fuller Test Results:')
    print(f'ADF Statistic: {result[0]:.4f}')
    print(f'p-value: {result[1]:.4f}')
    print('Critical Values:')
    for key, value in result[4].items():
        print(f'\t{key}: {value:.4f}')
    
    # Interpretation
    # H0 (Null Hypothesis) is that the time series is NOT stationary.
    if result[1] <= 0.05:
        print("\nConclusion: üü¢ REJECT H0. The time series is LIKELY stationary.")
    else:
        print("\nConclusion: üî¥ FAIL TO REJECT H0. The time series is NON-STATIONARY.")

# Run the test on your rainfall data
adf_test(df_ts['Rainfall_mm'])

ModuleNotFoundError: No module named 'statsmodels'