<a href="https://colab.research.google.com/github/humazafar2703/Assignment3_Python/blob/main/Seq2Seq_LSTM_Rainfall.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Step 1: Install required packages
!pip install kagglehub geopy --no-cache-dir

# Step 2: Import libraries
import kagglehub
import os
import zipfile
import pandas as pd
from geopy.geocoders import Nominatim
import time




In [3]:
# Step 3: Download dataset using KaggleHub
uk_rainfall = kagglehub.dataset_download("jakewright/2m-daily-weather-history-uk")
print("✅ Dataset downloaded to:", uk_rainfall)

✅ Dataset downloaded to: /kaggle/input/2m-daily-weather-history-uk


In [4]:
# Step 4: Check for ZIP file and extract it
zip_files = [f for f in os.listdir(uk_rainfall) if f.endswith('.zip')]
if zip_files:
    zip_path = os.path.join(uk_rainfall, zip_files[0])
    print("📦 Found ZIP file:", zip_path)

    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(uk_rainfall)
    print("✅ ZIP file extracted.")

In [5]:
# Step 5: Find CSV file
csv_file_path = None
for root, dirs, files in os.walk(uk_rainfall):
    for file in files:
        if file.endswith('.csv'):
            csv_file_path = os.path.join(root, file)
            break

if not csv_file_path:
    raise FileNotFoundError("❌ CSV file not found in the dataset.")

print("📄 Using CSV file:", csv_file_path)

📄 Using CSV file: /kaggle/input/2m-daily-weather-history-uk/all_weather_data.csv


In [6]:
# Step 6: Load CSV
df = pd.read_csv(csv_file_path)


In [7]:
# Step 7: Handle dates
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df = df.dropna(subset=['date'])

start_date = df['date'].min()
end_date = df['date'].max()

# Step 8: Output date range
print(f"📅 Data covers from {start_date.date()} to {end_date.date()}")

📅 Data covers from 2009-01-01 to 2024-11-12


In [8]:
 #Now check unique locations
num_unique_locations = len(df['location'].unique())
print(f"Number of unique locations: {num_unique_locations}")

Number of unique locations: 504


In [10]:
!pip install geopy




In [9]:
from geopy.geocoders import Nominatim
import time

geolocator = Nominatim(user_agent="uk-weather-study")
location_coords = []

for loc in df['location'].dropna().unique():
    try:
        location = geolocator.geocode(loc + ", UK")
        if location:
            location_coords.append({
                'location': loc,
                'latitude': location.latitude,
                'longitude': location.longitude
            })
        else:
            location_coords.append({'location': loc, 'latitude': None, 'longitude': None})
    except Exception as e:
        print(f"Error for {loc}: {e}")
    time.sleep(1)  # To avoid being blocked




In [12]:
# Assuming location_coords = {location: (lat, lon), ...}

# Create a DataFrame for locations and their coordinates
loc_coords_df = pd.DataFrame([
    {'location': loc, 'latitude': coords[0], 'longitude': coords[1]}
    for loc, coords in location_coords.items()
])

# Filter locations inside UK bounding box
uk_loc_df = loc_coords_df[
    (loc_coords_df['latitude'] >= 49.5) & (loc_coords_df['latitude'] <= 61.0) &
    (loc_coords_df['longitude'] >= -8.5) & (loc_coords_df['longitude'] <= 2.0)
]

uk_locations = uk_loc_df['location'].tolist()

print(f"Number of locations within UK bounds: {len(uk_locations)}")


Number of locations within UK bounds: 477


In [13]:
# Aggregate rainfall stats by location
rainfall_stats = df.groupby('location')['rain mm'].agg(['min', 'mean', 'max']).reset_index()
rainfall_stats.rename(columns={'min':'min_rain_mm', 'mean':'avg_rain_mm', 'max':'max_rain_mm'}, inplace=True)


In [14]:
rainfall_stats['latitude'] = rainfall_stats['location'].apply(lambda loc: location_coords.get(loc, (None, None))[0])
rainfall_stats['longitude'] = rainfall_stats['location'].apply(lambda loc: location_coords.get(loc, (None, None))[1])

# Drop rows without coordinates
rainfall_stats = rainfall_stats.dropna(subset=['latitude', 'longitude'])


In [29]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go

# --- STEP 1: Prepare Data ---

# Add latitude and longitude from the location_coords dictionary
rainfall_stats['latitude'] = rainfall_stats['location'].map(
    lambda loc: location_coords.get(loc, (None, None))[0]
)
rainfall_stats['longitude'] = rainfall_stats['location'].map(
    lambda loc: location_coords.get(loc, (None, None))[1]
)

# Keep only rows with valid coordinates and average rainfall
rainfall_stats = rainfall_stats.dropna(subset=['latitude', 'longitude', 'avg_rain_mm'])

# Extract arrays of lat, lon, and rainfall
lats = rainfall_stats['latitude'].values
lons = rainfall_stats['longitude'].values
rains = rainfall_stats['avg_rain_mm'].astype(float).values

# --- STEP 2: Create the Heatmap Layer ---

# Use Densitymapbox for heatmap-style shading
heatmap = go.Densitymapbox(
    lat=lats,
    lon=lons,
    z=rains,               # Use average rainfall to control intensity
    radius=30,             # Controls smoothness (bigger = smoother heat)
    colorscale='Blues',    # Color gradient for rainfall
    hovertemplate='Rainfall: %{z:.2f} mm<extra></extra>'
)

# --- STEP 3: Display Map ---

fig = go.Figure(heatmap)

fig.update_layout(
    mapbox=dict(
        style='carto-positron',            # Light map style
        center=dict(lat=54.5, lon=-3),     # Centered over the UK
        zoom=5.5                           # Good zoom level for UK
    ),
    title="UK Rainfall Heatmap (Average Rainfall in mm)",
    margin=dict(r=0, t=50, l=0, b=0),
    height=600
)

fig.show()


In [119]:
#  2. Time-based Split into Train / Validation / Test
# We split based on weeks (after aggregation), preserving time order to avoid data leakage.

# Sort by date
df_weekly = df_weekly.sort_values('date')

# Get unique sorted weeks
weeks = df_weekly['date'].sort_values().unique()

# Define split indices
train_end = int(len(weeks) * 0.7)
val_end = int(len(weeks) * 0.85)

train_weeks = weeks[:train_end]
val_weeks = weeks[train_end:val_end]
test_weeks = weeks[val_end:]

# Split dataframes by weeks
train_df = df_weekly[df_weekly['date'].isin(train_weeks)]
val_df = df_weekly[df_weekly['date'].isin(val_weeks)]
test_df = df_weekly[df_weekly['date'].isin(test_weeks)]

print(f"Train weeks: {len(train_weeks)}")
print(f"Validation weeks: {len(val_weeks)}")
print(f"Test weeks: {len(test_weeks)}")

print("Train sample:")
print(train_df.head())



Train weeks: 580
Validation weeks: 124
Test weeks: 125
Train sample:
            location       date  rain mm
0         Abengourou 2009-01-04      0.8
158154    Mount Sion 2009-01-04      0.0
158961  Mountain Ash 2009-01-04      0.0
159768   Mousley End 2009-01-04      0.0
95208        Kearney 2009-01-04      0.0


In [111]:
print(f"Train weeks: {len(train_weeks)}")
print(f"Validation weeks: {len(val_weeks)}")
print(f"Test weeks: {len(test_weeks)}")


Train weeks: 580
Validation weeks: 124
Test weeks: 125


In [124]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense

# --- 1. Load and clean data ---
# Assume 'df' is already loaded from your CSV with at least columns 'date' and 'rainfall'
# Example:
# df = pd.read_csv('path_to_csv')
# df['date'] = pd.to_datetime(df['date'], errors='coerce')
# df = df.dropna(subset=['date'])

# --- 2. Aggregate daily rainfall to weekly sums ---
df['week_start'] = df['date'] - pd.to_timedelta(df['date'].dt.dayofweek, unit='d')
df_weekly = df.groupby('week_start').agg({'rainfall': 'sum'}).reset_index()

# Rename 'week_start' to 'date' so later code uses df_weekly['date']
df_weekly = df_weekly.rename(columns={'week_start': 'date'})

# Sort weekly data by date ascending
df_weekly = df_weekly.sort_values('date').reset_index(drop=True)

# --- 3. Split data by time (weeks) ---
weeks = df_weekly['date'].values
train_end = int(len(weeks) * 0.7)
val_end = int(len(weeks) * 0.85)

train_weeks = weeks[:train_end]
val_weeks = weeks[train_end:val_end]
test_weeks = weeks[val_end:]

train_mask = df_weekly['date'].isin(train_weeks)
val_mask = df_weekly['date'].isin(val_weeks)
test_mask = df_weekly['date'].isin(test_weeks)

train_data = df_weekly[train_mask]['rainfall'].values
val_data = df_weekly[val_mask]['rainfall'].values
test_data = df_weekly[test_mask]['rainfall'].values

# --- 4. Normalize rainfall data with MinMaxScaler ---
scaler = MinMaxScaler(feature_range=(0, 1))
train_data_scaled = scaler.fit_transform(train_data.reshape(-1, 1)).flatten()
val_data_scaled = scaler.transform(val_data.reshape(-1, 1)).flatten()
test_data_scaled = scaler.transform(test_data.reshape(-1, 1)).flatten()

# --- 5. Create sequences for Seq2Seq ---
def create_sequences(data, past_len, future_len):
    X, y = [], []
    for i in range(len(data) - past_len - future_len + 1):
        X.append(data[i:i+past_len])
        y.append(data[i+past_len:i+past_len+future_len])
    return np.array(X), np.array(y)

PAST_SEQ_LEN = 4
FUTURE_SEQ_LEN = 2

X_train, y_train = create_sequences(train_data_scaled, PAST_SEQ_LEN, FUTURE_SEQ_LEN)
X_val, y_val = create_sequences(val_data_scaled, PAST_SEQ_LEN, FUTURE_SEQ_LEN)
X_test, y_test = create_sequences(test_data_scaled, PAST_SEQ_LEN, FUTURE_SEQ_LEN)

# Add feature dimension (LSTM expects 3D inputs)
X_train = X_train[..., np.newaxis]
y_train = y_train[..., np.newaxis]
X_val = X_val[..., np.newaxis]
y_val = y_val[..., np.newaxis]
X_test = X_test[..., np.newaxis]
y_test = y_test[..., np.newaxis]

# --- 6. Build Seq2Seq LSTM model ---
encoder_inputs = Input(shape=(PAST_SEQ_LEN, 1))
encoder_lstm = LSTM(64, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
encoder_states = [state_h, state_c]

decoder_inputs = Input(shape=(FUTURE_SEQ_LEN, 1))
decoder_lstm = LSTM(64, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(1)
decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='adam', loss='mse')

model.summary()

# --- 7. Prepare decoder input sequences (teacher forcing) ---
def prepare_decoder_inputs(y):
    decoder_input = np.zeros_like(y)
    decoder_input[:, 1:, :] = y[:, :-1, :]
    return decoder_input

decoder_train_input = prepare_decoder_inputs(y_train)
decoder_val_input = prepare_decoder_inputs(y_val)

# --- 8. Train the model ---
history = model.fit(
    [X_train, decoder_train_input], y_train,
    validation_data=([X_val, decoder_val_input], y_val),
    epochs=30,
    batch_size=16
)

# --- 9. Evaluate on test set ---
decoder_test_input = prepare_decoder_inputs(y_test)
test_loss = model.evaluate([X_test, decoder_test_input], y_test)
print(f"Test Loss (MSE): {test_loss}")

# --- 10. Predict and invert normalization ---
y_pred_scaled = model.predict([X_test, decoder_test_input])

y_pred_reshaped = y_pred_scaled.reshape(-1, 1)
y_pred_inversed = scaler.inverse_transform(y_pred_reshaped).reshape(y_pred_scaled.shape)

print("Sample predicted rainfall (unscaled):")
print(y_pred_inversed[:3])


KeyError: 'date'

# Task
Build a rainfall forecasting model using a Seq2Seq LSTM architecture, train it on the provided dataset, evaluate its performance, and visualize the predictions against the actual values.

## Process data for seq2seq model

### Subtask:
Continue with the data preparation steps in cell `jQXCFSRHLw3U`, which include aggregating daily rainfall to weekly sums, splitting the data into training, validation, and test sets based on time, normalizing the rainfall data, and creating sequences for the Seq2Seq model.


**Reasoning**:
The previous cell failed because it tried to access 'date' directly from the original `df` DataFrame which was already aggregated into `df_weekly`. I need to correct this by ensuring the aggregation and subsequent steps operate on the correct DataFrame and column names. I will rewrite the data preparation steps, ensuring the weekly aggregation uses the 'rain mm' column from `df_weekly` and the splitting and sequence creation steps correctly use the 'rain mm' values from the resulting weekly DataFrame.



In [125]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense

# --- 1. Aggregate daily rainfall to weekly sums ---
# Ensure we are using the correct DataFrame and column names.
# df_weekly is already created in a previous step with columns 'location', 'date', 'rain mm'
# We need to aggregate this by week and sum 'rain mm'.

# Assuming df_weekly from the previous steps is the starting point
# Calculate week start from the 'date' column in df_weekly
df_weekly['week_start'] = df_weekly['date'] - pd.to_timedelta(df_weekly['date'].dt.dayofweek, unit='d')

# Group by week_start and sum 'rain mm'
df_weekly_agg = df_weekly.groupby('week_start').agg({'rain mm': 'sum'}).reset_index()

# Rename 'week_start' to 'date' for consistency with later steps
df_weekly_agg = df_weekly_agg.rename(columns={'week_start': 'date'})

# Sort weekly data by date ascending
df_weekly_agg = df_weekly_agg.sort_values('date').reset_index(drop=True)

# --- 2. Split data by time (weeks) ---
weeks = df_weekly_agg['date'].values
train_end = int(len(weeks) * 0.7)
val_end = int(len(weeks) * 0.85)

train_weeks = weeks[:train_end]
val_weeks = weeks[train_end:val_end]
test_weeks = weeks[val_end:]

train_mask = df_weekly_agg['date'].isin(train_weeks)
val_mask = df_weekly_agg['date'].isin(val_weeks)
test_mask = df_weekly_agg['date'].isin(test_weeks)

# Extract the 'rain mm' values for each set
train_data = df_weekly_agg[train_mask]['rain mm'].values
val_data = df_weekly_agg[val_mask]['rain mm'].values
test_data = df_weekly_agg[test_mask]['rain mm'].values

# --- 3. Normalize rainfall data with MinMaxScaler ---
scaler = MinMaxScaler(feature_range=(0, 1))
train_data_scaled = scaler.fit_transform(train_data.reshape(-1, 1)).flatten()
val_data_scaled = scaler.transform(val_data.reshape(-1, 1)).flatten()
test_data_scaled = scaler.transform(test_data.reshape(-1, 1)).flatten()

# --- 4. Create sequences for Seq2Seq ---
def create_sequences(data, past_len, future_len):
    X, y = [], []
    for i in range(len(data) - past_len - future_len + 1):
        X.append(data[i:i+past_len])
        y.append(data[i+past_len:i+past_len+future_len])
    return np.array(X), np.array(y)

PAST_SEQ_LEN = 4
FUTURE_SEQ_LEN = 2

X_train, y_train = create_sequences(train_data_scaled, PAST_SEQ_LEN, FUTURE_SEQ_LEN)
X_val, y_val = create_sequences(val_data_scaled, PAST_SEQ_LEN, FUTURE_SEQ_LEN)
X_test, y_test = create_sequences(test_data_scaled, PAST_SEQ_LEN, FUTURE_SEQ_LEN)

# Add feature dimension (LSTM expects 3D inputs)
X_train = X_train[..., np.newaxis]
y_train = y_train[..., np.newaxis]
X_val = X_val[..., np.newaxis]
y_val = y_val[..., np.newaxis]
X_test = X_test[..., np.newaxis]
y_test = y_test[..., np.newaxis]

print("Data preparation complete.")
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

Data preparation complete.
X_train shape: (575, 4, 1)
y_train shape: (575, 2, 1)
X_val shape: (119, 4, 1)
y_val shape: (119, 2, 1)
X_test shape: (120, 4, 1)
y_test shape: (120, 2, 1)


**Reasoning**:
The data preparation steps are complete, including weekly aggregation, splitting, normalization, and sequence creation. The shapes of the prepared data are also printed, confirming the successful execution of the previous steps. The next logical step in the subtask is to build and train the Seq2Seq LSTM model. I will now define and compile the model architecture as described in the instructions.



In [126]:
# --- Build Seq2Seq LSTM model ---
encoder_inputs = Input(shape=(PAST_SEQ_LEN, 1))
encoder_lstm = LSTM(64, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)
encoder_states = [state_h, state_c]

decoder_inputs = Input(shape=(FUTURE_SEQ_LEN, 1))
# Use the encoder states as initial states for the decoder
decoder_lstm = LSTM(64, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(1)
decoder_outputs = decoder_dense(decoder_outputs)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='adam', loss='mse')

model.summary()

**Reasoning**:
The Seq2Seq model architecture is defined and compiled. The next step is to prepare the decoder input sequences for training using teacher forcing, which involves shifting the target sequences one time step forward and padding the first time step with zeros. After preparing the decoder inputs, I will train the model using the prepared training and validation data.



In [127]:
# --- Prepare decoder input sequences (teacher forcing) ---
def prepare_decoder_inputs(y):
    decoder_input = np.zeros_like(y)
    # Shift target sequences one step forward
    decoder_input[:, 1:, :] = y[:, :-1, :]
    return decoder_input

decoder_train_input = prepare_decoder_inputs(y_train)
decoder_val_input = prepare_decoder_inputs(y_val)

# --- Train the model ---
history = model.fit(
    [X_train, decoder_train_input], y_train,
    validation_data=([X_val, decoder_val_input], y_val),
    epochs=30,
    batch_size=16
)

Epoch 1/30
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 35ms/step - loss: 0.0551 - val_loss: 0.0506
Epoch 2/30
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - loss: 0.0325 - val_loss: 0.0494
Epoch 3/30
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - loss: 0.0321 - val_loss: 0.0489
Epoch 4/30
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 8ms/step - loss: 0.0316 - val_loss: 0.0484
Epoch 5/30
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - loss: 0.0345 - val_loss: 0.0482
Epoch 6/30
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 13ms/step - loss: 0.0326 - val_loss: 0.0477
Epoch 7/30
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step - loss: 0.0324 - val_loss: 0.0484
Epoch 8/30
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - loss: 0.0299 - val_loss: 0.0476
Epoch 9/30
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[

**Reasoning**:
The model training is complete. The next step is to evaluate the model's performance on the test set and then make predictions. Finally, I need to invert the normalization on the predictions to get the rainfall values back in their original scale.



In [128]:
# --- Evaluate on test set ---
decoder_test_input = prepare_decoder_inputs(y_test)
test_loss = model.evaluate([X_test, decoder_test_input], y_test)
print(f"Test Loss (MSE): {test_loss}")

# --- Predict and invert normalization ---
# For prediction, we need to use a slightly different approach than teacher forcing.
# We predict one step at a time and feed the prediction back as input for the next step.

def predict_sequence(encoder_model, decoder_model, input_seq, past_len, future_len, scaler):
    # Encode the input sequence
    state = encoder_model.predict(input_seq)

    # Initialize the target sequence with a starting value (e.g., the last value of the input sequence)
    target_seq = np.zeros((1, future_len, 1))
    target_seq[0, 0, 0] = input_seq[0, -1, 0] # Use the last input value as the first decoder input

    output_sequence = []

    for i in range(future_len):
        # Predict the next value
        output_tokens, h, c = decoder_model.predict([target_seq, state])

        # Store the predicted value
        predicted_value = output_tokens[0, i, 0]
        output_sequence.append(predicted_value)

        # Update the target sequence for the next prediction
        if i < future_len - 1:
            target_seq[0, i + 1, 0] = predicted_value

        # Update states
        state = [h, c]

    return np.array(output_sequence)

# Create encoder and decoder models for inference
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(64,))
decoder_state_input_c = Input(shape=(64,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs
)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states
)

# Predict on the test set
y_pred_scaled = []
for i in range(X_test.shape[0]):
    input_seq = X_test[i:i+1] # Process one sequence at a time
    predicted_seq_scaled = predict_sequence(encoder_model, decoder_model, input_seq, PAST_SEQ_LEN, FUTURE_SEQ_LEN, scaler)
    y_pred_scaled.append(predicted_seq_scaled)

y_pred_scaled = np.array(y_pred_scaled)

# Invert normalization
y_test_reshaped = y_test.reshape(-1, 1)
y_pred_reshaped = y_pred_scaled.reshape(-1, 1)

y_test_inversed = scaler.inverse_transform(y_test_reshaped).reshape(y_test.shape)
y_pred_inversed = scaler.inverse_transform(y_pred_reshaped).reshape(y_pred_scaled.shape)


print("Sample actual rainfall (unscaled):")
print(y_test_inversed[:3])

print("Sample predicted rainfall (unscaled):")
print(y_pred_inversed[:3])

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - loss: 0.0583
Test Loss (MSE): 0.05460525304079056
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 373ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 243ms/step


The structure of `inputs` doesn't match the expected structure.
Expected: ['keras_tensor_56', 'keras_tensor_61', 'keras_tensor_62']
Received: inputs=('Tensor(shape=(1, 2, 1))', ('Tensor(shape=(1, 64))', 'Tensor(shape=(1, 64))'))



[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 269ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 106ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 64ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 

## Evaluate the model

### Subtask:
Evaluate the trained Seq2Seq LSTM model on the test set to assess its performance.


**Reasoning**:
Calculate and print evaluation metrics (MSE, MAE) on the inverse-transformed test data and predictions.



In [129]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Reshape y_test_inversed and y_pred_inversed to flatten for metric calculation
y_test_flat = y_test_inversed.flatten()
y_pred_flat = y_pred_inversed.flatten()

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test_flat, y_pred_flat)
print(f"Mean Squared Error (MSE): {mse:.4f}")

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test_flat, y_pred_flat)
print(f"Mean Absolute Error (MAE): {mae:.4f}")

# Calculate R-squared (R2) score
# Ensure there is variance in the actual values for R2 calculation
if np.var(y_test_flat) > 0:
    ss_total = np.sum((y_test_flat - np.mean(y_test_flat))**2)
    ss_residual = np.sum((y_test_flat - y_pred_flat)**2)
    r2 = 1 - (ss_residual / ss_total)
    print(f"R-squared (R2): {r2:.4f}")
else:
    print("Cannot calculate R-squared: Variance of actual values is zero.")

# Optional: Calculate accuracy for predicting any rainfall vs no rainfall
# Define a threshold for classifying as rainfall (e.g., > 0.1 mm)
rainfall_threshold = 0.1

y_test_binary = (y_test_flat > rainfall_threshold).astype(int)
y_pred_binary = (y_pred_flat > rainfall_threshold).astype(int)

binary_accuracy = np.mean(y_test_binary == y_pred_binary)
print(f"Rain/No-Rain Accuracy (threshold > {rainfall_threshold} mm): {binary_accuracy:.4f}")


Mean Squared Error (MSE): 54173879.3159
Mean Absolute Error (MAE): 6155.5537
R-squared (R2): 0.1415
Rain/No-Rain Accuracy (threshold > 0.1 mm): 0.8708


## Predict and visualize rainfall

### Subtask:
Use the trained model to make predictions on the test set and visualize the predicted rainfall against the actual rainfall.


**Reasoning**:
Create a time index for the test set predictions and align the actual and predicted future rainfall values with their corresponding dates. Then, create a pandas DataFrame to hold the actual and predicted rainfall values along with their dates for easier plotting.



In [130]:
# Step 1: Create a time index for the test set predictions.
# The test_weeks array contains the start dates of the weeks in the test set.
# Each sequence in X_test corresponds to a sequence of PAST_SEQ_LEN weeks starting from a date in test_weeks.
# The prediction for a sequence starting at test_weeks[i] corresponds to the FUTURE_SEQ_LEN weeks starting from test_weeks[i + PAST_SEQ_LEN].

# Determine the start dates for the predicted sequences.
# These are the dates in test_weeks that align with the start of the future sequences.
# The first prediction sequence corresponds to the future weeks starting from test_weeks[PAST_SEQ_LEN].
prediction_start_dates = test_weeks[PAST_SEQ_LEN : PAST_SEQ_LEN + len(y_test_inversed)]

# Generate the actual dates for each prediction point in the test set.
# For each prediction sequence (of length FUTURE_SEQ_LEN), the dates are the prediction_start_date + j weeks.
actual_prediction_dates = []
for start_date in prediction_start_dates:
    for j in range(FUTURE_SEQ_LEN):
        actual_prediction_dates.append(start_date + pd.Timedelta(weeks=j))

# Step 2: Reshape the y_test_inversed and y_pred_inversed arrays to have the shape (number_of_sequences * future_seq_len).
# This will flatten the arrays so they match the length of actual_prediction_dates.
y_test_flat = y_test_inversed.flatten()
y_pred_flat = y_pred_inversed.flatten()

# Step 3: Create a pandas DataFrame to hold the actual and predicted rainfall values along with their dates.
prediction_df = pd.DataFrame({
    'date': actual_prediction_dates,
    'actual_rainfall': y_test_flat,
    'predicted_rainfall': y_pred_flat
})

# Sort the DataFrame by date to ensure the plot is in chronological order
prediction_df = prediction_df.sort_values('date').reset_index(drop=True)

display(prediction_df.head())

Unnamed: 0,date,actual_rainfall,predicted_rainfall
0,2022-07-25,4410.6,9224.137695
1,2022-08-01,3890.3,7477.543945
2,2022-08-01,3890.3,8190.328125
3,2022-08-08,485.4,6866.959473
4,2022-08-08,485.4,8000.530762


**Reasoning**:
Use the created DataFrame to plot the actual and predicted rainfall over time for the test period using Plotly for interactive visualization.



In [131]:
import plotly.graph_objects as go

# Step 5: Use Plotly to create a line plot
fig = go.Figure()

# Add actual rainfall line
fig.add_trace(go.Scatter(
    x=prediction_df['date'],
    y=prediction_df['actual_rainfall'],
    mode='lines',
    name='Actual Rainfall'
))

# Add predicted rainfall line
fig.add_trace(go.Scatter(
    x=prediction_df['date'],
    y=prediction_df['predicted_rainfall'],
    mode='lines',
    name='Predicted Rainfall'
))

# Step 6: Add appropriate labels and title
fig.update_layout(
    title="Actual vs. Predicted Rainfall (Test Set)",
    xaxis_title="Date",
    yaxis_title="Rainfall (mm)",
    hovermode='x unified' # Show hover information for all traces at a given x-coordinate
)

# Step 7: Display the plot
fig.show()

## Summary:

### Data Analysis Key Findings

*   The daily rainfall data was successfully aggregated into weekly sums.
*   The data was split into training (70%), validation (15%), and test (15%) sets based on time.
*   The rainfall data was normalized using `MinMaxScaler`.
*   Sequences for the Seq2Seq model were created with a past sequence length of 4 and a future sequence length of 2.
*   A Seq2Seq LSTM model was built and trained for 30 epochs with a batch size of 16.
*   The model achieved a Mean Squared Error (MSE) of approximately 54173879.3159 and a Mean Absolute Error (MAE) of approximately 6155.5537 on the test set.
*   The R-squared (R2) score on the test set was approximately 0.1415.
*   The model demonstrated a Rain/No-Rain Accuracy of approximately 0.8708 on the test set with a threshold of 0.1 mm.
*   A time series plot visualizing the actual versus predicted weekly rainfall on the test set was generated, showing the model's forecasting performance over time.

### Insights or Next Steps

*   The high MSE and MAE values and low R2 score suggest that the model struggles to accurately predict the exact amount of rainfall, although it performs reasonably well in predicting whether rainfall will occur or not.
*   Future steps could involve exploring different model architectures (e.g., adding more LSTM layers, using different activation functions), tuning hyperparameters (e.g., number of units, learning rate, batch size, number of epochs), incorporating additional relevant features (e.g., temperature, humidity, wind speed), or experimenting with longer sequence lengths to potentially improve the model's quantitative rainfall predictions.
