# Introduction
In this small project we will take a look at Seattle weather dataset from Kaggle to extract important features and use them to test functionality of Recurrent Neural Network.
<br><br>
#### Main objective
Build a model using PyTorch library that predict weather features such as average temperature, wind and precipitation.
<br><br>
#### Process includes:
1. **Data** <br>
&ensp;1.1 Overview <br>
&ensp;1.2 Anomalies <br>
&ensp;1.3 Visualization <br>
&ensp;1.4 Preparing data for model <br>
2. **Building model** <br>
&ensp;2.1 Train / test split <br>
&ensp;2.2 Sequencing datasets <br>
&ensp;2.3 Class LSTM <br>
&ensp;2.4 Training function <br>
&ensp;2.5 Testing function <br>
&ensp;2.6 Training and Testing the RNN models <br>
3. **Conclusion**

# I - Data
#### 1.1 Overview
To preview dataset we're going to use pandas library

In [None]:
import matplotlib.pyplot as plt

"""DON'T MIND THIS, it's dark theme for matplotlib"""
WHITE_MID = '#b5b5b5'
GREY_DARK = '#141414'

plt.rcParams['figure.facecolor'] = GREY_DARK
plt.rcParams['text.color'] = WHITE_MID
plt.rcParams['axes.facecolor'] = GREY_DARK
plt.rcParams['axes.edgecolor'] = WHITE_MID
plt.rcParams['axes.labelcolor'] = WHITE_MID
plt.rcParams['axes.titlecolor'] = WHITE_MID

plt.rcParams['grid.color'] = WHITE_MID
plt.rcParams['grid.linestyle'] = '--'
plt.rcParams['grid.linewidth'] = 0.5
plt.rcParams['axes.grid'] = True

plt.rcParams['axes.linewidth'] = 1
plt.rcParams['xtick.color'] = WHITE_MID
plt.rcParams['ytick.color'] = WHITE_MID
plt.rcParams['legend.edgecolor'] = WHITE_MID
plt.rcParams['legend.labelcolor'] = WHITE_MID

In [None]:
import pandas as pd

df = pd.read_csv('seattle-weather.csv')
df

Following authors description: <br><br>
**_precipitation_** - all forms in which water falls on the land surface and open water bodies as rain, sleet, snow, hail, or drizzle <br>
**_temp_max_** - highest temperature recorded that day <br>
**_temp_min_** - lowest temperature recorded that day <br>
**_wind_** - wind speed <br>
**_weather_** - weather condition <br>

In [None]:
# convert date to actual date format and attach it to index column
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

In [None]:
df.describe()

In [None]:
df.isna().sum()

It is impossible to recreate the situation I described below in the real world, <br>
but we are looking for this type of anomaly.

In [None]:
df[df['temp_max'] < df['temp_min']]

"rainy" weather but no actual rain

In [None]:
df_rainy = df['weather'] == 'rain'
df_noprecip = df['precipitation'] == 0
df[(df_rainy) & (df_noprecip)]

It's sketchy, but I guess the weather 'rain' doesn't mean we have to expect rain that day but the 'feeling' of the weather.

In [None]:
df.groupby('weather').mean()

#### 1.3 Visualization
Using heatmaps, lineplots from seaborn to find correlations and patterns.

In [None]:
import seaborn as sb
import matplotlib.pyplot as plt
import numpy as np

In [None]:
corr = df.drop(['weather'], axis=1).corr()

fig = plt.figure(figsize=(6, 5))
sb.heatmap(corr, annot=True, cmap='Greys_r', cbar=False, mask=np.triu(np.ones(len(corr)), k=0))
plt.title('Correlation')

In [None]:
fig, axes = plt.subplots(ncols=1, nrows=3, figsize=(24, 14))
axes = axes.flatten()

mid_blue = '#6f93bf'
mid_red = '#bf6f6f'

sb.lineplot(data=df, x='date', y='temp_max', color=mid_red, ax=axes[0])
sb.lineplot(data=df, x='date', y='temp_min', color=mid_blue, ax=axes[0])
sb.lineplot(data=df, x='date', y='wind', color=mid_blue, ax=axes[1])
sb.lineplot(data=df, x='date', y='precipitation', color=mid_blue, ax=axes[2])

axes[0].set_title('Temperature min, max')
axes[1].set_title('Wind speed')
axes[2].set_title('Precipitation')

fig.suptitle('Trends Over Time - Daily', fontweight='bold')
plt.tight_layout()

Seasonal pattern suggests that more precipitations appear in autumn months.<br>
During summer season wind speed varies between maximum and minimum values notably more comparing to winter season.<br>

Since both temp_min and temp_max change almost the same way, and we don't have timestamps of each hour per day. We can extract _**average temperature**_ that day. <br>
This can reduce the number of values for the model to calculate and change the input size by removing redundant features.

In [None]:
temp_avg = (df['temp_max'] + df['temp_min']) / 2
df.drop(['temp_max', 'temp_min'], axis=1, inplace=True)
df.insert(1, 'temp_avg', temp_avg)
df

I've created 3 time resampled versions of the same dataframe so the plot is easier to read.

In [None]:
df_daily = df.drop('weather', axis=1).resample('D').mean()
df_weekly = df.drop('weather', axis=1).resample('W').mean()
df_monthly = df.drop('weather', axis=1).resample('ME').mean()
df_monthly.head()

In [None]:
fig, axes = plt.subplots(ncols=1, nrows=3, figsize=(24, 12))
axes = axes.flatten()

sb.lineplot(data=df_weekly, x='date', y='temp_avg', color=mid_red, ax=axes[0])
sb.lineplot(data=df_weekly, x='date', y='wind', color=mid_blue, ax=axes[1])
sb.lineplot(data=df_weekly, x='date', y='precipitation', color=mid_blue, ax=axes[2])

axes[0].set_title('Average Temperature')
axes[1].set_title('Wind speed')
axes[2].set_title('Precipitation')

plt.suptitle('Trends Over Time - Weekly', fontweight='bold')
plt.tight_layout()

In [None]:
fig = plt.figure(figsize=(8, 6))

sb.countplot(data=df, x='weather', color=mid_blue, edgecolor='none')

#### 1.4 Preparing data for model
First we have to convert our weather condition labels to actual numeric values, <br>
because the RNN model classification expect labels with dtype of integer. <br>
For this I'm going to use simple dictionaries.

In [None]:
weather_to_idx = {
    k: v for v, k in enumerate(df['weather'].unique())
}
idx_to_weather = {
    k: v for k, v in enumerate(df['weather'].unique())
}
weather_to_idx

Just making sure if backwards conversion is fine.

In [None]:
for w in df['weather'].unique():
    print(f"{w:>10}:", w == idx_to_weather[weather_to_idx[w]])

In [None]:
df['weather'] = df['weather'].apply(lambda x: weather_to_idx[x])
df

We're going to pass sequenced data with length of 7 days per sequence. <br>
Knowing that, we have to make sure that our dataset is divisible by 7, <br>
we can simply cut it to the exact length of full weeks it contains.

In [None]:
n_weeks = len(df) // 7 * 7
df = df[:n_weeks]

Normalizing features using _**MinMaxScaler**_.

In [None]:
features = ['precipitation', 'temp_avg', 'wind']

for feat in features:
    print(
        f'{feat}\n',
        f'min: {df[feat].min()}',
        f'max: {df[feat].max()}',
        end='\n\n'
    )

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df.loc[:, features] = scaler.fit_transform(df[features])

In [None]:
features = ['precipitation', 'temp_avg', 'wind']

for feat in features:
    print(
        f'{feat}\n',
        f'min: {df[feat].min()}',
        f'max: {df[feat].max()}',
        end='\n\n'
    )

# II - Building model
#### 2.1 Train / test split
Splitting dataset using week stamps.

In [None]:
n_weeks = int(len(df) / 7)
n_weeks

In [None]:
train_size = int(n_weeks * 0.7) + 1
test_size = int(n_weeks * 0.3)

print(
    f"{f"train size":>12}: {train_size}",
    f"{f"test size":>12}: {test_size}", sep='\n'
)

In [None]:
df_train = df[: train_size*7]
df_test = df[train_size*7 :]

In [None]:
print(
    len(df_train) / 7,
    len(df_test) / 7, sep='\n'
)

First we convert each data split to numpy, so we're able to pass it to torch.

In [None]:
import numpy as np
import torch
import random

seed = 42

random.seed(seed)
np.random.seed(seed)

torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [None]:
df_train_torch = df_train.to_numpy()
df_train_torch = torch.tensor(df_train_torch)

df_test_torch = df_test.to_numpy()
df_test_torch = torch.tensor(df_test_torch)
df_test_torch[:7]

#### 2.2 - Sequencing datasets
To predict weather conditions based on features from past 6 days, <br>
we have to reshape our data split sets to return specific sequences. <br><br>
Each sequence containing data from 7 days, where: <br>
X - precipitation, temp_avg and wind speed features from past 6 days <br>
y - same features, but we take just day 7

In [None]:
from torch.utils.data import TensorDataset, DataLoader

def sequence_dataset(dataset, batch_size):
    
    x, y = [], []
    
    for i in range(0, len(dataset) - 7):
        
        week = dataset[i: i + 7, :3] # take everything except 'weather'
        week_features = week[:6, :] # trim to 6 days
        week_target = week[-1, :] # take last day
        
        x.append(week_features)
        y.append(week_target)
        
    # covert to tensor dataset
    tensor_dataset = TensorDataset(torch.tensor(np.asarray(x)),
                                   torch.tensor(np.asarray(y)))
    
    # covert to data_loader
    data_loader = DataLoader(tensor_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

    return data_loader

In [None]:
batch_size = 4
train_loader = sequence_dataset(df_train_torch, batch_size=batch_size)
test_loader = sequence_dataset(df_test_torch, batch_size=batch_size)

In [None]:
i = 2

for batch in test_loader:
    input, target = batch
    print(input.size())
    print(target.size())
    break

The shape corresponds to: <br>
&emsp;4 - batch size <br>
&emsp;6 - n days in sequence <br>
&emsp;3 - n features for single day <br><br>
Note that before I've said that sequence is 7 days but our input is now 6 days. That's because the 7th day became y/label.

In [None]:
for batch in train_loader:
    input, target = batch
    print(input[0], target[0], sep='\n')
    break

#### 2.3 class LSTM
With PyTorch we are able to build simple RNN with Long Short-Term Memory layers by simply attaching _nn.LSTM()_ with specific factors in it. <br>
The __init__ function first initializes the LSTM and fully connected layers. <br>
__*forward()*__ passes our input x to the rnn network then fully connects it to finally return last value from the output.

In [None]:
from torch import nn

class WeatherRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size=3, n_layers=2, dropout=0):
        super(WeatherRNN, self).__init__()
        self.rnn = nn.LSTM(input_size, hidden_size, num_layers=n_layers, dropout=dropout, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
        
    def forward(self, x):
        out, _ = self.rnn(x)
        out = self.fc(out[:, -1, :]) # take the last value
        return out

#### 2.4 Training function
So we take our train_loader and pass it to our training function. <br>
Training function iterates over batches for n_epoch times and passes sequences from these batches to our model <br>
that calculates loss by comparing calculated outputs with actual targets using Mean Squared Error function.

In [None]:
import torch.optim as optim

def train_model(model, device, train_loader, n_epochs=1, lr=0.001):
    
    model.to(device)
    
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)
    
    for epoch in range(n_epochs):
        model.train()
        total_loss = 0
        
        for i, batch in enumerate(train_loader):
            inputs, targets = batch
            inputs, targets = (inputs.to(device).float(),
                               targets.to(device).float())
            
            optimizer.zero_grad()
            outputs = model(inputs)
            
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            
        print(f"Epoch {epoch + 1} / {n_epochs} finished | {total_loss/len(train_loader):.4f}")

#### 2.5 Testing function
Similar to training function. This time we run the model with no gradients calculations <br>
and iterate through batches from test_loader. <br>
Then return y true and y predicted values to compare where the model makes most mistakes.

In [None]:
def test_model(model, device, test_loader):
    total_loss = 0
    criterion = nn.MSELoss()
    y_true, y_pred = [], []

    with torch.no_grad():
        for batch in test_loader:
            inputs, targets = batch
            inputs, targets = (inputs.to(device).float(),
                               targets.to(device).float())

            outputs = model(inputs)

            loss = criterion(outputs, targets)
            total_loss += loss.item()

            y_true.extend(targets.cpu().numpy())
            y_pred.extend(outputs.cpu().numpy())

    avg_loss = total_loss / len(test_loader)
    print(f"Test MSE Loss: {avg_loss:.4f}")

    return y_true, y_pred


#### 2.6 Training and Testing the RNN models

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

input_size = 3
output_size = 3
hidden_size = 32
n_layers = 2
dropout = 0.5
learning_rate = 0.0001

n_epochs = 50

weather_rnn = WeatherRNN(input_size=input_size, hidden_size=hidden_size, n_layers=n_layers, output_size=output_size, dropout=dropout)

In [None]:
train_model(weather_rnn, device=device, train_loader=train_loader, n_epochs=n_epochs, lr=learning_rate)

In [None]:
y_true, y_pred = test_model(weather_rnn, device=device, test_loader=test_loader)

In [None]:
precip_true = np.array(y_true)[:, 0]
precip_pred = np.array(y_pred)[:, 0]

temp_avg_true = np.array(y_true)[:, 1]
temp_avg_pred = np.array(y_pred)[:, 1]

wind_true = np.array(y_true)[:, 2]
wind_pred = np.array(y_pred)[:, 2]

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(30, 20))
axes = axes.flatten()

sb.lineplot(x=range(len(precip_true)), y=precip_true, color='red', ax=axes[0], label='True')
sb.lineplot(x=range(len(precip_pred)), y=precip_pred, color='white', ax=axes[0], label='Pred')
axes[0].set_title('precipitation predictions')

sb.lineplot(x=range(len(temp_avg_true)), y=temp_avg_true, color='red', ax=axes[1], label='True')
sb.lineplot(x=range(len(temp_avg_pred)), y=temp_avg_pred, color='white', ax=axes[1], label='Pred')
axes[1].set_title('temp_avg predictions')

sb.lineplot(x=range(len(wind_true)), y=wind_true, color='red', ax=axes[2], label='True')
sb.lineplot(x=range(len(wind_pred)), y=wind_pred, color='white', ax=axes[2], label='Pred')
axes[2].set_title('wind predictions')

# III - Conclusion
Clearly the model doesn't understand the chaos that is in the _precipitation_ and _wind_ columns. <br>
I suppose predicting such thing as weather with both high precision <br>
and large amount of features requires more than just 3 numeric columns. <br>
Despite this fact the model did decent job with predicting average temperature. <br><br>
Overall I think this project gives quiet simple fundamentals to understand the functionality of Recurrent Neural Networks.
#### Thanks for reading my project, I hope you enjoyed the process :]
_Gracjan Pawłowski 2025_