<font size="4">**25. Load the Dataset**</font>

In [2]:
import pandas as pd

#load the dataset
file_path = './Downloads/final_rainfall_data.csv'
df = pd.read_csv(file_path)

#display dataset information
print("Sample Data:")
print(df.head())

print("\nDataset Information:")
print(df.info())

print("\nMissing Values:")
print(df.isnull().sum())

Sample Data:
        Country  Year  Month  Rainfall
0      DJIBOUTI  1981      1  0.000452
1  ILE TROMELIN  1981      1  0.012166
2     SWAZILAND  1981      1  0.023881
3          MALI  1981      1  0.004452
4         NIGER  1981      1  0.007719

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 322994 entries, 0 to 322993
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   Country   322994 non-null  object 
 1   Year      322994 non-null  int64  
 2   Month     322994 non-null  int64  
 3   Rainfall  322994 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 9.9+ MB
None

Missing Values:
Country     0
Year        0
Month       0
Rainfall    0
dtype: int64


<font size="4">**26. Data Cleaning**</font>

In [4]:
from sklearn.preprocessing import MinMaxScaler

#fill any remaining NaN values in Rainfall with the column mean
df['Rainfall'] = df['Rainfall'].fillna(df['Rainfall'].mean())

#combine Year and Month into a Date column
df['Date'] = pd.to_datetime(df[['Year', 'Month']].assign(Day=1))

#normalize Rainfall
scaler = MinMaxScaler()
df['Rainfall'] = scaler.fit_transform(df[['Rainfall']])

<font size="4">**27. Feature Engineering**</font>

This section introduces feature engineering to enhance the rainfall dataset by adding lagged features, rolling averages, and cyclical month representations. Lag features are created to capture the rainfall from the previous 12 months for each country, helping to identify temporal dependencies. Using the groupby method ensures that lagged values are calculated independently for each country. Rolling averages over 3-month and 6-month windows are then computed to capture short-term and medium-term rainfall trends, providing smoothed representations of temporal patterns.

To account for seasonality, cyclical features for months are generated using sine and cosine transformations. These transformations ensure that months like December and January, which are numerically far apart but seasonally close, are correctly represented in the dataset. Any missing values introduced by lagging or rolling averages are addressed using linear interpolation, while remaining NaN values are filled using a combination of backward and forward fill methods to ensure data completeness. These engineered features prepare the dataset for advanced time-series analysis or machine learning models, improving their ability to detect patterns and predict future outcomes.

In [7]:
import numpy as np

#generate lag features (previous 12 months of rainfall)
for lag in range(1, 13):
    df[f'Lag_{lag}'] = df.groupby('Country')['Rainfall'].shift(lag)

#generate rolling averages
df['Rolling_Mean_3'] = df.groupby('Country')['Rainfall'].transform(lambda x: x.rolling(window=3).mean())
df['Rolling_Mean_6'] = df.groupby('Country')['Rainfall'].transform(lambda x: x.rolling(window=6).mean())

#generate cyclical month features
df['Month'] = df['Date'].dt.month
df['Month_sin'] = np.sin(2 * np.pi * df['Month'] / 12)
df['Month_cos'] = np.cos(2 * np.pi * df['Month'] / 12)

#fill NaNs caused by lagging and rolling averages using interpolation
df = df.interpolate(method='linear')

#fill any remaining NaN values with a fallback method
df = df.fillna(method='bfill').fillna(method='ffill')

print("\nFeature-Engineered Data Sample:")
print(df.head())


Feature-Engineered Data Sample:
        Country  Year  Month  Rainfall       Date     Lag_1     Lag_2  \
0      DJIBOUTI  1981      1  0.016678 1981-01-01  0.006167  0.006167   
1  ILE TROMELIN  1981      1  0.449310 1981-01-01  0.006167  0.006167   
2     SWAZILAND  1981      1  0.881942 1981-01-01  0.006167  0.006167   
3          MALI  1981      1  0.164401 1981-01-01  0.006167  0.006167   
4         NIGER  1981      1  0.285065 1981-01-01  0.006167  0.006167   

      Lag_3     Lag_4     Lag_5  ...     Lag_7     Lag_8     Lag_9    Lag_10  \
0  0.006167  0.006167  0.006167  ...  0.006167  0.006167  0.006167  0.006167   
1  0.006167  0.006167  0.006167  ...  0.006167  0.006167  0.006167  0.006167   
2  0.006167  0.006167  0.006167  ...  0.006167  0.006167  0.006167  0.006167   
3  0.006167  0.006167  0.006167  ...  0.006167  0.006167  0.006167  0.006167   
4  0.006167  0.006167  0.006167  ...  0.006167  0.006167  0.006167  0.006167   

     Lag_11    Lag_12  Rolling_Mean_3  Rolling_

  df = df.interpolate(method='linear')
  df = df.fillna(method='bfill').fillna(method='ffill')


<font size="4">**28. Define Target Variable**</font>

This section focuses on categorizing rainfall levels and encoding the categories for further analysis. Rainfall data is divided into three categories; Drought, Normal, and Flood, based on thresholds derived from the 25th and 75th percentiles (quantiles) of the Rainfall column. Rainfall below the lower threshold is categorized as likely to lead to Drought, above the upper threshold as likely to lead to a Flood, and values in between are labeled as Normal. This categorization allows for a simplified analysis of rainfall patterns and their extremes.

Once categorized, the Rainfall_Category column is encoded into numerical labels using Scikit-learn's LabelEncoder. This transformation assigns a unique integer to each category, enabling machine learning algorithms to process the data efficiently. The resulting dataset includes both the categorical labels and their numerical representations, making it ready for predictive modeling or statistical analysis.

In [10]:
from sklearn.preprocessing import LabelEncoder

#define thresholds
drought_threshold = df['Rainfall'].quantile(0.25)
flood_threshold = df['Rainfall'].quantile(0.75)

#categorize rainfall
def categorize_rainfall(rainfall):
    if rainfall < drought_threshold:
        return 'Drought'
    elif rainfall > flood_threshold:
        return 'Flood'
    else:
        return 'Normal'

df['Rainfall_Category'] = df['Rainfall'].apply(categorize_rainfall)

#encode labels
encoder = LabelEncoder()
df['Rainfall_Category_Encoded'] = encoder.fit_transform(df['Rainfall_Category'])

print("\nCategorized Data Sample:")
print(df[['Rainfall', 'Rainfall_Category', 'Rainfall_Category_Encoded']].head(10))


Categorized Data Sample:
   Rainfall Rainfall_Category  Rainfall_Category_Encoded
0  0.016678            Normal                          2
1  0.449310             Flood                          1
2  0.881942             Flood                          1
3  0.164401             Flood                          1
4  0.285065             Flood                          1
5  0.093802            Normal                          2
6  0.068182            Normal                          2
7  0.000000           Drought                          0
8  0.077412            Normal                          2
9  0.104673            Normal                          2


<font size="4">**29. Encode Categorical Variables**</font>

This section demonstrates the use of one-hot encoding to convert the categorical Country column into numerical features suitable for machine learning models. The OneHotEncoder from Scikit-learn is applied with the drop='first' option, which prevents multicollinearity by omitting the first category from the encoded features. This creates binary columns representing the presence or absence of each country in the dataset.

The encoded features are stored in a new DataFrame, where each column corresponds to a unique country, prefixed with Country_. The original index of the dataset is preserved to ensure alignment with the existing data. The encoded features are then concatenated with the original DataFrame after dropping the Country column. This results in a dataset where the categorical Country data is replaced with meaningful binary features, improving compatibility with machine learning algorithms that require numerical inputs.

In [13]:
from sklearn.preprocessing import OneHotEncoder

#apply one-hot encoding to the Country column
encoder_onehot = OneHotEncoder(sparse_output=False, drop='first')
encoded_countries = encoder_onehot.fit_transform(df[['Country']])

#create DataFrame for encoded features
country_columns = [f'Country_{cat}' for cat in encoder_onehot.categories_[0][1:]]
encoded_countries_df = pd.DataFrame(encoded_countries, columns=country_columns, index=df.index)

#concatenate encoded features
df = pd.concat([df.drop('Country', axis=1), encoded_countries_df], axis=1)

<font size="4">**30. Split Data into Training and Testing Sets**</font>

This section prepares the dataset for machine learning by splitting it into training and testing sets. The feature set (X) is created by dropping irrelevant columns, including Rainfall_Category (the categorical version of the target), Rainfall_Category_Encoded (the target variable itself), and Date (a temporal identifier not needed for training). The target variable (y) is defined as the encoded rainfall category (Rainfall_Category_Encoded).

The dataset is split using Scikit-learn’s train_test_split function, which ensures that 20% of the data is reserved for testing while 80% is used for training the model. The stratify parameter is set to y to maintain the same class distribution in both the training and testing sets, preventing imbalances in the target variable across splits. The final split sizes are printed to confirm the dimensions of the training and testing sets, ensuring the data is ready for building and evaluating machine learning models.

In [16]:
from sklearn.model_selection import train_test_split

#features and target variable
X = df.drop(['Rainfall_Category', 'Rainfall_Category_Encoded', 'Date'], axis=1)
y = df['Rainfall_Category_Encoded']

#stratified splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Training Set: {X_train.shape}, Testing Set: {X_test.shape}")

Training Set: (258395, 78), Testing Set: (64599, 78)


<font size="4">**31. Multilayer Perceptron (MLP) for Rainfall Classification**</font>

This section demonstrates the implementation of a Multilayer Perceptron (MLP), a deep learning model, to classify rainfall data into categories such as Drought, Normal, and Flood. The input features are standardized using StandardScaler to normalize the range of values, ensuring efficient model training and faster convergence. The MLP is designed using the Sequential API in Keras, with three fully connected (Dense) hidden layers, each using ReLU activation to introduce non-linearity. Dropout layers (30%) are incorporated after the first and second hidden layers to mitigate overfitting by randomly deactivating neurons during training.

The output layer employs a softmax activation function, suitable for multi-class classification tasks, and the number of neurons equals the number of unique target classes. The model is compiled using the Adam optimizer, which adapts learning rates dynamically for efficient updates, and sparse categorical cross-entropy loss, appropriate for integer-labeled multi-class targets. Training is performed for 5 epochs with a batch size of 32, reserving 20% of the training data for validation. Model performance is evaluated using accuracy and a classification report, which provide detailed insights into precision, recall, and F1-scores for each rainfall category. This MLP effectively captures patterns in the data, enabling accurate rainfall classification for drought and flood prediction.

In [19]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score

#scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#define the deep learning model
model = Sequential([
    Dense(128, input_dim=X_train_scaled.shape[1], activation='relu'),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(32, activation='relu'),
    Dense(len(y_train.unique()), activation='softmax')  # Output layer
])

#compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

#train the model
history = model.fit(X_train_scaled, y_train,
                    validation_split=0.2,
                    epochs=30,
                    batch_size=32,
                    verbose=1)

#evaluate the model
y_pred_probs = model.predict(X_test_scaled)
y_pred = tf.argmax(y_pred_probs, axis=1).numpy()

#classification report
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=['Drought', 'Normal', 'Flood'])

print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:")
print(report)

Epoch 1/30


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m6460/6460[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 493us/step - accuracy: 0.8666 - loss: 0.3168 - val_accuracy: 0.9547 - val_loss: 0.1035
Epoch 2/30
[1m6460/6460[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 481us/step - accuracy: 0.9490 - loss: 0.1227 - val_accuracy: 0.9690 - val_loss: 0.0807
Epoch 3/30
[1m6460/6460[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 479us/step - accuracy: 0.9599 - loss: 0.0983 - val_accuracy: 0.9710 - val_loss: 0.0666
Epoch 4/30
[1m6460/6460[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 483us/step - accuracy: 0.9639 - loss: 0.0882 - val_accuracy: 0.9736 - val_loss: 0.0614
Epoch 5/30
[1m6460/6460[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 482us/step - accuracy: 0.9669 - loss: 0.0807 - val_accuracy: 0.9768 - val_loss: 0.0595
Epoch 6/30
[1m6460/6460[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 497us/step - accuracy: 0.9692 - loss: 0.0764 - val_accuracy: 0.9788 - val_loss: 0.0515
Epoch 7/30
[1m

<font size="4">**32. LSTM Model for Rainfall Classification**</font>

This section employs a Long Short-Term Memory (LSTM) network to classify rainfall data into categories (Drought, Normal, and Flood). Although the dataset is not inherently sequential, the input data is reshaped into a 3D format—required for LSTM layers—with a single timestep (timesteps=1). Before reshaping, the features are scaled using StandardScaler to normalize the data, ensuring consistency and faster model convergence. The reshaped input allows the LSTM to process data in a sequence-like format, capturing potential patterns across features.

The model architecture includes an LSTM layer with 50 units and ReLU activation, followed by a Dense layer with 32 neurons for additional feature extraction. The output layer uses a softmax activation function for multi-class classification, with the number of neurons matching the target classes. Compiled with the Adam optimizer and sparse categorical cross-entropy loss, the model is trained for 5 epochs with a batch size of 32, reserving 20% of the training data for validation. Model performance is evaluated on the test set using accuracy and a classification report, which provide insights into its effectiveness in distinguishing between rainfall categories. This approach leverages LSTM's capability to capture patterns, even in reshaped static data, to improve classification performance.

In [22]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from sklearn.preprocessing import StandardScaler

#assume X_train and X_test are the original datasets
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#reshape data into 3D format: (samples, timesteps, features)
#for example, using 1 timestep if not dealing with time-series
X_train_reshaped = np.expand_dims(X_train_scaled, axis=1)  # Shape: (samples, timesteps=1, features)
X_test_reshaped = np.expand_dims(X_test_scaled, axis=1)    # Shape: (samples, timesteps=1, features)

In [23]:
#define the LSTM model
lstm_model = Sequential([
    LSTM(50, activation='relu', input_shape=(X_train_reshaped.shape[1], X_train_reshaped.shape[2])),
    Dense(32, activation='relu'),
    Dense(len(y_train.unique()), activation='softmax')  # Output layer for multi-class classification
])

#compile the model
lstm_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = lstm_model.fit(
    X_train_reshaped, y_train,
    validation_split=0.2,
    epochs=30,
    batch_size=32,
    verbose=1
)

#evaluate the model
y_pred_probs = lstm_model.predict(X_test_reshaped)
y_pred = np.argmax(y_pred_probs, axis=1)

#classification report
from sklearn.metrics import classification_report, accuracy_score
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=['Drought', 'Normal', 'Flood'])

print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:")
print(report)

Epoch 1/30


  super().__init__(**kwargs)


[1m6460/6460[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 602us/step - accuracy: 0.9014 - loss: 0.2441 - val_accuracy: 0.9706 - val_loss: 0.0674
Epoch 2/30
[1m6460/6460[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 603us/step - accuracy: 0.9711 - loss: 0.0708 - val_accuracy: 0.9760 - val_loss: 0.0579
Epoch 3/30
[1m6460/6460[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 596us/step - accuracy: 0.9757 - loss: 0.0584 - val_accuracy: 0.9768 - val_loss: 0.0539
Epoch 4/30
[1m6460/6460[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 586us/step - accuracy: 0.9787 - loss: 0.0513 - val_accuracy: 0.9777 - val_loss: 0.0540
Epoch 5/30
[1m6460/6460[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 585us/step - accuracy: 0.9791 - loss: 0.0493 - val_accuracy: 0.9825 - val_loss: 0.0416
Epoch 6/30
[1m6460/6460[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 603us/step - accuracy: 0.9815 - loss: 0.0447 - val_accuracy: 0.9825 - val_loss: 0.0424
Epoch 7/30
[1m

<font size="4">**33. Save Models**</font>

In [25]:
#save the deep learning model as .h5
deep_learning_filename_h5 = "models/deep_learning_model2.h5"
model.save(deep_learning_filename_h5)
print(f"Deep learning model saved as {deep_learning_filename_h5}")



Deep learning model saved as models/deep_learning_model2.h5


In [26]:
#save the LSTM model as .h5
lstm_model_filename_h5 = "models/lstm_model2.h5"
model.save(lstm_model_filename_h5)
print(f"LSTM model saved as {lstm_model_filename_h5}")



LSTM model saved as models/lstm_model2.h5
