# A Jupyter Notebook to Analyze Weather Data
This notebook analyzes weather data using two scenarios with five machine learning algorithms: Linear Regression, Random Forest, Gradient Boosting, Neural Networks, and XGBoost.

## Import Required Libraries
Import libraries such as pandas, numpy, matplotlib, seaborn, and machine learning libraries like scikit-learn, TensorFlow, and XGBoost.

In [2]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error, mean_absolute_percentage_error, explained_variance_score
#import tensorflow as tf
#from tensorflow import keras
#import xgboost as xgb

## Load and Explore Dataset
Load the dataset into a pandas DataFrame, display the first few rows, and check for missing values and data types.

In [3]:
# Load and Explore Dataset
df = pd.read_csv("dataset_synop.csv", encoding="ISO-8859-1", delimiter=";")
print("First 5 rows of the dataset:")
print(df.head())
print("\nDataset Info:")
df.info()

First 5 rows of the dataset:
   WMO Station ID                       Date  Sea level pressure  \
0            7761  2010-01-05T20:00:00+02:00             99490.0   
1            7790  2010-03-01T17:00:00+02:00            101130.0   
2            7790  2010-02-27T23:00:00+02:00            100990.0   
3            7790  2010-02-28T11:00:00+02:00            100300.0   
4            7790  2010-02-25T17:00:00+02:00            100710.0   

   3-hour pressure variation  Barometric trend type  \
0                     -100.0                    7.0   
1                      -50.0                    5.0   
2                     -240.0                    6.0   
3                       70.0                    3.0   
4                      -80.0                    7.0   

   10-min mean wind direction  10-min mean wind speed  Temperature  Dew point  \
0                        60.0                     1.5       285.05     284.15   
1                       330.0                     3.6       285.65   

  df = pd.read_csv("dataset_synop.csv", encoding="ISO-8859-1", delimiter=";")


## Identify Common Physical Measurements
Identify columns that represent physical measurements and are common across all stations.

In [4]:
# Identify Common Physical Measurements
common_columns = [
    "Sea level pressure", "Barometric trend type", "10-min mean wind direction",
    "10-min mean wind speed", "Dew point", "Humidity", "Horizontal visibility",
    "Station pressure", "Temperature (C)"
]
print("Common physical measurement columns:")
print(common_columns)

Common physical measurement columns:
['Sea level pressure', 'Barometric trend type', '10-min mean wind direction', '10-min mean wind speed', 'Dew point', 'Humidity', 'Horizontal visibility', 'Station pressure', 'Temperature (C)']


## Handle Missing Values
Fill or drop missing values in the dataset to prepare it for machine learning models.

In [5]:
# Handle Missing Values
for column in common_columns:
    df[column].fillna(df[column].mean(), inplace=True)
print("Missing values handled.")

Missing values handled.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].mean(), inplace=True)


## Normalize Data
Normalize the numerical columns to a range of -1 to 1 using MinMaxScaler.

In [6]:
# Normalize Data
scaler = MinMaxScaler(feature_range=(-1, 1))
df[common_columns] = scaler.fit_transform(df[common_columns])
print("Data normalized.")

Data normalized.


## Linear Regression
Train and evaluate a linear regression model for each physical measurement in Scenario 1 and on the combined data in Scenario 2.

In [7]:
# Linear Regression
X = df[common_columns].drop(columns=["Temperature (C)"])
y = df["Temperature (C)"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
y_pred = lr_model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Linear Regression - MAE: {mae}, R²: {r2}")

Linear Regression - MAE: 0.020515196127335416, R²: 0.9892443758042915


## Random Forest
Train and evaluate a random forest model for each physical measurement in Scenario 1 and on the combined data in Scenario 2.

In [8]:
# Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)
print(f"Random Forest - MAE: {mae_rf}, R²: {r2_rf}")

Random Forest - MAE: 0.0012570847171043465, R²: 0.999501490611814


## Gradient Boosting
Train and evaluate a gradient boosting model for each physical measurement in Scenario 1 and on the combined data in Scenario 2.

In [None]:
# Gradient Boosting
gb_model = GradientBoostingRegressor(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)

mae_gb = mean_absolute_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)
print(f"Gradient Boosting - MAE: {mae_gb}, R²: {r2_gb}")

## Neural Networks
Train and evaluate a neural network model for each physical measurement in Scenario 1 and on the combined data in Scenario 2.

In [None]:
# Neural Networks
def create_nn_model(input_shape):
    model = keras.Sequential([
        keras.layers.Dense(64, activation="relu", input_shape=(input_shape,)),
        keras.layers.Dense(32, activation="relu"),
        keras.layers.Dense(1)
    ])
    model.compile(optimizer="adam", loss="mean_squared_error")
    return model

nn_model = create_nn_model(X_train.shape[1])
nn_model.fit(X_train, y_train, epochs=50, verbose=0)
y_pred_nn = nn_model.predict(X_test).flatten()

mae_nn = mean_absolute_error(y_test, y_pred_nn)
r2_nn = r2_score(y_test, y_pred_nn)
print(f"Neural Networks - MAE: {mae_nn}, R²: {r2_nn}")

## XGBoost
Train and evaluate an XGBoost model for each physical measurement in Scenario 1 and on the combined data in Scenario 2.

In [None]:
# XGBoost
xgb_model = xgb.XGBRegressor(n_estimators=100, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost - MAE: {mae_xgb}, R²: {r2_xgb}")