<a href="https://colab.research.google.com/github/IvaroEkel/Probabilistic-Machine-Learning_lecture-PROJECTS/blob/main/TEMPLATE_Probabilistic_Machine_Learning_Project_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Probabilistic Machine Learning - Project Report

**Course:** Probabilistic Machine Learning (SoSe 2025)
**Lecturer:** [Alvaro Ruales]  
**Student(s) Name(s):Konstantin Abe
**GitHub Username(s):MetropolisJenensis
**Date:**  
**PROJECT-ID:** [Assigned Project ID]  

---


## 1. Introduction
### 1.1 Dataset
The dataset used in this analysis originates from historical weather observations collected across various locations in Australia. It comprises a total of 145,460 records and 23 variables, including both continuous measurements and categorical features. The data is typically used for modeling weather-related outcomes, especially for predicting whether it will rain the following day.
Several variables contain missing values, notably Evaporation, Sunshine, Cloud9am, Cloud3pm, and the pressure-related measurements. These missing values must be appropriately handled during data preprocessing to ensure robust analysis and modeling.

The target variable for predictive modeling is RainTomorrow, which indicates whether it rained the next day. This makes the dataset particularly suitable for binary classification tasks in the context of weather forecasting.


### 1.2 Motivation
Preditcting weather patterns, especially rainfall, is crucial for various sectors, including agriculture, transportation, and disaster management. Accurate predictions can help in resource allocation, planning, and mitigating the impacts of adverse weather conditions. This project aims to leverage probabilistic machine learning techniques to build a model that can predict whether it will rain tomorrow based on historical weather data.

The prediction of rain is one of the most recognized and visible predictions task in the broader public, largerly due to the presence in modern media, tv and weather apps. Almost any news show wouldn't be complete without a weather forecast, and the public is used to seeing these predictions. 
But also academia is interested in leveraging machine learning techniques to improve the accuracy of weather predictions. Given the growing complexity of the climate system and the limitations of traditional forecasting methods, artificial intelligence has emerged as a transformative tool. These methods are increasingly used not only to improve predictive accuracy, but also to enhance the detection, attribution, and communication of extreme events. Their ability to integrate heterogeneous data sources and uncover complex spatio-temporal patterns makes them especially suited to this task. As highlighted in [1], developing reliable and explainable machine learning models is a critical step toward strengthening early warning systems and building trust in risk communication and decision-making processes.

Due to the complexity of the task, the project will only focus on comparing different methods to each other in the context of the given dataset. The goal is to find the best performing model, eventhough state-of-the-art methods are already more advanced at this point of time. 

Deriving from that, the project will focus on the following research questions:
What is the best performing model for predicting rain tomorrow based on the given dataset?

To follow a scientific approach, the project will be structured along the following hypotheses:
1. **Hypothesis 1:** A Random Forest model will outperform a Logistic Regression model in predicting whether it will rain tomorrow based on the historical weather data.

2. **Hypothesis 2:** A Gradient Boosting model will outperform both the Random Forest and Logistic Regression models in terms of prediction accuracy.

3. **Hypothesis 3:** A Neural Network model will outperform all other models in terms of prediction accuracy, given its ability to capture complex patterns in the data.


## 2. Data Loading and Exploration

- Code to load data

- Basic exploration (plots, statistics, missing data, etc.)

The dataset was loaded from a CSV file using the pandas library. The initial exploration revealed that the dataset contains 145,460 records and 23 variables, including both continuous and categorical features. The target variable, RainTomorrow, is binary, indicating whether it rained the next day. There are several variables with missing values, particularly Evaporation, Sunshine, Cloud9am, Cloud3pm, and pressure-related measurements. These missing values will need to be addressed during data preprocessing. In the column "Location" we can find the location of the weather station, which is used to collect the data. The dataset contains 49 different locations, which are shown in the following figure.


<img src="picture_australia.png" width="45%"/>


<img src="rainfall_distribution.png" width="45%"/>
<img src="raincounts.png" width="45%"/>
<img src="rainfall_per_month.png" width="45%"/>

<img src="correlation_matrix.png" width="45%"/>

Out of 145460 records, it did rain on 31880 days.
The dates range from 2007-11-01 to 2017-06-25 which are almost 10 years of data.


Inspecting the variable Rainfall it is interesting to see, that many of the higher values are linked to floods in Australia. The highest rainfall was recorded in Coffs Harbour, NSW, Australia on 2010-02-17 with 371.0mm of rain. 
Another high value is linked to the floods in Queensland, Australia in 2010 and 2011.

## 3. Data Preprocessing

- Steps taken to clean or transform the data




## 4. Probabilistic Modeling Approach

## 4.1 Logistic Regression
The first chosen model is Logistic Regression, a fundamental probabilistic model used for binary classification tasks. It estimates the probability of a binary outcome based on one or more predictor variables. The logistic function maps any real-valued number into the (0, 1) interval, making it suitable for modeling probabilities. The general form of the logistic regression model is definied as:

\[P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n)}}\]

- Description of the models chosen
- Why they are suitable for your problem
- Mathematical formulations (if applicable)



## 5. Model Training and Evaluation

- Training process
- Model evaluation (metrics, plots, performance)
- Cross-validation or uncertainty quantification



## 6. Results

- Present key findings
- Comparison of models if multiple approaches were used



## 7. Discussion

- Interpretation of results
- Limitations of the approach
- Possible improvements or extensions



## 8. Conclusion

- Summary of main outcomes



## 9. References

[1] Camps-Valls, G., Fernández-Torres, MÁ., Cohrs, KH. et al. Artificial intelligence for modeling and understanding extreme weather and climate events. Nat Commun 16, 1919 (2025). https://doi.org/10.1038/s41467-025-56573-8
[2] James, G., Witten, D., Hastie, T., & Tibshirani, R. . An introduction to statistical learning (1st ed.) [PDF] (2013). Springer.


<!-- ## 9. References

[1] Camps-Valls, G., Fernández-Torres, MÁ., Cohrs, KH. et al. Artificial intelligence for modeling and understanding extreme weather and climate events. Nat Commun 16, 1919 (2025). https://doi.org/10.1038/s41467-025-56573-8

[2] Additional references as needed. -->