<a href="https://colab.research.google.com/github/IvaroEkel/Probabilistic-Machine-Learning_lecture-PROJECTS/blob/main/TEMPLATE_Probabilistic_Machine_Learning_Project_Report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Probabilistic Machine Learning - Project Report

**Course:** Probabilistic Machine Learning (SoSe 2025)                                    
**Lecturer:** Alvaro Diaz-Ruelas                                                     
**Student(s) Name(s):** Johannes Betzler                                               
**GitHub Username(s):** Gimelot                               
**Date:** 09.06.2025                                                  
**PROJECT-ID:** 26-2BJXXXX  

---


## 1. Introduction

- The dataset is provided by the Deutscher Wetterdienst (DWD) and contains hourly meteorological measurements over a 10-year period, from January 1, 2014, to December 31, 2023. It includes parameters such as temperature, relative humidity, precipitation, pressure at station height, and wind speed. The data was collected from 19 weather stations located within a 100 km radius around the Erfurt-Weimar weather station. 
- The goal of this project is to generate realistic weather time series for an artificial location near Erfurt-Weimar using Gaussian-based methods (e.g., Gaussian Processes or Gaussian Mixture Models). This approach is suitable for capturing the uncertainty and correlations in time-series data and fits well within the context of probabilistic learning.
- For evaluation, the Erfurt-Weimar station will be excluded from the training set, and weather data will be generated for this location. The generated time series will then be compared to the actual recorded data, allowing the assessment of the model's abilities to generalize and interpolate weather conditions for unseen locations.


## 2. Data Loading and Exploration

- The code used to load and process the data is available in the notebook DataframeGenerator.ipynb.

- Basic Statisic:
    - The dataset contains 87648 hourly entries per station, totaling 1,665,312 entries for 19 stations.
    - Each station's data includes 22 columns:
        - 8 time-related columns
        - 2 station identification columns
        - 3 geographic metadata columns
        - 9 measured weather parameters, of which 5 are used for generation.
    - The stations are located at elevations between 164 m and 938 m.

- Missing Data:
    - Up to 49% of values are missing per column overall, but for the features used in the generation model, the maximum missing rate is 1.23%.
    - The longest gap in the dataset is 6054 hours, but the median gap length is 8 hours.
    - For any single feature used in generation, the maximum missing rate per station is 7.61%

    ![Histogramm Missing Value Precentage](Missing_Value_Precentage_hist.png)

- Feature Distributions and Characteristics:
    - TT_TU (Temperature):
        - Unrealistic values are filtered.
        - Median: 8.8 °C, Min: –23.4 °C, Max: 38.6 °C.
        - Distributions are bell-shaped and fairly consistent across stations, though some are flatter.

        ![Temperature Values](TT_TU_Values.png)

    - RF_TU (Relative Humidity):
        - Unrealistic values are filtered. 
        - Median: 82%, Min: 4%, Max: 100%.
        - Higher humidity values occur more frequently, with a slight drop just before 100%, which varies by station.
        - Distribution rises to the maximum in different patterns—some linear, others exponential.

        ![Relativ Humidity Values](RF_TU_Values.png)

    - R1 (Precipitation):
        - Unrealistic values are filtered.
        - Median and Min: 0 mm/hour, Max: 50 mm/hour.
        - Over 75% of the values are 0 (no precipitation).

        ![Percepation](R1_Values.png)

    - P0 (Pressure):
        - Unrealistic values are filtered.
        - Median: 972 hPa, Min: 864, Max: 1027.
        - Bell-shaped distributions.
        - Stations fall into two groups with pressure modes around 915 hPa and 980 hPa respectively.

        ![Preasure](P0_Values.png)

    - F (Wind Speed):
        - Unrealistic values are filtered.
        - 3.1 m/s, Min: 0, Max: 23.9.
        - Bell-shaped distribution with a positive skew (tail toward higher wind speeds), more pronounced at stations with lower peak wind speed.

        ![Windspeed](F_Values.png)

- Correlation Analysis:
    - Temperature and Relative Humidity: Negatively correlated (–0.5).
    - Wind Speed and Pressure: Positively correlated (+0.31).
    - Temperature also correlates with:
        - month_sin: +0.45
        - month_cos: +0.64

    ![Correlation Matrix](Corr.png)

- Time Series Analysis:
    - Hourly Patterns:
        - Temperature peaks around 14:00, lowest around 04:00.

        - Relative Humidity is highest at night, lowest during the day.

        - Precipitation and Pressure are fairly constant throughout the day.
        - Wind Speed tends to be higher during the day, though the extent varies by station.
    
    ![Tempreature Hourly](TT_TU_hourly.png)

    ![Realativ Humidity Hourly](RF_TU_hourly.png)

    ![Windspeed Hourly](F_hourly.png)    

    - Daily Patterns:
        - No significant patterns observed; values are relatively evenly distributed.

    - Monthly Patterns:
        - Temperature peaks in summer months.

        - Relative Humidity is lowest during summer.

        - Precipitation per hour is generally higher in summer than in winter.

        - Pressure is fairly consistent year-round, with smaller variance in summer.
        - Wind Speed tends to be lower in summer, higher in winter.

    ![Tempreature Monthly](TT_TU_monthly.png)

    ![Realativ Humidity Monthly](RF_TU_monthly.png)

    ![Percepation Monthly](R1_monthly.png)

    ![Windspeed Monthly](F_monthly.png)
    
- Inter-Parameter Consistency Checks:
    - A notable inconsistency: In 226,511 cases, rain is recorded even though precipitation is 0 mm.
    - All other consistency checks are passed.



## 3. Data Preprocessing

- Steps taken to clean or transform the data




## 4. Probabilistic Modeling Approach

- Description of the models chosen
- Why they are suitable for your problem
- Mathematical formulations (if applicable)



## 5. Model Training and Evaluation

- Training process
- Model evaluation (metrics, plots, performance)
- Cross-validation or uncertainty quantification



## 6. Results

- Present key findings
- Comparison of models if multiple approaches were used



## 7. Discussion

- Interpretation of results
- Limitations of the approach
- Possible improvements or extensions



## 8. Conclusion

- Summary of main outcomes



## 9. References

- Cite any papers, datasets, or tools used