In [None]:
import s3fs
import pyarrow as pa
import pyarrow.dataset as ds

import sys
import os

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import Window

import pandas as pd
import polars as pl
import altair as alt
import matplotlib.pyplot as plt

alt.data_transformers.disable_max_rows()

print("Pandas version: ", pd.__version__)
print("Pyarrow version: ", pa.__version__)
print("Pyspark version: ", pyspark.__version__)
print("Python version: ", sys.version)

We also set up [black](https://github.com/psf/black), which is a highly encouraged best-practice for all your Python projects. That way, you never have to worry and debate about code formatting anymore. By using it, you agree to cede control over minutiae of hand-formatting. In return, Black gives you speed, determinism, and freedom from `pycodestyle` nagging about formatting. You will save time and mental energy for more important matters.

In [3]:
import jupyter_black

jupyter_black.load()

## Connecting to the Spark cluster using Spark Connect

Here we connect to the cluster using Spark connect. You may access the [Spark Connect UI](http://vlenpmod302spk1.hevs.ch:4040/jobs/) to troubleshoot and monitor your jobs.

In [4]:
spark = SparkSession.builder.remote("sc://VLENPMOD302SPK1.hevs.ch:15002").getOrCreate()

## Loading the data 

We are first going to load the data available on the master node at `/home/data/meteosuisse.parquet`.

The data have been written using the so-called [Hive Partioning strategy](https://duckdb.org/docs/data/partitioning/hive_partitioning.html#:~:text=Hive%20partitioning%20is%20a%20partitioning,the%20name%20of%20the%20folder.), that is used to split a table into multiple files based on partition keys. The files are organized into folders, whose name is determined by the partition key value:

```
meteosuisse
├── year=2021
│    ├── month=1
│    │   ├── stn=ZER
│    │   │   └── file0.parquet
│    │   ├── stn=MLS
│    │   │   └── file0.parquet
│    │   └── ...
│    └── month=2
│    │   ├── stn=ZER
│    │   │   └── file0.parquet
│    │   └── ...
└── year=2021
     ├── month=11
     │   ├── stn=ZER
     │   │   └── file0.parquet
     │   ├── stn=MLS
     │   │   └── file0.parquet
     │   └── ...
     └── month=12
         ├── stn=ZER
         │   └── file0.parquet
         └── ...
```

**Questions**:

- What is the purpose of using a partitioning strategy such as the one used by Hive?
- What is a lazy dataframe? Does this cell cause any data to be downloaded?

In [1]:
# Load the data and display the schema

# Our data

Here is a description of the fields that you have above:

| Paramètre | Unité  | Description                                           
|-----------|--------|-------------------------------------------------------
| `pva200h0`| hPa    | pression de vapeur à 2 m du sol; moyenne horaire      
| `ods000h0`| W/m²   | rayonnement diffus; moyenne horaire
| `gre000h0`| W/m²   | rayonnement global; moyenne horaire
| `prestah0`| hPa    | pression atmosphérique à l'altitude du baromètre (QFE); moyenne horaire
| `tre200hx`| °C     | température de l'air à 2 m du sol; maximum horaire
| `tre200hn`| °C     | température de l'air à 2 m du sol; minimum horaire
| `tre200h0`| °C     | température de l'air à 2 m du sol; moyenne horaire
| `tre005hn`| °C     | température de l'air à 5 cm de l'herbe; minimum horaire
| `tre005h0`| °C     | température de l'air à 5 cm de l'herbe; moyenne horaire
| `hns000hs`| cm     | épaisseur de neige; valeur instantanée horaire
| `rre150hx`| mm     | précipitations; sommation sur 10 minutes, maximum horaire
| `rre150h0`| mm     | précipitations, somme horaire
| `ure200h0`| %      | humidité de l'air relative à 2 m du sol; moyenne horaire
| `htoauths`| cm     | hauteur de neige (mesurée automatiquement); valeur instantanée horaire
| `hto000hs`| cm     | hauteur de neige; valeur instantanée horaire
| `sre000h0`| min    | durée d'ensoleillement; somme horaire
| `tde200hs`| °C     | point de rosée à 2 m du sol; valeur instantanée horaire
| `dkl010h0`| °      | direction du vent; moyenne horaire
| `tso010hs`| °C     | température du sol à -10 cm; valeur instantanée horaire
| `tso020hs`| °C     | température du sol à -20 cm;  valeur instantanée horaire
| `tso005hs`| °C     | température du sol à -5 cm;  valeur instantanée horaire
| `oli000h0`| W/m²   | irradiation par onde longue; moyenne horaire



## Extract high-level statistics

Extract some high-level statistics from the dataset:

- The number of weather stations, identified by the `stn` column.
- The start and end timestamp of the dataset.

In [None]:
# Write your code here

Display how many data points we have for each station, in descending order of the number of data points.

**Question**: what could be the reasons of the discrepancies here?

In [None]:
# Write your code here

Show the starting and ending date of each station.

In [None]:
# Write your code here

- Compute the min, max and average time interval (in seconds) between successive data points for each station.
- Filter those stations that are not regular and examine them more closely by plotting the points on a line.

In [None]:
# Write your code here

In [None]:
# Write your code here

# Analyzing data for Zermatt

Here we will load the data for the Zermatt station (code `ZER`). Take a look at your system monitor and try to assess how much data is downloaded as you try the different columns below. 

## Instructions

- Collect all data for Zermatt and ensure they are ordered by time.
- Print the number of data points, the start and end date.

In [None]:
# Write your code here

# Plotting the data

- Plot the air temperature 2m over ground in Zermatt for the past 5 years. Make sure it has a legend.
- What pattern can you visually spot? 

In [None]:
# Write your code here

- Plot the same variable, between September 1, 2023 and September 15, 2023.
- What patterns can you visually spot now?

In [None]:
# Write your code here

### Identify trends

- Perform a moving average of the monthly temperature over the entire Zermatt data (hint: look at `rowsBetween`).
- Visualize the hourly temperature and the corresponding moving average during between September 1, 2023 and September 15, 2023. What is the trend you observe?
- What can you deduce from this trend? Is that a climate- or weather-related pattern?

In [None]:
# Write your code here

- Perform a yearly moving average over the entire Zermatt data and visualize it alongside the hourly temperatures across the entire time period.
- What do you observe? Is that a climate- or weather-related pattern?
- Can you spot any anomaly in this graph? Where does it come from?

In [None]:
# Write your code here

Let's see if there is a significant difference between the period 2010-2013 and 2020-2023:

- Extract two Pandas series from the data that correspond to the temperature samples in the two above periods.
- Plot histograms and compare them visually.
- What are the conclusions you may draw from these results?

In [None]:
# Write your code here

Another approach to visualize these data is as follows:

- Plot all years on the same graph (with a common x-axis from January to December), with different colors for every year, so that we can compare how they evolve.
- First, compute a **monthly** moving average across the entire period.
- Then, loop over each year, extract the data from the dataframe computed at the previous step, and plot them. 

Hint: you will need the `F.dayofyear` function. Also, make sure that years that are close to each other have the same hue (you may use a colormap such as `matplotlib.pyplot.cm.rainbow`).

In [None]:
# Write your code here

Yet another way to look at these data is by looking at the frequency of anomalies, where we define an anomaly as:

> An anomaly is any daily temperature measurement that lies beyond 2 standard deviations of the mean temperature for the same 30-day calendar period in the past five years.

For instance, a measurement $X$ on May 20, 2017 is considered an anomaly if it is either larger than $\mu + n\cdot\sigma$ or smaller than $\mu - n\cdot\sigma$, where $\mu$ and $\sigma$ are respectively the mean and the standard deviation of samples that fall in any of the following intervals: 15 days before to 15 days after May 20 of years 2016, 2015, 2014, 2013 and 2012. Test different numbers of $n$.

Given the following definition of an anomaly, compute the anomaly frequency (i.e., the fraction of anomalous hours) of each month for the period 2015-2024.

In [None]:
# Write your code here

## Open Challenges with Meteosuisse data (3h30)

In this section, you are left completely free and must explore one of these 5 open data challenges.

Here are the four additional data challenges, aligned with the provided format:

### Challenge 1: **Temperature Trend Analysis**

- **Motivation**: Monitoring long-term temperature trends is key to understanding the effects of climate change, especially in mountainous regions where warming may occur faster than in valley bottoms.
- **Task**: Analyze the long-term trend in temperature across stations at different altitudes, distinguishing between valley and high-altitude locations. Investigate whether higher altitude stations exhibit different temperature trends compared to valley stations.  
- **Expected Outcome**: The goal is to determine if there is evidence of warming and if altitude affects the rate of change in temperature over the 14-year period. You should identify key patterns and trends in temperature variation over time. You should expect to observe a rise in average temperatures, with potential variations in the rate of warming between high-altitude and valley stations. Results may reveal that higher altitudes exhibit different warming patterns compared to valley floors.

### Challenge 2: **Snow Depth and Precipitation Correlation**

- **Motivation**: Understanding the relationship between snow depth and precipitation can provide insights into snowpack formation and potential water resources in mountainous regions. Snow depth also plays a critical role in assessing avalanche risk.
- **Task**: Investigate the correlation between snow depth (`htoauths`, `hto000hs`) and precipitation (`rre150h0`) at different stations, focusing on both valley and high-altitude locations. Analyze seasonal variations and how snow depth responds to precipitation events over the 14-year period.
- **Expected Outcome**: The goal is to identify patterns in how snow depth builds up in response to different levels of precipitation. You may find that snow depth at higher altitudes shows stronger correlations with precipitation compared to valley locations. The challenge will also help reveal any delayed responses in snow accumulation following significant precipitation events.

### Challenge 3: **Precipitation Patterns by Wind Direction**

- **Motivation**: Wind direction influences the distribution of precipitation across mountainous regions, particularly in valleys versus slopes. Understanding this relationship can help in forecasting weather patterns and managing water resources.
- **Task**: Analyze how precipitation (`rre150h0`) is distributed depending on wind direction (`dkl010h0`) across different stations. Focus on comparing the frequency and amount of precipitation during south-bound versus north-bound wind events.
- **Expected Outcome**: The analysis should reveal correlations between wind direction and precipitation patterns, showing whether certain wind directions consistently bring more precipitation. Expect to see patterns where specific directions (e.g., southerly winds) may correlate with higher precipitation amounts due to the orographic effect.

### Challenge 4: **Wind Direction and Precipitation Patterns**

- **Motivation**: Wind can play a significant role in weather patterns, affecting where and how precipitation occurs. Analyzing this relationship is crucial for understanding precipitation distribution in mountainous areas.
- **Task**: Examine how wind direction (`dkl010h0`) influences precipitation patterns (`rre150h0`) across various stations. Focus on how wind direction may cause precipitation to differ between valley and high-altitude locations.
- **Expected Outcome**: The goal is to understand whether specific wind directions lead to higher or lower precipitation at different altitudes. You may discover that certain wind directions are associated with heavy precipitation events in specific locations, potentially influencing local weather conditions.

### Challenge 5: **Extreme Events Frequency Over Time**

- **Motivation**: Identifying the frequency of extreme weather events like massive snowfalls, heatwaves, or heavy rainfall is critical for assessing climate change's impact and preparing for future extreme conditions.
- **Task**: Define criteria for extreme weather events, such as significant snowfall (`hns000hs`), extreme heat (`tre200hx`), or heavy precipitation (`rre150hx`), and analyze how their frequency has evolved over the 14-year period. Compare results across high-altitude and valley stations.
- **Expected Outcome**: The analysis should show whether there is an increase in the frequency of extreme weather events over time. You may observe a rise in heatwaves or more frequent heavy snowfall events at high altitudes, potentially pointing to climate-related shifts in weather patterns.