<a href="https://colab.research.google.com/github/leoalfonso/M11-and-M49/blob/main/06_Descriptive_Stats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div>
<table style="width: 100%">
	<tr>
		<td>
		<table style="width: 100%">
			<tr>
                <td ><center><font size="5"><b>Modules 11 and 49</b></font><center></td>
			</tr>
			<tr>
                <td><center><font size="14">Notebook 6</font><center></td>
			</tr>
			<tr>
                <td><center><font size="6"><b>Descriptive statistics</b></font><center></td>
			</tr>
		</table>
		</td>
		<td><center><img src='https://ihe-delft-ihe-website-production.s3.eu-central-1.amazonaws.com/s3fs-public/styles/792w/public/2022-11/IHE-DELFT-INSTITUTE_UNESCO_RGB.png?itok=-GnfBc2x'></img></td>
	</tr>
</table>
</div>

# üêç Period 6: Descriptive Statistics with Pandas

In Period 5, we successfully loaded, cleaned, and indexed our precipitation time series. Now, we will analyze this data to answer fundamental hydrological questions:

* What was the total rainfall during this period?
* What was the average daily rainfall?
* What was the single wettest day (the extreme event)?

Pandas makes these calculations simple and incredibly fast. We will be working with the `Precipitation_mm` column, which is a Pandas `Series`.

## 1. Reloading and Preparing Our Data

Since this is a new notebook, we must perform our setup steps again:
1.  Import Pandas.
2.  Upload the `precipitation_data.csv` file.
3.  Load it using `pd.read_csv()`.
4.  Convert the `Date` column to `datetime` objects.
5.  Set the `Date` column as the index.

**Task 1.1:** Run the code cell below to perform all setup steps. Review the code to ensure you understand the workflow.

In [None]:
# Task 1.1: Setup and Data Preparation

import pandas as pd
from google.colab import files

# 1. Upload the file
print("Please upload the 'precipitation_data.csv' file.")
uploaded = files.upload()

# 2. Load the file
df = pd.read_csv('precipitation_data.csv')

# 3. Convert 'Date' to datetime
df['Date'] = pd.to_datetime(df['Date'])

# 4. Set 'Date' as the index
df = df.set_index('Date')

# 5. Inspect the final DataFrame
print("\n--- Data Loaded and Prepared ---")
df.info()
df.head()

## 2. Basic Statistical Aggregations

We can apply statistical methods directly to the `Precipitation_mm` column.

**Task 2.1: Total Precipitation**
* Calculate the total (sum) of all values in the `Precipitation_mm` column.
* Use the `.sum()` method.

**Task 2.2: Mean Precipitation**
* Calculate the average daily rainfall.
* Use the `.mean()` method.

**Task 2.3: Median Precipitation**
* Calculate the median daily rainfall. The median (the 50th percentile) is often a more robust measure of central tendency than the mean, as it is not skewed by extreme events.
* Use the `.median()` method.

In [None]:
# Task 2.1: Calculate Total Precipitation

# [STUDENT CODE GOES HERE]
# Select the 'Precipitation_mm' column and apply the .sum() method
# total_precip_mm = df['Precipitation_mm'].sum()

# print(f"Total precipitation over the period: {total_precip_mm:.1f} mm")

In [None]:
# Task 2.2: Calculate Mean Daily Precipitation

# [STUDENT CODE GOES HERE]
# mean_precip_mm = df['Precipitation_mm'].mean()

# print(f"Mean daily precipitation: {mean_precip_mm:.2f} mm/day")

In [None]:
# Task 2.3: Calculate Median Daily Precipitation

# [STUDENT CODE GOES HERE]
# median_precip_mm = df['Precipitation_mm'].median()

# print(f"Median daily precipitation: {median_precip_mm:.2f} mm/day")

# **Discussion Point:** Why is the median (0.0) so different from the mean?
# (Answer: Because most days are 'dry days', which pull the median down,
# while a few 'wet days' pull the mean up.)

## 3. Identifying Extreme Events

In hydrology, we are often most interested in the extremes, not the average.

**Task 3.1: Maximum Precipitation (The Wettest Day)**
* Find the highest rainfall value recorded in the dataset.
* Use the `.max()` method.

**Task 3.2: The Date of the Extreme Event**
* Finding the *value* (e.g., 12.1 mm) is good, but we need to know *when* it happened.
* We can use `.idxmax()` to find the **index** (in our case, the Date) of the maximum value.

In [None]:
# Task 3.1: Find the maximum precipitation event

# [STUDENT CODE GOES HERE]
# max_precip_mm = df['Precipitation_mm'].max()

# print(f"Maximum single-day rainfall: {max_precip_mm:.1f} mm")

In [None]:
# Task 3.2: Find the DATE of the maximum event

# [STUDENT CODE GOES HERE]
# date_of_max_precip = df['Precipitation_mm'].idxmax()

# print(f"Date of maximum rainfall: {date_of_max_precip}")

# Note: The output will be a Timestamp object, which is exactly what we want.

## 4. The All-in-One: `.describe()`

The `.describe()` method is a powerful tool that runs all the most common descriptive statistics on your DataFrame's numeric columns at once.

**Task 4.1:** Run `.describe()` on our DataFrame `df`.
* Analyze the output. You will see:
    * `count`: The number of non-missing observations.
    * `mean`: The average.
    * `std`: The standard deviation (a measure of spread).
    * `min`: The minimum value.
    * `25%`: The 25th percentile (1st quartile).
    * `50%`: The 50th percentile (the median).
    * `75%`: The 75th percentile (3rd quartile).
    * `max`: The maximum value.

In [None]:
# Task 4.1: Run the .describe() method

# [STUDENT CODE GOES HERE]
# Run .describe() on the entire DataFrame
# precip_stats = df.describe()

# print(precip_stats)

**End of Notebook 06**