# Introduction

This notebook analyses historical rainfall data for various cities to determine if London, UK, is as rainy as commonly portrayed. The data for this analysis should be collected beforehand using the `data_collection_script.py` script, which gathers historical rainfall data from the Open-Meteo API and saves it in JSON format.

In this notebook, we will:
1. Load the historical rainfall data.
2. Process and clean the data.
3. Visualise metrics such as the number of rainy days, total rainfall, and average rain intensity across the selected cities.


# 1. Data Collection Prerequisite

Before running this notebook, ensure you have collected data for each city of interest as explained in the README file 'How to Set Up and Run the Project.'

For my project, I have chosen to focus on the following 5 cities.

1. **London, UK** – The main city of interest, often portrayed as rainy.
2. **Singapore** – Known for its tropical climate with high annual rainfall.
3. **Cairo, Egypt** – Represents a dry climate with very low annual rainfall.
4. **Buenos Aires, Argentina** – A moderate climate with regular seasonal rainfall.
5. **Mumbai, India** – Known for very high rainfall, particularly during the monsoon season.

These cities offer a range of climates, from dry (Cairo) to tropical (Singapore, Mumbai) and moderate (Buenos Aires), providing a balanced comparison to assess if London's raininess is as prominent as often portrayed.


Run the following commands from the project root directory:

**1. London, UK**:
```bash
python scripts/data_collection_script.py GB London --start_date 2023-01-01 --end_date 2023-12-31 --output_file data/london_2023.json
```

**2. Singapore**:
```bash
python scripts/data_collection_script.py SG Singapore --start_date 2023-01-01 --end_date 2023-12-31 --output_file data/singapore_2023.json
```

**3. Cairo, Egypt**:
```bash
python scripts/data_collection_script.py EG Cairo --start_date 2023-01-01 --end_date 2023-12-31 --output_file data/cairo_2023.json
```

**4. Buenos Aires, Argentina**:
```bash
python scripts/data_collection_script.py AR 'Buenos Aires' --start_date 2023-01-01 --end_date 2023-12-31 --output_file data/buenos_aires_2023.json
```

**5. Mumbai, India**:
```bash
python scripts/data_collection_script.py IN Mumbai --start_date 2023-01-01 --end_date 2023-12-31 --output_file data/mumbai_2023.json
```
After running these commands, the data for each city will be saved as JSON files in the data folder. Proceed with the cells below to load and analyse the rainfall data.


# 2. Import Libraries ⚙️


This code imports essential libraries required for data collection and analysis:

In [1]:
import json

import pandas as pd

from lets_plot import *
LetsPlot.setup_html()

from datetime import datetime

# 3. Load the JSON data 📥

In this section, we load the historical rainfall data for each city that has been collected and saved in JSON format. Each JSON file corresponds to a city's data, which we will then combine into a single DataFrame for analysis.


In [2]:
# Specify the file paths for each city's data
data_files = {
    "London": "../data/london_2023.json",
    "Singapore": "../data/singapore_2023.json",
    "Cairo": "../data/cairo_2023.json",
    "Buenos Aires": "../data/buenos_aires_2023.json",
    "Mumbai": "../data/mumbai_2023.json"
}

Let's make an empty dictionary to hold dataframes for each city.

In [3]:
city_dfs = []

Since we have 5 cities, we can use the loop function to structure it into dataframes.

In [4]:
# Load each city's data from its JSON file and add it to the list
for city, file_path in data_files.items():
    with open(file_path, 'r') as file:
        data = json.load(file)
        city_df = pd.DataFrame(data)
        city_df['city'] = city  # Label each row with the city's name
        city_dfs.append(city_df)

Let's concatenate all the city dataframes into a single dataframe.

In [5]:
df = pd.concat(city_dfs, ignore_index=True)

Display the first few rows to check the structure.

In [6]:
display(df.head(5))

Unnamed: 0,time,precipitation_sum,city
0,2023-01-01,4.0,London
1,2023-01-02,0.2,London
2,2023-01-03,3.2,London
3,2023-01-04,0.9,London
4,2023-01-05,0.1,London


Display the last few rows to check the structure.

In [7]:
display(df.tail(5))

Unnamed: 0,time,precipitation_sum,city
1820,2023-12-27,0.0,Mumbai
1821,2023-12-28,0.0,Mumbai
1822,2023-12-29,0.0,Mumbai
1823,2023-12-30,0.0,Mumbai
1824,2023-12-31,0.0,Mumbai


I also want to check that there should be a total of 1825 entries (365 days x 5 cities) so there is no missing data.

In [8]:
entries = len(df)
print(entries)

1825


# 2. Calculate Key Metrics for Raininess Analysis 🧮

Now that we have obtained the relevant data and sorted it into organised tables, we can calculate the following required metrics. 

We will be using groupby() function to obtain the data from the corresponding columns required.

- Number of Rainy Days: Days with precipitation_sum > 0

- Average Rain Intensity: Total Rainfall / Number of Rainy Days




## 2.1 Number of Rainy Days

We introduce the size() function to count the number of occurences in each group. In this case, we want to count the number of days precipitation_sum is greater than 0.

This metric will give us a sense of the **frequency** and helps put London’s raininess in perspective compared to other cities.

In [9]:
# Filter the DataFrame to include only rows where precipitation_sum > 0
rainy_days_df = df[df['precipitation_sum'] > 0]

# Group by 'city' and count the number of days for each city
rainy_days = rainy_days_df.groupby('city').size().reset_index()

# Rename the columns for clarity
rainy_days.columns = ['City', 'Rainy Days']

# Display the result
display(rainy_days)

Unnamed: 0,City,Rainy Days
0,Buenos Aires,150
1,Cairo,29
2,London,228
3,Mumbai,157
4,Singapore,350


## 2.2 Average Rain Intensity

Now, we can use the variable we have formulated in 3.1 and 3.2 to calculate the average rain intensity for each city.

This metric will give us a sense of the **lightiness/heaviness** of the rain in London as compared to other cities.

In [10]:
# Group by 'City' and calculate total rainfall and rainy days
total_rainfall = df.groupby('city')['precipitation_sum'].sum().reset_index()
rainy_days = df[df['precipitation_sum'] > 0].groupby('city').size().reset_index(name='Rainy Days')

# Merge the total_rainfall and rainy_days DataFrames on 'City'
rain_data = pd.merge(total_rainfall, rainy_days, on='city')

# Calculate the average rain intensity (Total Rainfall / Rainy Days)
rain_data['Average Rain Intensity (mm/day)'] = rain_data['precipitation_sum'] / rain_data['Rainy Days']

# Rename the columns for clarity
rain_data.columns = ['City', 'Total Rainfall (mm)', 'Rainy Days', 'Average Rain Intensity (mm/day)']

# Select only the columns to display
rain_data = rain_data[['City', 'Average Rain Intensity (mm/day)']]

# Display the result
display(rain_data)

Unnamed: 0,City,Average Rain Intensity (mm/day)
0,Buenos Aires,6.108667
1,Cairo,0.975862
2,London,3.424123
3,Mumbai,13.045223
4,Singapore,6.756286


# 3. Visualisations 📊

## 3.1 Number of Rainy Days (Bar Chart)

This metric shows the total number of days with rainfall throughout the year in each city.
 
**Rationale: A higher number of rainy days can contribute to the perception of a 'rainy' city, even if the rain is light.**



In [11]:
rainy_days_data = (
    rain_data[['City', 'Rainy Days']]
    .assign(City=lambda df: df['City'].astype('category'))  # Ensure 'City' is treated as a category
    .sort_values(by='Rainy Days', ascending=True)  # Sort by 'Rainy Days' in ascending order
)

plot = (
    ggplot(rainy_days_data, aes(x='City', y='Rainy Days', fill='Rainy Days')) + 
    geom_bar(stat='identity', show_legend=False) +  # Create a bar chart without a legend
    geom_text(aes(label='Rainy Days'), position=position_nudge(y=5.5), color='black', size=10) +  # Add labels on bars
    scale_fill_manual(values=['#92C5F9', '#4394E5', '#0066CC', '#004D99', '#003366']) +  # Custom color gradient
    ggtitle('Figure 1: Singapore Had More Rainy Days Than London', 
            subtitle='London experienced 35% fewer rainy days compared to Singapore.') +  # Add title and subtitle
    theme(
        axis_title_x=element_text(size=20),  # Set x-axis title size
        axis_title_y=element_text(size=20),  # Set y-axis title size
        plot_title=element_text(size=30, face="bold", color='#333333'),  # Title styling
        plot_subtitle=element_text(size=20, color="grey"),  # Subtitle styling
        axis_text_x=element_text(size=20),  # X-axis text size
        axis_text_y=element_text(size=20),  # Y-axis text size
        panel_grid_major_y=element_line(color="grey", size=0.3, linetype="dotted"),  # Dashed grid lines
        panel_grid_major_x=element_blank(),  # Remove vertical grid lines
        legend_position='none'  # Hide the legend
    ) + 
    labs(x='City', y='Number of Rainy Days') +  # Label x and y axes
    ggsize(1400, 1000)  # Set the plot size
)

# Display the plot
plot


KeyError: "['Rainy Days'] not in index"

## 3.2 Monthly Total Rainfall by City (Line Graph)

This metric displays the total amount of rainfall each month, highlighting seasonal patterns. 

**Rationale: It helps compare the actual volume of rain London receives compared to other cities, providing context to its 'rainy' reputation.**

In [None]:
# Convert 'time' column to datetime format
df['time'] = pd.to_datetime(df['time'])

# Extract month as a period from 'time' column
df['month'] = df['time'].dt.to_period('M')

# Prepare data for monthly total rainfall by city using method chaining
monthly_rainfall_data = (
    df.groupby(['city', 'month'])['precipitation_sum'].sum().reset_index()
    .assign(month=lambda df: df['month'].astype(str))  # Convert 'month' to string for plotting
)

plot = (
    ggplot(monthly_rainfall_data, aes(x='month', y='precipitation_sum', color='city', group='city', linetype='city')) + 
    geom_line(size=2.5) +  # Thicker lines for improved visibility
    scale_color_manual(values=['#000000', '#e7298a', '#d95f02', '#1b9e77', '#FF0000']) +  # Set distinct colours for each city
    scale_linetype_manual(values=['solid', 'dashed', 'dotted', 'dotdash', 'twodash']) +  # Set line types for each city
    ggtitle('Figure 2: Mumbai exhibits extreme rainfall peaks', subtitle='Steady, lower rainfall trends observed in London.') + 
    theme(
        axis_title_x=element_text(size=20, face="bold"),  # Bold x-axis title
        axis_title_y=element_text(size=20, face="bold"),  # Bold y-axis title
        plot_title=element_text(size=30, face="bold", color='#333333'),  # Bold title styling
        plot_subtitle=element_text(size=20, color="gray"),  # Subtitle styling
        axis_text_x=element_text(size=20, hjust=1, face="bold"),  # Bold x-axis text with horizontal adjustment
        axis_text_y=element_text(size=20, face="bold"),  # Bold y-axis text
        panel_grid_major_y=element_line(color="lightgray", size=0.5, linetype="dotted"),  # Dashed y-axis grid lines
        panel_grid_major_x=element_blank(),  # Remove x-axis grid lines
        legend_position='right',  # Position legend to the right
        legend_title=element_text(size=24),  # Larger legend title
        legend_text=element_text(size=24)  # Larger legend text
    ) + 
    scale_x_discrete(name='Month', labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']) +  # Custom month labels
    labs(x='Month', y='Total Rainfall (mm)', color='City', linetype='City') +  # Axis and legend labels
    ggsize(1400, 800)  # Set plot size
)

# Display the plot
plot


## 3.3 Average Rain Intensity by City (Bar Graph)

This metric shows the average amount of rain per rainy day, indicating the intensity of rainfall. 

**Rationale: A lower intensity suggests lighter rain, whereas higher intensity indicates heavier downpours on rainy days.**

In [None]:
average_rain_intensity_data = (
    rain_data[['City', 'Average Rain Intensity (mm/day)']]
    .assign(City=lambda df: df['City'].astype('category'))  # Ensure 'City' is treated as a category
    .sort_values(by='Average Rain Intensity (mm/day)', ascending=True)  # Sort by intensity in ascending order
)

plot = (
    ggplot(average_rain_intensity_data, aes(x='City', y='Average Rain Intensity (mm/day)', fill='Average Rain Intensity (mm/day)')) + 
    geom_bar(stat='identity', show_legend=False) +  # Bar chart without legend
    geom_text(
        aes(label=average_rain_intensity_data['Average Rain Intensity (mm/day)'].round(2)),  # Rounded labels to 2 decimals
        position=position_nudge(y=0.16), color='black', size=10  # Position labels closer to bars
    ) + 
    scale_fill_manual(values=['#92C5F9', '#4394E5', '#0066CC', '#004D99', '#003366']) +  # Custom color gradient
    ggtitle(
        'Figure 3: Mumbai had the Highest Average Rain Intensity', 
        subtitle='London’s average rain intensity is 74% lower than Mumbai, highlighting a significant difference.'
    ) + 
    theme(
        axis_title_x=element_text(size=20),  # Set x-axis title size
        axis_title_y=element_text(size=20),  # Set y-axis title size
        plot_title=element_text(size=30, face="bold", color='#333333'),  # Title styling
        plot_subtitle=element_text(size=20, color="grey"),  # Subtitle styling
        axis_text_x=element_text(size=20),  # X-axis text size
        axis_text_y=element_text(size=20),  # Y-axis text size
        panel_grid_major_y=element_line(color="grey", size=0.3, linetype="dotted"),  # Dashed y-axis grid lines
        panel_grid_major_x=element_blank(),  # Remove x-axis grid lines
        legend_position='none'  # Hide legend
    ) + 
    labs(x='City', y='Average Rain Intensity (mm/day)') +  # Axis labels
    ggsize(1400, 1000)  # Set plot size
)

# Display the plot
plot


# 4. Conclusion 🎉

### Null and Alternative Hypothesis
- We use null and alternative hypotheses to establish a baseline (null) assumption that we can test, allowing us to determine if there is enough evidence to reject it in favour of the alternative. This approach helps us make objective conclusions based on data.

- Null Hypothesis (H₀): London is rainy.
- Alternative Hypothesis (H₁): London is not rainy.

### Figure Conclusions
**Figure 1: Number of Rainy Days (Frequency)**
- Conclusion: London had 228 rainy days in 2023. Although this is fewer than Singapore’s 350 rainy days, it still means that London experienced rain on 62.4% (228/365) of the days in the year. This high frequency supports the idea that London is indeed a rainy city.

**Figure 2: Monthly Total Rainfall (Volume)**
- Conclusion: London’s rainfall is steady and moderate compared to cities with extreme rain peaks, like Mumbai. This suggests that while London does not experience heavy downpours, it still receives a consistent volume of rain each month, which aligns with its rainy reputation.

**Figure 3: Average Rain Intensity (Lightness/Heaviness of Rain)**
- Conclusion: London’s rain intensity is much lighter compared to cities like Mumbai, where rain is heavier. This means that although it rains frequently in London, the rain tends to be light or drizzly rather than intense.

### Overall Conclusion and Hypothesis Decision
Based on these figures, we can conclude:

- London experiences frequent rain (as shown by its high number of rainy days), and it receives a consistent amount of rainfall each month, albeit with lighter rain intensity. This combination suggests that London is indeed a "rainy" city, though not the heaviest or most intense in rainfall.

Decision: We fail to reject the null hypothesis (H₀), which states that London is rainy. The data supports the notion that while London may not have the heaviest rain, it does rain frequently enough to fit the rainy image associated with the city.

**In summary, London's raininess may not be as extreme as often portrayed in movies, but it is still valid to consider it a rainy city based on frequency and consistency.** 😊🌧️