# <center>Is London Really as Rainy as the Movies Make It Out To Be?</center>
## <center>An Exploratory Data Analysis</center>

##### This notebook (utilizing functions defined in Analysis_Functions.py) collects data on all of the capital cities (of course, including London) and preforms some Exploratory Data Analysis to help answer the question of whether London is truly as rainy as it is portrayed in movies and TV.

***

## Part 1: The Approach
To answer this question, it must first be determined what London should be compared *to*. Of course, there are many possible approaches, each yielding varying answers. The method I thought was most interesting was to compare London to cities spread across the globe. A convenient way to do this would be to compare London to the other capital cities of the world; this would ensure a widely dispersed set of data while including many of the most populous cities (e.g., Paris, Tokyo, etc.). I would then, based on the data (see Part 2), define a "Raininess Index" (see Part 3) to easily compare the raininess of London to other world capitals and create meaningful visualizations of these analyses.

## Part 2: The Data
The data used for testing and the final analyses are contained in the [world_cities.csv](https://github.com/lse-ds105/ds105a-2024-w06-summative-mjae-1616/blob/main/world_cities.csv) and the [country-capital-lat-long-population.csv](https://github.com/lse-ds105/ds105a-2024-w06-summative-mjae-1616/blob/main/country-capital-lat-long-population.csv). The [OpenMeteo API](https://open-meteo.com/) provided the weather data, given the latitude and longitude of cities.

The DS105A LSE course provided the World Cities CSV; it contains data on city names, their country abbreviations, and their coordinates. This proved useful for testing many functions.

The country-capital-lat-long-population CSV was found on [GitHub Gist](https://gist.github.com/ofou/df09a6834a8421b4f376c875194915c9). This contains much of the same data as the World Cities CSV, but only includes the capital cities of each country. This made it easier to iterate through when requesting data from the OpenMeteo API.

The OpenMeteo API is a free API providing the historical and forecast weather data used in creating the raininess index; however, the API does have a rate limit. This is addressed in Part 3. Both historical and forecast weather data were collected for each capital city. Historical data included the past two years of daily rain sum (mm of rain per day) and hours of precipitation. The forecast data included the projected next 7 days of rain sum (mm of rain per day).

#### Importing Python Packages and Libraries

In [177]:
# Importing standard libraries
import pandas as pd
from lets_plot import *
import geopandas as gpd
from lets_plot.geo_data import geocode_countries
from datetime import timedelta, datetime

# Importing custom libraries
from Analysis_Functions import *

#### Collecting London data:

##### To collect data for a particular city (in this case, London), we can utilize functions defined in and imported from the Analysis_Functions.py file, which are as follows:

**(1) get_city_latlong**<br>
This function allows a user to input a city and country, and will return the latitude and longitude of the city as a tuple

**(2) get_historical_precipitation**<br>
Allows users to input a latitude and longitude, and will return (as a dictionary of two lists) the mm of rain per day and the hours of rain per day from the past 2 years

**(3) get_forecast_precipitation**<br>
Allows a user to input a latitude and longitude, and will return the daily projected mm of rain for the next 7 days and a list

**(4) raininess**<br>
Allows a user to input a dictionary of combined historical and forecast data, and will determine a "raininess index" for how rainy the city is.

In [178]:
# Collecting historical and forecast precipitation data for London, GB

city = 'London'
country = 'GB'
latitude, longitude = get_city_latlon(city, country)
historical_precipitation = f"   Historical precipitation: {get_historical_precipitation(latitude, longitude)}"
forecast_precipitation = f"   Forecast precipitation: {get_forecast_precipitation(latitude, longitude)}"
print(f"Weather data for {city}, {country}:")
print(historical_precipitation)
print(forecast_precipitation)

# storing above output into a variable

London = {
    "City": city,
    "Country": country,
    "Historical Precipitation": get_historical_precipitation(latitude, longitude),
    "Forecast Precipitation": get_forecast_precipitation(latitude, longitude)
}

Weather data for London, GB:
   Historical precipitation: {'Historical Rain Sum': [0.3, 12.8, 1.7, 5.0, 1.4, 4.6, 2.3, 4.0, 15.4, 7.7, 0.2, 0.0, 0.0, 0.0, 0.0, 0.9, 1.7, 2.3, 0.0, 1.6, 4.7, 12.3, 2.4, 0.4, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 5.5, 0.0, 3.6, 1.3, 4.8, 0.1, 7.5, 8.7, 6.6, 3.1, 2.5, 7.9, 6.6, 10.8, 14.5, 5.2, 4.3, 0.0, 1.7, 0.0, 7.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.1, 0.0, 0.0, 1.3, 0.1, 3.5, 6.5, 0.1, 0.0, 1.0, 4.8, 5.4, 12.5, 6.2, 2.7, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2, 0.0, 2.9, 6.9, 1.5, 0.0, 0.6, 0.1, 0.0, 6.7, 0.3, 0.1, 0.0, 0.0, 0.0, 1.0, 18.8, 2.4, 0.3, 0.4, 12.8, 0.3, 6.8, 15.7, 1.0, 2.2, 1.6, 3.9, 0.0, 0.0, 1.2, 4.3, 2.7, 0.8, 7.2, 6.1, 5.5, 0.0, 4.2, 0.2, 5.1, 15.4, 0.0, 0.0, 3.3, 4.0, 3.1, 0.2, 1.6, 0.1, 3.5, 1.7, 0.0, 0.0, 0.6, 1.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3, 0.6, 0.0, 0.0, 0.0, 0.3, 0.0, 0.0, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3, 0.0, 0.0, 0.0, 0.0, 6.4, 2.2, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2, 13.6, 3.4, 7.9, 2.2, 0.0, 1.0, 0.2, 0

#### The Raininess Index
A raininess index was then defined; the past year of weather data was weighted equally (i.e., yesterday's weather influences the index just as much as the weather 365 days ago; this strategy was implemented so as to prevent a skew between the Northern and Southern hemisphere), while the year prior was weighted (equally) less (i.e., 366 days ago is weighted the same amount as 730 days ago, both of which are weighted less than 364 days ago). Forecast data can be very temperamental and non-reflective of true weather patterns, and as such is weighted the least.

Finally, the index was scaled down using a logarithmic scale.

#### Capital City Raininess
Using the capital city coordinate data, the raininess index was applied to each capital city and stored in a pandas DataFrame.

## Visualizations

### World Map
The first visualization is a world map displaying the raininess of countries based on their capital cities. This incorporates data collected in [Data_Collection.py](https://github.com/lse-ds105/ds105a-2024-w06-summative-mjae-1616/blob/main/Data_Collection.py) of each capital city's raininess.

The graph is plotted using lets-plot. The countries included in the lets-plot geomapping package vary slightly from the countries included in the [countries-capital-lat-long-population.csv](https://github.com/lse-ds105/ds105a-2024-w06-summative-mjae-1616/blob/main/country-capital-lat-long-population.csv); countries in the lets-plot package that aren't in the capital CSV are displayed in grey. To make the mapping of country names from the lets-plot package to the CSV, some names were adjusted in the CSV (e.g., the altering of "Micronesia [Fed States of:]" to "Federated States of Micronesia").

In [179]:
LetsPlot.setup_html()

# Read the CSV into a DataFrame
rainiest_cities_df = pd.read_csv('rainiest_cities.csv')

# Convert to a GeoDataFrame
rainiest_cities_gdf = gpd.GeoDataFrame(
    rainiest_cities_df,
    geometry=gpd.points_from_xy(rainiest_cities_df['Longitude'], rainiest_cities_df['Latitude']),
    crs="EPSG:4326"  # WGS84 coordinate reference system
)

# Retrieve world boundaries and set the CRS to match
world_gdf = gpd.GeoDataFrame.from_features(geocode_countries().get_boundaries())
world_gdf.set_crs("EPSG:4326", inplace=True)

# Rename 'country' column in world_gdf to 'Country' to match rainiest_cities_gdf
rainiest_cities_gdf.rename(columns={'Country': 'Country_Name'}, inplace=True)
world_gdf.rename(columns={'country': 'Country_Name'}, inplace=True)

# Merge rainiest cities with world boundaries on 'Country'
merged_gdf = world_gdf.merge(rainiest_cities_gdf[['Country_Name', 'Raininess']], on='Country_Name', how='left')

# Fill missing Raininess values with 0 (or leave NaN if gray mapping for missing data is desired)
merged_gdf['Raininess'] = merged_gdf['Raininess'].fillna(0)

gdf = geocode_countries().get_boundaries(2)

# Add raininess data to match the 'Country_Name' with 'found name' in the world map data
# Assuming merged_gdf has 'Country_Name' and 'Raininess' columns, merge them with the country boundaries
gdf = gdf.merge(merged_gdf[['Country_Name', 'Raininess']], left_on='found name', right_on='Country_Name', how='left')

gdf['Raininess'] = pd.to_numeric(gdf['Raininess'], errors='coerce')

print(gdf.dtypes)
print(gdf[['found name', 'Raininess']].head())

# Plot with Raininess as the fill color
plot = (
    ggplot(gdf) +
    geom_map(aes(fill='Raininess'), color='white', 
             tooltips=layer_tooltips().line('@{found name}: @{Raininess}')) +
    scale_fill_gradient(low='lightgray', high='blue', na_value='lightgray') +  # Set NA color for missing data
    ggsize(800, 600) +
    ggtitle("World Raininess Distribution") +
    theme_void()
)

plot

country           object
found name        object
geometry        geometry
Country_Name      object
Raininess        float64
dtype: object
  found name  Raininess
0    Andorra  60.746116
1   Slovakia  59.124297
2    Austria  59.555555
3    Hungary  59.084918
4    Georgia  57.833600


### Raininess Index Histogram

The second visualization is a histogram of the raininess indexes of each world capital. The bin width is 2, and the visualization displays the frequency of occurrences in each bin. Hovering over a specific bin yields the index range and the frequency for that bin. The bin that london is contained in is coloured darker than the rest of the bins.

In [180]:
from lets_plot import *
LetsPlot.setup_html()

# Define the bin width and ranges
bin_width = 2
min_raininess = int(rainiest_cities_df['Raininess'].min())
max_raininess = int(rainiest_cities_df['Raininess'].max())
bins = range(min_raininess, max_raininess + bin_width, bin_width)
bin_ranges = [f"{start}-{start + bin_width - 1}" for start in bins[:-1]]

# Cut the data into bins and create labels
rainiest_cities_df['Raininess_bin'] = pd.cut(
    rainiest_cities_df['Raininess'],
    bins=bins,
    labels=bin_ranges,
    include_lowest=True
)

# Count the number of entries per bin
plot_df = rainiest_cities_df['Raininess_bin'].value_counts().sort_index().reset_index()
plot_df.columns = ['Raininess_bin', 'Count']

london_raininess = rainiest_cities_df.loc[rainiest_cities_df['Capital City'] == 'London', 'Raininess']
plot_df['Color'] = plot_df['Raininess_bin'].apply(lambda x: "#a9a9a9" if x == "61-62" else "#d3d3d3")

# Create the plot with custom fill colors
plot = (
    ggplot(plot_df, aes(x="Raininess_bin", y="Count", fill="Color")) +
    geom_bar(
        stat="identity",
        color="black",
        tooltips=layer_tooltips()
            .line("Range: @Raininess_bin")
            .line("Count: @Count")
    ) +
    scale_fill_identity() +  # Use the Color column directly for fill
    ggtitle("Distribution of Raininess Index with London's Position") +
    xlab("Raininess Index") +
    ylab("Frequency") +
    theme(axis_text_x=element_text(angle=45, hjust=1))  # Rotate x-axis labels
)

# Display the plot
plot


London's raininess index is approximately 60.8; because the bins are left inclusive and right exclusive, so this falls into the 61-62 bin, which is the mode in the dataframe of Raininess indexes. 

### Raininess Index Density

A density distribution graph can be used to see more precisely where London falls in relation to other world capital cities. This distribution directly plots London onto the curve

In [181]:
london_raininess = float(rainiest_cities_df.loc[rainiest_cities_df['Capital City'] == 'London', 'Raininess'].values[0])

plot = (
    ggplot(rainiest_cities_df, aes(x="Raininess")) +
    geom_density(fill="#d3d3d3", color="black") +  # Density plot with gray fill
    geom_vline(
        xintercept=london_raininess,
        color="red",
        linetype="dashed",
        size=1
    ) +  # Dashed line to mark London's raininess
    geom_text(
        x=london_raininess + 2.3,  # Slightly offset to avoid overlap
        y=0.095,  # Near the top of the density range
        label="London",
        color="red",
        size=8,
        vjust=-0.5  # Position label above the line
    ) +  # Label for London
    ggtitle("Raininess Density Distribution with London's Position") +
    xlab("Raininess Index") +
    ylab("Density") +
    theme_minimal() +
    scale_y_continuous(limits=[0, 0.1])  # Extend y-axis to 1
)

plot

London (the red dashed line) falls to the left of the peak of the density distribution. This is a PDF (probability density function), so the area under the curve to the right of the red line represents the cities that are rainier than London, while the area to the left are cities with a raininess index less than or equal to London. Since London is positioned somewhat left of the peak but still within the main cluster, this suggests that a fair portion of cities are rainier than London. However, given the left-skew of London's position relative to the distribution, it's likely that a significant percentage of cities are also less rainy.

From this, we can gather that London is rainier than a substantial portion of cities, although not the majority in this dataset.

### Raininess Index Line Graph: London, Tokyo, Paris, and Cairo

Next, I created a line graph of the historical rain fall (in mm) of five cities over the last two years. The chosen cities—London, Tokyo, Paris, New York City, and Sydney—are globally recognized urban centers spanning diverse climates and continents. This selection provides a well-rounded comparison of rainfall patterns across different geographical regions and climates, offering insights into how precipitation varies in different climates.

In [182]:
cities = [
    ("London", "GB"),
    ("Tokyo", "JP"),
    ("Paris", "FR"),
    ("New York City", "US"),
    ("Sydney", "AU")
]

precip_data = {"Date": [], "Rain (mm)": [], "City": []}

# Loop through each city to get the precipitation data
for city, country in cities:
    # Get latitude and longitude for the city
    coordinates = get_city_latlon(city, country)
    
    if coordinates:
        latitude, longitude = coordinates
        # Get historical precipitation data
        historical_data = get_historical_precipitation(latitude, longitude)
        rain_sum = historical_data["Historical Rain Sum"]
        
        # Generate dates based on the length of the data, assuming daily data
        start_date = pd.Timestamp.now() - timedelta(days=len(rain_sum))
        dates = [start_date + timedelta(days=i) for i in range(len(rain_sum))]
        
        # Append data to the main dictionary
        precip_data["Date"].extend(dates)
        precip_data["Rain (mm)"].extend(rain_sum)
        precip_data["City"].extend([city] * len(dates))
    else:
        print(f"Coordinates not found for {city}, {country}")

# Convert data to a DataFrame for Let's-Plot
precip_df = pd.DataFrame(precip_data)

precip_df['Date'] = pd.to_datetime(precip_df['Date'])

# Extract year and month for grouping
precip_df['Year'] = precip_df['Date'].dt.year
precip_df['Month'] = precip_df['Date'].dt.month

monthly_avg_precip = (
    precip_df.groupby(['City', 'Year', 'Month'])
    .filter(lambda x: len(x) >= 28)  # Keep only months with at least 28 days
    .groupby(['City', 'Year', 'Month'])
    .agg({'Rain (mm)': 'mean'})
    .reset_index()
)

# Reformat the date for plotting
monthly_avg_precip['Date'] = pd.to_datetime(monthly_avg_precip[['Year', 'Month']].assign(DAY=1))

# Now plot the updated DataFrame using Let's-Plot
from lets_plot import *
LetsPlot.setup_html()


# Define the cutoff date as two years ago from today
cutoff_date = pd.Timestamp.now() - pd.DateOffset(years=2)

# Filter the DataFrame to only include data from the last two years
monthly_avg_precip_last_2_years = monthly_avg_precip[monthly_avg_precip['Date'] >= cutoff_date]

# Now plot the filtered DataFrame using Let's-Plot
plot = (
    ggplot(monthly_avg_precip_last_2_years, aes(x='Date', y='Rain (mm)', color='City')) +
    geom_line() +
    ggtitle("Monthly Average Rainfall Comparison of Cities (Last 2 Years)") +
    xlab("Date") +
    ylab("Rainfall (mm)")
)

plot


London's monthly average rainfall is relatively consistent, without the pronounced peaks seen in cities like Tokyo. This suggests that London experiences more steady, moderate rainfall rather than seasonal extremes. This aligns with London’s reputation for frequent, light rain rather than intense downpours.

## Part 4: Conclusion

This analysis provides insights into London's raininess relative to other global cities by examining historical rainfall data and a custom-defined raininess index. Through comparative visualizations, it appears that London exhibits moderate rainfall levels with some seasonal fluctuations, placing it neither among the wettest nor the driest cities within this selection. However, it’s essential to note that these results are heavily influenced by the specific parameters chosen to define the raininess index. Different definitions or thresholds for rain events, intensity, or frequency could yield a different perspective on London's relative position. As such, the conclusions drawn here should be understood within the context of these chosen metrics. Further exploration with adjusted parameters or additional climatic factors could offer a more nuanced understanding of London's precipitation patterns in a global context.