# Data Collection

In order to best answer the question **“Is London really as rainy as the movies make it out to be?”** several variables need to be considered: the time period, other cities to compare against London, how to measure and define raininess and as such which Open Meteo variables will be used. 

### The time period
To explore this question in depth I am going to focus on a very historical time period (1st January 1970 - 31st December 1980) and a more recent time period (1st January 2010 - 31st December 2020). This will provide a large enough dataset to effectively answer the question and will allow investigation into the changing raininess throughout history.
 
### Other cities to compare against London
To assess whether London is as rainy as movies suggest, a broad sample of global cities should be used in the comparison. This will enhance the reliability of findings, ensuring different climates are accounted for. As such the cities I will investigate are:
 - Manchester, UK: offers a regional counterpoint to London, indicating whether the amount of rain in London is abnormal for England
 - Edinburgh, Scotland: provides another UK perspective, allowing the assessment of London's weather in comparison to the broader UK climate
 - Cork, Ireland: Located in close proximity to the UK and as Ireland is also known for its wet climate, it provides a meaningful benchmark to assess if London's rainfall stands out even among neighboring regions with similar weather patterns.
 - Paris, France: located relatively close to London and at a similar latitude, which may offer a similar climate and allowing the comaprison between major European cities
 - Rome, Italy: offers a warmer Mediterranean climate to contrast London's apparent 'rainier' climate
 - Seattle, Washington, USA: another city that possesses 'a rainy reputation', making it a potentially good benchmark for comparison
 - Bogotá, Columbia: a tropical climate in South America, characterised by high humidity and significant rainfall providing a valuable perspective on London's rainfall
 - Cairo, Egypt: a desert city with minimal rain, highlighting extremes in rainfall levels
 - Cape Town, South Africa: another city in the Southern Hemisphere with distinct wet and dry seasons
 - Mumbai, India: a tropical monsoon climate, with heavy rains to offer comparison with London
 

In [1]:
import json 
import requests
from lets_plot import *
import pandas as pd

LetsPlot.setup_html()


In [22]:

def plot_city_map(city_data,
                 point_size=2, 
                 point_colour='blue',
                 title='City Locations', 
                 title_size=30):
    
    city_df = pd.DataFrame(city_data, columns=['city', 'latitude', 'longitude'])
    
    # Plotting the points on a map
    plot = (
        ggplot() +
        geom_livemap() +
        geom_point(aes(x='longitude', y='latitude'),
                        size=point_size,
                        colour=point_colour, 
                        show_legend=False, 
                        data=city_df) + 
        ggsize(800,500) +
        labs(
        title=title) + 
        theme_minimal() + 
        theme(plot_title=element_text(size=title_size, hjust=0.5))
    )

    return plot

# Sample data as a list of tuples (city, latitude, longitude)
city_coords = [
    ("London", 51.5072, -0.1276),  
    ("Manchester", 53.4808, -2.2426),  
    ("Edinburgh", 55.9533, -3.1883),
    ("Cork", 51.8985, -8.4756),
    ("Paris", 48.8575, 2.3514),
    ("Rome", 41.8967, 12.4822),
    ("Seattle", 47.6061, -122.3328),    
    ("Bogota", 4.7110, -74.0721),  
    ("Cairo", 30.0444, 31.2357),
    ("Cape Town", -33.9221, 18.4231),
    ("Mumbai", 19.0760, 72.8777)
]

# Call the function to plot the city map
city_map_plot = plot_city_map(city_coords, point_colour='#1F627D')
city_map_plot


As the map demonstrates, these locations will provide a wide range of counterpoints to compare London against.

### Measuring and defining raininess
Raininess is based upon the frequency and duration of the rain and the amount of precipiation. As such the variables in Open-Meteo to be used are:
- Precipitation Sum 
- Rain Sum 
- Precipitation Hours

## Data collection

## Collecting historical data

In [10]:

def get_historical_data(city_coords, start_date, end_date):
    base_historical_url = "https://archive-api.open-meteo.com/v1/archive"
    precipitation = {}
    
    for city, latitude, longitude in city_coords:
        params_lat_long = f"latitude={latitude}&longitude={longitude}"
        params_others = "&daily=precipitation_sum,rain_sum,precipitation_hours" 
        params_dates = f"&start_date={start_date}&end_date={end_date}"
        
        end_url = base_historical_url + '?' + params_lat_long + params_others + params_dates
        
        historical_response = requests.get(end_url)
        
        # Checking that all data is collected
        if historical_response.status_code == 200:
            historical_data = historical_response.json()
            historical_precipitation = {
                "precipitation_sum": historical_data['daily']['precipitation_sum'],
                "rain_sum": historical_data['daily']['rain_sum'],
                "precipitation_hours": historical_data['daily']['precipitation_hours'],
            }
            precipitation[city] = historical_precipitation
        else:
            print(f"Error for {city}: {historical_response.status_code}")
            precipitation[city] = None 

    # Saving the data in json file
    with open("../data/precipitation_data_hist.json", "w") as file:
        json.dump(precipitation, file, indent=4) 
    
    return precipitation 

city_coords = [
    ("London", 51.5072, -0.1276),  
    ("Manchester", 53.4808, -2.2426),  
    ("Edinburgh", 55.9533, -3.1883),
    ("Cork", 51.8985, -8.4756),
    ("Paris", 48.8575, 2.3514),
    ("Rome", 41.8967, 12.4822),
    ("Seattle", 47.6061, -122.3328),    
    ("Bogota", 4.7110, -74.0721),  
    ("Cairo", 30.0444, 31.2357),
    ("Cape Town", -33.9221, 18.4231),
    ("Mumbai", 19.0760, 72.8777)
]

precipitation_hist = get_historical_data(city_coords, start_date="1970-01-01", end_date="1980-12-31")



### Collecting data for the more recent time period
Done seperately to ensure that there were not too many API requests in one day

In [4]:

def get_recent_data(city_coords, start_date, end_date):
    base_historical_url = "https://archive-api.open-meteo.com/v1/archive"
    precipitation = {}
    
    for city, latitude, longitude in city_coords:
        params_lat_long = f"latitude={latitude}&longitude={longitude}"
        params_others = "&daily=precipitation_sum,rain_sum,precipitation_hours"  
        params_dates = f"&start_date={start_date}&end_date={end_date}"
        
        end_url = base_historical_url + '?' + params_lat_long + params_others + params_dates
        
        recent_response = requests.get(end_url)
        
        # Checking that all data was collected
        if recent_response.status_code == 200:
            recent_data = recent_response.json()
            recent_precipitation = {
                "precipitation_sum": recent_data['daily']['precipitation_sum'],
                "rain_sum": recent_data['daily']['rain_sum'],
                "precipitation_hours": recent_data['daily']['precipitation_hours'],
            }
            precipitation[city] = recent_precipitation
        else:
            print(f"Error for {city}: {recent_response.status_code}")
            precipitation[city] = None 

    # Saving the data to a json file
    with open("../data/precipitation_data_rec.json", "w") as file:
        json.dump(precipitation, file, indent=4) 
    
    return precipitation 

city_coords = [
   ("London", 51.5072, -0.1276),  
    ("Manchester", 53.4808, -2.2426),  
    ("Edinburgh", 55.9533, -3.1883),
    ("Cork", 51.8985, -8.4756),
    ("Paris", 48.8575, 2.3514),
    ("Rome", 41.8967, 12.4822),
    ("Seattle", 47.6061, -122.3328),    
    ("Bogota", 4.7110, -74.0721),  
    ("Cairo", 30.0444, 31.2357),
    ("Cape Town", -33.9221, 18.4231),
    ("Mumbai", 19.0760, 72.8777)
]

precipitation_recent = get_recent_data(city_coords, start_date="2010-01-01", end_date="2020-12-31")



#### Creating a dataframe from the JSON files

In [12]:

# Loading JSON data from both files
with open("precipitation_data_hist.json") as precip_hist, open("precipitation_data_rec.json") as precip_rec:
    precipitation_data_hist = json.load(precip_hist)
    precipitation_data_rec = json.load(precip_rec)

# Transforming the data into a more helpful format
def transform_to_tidy(data, start_date):
    tidy_data = []
    for city, city_data in data.items():
        for i, (precip_sum, rain_sum, precip_hours) in enumerate(zip(
            city_data['precipitation_sum'], city_data['rain_sum'], city_data['precipitation_hours']
        )):
            tidy_data.append({
                "City": city,
                "Date": pd.to_datetime(start_date) + pd.Timedelta(days=i),  
                "Precipitation Sum": precip_sum,
                "Rain Sum": rain_sum,
                "Precipitation Hours": precip_hours
            })
    return tidy_data

start_date_hist = "1970-01-01"
start_date_rec = "2010-01-01"

tidy_data_hist = transform_to_tidy(precipitation_data_hist, start_date_hist)
tidy_data_rec = transform_to_tidy(precipitation_data_rec, start_date_rec)

all_data = pd.DataFrame(tidy_data_hist + tidy_data_rec)

# Testing the df is correctly formatted
print(all_data.head(10))
print(all_data.tail(40))


     City       Date  Precipitation Sum  Rain Sum  Precipitation Hours
0  London 1970-01-01                0.0       0.0                  0.0
1  London 1970-01-02                0.3       0.3                  3.0
2  London 1970-01-03                0.3       0.3                  3.0
3  London 1970-01-04                0.0       0.0                  0.0
4  London 1970-01-05                0.0       0.0                  0.0
5  London 1970-01-06                1.4       0.2                  4.0
6  London 1970-01-07                0.0       0.0                  0.0
7  London 1970-01-08                4.5       0.9                  8.0
8  London 1970-01-09                5.6       5.6                 14.0
9  London 1970-01-10                1.2       1.2                  4.0
         City       Date  Precipitation Sum  Rain Sum  Precipitation Hours
88356  Mumbai 2020-11-22                0.0       0.0                  0.0
88357  Mumbai 2020-11-23                0.0       0.0                

In [13]:
# saving the data as a csv
all_data.to_csv("../data/rainfall_data.csv", index=False)
