# What is our project about?

# "_London is a rainy city._"

# Data Pipeline Diagram

```{mermaid}
flowchart LR
    A{open-meteo} --> L((open-meteo API Requests))
    L --> B[London Dataframe]
    L --> C[City 1 Dataframe]
    L --> D[City ... Dataframe]
    L --> E[City 19 Dataframe]
    G{top 20 cities website} -->|scraping| H(python list of 20 cities)
    I{OpenStreetMaps} --> J((OSM API request))
    H --> J
    J --> K(python list of dictionaries for coordinates of all cities)
    K --> L
    B --> M[Weather Dataframe:<br>each row is a time and city]
    C --> M
    D --> M
    E --> M
    M --> F(save as json files)
    N{Google NGRAMS} --> S(json format)
    S -->|scraping| T(list of frequencies for each search)
    T --> U[NGRAMS Dataframe]
    U --> F
    M --> V[Final Database]
    U --> V
    V -->|Data manipulation| O(London Visualisations)
    V -->|Data manipulation| Q(Descriptive Visualisations)
    V -->|Data manipulation| R(More Complex and Interactive Visualisations)
```


# How did we get the weather data?

---
## Steps
1. Getting our list of cities.
2. Preparing the input to the open-meteo API call.
3. Getting the data from the open-meteo API.
4. Transforming the data into our desired format.

## Scraping the 20 most visited cities
```python
import requests
from scrapy import Selector
cities_url = "https://travelness.com/most-visited-cities-in-the-world" # URL of the page with the list of cities

response = requests.get(cities_url)
sel = Selector(response)

cities = sel.xpath("//table//tr/td[2]/text()").getall()
```
This returns a list of the top 20 most visited cities:
```
['Bangkok', 'Paris', 'London', 'Dubai', 'Singapore', 'Kuala Lumpur', 'New York', 'Istanbul', 'Tokyo', 'Antalya', 'Seoul', 'Osaka', 'Makkah', 'Phuket', 'Pattaya', 'Milan', 'Barcelona', 'Palma de Mallorca', 'Bali', 'Hong Kong SAR']
```

## Geocoding the cities
The open-meteo API requires that we input coordinates and so we first had to take the city names and geocode them using the OpenStreetMaps API.
```python
from geopy.geocoders import Nominatim

def geocode_city(city):
    geolocator = Nominatim(user_agent="my_geocoder")
    location = geolocator.geocode(city)
    return {"city": city, "latitude": location.latitude, "longitude": location.longitude}

def geocode_cities(city_list):
    geocoded_cities = [geocode_city(city) for city in city_list if geocode_city(city)]
    return geocoded_cities

# Geocode the list of cities
geocoded_cities = geocode_cities(cities)
```
This returns a list of dictionaries, here is one of the items as an example:
```
{"city": "Paris", "latitude": 48.8534951, "longitude": 2.3483915}
```
## Preparing the open-meteo API call
- The open-meteo API takes in a dictionary of parameters.
```python
params = {
    "latitude": [city["latitude"] for city in geocoded_cities],
    "longitude": [city["longitude"] for city in geocoded_cities],
    "start_date": "1940-01-01",
    "end_date": "2023-12-31",
    "daily": daily_variables_of_interest,
}
```
where we defined our ```daily_variables_of_interest``` as a list.

## The actual API call
Here we use the code provided in the open-meteo API documentation to get a list of responses for each city.

```python
# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after = -1)
retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
openmeteo = openmeteo_requests.Client(session = retry_session)

url = "https://archive-api.open-meteo.com/v1/archive"
responses = openmeteo.weather_api(url, params=params)
```
## Transforming the data into our desired format
Once we retrieved our API responses, we had to manipulate them into a useable format the API call returned a list of responses that looked like this:
```
<openmeteo_sdk.WeatherApiResponse.WeatherApiResponse at 0x7fd78b1f7d30>
```
This is obviously not very useful.

## Open-meteo API documentation and example code
The open-meteo API does provide you with useable code for returning pandas dataframes for each city, but their code was not desirable for a few reasons:
- Returned a list of dataframes for each city.
- Each dataframe had no way of distinguishing it from the other dataframes meaning that if we merged them then we wouldn't know which datapoints correspond to which cities.

## How did we overcome this?
We wrote a custom function using the documentation that processes each response and includes a column highlighting the city it corresponds to, only then did we merge the dataframes.

## How did we overcome this?
```python
import pandas as pd
import openmeteo_requests

def process_response(response, geocoded_cities, i):
    daily = response.Daily()
    temperature_2m_max = daily.Variables(0).ValuesAsNumpy()
    temperature_2m_min = daily.Variables(1).ValuesAsNumpy()
    temperature_2m_mean = daily.Variables(2).ValuesAsNumpy()
    daylight_duration = daily.Variables(3).ValuesAsNumpy()
    sunshine_duration = daily.Variables(4).ValuesAsNumpy()
    precipitation_sum = daily.Variables(5).ValuesAsNumpy()
    rain_sum = daily.Variables(6).ValuesAsNumpy()
    precipitation_hours = daily.Variables(7).ValuesAsNumpy()

    daily_data = {
        "date": pd.date_range(
            start=pd.to_datetime(daily.Time(), unit="s", utc=True),
            end=pd.to_datetime(daily.TimeEnd(), unit="s", utc=True),
            freq=pd.Timedelta(seconds=daily.Interval()),
            inclusive="left"
        ).date
    }

    daily_data["city"] = geocoded_cities[i]['city']
    daily_data["temperature_2m_max"] = temperature_2m_max
    daily_data["temperature_2m_min"] = temperature_2m_min
    daily_data["temperature_2m_mean"] = temperature_2m_mean
    daily_data["daylight_duration"] = daylight_duration
    daily_data["sunshine_duration"] = sunshine_duration
    daily_data["precipitation_sum"] = precipitation_sum
    daily_data["rain_sum"] = rain_sum
    daily_data["precipitation_hours"] = precipitation_hours

    return pd.DataFrame(data=daily_data)
```

```python
dataframes_list = [cf.process_response(response, geocoded_cities, i) for i, response in enumerate(responses)]
merged_df = pd.concat(dataframes_list, ignore_index=True)
merged_df.to_csv("../data/weather_data.csv", index=False)
```

## Final Outcome
```python
merged_df[merged_df['city']=='London'].describe()
```
```{HTML}
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>temperature_2m_max</th>
      <th>temperature_2m_min</th>
      <th>temperature_2m_mean</th>
      <th>daylight_duration</th>
      <th>sunshine_duration</th>
      <th>precipitation_sum</th>
      <th>rain_sum</th>
      <th>precipitation_hours</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>count</th>
      <td>30681.000000</td>
      <td>30681.000000</td>
      <td>30681.000000</td>
      <td>30681.000000</td>
      <td>30680.000000</td>
      <td>30680.000000</td>
      <td>30680.000000</td>
      <td>30681.000000</td>
    </tr>
    <tr>
      <th>mean</th>
      <td>13.774742</td>
      <td>6.729281</td>
      <td>10.306868</td>
      <td>44177.980469</td>
      <td>25653.326172</td>
      <td>1.691261</td>
      <td>1.642960</td>
      <td>3.851146</td>
    </tr>
    <tr>
      <th>std</th>
      <td>6.194537</td>
      <td>5.271986</td>
      <td>5.609030</td>
      <td>10799.614258</td>
      <td>16077.271484</td>
      <td>3.265338</td>
      <td>3.221255</td>
      <td>5.007728</td>
    </tr>
    <tr>
      <th>min</th>
      <td>-6.454500</td>
      <td>-15.904500</td>
      <td>-8.721166</td>
      <td>28170.857422</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>25%</th>
      <td>9.195499</td>
      <td>2.795500</td>
      <td>6.139249</td>
      <td>33813.128906</td>
      <td>12710.077148</td>
      <td>0.000000</td>
      <td>0.000000</td>
      <td>0.000000</td>
    </tr>
    <tr>
      <th>50%</th>
      <td>13.745500</td>
      <td>6.995500</td>
      <td>10.437167</td>
      <td>44288.531250</td>
      <td>25880.933594</td>
      <td>0.200000</td>
      <td>0.200000</td>
      <td>1.000000</td>
    </tr>
    <tr>
      <th>75%</th>
      <td>18.545500</td>
      <td>10.945499</td>
      <td>14.853833</td>
      <td>54571.656250</td>
      <td>38797.984375</td>
      <td>1.900000</td>
      <td>1.800000</td>
      <td>7.000000</td>
    </tr>
    <tr>
      <th>max</th>
      <td>37.952000</td>
      <td>20.851999</td>
      <td>29.131165</td>
      <td>59899.007812</td>
      <td>55052.890625</td>
      <td>39.900002</td>
      <td>39.900002</td>
      <td>24.000000</td>
    </tr>
  </tbody>
</table>
</div>
```

## Next steps
- Create a database connection.
- Ensure the data is saved in efficient formats as 600,000 pieces of data is a lot.

# What have we found so far?


# Thanks for listening.