# Creating aggregate weather data and combining it with bike data

In this notebook we:
- Combine all of the weather data collected on both daily and hourly scales into ``DataFrame``s,
- Perform some preliminary cleaning on these weather ``DataFrame``s to remove columns containing virtually no information,
- Drop dates from the weather data that fall outside our range of interest, January 1, 2017 - April 30, 2024,
- Combine this data with the ``aggregate data`` of bike usage in the ``bike data`` folder, and
- Save this combination of weather and bike usage data to ``bike_weather_combined_daily.csv``.

The result, ``bike_weather_combined_daily.csv``, still requires more cleaning; we will do this in another notebook.

In [1]:
import numpy as np
import pandas as pd

In [2]:
## This function takes in two lists of strings, years and months,
## reads in the data file for these years and months, and returns
## a DataFrame combining all of the respective data, indexed
## by the date and (if appropriate) time

def aggregate_weather(years, months=[]):
    aggregate = pd.DataFrame({})

    if months==[]:
        filenames = [year + '_Daily.csv' for year in years]
        index_col = ['Date/Time']
    else:
        filenames = [year + '-' + month + '_Hourly.csv' for year in years for month in months]
        index_col = ['Date/Time (LST)']
        
    weathers = [pd.read_csv('./' + file, index_col=index_col, parse_dates=index_col) for file in filenames]
    for weather_df in weathers:
        aggregate = pd.concat([aggregate, weather_df], ignore_index=False)
    return aggregate

First we will combine the files in the ``weather_data`` folder into ``DataFrame``s, one for the hourly data and another for the daily data. Then we'll save them into their own files.

In [3]:
years = [str(i) for i in range(2017,2025)]
months = ['01','02','03','04','05','06','07','08','09','10','11','12']

aggregate_daily_weather = aggregate_weather(years)
aggregate_hourly_weather = aggregate_weather(years, months)

Looking at the columns of our data, we can see that there are several columns that contain virtually no data (high percentage of ``NaN`` values):

In [4]:
aggregate_daily_weather.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2922 entries, 2017-01-01 to 2024-12-31
Data columns (total 30 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Longitude (x)              2922 non-null   float64
 1   Latitude (y)               2922 non-null   float64
 2   Station Name               2922 non-null   object 
 3   Climate ID                 2922 non-null   int64  
 4   Year                       2922 non-null   int64  
 5   Month                      2922 non-null   int64  
 6   Day                        2922 non-null   int64  
 7   Data Quality               0 non-null      float64
 8   Max Temp (°C)              2656 non-null   float64
 9   Max Temp Flag              36 non-null     object 
 10  Min Temp (°C)              2659 non-null   float64
 11  Min Temp Flag              36 non-null     object 
 12  Mean Temp (°C)             2655 non-null   float64
 13  Mean Temp Flag             36 

In [5]:
aggregate_hourly_weather.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 70128 entries, 2017-01-01 00:00:00 to 2024-12-31 23:00:00
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Longitude (x)        70128 non-null  float64
 1   Latitude (y)         70128 non-null  float64
 2   Station Name         70128 non-null  object 
 3   Climate ID           70128 non-null  int64  
 4   Year                 70128 non-null  int64  
 5   Month                70128 non-null  int64  
 6   Day                  70128 non-null  int64  
 7   Time (LST)           70128 non-null  object 
 8   Temp (°C)            64481 non-null  float64
 9   Temp Flag            0 non-null      float64
 10  Dew Point Temp (°C)  64481 non-null  float64
 11  Dew Point Temp Flag  1 non-null      object 
 12  Rel Hum (%)          64480 non-null  float64
 13  Rel Hum Flag         2 non-null      object 
 14  Precip. Amount (mm)  0 non-null      float64
 15  P

Now we'll drop these columns that contain comparatively no data (even if the data is useful in cases where it is non-null, it isn't present enough to help us produce a meaningful analysis).

In [6]:
daily_drop_cols = ['Longitude (x)', 'Latitude (y)', 'Station Name', 'Climate ID',
                  'Year', 'Month', 'Day', 'Data Quality', 'Max Temp Flag', 'Min Temp Flag',
                  'Mean Temp Flag','Heat Deg Days Flag', 'Cool Deg Days Flag','Total Rain Flag',
                  'Total Snow Flag','Total Precip Flag','Snow on Grnd Flag', 'Dir of Max Gust (10s deg)',
                  'Dir of Max Gust Flag','Spd of Max Gust Flag']
hourly_drop_cols = ['Longitude (x)', 'Latitude (y)', 'Station Name', 'Climate ID',
                   'Year', 'Month', 'Day', 'Time (LST)', 'Temp Flag', 'Dew Point Temp Flag',
                   'Rel Hum Flag', 'Precip. Amount (mm)', 'Precip. Amount Flag', 'Wind Dir Flag',
                    'Wind Spd Flag', 'Visibility Flag', 'Stn Press Flag', 'Hmdx',
                    'Hmdx Flag', 'Wind Chill Flag']

aggregate_daily_weather = aggregate_daily_weather.drop(columns=daily_drop_cols)
aggregate_hourly_weather = aggregate_hourly_weather.drop(columns=hourly_drop_cols)

In [7]:
aggregate_daily_weather.info(), len(aggregate_daily_weather)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2922 entries, 2017-01-01 to 2024-12-31
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Max Temp (°C)           2656 non-null   float64
 1   Min Temp (°C)           2659 non-null   float64
 2   Mean Temp (°C)          2655 non-null   float64
 3   Heat Deg Days (°C)      2655 non-null   float64
 4   Cool Deg Days (°C)      2655 non-null   float64
 5   Total Rain (mm)         2666 non-null   float64
 6   Total Snow (cm)         2672 non-null   float64
 7   Total Precip (mm)       2677 non-null   float64
 8   Snow on Grnd (cm)       124 non-null    float64
 9   Spd of Max Gust (km/h)  1861 non-null   object 
dtypes: float64(9), object(1)
memory usage: 251.1+ KB


(None, 2922)

In [8]:
aggregate_hourly_weather.info(), len(aggregate_hourly_weather)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 70128 entries, 2017-01-01 00:00:00 to 2024-12-31 23:00:00
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Temp (°C)            64481 non-null  float64
 1   Dew Point Temp (°C)  64481 non-null  float64
 2   Rel Hum (%)          64480 non-null  float64
 3   Wind Dir (10s deg)   63062 non-null  float64
 4   Wind Spd (km/h)      64478 non-null  float64
 5   Visibility (km)      64482 non-null  float64
 6   Stn Press (kPa)      64481 non-null  float64
 7   Wind Chill           2531 non-null   float64
 8   Weather              28852 non-null  object 
dtypes: float64(8), object(1)
memory usage: 5.4+ MB


(None, 70128)

Now we can see that the number of rows in each of these ``DataFrame``s is greater than the (maximum) number of non-null entries. Inspecting these ``DataFrame``s, this is because the weather data has space for days that haven't happened yet (our files record data up through December 2024). For example, we have

In [9]:
aggregate_daily_weather.tail(10)

Unnamed: 0_level_0,Max Temp (°C),Min Temp (°C),Mean Temp (°C),Heat Deg Days (°C),Cool Deg Days (°C),Total Rain (mm),Total Snow (cm),Total Precip (mm),Snow on Grnd (cm),Spd of Max Gust (km/h)
Date/Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2024-12-22,,,,,,,,,,
2024-12-23,,,,,,,,,,
2024-12-24,,,,,,,,,,
2024-12-25,,,,,,,,,,
2024-12-26,,,,,,,,,,
2024-12-27,,,,,,,,,,
2024-12-28,,,,,,,,,,
2024-12-29,,,,,,,,,,
2024-12-30,,,,,,,,,,
2024-12-31,,,,,,,,,,


Hence the next step is to drop the rows of each of these ``DataFrame``s together with rows of the aggregated bike data so that the data goes from January 1, 2017 to April 30, 2024. We will store these modified ``DataFrame``s in a list for ease of concatenation in the following steps.

In [10]:
aggregate_daily_bike_data = pd.read_csv('../bike data/aggregate data.csv', index_col=['Date'], parse_dates=['Date'])

## Load in aggregated hourly bike data here, aggregate_hourly_bike_data = ...

In [11]:
drop_range = pd.date_range(start='2024-05-01',end='2025-01-01',freq='h')

aggs = [aggregate_daily_bike_data, aggregate_daily_weather, aggregate_hourly_weather] ## Add entry for aggregate_hourly_bike_data

for idx, df in enumerate(aggs):
    aggs[idx] = df.drop([index for index in drop_range if index in df.index])

In [12]:
## Verify that the index of both the modified daily data are the same:
## Check if any of the indices (which, recall, are datetime64[ns]s)
## of the bike data don't exactly match those for the weather
any(aggs[0].index != aggs[1].index)

## Similarly for hourly data:
## any(aggs[2].index != aggs[3].index)

False

Now that we have ``DataFrame``s with matching index we can easily join them side-by-side.

In [13]:
full_daily = pd.concat([aggs[0], aggs[1]], axis=1)
full_daily.index.names = ['Date']
## full_hourly = pd.concat([aggs[2], aggs[3]], axis=1)
## full_hourly.index.names = ['Date/Time']

In [14]:
full_daily

Unnamed: 0_level_0,Bike trips,Total distance (m),Total duration (sec),Mean departure temperature (C),Mean return temperature (C),Electric bike trips,Max Temp (°C),Min Temp (°C),Mean Temp (°C),Heat Deg Days (°C),Cool Deg Days (°C),Total Rain (mm),Total Snow (cm),Total Precip (mm),Snow on Grnd (cm),Spd of Max Gust (km/h)
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2017-01-01,162,338025.00,232693.0,3.555556,4.296296,,2.2,-2.3,-0.1,18.1,0.0,0.0,0.0,0.0,3.0,41
2017-01-02,270,660054.00,382729.0,2.718519,3.688889,,1.4,-6.0,-2.3,20.3,0.0,0.0,0.0,0.0,2.0,<31
2017-01-03,384,635395.00,376013.0,0.807292,1.791667,,0.4,-7.8,-3.7,21.7,0.0,0.0,0.0,0.0,1.0,<31
2017-01-04,460,766082.00,376721.0,2.767391,3.643478,,2.2,-8.4,-3.1,21.1,0.0,0.0,0.0,0.0,1.0,32
2017-01-05,524,888222.00,524933.0,2.601145,3.524809,,0.7,-6.6,-3.0,21.0,0.0,0.0,0.0,0.0,1.0,<31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-04-26,3367,9459383.33,4193707.0,13.858925,14.667360,1346.0,14.6,8.7,11.7,6.3,0.0,6.6,0.0,6.6,,
2024-04-27,1291,3274519.00,1443072.0,10.175058,10.984508,547.0,11.1,8.8,10.0,8.0,0.0,6.8,0.0,6.8,,40.0
2024-04-28,2313,6738587.00,2859352.0,11.309987,11.947687,851.0,12.8,7.9,10.4,7.6,0.0,0.5,0.0,0.5,,46.0
2024-04-29,3929,11560811.99,4935558.0,14.148129,14.511835,1467.0,,,,,,0.0,0.0,0.0,,37.0


In [15]:
## full_hourly

Lastly, save the modifications.

In [16]:
aggs[1].to_csv('./aggregate_daily_weather.csv',index_label='Date/Time')
aggs[2].to_csv('./aggregate_hourly_weather.csv',index_label='Date/Time (LST)')
full_daily.to_csv('bike_weather_combined_daily.csv',index_label='Date')
## full_hourly.to_csv('bike_weather_combined_hourly',index_label='Date/Time')