# Cleaning and Preprocessing ``bike_weather_combined_daily.csv``

In [1]:
import numpy as np
import pandas as pd

In [2]:
bikeshare = pd.read_csv('./weather_data/bike_weather_combined_daily.csv', index_col=['Date'], parse_dates=['Date'])

In [3]:
bikeshare.info(), len(bikeshare)

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2677 entries, 2017-01-01 to 2024-04-30
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Bike trips                      2677 non-null   int64  
 1   Total distance (m)              2677 non-null   float64
 2   Total duration (sec)            2677 non-null   float64
 3   Mean departure temperature (C)  2645 non-null   float64
 4   Mean return temperature (C)     2645 non-null   float64
 5   Electric bike trips             638 non-null    float64
 6   Max Temp (°C)                   2645 non-null   float64
 7   Min Temp (°C)                   2648 non-null   float64
 8   Mean Temp (°C)                  2644 non-null   float64
 9   Heat Deg Days (°C)              2644 non-null   float64
 10  Cool Deg Days (°C)              2644 non-null   float64
 11  Total Rain (mm)                 2655 non-null   float64
 12  Total Snow (cm) 

(None, 2677)

The easiest and first cleaning steps we will take are:
- Replace null values in ``Electric bike trips`` with zeros
- Replace null values in ``Snow on Grnd (cm)`` with zeros
- Replace null values in ``Spd of Max Gust (km/h)`` with ``<31``s

These steps are justified since the above measurements were simply not recorded in the case that our replacement values were what actually occurred (e.g., ``Spd of Max Gust`` is not recorded if the largest gust is less than 29 km/h, and the lowest value recorded in ``Spd of Max Gust`` is ``<31``).

Then the secondary cleaning steps we will take are:
- Replace null values in ``Max Temp, Min Temp, Mean Temp, Heat Deg Days, Cool Deg Days`` with linear interpolations of their closest non-null values; this action reasonably justifies our removal of the ``Flag`` data in ``bike_weather_comb.ipynb``
- Replace null values in ``Mean departure/return temperature`` with the mean temperature for each given day

These actions reasonably justify our removal of ``Flag`` data from ``bike_weather_comb.ipynb``.

Lastly we will examine the null values in ``Total Rain/Snow/Precip`` and decide how to handle them.

The preprocessing steps we will then take are:
- Add a column for the mean temperature of each bike ride
- Convert ``Daylight``s values, which are in hours, minutes, and seconds to the number of minutes of daylight

Before we do any of this let us rename columns to have more shorter, more typable names.

In [4]:
bikeshare.columns

Index(['Bike trips', 'Total distance (m)', 'Total duration (sec)',
       'Mean departure temperature (C)', 'Mean return temperature (C)',
       'Electric bike trips', 'Max Temp (°C)', 'Min Temp (°C)',
       'Mean Temp (°C)', 'Heat Deg Days (°C)', 'Cool Deg Days (°C)',
       'Total Rain (mm)', 'Total Snow (cm)', 'Total Precip (mm)',
       'Snow on Grnd (cm)', 'Spd of Max Gust (km/h)', 'Daylength'],
      dtype='object')

In [5]:
col_names = {'Bike trips':'num_trips', 'Total distance (m)':'total_dist', 'Total duration (sec)':'total_duration',
       'Mean departure temperature (C)':'mean_dep_temp', 'Mean return temperature (C)':'mean_ret_temp',
       'Electric bike trips':'ebike_trips', 'Max Temp (°C)':'max_temp', 'Min Temp (°C)':'min_temp',
       'Mean Temp (°C)':'mean_temp', 'Heat Deg Days (°C)':'hdd', 'Cool Deg Days (°C)':'cdd',
       'Total Rain (mm)':'rain', 'Total Snow (cm)':'snow', 'Total Precip (mm)':'total_precip',
       'Snow on Grnd (cm)':'snow_on_ground', 'Spd of Max Gust (km/h)':'max_gust', 'Daylength':'day_length'}

bikeshare = bikeshare.rename(columns=col_names)

## First cleaning steps

In [6]:
replacement_dict = {'ebike_trips':{np.nan:0}, 'snow_on_ground':{np.nan:0.0}, 'max_gust':{np.nan:'<31'}}
bikeshare = bikeshare.replace(replacement_dict)

In [7]:
bikeshare.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2677 entries, 2017-01-01 to 2024-04-30
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   num_trips       2677 non-null   int64  
 1   total_dist      2677 non-null   float64
 2   total_duration  2677 non-null   float64
 3   mean_dep_temp   2645 non-null   float64
 4   mean_ret_temp   2645 non-null   float64
 5   ebike_trips     2677 non-null   float64
 6   max_temp        2645 non-null   float64
 7   min_temp        2648 non-null   float64
 8   mean_temp       2644 non-null   float64
 9   hdd             2644 non-null   float64
 10  cdd             2644 non-null   float64
 11  rain            2655 non-null   float64
 12  snow            2661 non-null   float64
 13  total_precip    2666 non-null   float64
 14  snow_on_ground  2677 non-null   float64
 15  max_gust        2677 non-null   object 
 16  day_length      2677 non-null   object 
dtypes: float64(14),

## Secondary cleaning steps

In [8]:
## This function takes a pandas Series and performs linear interpolation on nearest non-null values to fill in null values.
## The input must begin and end with non-null values, an assumption that holds on the columns we'll later apply this function to
def interpolate(series):
    for i, item in enumerate(series):
        if np.isnan(item):
            ## if the entry is null, this is the first non-null entry, so the previous entry is the first point
            ## on a line between the nearest non-null values. We need to find the next non-null value
            left = series.iloc[i-1]
            ## Find the next non-null value
            right_index = i+1
            for j, val in enumerate(series.iloc[right_index:]):
                if not np.isnan(val):
                    right_index += j
                    break
            right = series.iloc[right_index]

            slope = (right - left)/(right_index - i + 1)
            series.iloc[i] = left + slope
    return series

In [9]:
pd.options.mode.chained_assignment = None
interpolation_columns = ['max_temp','min_temp','mean_temp','hdd','cdd']
for col_name in interpolation_columns:
    interpolate(bikeshare[col_name])

In [10]:
bikeshare.loc[np.isnan(bikeshare.mean_dep_temp), 'mean_dep_temp'] = bikeshare.loc[np.isnan(bikeshare.mean_dep_temp), 'mean_temp']
bikeshare.loc[np.isnan(bikeshare.mean_ret_temp), 'mean_ret_temp'] = bikeshare.loc[np.isnan(bikeshare.mean_ret_temp), 'mean_temp']

In [11]:
bikeshare.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2677 entries, 2017-01-01 to 2024-04-30
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   num_trips       2677 non-null   int64  
 1   total_dist      2677 non-null   float64
 2   total_duration  2677 non-null   float64
 3   mean_dep_temp   2677 non-null   float64
 4   mean_ret_temp   2677 non-null   float64
 5   ebike_trips     2677 non-null   float64
 6   max_temp        2677 non-null   float64
 7   min_temp        2677 non-null   float64
 8   mean_temp       2677 non-null   float64
 9   hdd             2677 non-null   float64
 10  cdd             2677 non-null   float64
 11  rain            2655 non-null   float64
 12  snow            2661 non-null   float64
 13  total_precip    2666 non-null   float64
 14  snow_on_ground  2677 non-null   float64
 15  max_gust        2677 non-null   object 
 16  day_length      2677 non-null   object 
dtypes: float64(14),

## Examining the null ``rain``, ``snow``, and ``total_precip`` entries

Sometimes these null entries occur together, like so:

In [12]:
bikeshare.loc[np.isnan(bikeshare.rain)]

Unnamed: 0_level_0,num_trips,total_dist,total_duration,mean_dep_temp,mean_ret_temp,ebike_trips,max_temp,min_temp,mean_temp,hdd,cdd,rain,snow,total_precip,snow_on_ground,max_gust,day_length
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2017-02-01,914,2138128.0,887087.0,6.001094,7.230853,0.0,7.4,-4.9,1.3,16.7,0.0,,,0.0,0.0,<31,9:26:29
2017-04-23,923,2417567.0,1047365.0,14.743229,15.462622,0.0,13.45,8.6,10.45,7.55,0.0,,,,0.0,<31,14:13:10
2017-10-25,1110,2362888.33,805047.0,13.218919,14.546847,0.0,12.9,6.9,9.7,8.3,0.0,,,,0.0,<31,10:12:36
2017-11-07,1154,2486650.0,2074278.0,7.702773,8.960139,0.0,7.466667,-0.6,3.3,14.7,0.0,,,,0.0,<31,9:30:11
2017-11-10,1649,4266693.0,1807966.0,11.061856,11.838084,0.0,10.0,7.1,8.6,9.4,0.0,,,2.2,0.0,<31,9:21:05
2018-02-21,645,1197318.0,516333.0,0.844961,1.995349,0.0,-1.1,-5.6,-3.4,21.4,0.0,,2.4,1.0,1.0,<31,10:32:34
2018-07-06,3257,9444178.0,6284420.0,25.641388,25.839423,0.0,25.1,15.0,20.1,0.0,2.1,,,,0.0,<31,16:04:26
2020-12-01,1076,2711272.0,1268715.0,7.754647,8.740706,0.0,7.5,0.3,3.9,14.1,0.0,,,0.0,0.0,<31,8:29:01
2020-12-02,1107,2629697.0,1182695.0,7.734417,8.739837,0.0,7.3,-1.2,3.1,14.9,0.0,,,0.0,0.0,<31,8:27:18
2020-12-03,996,2286192.67,962935.0,8.861446,10.121486,0.0,8.1,-1.4,3.4,14.6,0.0,,,0.0,0.0,<31,8:25:39


Since all these quantities must be positive, we can perform a first pass through days where the ``total_precip`` is zero and replace the ``rain`` and ``snow`` values with zero.

In [13]:
bikeshare.loc[bikeshare.total_precip == 0, ['rain','snow']] = 0

Now, note that since ``total_precip`` = ``rain`` + ``snow``, if we know two of these values we know the third. Thus we can replace certain null values via this two-out-of-three rule.

In [14]:
bikeshare.loc[np.isnan(bikeshare.total_precip), 'total_precip'] = (bikeshare.loc[np.isnan(bikeshare.total_precip), 'rain'] +
                                                                        bikeshare.loc[np.isnan(bikeshare.total_precip), 'snow'])
bikeshare.loc[np.isnan(bikeshare.rain), 'rain'] = (bikeshare.loc[np.isnan(bikeshare.rain), 'total_precip'] -
                                                                        bikeshare.loc[np.isnan(bikeshare.rain), 'snow'])
bikeshare.loc[np.isnan(bikeshare.snow), 'snow'] = (bikeshare.loc[np.isnan(bikeshare.snow), 'total_precip'] -
                                                                        bikeshare.loc[np.isnan(bikeshare.snow), 'rain'])

In [15]:
bikeshare.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2677 entries, 2017-01-01 to 2024-04-30
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   num_trips       2677 non-null   int64  
 1   total_dist      2677 non-null   float64
 2   total_duration  2677 non-null   float64
 3   mean_dep_temp   2677 non-null   float64
 4   mean_ret_temp   2677 non-null   float64
 5   ebike_trips     2677 non-null   float64
 6   max_temp        2677 non-null   float64
 7   min_temp        2677 non-null   float64
 8   mean_temp       2677 non-null   float64
 9   hdd             2677 non-null   float64
 10  cdd             2677 non-null   float64
 11  rain            2662 non-null   float64
 12  snow            2667 non-null   float64
 13  total_precip    2666 non-null   float64
 14  snow_on_ground  2677 non-null   float64
 15  max_gust        2677 non-null   object 
 16  day_length      2677 non-null   object 
dtypes: float64(14),

Now note that all of the days where ``snow`` is null have a mean temperature well above zero:

In [16]:
bikeshare.loc[np.isnan(bikeshare.snow)]

Unnamed: 0_level_0,num_trips,total_dist,total_duration,mean_dep_temp,mean_ret_temp,ebike_trips,max_temp,min_temp,mean_temp,hdd,cdd,rain,snow,total_precip,snow_on_ground,max_gust,day_length
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2017-04-23,923,2417567.0,1047365.0,14.743229,15.462622,0.0,13.45,8.6,10.45,7.55,0.0,,,,0.0,<31,14:13:10
2017-10-25,1110,2362888.33,805047.0,13.218919,14.546847,0.0,12.9,6.9,9.7,8.3,0.0,,,,0.0,<31,10:12:36
2017-11-07,1154,2486650.0,2074278.0,7.702773,8.960139,0.0,7.466667,-0.6,3.3,14.7,0.0,,,,0.0,<31,9:30:11
2017-11-10,1649,4266693.0,1807966.0,11.061856,11.838084,0.0,10.0,7.1,8.6,9.4,0.0,,,2.2,0.0,<31,9:21:05
2018-07-06,3257,9444178.0,6284420.0,25.641388,25.839423,0.0,25.1,15.0,20.1,0.0,2.1,,,,0.0,<31,16:04:26
2020-12-06,537,1339338.0,536216.0,10.979516,11.98324,0.0,10.6,2.7,6.7,11.3,0.0,,,2.2,0.0,32.0,8:21:11
2020-12-07,424,885906.0,328377.0,10.325472,11.466981,0.0,10.0,8.6,9.3,8.7,0.0,,,10.8,0.0,37.0,8:19:51
2020-12-08,521,1102663.0,412559.0,10.950096,12.113244,0.0,10.4,6.4,8.4,9.6,0.0,,,14.7,0.0,45.0,8:18:38
2023-08-01,6777,19479827.33,8464764.0,24.751217,24.801977,1956.0,22.9,13.4,18.2,0.0,0.2,,,,0.0,31.0,15:08:02
2024-03-05,2564,6247720.33,2896233.0,5.285491,5.784711,974.0,5.0,0.15,2.6,15.4,0.0,,,,0.0,<31,11:17:41


Thus it makes sense to cast these remaining null ``snow`` values as zeros.

In [17]:
bikeshare.loc[np.isnan(bikeshare.snow), 'snow'] = 0

For the remaining null ``rain`` or ``total_precip`` values (there are only five), we will cast ``rain`` to zero and ``total_precip`` to the sum of ``rain`` and ``snow`` for those days. This may potentially introduce some inaccuracy into our final model, but since it is only five observations (together with the fact that ECCC often simply does not record values for events that didn't happen) it will not introduce much substantial inaccuracy and is a rather reasonable choice.

In [18]:
bikeshare.loc[np.isnan(bikeshare.rain), 'rain'] = 0
bikeshare.loc[np.isnan(bikeshare.total_precip), 'total_precip'] = (bikeshare.loc[np.isnan(bikeshare.total_precip), 'rain'] +
                                                                        bikeshare.loc[np.isnan(bikeshare.total_precip), 'snow'])

In [19]:
bikeshare.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2677 entries, 2017-01-01 to 2024-04-30
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   num_trips       2677 non-null   int64  
 1   total_dist      2677 non-null   float64
 2   total_duration  2677 non-null   float64
 3   mean_dep_temp   2677 non-null   float64
 4   mean_ret_temp   2677 non-null   float64
 5   ebike_trips     2677 non-null   float64
 6   max_temp        2677 non-null   float64
 7   min_temp        2677 non-null   float64
 8   mean_temp       2677 non-null   float64
 9   hdd             2677 non-null   float64
 10  cdd             2677 non-null   float64
 11  rain            2677 non-null   float64
 12  snow            2677 non-null   float64
 13  total_precip    2677 non-null   float64
 14  snow_on_ground  2677 non-null   float64
 15  max_gust        2677 non-null   object 
 16  day_length      2677 non-null   object 
dtypes: float64(14),

## Final preprocessing steps

First a column for mean ride temperatures:

In [20]:
bikeshare['mean_ride_temp'] = (bikeshare.mean_dep_temp + bikeshare.mean_ret_temp)/2

The entries of ``day_length`` are stored as ``string``s so it is simple to convert the entries to hours.

In [21]:
days = bikeshare.day_length.str.split(':')
for i in range(len(days)):
    days.iloc[i] = 60*int(days.iloc[i][0]) + int(days.iloc[i][1]) + int(days.iloc[i][0])/60

bikeshare.day_length = days

Lastly we will just reorder the columns so that weather information is properly grouped first and bike information second (since we are trying to predict bike information off weather information it makes sense for the weather features to come first).

In [22]:
reorganized_columns = ['day_length', 'min_temp', 'max_temp', 'mean_temp', 'hdd', 'cdd', 'rain',
                      'snow', 'total_precip', 'snow_on_ground', 'max_gust', 'mean_dep_temp', 'mean_ret_temp',
                       'mean_ride_temp', 'total_dist', 'total_duration', 'ebike_trips', 'num_trips']
bikeshare = bikeshare[reorganized_columns]

In [23]:
bikeshare.to_csv('bike_weather_daily.csv',index_label='Date')