# Dataset description and splitting off test data

In [1]:
import numpy as np
import pandas as pd

Our full ``bike_weather_daily.csv`` file contains daily weather and bikeshare usage data from January 1, 2017 to April 30, 2024. Here is a quick look at this data:

In [2]:
bikeshare = pd.read_csv('../preprocessing/bike_weather_daily.csv',index_col=['Date'], parse_dates=['Date'])
bikeshare.sample(5, random_state=604)

Unnamed: 0_level_0,day_length,min_temp,max_temp,mean_temp,temp_diff,hdd,cdd,rain,snow,total_precip,snow_on_ground,max_gust,mean_dep_temp,mean_ret_temp,mean_ride_temp,total_dist,total_duration,ebike_trips,num_trips
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2017-01-07,505.133333,-4.1,1.9,-1.1,6.0,19.1,0.0,0.0,0.0,0.0,0.0,0.0,4.119658,5.247863,4.683761,754413.0,329623.0,0.0,351
2022-04-18,835.216667,5.5,11.8,8.7,6.3,9.3,0.0,19.2,0.0,19.2,0.0,38.0,9.13764,10.171348,9.654494,795692.0,388572.0,0.0,356
2018-09-18,745.2,6.9,16.2,11.6,9.3,6.4,0.0,0.0,0.0,0.0,0.0,0.0,17.745803,18.282631,18.014217,7294024.33,3136911.0,0.0,2919
2018-07-12,955.25,14.3,23.8,19.1,9.5,0.0,1.1,0.0,0.0,0.0,0.0,6.0,26.033056,26.234253,26.133654,11660889.0,5009685.0,0.0,3842
2019-08-30,813.216667,16.4,22.9,19.7,6.5,0.0,1.7,0.0,0.0,0.0,0.0,0.0,22.47762,23.067141,22.77238,10684309.66,4963873.0,0.0,3932


In [3]:
bikeshare.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2677 entries, 2017-01-01 to 2024-04-30
Data columns (total 19 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   day_length      2677 non-null   float64
 1   min_temp        2677 non-null   float64
 2   max_temp        2677 non-null   float64
 3   mean_temp       2677 non-null   float64
 4   temp_diff       2677 non-null   float64
 5   hdd             2677 non-null   float64
 6   cdd             2677 non-null   float64
 7   rain            2677 non-null   float64
 8   snow            2677 non-null   float64
 9   total_precip    2677 non-null   float64
 10  snow_on_ground  2677 non-null   float64
 11  max_gust        2677 non-null   float64
 12  mean_dep_temp   2677 non-null   float64
 13  mean_ret_temp   2677 non-null   float64
 14  mean_ride_temp  2677 non-null   float64
 15  total_dist      2677 non-null   float64
 16  total_duration  2677 non-null   float64
 17  ebike_trips    

A description of each of the measurements in each column of this data is as follows:

- ``day_length``: The number of minutes from sunrise to sunset for the given day
- ``min_temp``: The daily minimum temperature (deg C)
- ``max_temp``: The daily maximum temperature (deg C)
- ``mean_temp``: The mean temperature of the day (deg C)
- ``temp_diff``: The difference between the daily minimum and maximum temperatures (deg C)
- ``hdd``: The "Heating Degree Days" of the day, the number of degrees C that the daily average temperature is below 18 degrees C (0 if the daily average temperature is above 18 degrees C)
- ``cdd``: The "Cooling Degree Days" of the day, the number of degrees C that the daily average temperature is above 18 degrees C (0 if the daily average temperature is below 18 degrees C)
- ``rain``: The amount of rain that occurred that day (mm)
- ``snow``: The amount of snow that occurred that day (cm)
- ``total_precip``: The total amount of rain and snow that occurred that day (mm) (Note: the standard conversion of snow-to-rain amount is a factor of ten, so 1cm of snow is equivalent to 1mm of rain; therefore, this column is exactly the sum of ``rain`` and ``snow``)
- ``snow_on_ground``: The amount of snow observed on the ground during the day (cm)
- ``max_gust``: The difference between the speed of the observed maximum gust of wind (in km/hr) and 31 km/hr (0 if the maximum observed gust of wind is at or below 31 km/hr)
- ``mean_dep_temp``: The mean departure temperature of all bike rides taken that day (deg C)
- ``mean_ret_temp``: The mean return temperature of all bike rides taken that day (deg C)
- ``total_dist``: The total distance covered by all bike rides taken that day (m)
- ``total_duration``: The total time spent on bikes that day (sec)
- ``ebike_trips``: The number of bike rides taken on electric bikes that day
- ``num_trips`` The total number of bike rides taken that day, regardless of bicycle type

Now we will set aside the last year's data for testing our final model. 2024 is a leap year so we just need to set aside the last 366 days to obtain the data from May 1, 2023 to April 30, 2024.

In [4]:
bikeshare_test = bikeshare.iloc[-366:]
bikeshare_train = bikeshare.drop(bikeshare_test.index)

In [5]:
bikeshare_test.to_csv('bikeshare_test_data.csv', index_label='Date')
bikeshare_train.to_csv('bikeshare_train_data.csv', index_label='Date')