<img title="GitHub Octocat" src='./img/Octocat.jpg' style='height: 60px; padding-right: 15px' alt="Octocat" align="left" height="60"> This notebook is part of a GitHub repository: https://github.com/pessini/moby-bikes
<br>MIT Licensed
<br>Author: Leandro Pessini

# <p style="font-size:100%; text-align:left; color:#444444;">Data Wrangling</p>

# <p style="font-size:100%; text-align:left; color:#444444;">Table of Contents:</p>
* [1. Datasets](#1)
  * [1.1 Rentals Data - Moby Bikes](#1.1)
  * [1.2 Weather Data - Met Éireann](#1.2)
* [2. Feature Engineering](#2)
  * [2.1 Target variable distribution](#2.1)
  * [2.2 Missing values](#2.2)
  * [2.3 Exploratory Analysis](#2.3)
  * [2.4 Features Importance](#2.4)

In [154]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from scipy import stats
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

<a id="1"></a>
# <p style="font-size:100%; text-align:left; color:#444444;">1- Datasets</p>

Dataset provided by [Moby Bikes](https://data.gov.ie/dataset/moby-bikes) through a public [API](https://data.smartdublin.ie/mobybikes-api). 

Dataset provided by [Met Éireann](https://www.met.ie/) through a public [API](https://data.gov.ie/organization/meteireann).


[Met Éireann Weather Forecast API](https://data.gov.ie/dataset/met-eireann-weather-forecast-api/resource/5d156b15-38b8-4de9-921b-0ffc8704c88e)

<a id="1.1"></a>
## Rentals Data - Moby Bikes

In [155]:
historical_data = pd.read_csv('../data/raw/historical_data.csv')

In [156]:
historical_data.columns = historical_data.columns.str.lower()
historical_data.head()

Unnamed: 0,harvesttime,bikeid,battery,bikeidentifier,biketypename,ebikeprofileid,ebikestateid,isebike,ismotor,issmartlock,lastgpstime,lastrentalstart,latitude,longitude,spikeid
0,2021-04-01 00:00:03,5,7.0,1,DUB-General,1,2,True,False,False,2021-03-31 23:41:40,2021-03-30 19:18:18,53.3091,-6.21643,1
1,2021-04-01 00:00:03,6,16.0,2,DUB-General,1,2,True,False,False,2021-03-31 23:55:41,2021-03-31 10:31:13,53.3657,-6.32249,2
2,2021-04-01 00:00:03,7,66.0,3,DUB-General,4,2,True,False,False,2021-03-31 23:42:04,2021-03-30 13:07:19,53.2799,-6.14497,3
3,2021-04-01 00:00:03,8,48.0,4,DUB-General,1,2,True,False,False,2021-03-31 23:52:26,2021-03-30 12:43:17,53.2891,-6.11378,4
4,2021-04-01 00:00:03,9,-6.0,5,DUB-General,1,2,True,False,False,2021-03-31 23:50:20,2021-03-29 22:37:58,53.2928,-6.13014,5


In [157]:
print(f'Total number of rows: {historical_data.shape[0]}')
print(f'Total number of columns: {historical_data.shape[1]}')

Total number of rows: 1667841
Total number of columns: 15


In [158]:
historical_data.isnull().sum()

harvesttime            0
bikeid                 0
battery            42341
bikeidentifier         0
biketypename           0
ebikeprofileid         0
ebikestateid           0
isebike                0
ismotor                0
issmartlock            0
lastgpstime            0
lastrentalstart        0
latitude               0
longitude              0
spikeid                0
dtype: int64

<a id="1.2"></a>
## Weather Data - Met Éireann

Regarding the weather data there are two important decisions to deal with.

- One is about from **which station** the **historical data will be collected**;
-  and the other one is about the **frequency of data**, which can be **hourly or daily**.

### Station Name: **PHOENIX PARK**

In [159]:
# Hourly data from Phoenix Park Station
phoenixpark_weather_hourly = pd.read_csv('../data/raw/hly175.csv')
phoenixpark_weather_hourly.head()

Unnamed: 0,date,ind,rain,ind.1,temp,ind.2,wetb,dewpt,vappr,rhum,msl
0,16-aug-2003 01:00,0,0.0,0,9.2,0,8.9,8.5,11.1,95,1021.9
1,16-aug-2003 02:00,0,0.0,0,9.0,0,8.7,8.5,11.1,96,1021.7
2,16-aug-2003 03:00,0,0.0,0,8.2,0,8.0,7.7,10.5,96,1021.2
3,16-aug-2003 04:00,0,0.0,0,8.4,0,8.1,7.9,10.7,97,1021.2
4,16-aug-2003 05:00,0,0.0,0,7.7,0,7.5,7.3,10.2,97,1021.1


Source: [https://data.gov.ie/dataset/phoenix-park-hourly-data](https://data.gov.ie/dataset/phoenix-park-hourly-data)

In [160]:
# Daily data from Phoenix Park Station
phoenixpark_weather_daily = pd.read_csv('../data/raw/dly175.csv')
phoenixpark_weather_daily.head()

Unnamed: 0,date,ind,maxtp,ind.1,mintp,igmin,gmin,ind.2,rain,cbl,soil
0,16-aug-2003,0,20.1,0,7.5,4,,0,0.0,1013.7,18.565
1,17-aug-2003,0,21.3,0,11.6,0,7.5,0,1.1,1007.5,18.28
2,18-aug-2003,0,20.3,0,8.5,0,4.3,0,0.0,1008.8,17.825
3,19-aug-2003,0,19.9,0,11.3,0,7.7,0,0.0,1014.3,18.138
4,20-aug-2003,0,21.5,0,10.8,0,6.9,0,0.0,1013.6,18.432


Source: [https://data.gov.ie/dataset/phoenixpark-daily-data](https://data.gov.ie/dataset/phoenixpark-daily-data)

### Station Name: **DUBLIN AIRPORT**

In [161]:
# Hourly data from Dublin Airport Station
dublin_airport_weather_hourly = pd.read_csv('../data/raw/hly532.csv')
dublin_airport_weather_hourly.head()

Unnamed: 0,date,ind,rain,ind.1,temp,ind.2,wetb,dewpt,vappr,rhum,...,ind.3,wdsp,ind.4,wddir,ww,w,sun,vis,clht,clamt
0,01-jan-1992 00:00,0,0.0,0,8.4,0,6.6,4.3,8.3,75,...,2,23,2,210,2,11,0.0,25000,999,3
1,01-jan-1992 01:00,0,0.0,0,8.6,0,6.6,4.0,8.1,73,...,2,23,2,220,2,11,0.0,25000,100,6
2,01-jan-1992 02:00,0,0.0,0,9.0,0,7.0,4.5,8.4,73,...,2,22,2,220,2,11,0.0,25000,100,6
3,01-jan-1992 03:00,0,0.0,0,9.5,0,7.4,4.8,8.6,73,...,2,22,2,220,2,11,0.0,25000,25,6
4,01-jan-1992 04:00,0,0.0,0,9.5,0,7.4,4.8,8.6,73,...,2,23,2,230,2,11,0.0,25000,25,6


In [162]:
print(f'Total number of rows: {dublin_airport_weather_hourly.shape[0]}')

Total number of rows: 264409


Source: [https://data.gov.ie/dataset/dublin-airport-hourly-data](https://data.gov.ie/dataset/dublin-airport-hourly-data)

### Phoenix Park Station vs Dublin Aiport Station
Geographically, the station at Phoenix Park would be the most suitable choice but unfortunately, they do not collect Wind information which in Ireland plays an important role when deciding to go cycling or not. For those who are not familiar with Irish weather, it rains a lot and mostly we do not have much choice about it but the wind is something that can prevent you go outside or choosing a different kind of transportation. Heavy rain is not that common, though.

### Hourly vs Daily data
A daily data to the business could make more sense but because the weather is so unpredictable in Ireland (it can completely change in an hour), the best option would be hourly data if looking at a historical perspective. Important to note that from the Weather API the forecast is provided hourly. For simplicity and better planning, we can always aggregate the predicted results by day.

In [163]:
# transforming date columns in weather data to datetime
dublin_airport_weather_hourly['date'] = pd.to_datetime(dublin_airport_weather_hourly['date'])
dublin_airport_weather_hourly.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264409 entries, 0 to 264408
Data columns (total 21 columns):
 #   Column  Non-Null Count   Dtype         
---  ------  --------------   -----         
 0   date    264409 non-null  datetime64[ns]
 1   ind     264409 non-null  int64         
 2   rain    264409 non-null  float64       
 3   ind.1   264409 non-null  int64         
 4   temp    264409 non-null  float64       
 5   ind.2   264409 non-null  int64         
 6   wetb    264409 non-null  float64       
 7   dewpt   264409 non-null  float64       
 8   vappr   264409 non-null  object        
 9   rhum    264409 non-null  object        
 10  msl     264409 non-null  float64       
 11  ind.3   264409 non-null  int64         
 12  wdsp    264409 non-null  int64         
 13  ind.4   264409 non-null  int64         
 14  wddir   264409 non-null  object        
 15  ww      264409 non-null  int64         
 16  w       264409 non-null  int64         
 17  sun     264409 non-null  floa

In [164]:
dublin_airport_weather_hourly.tail()

Unnamed: 0,date,ind,rain,ind.1,temp,ind.2,wetb,dewpt,vappr,rhum,...,ind.3,wdsp,ind.4,wddir,ww,w,sun,vis,clht,clamt
264404,2022-02-28 20:00:00,3,0.0,0,2.2,0,1.4,0.0,6.1,86,...,2,7,2,300,2,11,0.0,30000,999,1
264405,2022-02-28 21:00:00,3,0.0,0,1.1,0,0.6,-0.3,6.0,90,...,2,5,2,290,2,11,0.0,30000,999,1
264406,2022-02-28 22:00:00,3,0.0,0,0.0,1,-0.3,-1.0,5.7,94,...,2,6,2,290,2,11,0.0,30000,999,1
264407,2022-02-28 23:00:00,3,0.0,0,0.2,1,-0.1,-0.7,5.8,94,...,2,6,2,290,2,11,0.0,30000,999,1
264408,2022-03-01 00:00:00,3,0.0,1,-0.2,1,-0.4,-0.9,5.8,96,...,2,6,2,280,2,11,0.0,30000,999,1


### Sampling

In [165]:
start_date_hist = datetime(2021, 3, 1) # first day
end_date_hist = datetime(2022, 3, 1) # last day used as historical data

In [166]:
recent_dubairport_data = dublin_airport_weather_hourly.copy()
recent_dubairport_data = recent_dubairport_data[(recent_dubairport_data.date >= start_date_hist) & (recent_dubairport_data.date <= end_date_hist)]
len(dublin_airport_weather_hourly), len(recent_dubairport_data)

(264409, 8761)

In [167]:
columns_to_drop = ['ind','ind.1','ind.2','ind.3','vappr','msl','ind.4','wddir','ww','w','sun','vis','clht','clamt','wetb','dewpt']
weather_data = recent_dubairport_data.drop(columns=columns_to_drop)
weather_data.to_csv('../data/interim/hist_weather_data.csv', index=False)

In [168]:
weather_data.head()

Unnamed: 0,date,rain,temp,rhum,wdsp
255648,2021-03-01 00:00:00,0.0,0.1,98,4
255649,2021-03-01 01:00:00,0.0,-1.1,98,3
255650,2021-03-01 02:00:00,0.0,-1.2,98,4
255651,2021-03-01 03:00:00,0.0,-0.9,100,5
255652,2021-03-01 04:00:00,0.0,0.0,100,6


In [169]:
weather_data = weather_data[:-1] # drop the last row from 01/03/2022
weather_data.tail()

Unnamed: 0,date,rain,temp,rhum,wdsp
264403,2022-02-28 19:00:00,0.0,2.5,86,6
264404,2022-02-28 20:00:00,0.0,2.2,86,7
264405,2022-02-28 21:00:00,0.0,1.1,90,5
264406,2022-02-28 22:00:00,0.0,0.0,94,6
264407,2022-02-28 23:00:00,0.0,0.2,94,6


<a id="2"></a>
# <p style="font-size:100%; text-align:left; color:#444444;">2- Feature Engineering</p>

## Hypothesis

Hourly trend: It might be a high demand for people commuting to work. Early morning and late evening can have different trend (cyclist) and low demand during 10:00 pm to 4:00 am.

Daily Trend: Users demand more bike on weekdays as compared to weekend or holiday.

Rain: The demand of bikes will be lower on a rainy day as compared to a sunny day. Similarly, higher humidity will cause to lower the demand and vice versa.

Temperature: In Ireland, temperature has positive correlation with bike demand.

Traffic: It can be positively correlated with Bike demand. Higher traffic may force people to use bike as compared to other road transport medium like car, taxi etc.



### New Features
- date (yyyy-mm-dd)
- month
- hour
- workingday
- peak
- holiday
- season
- battery_start
- battery_end
- path? (multi polygon)
- rental_duration


The number of rentals each hour will be aggregate later with a new feature `count`.

In [170]:
rentals_data = historical_data.drop(['harvesttime','ebikestateid'], axis=1).copy()
rentals_data[["lastgpstime", "lastrentalstart"]] = rentals_data[["lastgpstime", "lastrentalstart"]].apply(pd.to_datetime)

rentals_data = rentals_data.astype({'battery': np.int16}, errors='ignore') # errors ignore to keep missing values (not throwing error)

In [171]:
rentals_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1667841 entries, 0 to 1667840
Data columns (total 13 columns):
 #   Column           Non-Null Count    Dtype         
---  ------           --------------    -----         
 0   bikeid           1667841 non-null  int64         
 1   battery          1625500 non-null  float64       
 2   bikeidentifier   1667841 non-null  int64         
 3   biketypename     1667841 non-null  object        
 4   ebikeprofileid   1667841 non-null  int64         
 5   isebike          1667841 non-null  bool          
 6   ismotor          1667841 non-null  bool          
 7   issmartlock      1667841 non-null  bool          
 8   lastgpstime      1667841 non-null  datetime64[ns]
 9   lastrentalstart  1667841 non-null  datetime64[ns]
 10  latitude         1667841 non-null  float64       
 11  longitude        1667841 non-null  float64       
 12  spikeid          1667841 non-null  int64         
dtypes: bool(3), datetime64[ns](2), float64(3), int64(4), obje

<a id="2.1"></a>
### Rentals information

- `coordinates`: converting latitude and longitude to an array to store a GeoJSON object *MultiPoint* 
- `start_battery`: getting the battery status when the rental started
- `lastgpstime`: new variable that will only store the last record when grouping rentals

In [172]:
def feat_eng(x):
    d = {'coordinates': x[['latitude','longitude']].values.tolist()}
    d['start_battery'] = list(x['battery'])[-1] # get the first battery status (when rental started)
    d['lastgpstime'] = list(x['lastgpstime'])[0] # get the last gpstime (previously sorted)

    return pd.Series(d, index=['coordinates', 'start_battery', 'lastgpstime'])

# also sorting data by lastgpstime
grouped_rentals = rentals_data.sort_values("lastgpstime", ascending=False).groupby(['lastrentalstart', 'bikeid']).apply(feat_eng).reset_index()

In [173]:
grouped_rentals.shape

(48166, 5)

### Date and time - new features
- `rental_date`
- `rental_month`
- `rental_hour`
- `holiday`
- `workingday`
- `peak`
- `season`: (1 = Spring, 2 = Summer, 3 = Fall, 4 = Winter)
- `duration`*: duration of the rental

\* **Assumption**: Due to lack of information and data, to calculate the average rent time I am assuming that when a new bike rental starts the average will be calculated by: $ ( AvgRentTime* = LastGPSTime - LastRentalStart ) $

In [174]:
# weather_data['dt'] = pd.to_datetime(weather_data['date'].dt.date)
weather_data['hour'] = weather_data['date'].dt.hour
weather_data['day'] = weather_data['date'].dt.day
weather_data['month'] = weather_data['date'].dt.month
weather_data['year'] = weather_data['date'].dt.year

Slicing the dataset to get the sample as per weather data above.

In [175]:
start_date_hist, end_date_hist

(datetime.datetime(2021, 3, 1, 0, 0), datetime.datetime(2022, 3, 1, 0, 0))

In [176]:
grouped_rentals['date'] = pd.to_datetime(grouped_rentals['lastrentalstart'].dt.date)
grouped_rentals = grouped_rentals[(grouped_rentals['date'] >= start_date_hist) & (grouped_rentals['date'] <= end_date_hist)]

## Rental's duration

In [177]:
# time of rental in minutes (lastgpstime - rental-start)
grouped_rentals['duration'] = (grouped_rentals['lastgpstime'] - grouped_rentals['lastrentalstart']) / pd.Timedelta(minutes=1)

A few GPS records have frozen and stopped sending the accurate data back, which would lead to a bias duration of rentals.

To prevent any inaccurate information these records will be set as `NaN`.

In [178]:
grouped_rentals['duration'] = np.where(grouped_rentals['duration'] < 0, np.NaN, grouped_rentals['duration'])
len(grouped_rentals[ np.isnan(grouped_rentals['duration']) ])

271

## Battery

In [179]:
grouped_rentals['start_battery'] = pd.to_numeric(grouped_rentals['start_battery'])

In [180]:
grouped_rentals[grouped_rentals['start_battery'] > 100]

Unnamed: 0,lastrentalstart,bikeid,coordinates,start_battery,lastgpstime,date,duration
18593,2021-04-03 12:40:00,103,"[[53.3405, -6.2679]]",268.0,2021-04-03 12:55:11,2021-04-03,15.183333


From the battery records there is a few cases that we can consider. Only one record has ` battery > 100` and a few negatives ones. To simplify the analysis the records will be normalized with values between `0 > x > 100`.

All missing values (*n=571*) will not be transformed as it could be only malfunction issue when transmiting the data and it could mislead the analysis.

In [181]:
# normalize battery status between 0 > x < 100
grouped_rentals['start_battery'] = abs(grouped_rentals['start_battery'])
grouped_rentals.loc[grouped_rentals['start_battery'] > 100, 'start_battery'] = 100

In [182]:
grouped_rentals.isnull().sum()

lastrentalstart      0
bikeid               0
coordinates          0
start_battery      472
lastgpstime          0
date                 0
duration           271
dtype: int64

In [183]:
new_rentals = grouped_rentals.copy()
new_rentals.to_csv('../data/interim/new_features_rentals.csv', index=False)
new_rentals.shape

(33119, 7)

## Bank Holidays

In [184]:
bank_holidays = pd.read_json('../data/processed/irishcalendar.json')
bank_holidays['date'] = pd.to_datetime(arg=bank_holidays['date'],utc=True, infer_datetime_format=True)
bank_holidays['dt'] = pd.to_datetime(bank_holidays['date'].dt.date)

In [185]:
qry_bh = {
    'type': 'National holiday'
}

# bank_holidays = read_mongo(query=qry_bh, collection='irishcalendar')
bank_holidays = bank_holidays[bank_holidays['type'] == 'National holiday']
bank_holidays.drop(['country', 'type', 'date'], axis=1, inplace=True)

In [186]:
# holiday
weather_data['holiday'] = weather_data['date'].isin(bank_holidays['dt'])

# day of the week
weather_data['dayofweek_n'] = weather_data['date'].dt.dayofweek
weather_data['dayofweek'] = weather_data['date'].dt.day_name()

# working day (Monday=0, Sunday=6)
# from 0 to 4 or monday to friday and is not holiday
weather_data['working_day'] = weather_data['dayofweek_n'] < 5
# set working_day to False on National Bank Holodays
weather_data.loc[ weather_data['holiday'] , 'working_day'] = False

## Seasons

In [187]:
weather_data['date'] = pd.to_datetime(weather_data['date'].dt.date)

Y = 2000 # dummy leap year to allow input X-02-29 (leap day)
seasons = [('Winter', (datetime(Y,  1,  1),  datetime(Y,  3, 20))),
           ('Spring', (datetime(Y,  3, 21),  datetime(Y,  6, 20))),
           ('Summer', (datetime(Y,  6, 21),  datetime(Y,  9, 22))),
           ('Autumn', (datetime(Y,  9, 23),  datetime(Y, 12, 20))),
           ('Winter', (datetime(Y, 12, 21),  datetime(Y, 12, 31)))]

def get_season(date: pd.DatetimeIndex) -> str:
    '''
        Receives a date and returns the corresponded season
        0 - Spring | 1 - Summer | 2 - Autumn | 3 - Winter
        Vernal equinox(about March 21): day and night of equal length, marking the start of spring
        Summer solstice (June 20 or 21): longest day of the year, marking the start of summer
        Autumnal equinox(about September 23): day and night of equal length, marking the start of autumn
        Winter solstice (December 21 or 22): shortest day of the year, marking the start of winter
    '''
    date = date.replace(year=Y)
    return next(season for season, (start, end) in seasons if start <= date <= end)


weather_data['season'] = weather_data['date'].map(get_season)

## Peak Times

>https://www.independent.ie/irish-news/the-new-commuter-hour-peak-times-increase-with-record-traffic-volumes-36903431.html

In [188]:
weather_data['peak'] = weather_data[['hour', 'working_day']] \
    .apply(lambda x: (False, True)[(x['working_day'] == 1 and (6 <= x['hour'] <= 10 or 15 <= x['hour'] <= 19))], axis = 1)

## Times of the Day

- Morning (from 7am to noon)
- Afternoon (from midday to 6pm)
- Evening (from 6pm to 10pm)
- Night (from 10pm to 5am)

In [189]:
conditions = [
    (weather_data['hour'] < 7), # night
    (weather_data['hour'] >= 7) & (weather_data['hour'] < 12), # morning
    (weather_data['hour'] >= 12) & (weather_data['hour'] < 18), # afternoon
    (weather_data['hour'] >= 18) & (weather_data['hour'] < 23) # evening
]
values = ['Night', 'Morning', 'Afternoon', 'Evening']
weather_data['timesofday'] = np.select(conditions, values,'Night')

## Rainfall Intensity Level

| Level | Rainfall Intensity |
| :- | :-: |
| no rain        | 0       |
| drizzle        | 0.1~0.3 |
| light rain     | 0.3~0.5 |
| moderate rain  | 0.5~4   |
| heavy rain     | >4      |

Source: https://www.metoffice.gov.uk/research/library-and-archive/publications/factsheets

PDF direct link: [Water in the atmosphere](https://www.metoffice.gov.uk/binaries/content/assets/metofficegovuk/pdf/research/library-and-archive/library/publications/factsheets/factsheet_3-water-in-the-atmosphere-v02.pdf)

### Met Éireann Weather Forecast API

(https://data.gov.ie/dataset/met-eireann-weather-forecast-api/resource/5d156b15-38b8-4de9-921b-0ffc8704c88e)

**Precipitation unit:** Rain will be output in *millimetres (mm)*.

The minvalue, value and maxvalue values are derived from statistical analysis of the forecast, and refer to the lower (20th percentile), middle (60th percentile) and higher (80th percentile) expected amount. If minvalue and maxvalue are not output, value is the basic forecast amount.

```html
<precipitation unit="mm" value="0.0" minvalue="0.0" maxvalue="0.1"/>
```

In [190]:
conditions = [
    (weather_data['rain'] == 0.0), # no rain
    (weather_data['rain'] <= 0.3), # drizzle
    (weather_data['rain'] > 0.3) & (weather_data['rain'] <= 0.5), # light rain
    (weather_data['rain'] > 0.5) & (weather_data['rain'] <= 4), # moderate rain
    (weather_data['rain'] > 4) # heavy rain
    ]
values = ['no rain', 'drizzle', 'light rain', 'moderate rain','heavy rain']
weather_data['rain_type'] = np.select(conditions, values)

In [191]:
weather_data['rain_type'].value_counts()

no rain          7863
drizzle           464
moderate rain     320
light rain        100
heavy rain         13
Name: rain_type, dtype: int64

## Combining Rentals and Weather data

In [192]:
new_rentals.shape, weather_data.shape

((33119, 7), (8760, 17))

In [193]:
new_rentals.head()

Unnamed: 0,lastrentalstart,bikeid,coordinates,start_battery,lastgpstime,date,duration
15047,2021-03-01 02:52:03,5,"[[53.3254, -6.25514], [53.3254, -6.25514], [53...",25.0,2021-03-01 14:04:42,2021-03-01,672.65
15048,2021-03-01 07:35:15,21,"[[53.3428, -6.23861], [53.3428, -6.23862], [53...",25.0,2021-03-01 14:25:39,2021-03-01,410.4
15049,2021-03-01 07:49:36,86,"[[53.3763, -6.27202], [53.3763, -6.27203], [53...",21.0,2021-03-02 16:27:20,2021-03-01,1957.733333
15050,2021-03-01 07:52:24,55,"[[53.3425, -6.29335], [53.3425, -6.29336], [53...",30.0,2021-03-01 13:09:29,2021-03-01,317.083333
15051,2021-03-01 08:26:50,75,"[[53.3763, -6.27199], [53.3763, -6.27199], [53...",88.0,2021-03-06 09:17:46,2021-03-01,7250.933333


In [194]:
weather_data.head()

Unnamed: 0,date,rain,temp,rhum,wdsp,hour,day,month,year,holiday,dayofweek_n,dayofweek,working_day,season,peak,timesofday,rain_type
255648,2021-03-01,0.0,0.1,98,4,0,1,3,2021,False,0,Monday,True,Winter,False,Night,no rain
255649,2021-03-01,0.0,-1.1,98,3,1,1,3,2021,False,0,Monday,True,Winter,False,Night,no rain
255650,2021-03-01,0.0,-1.2,98,4,2,1,3,2021,False,0,Monday,True,Winter,False,Night,no rain
255651,2021-03-01,0.0,-0.9,100,5,3,1,3,2021,False,0,Monday,True,Winter,False,Night,no rain
255652,2021-03-01,0.0,0.0,100,6,4,1,3,2021,False,0,Monday,True,Winter,False,Night,no rain


In [195]:
rentals = new_rentals.copy()
weather = weather_data.copy()
rentals['hour'] = rentals['lastrentalstart'].dt.hour

In [196]:
all_data = pd.merge(rentals, weather, on=['date', 'hour'])
all_data.to_csv('../data/processed/all_data.csv', index=False)
rentals.shape[0] - all_data.shape[0]

0

## Grouping data to reflect hourly count of rentals

In [197]:
hourly_rentals = all_data.copy()
count_hourly_rentals = hourly_rentals.groupby(['date', 'hour']).size().reset_index(name='count')
columns_to_drop = ['lastrentalstart','bikeid','coordinates','start_battery','lastgpstime','duration']
hourly_rentals = hourly_rentals.drop(columns_to_drop, axis=1)
hourly_rentals.shape, count_hourly_rentals.shape

((33119, 16), (6966, 3))

### Dataframe with only hours with *at least* 1 rental

In [198]:
hourly_rentals = hourly_rentals.drop_duplicates(subset=['date', 'hour'])
hourly_data = pd.merge(hourly_rentals, count_hourly_rentals, on=['date','hour'])
hourly_data.to_csv('../data/processed/hourly_rentals.csv', index=False)
hourly_data.head()

Unnamed: 0,date,hour,temp,rhum,wdsp,day,month,year,holiday,dayofweek_n,dayofweek,working_day,season,peak,timesofday,rain_type,count
0,2021-03-01,2,-1.2,98,4,1,3,2021,False,0,Monday,True,Winter,False,Night,no rain,1
1,2021-03-01,7,2.1,100,4,1,3,2021,False,0,Monday,True,Winter,True,Morning,no rain,3
2,2021-03-01,8,5.1,98,5,1,3,2021,False,0,Monday,True,Winter,True,Morning,no rain,1
3,2021-03-01,9,5.7,98,5,1,3,2021,False,0,Monday,True,Winter,True,Morning,no rain,4
4,2021-03-01,10,6.7,94,6,1,3,2021,False,0,Monday,True,Winter,True,Morning,no rain,4


### Dataframe with all hours including those with none rental

In [199]:
hourly_data_withzeros = pd.merge(weather, count_hourly_rentals, on=['date','hour'], how='left')
hourly_data_withzeros['count'] = hourly_data_withzeros['count'].fillna(0).astype(int)
hourly_data_withzeros.to_csv('../data/processed/hourly_data.csv', index=False)
hourly_data_withzeros.head()

Unnamed: 0,date,rain,temp,rhum,wdsp,hour,day,month,year,holiday,dayofweek_n,dayofweek,working_day,season,peak,timesofday,rain_type,count
0,2021-03-01,0.0,0.1,98,4,0,1,3,2021,False,0,Monday,True,Winter,False,Night,no rain,0
1,2021-03-01,0.0,-1.1,98,3,1,1,3,2021,False,0,Monday,True,Winter,False,Night,no rain,0
2,2021-03-01,0.0,-1.2,98,4,2,1,3,2021,False,0,Monday,True,Winter,False,Night,no rain,1
3,2021-03-01,0.0,-0.9,100,5,3,1,3,2021,False,0,Monday,True,Winter,False,Night,no rain,0
4,2021-03-01,0.0,0.0,100,6,4,1,3,2021,False,0,Monday,True,Winter,False,Night,no rain,0


<img title="GitHub Mark" src="./img/GitHub-Mark-64px.png" style="height: 32px; padding-right: 15px" alt="GitHub Mark" align="left"> [GitHub repository](https://github.com/pessini/moby-bikes) <br>Author: Leandro Pessini