<img title="GitHub Octocat" src='./img/Octocat.jpg' style='height: 60px; padding-right: 15px' alt="Octocat" align="left" height="60"> This notebook is part of a GitHub repository: https://github.com/pessini/moby-bikes
<br>MIT Licensed
<br>Author: Leandro Pessini

# Feature Engineering

## Hypothesis

Hourly trend: It might be a high demand for people commuting to work. Early morning and late evening can have different trend (cyclist) and low demand during 10:00 pm to 4:00 am.

Daily Trend: Users demand more bike on weekdays as compared to weekend or holiday.

Rain: The demand of bikes will be lower on a rainy day as compared to a sunny day. Similarly, higher humidity will cause to lower the demand and vice versa.

Temperature: In Ireland, temperature has positive correlation with bike demand.

Traffic: It can be positively correlated with Bike demand. Higher traffic may force people to use bike as compared to other road transport medium like car, taxi etc.



### New Features
- date (yyyy-mm-dd)
- month
- hour
- workingday
- peak
- holiday
- season
- battery_start
- battery_end
- path? (multi polygon)
- rental_duration


The number of rentals each hour will be aggregate later with a new feature `count`.

In [None]:
rentals_data = historical_data.drop(['harvesttime','ebikestateid'], axis=1).copy()
rentals_data[["lastgpstime", "lastrentalstart"]] = rentals_data[["lastgpstime", "lastrentalstart"]].apply(pd.to_datetime)

rentals_data = rentals_data.astype({'battery': np.int16}, errors='ignore') # errors ignore to keep missing values (not throwing error)

In [None]:
rentals_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1667841 entries, 0 to 1667840
Data columns (total 13 columns):
 #   Column           Non-Null Count    Dtype         
---  ------           --------------    -----         
 0   bikeid           1667841 non-null  int64         
 1   battery          1625500 non-null  float64       
 2   bikeidentifier   1667841 non-null  int64         
 3   biketypename     1667841 non-null  object        
 4   ebikeprofileid   1667841 non-null  int64         
 5   isebike          1667841 non-null  bool          
 6   ismotor          1667841 non-null  bool          
 7   issmartlock      1667841 non-null  bool          
 8   lastgpstime      1667841 non-null  datetime64[ns]
 9   lastrentalstart  1667841 non-null  datetime64[ns]
 10  latitude         1667841 non-null  float64       
 11  longitude        1667841 non-null  float64       
 12  spikeid          1667841 non-null  int64         
dtypes: bool(3), datetime64[ns](2), float64(3), int64(4), obje

### Date and time - new features
- `rental_date`
- `rental_month`
- `rental_hour`
- `holiday`
- `workingday`
- `peak`
- `season`: (1 = Spring, 2 = Summer, 3 = Fall, 4 = Winter)
- `duration`*: duration of the rental

\* **Assumption**: Due to lack of information and data, to calculate the average rent time I am assuming that when a new bike rental starts the average will be calculated by: $ ( AvgRentTime* = LastGPSTime - LastRentalStart ) $

In [None]:
weather_data['dt'] = pd.to_datetime(weather_data['date'].dt.date)
weather_data['hour'] = weather_data['date'].dt.hour
weather_data['day'] = weather_data['date'].dt.day
weather_data['month'] = weather_data['date'].dt.month
weather_data['year'] = weather_data['date'].dt.year
weather_data.drop(columns='date', axis=1, inplace=True)
weather_data.rename(columns={'dt':'date'},inplace=True)

Slicing the dataset to get the sample as per weather data above.

In [None]:
start_date_hist, end_date_hist

NameError: name 'start_date_hist' is not defined

In [None]:
grouped_rentals['date'] = pd.to_datetime(grouped_rentals['lastrentalstart'].dt.date)
grouped_rentals = grouped_rentals[(grouped_rentals['date'] >= start_date_hist) & (grouped_rentals['date'] <= end_date_hist)]

## Rental's duration


### Period of use
> "5.1 Bikes should not be used for more than 19 consecutive hours, this is the maximum period of use." [General Terms and Conditions (“GTC”)](https://app.mobymove.com/t-c.html)

In [None]:
# time of rental in minutes (lastgpstime - rental-start)
grouped_rentals['duration'] = (grouped_rentals['lastgpstime'] - grouped_rentals['lastrentalstart']) / pd.Timedelta(minutes=1)

A few GPS records have frozen and stopped sending the accurate data back, which would lead to a bias duration of rentals.

To prevent any inaccurate information these records will be set as `NaN`.

In [None]:
grouped_rentals['duration'] = np.where(grouped_rentals['duration'] < 0, np.NaN, grouped_rentals['duration'])
len(grouped_rentals[ np.isnan(grouped_rentals['duration']) ])

271

## Bank Holidays

In [None]:
bank_holidays = pd.read_json('../data/processed/irishcalendar.json')
bank_holidays['date'] = pd.to_datetime(arg=bank_holidays['date'],utc=True, infer_datetime_format=True)
bank_holidays['dt'] = pd.to_datetime(bank_holidays['date'].dt.date)
bank_holidays = bank_holidays[bank_holidays['type'] == 'National holiday']
bank_holidays.drop(['country', 'type', 'date'], axis=1, inplace=True)
# holiday
weather_data['holiday'] = weather_data['date'].isin(bank_holidays['dt'])

In [None]:
# day of the week
weather_data['dayofweek_n'] = weather_data['date'].dt.dayofweek
weather_data['dayofweek'] = weather_data['date'].dt.day_name()

# working day (Monday=0, Sunday=6)
# from 0 to 4 or monday to friday and is not holiday
weather_data['working_day'] = weather_data['dayofweek_n'] < 5
# set working_day to False on National Bank Holodays
weather_data.loc[ weather_data['holiday'] , 'working_day'] = False

## Seasons

In [None]:
weather_data['date'] = pd.to_datetime(weather_data['date'].dt.date)

Y = 2000 # dummy leap year to allow input X-02-29 (leap day)
seasons = [('Winter', (datetime(Y,  1,  1),  datetime(Y,  3, 20))),
           ('Spring', (datetime(Y,  3, 21),  datetime(Y,  6, 20))),
           ('Summer', (datetime(Y,  6, 21),  datetime(Y,  9, 22))),
           ('Autumn', (datetime(Y,  9, 23),  datetime(Y, 12, 20))),
           ('Winter', (datetime(Y, 12, 21),  datetime(Y, 12, 31)))]

def get_season(date: pd.DatetimeIndex) -> str:
    '''
        Receives a date and returns the corresponded season
        0 - Spring | 1 - Summer | 2 - Autumn | 3 - Winter
        Vernal equinox(about March 21): day and night of equal length, marking the start of spring
        Summer solstice (June 20 or 21): longest day of the year, marking the start of summer
        Autumnal equinox(about September 23): day and night of equal length, marking the start of autumn
        Winter solstice (December 21 or 22): shortest day of the year, marking the start of winter
    '''
    date = date.replace(year=Y)
    return next(season for season, (start, end) in seasons if start <= date <= end)


weather_data['season'] = weather_data['date'].map(get_season)

## Peak Times

>https://www.independent.ie/irish-news/the-new-commuter-hour-peak-times-increase-with-record-traffic-volumes-36903431.html

In [None]:
weather_data['peak'] = weather_data[['hour', 'working_day']] \
    .apply(lambda x: (False, True)[(x['working_day'] == 1 and (6 <= x['hour'] <= 10 or 15 <= x['hour'] <= 19))], axis = 1)

## Times of the Day

- Morning (from 7am to noon)
- Afternoon (from midday to 6pm)
- Evening (from 6pm to 10pm)
- Night (from 10pm to 5am)

In [None]:
conditions = [
    (weather_data['hour'] < 7), # night 23:00 - 06:59
    (weather_data['hour'] >= 7) & (weather_data['hour'] < 12), # morning 7:00 - 11:59
    (weather_data['hour'] >= 12) & (weather_data['hour'] < 18), # afternoon 12:01 - 17:59
    (weather_data['hour'] >= 18) & (weather_data['hour'] < 23) # evening 18:00 - 22:59
]
values = ['Night', 'Morning', 'Afternoon', 'Evening']
weather_data['timesofday'] = np.select(conditions, values,'Night')

## Rainfall Intensity Level

| Level | Rainfall Intensity |
| :- | :-: |
| no rain        | 0       |
| drizzle        | 0.1~0.3 |
| light rain     | 0.3~0.5 |
| moderate rain  | 0.5~4   |
| heavy rain     | >4      |

Source: https://www.metoffice.gov.uk/research/library-and-archive/publications/factsheets

PDF direct link: [Water in the atmosphere](https://www.metoffice.gov.uk/binaries/content/assets/metofficegovuk/pdf/research/library-and-archive/library/publications/factsheets/factsheet_3-water-in-the-atmosphere-v02.pdf)

### Met Éireann Weather Forecast API

(https://data.gov.ie/dataset/met-eireann-weather-forecast-api/resource/5d156b15-38b8-4de9-921b-0ffc8704c88e)

**Precipitation unit:** Rain will be output in *millimetres (mm)*.

The minvalue, value and maxvalue values are derived from statistical analysis of the forecast, and refer to the lower (20th percentile), middle (60th percentile) and higher (80th percentile) expected amount. If minvalue and maxvalue are not output, value is the basic forecast amount.

```html
<precipitation unit="mm" value="0.0" minvalue="0.0" maxvalue="0.1"/>
```

In [None]:
conditions = [
    (weather_data['rain'] == 0.0), # no rain
    (weather_data['rain'] <= 0.3), # drizzle
    (weather_data['rain'] > 0.3) & (weather_data['rain'] <= 0.5), # light rain
    (weather_data['rain'] > 0.5) & (weather_data['rain'] <= 4), # moderate rain
    (weather_data['rain'] > 4) # heavy rain
    ]
values = ['no rain', 'drizzle', 'light rain', 'moderate rain','heavy rain']
weather_data['rainfall_intensity'] = np.select(conditions, values)

In [None]:
weather_data['rainfall_intensity'].value_counts()

no rain          7862
drizzle           465
moderate rain     320
light rain        100
heavy rain         13
Name: rainfall_intensity, dtype: int64

### Wind Speed Beaufort scale

[The Irish Meteorological Service - BEAUFORT SCALE](https://www.met.ie/forecasts/marine-inland-lakes/beaufort-scale)

<img title="BEAUFORT SCALE" src='./img/Beaufort-scale.png' alt="BEAUFORT SCALE" />

Another source: https://www.metoffice.gov.uk/weather/guides/coast-and-sea/beaufort-scale

In [None]:
import math
def scale(value, factor):
    """
    Multiply value by factor, allowing for None values.
    """
    return None if value is None else value * factor

def wind_ms(kn):
    """
    Convert wind from knots to metres per second
    """
    return scale(kn, 0.514)

def wind_kn(ms):
    """
    Convert wind from metres per second to knots
    """
    return scale(ms, 3.6 / 1.852)

def wind_bft(ms):
    """
    Convert wind from metres per second to Beaufort scale
    """
    _bft_threshold = (0.3, 1.5, 3.4, 5.4, 7.9, 10.7, 13.8, 17.1, 20.7, 24.4, 28.4, 32.6)
    if ms is None:
        return None
    return next((bft for bft in range(len(_bft_threshold)) if ms < _bft_threshold[bft]), len(_bft_threshold))

In [None]:
weather_data['wind_bft'] = weather_data.apply(lambda row: wind_bft(wind_ms(row.wdsp)), axis=1)
weather_data['wind_bft'].value_counts().sort_values(ascending=False)

3    3026
2    2875
4    1872
5     529
1     314
6     117
7      25
8       2
Name: wind_bft, dtype: int64

### Grouped Wind Speed (Beaufort scale)

| Level | Beaufort scale |
| :- | :-: |
| Calm / Light Breeze           | 0~2     |
| Breeze                        | 3       |
| Moderate Breeze               | 4-5     |
| Strong Breeze / Near Gale     | 6-7     |
| Gale / Storm                  | 8~12    |

In [None]:
conditions = [
    (weather_data['wind_bft'] < 3), # Calm / Light Breeze
    (weather_data['wind_bft'] == 3), # Breeze
    (weather_data['wind_bft'] > 3) & (weather_data['wind_bft'] < 6), # Moderate Breeze
    (weather_data['wind_bft'] >= 6) & (weather_data['wind_bft'] < 8), # Strong Breeze / Near Gale
    (weather_data['wind_bft'] > 7) # Gale / Storm
]
values = ['Calm / Light Breeze', 'Breeze', 'Moderate Breeze', 'Strong Breeze / Near Gale','Gale / Storm']
weather_data['wind_speed_group'] = np.select(conditions, values)

### Rounded Temperature

Capturing the relationship on temperature as continuous can be hard for machine learning algorithms as the range is to high. Temperature of 13.4C and 13.9C or 13C and 15C, are practically the same if you think about deciding whether to go bicicling or not. The same rationale applies for humidity and wind speed.

In [None]:
def round_up(x):
    '''
    Helper function to round away from zero
    '''
    from math import copysign
    return int(x + copysign(0.5, x))

weather_data['temp_r'] = weather_data['temp'].apply(round_up)

### KBinsDiscretizer - Temperature and Humidity

[KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html) - Bin continuous data into intervals.

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
# transform the temperature with KBinsDiscretizer
enc_kmeans = KBinsDiscretizer(n_bins=10, encode="ordinal", strategy='kmeans')
weather_data['temp_bin'] = enc_kmeans.fit_transform(weather_data['temp'].array.reshape(-1,1))

# transform the humidity with KBinsDiscretizer
enc_kmeans = KBinsDiscretizer(n_bins=5, encode="ordinal", strategy='kmeans')
weather_data['rhum_bin'] = enc_kmeans.fit_transform(weather_data['rhum'].array.reshape(-1,1))

## Combining Rentals and Weather data

In [None]:
new_rentals.shape, weather_data.shape

((33119, 7), (8760, 22))

In [None]:
rentals = new_rentals.copy()
weather = weather_data.copy()
rentals['hour'] = rentals['lastrentalstart'].dt.hour

In [None]:
all_data = pd.merge(rentals, weather, on=['date', 'hour'])
all_data.to_csv('../data/processed/all_data.csv', index=False)
rentals.shape[0] - all_data.shape[0]

0

## Grouping data to reflect hourly count of rentals

In [None]:
hourly_rentals = all_data.copy()
count_hourly_rentals = hourly_rentals.groupby(['date', 'hour']).size().reset_index(name='count')
columns_to_drop = ['lastrentalstart','bikeid','coordinates','start_battery','lastgpstime','duration']
hourly_rentals = hourly_rentals.drop(columns_to_drop, axis=1)
hourly_rentals.shape, count_hourly_rentals.shape

((33119, 22), (6966, 3))

### Dataframe including only hours with *at least* 1 rental

In [None]:
hourly_rentals = hourly_rentals.drop_duplicates(subset=['date', 'hour'])
hourly_data = pd.merge(hourly_rentals, count_hourly_rentals, on=['date','hour'])
hourly_data.to_csv('../data/processed/hourly_rentals.csv', index=False)
hourly_data.shape

(6966, 23)

### Dataframe with all hours (including none rental)

In [None]:
hourly_data_withzeros = pd.merge(weather, count_hourly_rentals, on=['date','hour'], how='left')
hourly_data_withzeros['count'] = hourly_data_withzeros['count'].fillna(0).astype(int)
hourly_data_withzeros.to_csv('../data/processed/hourly_data.csv', index=False)
hourly_data_withzeros.shape

(8760, 23)

<img title="GitHub Mark" src="./img/GitHub-Mark-64px.png" style="height: 32px; padding-right: 15px" alt="GitHub Mark" align="left"> [GitHub repository](https://github.com/pessini/moby-bikes) <br>Author: Leandro Pessini