### 1. Load Libraries

In [41]:
import numpy as np
import pandas as pd

import data

%matplotlib inline


### 2. Load Dataset

In [112]:
# Extract and retrieve rentals data from Microsoft SQL server
# Refer to documentation within data module for technical and configuration details
df_rentals = data.get_rentals()

df_rentals.head()

Unnamed: 0,date,hr,weather,temperature,feels_like_temperature,relative_humidity,windspeed,psi,guest_scooter,registered_scooter
0,2012-07-02,23,clear,109.0,140.0,51.0,7.0,13,37,631
1,2011-12-22,10,cloudy,80.2,109.4,82.0,6.0,35,41,894
2,2011-02-25,11,clear,90.4,120.2,77.0,30.0,30,27,350
3,2012-03-10,1,clear,71.8,95.0,36.0,17.0,40,2,354
4,2011-06-19,5,cloudy,102.2,132.8,78.0,0.0,1,23,82


### 3. Data Insights

In [38]:
df_rentals.shape

(18643, 10)

- Dataset contains 18,643 observations with 10 features.
- There are 24 hours a day, 365 days a year. So over 2 years, there should be 17,520 observations.


- The problem statement is to predict the total number of active e-scooter users given the above dataset.

- Each observation records the number of guest and registered users using rental e-scooters in a particular hour of a day.

- I shall assume that the total number of active e-scooter users in a particular hour of a day is the sum of the guest and registered users <i><b>i.e. total active users = guest users + registered users.</b></i>

In [39]:
df_rentals.columns.values

array(['date', 'hr', 'weather', 'temperature', 'feels_like_temperature',
       'relative_humidity', 'windspeed', 'psi', 'guest_scooter',
       'registered_scooter'], dtype=object)

- Column labels of the rentals dataset

In [6]:
df_rentals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18643 entries, 0 to 18642
Data columns (total 10 columns):
date                      18643 non-null object
hr                        18643 non-null int64
weather                   18643 non-null object
temperature               18643 non-null float64
feels_like_temperature    18643 non-null float64
relative_humidity         18643 non-null float64
windspeed                 18643 non-null float64
psi                       18643 non-null int64
guest_scooter             18643 non-null int64
registered_scooter        18643 non-null int64
dtypes: float64(4), int64(4), object(2)
memory usage: 1.4+ MB


- No column with null/missing value.

### 4. Summary Statistics

In [55]:
df_rentals.describe()

Unnamed: 0,hr,temperature,feels_like_temperature,relative_humidity,windspeed,psi,guest_scooter,registered_scooter
count,18643.0,18643.0,18643.0,18643.0,18643.0,18643.0,18643.0,18643.0
mean,11.537145,88.433037,117.313608,62.733251,12.741082,25.142198,106.38894,1074.471383
std,6.924281,16.2522,20.364081,19.315897,8.217008,14.442978,147.151664,1055.916934
min,0.0,48.1,60.8,0.0,0.0,0.0,-2.0,-2.0
25%,6.0,75.2,100.4,48.0,7.0,13.0,12.0,240.0
50%,12.0,88.7,118.4,63.0,13.0,25.0,50.0,807.0
75%,18.0,102.2,134.6,78.0,17.0,38.0,144.0,1535.5
max,23.0,131.0,179.6,100.0,57.0,50.0,1099.0,6203.0


- Large differnece in the 75th %tile and max values of columns <b>windspeed</b>, <b>guest_scooter</b>, <b>registered_scooter</b>
- This observation suggests that there are extreme values or outliers in these columns. 

### 5. Data Cleaning

#### 5.1 <b style="font-family:'Courier New'; font-size:18px">date</b> Column

In [113]:
# Check data type of date column
df_rentals.dtypes['date']

dtype('O')

- Convert the <b style="font-family:'Courier New'; font-size:15px">date</b> column from <b style="font-family:'Courier New'; font-size:15px">string</b> to <b style="font-family:'Courier New'; font-size:15px">date</b> data type.


- Combine the <b style="font-family:'Courier New'; font-size:15px">date</b> and <b style="font-family:'Courier New'; font-size:15px">hr</b> columns to a <b style="font-family:'Courier New'; font-size:15px">datetime</b> column.
- This is to facilitate the use of datetime/timeseries operations when doing exploration and feature engineering later.

In [131]:
# Rename date column to date_str to indicate string data type
#df_rentals.rename(columns={'date': 'date_str'}, inplace=True)

# Convert date column from string to datetime data type
df_rentals['date'] = pd.to_datetime(df_rentals['date'])

# Verify column data type
df_rentals.dtypes['date']

dtype('<M8[ns]')

In [133]:
# Create datetime column by concatenating the date and hr columns
df_rentals['datetime'] = df_rentals.apply(lambda row: row.date_str + ' ' + str(row.hr), axis=1) + ':00'

# Convert datetime column from string to datetime data type
df_rentals.datetime = pd.to_datetime(df_rentals.datetime)

# Verify column data type
df_rentals.dtypes['datetime']

dtype('<M8[ns]')

#### 5.2 <b style="font-family:'Courier New'; font-size:18px">weather</b> Column

In [134]:
df_rentals.weather.unique()

array(['clear', 'cloudy', 'light snow/rain', 'loudy', 'CLOUDY', 'CLEAR',
       'lear', 'LIGHT SNOW/RAIN', 'clar', 'heavy snow/rain', 'cludy',
       'liht snow/rain'], dtype=object)

- The <b>weather</b> column contains categorical data.
- The <b>weather</b> data is 'dirty', clean up is neccessary. 
- Mixed cases i.e. clear and CLEAR..
- Incorrect spelling e.g. lear, clar


- Correct values 'lear' and 'clar' to be 'clear'.
- Correct values 'cludy' and 'loudy' to be 'cloudy'.
- Correct value 'liht snow/rain' to be 'light snow/rain'.

In [136]:
# Standardized weather column to lower case characters
df_rentals.weather = df_rentals.weather.str.lower()

In [153]:
# Check for the number of weather values with incorrect spelling for 'clear'
len(df_rentals[df_rentals['weather'].isin(['lear', 'clar'])])

332

In [161]:
# Replace incorrect values 'lear' and 'clar' with 'clear'
df_rentals.weather.replace(['lear', 'clar'], 'clear', inplace=True)

# Verify that incorrect values 'lear' and 'clar' have been replaced
len(df_rentals[df_rentals['weather'].isin(['lear', 'clar'])])

0

In [166]:
# Check for the number of weather values with incorrect spelling for 'cloudy'
len(df_rentals[df_rentals['weather'].isin(['cludy', 'loudy'])])

74

In [167]:
# Replace incorrect values 'cludy' and 'loudy' with 'cloudy'
df_rentals.weather.replace(['cludy', 'loudy'], 'cloudy', inplace=True)

# Verify that incorrect values 'cludy' and 'loudy' have been replaced
len(df_rentals[df_rentals['weather'].isin(['cludy', 'loudy'])])

0

In [169]:
# Check for the number of weather values with incorrect spelling for 'light snow/rain'
len(df_rentals[df_rentals['weather']=='liht snow/rain'])

15

In [170]:
# Replace incorrect value 'liht snow/rain' with 'light snow/rain'
df_rentals.weather.replace('liht snow/rain', 'light snow/rain', inplace=True)

# Verify that incorrect value 'liht snow/rain' has been replaced
len(df_rentals[df_rentals['weather']=='liht snow/rain'])

0

In [171]:
df_rentals.weather.unique()

array(['clear', 'cloudy', 'light snow/rain', 'heavy snow/rain'],
      dtype=object)

- The <b>weather</b> column contains 4 unique categorical values i.e. clear, cloudy, light snow/rain and heavy snow/rain.
- One-hot encoding can be applied to the <b>weather</b> column later in feature engineering.

#### 5.3 <b style="font-family:'Courier New'; font-size:18px">temperature</b>, <b style="font-family:'Courier New'; font-size:18px">feels_like_temperature</b> Columns

In [204]:
# Get the maximum and minimum temperature recorded
max(df_rentals.temperature), min(df_rentals.temperature)

(131.0, 48.1)

In [206]:
# Get maximum and minimum feels_like_temperature recorded
max(df_rentals.feels_like_temperature), min(df_rentals.feels_like_temperature)

(179.6, 60.8)

In [203]:
# Number of observations with temperatures above 120°F
len(df_rentals[df_rentals.temperature > 120])

240

- I shall assume that values from the <b>temperature</b> and <b>feels_like_temperature</b> columns are in fahrenheit.

- I shal assume that this dataset is gathered from a city/town since people are renting e-scooters and e-bikes.


- The maximum value of the <b>temperature</b> column is 131°F which is pretty close to the [highest temperature ever recorded](https://en.wikipedia.org/wiki/List_of_weather_records#Highest_temperatures_ever_recorded) of 134.1°F.


- According to [TripSavvy](https://www.tripsavvy.com/the-worlds-hottest-cities-4070053), some of the highest temperatures recorded in a city include Phoenix 122°F, Marrakech 120°F, Mecca 121.6°F, Kuwait City 126°F, Ahvaz 129°F and Timbuktu 120°F.


- There are 240 observations with temperatures above 120°F.  This dataset should be from a city known for its high temperatures. If otherwise, the temperatures in these observations need to be verified.


- 'Feels like' temperature is also known as the [heat index](https://en.wikipedia.org/wiki/Heat_index).  In short, it is a temperature reading that factors in a component of relative humidity.


- We can verify the values of the <b>feels_like_temperature</b> column using the heat index [formula](https://en.wikipedia.org/wiki/Heat_index#Formula).


- Without any geographical information on this dataset given, I shall assume that all temperature readings are accurate. 

#### 5.4 <b style="font-family:'Courier New'; font-size:18px">relative_humidity</b> Column

In [207]:
# Get the maximum and minimum values of relative humidity recorded
max(df_rentals.relative_humidity), min(df_rentals.relative_humidity)

(100.0, 0.0)

In [213]:
# Number of observations with 0 relative humidity
len(df_rentals[df_rentals.relative_humidity==0])

25

- [Relative humidity](https://en.wikipedia.org/wiki/Relative_humidity) (RH) is the actual amount of water vapor present in relation to the capacity that the air has at a particular temperature.  It is express as a percentage.


- A relative humidity reading of 0 implies [air devoid of water vapor](https://www.chicagotribune.com/news/ct-xpm-2011-12-16-ct-wea-1216-asktom-20111216-story.html). This is quite impossible given the climate conditions of a city/town, where I assume this dataset is gathered.  Values of 0 in the <b>relative_humidity</b> column need to be verified.


- Since there are only 25 observations with 0 relative humidity, I've decided to drop them.


- A relative humidity reading of 100 means that the air is totally saturated with water vapor and cannot hold any more, creating the possibility of rain.  So values of 100 in the <b>relative_humidity</b> column are valid.

In [None]:
len()

In [29]:
df_rentals[(df_rentals.psi<0) | (df_rentals.psi>400)]

Unnamed: 0,date,hr,weather,temperature,feels_like_temperature,relative_humidity,windspeed,psi,guest_scooter,registered_scooter


In [25]:
len(df_rentals[(df_rentals.guest_scooter<0) | (df_rentals.registered_scooter<0)])

659