Hong Fei Jin<br>
INSY 695 – Enterprise Data Science & ML in Production I<br>
Fatih Nayebi<br>
Sunday, February 12, 2023

# Assignment 2

## Preparing the Data (See Assignment 1)

In [44]:
# Import the data
import pandas as pd
df = pd.read_csv('SeoulBikeData.csv', encoding = 'mbcs')

In [45]:
# Get the weekday using the 'datetime' package
import numpy as np
from datetime import datetime

# Create empty column
df['Weekday'] = np.NaN

# Fill in weekdays
for i in range(len(df)):
    date = str(df['Date'][i])
    date_obj = datetime.strptime(date, '%d/%m/%Y').date()
    df['Weekday'][i] = date_obj.weekday()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Weekday'][i] = date_obj.weekday()


In [46]:
# Split 'Date' into 'Day', 'Month', and 'Year'
df[['Day', 'Month', 'Year']] = df['Date'].str.split('/', expand = True)
df[['Day', 'Month', 'Year', 'Weekday']] = df[['Day', 'Month', 'Year', 'Weekday']].astype('int')

In [47]:
# Reorder columns for the sake of better visibility
cols = df.columns.tolist()
cols = cols[0:1] + cols[-4:] + cols[1:-4]
df = df[cols]

In [48]:
df = df[(df['Rainfall(mm)'] <= 2.5)]
df = df[(df['Snowfall (cm)'] <= 2.2)]
df = df[(df['Solar Radiation (MJ/m2)'] <= 2.33)]
df = df[(df['Wind speed (m/s)'] <= 4.4)]
df.shape

(7793, 18)

In [49]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df, test_size = 0.2, random_state = 119)
print('Training set:', len(train_set))
print('Testing set:', len(test_set))

Training set: 6234
Testing set: 1559


In [50]:
bike = train_set.drop(['Rented Bike Count', 'Date', 'Dew point temperature(°C)'], axis = 1)
bike_labels = train_set['Rented Bike Count'].copy()

In [51]:
bike.shape

(6234, 15)

In [52]:
bike.head()

Unnamed: 0,Weekday,Day,Month,Year,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
100,1,5,12,2017,4,-7.2,34,3.0,2000,0.0,0.0,0.0,Winter,No Holiday,Yes
8486,0,19,11,2018,14,10.6,34,2.8,920,1.07,0.0,0.0,Autumn,No Holiday,Yes
4337,2,30,5,2018,17,24.1,41,2.2,1980,1.9,0.0,0.0,Spring,No Holiday,Yes
4700,3,14,6,2018,20,22.6,74,1.2,1594,0.01,0.0,0.0,Summer,No Holiday,Yes
7192,2,26,9,2018,16,24.4,35,1.8,2000,1.71,0.0,0.0,Autumn,No Holiday,Yes


In [53]:
bike_num = bike.select_dtypes(include = [np.number])
bike_cat = bike[['Seasons', 'Holiday', 'Functioning Day']]

In [73]:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

num_pipeline = make_pipeline(SimpleImputer(strategy = 'median'), StandardScaler())
cat_pipeline = make_pipeline(SimpleImputer(strategy = 'most_frequent'), 
                             OneHotEncoder(handle_unknown = 'ignore'))

In [93]:
bike_num_clean = num_pipeline.fit_transform(bike_num)
bike_num_clean = pd.DataFrame(bike_num_clean, columns = list(bike_num.columns))

bike_cat_labels = ['Autumn', 'Spring', 'Summer', 'Winter', 'Holiday', 'No Holiday', 'No', 'Yes']
bike_cat_clean = cat_pipeline.fit_transform(bike_cat)
bike_cat_clean = pd.DataFrame(bike_cat_clean.toarray(), columns = bike_cat_labels)

In [96]:
bike_cleaned = pd.concat([bike_num_clean, bike_cat_clean], axis = 1, join = 'inner')
bike_cleaned

Unnamed: 0,Weekday,Day,Month,Year,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Autumn,Spring,Summer,Winter,Holiday,No Holiday,No,Yes
0,-1.006173,-1.207988,1.563015,-3.234124,-1.011668,-1.641680,-1.291095,1.452268,0.919984,-0.626638,-0.177791,-0.172022,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
1,-1.508292,0.380434,1.277657,0.309203,0.372269,-0.136938,-1.291095,1.241153,-0.855277,1.058043,-0.177791,-0.172022,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
2,-0.504053,1.628480,-0.434491,0.309203,0.787450,1.004299,-0.933634,0.607810,0.887108,2.364852,-0.177791,-0.172022,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
3,-0.001933,-0.186859,-0.149133,0.309203,1.202631,0.877495,0.751537,-0.447763,0.252617,-0.610893,-0.177791,-0.172022,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
4,-0.504053,1.174645,0.706941,0.309203,0.649056,1.029660,-1.240029,0.185580,0.919984,2.065703,-0.177791,-0.172022,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6229,-0.504053,1.174645,0.706941,0.309203,1.341024,0.564711,-0.831502,0.396695,0.919984,-0.626638,-0.177791,-0.172022,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
6230,1.002306,-1.548364,1.563015,-3.234124,-1.288455,-1.278175,1.364327,-0.131092,-0.302974,-0.626638,-0.177791,-0.172022,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
6231,-0.504053,0.607352,1.277657,0.309203,-0.873274,-0.458175,1.364327,-1.186665,-1.230055,-0.626638,-0.177791,-0.172022,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
6232,-0.504053,0.607352,-1.005207,0.309203,0.925843,-0.863948,0.904734,-0.342206,0.919984,-0.516425,-0.177791,-0.172022,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
