# Assignment 2 specification

The purpose of this assignment is to analyse the Bike Sharing Dataset hosted on the UCI repository of datasets.

The dataset is provided with this notebook as a zip file.

There are two related datasets in the zip file: one aggregated by day, and the other aggregated by hour.

They represent the number of bikes that were shared/hired in Washington over that time period, together with the factors that are believed to predict the demand for such bikes.

They include the time unit and various measures of the weather etc. (in terms of temperature, humidity and wind-speed). More description can be found [here](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset).

You are asked to

1. Read the _hourly_ data and split into training and test data __[5 marks]__
2. For the _training data_ only, use exploratory data analysis to learn about the data and to indicate how to build a model __[15 marks]__
3. Using a forward selection approach, build a regression model that offers the best performance, using a machine learning measure (i.e., prediction accuracy on the test data) __[30 marks]__
   - You need to pay particular attention to the regression model assumptions
   - For best performance, you will also need to perform feature engineering
     - modifying the existing features
     - transforming them
     - merging them
     - keeping feature correlation as low as possible
   - 10-fold cross-validation should be used to estimate the uncertainty in the fitted model parameters.
4. Identify the 3 target columns. Which of these target columns is easiest to predict accurately? __[5 marks]__
5. Using this "preferred target", derive a new target whose values are the grouped label (taking the values `Q1`, `Q2`, `Q3`, `Q4`) for demand in the quartiles (0 < demand <= 25th percentile of demand), (25th percentile of demand < demand <= 50th percentile of demand), .. You might find the [pandas quantile calculator](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html) convenient when computing the quartile end points (25th, 50th and 75th percentiles), and pandas filtering by rows  convenient for assigning the new labels. __[5 marks]__
6. Use _two_ classification procedures to predict these demand quartiles, repeating the forward selection procedure to find the best model for each, but this time focusing on classification accuracy on the test set as the measure of performance. Are the same features used in each of the two models? __[35 marks]__
7. Which of the two machine learning procedures (regression and classification) provides the highest classification accuracy on the test set? Why is this? __[5 marks]__

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# Task 1: Read the _hourly_ data and split into training and test data.

##Start of Answer 1##

In [35]:
# read in the hourly data from CSV
hourDF = pd.read_csv('data/hour.csv') 
# rename columns to be more human legible
hourDF = hourDF.rename(columns={'instant': 'index', 'dteday': 'date', 'yr': 'year', 'mnth': 'month', 'hr': 'hour', 'weathersit': 'weather', 'temp': 'temperature', 'atemp': 'temperature-feels-like', 'hum': 'humidity','windspeed': 'wind_speed' ,'casual': 'casual_users', 'registered': 'registered_users', 'cnt': 'total_users'})
# hourDF['season'] = hourDF['season'].astype('category')
# hourDF['holiday'] = hourDF['holiday'].astype('category')
# hourDF['year'] = hourDF['year'].astype('category')
# hourDF['weekday'] = hourDF['weekday'].astype('category')
hourDF[['season', 'holiday', 'year','weekday', 'workingday', 'weather', 'month', 'hour']] = hourDF[['season', 'holiday', 'year','weekday', 'workingday', 'weather', 'month', 'hour']].apply(lambda x: x.astype('category'))
hourDF.to_pickle('data/hourFormatted.pk1')
hourDF = pd.read_pickle('data/hourFormatted.pk1')

#Split the Dataframe into test, training and validation data.
trainVal, test = train_test_split(hourDF, test_size=0.2)
train, validation = train_test_split(trainVal, test_size=0.1)

In [36]:
print(f'{hourDF.info()}')
print(f'\n{test.info()}')
print(f'\n{trainVal.info()}')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   index                   17379 non-null  int64   
 1   date                    17379 non-null  object  
 2   season                  17379 non-null  category
 3   year                    17379 non-null  category
 4   month                   17379 non-null  category
 5   hour                    17379 non-null  category
 6   holiday                 17379 non-null  category
 7   weekday                 17379 non-null  category
 8   workingday              17379 non-null  category
 9   weather                 17379 non-null  category
 10  temperature             17379 non-null  float64 
 11  temperature-feels-like  17379 non-null  float64 
 12  humidity                17379 non-null  float64 
 13  wind_speed              17379 non-null  float64 
 14  casual_users          

##End of Answer 1##

# Task 2: For the training data only, use exploratory data analysis to learn about the data and to indicate how to build a model

##Start of Answer 2##

##End of Answer 2##

# Task 3: Using a forward selection approach, build a regression model that offers the best performance

##Start of Answer 3##

##End of Answer 3##

# Task 4: Which of the 3 target columns is easiest to predict accurately?

##Start of Answer 4##

##End of Answer 4##

# Task 5: Using this "preferred target", derive a new target whose values are the grouped label.

##Start of Answer 5##

##End of Answer 5##

# Task 6: Use _two_ classification procedures to predict these demand quartiles.

##Start of Answer 6

##End of Answer 6##

# Task 7: Does regression or classification provide the best classification accuracy on the test set? Why?

##Start of Answer 7

##End of Answer 7##