<h1 align="center">Lab 6: Decision Trees for Classification and Regression</h1>

---
## Problem A

The goal of this problem is to predict the number of occupants in a room using data obtained from multiple non-intrusive environmental sensors like temperature, light, sound, CO2, and passive infrared (PIR).
1. **Algorithm to be used**: Decision Trees
2. **Dataset**: `Occupancy Estimation.csv`

Following is the description of columns in `Occupancy Estimation.csv` file:
<TABLE CAPTION="Room Occupancy">
<TR><TD><B>Name</B></TD><TD><B>Description</B></TD></TR>
<TR><TD>Temperature</TD><TD>In degree Celsius</TD></TR>
<TR><TD>Light</TD><TD>In Lux</TD></TR>
<TR><TD>Sound</TD><TD>In Volts</TD></TR>
<TR><TD>CO2</TD><TD>In PPM</TD></TR>
<TR><TD>CO2 Slope</TD><TD>Slope of CO2 values taken in a sliding window</TD></TR>
<TR><TD>PIR</TD><TD>Binary value conveying motion detection</TD></TR>   
<TR><TD>Room Occupancy Count (outcome)</TD><TD>Number of occupants in the room</TD></TR>
</TABLE>

Sensor nodes S1-S4 consist of temperature, light and sound sensors, S5 has a CO2 sensor and S6 and S7 have one PIR sensor each that are deployed on the ceiling ledges at an angle that maximize the sensor field of view for motion detection.

#### Import Packages

In [None]:
import pandas as pd                  # Pandas
import numpy as np                   # Numpy
from matplotlib import pyplot as plt # Matplotlib

# Package to implement Decision Tree Model
import sklearn
from sklearn.tree import DecisionTreeClassifier

# Package for data partitioning
from sklearn.model_selection import train_test_split

# Package to visualize Decision Tree
from sklearn import tree

# Package for generating confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Package for generating classification report
from sklearn.metrics import classification_report

# Import packages to implement Stratified K-fold CV
from sklearn.model_selection import StratifiedKFold # For creating folds

# Import Package to implement GridSearch CV (Hyperparameter Tuning Method 1)
from sklearn.model_selection import GridSearchCV

# Importing package for Randomized Search CV (Hyperparameter Tuning Method 2)
from sklearn.model_selection import RandomizedSearchCV

# Package to record time
import time

# Ignore Deprecation Warnings
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

#### 1. Import `Occupancy Estimation.csv` file and check if the dataset is balanced/imbalanced with respect to the target variable (Room Occupancy Count).

In [None]:
# Your code here

#### 2. With Room Occupancy Count as the outcome, implement a Decision Tree model and identify its optimal values of hyperparameters (`max_depth`, `min_samples_split`, and `min_samples_leaf`) using Grid Search CV.

**NOTE**: Consider test data size to be 40% and the number of folds to be 3. Use `f1_macro` as the scoring metric.

In [None]:
# Your code here

#### 3. Create a visualization of the best decision tree (the one with the optimal values of hyperparameters)

In [None]:
# Your code here

#### 4. Generate a confusion matrix and a classification report to evaluate the performance of tuned model on train set.

In [None]:
# Your code here

#### 5. Generate a confusion matrix and a classification report to evaluate the performance of tuned model on test set.

In [None]:
# Your code here

#### 6. Print the values of macro averaged F1 score for both train and test sets. Report your observation in terms of whether the model is underfitting or overfitting.

In [None]:
# Your code here

#### 7. Using the tuned model, generate a bar plot to show the importance of input features in occupancy estimation.

NOTE: Consider only the features that have importance scores of at least 5%.

What is the total percentage of reduction in Gini impurity caused by these features.

In [None]:
# Your code here

#### 8. Implement Randomized Search Cross Validation to identify the near optimal values of hyperparameters by considering `n_iter = 60`.

Does the values matche with the ones obtained while implementing Grid Search Cross Validation?

In [None]:
# Your code here

#### 9. Implement adaptive strategy of hyperparameter tuning (combining Randomized and Grid Search Cross Validation) to find the optimal values of hyperparameters.

In [None]:
# Your code here

#### 10. Report your observations in terms of number of experiments required to reach the optimal values of hyperparameters using Grid Search CV in comparison to the Adaptive Strategy.

Write your observation:

---
## Problem B

<img src="https://media.istockphoto.com/id/618973378/photo/bicycle-sharing-system.jpg?s=612x612&w=0&k=20&c=ms8wYi_uOo2YgghfJiXeIq073M15Dyoc7dEau9qDFOE=" width="400" style="float: center"/>

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental, and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Today, there exists great interest in these systems due to their important role in traffic, environmental, and health issues.


**In this case study, the objective is to predict the daily count of users renting bikes using weather and seasonal information.**

Data to be used: *bike.csv*

Following is the description of columns in *bike.csv* file

<TABLE CAPTION="Bike Sharing Dataset">
<TR><TD><B>Variable</B></TD><TD><B>Description</B></TD></TR>
<TR><TD>season</TD><TD>season (1:winter, 2:spring, 3:summer, 4:fall)</TD></TR>
<TR><TD>yr</TD><TD>year (0: 2011, 1:2012)</TD></TR>
<TR><TD>mnth</TD><TD>month (1 to 12)</TD></TR>
<TR><TD>holiday</TD><TD>whether day is holiday or not </TD></TR>
<TR><TD>weekday</TD><TD>day of the week</TD></TR>
<TR><TD>workingday</TD><TD>if day is neither weekend nor holiday is 1, otherwise is 0</TD></TR>   
<TR><TD>weathersit</TD><TD>Weather Situation (1,2,3,4)**</TD></TR>
<TR><TD>temp</TD><TD>Normalized temperature in Celsius</TD></TR>
<TR><TD>atemp</TD><TD>Normalized feeling temperature in Celsius</TD></TR>
<TR><TD>hum</TD><TD>Normalized humidity</TD></TR>
<TR><TD>windspeed</TD><TD>Normalized wind speed</TD></TR>
<TR><TD>cnt (outcome)</TD><TD>Count of users renting bikes</TD></TR>
</TABLE>

** For Weather Situation (variable: weathersit), following are the possibilites:

- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

#### 1. With cnt as the outcome, implement a Decision Tree model and identify its optimal values of hyperparameters using adaptive strategy of hyperparameter tuning. Consider test data size to be 20%, number of folds to be 3, and scoring metric to be `r2`.

**NOTE**: To implement Decision Tree Regressor, import the following library on your Colab Notebook:

`from sklearn.tree import DecisionTreeRegressor`

In [None]:
# Your code here

#### 2. Using the best Decision Tree model that is identified in the previous step, do the following:

- Evaluate the performance of model on test set using R2 and RMSE values.
- Generate a a bar plot to show the importance of input features with non-zero importance scores.

In [None]:
# Your code here