# Introduction
In Part III, we will use machine learning techniques to predict 'Occupancy'. The process goes like this: 

![MachineLearningProcess](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/CommonAssets/MachineLearningProcess.png)

We put this section on all of the projects in UpLevel so bear with us if you've seen this before. 

Generally, the machine learning process has five parts:
1. <strong>Split your data into train and test set</strong>
2. <strong>Model creation</strong>
<br>
Import your models from sklearn and instantiate them (assign model object to a variable)
3. <strong>model fitting</strong>
<br>
Fit your training data into the model and train train train
4. <strong>model prediction</strong>
<br>
Make a set of predictions using your test data, and
5. <strong>Model assessment</strong>
<br>
Compare your predictions with ground truth in test data

Highly recommended readings:
1. [Important] https://scipy-lectures.org/packages/scikit-learn/index.html
2. https://machinelearningmastery.com/a-gentle-introduction-to-scikit-learn-a-python-machine-learning-library/
3. https://scikit-learn.org/stable/tutorial/basic/tutorial.html

### Step 1: Import your libraries
We will be using models from sklearn - a popular machine learning library. However, we won't import everything from sklearn and take just what we need. 

We'll need to import plotting libraries to plot our predictions against the ground truth (test data). 

Import the following:
- pandas as pd

In [1]:
# Step 1: Import your library
import pandas as pd

### Step 2: Read the CSV from Part II as a DataFrame
Read your CSV from the previous Part as a DataFrame. 

You should have:
- 20,560 rows
- 10 columns

In [5]:
# Step 2: Read the CSV from Part II
df = pd.read_csv("df_comb.csv", index_col = 0)
df


Unnamed: 0,date,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy,weekday,hour,minute
0,2015-02-02 14:19:00,23.7000,26.2720,585.200000,749.200000,0.004764,1,0,14,19
1,2015-02-02 14:19:59,23.7180,26.2900,578.400000,760.400000,0.004773,1,0,14,19
2,2015-02-02 14:21:00,23.7300,26.2300,572.666667,769.666667,0.004765,1,0,14,21
3,2015-02-02 14:22:00,23.7225,26.1250,493.750000,774.750000,0.004744,1,0,14,22
4,2015-02-02 14:23:00,23.7540,26.2000,488.600000,779.000000,0.004767,1,0,14,23
...,...,...,...,...,...,...,...,...,...,...
20555,2015-02-18 09:15:00,20.8150,27.7175,429.750000,1505.250000,0.004213,1,2,9,15
20556,2015-02-18 09:16:00,20.8650,27.7450,423.500000,1514.500000,0.004230,1,2,9,16
20557,2015-02-18 09:16:59,20.8900,27.7450,423.500000,1521.500000,0.004237,1,2,9,16
20558,2015-02-18 09:17:59,20.8900,28.0225,418.750000,1632.000000,0.004279,1,2,9,17


### Step 3: Prepare your independent and dependent variables
At this point, let's prepare our indepedent and dependent variables. 

1. Declare a variable, and assign your independent variables to it by dropping 'date' and 'Occupancy'
2. Declare another variable, and assign only values 'Occupancy'

In [8]:
# Step 3: Prepare your independent and dependent variables
X = df.drop(['date', 'Occupancy'], axis = 1)
y = df['Occupancy']


### Step 4: Import machine learning libraries
Time to import other libraries.

The resources provided at the top of this notebook will be immensely useful if you're new to modelling. 

Import the following libraries and methods:
1. train_test_split - sklearn.model_selection
2. DummyClassifier - sklearn.dummy
3. LogisticRegression - sklearn.linear_model
4. DecisionTreeClassifier - sklearn.tree
5. RandomForestClassifier - sklearn.ensemble
6. f1_score - sklearn.metrics
7. confusion_matrix - sklearn.metrics

In [14]:
# Step 4: Import the machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score 
from sklearn.metrics import confusion_matrix 
from sklearn import metrics

### Step 5: Split your dataset into train and test
Now that you have finished importing the libraries you need, split the dataset into train and test at a 80/20 split.

Don't forget to stratify by your dependent values with the stratify parameter.

In [11]:
# Step 5: Split your dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y)

### Step 6: Train a DummyClassifier
This is what you'll need to do:
1. Start with a model
2. Declare a variable, and store your model in it (don't forget to use brackets)
3. Fit your training data into the instantiated model
4. Declare a variable that contains predictions from the model you just trained, using the train dataset (X_test)
5. Compare the prediction with the actual result (y_test) with the f1_score
6. Plot a confusion_matrix using the prediction (y-axis) vs actual y_test (x-axis) 

The recommended readings will be very helpful.

Let's start with the DummyClassifier to establish a baseline. This will be useful as we train other models.

In [15]:
# Step 6a: Declare a variable to store the model
dummy = DummyClassifier()

# Step 6b: Fit your train dataset
dummy.fit(X_train, y_train)

# Step 6c: Declare a variable and store your predictions that you make with your model using X test data
dummy_pred = dummy.predict(X_test)

# Step 6d: Print the f1_score between the y test and dummy prediction
print("F1-score:", metrics.f1_score(y_test, dummy_pred))

# Step 6e: Print a confusion_matrix between y_test and your prediction
confusion_matrix(y_test, dummy_pred)

F1-score: 0.0


array([[3162,    0],
       [ 950,    0]])

### Step 7: Train a LogisticRegression
Now that we have established the baseline performance of a classifier, let's train a LogisticRegression model. 

Similar to how we did in training the DummyClassifier, train the model and then assess the model performance with the f1_score and the confusion_matrix.

In [16]:
# Step 7a: Declare a variable to store the LogisticRegression model
logr = LogisticRegression()

# Step 7b: Fit your train dataset
logr.fit(X_train, y_train)

# Step 7c: Declare a variable and store your predictions that you make with your model using X test data 
logr_pred = logr.predict(X_test)

# Step 7d: Print f1_score between the y test and LogisticRegression prediction
print("F1-score:", metrics.f1_score(y_test, logr_pred))

# Step 7e: Print a confusion_matrix between y_test and your prediction
confusion_matrix(y_test, logr_pred)

F1-score: 0.9772492244053775


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


array([[3123,   39],
       [   5,  945]])

### Step 8: Train a DecisionTreeClassifier
The LogisticRegression model should perform quite impressively, based on the confusion matrix and the f1_score. 

Can we improve it further? Let's find out by training and assessing a DecisionTreeClassifier.

In [17]:
# Step 8: Train a DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
tree_pred = tree.predict(X_test)
print("F1-score:", metrics.f1_score(y_test, tree_pred))
confusion_matrix(y_test, tree_pred)

F1-score: 0.9837270341207349


array([[3144,   18],
       [  13,  937]])

### Step 9: Train a RandomForestClassifier
The DecisionTreeClassifier is most likely (slightly) better than the LogisticRegression results, in terms of f1 score.

Train a RandomForestClassifier and see if you can push the performance even further.

In [18]:
# Step 9: Train a RandomForestClassifier
forest = RandomForestClassifier()
forest.fit(X_train, y_train)
forest_pred = forest.predict(X_test)
print("F1-score:", metrics.f1_score(y_test, forest_pred))
confusion_matrix(y_test, forest_pred)

F1-score: 0.9868904037755636


array([[3146,   16],
       [   9,  941]])

### Optional: Train other classifiers
There are a few other classifiers that you can try, apart from the three that we used above.

It's hard to top RandomForestClassifier for this dataset, but it's still worth typing it out to get some practice in.

### Step 10: Get a feature importances DataFrame
Create a DataFrame containing the feature importances of your best performing model. 

For example, this is what an example DataFrame would look like:

![RandomForestClassifierFeatureImportances](https://uplevelsg.s3.ap-southeast-1.amazonaws.com/ProjectRoomOccupancy/RandomForestClassifierFeatureImportances.png)

What's the most important feature? 

Does it align with what you observed in Part II? 

In [20]:
# Step 10: Create a DataFrame containing feature importances
forest.feature_importances_

pd.DataFrame({'feature': X.columns,
             'importance': forest.feature_importances_})

Unnamed: 0,feature,importance
0,Temperature,0.161425
1,Humidity,0.017774
2,Light,0.544876
3,CO2,0.102417
4,HumidityRatio,0.023608
5,weekday,0.050829
6,hour,0.090849
7,minute,0.008222


## Modelling without 'Light'
Whichever model you used, it's most likely that you identified "Light" as the most important feature in the model.

This makes sense, because if there's 'Light', it's most likely that there's someone in the room. 

Here's a challenge - let's try modelling without "Light" as a feature. 

### Step 11: Repeat Step 3 and drop 'Light'
Repeat what you did in Step 3, i.e. prepare independent and dependent values.

However, this time drop 'Light' on top of 'date' and 'Occupancy' to prepare your independent values.

In [21]:
# Step 11: Prepare new independent and dependent values
X = df.drop(['date', 'Occupancy', 'Light'], axis = 1)
y = df['Occupancy']

### Step 12: Repeat Steps 6-10
Now that you've removed 'Light' column, time to split your data and model again.

One thing to note - when you train a LogisticRegression you <strong>may</strong> receive a warning. Don't worry - just increase the value of max_iter.

In [22]:
# Step 12a: Split your data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y)

In [23]:
# Step 12b: Train and assess a DummyClassifier
dummy = DummyClassifier()
dummy.fit(X_train, y_train)
dummy_pred = dummy.predict(X_test)
print("F1-score:", metrics.f1_score(y_test, dummy_pred))
confusion_matrix(y_test, dummy_pred)

F1-score: 0.0


array([[3162,    0],
       [ 950,    0]])

In [24]:
# Step 12b: Train and assess a LogisticRegression
logr = LogisticRegression()
logr.fit(X_train, y_train)
logr_pred = logr.predict(X_test)
print("F1-score:", metrics.f1_score(y_test, logr_pred))
confusion_matrix(y_test, logr_pred)

F1-score: 0.4866920152091255


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


array([[2918,  244],
       [ 566,  384]])

In [25]:
# Step 12c: Train a DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
tree_pred = tree.predict(X_test)
print("F1-score:", metrics.f1_score(y_test, tree_pred))
confusion_matrix(y_test, tree_pred)

F1-score: 0.9763531266421439


array([[3138,   24],
       [  21,  929]])

In [26]:
# Step 12d: Train a RandomForestClassifier
forest = RandomForestClassifier()
forest.fit(X_train, y_train)
forest_pred = forest.predict(X_test)
print("F1-score:", metrics.f1_score(y_test, forest_pred))
confusion_matrix(y_test, forest_pred)

F1-score: 0.9837781266352694


array([[3141,   21],
       [  10,  940]])

In [27]:
# Step 12e: Create a DataFrame containing feature importances
pd.DataFrame({'feature': X.columns,
             'importance': forest.feature_importances_})

Unnamed: 0,feature,importance
0,Temperature,0.281392
1,Humidity,0.065274
2,CO2,0.228614
3,HumidityRatio,0.068114
4,weekday,0.08971
5,hour,0.241695
6,minute,0.025201


<details>
    <summary><strong>Did removing 'Light' affect model performance adversely?</strong></summary>
    <div>No, not really. The f1 score and confusion matrix look great</div>
</details>

<details>
    <summary><strong>What were the features that were important?</strong></summary>
    <div>In the new DataFrame, the features were 'Temperature', 'CO2', and 'hour'. It seems that the model considered these three features in the absence of light.</div>
</details>

# The end
You did it! You've arrived at the end. Congratulations and well done on completing this project series! 

Let's review.
1. In Part I, you collected the datasets and combined them to form a single DataFrame. You also investigated the data briefly to see if there was anything remarkable about it
2. In Part II, you performed exploratory data analysis on the dataset, investigating distributions and relationships found between features. You also engineered additional features from the dataset for model building
5. In Part III, you trained a machine learning model that can predict room occupancy based on sensor data. In addition, you modelled the problem without a major feature to see if the model performed equally well

Go on, give yourself a pat on the back. We hope this project series has give you more confidence in coding and machine learning. 

Whatever you learn here is but a tip of the iceberg, and launchpad for bigger and better things to come. Come join us in our Telegram community over at https://bit.ly/UpLevelSG and our Facebook page at https://fb.com/UpLevelSG