## Data Set Information:

- * Abstract: Experimental data used for binary classification (room occupancy) from Temperature,Humidity,Light and CO2. Ground-truth occupancy was obtained from time stamped pictures that were taken every minute i.e predicting whether a room or rooms are occupied based on environmental measures such as temperature, humidity, and related measures.
- * This is a type of common time series classification problem called room occupancy classification.
- * The dataset was collected by monitoring an office with a suite of environmental sensors and using a camera to determine if the room was occupied.
- Attribute Information:

- date time year-month-day hour:minute:second
- Temperature, in Celsius
- Relative Humidity, %ge
- Light, in Lux
- CO2, in ppm(parts per million)
- Humidity Ratio, derived quantity from temperature and relative humidity measured in kgwater-vapor/kg-air
- Occupancy, 0 or 1, 0 for not occupied, 1 for occupied status

- The three files are as follows:

- datatest.txt (test): From 2015-02-02 14:19:00 to 2015-02-04 10:43:00
- datatraining.txt (train): From 2015-02-04 17:51:00 to 2015-02-10 09:33:00
- datatest2.txt (val): From 2015-02-11 14:48:00 to 2015-02-18 09:19:00

## Data Preprocessing :

In [1]:
## Importing necessary lib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import math
from scipy import stats
import warnings
warnings.filterwarnings("ignore")
plt.style.use('dark_background')

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

In [2]:
# # Reading given txt file and creating dataframe

# df_train = pd.read_csv("F:\INC_ML_Shortlisted\Occupancy_Data\datatraining.txt")
# df_valid = pd.read_csv("F:\INC_ML_Shortlisted\Occupancy_Data\datatest2.txt")
# df_test = pd.read_csv("F:\INC_ML_Shortlisted\Occupancy_Data\datatest.txt")
 
# # Storing this dataframe in a csv file into the local file

# df_train.to_csv("F:\INC_ML_Shortlisted\Occupancy_Data\occu_training.csv", 
#                   index = None)
# df_valid.to_csv("F:\INC_ML_Shortlisted\Occupancy_Data\occu_valid.csv", 
#                   index = None)
# df_test.to_csv("F:\INC_ML_Shortlisted\Occupancy_Data\occu_test.csv", 
#                   index = None)

In [3]:
## Reading the datasets
train = pd.read_csv("F:\INC_ML_Shortlisted\Occupancy_Data\occu_training.csv")
valid = pd.read_csv("F:\INC_ML_Shortlisted\Occupancy_Data\occu_valid.csv")
test = pd.read_csv("F:\INC_ML_Shortlisted\Occupancy_Data\occu_test.csv")

In [4]:
## Drop date variable from both train and test as it is useless for this purpose
train = train.drop('date', axis = 1)
valid = valid.drop('date', axis = 1)
test = test.drop('date', axis = 1)

In [5]:
train.head()

Unnamed: 0,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
0,23.18,27.272,426.0,721.25,0.004793,1
1,23.15,27.2675,429.5,714.0,0.004783,1
2,23.15,27.245,426.0,713.5,0.004779,1
3,23.15,27.2,426.0,708.25,0.004772,1
4,23.1,27.2,426.0,704.5,0.004757,1


In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8143 entries, 0 to 8142
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Temperature    8143 non-null   float64
 1   Humidity       8143 non-null   float64
 2   Light          8143 non-null   float64
 3   CO2            8143 non-null   float64
 4   HumidityRatio  8143 non-null   float64
 5   Occupancy      8143 non-null   int64  
dtypes: float64(5), int64(1)
memory usage: 381.8 KB


In [7]:
train.shape,valid.shape, test.shape

((8143, 6), (9752, 6), (2665, 6))

In [8]:
train.columns

Index(['Temperature', 'Humidity', 'Light', 'CO2', 'HumidityRatio',
       'Occupancy'],
      dtype='object')

### Obs : As the dataset is already splitted into train , validation and test so we will not combine and split the data again


In [9]:
train.isnull().sum()

Temperature      0
Humidity         0
Light            0
CO2              0
HumidityRatio    0
Occupancy        0
dtype: int64

In [10]:
valid.isnull().sum()

Temperature      0
Humidity         0
Light            0
CO2              0
HumidityRatio    0
Occupancy        0
dtype: int64

In [11]:
test.isnull().sum()

Temperature      0
Humidity         0
Light            0
CO2              0
HumidityRatio    0
Occupancy        0
dtype: int64

### Obs : There is no missing value present in any of the dataset

In [12]:
## Function to check Normality using Skewness and kurtosis

def skw_krt(data1):
    
    print(' 1. Skewness of the Every Column in the Dataset is:','\n',data1.skew().abs())
    print('')

    print(' 2. Kurtosis of the Every Column in the Dataset is:','\n',data1.kurtosis().abs())
    print('')
    
    return

In [13]:
skw_krt(train)

 1. Skewness of the Every Column in the Dataset is: 
 Temperature      0.450868
Humidity         0.272018
Light            1.237440
CO2              2.380910
HumidityRatio    0.616681
Occupancy        1.407109
dtype: float64

 2. Kurtosis of the Every Column in the Dataset is: 
 Temperature      0.810242
Humidity         0.931880
Light            0.123152
CO2              5.775411
HumidityRatio    0.039355
Occupancy        0.020050
dtype: float64



In [14]:
skw_krt(valid)

 1. Skewness of the Every Column in the Dataset is: 
 Temperature      1.228301
Humidity         0.119257
Light            1.512123
CO2              1.545384
HumidityRatio    0.166972
Occupancy        1.423383
dtype: float64

 2. Kurtosis of the Every Column in the Dataset is: 
 Temperature      1.270792
Humidity         0.850902
Light            1.230746
CO2              1.480166
HumidityRatio    0.611207
Occupancy        0.026024
dtype: float64



In [15]:
skw_krt(test)

 1. Skewness of the Every Column in the Dataset is: 
 Temperature      0.842562
Humidity         0.672762
Light            0.759940
CO2              0.787597
HumidityRatio    0.649385
Occupancy        0.562365
dtype: float64

 2. Kurtosis of the Every Column in the Dataset is: 
 Temperature      0.628995
Humidity         0.272580
Light            0.538937
CO2              0.727056
HumidityRatio    0.729942
Occupancy        1.685011
dtype: float64



### Obs : All the data is normally distributed except CO2 feature in training data

In [16]:
## Creating training and test data for independent and dependent variable
x_train = train.drop('Occupancy', axis = 1)
y_train = train['Occupancy']

x_valid = valid.drop('Occupancy', axis = 1)
y_valid = valid['Occupancy']

x_test = test.drop('Occupancy', axis = 1)
y_test = test['Occupancy']

In [17]:
## Feature Scaling
sc = StandardScaler()
x_train_sc = sc.fit_transform(x_train)
x_valid_sc = sc.transform(x_valid)

### Model_1 : Baseline Model_Gaussian Naive Bayes

In [18]:
## Creating  Model
print('------------------ NAIVE BAYES MODEL ------------------')

model_naive = GaussianNB()
model_naive.fit(x_train_sc, y_train)

## Determining score of training and test dataset
print('Training Score_Naive Bayes Model : %0.4f'%model_naive.score(x_train_sc, y_train))
# print('Test Score_Naive Bayes Model : %0.4f'%naive.score(x_test_sc,y_test))
print('')
## Predicting and determing accuracy of the model
naive_pred = model_naive.predict(x_valid_sc)
naive_score = accuracy_score(y_valid,naive_pred)
print('Valid Score_Naive Bayes Model : %0.4f'%naive_score)
print('')
## Determining accuracy using Confusion Matrix
naive_results = confusion_matrix(y_valid,naive_pred)
print('Confusion Matrix_Naive Bayes Model :\n',naive_results)
print('')
## Classification Report
print('Classification Report_NB Model :\n',classification_report(y_valid,naive_pred))

------------------ NAIVE BAYES MODEL ------------------
Training Score_Naive Bayes Model : 0.9789

Valid Score_Naive Bayes Model : 0.9869

Confusion Matrix_Naive Bayes Model :
 [[7589  114]
 [  14 2035]]

Classification Report_NB Model :
               precision    recall  f1-score   support

           0       1.00      0.99      0.99      7703
           1       0.95      0.99      0.97      2049

    accuracy                           0.99      9752
   macro avg       0.97      0.99      0.98      9752
weighted avg       0.99      0.99      0.99      9752



### Obs : Model Accuracy is quite good along with both Precision , Recall and f1-score

## Model_2 : Logistic Regression

In [19]:
## Creating  Model
print('------------------ LOGISTIC REGRESSION MODEL ------------------')

model_logis = LogisticRegression()
model_logis.fit(x_train_sc, y_train)

## Determining score of training and test dataset
print('Training Score_Naive Bayes Model : %0.4f'%model_logis.score(x_train_sc, y_train))
print('')
# print('Test Score_Naive Bayes Model : %0.4f'%logis.score(x_test_sc,y_test))
## Predicting and determing accuracy of the model
logis_pred = model_logis.predict(x_valid_sc)
logis_score = accuracy_score(y_valid,logis_pred)
print('Valid Score_Logistic Reg : %0.4f'%logis_score)
print('')
## Determining accuracy using Confusion Matrix
logis_results = confusion_matrix(y_valid,logis_pred)
print('Confusion Matrix_Logistic Reg :\n',logis_results)
print('')
## Classification Report
print('Classification Report_Logistic Reg :\n',classification_report(y_valid,logis_pred))

------------------ LOGISTIC REGRESSION MODEL ------------------
Training Score_Naive Bayes Model : 0.9860

Valid Score_Logistic Reg : 0.9849

Confusion Matrix_Logistic Reg :
 [[7651   52]
 [  95 1954]]

Classification Report_Logistic Reg :
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      7703
           1       0.97      0.95      0.96      2049

    accuracy                           0.98      9752
   macro avg       0.98      0.97      0.98      9752
weighted avg       0.98      0.98      0.98      9752



#### Obs : 
- No big difference in train and valid score in both NB model and Logistic reg
- Accuracy of the NB model is slightly higher than Logistic Reg model

## Final Model on Test data:

In [20]:
x_test_sc = sc.transform(x_test)
## Determining Accuracy of the created model through confusion matrix
y_pred = model_naive.predict(x_test_sc)
final_score = accuracy_score(y_test,y_pred)
print('Test Score_NB Model : %0.4f'%final_score)
print('')
## Determining accuracy using Confusion Matrix
final_results = confusion_matrix(y_test,y_pred)
print('Confusion Matrix_NB Model :\n',final_results)
print('')
## Classification Report
print('Classification Report_NB Model :\n',classification_report(y_test,y_pred))
    

Test Score_NB Model : 0.9775

Confusion Matrix_NB Model :
 [[1638   55]
 [   5  967]]

Classification Report_NB Model :
               precision    recall  f1-score   support

           0       1.00      0.97      0.98      1693
           1       0.95      0.99      0.97       972

    accuracy                           0.98      2665
   macro avg       0.97      0.98      0.98      2665
weighted avg       0.98      0.98      0.98      2665



## Obs : Final Model also works very well with Test data