## Note : 
- Naive Bayes algorithm is applicable only for classification problems
- It's a Parametric approach
- NB model is easy to build and particularly useful for very large datasets i.e dimensionality of the input data is very high
- Continuous features should follow normal distribution
- If cont. feature do not follow normal distribution , we should use transformation or different method to convert it into normal distribution
- As this model assumes that there is an independence among predictors, it is suggested to remove correlated features

## Data Set Information:

- * Abstract: Experimental data used for binary classification (room occupancy) from Temperature,Humidity,Light and CO2. Ground-truth occupancy was obtained from time stamped pictures that were taken every minute i.e predicting whether a room or rooms are occupied based on environmental measures such as temperature, humidity, and related measures.
- * This is a type of common time series classification problem called room occupancy classification.
- * The dataset was collected by monitoring an office with a suite of environmental sensors and using a camera to determine if the room was occupied.
- Attribute Information:

- date time year-month-day hour:minute:second
- Temperature, in Celsius
- Relative Humidity, %ge
- Light, in Lux
- CO2, in ppm(parts per million)
- Humidity Ratio, derived quantity from temperature and relative humidity measured in kgwater-vapor/kg-air
- Occupancy, 0 or 1, 0 for not occupied, 1 for occupied status

In [1]:
## Importing necessary lib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import math
from scipy import stats
import warnings
warnings.filterwarnings("ignore")
plt.style.use('dark_background')

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import train_test_split

In [2]:
# Readinag given txt file and creating dataframe

df_training = pd.read_csv("D:\Data Science\Certificates_Udemy_Coursera\Prac_BM_PK_Book\occupancy_data\datatraining.txt")
df_test = pd.read_csv("D:\Data Science\Certificates_Udemy_Coursera\Prac_BM_PK_Book\occupancy_data\datatest.txt")
df_test_2 = pd.read_csv("D:\Data Science\Certificates_Udemy_Coursera\Prac_BM_PK_Book\occupancy_data\datatest2.txt")
  
# Storing this dataframe in a csv file into the local file

df_training.to_csv("D:\Data Science\Certificates_Udemy_Coursera\Prac_BM_PK_Book\occupancy_data\occu_training.csv", 
                  index = None)

df_test.to_csv("D:\Data Science\Certificates_Udemy_Coursera\Prac_BM_PK_Book\occupancy_data\occu_test.csv", 
                  index = None)

df_test_2.to_csv("D:\Data Science\Certificates_Udemy_Coursera\Prac_BM_PK_Book\occupancy_data\occu_test_2.csv", 
                  index = None)

In [3]:
## Reading the dataset
training = pd.read_csv("D:\Data Science\Certificates_Udemy_Coursera\Prac_BM_PK_Book\occupancy_data\occu_training.csv")
test = pd.read_csv("D:\Data Science\Certificates_Udemy_Coursera\Prac_BM_PK_Book\occupancy_data\occu_test.csv")

In [4]:
## Drop date variable from both train and test as it is useless for this purpose
training = training.drop('date', axis = 1)
test = test.drop('date', axis = 1)

In [5]:
training.shape, test.shape

((8143, 6), (2665, 6))

In [6]:
training.columns

Index(['Temperature', 'Humidity', 'Light', 'CO2', 'HumidityRatio',
       'Occupancy'],
      dtype='object')

In [7]:
## Creating training and test data for independent and dependent variable
x_train = training.drop('Occupancy', axis = 1)
y_train = training['Occupancy']

x_test = test.drop('Occupancy', axis = 1)
y_test = test['Occupancy']

In [8]:
## Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train_sc = sc.fit_transform(x_train)
x_test_sc = sc.transform(x_test)

In [14]:
## Creating Naive Bayes Model
naive_occu = GaussianNB()
naive_occu.fit(x_train_sc, y_train)

## Determining score of training and test dataset
print('Training Score_Naive Bayes Model : %0.4f'%naive_occu.score(x_train_sc, y_train))
print('Test Score_Naive Bayes Model : %0.4f'%naive_occu.score(x_test_sc,y_test))

Training Score_Naive Bayes Model : 0.9789
Test Score_Naive Bayes Model : 0.9775


In [15]:
## Predicting and determing accuracy of the model
naive_pred = naive_occu.predict(x_test_sc)

## Determining accuracy using Confusion Matrix
naive_results = confusion_matrix(y_test,naive_pred)
print('Confusion Matrix_Naive Bayes :\n',naive_results)

Confusion Matrix_Naive Bayes :
 [[1638   55]
 [   5  967]]


In [16]:
## Creating a Logistic Regression Model
print('------------------ LOGISTIC REGRESSION MODEL ------------------')

logis_occu = LogisticRegression()

logis_occu.fit(x_train_sc, y_train)

logis_pred = logis_occu.predict(x_test_sc)
logis_score = accuracy_score(y_test,logis_pred)
print('Score_Logistic Reg : %0.4f'%logis_score)
logis_results = confusion_matrix(y_test,logis_pred)
print('Confusion Matrix_Logistic Reg :\n',logis_results)

------------------ LOGISTIC REGRESSION MODEL ------------------
Score_Logistic Reg : 0.9771
Confusion Matrix_Logistic Reg :
 [[1639   54]
 [   7  965]]


#### Obs : 
- No big difference in train and test score in NB model
- Accuracy of the NB model is slightly higher than Logistic Reg model