# STINTSY Machine Project

**SS1 - Group 3**
1. BERENGUER, Beatrice A.
2. BUENDIA, Leigh Arriane S.
3. ENRIQUEZ, Manolo L.

**<h2> Description of the Task**

To create a machine learning model that will classify whether the smoke detector should be triggered or not

**<h2> List of Requirements**

In [80]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from imblearn.over_sampling import SMOTE

from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

**<h2> Description of the Dataset**

https://www.kaggle.com/datasets/deepcontractor/smoke-detection-dataset

Collection of training data is performed with the help of IOT devices since the goal is to develop a AI based smoke detector device. Many different environments and fire sources have to be sampled to ensure a good dataset for training. A short list of different scenarios which are captured:

* Normal indoor
* Normal outdoor
* Indoor wood fire, firefighter training area
* Indoor gas fire, firefighter training area
* Outdoor wood, coal, and gas grill
* Outdoor high humidity etc.



In [81]:
data=pd.read_csv('smoke_detection_iot.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,UTC,Temperature[C],Humidity[%],TVOC[ppb],eCO2[ppm],Raw H2,Raw Ethanol,Pressure[hPa],PM1.0,PM2.5,NC0.5,NC1.0,NC2.5,CNT,Fire Alarm
0,0,1654733331,20.0,57.36,0,400,12306,18520,939.735,0.0,0.0,0.0,0.0,0.0,0,0
1,1,1654733332,20.015,56.67,0,400,12345,18651,939.744,0.0,0.0,0.0,0.0,0.0,1,0
2,2,1654733333,20.029,55.96,0,400,12374,18764,939.738,0.0,0.0,0.0,0.0,0.0,2,0
3,3,1654733334,20.044,55.28,0,400,12390,18849,939.736,0.0,0.0,0.0,0.0,0.0,3,0
4,4,1654733335,20.059,54.69,0,400,12403,18921,939.744,0.0,0.0,0.0,0.0,0.0,4,0


In [82]:
data.shape

(62630, 16)

In [83]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62630 entries, 0 to 62629
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      62630 non-null  int64  
 1   UTC             62630 non-null  int64  
 2   Temperature[C]  62630 non-null  float64
 3   Humidity[%]     62630 non-null  float64
 4   TVOC[ppb]       62630 non-null  int64  
 5   eCO2[ppm]       62630 non-null  int64  
 6   Raw H2          62630 non-null  int64  
 7   Raw Ethanol     62630 non-null  int64  
 8   Pressure[hPa]   62630 non-null  float64
 9   PM1.0           62630 non-null  float64
 10  PM2.5           62630 non-null  float64
 11  NC0.5           62630 non-null  float64
 12  NC1.0           62630 non-null  float64
 13  NC2.5           62630 non-null  float64
 14  CNT             62630 non-null  int64  
 15  Fire Alarm      62630 non-null  int64  
dtypes: float64(8), int64(8)
memory usage: 7.6 MB


In [84]:
data.nunique()

Unnamed: 0        62630
UTC               62630
Temperature[C]    21672
Humidity[%]        3890
TVOC[ppb]          1966
eCO2[ppm]          1713
Raw H2             1830
Raw Ethanol        2659
Pressure[hPa]      2213
PM1.0              1337
PM2.5              1351
NC0.5              3093
NC1.0              4113
NC2.5              1161
CNT               24994
Fire Alarm            2
dtype: int64

**<h2> Data Preprocessing and Cleaning**

In [85]:
data.isnull().sum()

Unnamed: 0        0
UTC               0
Temperature[C]    0
Humidity[%]       0
TVOC[ppb]         0
eCO2[ppm]         0
Raw H2            0
Raw Ethanol       0
Pressure[hPa]     0
PM1.0             0
PM2.5             0
NC0.5             0
NC1.0             0
NC2.5             0
CNT               0
Fire Alarm        0
dtype: int64

In [86]:
data = data.dropna()

In [87]:
data.isnull().sum()

Unnamed: 0        0
UTC               0
Temperature[C]    0
Humidity[%]       0
TVOC[ppb]         0
eCO2[ppm]         0
Raw H2            0
Raw Ethanol       0
Pressure[hPa]     0
PM1.0             0
PM2.5             0
NC0.5             0
NC1.0             0
NC2.5             0
CNT               0
Fire Alarm        0
dtype: int64

In [88]:
data.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
62625    False
62626    False
62627    False
62628    False
62629    False
Length: 62630, dtype: bool

In [89]:
data = data.drop_duplicates()

In [90]:
data.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
62625    False
62626    False
62627    False
62628    False
62629    False
Length: 62630, dtype: bool

In [91]:
data.drop(['Unnamed: 0', 'UTC'], axis = 1, inplace = True)
data.head()

Unnamed: 0,Temperature[C],Humidity[%],TVOC[ppb],eCO2[ppm],Raw H2,Raw Ethanol,Pressure[hPa],PM1.0,PM2.5,NC0.5,NC1.0,NC2.5,CNT,Fire Alarm
0,20.0,57.36,0,400,12306,18520,939.735,0.0,0.0,0.0,0.0,0.0,0,0
1,20.015,56.67,0,400,12345,18651,939.744,0.0,0.0,0.0,0.0,0.0,1,0
2,20.029,55.96,0,400,12374,18764,939.738,0.0,0.0,0.0,0.0,0.0,2,0
3,20.044,55.28,0,400,12390,18849,939.736,0.0,0.0,0.0,0.0,0.0,3,0
4,20.059,54.69,0,400,12403,18921,939.744,0.0,0.0,0.0,0.0,0.0,4,0


In [92]:
data.columns = data.columns.str.replace(' ', '')

In [93]:
X = data.drop(['FireAlarm'], axis = 1)
X.head()

Unnamed: 0,Temperature[C],Humidity[%],TVOC[ppb],eCO2[ppm],RawH2,RawEthanol,Pressure[hPa],PM1.0,PM2.5,NC0.5,NC1.0,NC2.5,CNT
0,20.0,57.36,0,400,12306,18520,939.735,0.0,0.0,0.0,0.0,0.0,0
1,20.015,56.67,0,400,12345,18651,939.744,0.0,0.0,0.0,0.0,0.0,1
2,20.029,55.96,0,400,12374,18764,939.738,0.0,0.0,0.0,0.0,0.0,2
3,20.044,55.28,0,400,12390,18849,939.736,0.0,0.0,0.0,0.0,0.0,3
4,20.059,54.69,0,400,12403,18921,939.744,0.0,0.0,0.0,0.0,0.0,4


In [94]:
y = data['FireAlarm']
y.head()

0    0
1    0
2    0
3    0
4    0
Name: FireAlarm, dtype: int64

In [95]:
scale = MinMaxScaler()

In [96]:
X = pd.DataFrame(scale.fit_transform(X),columns=X.columns)
X.head()

Unnamed: 0,Temperature[C],Humidity[%],TVOC[ppb],eCO2[ppm],RawH2,RawEthanol,Pressure[hPa],PM1.0,PM2.5,NC0.5,NC1.0,NC2.5,CNT
0,0.512692,0.723239,0.0,0.0,0.522488,0.525685,0.986014,0.0,0.0,0.0,0.0,0.0,0.0
1,0.512875,0.712535,0.0,0.0,0.534928,0.547185,0.987013,0.0,0.0,0.0,0.0,0.0,4e-05
2,0.513046,0.70152,0.0,0.0,0.544179,0.565731,0.986347,0.0,0.0,0.0,0.0,0.0,8e-05
3,0.513229,0.690971,0.0,0.0,0.549282,0.579682,0.986125,0.0,0.0,0.0,0.0,0.0,0.00012
4,0.513412,0.681818,0.0,0.0,0.553429,0.591498,0.987013,0.0,0.0,0.0,0.0,0.0,0.00016


In [97]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [98]:
smote = SMOTE()

In [99]:
y_train.value_counts()

1    35836
0    14268
Name: FireAlarm, dtype: int64

In [100]:
X_train, y_train = smote.fit_resample(X_train, y_train)

In [101]:
y_train.value_counts()

0    35836
1    35836
Name: FireAlarm, dtype: int64

**<h2> Exploratory Data Analysis**

In [102]:
data.describe()

Unnamed: 0,Temperature[C],Humidity[%],TVOC[ppb],eCO2[ppm],RawH2,RawEthanol,Pressure[hPa],PM1.0,PM2.5,NC0.5,NC1.0,NC2.5,CNT,FireAlarm
count,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0,62630.0
mean,15.970424,48.539499,1942.057528,670.021044,12942.453936,19754.257912,938.627649,100.594309,184.46777,491.463608,203.586487,80.049042,10511.386157,0.714626
std,14.359576,8.865367,7811.589055,1905.885439,272.464305,609.513156,1.331344,922.524245,1976.305615,4265.661251,2214.738556,1083.383189,7597.870997,0.451596
min,-22.01,10.74,0.0,400.0,10668.0,15317.0,930.852,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,10.99425,47.53,130.0,400.0,12830.0,19435.0,938.7,1.28,1.34,8.82,1.384,0.033,3625.25,0.0
50%,20.13,50.15,981.0,400.0,12924.0,19501.0,938.816,1.81,1.88,12.45,1.943,0.044,9336.0,1.0
75%,25.4095,53.24,1189.0,438.0,13109.0,20078.0,939.418,2.09,2.18,14.42,2.249,0.051,17164.75,1.0
max,59.93,75.2,60000.0,60000.0,13803.0,21410.0,939.861,14333.69,45432.26,61482.03,51914.68,30026.438,24993.0,1.0


**<h2> Model Training**

<h3> Binomial Logistic Regression

In [103]:
logreg = SGDClassifier(loss='log', eta0=0.001, learning_rate='optimal', random_state=1, verbose=0)
max_epochs = 200

In [104]:
class DataLoader(object):

    def __init__(self, X, y, batch_size):
        """Class constructor for DataLoader

        Arguments:
            X {np.ndarray} -- A numpy array of shape (N, D) containing the
            data; there are N samples each of dimension D.
            y {np.ndarray} -- A numpy array of shape (N, 1) containing the
            ground truth values.
            batch_size {int} -- An integer representing the number of instances
            per batch.
        """
        self.X = X
        self.y = y
        self.batch_size = batch_size

        self.indices = np.array([i for i in range(self.X.shape[0])])
        print(self.X.iloc[0])
        np.random.seed(1)

    def shuffle(self):
        """Shuffles the indices in self.indices.
        """

        # TODO: Use np.random.shuffle() to shuffles the indices in self.indices
        np.random.shuffle(self.indices)

    def get_batch(self, mode='train'):
        """Returns self.X and self.y divided into different batches of size
        self.batch_size according to the shuffled self.indices.

        Arguments:
            mode {str} -- A string which determines the mode of the model. This
            can either be `train` or `test`.

        Returns:
            list, list -- List of np.ndarray containing the data divided into
            different batches of size self.batch_size; List of np.ndarray
            containing the ground truth labels divided into different batches
            of size self.batch_size
        """

        X_batch = []
        y_batch = []

        # TODO: If mode is set to `train`, shuffle the indices first using
        # self.shuffle().
        if mode == 'train':
            self.shuffle()
            
        elif mode == 'test':
            self.indices = np.array([i for i in range(self.X.shape[0])])

        # The loop that will iterate from 0 to the number of instances with
        # step equal to self.batch_size
        for i in range(0, len(self.indices), self.batch_size):

            # TODO: Check if we can still get self.batch_size from the
            # remaining indices starting from index i. Edit the condition
            # below.
            if i + self.batch_size <= len(self.indices):
                indices = self.indices[i:i + self.batch_size]

            # TODO: Else, just get the remaining indices from index i until the
            # last element in the list. Edit the statement inside the else
            # block.
            else:
                indices = self.indices[i:]

            X_batch.append(self.X.iloc[indices])
            y_batch.append(self.y.iloc[indices])

        return X_batch, y_batch

In [105]:
data_loader = DataLoader(X=X_train, y=y_train, batch_size=128)

Temperature[C]    2.065292e-01
Humidity[%]       6.796463e-01
TVOC[ppb]         3.700000e-03
eCO2[ppm]         0.000000e+00
RawH2             7.904306e-01
RawEthanol        7.744953e-01
Pressure[hPa]     9.668110e-01
PM1.0             3.069691e-05
PM2.5             9.904856e-06
NC0.5             4.879475e-05
NC1.0             9.014791e-06
NC2.5             3.663438e-07
CNT               2.833193e-01
Name: 0, dtype: float64


In [106]:
from sklearn.metrics import log_loss

e = 0
is_converged = False
previous_loss = 0
labels = np.unique(y_train)
print(labels)
# For each epoch
while e < max_epochs and is_converged is not True:
    loss=0
    X_batch,y_batch = data_loader.get_batch()
    for X,y in zip(X_batch,y_batch):
        logreg.partial_fit(X,y,classes=labels)
        y_pred = logreg.predict_proba(X_train)
        loss += log_loss(y_train,y_pred)
    print('Epoch:', e + 1, '\tLoss:', (loss / len(X_batch)))
    if abs(previous_loss - loss)<0.1:
        print('if')
        is_converged = True
    else:
        print('else')
        previous_loss = loss
        e +=1

[0 1]




Epoch: 1 	Loss: 0.18214687813299596
else
Epoch: 2 	Loss: 0.14553671897008877
else
Epoch: 3 	Loss: 0.1448532799880616
else
Epoch: 4 	Loss: 0.14445873437787102
else
Epoch: 5 	Loss: 0.14425674419304255
else
Epoch: 6 	Loss: 0.14406878967190828
else
Epoch: 7 	Loss: 0.1439590198190906
if


In [107]:
predictions = logreg.predict(X_test)
print(predictions)

[0 1 0 ... 1 0 0]


In [108]:
num_correct = np.sum(predictions == y_test)
print(num_correct, 'out of', len(y_test))

11673 out of 12526


In [109]:
accuracy = num_correct / len(y_test) * 100
print(accuracy, '%')

93.19016445792752 %


In [110]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.82      0.98      0.89      3605
           1       0.99      0.91      0.95      8921

    accuracy                           0.93     12526
   macro avg       0.91      0.95      0.92     12526
weighted avg       0.94      0.93      0.93     12526



<h3> Gaussian Naive Bayes

<h3> k Nearest Neighbors

In [153]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 7)
knn = knn.fit(X_train, y_train)

In [154]:
y_train_knn = knn.predict(X_train)

In [155]:
num_correct = np.sum(y_train_knn == y_train)
print(num_correct, 'out of', len(y_train))

71666 out of 71672


In [156]:
accuracy = num_correct / len(y_train) * 100
print(accuracy, '%')

99.99162852996987 %


In [157]:
y_test_knn = knn.predict(X_test)

In [158]:
printClassificationReport = classification_report(y_test, y_test_knn)
print(printClassificationReport)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3605
           1       1.00      1.00      1.00      8921

    accuracy                           1.00     12526
   macro avg       1.00      1.00      1.00     12526
weighted avg       1.00      1.00      1.00     12526



In [159]:
num_correct = np.sum(y_test_knn == y_test)
print(num_correct, 'out of', len(y_test))

12524 out of 12526


In [160]:
accuracy = num_correct / len(y_test) * 100
print(accuracy, '%')

99.98403321092128 %


**<h2> Hyperparameter Tuning**



**<h2> Model Selection**

**<h2> Insights and Conclusion**

**<h2> References**