# Binary Classification 

To compare my implementation of logistic regression with sklearn's in 2 clasiification, I am going to use the [Occupancy Detection Dataset](https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+). 

## Dataset Inspection
First of all, let's see what this dataset looks like, and check if there are some invalid value existing.

### Samples
The dataset contains 1 date, 5 numeric, and 1 classical columns. Also, it's not having any N/A value. 

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('occypancy.txt')
df.head()

Unnamed: 0,date,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
1,2015-02-11 14:48:00,21.76,31.133333,437.333333,1029.666667,0.005021,1
2,2015-02-11 14:49:00,21.79,31.0,437.333333,1000.0,0.005009,1
3,2015-02-11 14:50:00,21.7675,31.1225,434.0,1003.75,0.005022,1
4,2015-02-11 14:51:00,21.7675,31.1225,439.0,1009.5,0.005022,1
5,2015-02-11 14:51:59,21.79,31.133333,437.333333,1005.666667,0.00503,1


In [2]:
# Check if null value exists
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9752 entries, 1 to 9752
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           9752 non-null   object 
 1   Temperature    9752 non-null   float64
 2   Humidity       9752 non-null   float64
 3   Light          9752 non-null   float64
 4   CO2            9752 non-null   float64
 5   HumidityRatio  9752 non-null   float64
 6   Occupancy      9752 non-null   int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 609.5+ KB


### Target
The Occupancy is the target I'm going to predict. As the cell showing below, it only contains two categories(0 and 1). 

In [3]:
# See the classification of target
df['Occupancy'].value_counts()

0    7703
1    2049
Name: Occupancy, dtype: int64

## Pre-process

### Unimportant data - date
The date value may not be important to the target. Also, it will be hard for logistic regression to train becuase it's not a numeric data. Though we can trasform it to timestamp as an long value. It does actually not affect the occupancy so I decided to get it dropped.  

In [4]:
df = df.drop('date', axis='columns')

### Shuffle 
To make sure the training and testing data are picked in a really random way, it's good to shuffle the dataset.

In [5]:
from sklearn.utils import shuffle
df = shuffle(df)

In [6]:
# Split dataset to features and target
features = df.iloc[:, :-1]
target = df.iloc[:, -1]

### Standardize and Split
I want to standardize the data because the LR in scikit-learn also does it. It also may give us a more accurate prediction than without standardization. 

This dataset will be splitted for training(80%) and testing(20%).

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features = scaler.fit_transform(features)

# Split data to train(80%) and test(20%) 
train_x, test_x, train_y, test_y = train_test_split(features, target, test_size=0.2, random_state=1)

# Because my implementation cannot take array-like features, so here I transform features to dataframe.
train_x = pd.DataFrame(train_x)
test_x = pd.DataFrame(test_x)

# Performace Comparison


## My Logistic Regression
After data is cleaned and standardized, we are okay to process it.   
Firstly, I use my implementation of Logistic Regression to fit data and make a prediction.  
The following graph is showing that the decreasing of costs among the 10,000 interations.

In [None]:
from LogisticRegression import LogisticRegression as MyLR
mylr = MyLR()
mylr.fit(train_x, train_y)
my_hyp = mylr.predict(test_x)

## Sklearn- Logistic Regression

Now, I am going to use the logistic regression from sklearn, and compare the accuracy metrix between mine and sklearn. 

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(train_x, train_y)
sk_hyp = clf.predict(test_x)

### Metrics
We will evaluate the explained_variance_score, accuracy_score and confusion_matrix to see the performace between the two versions.

In [None]:
from sklearn.metrics import explained_variance_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

In [None]:
my_evs = explained_variance_score(test_y, my_hyp)
my_acc = accuracy_score(test_y, my_hyp)
my_cm = confusion_matrix(test_y, my_hyp)

In [None]:
sk_evs = explained_variance_score(test_y, sk_hyp)
sk_acc = accuracy_score(test_y, sk_hyp)
sk_cm = confusion_matrix(test_y, sk_hyp)

In [None]:
print("My Explained Variance Score:", my_evs)
print("My Accuracy Score:", my_acc)
print("Comfusion Matrix:")
print(my_cm)

In [None]:
print("Sklearn's Explained Variance Score:", sk_evs)
print("Sklearn's Accuracy Score:", sk_acc)
print("Comfusion Matrix:")
print(sk_cm)

# Conclusion

As the metrics showing above, the sklearn got a better accuracy and variance. Although my implementation got 95% accuracy, the variance is much lower than sklearn's. Maybe this is because some methods we applied are different. For example, I only used the normal gradient decent to converge the model. On the other hand, the sklearn used [Stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent), which is more reliable. ([Source](https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/linear_model/_logistic.py))


Overall, though we did not make a perfect prediction, the accuracy is high enough.