## Logistic Regression From Scratch

In this notebook, we will use the "Simple Gender Classification Dataset" available [here](https://www.kaggle.com/datasets/muhammadtalharasool/simple-gender-classification/data) and the libraries `pandas` and `matplotlib` to read in a dataset and perform some basic data analysis. Then we will implement from scratch a linear regression using `numpy` and compare it to the `scikit-learn` implemented logistic regression.

### Installation & Setup

In [108]:
!pip install pandas
!pip install matplotlib
!pip install seaborn
!pip install scikit-learn

import warnings
warnings.filterwarnings("ignore")



In [109]:
import os
import pandas as pd

df = pd.read_pickle(os.path.join('data','dataset.pkl'))

### Exploratory Data Analysis

In [110]:
df.head()

Unnamed: 0,Gender,Age,Height (cm),Weight (kg),Occupation,Education Level,Marital Status,Income (USD),Favorite Color,Unnamed: 9
0,male,32,175,70,Software Engineer,Master's Degree,Married,75000,Blue,
1,male,25,182,85,Sales Representative,Bachelor's Degree,Single,45000,Green,
2,female,41,160,62,Doctor,Doctorate Degree,Married,120000,Purple,
3,male,38,178,79,Lawyer,Bachelor's Degree,Single,90000,Red,
4,female,29,165,58,Graphic Designer,Associate's Degree,Single,35000,Yellow,


In [111]:
df.shape

(131, 10)

In [112]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131 entries, 0 to 130
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0    Gender           131 non-null    object 
 1    Age              131 non-null    int64  
 2    Height (cm)      131 non-null    int64  
 3    Weight (kg)      131 non-null    int64  
 4    Occupation       131 non-null    object 
 5    Education Level  131 non-null    object 
 6    Marital Status   131 non-null    object 
 7    Income (USD)     131 non-null    int64  
 8    Favorite Color   131 non-null    object 
 9   Unnamed: 9        0 non-null      float64
dtypes: float64(1), int64(4), object(5)
memory usage: 10.4+ KB


In [113]:
df.isnull().sum()

 Gender               0
 Age                  0
 Height (cm)          0
 Weight (kg)          0
 Occupation           0
 Education Level      0
 Marital Status       0
 Income (USD)         0
 Favorite Color       0
Unnamed: 9          131
dtype: int64

In [114]:
df[' Gender'].value_counts()

 Gender
male       41
female     39
 male      27
 female    24
Name: count, dtype: int64

There is a little error in some classes with a blank space at the beginning. We will fix that in the feature engineering part!

In [115]:
df.describe()

Unnamed: 0,Age,Height (cm),Weight (kg),Income (USD),Unnamed: 9
count,131.0,131.0,131.0,131.0,0.0
mean,34.564885,173.198473,71.458015,93206.10687,
std,5.984723,8.045467,12.648052,74045.382919,
min,24.0,160.0,50.0,30000.0,
25%,29.0,166.0,60.0,55000.0,
50%,34.0,175.0,75.0,75000.0,
75%,39.0,180.5,83.0,100000.0,
max,52.0,190.0,94.0,500000.0,


### Feature Distribustion

### Feature Engineering

In [116]:
df.head()

Unnamed: 0,Gender,Age,Height (cm),Weight (kg),Occupation,Education Level,Marital Status,Income (USD),Favorite Color,Unnamed: 9
0,male,32,175,70,Software Engineer,Master's Degree,Married,75000,Blue,
1,male,25,182,85,Sales Representative,Bachelor's Degree,Single,45000,Green,
2,female,41,160,62,Doctor,Doctorate Degree,Married,120000,Purple,
3,male,38,178,79,Lawyer,Bachelor's Degree,Single,90000,Red,
4,female,29,165,58,Graphic Designer,Associate's Degree,Single,35000,Yellow,


In [117]:
df = df.drop('Unnamed: 9', axis=1)

In [118]:
labels = df[' Gender']
labels = labels.map({'male': 1, ' male': 1, '  male': 1, 'female': 0, ' female': 1, '  female': 0})

In [119]:
df['Gender'] = labels
df = df.drop(' Gender', axis=1)

In [120]:
df['Gender'].value_counts()

Gender
1    92
0    39
Name: count, dtype: int64

In [121]:
categorical_columns = df.select_dtypes(include='object').columns.tolist()
categorical_columns

[' Occupation', ' Education Level', ' Marital Status', ' Favorite Color']

In [122]:
encoded_df = df.copy()

In [123]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
for col in categorical_columns:
    encoded_df[col] = le.fit_transform(df[col])

In [124]:
encoded_df.head()

Unnamed: 0,Age,Height (cm),Weight (kg),Occupation,Education Level,Marital Status,Income (USD),Favorite Color,Gender
0,32,175,70,15,3,1,75000,1,1
1,25,182,85,14,1,2,45000,2,1
2,41,160,62,6,2,1,120000,6,1
3,38,178,79,10,1,2,90000,7,1
4,29,165,58,8,0,2,35000,8,1


In [125]:
encoded_df.corr()

Unnamed: 0,Age,Height (cm),Weight (kg),Occupation,Education Level,Marital Status,Income (USD),Favorite Color,Gender
Age,1.0,0.726308,0.784738,-0.1759,0.220823,-0.135039,0.662278,-0.013207,0.462123
Height (cm),0.726308,1.0,0.975157,-0.167029,0.20809,-0.141778,0.456217,-0.08431,0.597277
Weight (kg),0.784738,0.975157,1.0,-0.188074,0.200688,-0.156897,0.486022,-0.090207,0.63449
Occupation,-0.1759,-0.167029,-0.188074,1.0,0.771409,0.872901,-0.234879,0.751068,-0.577755
Education Level,0.220823,0.20809,0.200688,0.771409,1.0,0.820698,0.074108,0.743115,-0.329923
Marital Status,-0.135039,-0.141778,-0.156897,0.872901,0.820698,1.0,-0.183237,0.834993,-0.627019
Income (USD),0.662278,0.456217,0.486022,-0.234879,0.074108,-0.183237,1.0,-0.111655,0.268208
Favorite Color,-0.013207,-0.08431,-0.090207,0.751068,0.743115,0.834993,-0.111655,1.0,-0.524616
Gender,0.462123,0.597277,0.63449,-0.577755,-0.329923,-0.627019,0.268208,-0.524616,1.0


In [126]:
from sklearn.model_selection import train_test_split

X = encoded_df.drop('Gender', axis=1)
y = encoded_df['Gender']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [127]:
X_train.shape, X_test.shape

((104, 8), (27, 8))

In [128]:
X_train

Unnamed: 0,Age,Height (cm),Weight (kg),Occupation,Education Level,Marital Status,Income (USD),Favorite Color
78,30,170,64,29,4,6,55000,17
47,29,167,63,13,3,2,55000,7
0,32,175,70,15,3,1,75000,1
12,28,166,60,12,0,1,55000,6
42,39,179,83,9,1,1,95000,4
...,...,...,...,...,...,...,...,...
71,39,181,82,26,5,6,90000,9
106,27,162,56,33,5,6,50000,11
14,33,170,65,16,3,1,65000,8
92,36,179,78,21,5,5,85000,13


In [129]:
y_train

78     0
47     1
0      1
12     1
42     1
      ..
71     1
106    0
14     1
92     1
102    1
Name: Gender, Length: 104, dtype: int64

In [130]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [131]:
X_train

array([[-0.77407763, -0.43847927, -0.62406941,  0.9195311 ,  0.14034637,
         1.01477015, -0.52405126,  1.5323325 ],
       [-0.94499923, -0.82061763, -0.70439518, -0.73106194, -0.3461877 ,
        -0.91282977, -0.52405126, -0.46719799],
       [-0.43223443,  0.198418  , -0.14211482, -0.52473781, -0.3461877 ,
        -1.39472975, -0.24055929, -1.66691628],
       [-1.11592083, -0.94799709, -0.94537248, -0.83422401, -1.8057899 ,
        -1.39472975, -0.52405126, -0.66715104],
       [ 0.76421677,  0.70793581,  0.90212014, -1.1437102 , -1.31925584,
        -1.39472975,  0.04293268, -1.06705714],
       [ 0.08053037,  0.58055636,  0.66114284,  0.40372077,  1.59994857,
         1.01477015, -0.02794031,  0.13266116],
       [ 2.13158957,  0.96269472,  1.06277167, -1.55635846, -0.83272177,
        -1.39472975,  5.78364502, -1.66691628],
       [ 0.59329517,  0.198418  ,  0.42016555, -0.11208955,  0.62688043,
         0.53287017, -0.1696863 ,  0.33261421],
       [ 0.25145197,  0.198418  

### Logistic Regression From Scratch

In [164]:
import numpy as np

class LogisticRegression:
    def __init__(self, learning_rate=0.0001, num_iterations=1000):
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations
        self.weights = None
        self.bias = None
        
    def _sigmoid(self, z):
        return 1 / (1 + np.exp(-z))
    
    def fit(self, X, y):
        num_samples, num_features = X.shape
        self.weights = np.zeros(num_features)
        self.bias = 0
        
        for _ in range(self.num_iterations):
            linear_model = np.dot(X, self.weights) + self.bias
            y_pred = self._sigmoid(linear_model)
            
            dw = (1 / num_samples) * np.dot(X.T, (y_pred - y))
            db = (1 / num_samples) * np.sum(y_pred - y)
            
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
    def predict(self, X):
        linear_model = np.dot(X, self.weights) + self.bias
        y_pred = self._sigmoid(linear_model)
        return np.array([1 if i > 0.5 else 0 for i in y_pred])
    
    def score(self, y_true, y_pred):
        accuracy = np.sum(y_true == y_pred) / len(y_true)
        return accuracy

In [168]:
custom_model = LogisticRegression()
custom_model.fit(X_train, y_train)
y_pred = custom_model.predict(X_train)
y_pred

array([0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1])

In [169]:
custom_model.score(y_train, y_pred)

0.9038461538461539

In [170]:
y_test_pred = custom_model.predict(X_test)
y_test_pred

array([0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,
       0, 1, 1, 0, 0])

In [171]:
custom_model.score(y_test, y_test_pred)

0.9259259259259259

### Logistic Regression Using Scikit-Learn

In [172]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(fit_intercept=True, max_iter=1000, C=1e9, solver='liblinear')
model.fit(X_train, y_train)

In [173]:
model.predict(X_train)

array([0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1])

In [174]:
model.score(X_train, y_train)

1.0

In [145]:
model.predict(X_test)

array([1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 0])

In [146]:
model.score(X_test, y_test)

1.0