<a href="https://colab.research.google.com/github/mohamed7456/ML-Course-Assignments/blob/main/notebooks/05_logistic_regression_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Loading and Preprocessing**
https://www.kaggle.com/datasets/nareshbhat/health-care-data-set-on-heart-attack-possibility/data?select=heart.csv

In [4]:
# Libraries
import numpy as np
import pandas as pd

In [5]:
df = pd.read_csv("heart.csv") # read data
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [6]:
# Check Nulls
nulls = df.isnull().sum()

print("Nulls in the dataframe:")
print(nulls)

Nulls in the dataframe:
age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64


In [7]:
# Check duplicates
duplicates = df.duplicated().sum()
duplicates

np.int64(1)

In [8]:
df.drop_duplicates(inplace=True)

In [9]:
from sklearn.preprocessing import StandardScaler
# Standardize numerical features
numerical_cols = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

In [10]:
from sklearn.model_selection import train_test_split
# Split the dataset into training and testing sets and dropping the target column
X = df.drop(columns='target').values
y = df['target'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# **Implementing Logistic Regression functions**

In [11]:
# Sigmoid Function
def sigmoid(z):
    z = np.asarray(z)
    return 1 / (1 + np.exp(-z))

In [12]:
# Cost Functioin and gradient
def cost_fun(X, y, weights):
    m = len(y)
    h = sigmoid(np.dot(X, weights))
    cost = -1/m * (np.dot(y, np.log(h)) + np.dot((1 - y), np.log(1 - h)))
    gradient = 1/m * np.dot(X.T, (h - y))
    return cost, gradient

In [13]:
# Gradient Descent
def gradient_descent(X, y, weights, learning_rate, iterations):
    cost_history = []
    for i in range(iterations):
        cost, gradient = cost_fun(X, y, weights)
        weights -= learning_rate * gradient
        cost_history.append(cost)
    return weights, cost_history

In [14]:
def train(X_train, y_train, learning_rate = 0.01, iterations = 1000):
    weights = np.zeros(X_train.shape[1])
    weights, cost_history = gradient_descent(X_train, y_train, weights, learning_rate, iterations)
    return weights, cost_history

# **Model Training and Evaluation**

In [15]:
# Train the model
weights, cost_history = train(X_train, y_train)

In [16]:
# Model Evaluation
def predict(X, weights):
    probabilities = sigmoid(np.dot(X, weights))
    return [1 if prob >= 0.5 else 0 for prob in probabilities]

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

y_pred = predict(X_test, weights) # Predict for the test
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-score: {f1}")

Accuracy: 0.8852459016393442
Precision: 0.9032258064516129
Recall: 0.875
F1-score: 0.8888888888888888


## **Comparison with Scikit-learn Logistic Regression**

Now we compare the performance of our custom logistic regression implementation with `LogisticRegression` from `scikit-learn` using accuracy, precision, recall, F1-score, and confusion matrix.


In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Train scikit-learn logistic regression model
sk_model = LogisticRegression(max_iter=1000)
sk_model.fit(X_train, y_train)

# Predict and evaluate
y_pred_sk = sk_model.predict(X_test)

print("=== Scikit-learn Logistic Regression ===")
print("Accuracy:", accuracy_score(y_test, y_pred_sk))
print("Precision:", precision_score(y_test, y_pred_sk))
print("Recall:", recall_score(y_test, y_pred_sk))
print("F1 Score:", f1_score(y_test, y_pred_sk))
print(f"Confusion Matrix:\n", confusion_matrix(y_test, y_pred_sk))
print(f"Classification Report:\n", classification_report(y_test, y_pred_sk))

=== Scikit-learn Logistic Regression ===
Accuracy: 0.8360655737704918
Precision: 0.8666666666666667
Recall: 0.8125
F1 Score: 0.8387096774193549
Confusion Matrix:
 [[25  4]
 [ 6 26]]
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.86      0.83        29
           1       0.87      0.81      0.84        32

    accuracy                           0.84        61
   macro avg       0.84      0.84      0.84        61
weighted avg       0.84      0.84      0.84        61



# **Summary**

> In this notebook, we implemented a logistic regression model from scratch to predict the likelihood of a heart attack.

- **Data Cleaning**:
1. There weren't any nulls in the dataset.
2. Duplicates were dropped.
3. The numerical features were standardized.

- **Model Implementation**: We implemented the logistic regression model, including the sigmoid function, cost function, and gradient descent.

- **Model Evaluation**: The model achieved an accuracy of 88.52%, precision of 90.32%, recall of 87.5%, and F1-score of 88.88 on the test set.

- **Challenges and Insights:** The choice of learning rate and number of iterations have a big impact on the model's accuracy.


Overall, this project demonstrated the complete workflow of building and evaluating a logistic regression model from scratch, providing valuable insights into both the technical and practical aspects of machine learning.