Data Analytics II
1. Implement logistic regression using Python/R to
perform classification on Social_Network_Ads.csv dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN,
Accuracy, Error rate, Precision,
Recall on the given dataset.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv("Social_Network_Ads.csv")
print(df.head())
print(df.info())
print(df.describe())

    User ID  Gender  Age  EstimatedSalary  Purchased
0  15624510    Male   19            19000          0
1  15810944    Male   35            20000          0
2  15668575  Female   26            43000          0
3  15603246  Female   27            57000          0
4  15804002    Male   19            76000          0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User ID          400 non-null    int64 
 1   Gender           400 non-null    object
 2   Age              400 non-null    int64 
 3   EstimatedSalary  400 non-null    int64 
 4   Purchased        400 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 15.8+ KB
None
            User ID         Age  EstimatedSalary   Purchased
count  4.000000e+02  400.000000       400.000000  400.000000
mean   1.569154e+07   37.655000     69742.500000    0.357500
std    7.165832e+04   10

In [4]:
# Encoding categorical variables
# df.loc[df["Gender"] == "Male", "Gender"] = 0
# df.loc[df["Gender"] == "Female", "Gender"] = 1
le=LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])

# Drop the 'User ID' column
df.drop(["User ID"], axis=1, inplace=True)

# Min-max normalization for the features
def min_max_normalize(feature):
    df[feature] = (df[feature] - df[feature].min()) / (df[feature].max() - df[feature].min())

min_max_normalize("EstimatedSalary")
min_max_normalize("Age")

In [5]:
df.head()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased
0,1,0.02381,0.02963,0
1,1,0.404762,0.037037,0
2,0,0.190476,0.207407,0
3,0,0.214286,0.311111,0
4,1,0.02381,0.451852,0


In [6]:
X = df[['Gender', 'Age', 'EstimatedSalary']]  # Features
y = df['Purchased']  # Target variable

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
# Feature Scaling Apply StandardScaler to normalize the feature data before training the model.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [8]:
# Train Logistic Regression Model Create a logistic regression model and train it on the training data.
model = LogisticRegression()
model.fit(X_train, y_train)

In [9]:
# Make Predictions Use the trained model to make predictions on the test set.
y_pred = model.predict(X_test)

In [13]:
# Evaluate the Model Compute the confusion matrix and other evaluation metrics (accuracy, precision, recall, error rate).
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

# Extract TP, TN, FP, FN from the confusion matrix
TN, FP, FN, TP = cm.ravel()

# Calculate accuracy, error rate, precision, recall, and F1 score
accuracy = accuracy_score(y_test, y_pred, )
error_rate = 1 - accuracy
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Output the results
print(f"Confusion Matrix:\n{cm}")
print(f"True Positive (TP): {TP}")
print(f"False Positive (FP): {FP}")
print(f"True Negative (TN): {TN}")
print(f"False Negative (FN): {FN}")
print(f"Accuracy: {accuracy:.2f}")
print(f"Error Rate: {error_rate:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

Confusion Matrix:
[[50  2]
 [ 7 21]]
True Positive (TP): 21
False Positive (FP): 2
True Negative (TN): 50
False Negative (FN): 7
Accuracy: 0.89
Error Rate: 0.11
Precision: 0.90
Recall: 0.75
F1 Score: 0.82


In [None]:
# Visualize the Confusion Matrix Plot a heatmap of the confusion matrix for better visualization.
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Purchased', 'Purchased'], yticklabels=['Not Purchased', 'Purchased'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

### **1. What is Logistic Regression?**
Logistic Regression is a statistical method used for binary classification problems, where the outcome variable is categorical with two classes (e.g., Purchased vs. Not Purchased). It is used to estimate the probability that a given input point belongs to a certain class.

**Formula**:
The logistic regression model predicts the probability of a binary outcome using the logistic function:

$$ P(Y = 1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n)}} $$

Where:
- \( P(Y = 1|X) \) = Probability of the positive class (e.g., Purchased = 1)
- \( \beta_0 \) = Intercept (bias term)
- \( \beta_1, \beta_2, \dots, \beta_n \) = Coefficients for the input features \( X_1, X_2, \dots, X_n \)
- \( e \) = The base of the natural logarithm

### **2. Key Concepts**

- **Logistic Function (Sigmoid)**: Maps any input to a value between 0 and 1, which represents a probability.
  
  $$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

  - Where \( z = \beta_0 + \beta_1 X_1 + \dots \).

- **Decision Boundary**: The decision boundary is the threshold above which the model classifies an input as class 1 (e.g., Purchased) and below which it classifies as class 0 (e.g., Not Purchased). Typically, this threshold is set at 0.5, but it can be adjusted based on the desired sensitivity and specificity.

  $$ \text{If } P(Y = 1|X) > 0.5, \text{ classify as class 1} $$  
  $$ \text{If } P(Y = 1|X) \leq 0.5, \text{ classify as class 0} $$

### **3. Model Training**
 **Fit the Model**: We estimate the model parameters (\( \beta_0, \beta_1, \dots \)) by minimizing the **Log-Loss** (also known as **Binary Cross-Entropy**) using optimization techniques like Gradient Descent.

   **Log-Loss Function**:
   
   $$ \text{Log-Loss} = -\frac{1}{m} \sum_{i=1}^{m} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right] $$

   Where:
   - \( y_i \) = Actual label for the \( i \)-th data point
   - \( p_i \) = Predicted probability of class 1 for the \( i \)-th data point
   - \( m \) = Number of data points

### **4. Model Evaluation Metrics**

Once the model is trained and predictions are made, we evaluate its performance using various metrics:

#### **Confusion Matrix**:
A confusion matrix provides a summary of the predictions versus the actual values. It is a 2x2 matrix for binary classification.

$$
\begin{bmatrix}
TN & FP \\
FN & TP
\end{bmatrix}
$$

Where:
- **True Positive (TP)**: Correctly predicted positive class (e.g., Purchased = 1)
- **False Positive (FP)**: Incorrectly predicted positive class (e.g., Not Purchased, but predicted as Purchased)
- **True Negative (TN)**: Correctly predicted negative class (e.g., Not Purchased = 0)
- **False Negative (FN)**: Incorrectly predicted negative class (e.g., Purchased, but predicted as Not Purchased)

#### **Accuracy**:
Measures the overall correctness of the model.

$$ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $$

#### **Error Rate**:
The fraction of incorrect predictions.

$$ \text{Error Rate} = \frac{FP + FN}{TP + TN + FP + FN} $$

#### **Precision**:
The fraction of positive predictions that are actually positive.

$$ \text{Precision} = \frac{TP}{TP + FP} $$

#### **Recall (Sensitivity or True Positive Rate)**:
The fraction of actual positives that are correctly predicted.

$$ \text{Recall} = \frac{TP}{TP + FN} $$

#### **F1 Score**:
The harmonic mean of precision and recall, balancing both metrics.

$$ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

