Problem Statement: **Data Analytics II**
* Load the **Social\_Network\_Ads.csv** dataset.
* Explore and preprocess the data (handle missing values, encode categoricals, etc.).
* Build and train a **Logistic Regression** model.
* Use the model to make predictions on test data.
* Compute the **Confusion Matrix**.
* Calculate performance metrics:
  * True Positive (TP)
  * False Positive (FP)
  * True Negative (TN)
  * False Negative (FN)
  * Accuracy, Error Rate, Precision, Recall.

### import Required Files

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

### read CSV File

In [2]:
df = pd.read_csv('5_Social_Network_Ads.csv') 					# Load the dataset
df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


### Understand Data

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User ID          400 non-null    int64 
 1   Gender           400 non-null    object
 2   Age              400 non-null    int64 
 3   EstimatedSalary  400 non-null    int64 
 4   Purchased        400 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 15.8+ KB


In [4]:
df.isnull().sum()

User ID            0
Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64

### drop Useless Columns

In [5]:
df = df.drop(['User ID', 'Gender'], axis=1) 

### Separates features (X) and target (y) from the dataset for supervised learning

In [6]:
X = df.drop('Purchased', axis=1)
y = df['Purchased']

### Standardize features (important for Logistic Regrssion)

### **StandardScaler()**
<p>
  <code>StandardScaler()</code> standardizes features by removing the mean and scaling to unit variance. 
  It transforms the data so that each feature has a mean of 0 and a standard deviation of 1.
</p>

<h3>Mathematical Formula:</h3>
<p><code>z = (x − μ) / σ</code></p>

<h3>Where:</h3>
<ul>
  <li><strong>x:</strong> Original feature value</li>
  <li><strong>μ:</strong> Mean of the feature</li>
  <li><strong>σ:</strong> Standard deviation of the feature</li>
  <li><strong>z:</strong> Standardized value (output)</li>
</ul>

<p>
 Output: Returns a NumPy array or transformed DataFrame where each feature is centered and scaled.
</p>


In [7]:
from sklearn.preprocessing import StandardScaler

In [8]:
scaler = StandardScaler()		
X_scaled = scaler.fit_transform(X)

### Splits the dataset into training and testing sets

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split( X_scaled, y, test_size=0.25, random_state=42)

### Model Building

<h2>Logistic Regression</h2>

<h3>Definition:</h3>
<p>
  Logistic Regression is a statistical method used for binary classification problems. 
  It models the probability that a given input point belongs to a particular category (class 0 or 1).
</p>

<h3>Mathematical Formula:</h3>
<p>
  The logistic regression model estimates the probability <strong>p</strong> that the dependent variable <strong>y</strong> equals 1 (positive class) as:
</p>
<p><code>p = 1 / (1 + e<sup>−(β₀ + β₁x₁ + β₂x₂ + ⋯ + βₙxₙ)</sup>)</code></p>

<h3>Where:</h3>
<ul>
  <li><strong>p:</strong> Predicted probability of the positive class</li>
  <li><strong>β₀:</strong> Intercept term</li>
  <li><strong>β₁, β₂, ..., βₙ:</strong> Coefficients for each feature</li>
  <li><strong>x₁, x₂, ..., xₙ:</strong> Feature values</li>
</ul>

<h3>Log-Odds (Logit) Transformation:</h3>
<p>
  The logit is the natural logarithm of the odds of the event (i.e., log-odds transformation):
</p>
<p><code>log(p / (1 − p)) = β₀ + β₁x₁ + β₂x₂ + ⋯ + βₙxₙ</code></p>


In [11]:
model = LogisticRegression()

In [12]:
model.fit(X_train, y_train)

### predict Result on testing data

In [13]:
y_pred = model.predict(X_test)

### Evaluate Result

In [14]:
cm = confusion_matrix(y_test, y_pred)			# Confusion Matrix
tn, fp, fn, tp = cm.ravel()

print(f"Confusion Matrix:\n{cm}")
print(f"\nTrue Positives (TP): {tp}")
print(f"False Positives (FP): {fp}")
print(f"True Negatives (TN): {tn}")
print(f"False Negatives (FN): {fn}")

Confusion Matrix:
[[61  2]
 [12 25]]

True Positives (TP): 25
False Positives (FP): 2
True Negatives (TN): 61
False Negatives (FN): 12


In [15]:
# Derived Metrics
accuracy = accuracy_score(y_test, y_pred)
error_rate = 1 - accuracy
precision = tp / (tp + fp)
recall = tp / (tp + fn)

print(f"\nAccuracy: {accuracy:.2f}")
print(f"Error Rate: {error_rate:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")


Accuracy: 0.86
Error Rate: 0.14
Precision: 0.93
Recall: 0.68
