# Part A: Logistic Regression (Bank Dataset) 

## 1. Create a Logistic Regression Model

In [3]:
# pip install ucimlrepo
#already installed

In [4]:
import pandas as pd
from ucimlrepo import fetch_ucirepo

# Load the Bank Marketing dataset (UCI ID: 222)
bank_data = fetch_ucirepo(id=222)

# Extract features and target
X = bank_data.data.features
y = bank_data.data.targets

# Combine into a single DataFrame for easier handling
df = pd.concat([X, y], axis=1)

# Display the first few rows
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day_of_week,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,,5,may,261,1,-1,0,,no
1,44,technician,single,secondary,no,29,yes,no,,5,may,151,1,-1,0,,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,,5,may,76,1,-1,0,,no
3,47,blue-collar,married,,no,1506,yes,no,,5,may,92,1,-1,0,,no
4,33,,single,,no,1,no,no,,5,may,198,1,-1,0,,no


##### We fetch the **Bank Marketing dataset** using `fetch_ucirepo(id=222)`.

The dataset includes information on clients contacted during a marketing campaign.

- Features are stored in `X` and the target (`y`) indicates whether the client subscribed to a term deposit.
- We combine both into a single DataFrame for preprocessing.


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   age          45211 non-null  int64 
 1   job          44923 non-null  object
 2   marital      45211 non-null  object
 3   education    43354 non-null  object
 4   default      45211 non-null  object
 5   balance      45211 non-null  int64 
 6   housing      45211 non-null  object
 7   loan         45211 non-null  object
 8   contact      32191 non-null  object
 9   day_of_week  45211 non-null  int64 
 10  month        45211 non-null  object
 11  duration     45211 non-null  int64 
 12  campaign     45211 non-null  int64 
 13  pdays        45211 non-null  int64 
 14  previous     45211 non-null  int64 
 15  poutcome     8252 non-null   object
 16  y            45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


In [7]:
df.describe()

Unnamed: 0,age,balance,day_of_week,duration,campaign,pdays,previous
count,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,40.93621,1362.272058,15.806419,258.16308,2.763841,40.197828,0.580323
std,10.618762,3044.765829,8.322476,257.527812,3.098021,100.128746,2.303441
min,18.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
25%,33.0,72.0,8.0,103.0,1.0,-1.0,0.0
50%,39.0,448.0,16.0,180.0,2.0,-1.0,0.0
75%,48.0,1428.0,21.0,319.0,3.0,-1.0,0.0
max,95.0,102127.0,31.0,4918.0,63.0,871.0,275.0


In [8]:
#Check for missing values
df.isnull().sum()

age                0
job              288
marital            0
education       1857
default            0
balance            0
housing            0
loan               0
contact        13020
day_of_week        0
month              0
duration           0
campaign           0
pdays              0
previous           0
poutcome       36959
y                  0
dtype: int64

In [9]:
# Let's just confirm the unique values in one example column
df['job'].unique()

array(['management', 'technician', 'entrepreneur', 'blue-collar', nan,
       'retired', 'admin.', 'services', 'self-employed', 'unemployed',
       'housemaid', 'student'], dtype=object)

In [10]:
df[['job', 'education', 'contact', 'poutcome']] = df[['job', 'education', 'contact', 'poutcome']].fillna('unknown')

#### Handling Missing Values in the Bank Dataset

Although the UCI repository documentation states that the "bank-full.csv" dataset has **no missing values**, when we load it using the `ucimlrepo` package, we observe `NaN` values in several categorical columns such as:

- job
- education
- contact
- poutcome

These missing values are not due to missing data in the original dataset, but instead, are the result of the loader interpreting 'unknown' entries as `NaN` (which is common in some automated preprocessing pipelines).

Rather than dropping these rows or applying imputation (which is not meaningful for categorical data), we restore them to their original intended value of 'unknown'. This allows us to retain all records and let the machine learning model learn from the "unknown" category as a distinct feature level.


In [12]:
# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, drop_first=True)

#### Encoding Categorical Variables

To build a logistic regression model, all input features must be numeric.

Since many features in this dataset (like `job`, `marital`, `education`, etc.) are categorical, we use **one-hot encoding** (`pd.get_dummies`) to convert them into binary columns.

This creates new columns for each category and allows the model to interpret them properly.

We drop the first category from each feature (`drop_first=True`) to avoid multicollinearity.

In [14]:
# Separate features and target
X = df_encoded.drop('y_yes', axis=1)
y = df_encoded['y_yes']

#### Split Features and Target

We separate the dataset into:
- `X` = all feature columns (input)
- `y` = target column: `'y'`, which contains 'yes' or 'no'

We'll also convert `'yes'`/`'no'` to binary values (`1` for 'yes', `0` for 'no') so that logistic regression can model a binary outcome.

In [16]:
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

#### Train-Test Split

We divide the dataset into training and testing sets using an 80:20 split.  
This helps us evaluate how well the model generalizes to unseen data.

In [18]:
from sklearn.preprocessing import StandardScaler

# Initialize scaler
scaler = StandardScaler()

# Fit on training data, transform both train and test sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Feature Scaling with StandardScaler

Since logistic regression uses gradient descent, having features with different ranges (e.g., age vs. balance) can slow convergence.

We use `StandardScaler` to normalize the input features so that they have a mean of 0 and standard deviation of 1.

This improves model performance and prevents convergence issues.

In [20]:
from sklearn.linear_model import LogisticRegression

# Train the logistic regression model using scaled features
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train_scaled, y_train)

#### Training the Logistic Regression Model

We train the logistic regression model using the **scaled training data**. Scaling is important because it helps the solver (`lbfgs`) converge faster, especially when feature values vary widely.

**Parameters used:**
- `max_iter=1000`: Increases max iterations to avoid convergence issues.
- `random_state=42`: Ensures consistent results.

The model will now learn to predict if a customer will subscribe to a term deposit.

## 2. Evaluate the model

In [23]:
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score

# Predict on test data
y_pred = lr_model.predict(X_test_scaled)

# Print evaluation metrics
print("Logistic Regression Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Logistic Regression Performance:
Accuracy: 0.901249585314608
Precision: 0.643979057591623
Recall: 0.3487712665406427
F1 Score: 0.45248313917841815

Classification Report:
               precision    recall  f1-score   support

       False       0.92      0.97      0.95      7985
        True       0.64      0.35      0.45      1058

    accuracy                           0.90      9043
   macro avg       0.78      0.66      0.70      9043
weighted avg       0.89      0.90      0.89      9043



#### Model Evaluation

We evaluated the performance of our logistic regression model using four key metrics:

- **Accuracy**: How often the model gets predictions right.
- **Precision**: Of all positive predictions, how many were correct.
- **Recall**: Of all actual positives, how many did we catch.
- **F1 Score**: Harmonic mean of precision and recall.

We use `classification_report` for a quick summary. The metrics help us understand if the model performs well across both classes — especially the minority class.

## 3. Create two Regularized Logistic Regression

#### L1 (Lasso) regularization

In [27]:
# L1 Regularized Logistic Regression (Lasso)
lr_l1 = LogisticRegression(penalty='l1', solver='liblinear', C=1.0, max_iter=1000, random_state=42)

# Fit the model
lr_l1.fit(X_train_scaled, y_train)

# Predict on test data
y_pred_l1 = lr_l1.predict(X_test_scaled)

# Print classification report
print("L1 Regularized Logistic Regression Results:\n")
print(classification_report(y_test, y_pred_l1))

L1 Regularized Logistic Regression Results:

              precision    recall  f1-score   support

       False       0.92      0.97      0.95      7985
        True       0.65      0.35      0.45      1058

    accuracy                           0.90      9043
   macro avg       0.78      0.66      0.70      9043
weighted avg       0.89      0.90      0.89      9043



#### L1 Regularized Logistic Regression (Lasso)

- **L1 regularization** adds a penalty equal to the absolute values of the model coefficients.
- This encourages sparsity — some feature weights can become zero, acting like built-in feature selection.
- We used the `'liblinear'` solver, which supports L1 penalty.
- The model is trained and tested using the scaled dataset, and performance is reported using standard classification metrics.

#### L2 (Ridge) regularization

In [30]:
# L2 Regularized Logistic Regression (Ridge)
lr_l2 = LogisticRegression(penalty='l2', C=1.0, max_iter=1000, random_state=42)

# Fit the model
lr_l2.fit(X_train_scaled, y_train)

# Predict on test data
y_pred_l2 = lr_l2.predict(X_test_scaled)

# Print classification report
print("L2 Regularized Logistic Regression Results:\n")
print(classification_report(y_test, y_pred_l2))

L2 Regularized Logistic Regression Results:

              precision    recall  f1-score   support

       False       0.92      0.97      0.95      7985
        True       0.64      0.35      0.45      1058

    accuracy                           0.90      9043
   macro avg       0.78      0.66      0.70      9043
weighted avg       0.89      0.90      0.89      9043



#### L2 Regularized Logistic Regression (Ridge)

- **L2 regularization** adds a penalty equal to the square of the magnitude of coefficients.
- It shrinks all coefficients but does not eliminate any features completely.
- This helps prevent overfitting, especially in models with many correlated features.
- The model is trained and evaluated just like the L1 model for fair comparison.

## 4. Use KNN as a baseline model and compare its performance with Logistic Regression

#### Find the Optimal k using GridSearchCV

In [34]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

# Define parameter grid
param_grid = {'n_neighbors': list(range(1, 21))}

# GridSearchCV to find best k
grid_search_knn = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search_knn.fit(X_train_scaled, y_train)

# Best k
best_k = grid_search_knn.best_params_['n_neighbors']
print("Best value of k:", best_k)

Best value of k: 11


*We use GridSearchCV to find the best value of k (number of neighbors) for the KNN model.*

*A range of values from 1 to 20 is tested with 5-fold cross-validation.*

*The model with the highest cross-validated accuracy is selected.*

#### Train and Evaluate KNN with Best k

In [37]:
from sklearn.metrics import classification_report
import time

knn = KNeighborsClassifier(n_neighbors=best_k)

start_time = time.time()
knn.fit(X_train_scaled, y_train)
y_pred_knn = knn.predict(X_test_scaled)
end_time = time.time()

print("KNN Classification Report:\n")
print(classification_report(y_test, y_pred_knn))
print("Training + Prediction Time (KNN):", round(end_time - start_time, 4), "seconds")

KNN Classification Report:

              precision    recall  f1-score   support

       False       0.91      0.98      0.94      7985
        True       0.63      0.26      0.37      1058

    accuracy                           0.90      9043
   macro avg       0.77      0.62      0.66      9043
weighted avg       0.88      0.90      0.88      9043

Training + Prediction Time (KNN): 0.7718 seconds


*We train a KNN classifier using the best k obtained from GridSearchCV.*

*Training time is recorded using the time module.*

*Performance is evaluated using accuracy, precision, recall, and F1-score.*

*KNN has no trainable parameters and is computationally expensive during prediction.*

#### Train and Evaluate Logistic Regression

In [40]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000, random_state=42)

start_time = time.time()
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)
end_time = time.time()

print("Logistic Regression Classification Report:\n")
print(classification_report(y_test, y_pred_lr))
print("Training + Prediction Time (LogReg):", round(end_time - start_time, 4), "seconds")

Logistic Regression Classification Report:

              precision    recall  f1-score   support

       False       0.92      0.97      0.95      7985
        True       0.64      0.35      0.45      1058

    accuracy                           0.90      9043
   macro avg       0.78      0.66      0.70      9043
weighted avg       0.89      0.90      0.89      9043

Training + Prediction Time (LogReg): 0.0404 seconds


*We train a logistic regression model using scaled training data.*

*Training + prediction time is measured.*

*Performance metrics are obtained using classification_report.*

*Logistic regression is a parametric model and learns coefficients for each feature.*

#### Why KNN is Worse than Logistic Regression

**In this task, KNN performs worse than Logistic Regression, and here's why:**

**1. Model Performance on Imbalanced Data**

The dataset presents an imbalance, with the True class (indicating a positive outcome) being significantly less prevalent.

Logistic Regression demonstrates superior performance on the minority class:

It attains higher recall and F1-score for the True class in comparison to KNN.

KNN is particularly sensitive to class imbalance, often resulting in a bias towards the majority class (False), which adversely affects the detection of the minority class.

**2. Number of Trainable Parameters**

Logistic Regression operates by learning a fixed set of weights (trainable parameters), which enhances its efficiency during the prediction phase.

Conversely, KNN does not engage in parameter training; instead, it retains the entire training dataset and conducts computations at the time of prediction, rendering it computationally intensive.

**3. Training and Prediction Time**

Logistic Regression is markedly quicker in making predictions due to its reliance on learned weights.

In contrast, KNN is slower, as it necessitates the calculation of distances between the test instance and all training samples each time a prediction is made.

**4. Scalability and Generalization**

Logistic Regression exhibits superior generalization to unseen data by establishing a global decision boundary.

KNN, however, is prone to overfitting, particularly in high-dimensional spaces, and is vulnerable to noise and irrelevant features.

**5. Explainability**

Logistic Regression provides interpretability through its coefficients.

KNN, on the other hand, lacks a straightforward interpretation regarding how input features affect the prediction.

# Part B: SVM Classification (Grid Stability Dataset) 

## 5. Load the Electrical Grid Stability Dataset and Print Dimensions

In [46]:
# Load the dataset
df2 = pd.read_csv('grid_stability.csv')  

# Display first few rows (optional)
df2.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,0.055347,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,-0.005957,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,0.003471,unstable
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,0.028871,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,0.04986,unstable


In [47]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   tau1    10000 non-null  float64
 1   tau2    10000 non-null  float64
 2   tau3    10000 non-null  float64
 3   tau4    10000 non-null  float64
 4   p1      10000 non-null  float64
 5   p2      10000 non-null  float64
 6   p3      10000 non-null  float64
 7   p4      10000 non-null  float64
 8   g1      10000 non-null  float64
 9   g2      10000 non-null  float64
 10  g3      10000 non-null  float64
 11  g4      10000 non-null  float64
 12  stab    10000 non-null  float64
 13  stabf   10000 non-null  object 
dtypes: float64(13), object(1)
memory usage: 1.1+ MB


In [48]:
#Check for missing values
df2.isnull().sum()

tau1     0
tau2     0
tau3     0
tau4     0
p1       0
p2       0
p3       0
p4       0
g1       0
g2       0
g3       0
g4       0
stab     0
stabf    0
dtype: int64

In [49]:
df2.describe()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5.25,5.250001,5.250004,5.249997,3.75,-1.25,-1.25,-1.25,0.525,0.525,0.525,0.525,0.015731
std,2.742548,2.742549,2.742549,2.742556,0.75216,0.433035,0.433035,0.433035,0.274256,0.274255,0.274255,0.274255,0.036919
min,0.500793,0.500141,0.500788,0.500473,1.58259,-1.999891,-1.999945,-1.999926,0.050009,0.050053,0.050054,0.050028,-0.08076
25%,2.874892,2.87514,2.875522,2.87495,3.2183,-1.624901,-1.625025,-1.62496,0.287521,0.287552,0.287514,0.287494,-0.015557
50%,5.250004,5.249981,5.249979,5.249734,3.751025,-1.249966,-1.249974,-1.250007,0.525009,0.525003,0.525015,0.525002,0.017142
75%,7.62469,7.624893,7.624948,7.624838,4.28242,-0.874977,-0.875043,-0.875065,0.762435,0.76249,0.76244,0.762433,0.044878
max,9.999469,9.999837,9.99945,9.999443,5.864418,-0.500108,-0.500072,-0.500025,0.999937,0.999944,0.999982,0.99993,0.109403


In [50]:
# Display the shape of the dataset
df2.shape

(10000, 14)

*The dataset used is the Electrical Grid Stability Simulated Data.*

*We load it using pandas.read_csv() and print the dimensions using .shape.*

*This gives us the number of rows (samples) and columns (features + target).*

In [52]:
# Feature and Target Split
# 'stabf' is the target variable and the rest are features

X = df2.drop('stabf', axis=1)
y = df2['stabf']

We separate the dataset into:

Features (X) – all columns except the target.

Target (y) – the stabf column which we aim to predict.

This separation is essential before performing modeling and preprocessing.

In [54]:
# Encode the Target Labels
# Convert categorical labels to numerical values

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)  # For example: 'stable' → 1, 'unstable' → 0

The target column stabf contains categorical labels like "stable" and "unstable". Most machine learning models require numeric values, so we use LabelEncoder to convert these text labels into binary numeric values:

"unstable" → 0

"stable" → 1

In [56]:
# Train-Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

*We split the data into a training set (80%) and a test set (20%) using train_test_split. This allows us to train the model on one portion of the data and evaluate its performance on unseen data.*

*We set random_state=42 to ensure reproducibility.*

In [58]:
# Feature Scaling

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

*SVM models are sensitive to the scale of features. To ensure fair treatment of all features, we apply standardization using StandardScaler, which transforms the data so that each feature has:*

Mean = 0

Standard Deviation = 1

*We fit the scaler on the training set and transform both training and test sets to avoid data leakage.*

## 6. SVM Classification with 3 Different Kernels

#### SVM with Linear Kernel

In [62]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# SVM with Linear Kernel
svm_linear = SVC(kernel='linear', C=1.0, random_state=42)
svm_linear.fit(X_train_scaled, y_train)
y_pred_linear = svm_linear.predict(X_test_scaled)

print("SVM with Linear Kernel:\n")
print(classification_report(y_test, y_pred_linear))

SVM with Linear Kernel:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       693
           1       1.00      1.00      1.00      1307

    accuracy                           1.00      2000
   macro avg       1.00      1.00      1.00      2000
weighted avg       1.00      1.00      1.00      2000



*We trained an SVM classifier with a linear kernel, which assumes that the classes are linearly separable. The model was trained on scaled features and evaluated using standard classification metrics.*

**The model achieved perfect accuracy (100%) on the test data.**

*Both classes (0 and 1) were predicted with precision, recall, and F1-score of 1.00.*

*This suggests the data is highly linearly separable, and the linear kernel is a great fit.*

*No misclassifications occurred.*

#### SVM with RBF Kernel

In [65]:
# SVM with RBF Kernel
svm_rbf = SVC(kernel='rbf', C=1.0, random_state=42)
svm_rbf.fit(X_train_scaled, y_train)
y_pred_rbf = svm_rbf.predict(X_test_scaled)

print("SVM with RBF Kernel:\n")
print(classification_report(y_test, y_pred_rbf))

SVM with RBF Kernel:

              precision    recall  f1-score   support

           0       0.97      0.98      0.97       693
           1       0.99      0.98      0.99      1307

    accuracy                           0.98      2000
   macro avg       0.98      0.98      0.98      2000
weighted avg       0.98      0.98      0.98      2000



*We used the RBF kernel, which is useful for non-linear classification problems. It maps the input space into higher dimensions to find a suitable decision boundary.*

**The RBF model achieved 98% accuracy, which is excellent but slightly below the linear kernel.**

*A few misclassifications are present, especially in class 0.*

*RBF kernel might be slightly overfitting or unnecessary here, since linear separation already performs perfectly.*

*Useful in more complex datasets, but not essential in this case.*

#### SVM with Polynomial Kernel

In [68]:
# SVM with Polynomial Kernel
svm_poly = SVC(kernel='poly', degree=3, C=1.0, random_state=42)
svm_poly.fit(X_train_scaled, y_train)
y_pred_poly = svm_poly.predict(X_test_scaled)

print("SVM with Polynomial Kernel:\n")
print(classification_report(y_test, y_pred_poly))

SVM with Polynomial Kernel:

              precision    recall  f1-score   support

           0       0.97      0.95      0.96       693
           1       0.97      0.99      0.98      1307

    accuracy                           0.97      2000
   macro avg       0.97      0.97      0.97      2000
weighted avg       0.97      0.97      0.97      2000



*We applied a polynomial kernel of degree 3, which captures interactions between features up to a cubic level. It’s more complex and non-linear in nature.*

**The polynomial kernel achieved 97% accuracy, which is still high, but slightly worse than both linear and RBF kernels.**

*Some misclassification occurred, especially for class 0 (recall = 0.95).*

*Indicates potential overfitting or unnecessary complexity for this dataset.*

*Best suited when the data requires modeling polynomial feature interactions.*

## 7. Hyperparameter Tuning: SVM “C” Parameter

### Linear Kernel

In [72]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {'C': [0.01, 0.1, 1, 10, 100]}

**The C parameter in SVM controls the trade-off between a smooth decision boundary and classifying training points correctly:**

*Low C: Makes the margin wider, allowing for more misclassification.*

*High C: Tries to classify all training data points correctly, resulting in a narrow margin.*

*GridSearchCV is used to search over different C values.*

*Each model is trained on the scaled training data.*

*The best model (best_estimator_) is selected based on accuracy.*

In [75]:
# Linear Kernel 
grid_linear = GridSearchCV(SVC(kernel='linear', random_state=42), param_grid, scoring='accuracy')
grid_linear.fit(X_train_scaled, y_train)

best_linear = grid_linear.best_estimator_
y_pred_linear = best_linear.predict(X_test_scaled)

print("Best C (Linear):", grid_linear.best_params_)
print("Accuracy (Linear):", accuracy_score(y_test, y_pred_linear))
print("Classification Report (Linear):\n", classification_report(y_test, y_pred_linear))

Best C (Linear): {'C': 100}
Accuracy (Linear): 0.998
Classification Report (Linear):
               precision    recall  f1-score   support

           0       0.99      1.00      1.00       693
           1       1.00      1.00      1.00      1307

    accuracy                           1.00      2000
   macro avg       1.00      1.00      1.00      2000
weighted avg       1.00      1.00      1.00      2000



**Report:**

*Precision, Recall, F1-score: All values ~1.00*

**Inference:**

*The linear SVM achieved near-perfect performance.*

*The model performed best with a higher C (100), suggesting it benefited from stricter classification with fewer margin violations.*

*Likely, the data is linearly separable or close to it.*

### RBF Kernel

In [78]:
# RBF Kernel
grid_rbf = GridSearchCV(SVC(kernel='rbf', random_state=42), param_grid, scoring='accuracy')
grid_rbf.fit(X_train_scaled, y_train)

best_rbf = grid_rbf.best_estimator_
y_pred_rbf = best_rbf.predict(X_test_scaled)

print("Best C (RBF):", grid_rbf.best_params_)
print("Accuracy (RBF):", accuracy_score(y_test, y_pred_rbf))
print("Classification Report (RBF):\n", classification_report(y_test, y_pred_rbf))

Best C (RBF): {'C': 1}
Accuracy (RBF): 0.9815
Classification Report (RBF):
               precision    recall  f1-score   support

           0       0.97      0.98      0.97       693
           1       0.99      0.98      0.99      1307

    accuracy                           0.98      2000
   macro avg       0.98      0.98      0.98      2000
weighted avg       0.98      0.98      0.98      2000



**Report:**

*High performance with slightly lower accuracy than the linear model.*

*C = 1 provided the best generalization, balancing margin width and training accuracy.*

**Inference:**

*RBF captured some nonlinear structure but was slightly less effective than linear for this dataset.*

*Still robust, showing excellent precision and recall.*

### Polynomial Kernel

In [81]:
# Polynomial Kernel
grid_poly = GridSearchCV(SVC(kernel='poly', degree=3, random_state=42), param_grid, scoring='accuracy')
grid_poly.fit(X_train_scaled, y_train)

best_poly = grid_poly.best_estimator_
y_pred_poly = best_poly.predict(X_test_scaled)

print("Best C (Polynomial):", grid_poly.best_params_)
print("Accuracy (Polynomial):", accuracy_score(y_test, y_pred_poly))
print("Classification Report (Polynomial):\n", classification_report(y_test, y_pred_poly))

Best C (Polynomial): {'C': 10}
Accuracy (Polynomial): 0.969
Classification Report (Polynomial):
               precision    recall  f1-score   support

           0       0.95      0.96      0.96       693
           1       0.98      0.98      0.98      1307

    accuracy                           0.97      2000
   macro avg       0.97      0.97      0.97      2000
weighted avg       0.97      0.97      0.97      2000



**Report:**

*Strong performance, though slightly behind RBF and Linear.*

*Best performance at C = 10 indicates moderate regularization worked well.*

**Inference:**

*Polynomial kernel adds complexity (nonlinear decision boundaries).*

*Possibly overfits more than necessary on this dataset compared to linear/RBF.*

## 8. Comparison and Discussion of SVM Kernel Performance

#### Key Observations:

**1. Linear Kernel**

*Top performer with ~99.8% accuracy.*

*Extremely high precision and recall for both classes.*

*Indicates the dataset is likely linearly separable or very close to it.*

*Model is simpler and more efficient compared to nonlinear kernels.*


**2. RBF Kernel**

*Performs nearly as well as the linear kernel.*

*Good generalization ability, capturing nonlinear relationships.*

*Best performance at C = 1, suggesting a good trade-off between margin and misclassification.*

*Useful if data is not linearly separable, but adds computation cost.*

**3. Polynomial Kernel**

*Slightly lower accuracy (96.9%) than linear and RBF.*

*May have overfit slightly due to increased model complexity (degree-3 polynomial).*

*Still performs well but may not justify added complexity on this dataset.*

**Linear SVM is the best choice for this dataset due to its simplicity and top accuracy.**

**RBF is a strong fallback for handling nonlinear patterns, with competitive results.**

**Polynomial kernel is more complex and does not outperform simpler alternatives here.**

## END