<a href="https://colab.research.google.com/github/klaxman23/August_pratice/blob/main/Module_9_Case_Study_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Case Study – 3
Objective:
• Employ SVM from scikit learn for binary classification.
• Impact of preprocessing data and hyper parameter search using grid search.
Questions:
1. Load the data from “college.csv” that has attributes collected about private and
public colleges for a particular year. We will try to predict the private/public
status of the college from other attributes.
2. Use LabelEncoder to encode the target variable into numerical form and split
the data such that 20% of the data is set aside for testing.
3. Fit a linear SVM from scikit learn and observe the accuracy.
[Hint: Use Linear SVC]
4. Preprocess the data using StandardScalar and fit the same model again and
observe the change in accuracy.
[Hint: Refer to scikitlearn’s preprocessing methods]
5. Use scikit learns grid search to select the best hyperparameter for a non-linear
SVM, and identify the model with the best score and its parameters.
[Hint: Refer to model_selection module of Scikit learn]

Load the dataset & inspect columns

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("college.csv")

# View first rows
df.head()


In [None]:
# List columns
print("Columns in dataset:")
print(df.columns.tolist())


Encode target variable & train-test split (80–20)

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


In [None]:
# Encode target variable
le = LabelEncoder()
df['Private'] = le.fit_transform(df['Private'])
# Private: Yes -> 1, No -> 0

# Features and target
X = df.drop('Private', axis=1)
y = df['Private']

# Train-test split (20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


Fit Linear SVM (without preprocessing)

In [None]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report


In [None]:
# Linear SVM
linear_svm = LinearSVC(max_iter=10000)
linear_svm.fit(X_train, y_train)

# Predictions
y_pred = linear_svm.predict(X_test)

# Accuracy
print("Accuracy (Linear SVM without Scaling):", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))


Apply StandardScaler & fit Linear SVM again

In [None]:
from sklearn.preprocessing import StandardScaler


In [None]:
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Linear SVM again
linear_svm_scaled = LinearSVC(max_iter=10000)
linear_svm_scaled.fit(X_train_scaled, y_train)

# Predictions
y_pred_scaled = linear_svm_scaled.predict(X_test_scaled)

# Accuracy
print("Accuracy (Linear SVM with Scaling):", accuracy_score(y_test, y_pred_scaled))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred_scaled))


Grid Search for Non-Linear SVM (RBF Kernel)

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV


In [None]:
# Parameter grid
param_grid = {
    'C': [0.1, 1, 10],
    'gamma': [0.01, 0.1, 1],
    'kernel': ['rbf']
}

# Grid Search
grid = GridSearchCV(
    SVC(),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Fit grid search on scaled data
grid.fit(X_train_scaled, y_train)


In [None]:
# Best model and parameters
print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Score:", grid.best_score_)


In [None]:
# Test set performance
best_model = grid.best_estimator_
y_pred_grid = best_model.predict(X_test_scaled)

print("Test Accuracy (Best RBF SVM):", accuracy_score(y_test, y_pred_grid))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred_grid))
