Support Vector Machine (SVM) is a supervised learning algorithm that finds the optimal hyperplane to classify data points into different classes. It works well for binary classification, especially when the classes are well-separated.


When to Use SVM
Use SVM when:

You have high-dimensional data (many features).

The data is not linearly separable (you can use kernels).

You want a robust model against overfitting, especially with smaller datasets.

You care more about the decision boundary than probability.


1. StandardScaler
Purpose: Feature Scaling (Standardization)
Most machine learning models, especially SVMs, are sensitive to the scale of the input features.

StandardScaler standardizes features by removing the mean and scaling to unit variance


Why SVM needs it:
SVM (including SVR) relies on distances between points — especially with rbf (Radial Basis Function) kernel.

If features are on different scales (e.g., year is in 2000s, make is 0–100), it skews the distance calculations and harms model performance.

2. SVR(kernel='rbf')
Purpose: SVM Regression with RBF Kernel
SVR = Support Vector Regression (the regression version of SVM).

kernel='rbf' means it uses the Radial Basis Function, which maps inputs into a higher-dimensional space.

Why use RBF Kernel?
It helps capture non-linear relationships between features and the target (price).

More powerful than a linear kernel when data is not linearly separable or when price depends on complex feature interactions.

3. make_pipeline(...)
Purpose: Chain together steps into a clean pipeline

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(), SVR(kernel='rbf'))
This means:

First, the data is scaled by StandardScaler().

Then, the scaled data is passed to SVR(kernel='rbf').

Benefits of using make_pipeline:
Cleaner code: No need to manually scale your X_train and X_test.

Avoids data leakage: Automatically ensures scaling is fitted only on training data during cross-validation.

Reusable and easy to cross-validate: Can be passed directly to GridSearchCV, etc.




In [2]:
import pandas as pd

df = pd.read_csv("jiji_car_data_refined.csv")
df.head()



corr = df.corr(numeric_only=True)

# Plot heatmap
plt.figure(figsize=(10,6))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title("Feature Correlation with Price")
plt.show()


# This shows you how features like year, make, or location correlate with price.

sns.pairplot(df[['year', 'price', 'make', 'model']])
plt.show()
#Helps you see pairwise relationships. Especially useful to understand trends and spread.

plt.figure(figsize=(12,6))
sns.boxplot(x='condition', y='price', data=df)
plt.xticks(rotation=45)
plt.title('Price Distribution by Car Condition')
plt.show()
#Helps you see how price changes across categories (like condition, location, etc.).

sns.scatterplot(x='year', y='price', data=df)
plt.title('Car Year vs Price')
plt.show()
#Useful to see trends — e.g., do newer cars really cost more?

plt.figure(figsize=(8,6))
plt.scatter(y_test, y_pred)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual vs Predicted Price")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')  # perfect prediction line
plt.show()

residuals = y_test - y_pred
plt.figure(figsize=(8,6))
plt.scatter(y_pred, residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel("Predicted Price")
plt.ylabel("Residuals")
plt.title("Residuals vs Predicted")
plt.show()

Unnamed: 0,title,make,model,year,condition,transmission,location,price
0,Mercedes-Benz C300 2016 Gray,,,,local used,automatic,Kaduna / Kaduna State,"₦ 20,000,000"
1,Lexus RX 2019 Black,Lexus,RX,2019.0,foreign used,automatic,Central Business District,"₦ 63,000,000"
2,Lexus ES 350 2010 Gray,Lexus,ES,,foreign used,automatic,Apapa,"₦ 15,400,000"
3,Mercedes-Benz C300 2010 White,,,,local used,automatic,Benin City,"₦ 9,000,000"
4,Mercedes-Benz C400 2015 Black,,,,foreign used,automatic,Central Business District,"₦ 39,500,000"
