# Project 4: Predicting Online Purchase Intent: A KNN-Based Approach to Customer Behaviour

### Project Goal:
The main goal of this project is to build a machine learning model that can predict whether a visitor to an e-commerce website will make a purchase. We'll use data about their Browse session—like the pages they visited and how long they stayed—to make this prediction and compare different models to find the most effective one.

### Learning Objectives:
Learn how to clean and prepare data for machine learning, including encoding categorical features and scaling numerical data.

Understand the challenge of class imbalance (when one outcome is much rarer than another) and its impact on model performance.

Build, tune, and evaluate a K-Nearest Neighbors (KNN) classifier.

Compare the performance of multiple classification algorithms (like Logistic Regression, Decision Trees, and Random Forest) to identify the best model for the task.

Explore how techniques like PCA (for simplifying data) and SMOTE (for balancing data) affect model accuracy.



### Importing Necessary Libraries

Before we begin, we need to load all the tools we'll need. This block imports libraries for handling data (pandas, numpy), creating plots (matplotlib, seaborn), and building machine learning models (scikit-learn). We also import SMOTE, a technique to handle imbalanced datasets.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score

### Loading and Exploring the Data

Now, let's load our dataset. We're using pandas to read the online_shoppers_intention.csv file. We'll then display the first five rows using .head() to get a quick feel for the data's structure and contents.

In [None]:
df = pd.read_csv("C:/Users/HP/Downloads/online_shoppers_intention.csv")
df.head()

In [None]:
df.info()

#### Summary:
The summary confirms we have 12,330 entries and 18 columns. Importantly, there are no missing (non-null) values, which simplifies our preprocessing steps. We can also see that columns like Month and VisitorType are of type 'object', meaning they are text-based and we'll need to convert them into numbers for our model.

### Data Preprocessing and Transformation

Machine learning models work with numbers, not text or boolean values. In this step, we'll convert our non-numeric columns into a format the model can understand. We use LabelEncoder for 'Month' and 'VisitorType' and change the boolean (True/False) values in 'Weekend' and 'Revenue' to integers (1/0).

In [None]:
le = LabelEncoder()
df['Month'] = le.fit_transform(df['Month'])
df['VisitorType'] = le.fit_transform(df['VisitorType'])
df['Weekend'] = df['Weekend'].astype(int)
df['Revenue'] = df['Revenue'].astype(int)

In [None]:
df.info()

As you can see, all columns now have numerical data types (int, float). Our data is clean and fully prepared for the modeling phase.

Now, let's check the distribution of our target variable, Revenue. This tells us how many sessions resulted in a purchase versus those that didn't. It's important to check for class imbalance, where one outcome is much more common than the other.

In [None]:
df['Revenue'].value_counts()

The output shows a significant imbalance: 10,422 sessions did not generate revenue (class 0), while only 1,908 did (class 1). This is a common scenario in purchase prediction, and we must be mindful of it, as it can bias our model towards predicting the more common outcome (no purchase).

### Data Visualization
Let's see if there's a relationship between the month and whether a purchase was made. A line plot can help us visualize this trend.

In [None]:
sns.lineplot(x = df['Month'],y = df['Revenue'])

The plot shows fluctuations in revenue generation across different months, suggesting that seasonality might be a factor in a customer's intent to purchase.

In [None]:
df['Month'].value_counts()

### Model Preparation

It's time to prepare our data for modeling. We separate our features (X) from the target variable (y). Then, we split the data into a training set (to build the model) and a testing set (to evaluate it). We use stratify=y to ensure that both the training and testing sets have the same proportion of purchases and non-purchases as the original dataset, which is crucial because of our class imbalance.

In [None]:
X = df.drop('Revenue',axis = 1)
y = df['Revenue']

X_train, X_test, y_train, y_test = train_test_split(X,y,stratify = y,test_size = 0.2,random_state = 42)

Our features have different scales (e.g., ProductRelated_Duration vs. Month). Algorithms like KNN are sensitive to this. We use StandardScaler to transform our features so they all have a mean of 0 and a standard deviation of 1. We fit the scaler only on the training data to avoid leaking information from the test set into our training process.

In [None]:
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

### Building and Evaluating Models
As a baseline, let's first train a simple Logistic Regression model. It's a good starting point for classification problems to establish a benchmark performance.

In [None]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train_sc,y_train)
y_pred = model.predict(X_test_sc)
log_acc = model.score(X_test_sc,y_test)
print(log_acc)

The model achieves an accuracy of about 88.32%. This gives us a solid benchmark to compare against more complex models like KNN.


For the K-Nearest Neighbors (KNN) model, choosing the right value for 'k' (the number of neighbors) is critical. We'll test a range of k-values from 1 to 60 and plot the training and testing accuracy for each. This helps us find the "sweet spot" where the model performs well on unseen data without overfitting to the training data.

In [None]:
test_score, train_score = [],[]
k_range = range(1,61)
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_sc,y_train)
    train_score.append(knn.score(X_train_sc,y_train))
    test_score.append(knn.score(X_test_sc,y_test))

plt.figure(figsize=(10,5))
plt.plot(k_range,test_score,label='Testing Accuracy')
plt.plot(k_range,train_score,label='Training Accuracy')

plt.show()

The plot shows that as 'k' increases, the training accuracy decreases while the testing accuracy increases and then stabilizes. A good value for 'k' appears to be around 5, where the testing accuracy is high before it starts to level off. This balances the model between being too simple (overfitting) and too complex (underfitting).

To get a more reliable estimate of our KNN model's performance (with k=5), we use 10-fold cross-validation. This involves splitting the training data into 10 parts, training the model on 9 parts, and testing on the 10th part, repeating this process 10 times to get an average score.

In [None]:
cv_knn = KNeighborsClassifier(n_neighbors=5)
cv_score = cross_val_score(cv_knn,X_train_sc,y_train,cv=10)
cv_score.mean()

The mean accuracy from cross-validation is about 87.97%. This is a robust measure of how well our model is likely to perform on new, unseen data.

In [None]:
sns.boxplot(cv_score,orient='h')
plt.grid('True')

The box plot shows that the accuracy scores are tightly clustered around the mean of ~88%, indicating that the model's performance is quite stable across different subsets of the data.

### Advanced Model Experiments
Our dataset has many features. Let's see if we can simplify the model by using Principal Component Analysis (PCA) for dimensionality reduction. We'll reduce the features to just 5 principal components and then train our KNN model on this simplified data.

In [None]:
pca = PCA(n_components=5)
X_train_pca = pca.fit_transform(X_train_sc)
X_test_pca = pca.transform(X_test_sc)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca,y_train)
y_pred_pca = knn_pca.predict(X_test_pca)

knn_pca_acc = accuracy_score(y_test,y_pred_pca)
print(knn_pca_acc)



The accuracy with PCA is about 83.66%, which is lower than the accuracy of the model with all features (~88%). This suggests that reducing the dimensions to 5 components resulted in the loss of important information needed for accurate predictions.

In [None]:
X_train_noise = np.hstack([X_train_sc,np.random.rand(X_train_sc.shape[0],10)])
X_test_noise = np.hstack([X_test_sc,np.random.rand(X_test_sc.shape[0],10)])

knn_ = KNeighborsClassifier(n_neighbors=5)
knn_.fit(X_train_noise,y_train)
y_pred_noise = knn_.predict(X_test_noise)
knn_noise_acc = accuracy_score(y_test,y_pred_noise)
print(knn_noise_acc)

The accuracy with added noise is about 87.39%, only slightly lower than the original model's accuracy. This shows that KNN is somewhat robust to noise, but irrelevant features can still degrade its performance.


In [None]:
confusion_matrix(y_test, y_pred)

The matrix shows the number of true positives, true negatives, false positives, and false negatives. It gives us a clearer picture of where the model is making mistakes, which is more insightful than accuracy alone, especially with our imbalanced dataset.

### Comparing Multiple Models
In this step, we'll train and evaluate four different classification models: Logistic Regression, KNN, Decision Tree, and Random Forest. We'll compare their accuracy and AUC scores to see which one performs best on our dataset.

In [None]:
model_list = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier()
}
model_score = {}

for model_name, model in model_list.items():
    model.fit(X_train_sc, y_train)
    y_pred = model.predict(X_test_sc)
    model_score[model_name]={
        'Accuracy': accuracy_score(y_test,y_pred),
        'AUC Score': roc_auc_score(y_test,y_pred),
    }

display(model_score)

The results show that Random Forest is the top-performing model with the highest accuracy (90.15%) and AUC score (0.76). This is expected, as Random Forest is a powerful ensemble model that often excels in classification tasks by combining the predictions of multiple decision trees.

In [None]:
print("KNN Project Summary")
print('Logistic Regression Accuracy: {:.4f}'.format(log_acc)*100)
print('Cross Validation Accuracy: {:.4f}'.format(cv_score.mean()))
print('PCA and KNN togather: {:.4f}'.format(knn_pca_acc))

for model_name, results in model_score.items():
    print(f"{model_name} Accuracy: {results['Accuracy']:.4f}")
    print(f"{model_name} AUC Score: {results['AUC Score']:.4f}")

### Handling Class Imbalance with SMOTE
We noted earlier that our dataset is imbalanced. SMOTE (Synthetic Minority Over-sampling Technique) is a powerful method to address this. It works by creating new, synthetic data points for the minority class (sessions with purchases). Here, we apply SMOTE to our training data to create a balanced dataset.

In [None]:
sm = SMOTE(random_state=42)
X_train_smote, y_train_smote = sm.fit_resample(X_train_sc, y_train)

smote_score = {}
for model_name, model in model_list.items():
    model.fit(X_train_smote, y_train_smote)
    y_pred_smote = model.predict(X_test_sc)
    smote_score[model_name] = {
        'Accuracy': accuracy_score(y_test, y_pred_smote),
        'AUC Score': roc_auc_score(y_test, y_pred_smote),
    }
print("SMOTE Results:")
for model_name, results in smote_score.items():
    print(f"{model_name} Accuracy: {results['Accuracy']:.4f}")
    print(f"{model_name} AUC Score: {results['AUC Score']:.4f}")


## Conclusion
This project successfully developed and evaluated several machine learning models to predict online shoppers' purchase intent. The key takeaways are:

1. Data Quality and Imbalance: The dataset was clean with no missing values, but it exhibited a significant class imbalance, with far more non-purchase sessions than purchase sessions. This was the primary challenge throughout the project.

2. Model Performance: We compared four different models. Random Forest emerged as the top performer with an accuracy of approximately 90.3% and an AUC score of 0.76. K-Nearest Neighbors (KNN) also performed well, achieving a stable accuracy of around 88% with an optimal 'k' value of 5, as confirmed by cross-validation.

3. Feature Engineering Insights: Our experiments showed that dimensionality reduction using PCA was not beneficial, as it led to a drop in accuracy to 83.7%. This indicates that the original features contain valuable information that was lost during the reduction. The KNN model also showed reasonable resilience to noisy data.

4. Addressing the Core Challenge: While high accuracy scores were achieved, the class imbalance means these figures can be misleading. A model could achieve high accuracy by simply predicting "no purchase" most of the time. The project's final step correctly identified this issue and prepared to solve it by implementing SMOTE to create a balanced training set.

Future work should focus on training the models on the SMOTE-resampled data. This will likely improve the model's ability to identify true purchase intent (the minority class), leading to better performance on more telling metrics like Recall and the AUC Score, and ultimately creating a more practically useful prediction tool.