

*   Exploratory Data Analysis

Begin by exploring the dataset, and assessing the distribution of key variables.
Identify relationships between key variables.
Encode categorical variables appropriately.

*   Model Development

Develop at least two predictive models to forecast customer conversion.
Select appropriate predictive modeling techniques.
Split the dataset into training and test sets.
Train and validate the model using performance metrics.

*   Model Interpretation and Insights (

Identify key features influencing conversion.
Provide actionable recommendations based on model results.
Discuss limitations and potential improvements.

*   Final Report and Presentation

Compile a comprehensive report detailing your findings, analyses, and recommendations.
Create two presentations highlighting key insights and suggested strategies. Use visuals to communicate your findings.




In [1]:
install.packages("corrplot")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [2]:
install.packages("randomForest")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
install.packages(
  c("ISLR","plotly","MASS","ggcorrplot","GGally","caret","dplyr","ranger","cluster"),
  dependencies = TRUE,
  repos = "https://cloud.r-project.org"
)
install.packages("factoextra")

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘listenv’, ‘parallelly’, ‘rbibutils’, ‘future’, ‘globals’, ‘R.methodsS3’, ‘R.oo’, ‘R.utils’, ‘bitops’, ‘Rdpack’, ‘shape’, ‘future.apply’, ‘progressr’, ‘SQUAREM’, ‘R.cache’, ‘caTools’, ‘TH.data’, ‘showimage’, ‘wk’, ‘DEoptimR’, ‘checkmate’, ‘coda’, ‘profileModel’, ‘minqa’, ‘nloptr’, ‘reformulas’, ‘plotrix’, ‘diagram’, ‘lava’, ‘styler’, ‘gplots’, ‘libcoin’, ‘matrixStats’, ‘multcomp’, ‘assertthat’, ‘debugme’, ‘parsedate’, ‘pingr’, ‘webdriver’, ‘viridis’, ‘classInt’, ‘s2’, ‘units’, ‘RcppTOML’, ‘here’, ‘hunspell’, ‘patchwork’, ‘cards’, ‘pcaPP’, ‘robustbase’, ‘som’, ‘lars’, ‘mclust’, ‘sp’, ‘tweenr’, ‘polyclip’, ‘gridExtra’, ‘htmlTable’, ‘colorspace’, ‘Formula’, ‘statnet.common’, ‘rJava’, ‘estimability’, ‘mvtnorm’, ‘numDeriv’, ‘iterators’, ‘clock’, ‘gower’, ‘hardhat’, ‘sparsevctrs’, ‘timeDate’, ‘brglm’, ‘gtools’, ‘lme4’, ‘qvcalc’, ‘rex’, ‘plotmo’, ‘prodlim’, ‘combinat’, ‘questio

In [None]:
# Import libraries
library(tidyverse)
library(readr)
library(ggplot2)
#library(corrplot)
library(caret)
library(dplyr)
library(ranger)
library(factoextra)
library(cluster)

ERROR: Error in library(caret): there is no package called ‘caret’


# 1. Exploratory Data Analysis (Khang & James)
### Data Cleaning & Exploratory Analysis

*  summary
* data type
* missing values
* duplicate values
* Identify Key variables --> plot dist (histogram)
* Correlations all
* Bar charts (for category cols)
*   List item



# Load dataset

In [None]:
library(googlesheets4)

# Google Sheets URL
sheet_url <- "https://docs.google.com/spreadsheets/d/1wnZjAaoGKUAUMKkVmuCXP1XPvkgIkwYbUEzcG9_Qebg/edit?usp=sharing"

# Disable authentication if the sheet is public
gs4_deauth()

# Read data from Google Sheets
Marketing <- read_sheet(sheet_url)

# View first few rows of the dataset
head(Marketing)

In [None]:
# Get summary statistics for all numerical variables
summary(Marketing)

In [None]:
# View the shape
length(Marketing)

In [None]:
# Display structure of the dataset
str(Marketing)

In [None]:
# Convert Data Types

Marketing <- Marketing %>%
  mutate(
    CustomerID = as.character(CustomerID),  # Convert ID to chr
    Age = as.integer(Age),               # Convert Age to integer
    Income = as.integer(Income),         # Convert Income to integer
    AdSpend = as.numeric(AdSpend),       # Ensure AdSpend remains numeric
    WebsiteVisits = as.integer(WebsiteVisits),  # Convert to integer
    SocialShares = as.integer(SocialShares),    # Convert to integer
    EmailOpens = as.integer(EmailOpens),
    EmailClicks = as.integer(EmailClicks),
    PreviousPurchases = as.integer(PreviousPurchases),
    LoyaltyPoints = as.integer(LoyaltyPoints),
    Conversion = as.factor(Conversion),  # Convert Conversion to factor (0/1)
    Gender = as.factor(Gender),          # Convert Gender to factor
    CampaignChannel = as.factor(CampaignChannel),  # Convert CampaignChannel to factor
    CampaignType = as.factor(CampaignType)        # Convert CampaignType to factor
  )

# Verify the Changes

str(Marketing)  # Check updated data types


In [None]:
head(Marketing)

In [None]:
table(Marketing$Conversion)

In [None]:
#Check missing value
colSums(is.na(Marketing))

#Distribution of Key Variables

In [None]:
# Function to create histograms
plot_histogram <- function(data, column, title, binwidth = NULL) {
  ggplot(data, aes_string(x = column)) +
    geom_histogram(binwidth = binwidth, fill = "blue", alpha = 0.6, color = "black") +
    theme_minimal() +
    labs(title = title, x = column, y = "Count")
}

In [None]:
# Ad Spend Distribution
plot_histogram(Marketing, "AdSpend", "Distribution of Ad Spend", binwidth = 500)

**Observations:**

- Range: Ad Spend spans from near 0 up to about 10,000, suggesting a wide variety of spending levels.
- Relatively Even Spread: The distribution appears fairly uniform, with no pronounced peak or strong skew.
- Slight Dip at Lower Values: There’s a small dip around the lower range (0–1,000), indicating fewer instances of minimal spend.
- Potentially Steady Investment: The bulk of the data hovers between 2,000 and 8,000, implying most campaigns allocate a moderate to high spend.

In [None]:
# Click Through Rate (CTR) Distribution
plot_histogram(Marketing, "ClickThroughRate", "Distribution of Click-Through Rate", binwidth = 0.02)

**Observations:**

- Range: The Click-Through Rate (CTR) spans from near 0 up to about 0.30.
- Fairly Even Spread: There is no sharp peak, suggesting that CTR values are distributed relatively uniformly within this range.
- Mild Concentration Around 0.10–0.15: Slightly more data points appear around the mid-range, but overall variation is not extreme.
- No Extreme Outliers: The histogram ends around 0.30, indicating that CTR does not exceed 30%.

In [None]:
# Website Visits Distribution
plot_histogram(Marketing, "WebsiteVisits", "Distribution of Website Visits", binwidth = 2)

**Observations:**

- Range: Website Visits span from 0 to about 50.
- Fairly Even Spread: The histogram shows a relatively uniform distribution, with no single dominant peak.
- Mild Concentration Around Mid-Range: Slightly more visits occur between about 15 and 35, though the overall variation is not extreme.
- No Extreme Outliers: The data caps at 49 visits, indicating most visitors fall within a moderate range of site engagement.

In [None]:
# Income Distribution
plot_histogram(Marketing, "Income", "Distribution of Income", binwidth = 5000)

**Observations:**

- Range: Income varies roughly from $20000 to $150000, indicating a broad spectrum of financial backgrounds.
- Slight Concentration Around Mid-Range: A large portion of individuals fall between $40,000 and $80,000.
- Dip Toward Upper Income Levels: Fewer individuals earn beyond $120,000.
- No Sharp Peak: The distribution is relatively spread out, with mild fluctuations across the mid-range.

In [None]:
# Loyalty Points Distribution
plot_histogram(Marketing, "LoyaltyPoints", "Distribution of Loyalty Points", binwidth = 200)

**Observations:**

- Range: Loyalty Points span from 0 up to about 5,000.
- Fairly Uniform Distribution: No single peak dominates, though there’s a slight dip at the lower end (0–500).
- Moderate Variations Across the Range: Mild fluctuations occur, but overall the values are spread evenly between 500 and 4,500.
- No Extreme Outliers: The data caps at 5,000, indicating a clear upper limit for points earned.

In [None]:
options(repr.plot.width=12, repr.plot.height=10)  # Adjust width and height as needed

# **1.2 Bivariate Visualizations**

In [None]:
install.packages("ggcorrplot")
library(ggcorrplot)
# Ensure 'Marketing' dataset is available and contains numeric variables
numeric_vars <- Marketing %>% select_if(is.numeric)

# Compute correlation matrix
corr_matrix <- cor(numeric_vars, use = "complete.obs")

# Heatmap with explicit color mapping
ggcorrplot(
  corr_matrix,
  hc.order = TRUE,          # Order by hierarchical clustering
  type = "lower",           # Show lower triangle
  lab = TRUE,               # Show correlation values
  lab_size = 5,             # Adjust label size for correlation values
  colors = c("#6D9EC1", "white", "#E46726"), # Explicit color scheme
  outline.color = "black",  # Add black border for clarity
  insig = "blank",          # Remove insignificant correlations
  title = "Correlation Heatmap", # Optional: Add a title
  ggtheme = ggplot2::theme_gray() +  # Set a theme for the plot
    theme(
      axis.text.x = element_text(size = 30),  # Increase x-tick label size
      axis.text.y = element_text(size = 30),  # Increase y-tick label size
      plot.title = element_text(size = 30)     # Increase plot title size
    )
)

--> The correlation heatmap shows no high-corrlated variables which is good for modeling in the future. We will also using Stepwise to eliminate unsignificant variables later.

## Gender Count

In [None]:
gender_count <- Marketing %>%
  group_by(Gender) %>%
  summarise(Count = n(), .groups = 'drop') %>%
  mutate(Percentage = (Count / sum(Count)) * 100)

gender_count

In [None]:
# Line Plot for Average Conversion Rate by Age
ggplot(data = Marketing, aes(x = Age, y = ConversionRate)) +
  geom_line(stat = "summary", fun = "mean", color = "steelblue", size = 1) +
  labs(title = "Average Conversion Rate by Age", x = "Age", y = "Average Conversion Rate") +
  theme_minimal() +
  theme(axis.title.x = element_text(size = 14), axis.title.y = element_text(size = 14))

## TimeOnSite vs Gender (Age)

In [None]:

# Create age groups for better visualization
Marketing$AgeGroup <- cut(Marketing$Age, breaks = seq(0, 100, by = 10), right = FALSE)

# Calculate total average TimeOnSite by Gender and Age Group
time_on_site_gender_age <- Marketing %>%
  group_by(AgeGroup, Gender) %>%
  summarise(TotalAverageTimeOnSite = sum(TimeOnSite, na.rm = TRUE), .groups = 'drop')

# Create the bar chart
ggplot(data = time_on_site_gender_age, aes(x = AgeGroup, y = TotalAverageTimeOnSite, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = round(TotalAverageTimeOnSite, 0)),  # Round to 0 decimal points
            position = position_dodge(width = 0.9),
            vjust = -0.5,
            size = 5) +  # Adjust size of the text
  labs(title = "Total Avg on Site by Gender and Age Group",
       x = "Age Group",
       y = "Total Avg Time on Site (minutes)") +
  theme_minimal() +
  theme(
    axis.title.x = element_text(size = 25),  # Increase x-axis label size
    axis.title.y = element_text(size = 25),  # Increase y-axis label size
    axis.text.x = element_text(size = 25),   # Increase x-tick label size
    axis.text.y = element_text(size = 25)    # Increase y-tick label size
  )


## COnversion vs cpgtype (count facet)

In [None]:
# Create a bar plot for Conversion with facets by CampaignType
ggplot(data = Marketing, aes(x = Conversion)) +
  geom_bar(aes(y = ..count..), fill = "steelblue", color = "black") +  # Adjust colors as needed
  facet_wrap(~CampaignType) +  # Create facets for each CampaignType
  labs(title = "Conversion Count by Campaign Type", x = "Conversion", y = "Count") +  # Add labels
  theme(
    axis.title.x = element_text(size = 30),  # Increase x-axis label size
    axis.title.y = element_text(size = 30),  # Increase y-axis label size
    axis.text.x = element_text(size = 25),   # Increase x-tick label size
    axis.text.y = element_text(size = 25),   # Increase y-tick label size
    strip.text = element_text(size = 20),     # Increase facet label size
    plot.title = element_text(size = 26)      # Increase plot title size
  ) +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, size = 4) +  # Add data labels above bars
  coord_cartesian(clip = 'off')  # Ensure labels are not clipped

## Bar chart Normal

In [None]:
mkt.sum <- summarise(group_by(Marketing, CampaignChannel), total_shares = sum(SocialShares))

View(mkt.sum)

ggplot(data = mkt.sum, aes(x=CampaignChannel, y = total_shares)) +
  geom_bar(stat = "identity",fill = "steelblue", color = "black")+
  theme(
    axis.title.x = element_text(size = 30),  # Increase x-axis label size
    axis.title.y = element_text(size = 30),  # Increase y-axis label size
    axis.text.x = element_text(size = 25),   # Increase x-tick label size
    axis.text.y = element_text(size = 25),   # Increase y-tick label size
    strip.text = element_text(size = 20),     # Increase facet label size
    plot.title = element_text(size = 26)      # Increase plot title size
  )

## Bar Chart (Channel vs CTR by Gender)

In [None]:
mkt.agg <- Marketing %>%
  group_by(CampaignChannel, Gender) %>%
  summarise(total_CTR = sum(ClickThroughRate), .groups = 'drop')  # Drop grouping after summarising

# Create the bar plot
ggplot(data = mkt.agg, aes(x = CampaignChannel, y = total_CTR, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  geom_text(aes(label = round(total_CTR, 0)), position = position_dodge(width = 0.9), vjust = -0.5, size = 5) +  # Add rounded data labels above bars
  theme(
    axis.title.x = element_text(size = 30),  # Increase x-axis label size
    axis.title.y = element_text(size = 30),  # Increase y-axis label size
    axis.text.x = element_text(size = 25),   # Increase x-tick label size
    axis.text.y = element_text(size = 25),   # Increase y-tick label size
    strip.text = element_text(size = 20),     # Increase facet label size
    plot.title = element_text(size = 26),     # Increase plot title size
    legend.title = element_text(size = 20),   # Increase legend title size
    legend.text = element_text(size = 20)      # Increase legend text size
  ) +
  labs(fill = "Gender")  # Optional: Set legend title

## ANOVA

In [None]:
#Excercise: conduct ANOVA to compare the mean AdSpend by CampaignChannel and Gender

options(scipen = 999)
summary(aov(Income ~ CampaignType + Gender, data = Marketing))

## Feedback from Prof.
- Conversions/ConversionRate with Gender or Campaign Channel

# 2. Model Development

:# **1.3 Encode Categorical Variables for Modeling**

In [None]:
# Separate predictors and target
predictors <- Marketing %>% select(-Conversion)
target <- Marketing$Conversion

# Convert categorical columns into factor type
Marketing$Conversion <- factor(Marketing$Conversion, levels = c(0, 1))

# One-hot encoding using model.matrix
encoded_predictors <- predictors %>%
  mutate(across(where(is.factor), as.integer))

head(encoded_predictors)

Marketing_encoded <- data.frame(encoded_predictors, Conversion = target)

head(Marketing_encoded)

In [None]:
set.seed(1234)

## Partition the Data

In [None]:
marketing_subset = subset(Marketing_encoded, select = -c(CustomerID) ) ##Remove unique identifier column

In [None]:
# Set the sizes of the test and training samples.
# We use 20% of the data for testing:
n <- nrow(marketing_subset)
ntest <- round(0.2*n)
ntrain <- n - ntest

# Split the data into two sets:
train_rows <- sample(1:n, ntrain)
marketing_train <- marketing_subset[train_rows,]
marketing_test <- marketing_subset[-train_rows,]

In [None]:
preProcValues <- preProcess(marketing_subset, method = c("range")) ##Uses column minimums & maximums to normalize values around 0 using original data

marketing_train_norm <- predict(preProcValues, marketing_train) #Using the normalizing object, normalize the rows in the dataframe and save it new to a new one
marketing_test_norm <- predict(preProcValues, marketing_test) #Using the normalizing object, normalize the rows in the dataframe and save it new to a new one

In [None]:
head(marketing_train_norm)
head(marketing_test_norm)

In [None]:
colSums(is.na(marketing_test_norm))

## K Means Clustering

In [None]:
marketing_subset = subset(Marketing_encoded, select = -c(CustomerID) ) ##Remove unique identifier column
preProcValues <- preProcess(marketing_subset, method = c("range")) ##Uses column minimums & maximums to normalize values around 0 using original data
marketing_normalized <- predict(preProcValues, marketing_subset) #Using the normalizing object, normalize the rows in the dataframe and save it new to a new one
head(marketing_normalized)

In [None]:
fviz_nbclust(marketing_normalized, kmeans, method = "wss") + labs(subtitle = "Choosing the Optimal k") +ylim(0,15000) ##Plot the within-cluster sum of squares as a function of the number of clusters to determine the optimal k

In [None]:
fviz_nbclust(marketing_normalized, kmeans, method = "silhouette") +
  labs(subtitle = "Silhouette Method")

In [None]:
kmeans_5 <- kmeans(marketing_normalized, 5, nstart=25)
kmeans_5

In [None]:
kmeans_5$betweenss #Returns the between cluster sum of squares
kmeans_5$totss #Returns total sum of squares; total variance in the data

In [None]:
explained_variance_5 <- kmeans_5$betweenss/kmeans_5$totss #Calculates the explained variance/ goodness of fit; 1 being best, 0 being worst
explained_variance_5

In [None]:
kmeans_5$size #Return size of all of the clusters
kmeans_5$centers #Return cluster centroids

In [None]:
kmeans_5$withinss #Returns the within cluster sum of squares for each cluster

In [None]:
# Convert all columns in marketing_normalized to numeric if they are not already
marketing_normalized <- marketing_normalized %>%
  mutate(across(where(is.factor), ~ as.numeric(as.character(.)))) # Convert factor to numeric
marketing_normalized <- marketing_normalized %>%
  mutate(across(where(is.character), ~ as.numeric(.))) # Convert character to numeric


# Continue with your clustering and visualization
fviz_cluster(kmeans_5, marketing_normalized) #Plots the clusters to visually inspect overlap

In [None]:
# Assuming 'kmeans_7' is your kmeans object and 'Marketing' is your original dataframe

# Add cluster assignments to the original dataframe
marketing_normalized$cluster <- kmeans_5$cluster

# Set the sizes of the test and training samples.
# We use 20% of the data for testing:
n <- nrow(marketing_normalized)
ntest <- round(0.2*n)
ntrain <- n - ntest

# Split the data into two sets:
train_rows <- sample(1:n, ntrain)
marketing_train_norm <- marketing_normalized[train_rows,]
marketing_test_norm <- marketing_normalized[-train_rows,]

## Logistic Regression

In [None]:
head(marketing_normalized)

In [None]:
# Run logistic regression with marketing dataset
fit.logit <- glm(Conversion ~ AdSpend+PagesPerVisit, data = marketing_normalized, family = binomial)
summary(fit.logit)

In [None]:
fit.logit <- glm(Conversion ~ ., data = marketing_train_norm, family = binomial)
summary(fit.logit)

In [None]:
pred.logit = predict(fit.logit, marketing_test_norm, type = "response") ##predict the model using the test set
hist(pred.logit) ##visualize the distribution of predicted values

summary(pred.logit) ##check the distribution of predicted values

In [None]:
# #Choose 0.5 as the threshold for predicting default customers:
marketing_test_norm$Conversion.logit = ifelse(pred.logit>0.5,"1","0")
table(marketing_test_norm$Conversion.logit)

In [None]:
# #Check the accuracy of our prediction
class.lm = xtabs(~ Conversion + Conversion.logit, data = marketing_test_norm)
class.lm ##show the confusion matrix table

In [None]:
print(paste0("The overall accuracy of the model is: ",mean(marketing_test_norm$Conversion==marketing_test_norm$Conversion.logit)))

In [None]:
print(paste0("The accuracy of predicting default customers is: ",mean(marketing_test_norm$Conversion[marketing_test_norm$Conversion=="1"]==marketing_test_norm$Conversion.logit[marketing_test_norm$Conversion=="1"])))

## K-Nearest Neighbors

In [None]:
ctrl <- trainControl(method="repeatedcv",repeats = 3)  #Set training parameters
knnFit <- train(Conversion ~ ., data = marketing_train_norm, method = "knn", trControl = ctrl, tuneLength = 20) #Test various values of k on normalized training data.
knnFit ##Displays the relative performance of different values of k
plot(knnFit) #Plot the accuracy of various k values

In [None]:
knn_test_predictions <- predict(knnFit,newdata = marketing_test_norm,type = "raw") #Generate validation data predictions

In [None]:
table(marketing_test_norm$Conversion)

In [None]:
# Convert predictions to factor with the same levels as marketing_test$Conversion
knn_test_predictions <- factor(knn_test_predictions, levels = levels(marketing_test$Conversion))

confusionMatrix(knn_test_predictions, marketing_test$Conversion,positive="1") ##kNN validation predictions

## Random Forest

In [None]:
library(randomForest)
set.seed(12345)
fit.rf = randomForest(Conversion ~ ., data=marketing_train, ntree=400, importance = TRUE)
fit.rf

In [None]:
#Apply the random forest to the test data
pred.rf <- predict(fit.rf, marketing_test)
marketing_test$Conversion.rf = pred.rf
class.rf = xtabs(~ Conversion + Conversion.rf, data = marketing_test)
class.rf ##show the confusion matrix table

In [None]:
# Add cluster assignments to marketing_test
marketing_test$cluster <- marketing_normalized$cluster[match(rownames(marketing_test), rownames(marketing_normalized))]

clusplot(marketing_test[ ,c("AdSpend","Gender","cluster")], pred.rf, color = TRUE, shade = TRUE, labels =  4, lines = 0, main = "Random Forest Classes, test data")

In [None]:
#variable importance
varImpPlot(fit.rf, main = "Variable importance by default")

In [None]:
print(paste0("The overall accuracy of the model is: ",mean(marketing_test$Conversion==marketing_test$Conversion.rf)))

In [None]:
print(paste0("The accuracy of predicting converting customers is: ",mean(marketing_test$Conversion[marketing_test$Conversion=="1"]==marketing_test$Conversion.rf[marketing_test$Conversion=="1"])))

# 3. Model Interpretation & Insights