# STAT 301 Final Project: Group 22

### Group members: Siluni Jayarathne, Bhumika Kalra, Jeff Lu, Sofiia Prylypka

## Table of contents:
* [Introduction](#Introduction)
* [Method and Results](#Method-and-Results)
* [Discussion](#Discussion)
* [References](#References)

## Introduction
- Thanks to the rapid digitalization of the world, online shopping has begun reaching a wider audience than ever. As a result, there has been a large push for retail companies to curate personalized shopping experiences for customers. However, the factors that influence whether a customer ultimately makes a purchase are complex and not yet fully understood (Zhou et al., 2017), making it unclear which aspects of a website should be personalized to maximize revenue.
- Because of this, we would like to determine the association between a site visitor's purchasing decision (response), and predictors related to the site visitor's browsing behaviour (e.g. time spent on various types of pages, web page bounce rate) and time of site visit (e.g. whether the user visited on a special day or weekend). Our primary goal is inference, since we are trying to understand which predictors are relevant to purchasing decisions, rather than trying to predict a new visitor's purchasing decision.

## Method and Results

### Data

In [3]:
# Load required libraries
library(tidyverse)
library(dplyr)
library(broom)
library(car)
library(MASS)
library(tidymodels)

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.2.0 ──

[32m✔[39m [34mdials       [39m 1.2.1     [32m✔[39m [34mrsample     [39m 1.2.1
[32m✔[39m [34minfer       [39m 1.0.7     [32m✔[39m [34mtune        [39m 1.2.1
[32m✔[39m [34mmodeldata   [39m 1.4.0     [32m✔[39m [34mworkflows   [39m 1.1.4
[32m✔[39m [34mparsnip     [39m 1.2.1     [32m✔[39m [34mworkflowsets[39m 1.1.0
[32m✔[39m [34mrecipes     [39m 1.1.0     [32m✔[39m [34myardstick   [39m 1.3.1

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mscales[39m::[32mdiscard()[39m masks [34mpurrr[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m   masks [34mstats[39m::filter()
[31m✖[39m [34mrecipes[39m::[32mfixed()[39m  masks [34mstringr[39m::fixed()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m      masks [34mstats[39m::lag()
[31m✖[39m [34mcar[39m::[32mrecode()[39m     masks [34m

In [5]:
# Do not change
set.seed(2025)

# Download and load dataset from the Web
zipped <- tempfile()
download.file("https://archive.ics.uci.edu/static/public/468/online+shoppers+purchasing+intention+dataset.zip", zipped)
unzipped <- unz(zipped, "online_shoppers_intention.csv")
shopping <- read.csv(unzipped) |> 
    filter(Region != 1) # As required for our group's data

# Split into selection and inference sets to avoid the post-inference problem
shopping_split <- initial_split(shopping, prop = 0.5, strata = Revenue)
shopping_selection <- training(shopping_split)
shopping_inference <- testing(shopping_split)

head(shopping_selection, 3)

Unnamed: 0_level_0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
Unnamed: 0_level_1,<int>,<dbl>,<int>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<int>,<int>,<int>,<int>,<chr>,<lgl>,<lgl>
1,0,0,0,0,2,2.666667,0.05,0.14,0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
2,0,0,0,0,2,37.0,0.0,0.1,0,0.8,Feb,2,2,2,3,Returning_Visitor,False,False
3,0,0,0,0,16,407.75,0.01875,0.02583333,0,0.4,Feb,1,1,4,3,Returning_Visitor,False,False


##### Dataset summary
- For this project, our group will be working with the [Online Shoppers Purchasing Intention Dataset](https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset).
- This dataset was collected from an observational study and includes data about browser sessions of users visiting an online shopping website.
- The dataset contains 18 variables (described below) for 12,330 observations, with no missing values; however, we will only use observations where the Region is not 1, as specified in our group's instructions.
<table><thead>
  <tr>
    <th>Variable Name</th>
    <th>Type</th>
    <th>Description</th>
    <th>Data collection method</th>
  </tr></thead>
<tbody>
  <tr>
    <td>Administrative</td>
    <td>Integer</td>
    <td>Number of administrative pages visited</td>
    <td>Derived from URL information</td>
  </tr>
  <tr>
    <td>Administrative_Duration</td>
    <td>Continuous</td>
    <td>Total time spent on administrative pages (s)</td>
    <td>Derived from URL information</td>
  </tr>
  <tr>
    <td>Informational</td>
    <td>Integer</td>
    <td>Number of informational pages visited</td>
    <td>Derived from URL information</td>
  </tr>
  <tr>
    <td>Informational_Duration</td>
    <td>Continuous</td>
    <td>Total time spent on informational pages (s)</td>
    <td>Derived from URL information</td>
  </tr>
  <tr>
    <td>ProductRelated</td>
    <td>Integer</td>
    <td>Number of product-related pages visited</td>
    <td>Derived from URL information</td>
  </tr>
  <tr>
    <td>ProductRelated_Duration</td>
    <td>Continuous</td>
    <td>Total time spent on product-related pages (s)</td>
    <td>Derived from URL information</td>
  </tr>
  <tr>
    <td>BounceRates</td>
    <td>Continuous</td>
    <td>Average bounce rate (leaving a page without analytics server requests) of pages visited</td>
    <td>Google Analytics</td>
  </tr>
  <tr>
    <td>ExitRates</td>
    <td>Continuous</td>
    <td>Average exit rate of pages visited</td>
    <td>Google Analytics</td>
  </tr>
  <tr>
    <td>PageValues</td>
    <td>Continuous</td>
    <td>Average value for a web page visited</td>
    <td>Google Analytics</td>
  </tr>
  <tr>
    <td>SpecialDay</td>
    <td>Continuous</td>
    <td>Value between 0 and 1 indicating closeness of site visit to special day</td>
    <td>Calculated based on dynamics of e-commerce</td>
  </tr>
  <tr>
    <td>Month</td>
    <td>Categorical</td>
    <td>Month of site visit</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>OperatingSystems</td>
    <td>Categorical (encoded as Integer)</td>
    <td>Operating system used during site visit</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>Browser</td>
    <td>Categorical (encoded as Integer)</td>
    <td>Browser used during site visit</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>Region</td>
    <td>Categorical (encoded as Integer)</td>
    <td>Geographic region of session</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>TrafficType</td>
    <td>Categorical (encoded as Integer)</td>
    <td>Type of traffic that brought visitor to site</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>VisitorType</td>
    <td>Categorical</td>
    <td>"Returning Visitor," "New Visitor," or "Other"</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>Weekend</td>
    <td>Binary</td>
    <td>Whether the site visit was on a weekend</td>
    <td>NA</td>
  </tr>
  <tr>
    <td>Revenue</td>
    <td>Binary</td>
    <td>Whether the session ended in a transaction</td>
    <td>NA</td>
  </tr>
</tbody></table>

### Exploratory Data Analysis

### Methods: Plan

- We will first use `stepAIC()` to perform forward variable selection solely on `shopping_selection`. Since our dataset has many categorical variables, `stepAIC` will allow us to select all dummy variables associated with each categorical variable. This will lead to greater interpretability, which is important for inference.
- Using the variables selected, we will then fit an additive logistic regression model on `shopping_selection` using `Revenue` as the response on `shopping_selection` and use GVIF to check for multicollinearity. If the GVIF for any covariate is greater than 5, we will remove it from the variable set when doing inference.
- Using a logistic regression model is appropriate because the response variable `Revenue` is binary, so it will ensure that the fitted values are bounded between 0 and 1 as they should be. Also, using logistic regression will make the interpretation much more intuitive (e.g. we can say some variable is associated with some increase in the *odds* of a session resulting in a purchase), which is important for inference.
- Finally, we will fit an additive logistic regression model on `shopping_inference` using `Revenue` as the response and interpret the results.

### Code and Results

## Discussion

## References
Sakar, C. & Kastro, Y. (2018). Online Shoppers Purchasing Intention Dataset [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5F88Q

Sakar, C. O., Polat, S. O., Katircioglu, M., & Kastro, Y. (2018). Real-time prediction of online shoppers’ purchasing intention using Multilayer Perceptron and LSTM recurrent neural networks. *Neural Computing and Applications*, 31(10), 6893–6908. https://doi.org/10.1007/s00521-018-3523-0 

Zhou, L., Dai, L., & Zhang, D. (2007). Online shopping acceptance model — A critical survey of consumer factors in online shopping. *Journal of Electronic Commerce Research*, 8(1).

Link to download dataset: https://archive.ics.uci.edu/static/public/468/online+shoppers+purchasing+intention+dataset.zip