# DSCI 100 Group 16 Project Proposal

In [37]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 8)
library(readxl)

set.seed(16)

# Preliminary Exploratory Data Analysis

### Creating a usable data set

The data that we downloaded from the website was already in tidy format because each row represented an observation, each column is a single variable, and each value is a single cell. We removed all the 'NA' from the table and also removed all the discreet variables (Id, Education, Marital_Status, Complain) because we can't use them in the classification model since they are not continuous.
We made the data set a bit more usuable by:
- combining the Kidhome and Teenhome columns into a Childhome column
- combining the amount spent on all the different categories of food into a single Total_Spent column
- using the Year_Birth column to find the customer's age (this data was published at the end of 2022)
- changing the date the customer joined to the number of weeks they've been a customer
- changing the marital status column to single ()/not single ()
- changing response to a factor


In [38]:
data <- read_csv("superstore_data.csv")
date_2 = as.Date("2022-12-31")
store_data <- data |>
    na.omit() |>
    select(-Id, -Education, -Complain) |>
    mutate(Childhome = Kidhome + Teenhome) |>
    select(-Kidhome, -Teenhome) |>
    mutate(Age = 2022 - Year_Birth) |>
    select(-Year_Birth) |>
    mutate(Total_Spent = MntWines + MntFruits + MntMeatProducts + MntFishProducts + MntSweetProducts) |>
    select(-MntWines, -MntFruits, -MntMeatProducts, -MntFishProducts, -MntSweetProducts) |>
    mutate(Dt_Customer = as.Date(Dt_Customer, "%m/%d/%Y")) |>
    mutate(Weeks_Customer = difftime(date_2, Dt_Customer, units = "weeks")) |>
    mutate(Weeks_Customer = as.numeric(Weeks_Customer)) |>
    select(-Dt_Customer) |>
    mutate(Relationship = case_when(Marital_Status == 'Married' | Marital_Status == 'Together' ~ 'Yes',
                                    Marital_Status !=  'Married' & Marital_Status != 'Together' ~ 'No')) |>
    select(-Marital_Status) |>
    mutate(Response = as.factor(Response))
store_data

[1mRows: [22m[34m2240[39m [1mColumns: [22m[34m22[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (3): Education, Marital_Status, Dt_Customer
[32mdbl[39m (19): Id, Year_Birth, Income, Kidhome, Teenhome, Recency, MntWines, MntF...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Income,Recency,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Response,Childhome,Age,Total_Spent,Weeks_Customer,Relationship
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
84835,0,218,1,4,4,6,1,1,0,52,972,445.7143,No
57091,0,37,1,7,3,7,5,1,0,61,540,445.8571,No
67267,0,30,1,3,2,5,2,0,1,64,221,450.5714,Yes
32474,0,0,1,1,0,2,7,0,2,55,11,425.4286,Yes
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
31056,99,16,1,1,0,3,8,0,1,45,39,518.5714,Yes
46310,99,14,2,6,1,5,8,0,1,46,295,563.7143,No
65819,99,63,1,5,4,10,3,0,0,44,1320,526.2857,Yes
94871,99,144,1,8,5,4,7,1,2,53,934,572.7143,Yes


### Exploring the data
We then split our data in 75% training and 25% testing data. We see that we have 14 variables, but we will not use all of them.
After the split, we see that there are 1661 observations in the training set and 555 observations in the testing set. 

We see that about 15% of people gave a positive response to the 

In [39]:
store_data_split <- initial_split(store_data, prop = 0.75, strata = Response)
store_data_train <- training(store_data_split)
store_data_train
glimpse(store_data_train)

store_data_train |>
  group_by(Response) |>
  summarize(
    count = n(),
    percentage = n() / 1661 * 100)

Income,Recency,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Response,Childhome,Age,Total_Spent,Weeks_Customer,Relationship
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
67267,0,30,1,3,2,5,2,0,1,64,221,450.5714,Yes
32474,0,0,1,1,0,2,7,0,2,55,11,425.4286,Yes
44931,0,7,1,2,1,3,5,0,1,55,89,467.0000,Yes
65324,0,5,3,6,2,9,4,0,1,68,539,426.0000,Yes
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
77766,97,27,2,11,10,11,6,1,1,55,1502,514.1429,Yes
90687,98,51,1,6,2,8,2,1,0,39,1728,501.4286,No
50611,98,4,6,4,5,7,6,1,1,62,489,559.5714,No
94871,99,144,1,8,5,4,7,1,2,53,934,572.7143,Yes


Rows: 1,661
Columns: 14
$ Income              [3m[90m<dbl>[39m[23m 67267, 32474, 44931, 65324, 65324, 81044, 26872, 4…
$ Recency             [3m[90m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ MntGoldProds        [3m[90m<dbl>[39m[23m 30, 0, 7, 5, 5, 26, 32, 321, 22, 2, 10, 5, 30, 7, …
$ NumDealsPurchases   [3m[90m<dbl>[39m[23m 1, 1, 1, 3, 3, 1, 1, 0, 4, 1, 2, 1, 3, 3, 4, 3, 12…
$ NumWebPurchases     [3m[90m<dbl>[39m[23m 3, 1, 2, 6, 6, 5, 1, 25, 2, 1, 2, 1, 5, 5, 8, 2, 9…
$ NumCatalogPurchases [3m[90m<dbl>[39m[23m 2, 0, 1, 2, 2, 6, 1, 0, 1, 0, 0, 0, 2, 1, 8, 1, 2,…
$ NumStorePurchases   [3m[90m<dbl>[39m[23m 5, 2, 3, 9, 9, 10, 2, 0, 5, 3, 3, 2, 5, 8, 6, 4, 8…
$ NumWebVisitsMonth   [3m[90m<dbl>[39m[23m 2, 7, 5, 4, 4, 1, 6, 1, 4, 4, 8, 7, 7, 5, 6, 7, 8,…
$ Response            [3m[90m<fct>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Childhome           [3m[90m<dbl>[39m[23m 1, 2, 1, 1, 1, 0, 0, 1, 2, 1, 1, 3, 1,

Response,count,percentage
<fct>,<int>,<dbl>
0,1412,85.00903
1,249,14.99097
