

# Black Friday Examined
With the holiday season fast approaching, I found it intriguing to examine a dataset revolving around a hypothetical store and data of its shoppers. As described by the author, "The dataset is comprised of 550,000 observations about Black Friday shoppers in a retail store, it contains different kinds of variables either numerical or categorical. It contains missing values." (Mehdi Dagdoug)

The more info and the supplemental dataset can be found at: https://www.kaggle.com/mehdidag/black-friday

Although this EDA + Apriori only utilized the provided dataset; similar techniques can be applied to any similar dataset or business problem.

# Table of Contents
1. <a href="#Introduction">Introduction</a>
2. <a href="#Exploratory Data Analysis (EDA)"> Exploratory Data Analysis (EDA)</a>
    1. <a href="#Gender"> Gender</a>
    2. <a href="#Top Sellers"> Top Sellers</a>
    3. <a href="#Age"> Age</a>
    4. <a href="#City"> City</a>
    5. <a href="#Stay in Current City"> Stay in Current City</a>
    6. <a href="#Purchase"> Purchase</a>
    7. <a href="#Marital Status"> Marital Status</a>
    8. <a href="#Top Shoppers"> Top Shoppers</a>
    9. <a href="#Occupation"> Occupation</a>
3. <a href="#Apriori (Association Rule Learning)"> Apriori (Association Rule Learning)</a>
4. <a href="#Conclusion"> Conclusion</a>
5. <a href="#Works Cited"> Works Cited</a>



<a id="Introduction">
# Introduction
 ### Origins:
The origins of "Black Friday" stem not from a day filled with shopping, discounts, and a turn of the holiday season, but rather with a [**financial crisis**][1]! The first recorded use of the term "Black Friday" was recorded on September 24th, 1869 when two Wall Street businessmen, Jay Gould and Jim Fisk, decided to artifically inflate the price of gold and attempted to sell it for profit. As a result of their nefarious actions, on that specific Friday in September 1869, the price of Gold dropped and the United States plunged into a state of financial devestation.

![](http://i0.wp.com/armstrongeconomics.com/wp-content/uploads/2012/03/120.png?zoom=1.5&resize=584%2C636)

[1]: https://www.history.com/news/whats-the-real-history-of-black-friday

### First Recorded Use:
Various stories exist regarding the first recorded use of the term as it relates to holiday shopping, but its connotation continued to keep a negetive stigma associated with it until the late 20th century.

"Black Friday" and its relation to consumerism first derived from [1950s Philadelphia][1]. Philadelphia suburbinites descended on the city after the Thanksgiving holiday, to watch the traditional Army/Navy college football game and take advantage of sales and promotions brought about by the influx of spectators to the city. Philidelphia Police Officers who were assigned to work that weekend coined the term due to their long grueling shifts and the mass amounts of people/shoppers. Philidelphia businesses also started to use the term to describe the long lines and shopping mayhem at their stores.

[1]: https://www.cnn.com/2018/11/21/business/black-friday-history/index.html

### Use Within Business:
Although the term "Black Friday" originally represented the pitfalls of two Wall Street businessmen and the mayhem of downtown Philadephia following Thanksgiving, it is now familiarly known today as the **busiest** shopping day of the year.

One possible explanation/rumor for the term as it relates to consumers and retailers is that "Black Friday" represents the first day of the year in which businesses were turning profits and accounting was done on a hand-written ledger. As described Oxford Dictionary, "The use of colors in accounting refers back to the bookkeeping practice of recording the credit side of an account in a ledger in *black* ink and the debit side in *red* ink." ([Oxford Dictionaries][1]) Hence the name, "Black Friday" being associated with businesses debits overtaking their credits. Although this idea might make sense, the claim hasn't been completely verified.

[1]: https://en.oxforddictionaries.com/explore/why-is-day-after-thanksgiving-black-friday/

### In Recent Times:
"Black Friday as we know it today is an extravaganza of sales, promotions, and long lines outside of stores. Retailers such as Target, Best Buy, Amazon, and many others look forward to this day every year with the hopes that consumers will take advantage of door-busting deals. The term "Black Friday" has also spawned other retail holidays such as "Cyber Monday",  "Small-Business Saturday", and "Giving Tuesday." Here are a few note-worthy statistics from 2018's "Black Friday."
* [Visits by shoppers to physical stores were down 1.7% from 2017][1]
* [Shoppers spent $6.22 billion online, up 23.6% from 2017][1]
![](http://www.staradvertiser.com/wp-content/uploads/2018/11/web1_AP18327724044802--1-.jpg)

[1]: http://fortune.com/2018/11/25/black-friday-foot-traffic-falls/

<a id="Exploratory Data Analysis (EDA)">
# Exploratory Data Analysis (EDA)
To begin, lets load the dataset that we wil be using for this Exploratory Data Analysis (EDA).

In [None]:
dataset = read.csv("../input/BlackFriday.csv")

Now, lets import the libraries we will be utilizing in this kernel.

In [None]:
library(tidyverse)
library(scales)
library(arules)
library(gridExtra)

The [tidyverse][1] package is what we will use for visualizing and exploring our dataset. It is knows for easy-to-read syntax and massive amounts of useful functions. The [scales][2] package will be used mainly to customize plot axis. Lastly, the [arules][3] package will be utilized in the final part of this kernel, Association Rule Learning and Apriori. Info regarding all packages used during this EDA is provided in the **Works Cited** section of this kernel.

Lets start with a quick overview of the entire dataset.

[1]: https://www.tidyverse.org/
[2]: https://cran.r-project.org/web/packages/scales/scales.pdf
[3]: https://cran.r-project.org/web/packages/arules/arules.pdf

In [None]:
summary(dataset)

head(dataset)

It looks like we have 12 different columns, each representing a corresponding variable below.

* *User_ID*: Unique identifier of shopper.
* *Product_ID*: Unique identifier of product. (No key given)
* *Gender*: Sex of shopper.
* *Age*: Age of shopper split into bins.
* *Occupation*: Occupation of shopper. (No key given)
* *City_Category*: Residence location of shopper. (No key given)
* *Stay_In_Current_City_Years*: Number of years stay in current city.
* *Marital_Status*: Marital status of shopper.
* *Product_Category_1*: Product category of purchase.
* *Product_Category_2*: Product may belong to other category.
* *Product_Category_3*: Product may belong to other category.
* *Purchase*: Purchase amount in dollars.

If we look at the first few rows of our dataset, we can see that each row represents a different transaction, or item purchased by a specific customer. This will come into play later on when we group all transactions by a specific *User_ID* to get a sum of all purchases made by a single customer.

One critique we can make regarding this dataset is that there isn't a key given regarding the different Product_IDs and the item they represent. (Ie. We can't attribute P00265242 to an item easily recognizable) In reality, we would want to have another dataset which provides the name of an Item and its Product_ID and then join it to our existing dataset. This won't necessarily affect our EDA, but would be more useful during our implementation of the Apriori algorithm and could make some parts of the EDA clearer to interpret.

<a id="Gender">
## Gender
To begin our exploration, lets examine the gender of shoppers at this store.

Since each row represents an individual transaction, we must first group the data by User_ID to remove duplicates.

In [None]:
dataset_gender = dataset %>%
                    select(User_ID, Gender) %>%
                    group_by(User_ID) %>%
                    distinct()

head(dataset_gender)

summary(dataset_gender$Gender)

Now that we have the dataframe necessary to see each User_IDs corresponding gender and their total counts for reference, lets plot the distribution of gender across our dataset.

In [None]:
options(scipen=10000)   # To remove scientific numbering

genderDist  = ggplot(data = dataset_gender) +
                geom_bar(mapping = aes(x = Gender, y = ..count.., fill = Gender)) +
                labs(title = 'Gender of Customers') +
                scale_fill_brewer(palette = 'PuBuGn')
print(genderDist)

As we can see, there are quite a few more males than females shopping at our store on Black Friday. This gender split metric is helpful to retailers because some might want to modify their store layout, product selection, and other variables differently depending on the gender proportion of their shoppers.

A study published in the Clothing and Textiles Research Journal writes,
* "Involvement, variety seeking, and physical environment of stores were selected as antecedents of shopping experience satisfaction....The structural model for female subjects confirmed the existence of the mediating role of hedonic shopping value in shopping satisfaction, whereas the model for male respondents did not." Chang, E., Burns, L. D., & Francis, S. K. (2004) (Abstract)

Although this does not give direct insight into recommended actions for retail stores, it does display a difference in the value derived from shopping and its relationship to gender, which should be taken into account by retailers.

To investigate further, lets compute the average spending amount as it relates to Gender. For easy interpretation and traceback we will create separate tables and then join them together.

In [None]:
total_purchase_user = dataset %>%
                        select(User_ID, Gender, Purchase) %>%
                        group_by(User_ID) %>%
                        arrange(User_ID) %>%
                        summarise(Total_Purchase = sum(Purchase))

user_gender = dataset %>%
                select(User_ID, Gender) %>%
                group_by(User_ID) %>%
                arrange(User_ID) %>%
                distinct()

head(user_gender)
head(total_purchase_user)

In [None]:
user_purchase_gender = full_join(total_purchase_user, user_gender, by = "User_ID")
head(user_purchase_gender)

In [None]:
average_spending_gender = user_purchase_gender %>%
                            group_by(Gender) %>%
                            summarize(Purchase = sum(as.numeric(Total_Purchase)),
                                      Count = n(),
                                      Average = Purchase/Count)
head(average_spending_gender)

We can see that that the average transaction for Females was 699054.00 and the average transaction for Males was 911963.20. Let visualize our results.

In [None]:
genderAverage  = ggplot(data = average_spending_gender) +
                    geom_bar(mapping = aes(x = Gender, y = Average, fill = Gender), stat = 'identity') +
                    labs(title = 'Average Spending by Gender') +
                    scale_fill_brewer(palette = 'PuBuGn')
print(genderAverage)

Here we see an interesting observation. Even though female shoppers make less purchases than males at this specific store, they seem to be purchasing almost as much on average as the male shoppers. This being said, scale needs to be taken into account because females on average are still spending about 250,000 less than males.

<a id="Top Sellers">
## Top Sellers
Now lets switch gears and examine our top selling products. In this situation, we won't group by product ID since we want to see duplicates, just in case people are buying 2 or more quantities of the same product.

In [None]:
top_sellers = dataset %>%
                count(Product_ID, sort = TRUE)

top_5 = head(top_sellers, 5)

top_5

Looks like our top 5 best sellers are (by product ID)
* P00265242	= 1858
* P00110742 = 1591
* P00025442	= 1586
* P00112142 = 1539
* P00057642 = 1430

Now that we have Identified our top 5 best selling products, lets examine the best selling product, P00265242.

In [None]:
best_seller = dataset[dataset$Product_ID == 'P00265242', ]

head(best_seller)

We can see that this product fits into Product_Category_1 = 5 and Product_Category_2 = 8. As mentioned in the introduction, it would be useful to have a key to reference the item name in order to determine what it is.

Another interesting finding is that even though people are purchasing the same product, they are paying different prices. This could be due to various Black Friday promotions, discounts, or coupon codes. Otherwise, investigation would need to be done regarding the reason for different purchase prices of the same product between customers.

Lets continue to analyze our best seller to see if any relationship to Gender exits.

In [None]:
genderDist_bs  = ggplot(data = best_seller) +
                  geom_bar(mapping = aes(x = Gender, y = ..count.., fill = Gender)) +
                  labs(title = 'Gender of Customers (Best Seller)') +
                  scale_fill_brewer(palette = 'PuBuGn')
print(genderDist_bs)

We see a similar distribution between genders to our overall dataset gender split - lets confirm.


In [None]:
genderDist_bs_prop = ggplot(data = best_seller) +
                          geom_bar(fill = 'lightblue', mapping = aes(x = Gender, y = ..prop.., group = 1, fill = Gender)) +
                          labs(title = 'Gender of Customers (Best Seller - Proportion)') +
                          theme(plot.title = element_text(size=9.5))

genderDist_prop = ggplot(data = dataset_gender) +
                      geom_bar(fill = "lightblue4", mapping = aes(x = Gender, y = ..prop.., group = 1)) +
                      labs(title = 'Gender of Customers (Total Proportion)') +
                      theme(plot.title = element_text(size=9.5))

grid.arrange(genderDist_prop, genderDist_bs_prop, ncol=2)

We can see that between the overall observation set, both purchasers of the best seller and purchasers of all products are roughly ~25% female and ~75% male. A slight difference does exist but it seems like we can generally conclude that our best seller does not cater to a specific gender.

Now, let's move on and examine the Age variable.

<a id="Age">
## Age
Lets begin examining Age by creating a table of each individual age group and their respective counts.

In [None]:
customers_age = dataset %>%
                    select(User_ID, Age) %>%
                    distinct() %>%
                    count(Age)
customers_age

Here, we can see a dataset that shows the count of each Age category of customers at our store. Lets visualize this table.

In [None]:
customers_age_vis = ggplot(data = customers_age) +
                      geom_bar(color = 'black', stat = 'identity', mapping = aes(x = Age, y = n, fill = Age)) +
                      labs(title = 'Age of Customers') +
                      theme(axis.text.x = element_text(size = 10)) +
                      scale_fill_brewer(palette = 'Blues') +
                      theme(legend.position="none")
print(customers_age_vis)

We can also plot a similar chart depicting the distribution of age within our "best seller" category. This will show us if there is a specific age category that purchased the best selling product more than other shoppers.

In [None]:
ageDist_bs  = ggplot(data = best_seller) +
                  geom_bar(color = 'black', mapping = aes(x = Age, y = ..count.., fill = Age)) +
                  labs(title = 'Age of Customers (Best Seller)') +
                  theme(axis.text.x = element_text(size = 10)) +
                  scale_fill_brewer(palette = 'GnBu') +
                  theme(legend.position="none")
print(ageDist_bs)

It seems as though younger people (18-25 & 26-35) account for the highest number of purchases of the best selling product. Lets compare this observation to the overall dataset.

In [None]:
grid.arrange(customers_age_vis, ageDist_bs, ncol=2)

We can see that there is some deviation with the proportion of customers grouped by age when comparing the best selling product to the overall dataset. It looks like older customers > Age 45 are buying the top seller **slightly** less than other products included in the overall dataset.

Now that we have examined age, lets move to another variable.

<a id="City">
## City
Let's create a table of each User_ID and their corresponding City_Category.

In [None]:
customers_location =  dataset %>%
                            select(User_ID, City_Category) %>%
                            distinct()
head(customers_location)

In [None]:
customers_location_vis = ggplot(data = customers_location) +
                          geom_bar(color = 'white', mapping = aes(x = City_Category, y = ..count.., fill = City_Category)) +
                          labs(title = 'Location of Customers') +
                          scale_fill_brewer(palette = "Dark2") +
                          theme(legend.position="none")
print(customers_location_vis)

We can see that most of our customers live in **City C**. Now, we can compute the total purchase amount by City to see the which city's customers spent the most at our store.

In [None]:
purchases_city = dataset %>%
                  group_by(City_Category) %>%
                  summarise(Purchases = sum(Purchase))

purchases_city_1000s = purchases_city %>%
                          mutate(purchasesThousands = purchases_city$Purchases / 1000)

purchases_city_1000s

In order to work with larger numbers, we divided the Purchases column/1000. This is a common practice within the business and acounting world, and it makes large numbers easier to read and chart.

Now that we have our table, lets visualize our results.

In [None]:
purchaseCity_vis = ggplot(data = purchases_city_1000s, aes(x = City_Category, y = purchasesThousands, fill = City_Category)) +
                      geom_bar(color = 'white', stat = 'identity') +
                      labs(title = 'Total Customer Purchase Amount (by City)', y = '($000s)', x = 'City Category') +
                      scale_fill_brewer(palette = "Dark2") +
                      theme(legend.position="none", plot.title = element_text(size = 9))
print(purchaseCity_vis)

In [None]:
grid.arrange(customers_location_vis, purchaseCity_vis, ncol=2)

Here we can see that customers from City C were the most frequent shoppers at our store on Black Friday but Customers from City B had the **highest** amount of total purchases.

Let's continue to investigate and try to determine the reason for this observation.

Lets find how many purchases were made by customers from each city. First, we will get the total number of purchases for each corresponding User_ID.

In [None]:
customers = dataset %>%
              group_by(User_ID) %>%
              count(User_ID)
head(customers)

This tells us how many times a certain user made a purchase. To dive deeper lets compute the total purchase amount for each user, then join it with the other table

In [None]:
customers_City =  dataset %>%
                    select(User_ID, City_Category) %>%
                    group_by(User_ID) %>%
                    distinct() %>%
                    ungroup() %>%
                    left_join(customers, customers_City, by = 'User_ID')
head(customers_City)

city_purchases_count = customers_City %>%
                        select(City_Category, n) %>%
                        group_by(City_Category) %>%
                        summarise(CountOfPurchases = sum(n))
city_purchases_count

In [None]:
city_count_purchases_vis = ggplot(data = city_purchases_count, aes(x = City_Category, y = CountOfPurchases, fill = City_Category)) +
                              geom_bar(color = 'white', stat = 'identity') +
                              labs(title = 'Total Purchase Count (by City)', y = 'Count', x = 'City Category') +
                              scale_fill_brewer(palette = "Dark2") +
                              theme(legend.position="none", plot.title = element_text(size = 9))
print(city_count_purchases_vis)

In [None]:
grid.arrange(purchaseCity_vis, city_count_purchases_vis, ncol = 2)

One inference we can make from these charts is that customers from City B are simply making **more** purchases than residence of City A + City C, and not necessarily buying more expensive products. We can make this assumption due to the fact that the "Total Count of Purchases" chart has a very similar appearance to the "Total Customer Purchase Amount" chart. If it were the other case, then customers from City B would most likely have a lower count of total purchases corresponding to a higher total purchase amount.

Now, since we have identified that the purchase counts across City_Category follow a similar distribution to total purchase amount, lets examine the distribution of our best selling product (P00265242) within each City_Category.

In [None]:
head(best_seller)

best_seller_city = best_seller %>%
                    select(User_ID, City_Category) %>%
                    distinct() %>%
                    count(City_Category)
best_seller_city

In [None]:
best_seller_city_vis = ggplot(data = best_seller_city, aes(x = City_Category, y = n, fill = City_Category)) +
                              geom_bar(color = 'white', stat = 'identity') +
                              labs(title = 'Best Seller Purchase Count (by City)', y = 'Count', x = 'City Category') +
                              scale_fill_brewer(palette = "Blues") +
                              theme(legend.position="none", plot.title = element_text(size = 9))
grid.arrange(city_count_purchases_vis,best_seller_city_vis, ncol = 2)

An interesting revelation has been made! Although customers residing in City C purchase **more** of our "best seller" than City A + B, residents of City C fall behind City B in overall number of purchases.

<a id="Stay in Current City">
## Stay in Current City
Lets now examine the distribution of customers who have lived in their city the longest.

In [None]:
customers_stay = dataset %>%
                    select(User_ID, City_Category, Stay_In_Current_City_Years) %>%
                    group_by(User_ID) %>%
                    distinct()
head(customers_stay)

Now that we have our dataset in order, we can plot and explore.

Lets see where most of our customers are living.

In [None]:
residence = customers_stay %>%
                group_by(City_Category) %>%
                tally()
head(residence)

Looks like most of our customers are living in City C. Now, lets investigate further.

In [None]:
customers_stay_vis = ggplot(data = customers_stay, aes(x = Stay_In_Current_City_Years, y = ..count.., fill = Stay_In_Current_City_Years)) +
                              geom_bar(stat = 'count') +
                              scale_fill_brewer(palette = 15) +
                              labs(title = 'Customers Stay in Current City', y = 'Count', x = 'Stay in Current City', fill = 'Number of Years in Current City')
print(customers_stay_vis)

It looks like most of our customers have only been living in their respective cities for 1  year. In order to see a better distribution, lets make a stacked bar chart according to each City_Category.

In [None]:
stay_cities = customers_stay %>%
                group_by(City_Category, Stay_In_Current_City_Years) %>%
                tally() %>%
                mutate(Percentage = (n/sum(n))*100)
head(stay_cities)

In [None]:
ggplot(data = stay_cities, aes(x = City_Category, y = n, fill = Stay_In_Current_City_Years)) +
    geom_bar(stat = "identity", color = 'white') +
    scale_fill_brewer(palette = 2) +
    labs(title = "City Category + Stay in Current City",
            y = "Total Count (Years)",
            x = "City",
            fill = "Stay Years")

Looking at this chart we can see the distribution of the total customer base and their respective city residences, split by the amount of time they have lived there. Here, we can notice that in every City_Category, the most common stay length seems to be **1** year.

<a id="Purchase">
## Purchase
Now lets do some investigation regarding store customers and their purchases. We will start by computing the total purchase amount by user ID

In [None]:
customers_total_purchase_amount = dataset %>%
                                    group_by(User_ID) %>%
                                    summarise(Purchase_Amount = sum(Purchase))

head(customers_total_purchase_amount)

Now that we have grouped our purchases and grouped by User ID, we will sort and find our top spenders.

In [None]:
customers_total_purchase_amount = arrange(customers_total_purchase_amount, desc((Purchase_Amount)))

head(customers_total_purchase_amount)

Looks like User ID 1004277 is our top spender. Lets use summary() to see other facets of our total customer spending data.

In [None]:
summary(customers_total_purchase_amount)

Wecan  see an **average** total purchase amount of 851752, **max ** total purchase amount of 10536783, **min** total purchase amount of 44108 and a **median** purchase amount of 512612.

Lets plot a chart showing the distribution of purchase amounts to see if purchases are normally distributed or contain some skewness.  A density plot will show us where the highest number of similar purchase amounts rests in accordance to the entire customer base. It is important to note that Density charts  graph the expected probability of values, given data as input, and then plot a line surrounding those values (estimation).

In [None]:
ggplot(customers_total_purchase_amount, aes(Purchase_Amount)) +
  geom_density(adjust = 1) +
  geom_vline(aes(xintercept=median(Purchase_Amount)),
             color="blue", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=mean(Purchase_Amount)),
             color="red", linetype="dashed", size=1) +
  geom_text(aes(x=mean(Purchase_Amount), label=round(mean(Purchase_Amount)), y=1.2e-06), color = 'red', angle=360,
            size=4, vjust=3, hjust=-.1) +
  geom_text(aes(x=median(Purchase_Amount), label=round(median(Purchase_Amount)), y=1.2e-06), color = 'blue', angle=360,
            size=4, vjust=0, hjust=-.1) +
  scale_x_continuous(name="Purchase Amount", limits=c(0, 7500000), breaks = seq(0,7500000, by = 1000000), expand = c(0,0)) +
  scale_y_continuous(name="Density", limits=c(0, .00000125), labels = scientific, expand = c(0,0))

Here we are seeing a very right (positive) skewed density plot with a long tail. This means that there are quite a few values that sit higher than the  mean and that the highest density of values isn't a standardly distributed series. We see that the largest density of purchases is around the 250000 mark.


<a id="Marital Status">
## Marital Status
Lets now examine the marital status of store customers.

In [None]:
dataset_maritalStatus = dataset %>%
                            select(User_ID, Marital_Status) %>%
                            group_by(User_ID) %>%
                            distinct()

head(dataset_maritalStatus)

Note, we need to quickly change Marital_Status from a numeric variable to a categorical type.

In [None]:
dataset_maritalStatus$Marital_Status = as.character(dataset_maritalStatus$Marital_Status)
typeof(dataset_maritalStatus$Marital_Status)

If we look  back at the variable descriptions of the dataset, we don't have a clear guide for marital status. In other cases, it would be best to reach out to the provider of the data to be completely sure of what the values in a column represent but in this case, we will assume that 1 = married and 0 = single.

In [None]:
marital_vis = ggplot(data = dataset_maritalStatus) +
                    geom_bar(mapping = aes(x = Marital_Status, y = ..count.., fill = Marital_Status)) +
                    labs(title = 'Marital Status') +
                    scale_fill_brewer(palette = 'Pastel2')
print(marital_vis)

It looks like most of our shoppers happen to be single or unmarried. Similar to our investigation of age groups, we can look at the makeup of Marital_Status in each City_Category.

In [None]:
dataset_maritalStatus = dataset_maritalStatus %>%
                            full_join(customers_stay, by = 'User_ID')
head(dataset_maritalStatus)

In [None]:
maritalStatus_cities = dataset_maritalStatus %>%
                        group_by(City_Category, Marital_Status) %>%
                        tally()
head(maritalStatus_cities)

In [None]:
ggplot(data = maritalStatus_cities, aes(x = City_Category, y = n, fill = Marital_Status)) +
    geom_bar(stat = "identity", color = 'black') +
    scale_fill_brewer(palette = 2) +
    labs(title = "City + Marital Status",
            y = "Total Count (Shoppers)",
            x = "City",
            fill = "Marital Status")

Here, we can see that out off all Cities, the highest proportion of single shoppers seems to be in City A. Now, lets investigate the Stay_in_Current_City distribution within each City_Category.

In [None]:
Users_Age = dataset %>%
                select(User_ID, Age) %>%
                distinct()
head(Users_Age)

In [None]:
dataset_maritalStatus = dataset_maritalStatus %>%
                            full_join(Users_Age, by = 'User_ID')
head(dataset_maritalStatus)

In [None]:
City_A = dataset_maritalStatus %>%
            filter(City_Category == 'A')
City_B = dataset_maritalStatus %>%
            filter(City_Category == 'B')
City_C = dataset_maritalStatus %>%
            filter(City_Category == 'C')
head(City_A)
head(City_B)
head(City_C)

In [None]:
City_A_stay_vis = ggplot(data = City_A, aes(x = Age, y = ..count.., fill = Age)) +
                              geom_bar(stat = 'count') +
                              scale_fill_brewer(palette = 8) +
                              theme(legend.position="none", axis.text = element_text(size = 6)) +
                              labs(title = 'City A', y = 'Count', x = 'Age', fill = 'Age')
City_B_stay_vis = ggplot(data = City_B, aes(x = Age, y = ..count.., fill = Age)) +
                              geom_bar(stat = 'count') +
                              scale_fill_brewer(palette = 9) +
                              theme(legend.position="none", axis.text = element_text(size = 6)) +
                              labs(title = 'City B', y = 'Count', x = 'Age', fill = 'Age')
City_C_stay_vis = ggplot(data = City_C, aes(x = Age, y = ..count.., fill = Age)) +
                              geom_bar(stat = 'count') +
                              scale_fill_brewer(palette = 11) +
                              theme(legend.position="none", axis.text = element_text(size = 6)) +
                              labs(title = 'City C', y = 'Count', x = 'Age', fill = 'Age')

grid.arrange(City_A_stay_vis, City_B_stay_vis, City_C_stay_vis, ncol = 3)

It looks as though City A has **less** shoppers living there over the age of 45 compared to the other cities. This could be a factor in the resulting levels of Marital_Status within each individual city.

<a id="Top Shoppers">
## Top Shoppers
Now, we will investigate who our top shoppers were on Black Friday.

In [None]:
top_shoppers = dataset %>%
                count(User_ID, sort = TRUE)

head(top_shoppers)

Looks like User_ID 1001680 shows up the most on our master ledger of shopper data. Since each individual row represents a different transaction/product, it looks like this user made over **1000** total transactions! We can join together this top shoppers dataset with our total customer purchases dataset to see them combined.

In [None]:
top_shoppers =  top_shoppers %>%
                    select(User_ID, n) %>%
                    left_join(customers_total_purchase_amount, Purchase_Amount, by = 'User_ID')

head(top_shoppers)

Now that we have joined the two tables together, we can see that although User_ID 1001680 has the highest number of total purchases, User_ID 1004277 has the highest Purchase_Amount as identified in our earlier charts as well. From here, we can also compute the average Purchase_Amount for each user.

In [None]:
top_shoppers = mutate(top_shoppers,
                  Average_Purchase_Amount = Purchase_Amount/n)

head(top_shoppers)

Now, we can sort according to Average_Purchase_Amount to see which customers, on average, are spending the most.

In [None]:
top_shoppers_averagePurchase = top_shoppers %>%
                                    arrange(desc(Average_Purchase_Amount))

head(top_shoppers_averagePurchase)

Looks like User_ID 1005069 has the highest Average_Purchase_Amount and a total Purchase_Amount of 308454. User_ID 1003902 is right behind User_ID 1005069 in Average_Purchase_Amount, but has a much higher total Purchase_Amount of 1746284.

<a id="Occupation">
## Occupation
The last thing we will analyze is the occupation of customers in our dataset.

In [None]:
customers_Occupation =  dataset %>%
                          select(User_ID, Occupation) %>%
                          group_by(User_ID) %>%
                          distinct() %>%
                          left_join(customers_total_purchase_amount, Occupation, by = 'User_ID')

head(customers_Occupation)

Now that we have our dataset necessary, we can group together the total Purchase_Amount for each Occupation identifier. We will then convert Occupation to a charater data type.

In [None]:
totalPurchases_Occupation = customers_Occupation %>%
                              group_by(Occupation) %>%
                              summarise(Purchase_Amount = sum(Purchase_Amount)) %>%
                              arrange(desc(Purchase_Amount))

totalPurchases_Occupation$Occupation = as.character(totalPurchases_Occupation$Occupation)
typeof(totalPurchases_Occupation$Occupation)

head(totalPurchases_Occupation)

Now, lets plot each occupation and their total Purchase_Amount

In [None]:
occupation = ggplot(data = totalPurchases_Occupation) +
                  geom_bar(mapping = aes(x = reorder(Occupation, -Purchase_Amount), y = Purchase_Amount, fill = Occupation), stat = 'identity') +
                  scale_x_discrete(name="Occupation", breaks = seq(0,20, by = 1), expand = c(0,0)) +
                  scale_y_continuous(name="Purchase Amount ($)", expand = c(0,0), limits = c(0, 750000000)) +
                  labs(title = 'Total Purchase Amount by Occupation') +
                  theme(legend.position="none")
print(occupation)

Looks like customers labeled as Occupation 4 spent the most at our store on Black Friday, with customers of Occupation 0 + 7 closely behind. Here, if a key was given, we could use that information to classify our shoppers accordingly.

<a id="Apriori (Association Rule Learning)">
# Apriori (Association Rule Learning)
Now lets use a machine learning algorithim called [Apriori][2] to make some association rules regarding customer purchases. We will be using the [arules][1] package.

Before we begin, lets elaborate on the idea of Association Rule Learning. In its simplest form, Association Rule Learning attempts to predict customer transactions. In other words, the algorithm solves the problem, "People who bought ----- also bought ----- ." This can prove to be extemely useful for retailers who aim to optimize product placement in stores and promotional campaigns.

In the case of our store on Black Friday, implementing an effective product placement strategy can prove to optimize sales of products normally bought together. For example, lets say that our store was to have a sale on TVs. It would be smart to place HDMI Cables alongside these TVs because those items are usually purchased together. On the other hand, it may also prove to be smart to place them far apart so that customers need to walk throughout the entire store while searching for their desired item, where another product may catch their eye along the way.

The Apriori algorithm specifically aims to maximize the likelyhood someone performs/purchases/watches something given knowledge about their prior actions.

[1]: https://cran.r-project.org/web/packages/arules/arules.pdf
[2]: https://en.wikipedia.org/wiki/Apriori_algorithm

To begin, lets import the lbraries we wil be using for this section if not done so already.

In [None]:
library(arules)
library(arulesViz)
library(tidyverse)

The [arules][1] package was developed specifically to deal with Association Rule and Frequent Itemset mining.
In order to begin our analysis, we must retrieve the necessary data from the original dataset and then apply the correct formatting.

[1]: https://cran.r-project.org/web/packages/arules/arules.pdf

In [None]:
# Data Preprocessing
# Getting the dataset into the correct format
customers_products = dataset %>%
                        select(User_ID, Product_ID) %>%   # Selecting the columns we will need
                        group_by(User_ID) %>%             # Grouping by "User_ID"
                        arrange(User_ID) %>%              # Arranging by "User_ID"
                        mutate(id = row_number()) %>%     # Defining a key column for each "Product_ID" and its corresponding "User_ID" (Must do this for spread() to work properly)
                        spread(User_ID, Product_ID) %>%   # Converting our dataset from tall to wide format, and grouping "Product_IDs" to their corresponding "User_ID"
                        t()                               # Transposing the dataset from columns of "User_ID" to rows of "User_ID"

# Now we can remove the Id row we created earlier for spread() to work correctly.
customers_products = customers_products[-1,]

Now, in order for the Apriori algorithm to work correctly, we need to convert the customers_products table into a sparse matrix. Unfortunately, Apriori doesn't take strings or text as input, but rather 1 + 0. (Binary Format) This means that we must allocate a column for each individual product and then if a User_ID contains that product, it will be marked as a 1. On the other hand, if the User_ID does not contain that Product_ID, it wil be marked with a 0.

In order to do so, we need to use the [arules][1] library as described above and import the table as a .csv file. From there, we can use the arules function, "read.transactions()" to get our sparse matrix.

[1]: https://cran.r-project.org/web/packages/arules/arules.pdf

In [None]:
write.csv(customers_products, file = 'customers_products.csv')

customersProducts = read.transactions('customers_products.csv', sep = ',', rm.duplicates = TRUE) # remove duplicates with rm.duplicates

Before we implement the Apriori algorithm to our problem, lets take a look at our newly created sparse matrix.

In [None]:
summary(customersProducts)

Here, we can see that there are 5892 rows (elements/itemsets/transactions) and 10539 columns (items) in our sparse matrix. With this sumary function, we get a density of 0.008768598 in our matrix. The density tells us that we have 0.9% non-zero values (1) in our sparse matrix and 99.1% zero (0) values.

Also, as we discovered in our Exploratory Data Analysis, the summary() function also gives us the most frequent items that customers purchased and just to be sure, we can cross reference what we discovered earlier in the analysis. Lets list out what our sparse matrix gave us.
* P00265242 = 1858
* P00110742 = 1591
* P00025442 = 1586
* P00112142 = 1539
* P00057642 = 1430
* (Other) = 536489

Now lets compare it to what we discovered earler.

"Looks like our top 5 best sellers are (by product ID)"
* P00265242	= 1858
* P00110742 = 1591
* P00025442	= 1586
* P00112142 = 1539
* P00057642 = 1430

**Awesome!** Looks like our sparce matrix is accurate to what we discovered earlier. It is important to ensure that all data is being transfered correctly in every step of the analysis. This ensures repeatability and easy debugging should an error occur.

Lets continue to examine our sparse matrix.

In [None]:
summary(customersProducts)

The **"element (itemset/transaction) length distribution"** gives us a distribution of the number of items in a customers (User) basket and underneath it we can see more information including the quartile and mean information. In this case, we see a mean of 92.41, which means that on average, each customer purchased 92.41 items. In this case, since we are aware of a few customers who purchased over ~1000 items, it may be useful to use the median value of 54.00 items instead since the mean can be heavily affected by outlier values.

To get a clearer picture of the items, lets create an item frequency plot which is included in the arules package.

In [None]:
itemFrequencyPlot(customersProducts, topN = 25)    # topN is limiting to the top 50 products

Now lets begin training the association rule model.

Our first step will be to set our parameters. The first parameters we will set are the support and confidence. The support value is derived from the frequency of a specific item within the dataset. When we set our support value, we are setting a minimum number of transactions necessary for our rules to take effect.

* **Support**: Our support value wil be the minimum number of transactions necessary divided by the total number of transactions.
    * As described by summary(customersProducts), we have a total number of unique customer transactions of 5892.
    * From our dataset, lets assume that we want to choose a product which was purchased by at least **50** different customers.
    * With these two values established, we can compute the support value with simple division. (50/5892) = **.008486083**

The second parameter we will take into consideration will be the confidence. The confidence value determines how often a rule is to be found true. In other words, the minimum **strength** of any rule is a limit we place when setting our minimum confidence value.

The default confidence value in the apriori() function is 0.80 or 80%, so we can begin with that number and then adjust the parameters to applicable results.
* **Confidence**: We can determine our confidence value by first starting with the default value and adjusting accordingly.
    * With more domain knowledge, and with Product_IDs referencing items with recognizable names, the Confidence value can be easily changed to see different, and more relevant, results.
    * In our case, we will start with a value and then lower the confidence to see different rules.

In [None]:
rules = apriori(data = customersProducts,
               parameter = list(support = 0.008, confidence = 0.80, maxtime = 0)) # maxtime = 0 will allow our algorithim to run until completion with no time limit

It looks like apriori has created 7 rules in accordance to our specified parameters.

"writing ... [7 rule(s)] done [0.48s]."

Now, lets examine our results to get a better idea of how our algoritm worked.

In [None]:
inspect(sort(rules, by = 'lift'))

Here we see the association rules created by our apriori algorithm. Let's take a look at rule number 1.

We see a few values listed and we will go through them individually.
* The first value, **lhs**, corresponds to a grouping of items which the algorithm has pulled from the dataset.
* The second value, **rhs**, corresponds to the value predicted by apriori to be purchased with items in the "lhs" category.
* The third value, **support** is the number of transactions including that specific set of items divided by the total number of transactions. (As described earlier when we chose the parameters for Apriori.)
* The fourth value, **confidence** is the % chance in which a rule will be upheld.
* The fifth value, **lift** gives us the independance/dependence of a rule. It takes the confidence value and its relationship to the entire dataset into account.
* The sixth and final value, **count** is the number of times a rule occured during the implementation of Apriori on our data.

Now, lets visualize these rules using the [arulesViz][1] package.  

[1]: https://cran.r-project.org/web/packages/arulesViz/vignettes/arulesViz.pdf

In [None]:
plot(rules, method = 'graph')

Here we can see a visualization of our association rules. Arrows pointing **from items** to rule vertices indicate LHS (Grouped) items and arrows **from rules** to items indicates the RHS (Rule Item).

The size of the bubbles indicate the support with larger bubbles representing a higher support value. Fill color represents the lift values, with darker colors representing higher lifts.

Lets now try modifying some of the parameters for the Apriori algotrithm and see the results. This process would prove to be more intuitive if given a key for each corresponding Product_ID, so will only implement the algorithm once more.

This time, we will decrease our confidence value to **75%** and keep our support value the same (**0.008**).

In [None]:
rules = apriori(data = customersProducts,
               parameter = list(support = 0.008, confidence = 0.75, maxtime = 0))

Now that we have decreased the minimum confidence value to 75%, we have a total of 171 rules.

writing ... [171 rule(s)] done [0.50s].

This is much higher number of rules compared to our previous rule list which only contained 7. This should now give us more interesting rules to examine.

In [None]:
inspect(head(sort(rules, by = 'lift'))) # limiting to the top 6 rules

We can now see that we now have a new set of rules and the rule with the highest lift value has also changed.

Rule number 1 shows that Customers who bought items P00221142 and P00249642 will also purchase item P00103042 **~76%** of the time, given a support of 0.008.

In [None]:
plot(rules, method = 'graph', max = 25)

Now that we have more that 7 rules, this visualization becomes alot more difficult to interpret. Instead, we can create a matrix and have a similar plot and clearer interpretation.

In [None]:
plot(rules, method = 'grouped', max = 25)

In this visualization, we can see that we have our LHS on top and on the right hand side, the corresponding RHS. The size of the bubbles represents the support value of the rule and the fill/color represents the lift.

<a id="Conclusion">
# Conclusion
Overall, we have made some insightful discoveries from our EDA of this Black Friday dataset. We saw how customers at our store were distributed across multiple categorical classifications such as Gender, Age, Occupation, Stay in Current City, etc. We have also determined who our top purchasing customers were on Black Friday and also classified products into "best sellers" and " worst sellers." Also, we have identified various metrics regarding Purchases made on Black Friday including the average amount spend by customers and total purchase amount across multiple categories..

After our EDA, we dove into the world of Association Rule Learning and identified some association rules for our store on Black Friday. We discovered multiple situations where customers that purchased a certain set of items were over **75%** likely to purchase another item, given a set of inputs.

<a id="Works Cited">
# Works Cited
**Information**
1. https://www.history.com/news/whats-the-real-history-of-black-friday
2. https://en.oxforddictionaries.com/explore/why-is-day-after-thanksgiving-black-friday/
3. https://www.cnn.com/2018/11/21/business/black-friday-history/index.html
4. https://journals.sagepub.com/doi/abs/10.1177/0887302X0402200404#articleCitationDownloadContainer
5. https://en.wikipedia.org/wiki/Apriori_algorithm
6. https://en.wikipedia.org/wiki/Association_rule_learning

**Images**
1. https://www.armstrongeconomics.com/panic-of-1869/
2. http://www.staradvertiser.com/2018/11/23/breaking-news/black-friday-2018-a-not-so-wild-day-for-american-shoppers/

**Packages**
1. https://www.tidyverse.org/
2. https://cran.r-project.org/web/packages/scales/scales.pdf
3. https://cran.r-project.org/web/packages/arules/arules.pdf
4. https://cran.r-project.org/web/packages/arulesViz/vignettes/arulesViz.pdf
