# Market Basket Analysis with Spark FPGrowth (Data mining)
#### Dataset download > 
* #### [Instacart](https://www.kaggle.com/c/instacart-market-basket-analysis)

#### Library used >
* #### [FPGrowth](https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html#fp-growth)

Mining frequent items, itemsets, subsequences, or other substructures is usually among the first steps to analyze a large-scale dataset, which has been an active research topic in data mining for years. We refer users to Wikipedia’s association rule learning for more information. spark.mllib provides a parallel implementation of FP-growth, a popular algorithm to mining frequent itemsets.

Market basket analysis may provide the retailer with information to understand the purchase behavior of a buyer. This information will enable the retailer to understand the buyer's needs and rewrite the store's layout accordingly, develop cross-promotional programs, or even capture new buyers (much like the cross-selling concept). An apocryphal early illustrative example for this was when one super market chain discovered in its analysis that male customers that bought diapers often bought beer as well, have put the diapers close to beer coolers, and their sales increased dramatically. Although this urban legend is only an example that professors use to illustrate the concept to students, the explanation of this imaginary phenomenon might be that fathers that are sent out to buy diapers often buy a beer as well, as a reward. This kind of analysis is supposedly an example of the use of data mining. A widely used example of cross selling on the web with market basket analysis is Amazon.com's use of "customers who bought book A also bought book B", e.g. "People who read History of Portugal were also interested in Naval History".

This is a series of two notebooks. This is notebook #2. The purpose of this notebook is to do some exploratory data analysis and call the ML logic. 

![image-alt-text](https://s3.us-east-2.amazonaws.com/databricks-dennylee/media/buy+it+again+or+recommend.png)

# Exploratory Data Analysis

Explore your Instacart data using Spark SQL

In [1]:
%%sql
--Busiest day of the week
select 
  count(order_id) as total_orders, 
  (case 
     when order_dow = '0' then 'Sunday'
     when order_dow = '1' then 'Monday'
     when order_dow = '2' then 'Tuesday'
     when order_dow = '3' then 'Wednesday'
     when order_dow = '4' then 'Thursday'
     when order_dow = '5' then 'Friday'
     when order_dow = '6' then 'Saturday'              
   end) as day_of_week 
  from instacart.orders  
 group by order_dow 
 order by total_orders desc

StatementMeta(SparkGPU, 14, 0, Finished, Available)

<Spark SQL result set with 7 rows and 2 fields>

In [2]:
%%sql
--Breakdown of Orders by Hour of the Day
select 
  count(order_id) as total_orders, 
  order_hour_of_day as hour 
  from instacart.orders 
 group by order_hour_of_day 
 order by order_hour_of_day

StatementMeta(SparkGPU, 13, 1, Finished, Available)

<Spark SQL result set with 24 rows and 2 fields>

In [3]:
%%sql
--Max Products by Department
select countbydept.*
  from (
  -- from product table, let's count number of records per dept
  -- and then sort it by count (highest to lowest) 
  select department_id, count(1) as counter
    from instacart.products
   group by department_id
   order by counter asc 
  ) as maxcount
inner join (
  -- let's repeat the exercise, but this time let's join
  -- products and departments tables to get a full list of dept and 
  -- prod count
  select
    d.department_id,
    d.department,
    count(1) as products
    from instacart.departments d
      inner join instacart.products p
         on p.department_id = d.department_id
   group by d.department_id, d.department 
   order by products desc
  ) countbydept 
  -- combine the two queries's results by matching the product count
  on countbydept.products = maxcount.counter

StatementMeta(SparkGPU, 13, 2, Finished, Available)

<Spark SQL result set with 21 rows and 3 fields>

In [6]:
%%sql
--Top 10 Popular Items
select count(opp.order_id) as orders, p.product_name as popular_product
  from instacart.order_products opp, instacart.products p
 where p.product_id = opp.product_id 
 group by popular_product 
 order by orders desc 
 limit 10

StatementMeta(SparkGPU, 14, 5, Finished, Available)

<Spark SQL result set with 10 rows and 2 fields>

In [8]:
%%sql
--Shelf Space by Department
select d.department, count(distinct p.product_id) as products
  from instacart.products p
    inner join instacart.departments d
      on d.department_id = p.department_id
 group by d.department
 order by products desc
 limit 10

StatementMeta(SparkGPU, 15, 0, Finished, Available)

<Spark SQL result set with 10 rows and 2 fields>

In [11]:
# Organize the data by shopping basket
from pyspark.sql.functions import collect_set, col, count
rawData = spark.sql("select p.product_name, o.order_id from instacart.products p inner join instacart.order_products o where o.product_id = p.product_id")
baskets = rawData.groupBy('order_id').agg(collect_set('product_name').alias('items'))
baskets.createOrReplaceTempView('baskets')

StatementMeta(SparkGPU, 15, 4, Finished, Available)

In [13]:
# View Shopping Basket
display(baskets)

StatementMeta(SparkGPU, 15, 6, Finished, Available)

SynapseWidget(Synapse.DataFrame, 567ffbb9-5298-4a18-93fb-bd9240380a59)

# Train ML Model

To understand the frequency of items are associated with each other (e.g. peanut butter and jelly), we will use association rule mining for market basket analysis. Spark MLlib implements two algorithms related to frequency pattern mining (FPM): FP-growth and PrefixSpan. The distinction is that FP-growth does not use order information in the itemsets, if any, while PrefixSpan is designed for sequential pattern mining where the itemsets are ordered. We will use FP-growth as the order information is not important for this use case.

Note, we will be using the Scala API so we can configure setMinConfidence.

In [14]:
%%spark
//Use FP-growth

import org.apache.spark.ml.fpm.FPGrowth
 
// Extract out the items 
val baskets_ds = spark.sql("select items from baskets").as[Array[String]].toDF("items")
 
// Use FPGrowth
val fpgrowth = new FPGrowth().setItemsCol("items").setMinSupport(0.001).setMinConfidence(0)
val model = fpgrowth.fit(baskets_ds)

StatementMeta(SparkGPU, 15, 8, Finished, Available)

import org.apache.spark.ml.fpm.FPGrowth
baskets_ds: org.apache.spark.sql.DataFrame = [items: array<string>]
fpgrowth: org.apache.spark.ml.fpm.FPGrowth = fpgrowth_04bfe4e85f99
model: org.apache.spark.ml.fpm.FPGrowthModel = FPGrowthModel: uid=fpgrowth_04bfe4e85f99, numTrainingRecords=3346083


In [15]:
%%spark

// Display frequent itemsets
val mostPopularItemInABasket = model.freqItemsets

//create a view
mostPopularItemInABasket.createOrReplaceTempView("mostPopularItemInABasket")

StatementMeta(SparkGPU, 15, 9, Finished, Available)

mostPopularItemInABasket: org.apache.spark.sql.DataFrame = [items: array<string>, freq: bigint]


In [16]:
%%sql
select items, freq from mostPopularItemInABasket where size(items) > 2 order by freq desc limit 20

StatementMeta(SparkGPU, 15, 10, Finished, Available)

<Spark SQL result set with 20 rows and 2 fields>

# Review Association Rules
In addition to freqItemSets, the FP-growth model also generates association rules. For example, if a shopper purchases peanut butter , what is the likelihood that they will also purchase jelly. For more information, a good reference is Susan Li's A Gentle Introduction on Market Basket Analysis — Association Rules

In [17]:
%%spark
// Display generated association rules.
val ifThen = model.associationRules

//create a view
ifThen.createOrReplaceTempView("ifThen")

StatementMeta(SparkGPU, 15, 11, Finished, Available)

ifThen: org.apache.spark.sql.DataFrame = [antecedent: array<string>, consequent: array<string> ... 3 more fields]


In [18]:
%%sql
select antecedent as `antecedent (if)`, consequent as `consequent (then)`, confidence from ifThen order by confidence desc limit 20

StatementMeta(SparkGPU, 15, 12, Finished, Available)

<Spark SQL result set with 20 rows and 3 fields>