## Market Basket Analysis - Final Project
### Problem Analysis
Using a three million instacart dataset and order histories to build a recommendation system that aids better customer engagement by recommending the right item based on analysis of association between the products and buying pattern.

### Techniques Used
The spark dataframes containing the orders and product details are analysed using FP growth algorithm of association rule mining, which investigates the user buying patterns and recommends right products to the customer reducing the purchase time as well as increase business efficiency. 

### Datasets
####1. **orders(~3.4 M)**
   This dataset contains all the hystorical details of 3 million orders as per the below fields,
  
   **orderid**           - Identifier for the orders.
   
   **user_id**           - Identifier for the users.
   
   **eval_set**          - Identifies the eval set the order belongs to.
   
   **order_number**      - The order sequence of the user.
   
   **order_dow**         - the day of the week the order was placed on.
   
   **order_hour_of_day** - the hour of the day the order was placed on.
   
   **days_since_prior**  - days since the last order placed by the user(users with one orders have been given a value N/A).
   
####2. **products(~50 K)**   
   This dataset contains the details of a product w.r.t the aisle and the department it belongs to.

   **product_id** - Identifier for the products.
   
   **product_name** - Name assigned to a product.
   
   **aisle_id** - Aisle associated with the product.
   
   **department_id** - Department associated with the product.
   
####3. **aisles(134)** 
   Acts like a dictionary dataset that contains aisles data.

   **aisle_id** - Identifier for the aisles.
   
   **aisle** - Name assigned to the aisle.
   
####4. **departments(21)** 
   Acts like a dictionary dataset that contains department data.

   **department_id** - Identifier for departments.
   
   **department** - Name associated with the departments.
   
####5. **order_products_SET(~3 M +)** 
   Contains evaluation set fot the prior, test and training orders.

   **order_id** - Forign key containing identifier for the orders.
   
   **product_id** - Forign key containing identifier for the products.
   
   **add_to_cart_order** - order in which the product was added to cart.
   
   **reordered** - depicts if the product was ordered in the part, if no product ordered previously then 0. 
   
   **eval_Set** in the orders dataset can be one among the below three datasets,
   
   **prior(~3.2 M)** - orders prior to the users most recent order.
   
   **train(~131 K)** - training data to aid the ML model.
   
   **test(~75 K)** - test data for ML model.

### List of tables

In [0]:
%fs ls /FileStore/tables

path,name,size,modificationTime
dbfs:/FileStore/tables/aisles-1.csv,aisles-1.csv,2603,1669945057000
dbfs:/FileStore/tables/aisles.csv,aisles.csv,2603,1669944954000
dbfs:/FileStore/tables/departments-1.csv,departments-1.csv,270,1669945057000
dbfs:/FileStore/tables/departments.csv,departments.csv,270,1669944954000
dbfs:/FileStore/tables/order_products__prior.csv,order_products__prior.csv,577550706,1670197832000
dbfs:/FileStore/tables/order_products__train-1.csv,order_products__train-1.csv,24680147,1669945501000
dbfs:/FileStore/tables/order_products__train.csv,order_products__train.csv,24680147,1669945101000
dbfs:/FileStore/tables/orders.csv,orders.csv,108968645,1670178168000
dbfs:/FileStore/tables/products.csv,products.csv,2166953,1669945205000
dbfs:/FileStore/tables/sample_submission.csv,sample_submission.csv,1475693,1669945204000


### Importing Library

In [0]:

import numpy as np
import pandas as pd
from functools import reduce
from pyspark.sql import DataFrame



### Reading the csv files from the FileStore/tables into spark dataframes.

In [0]:
departments_df = spark.read.csv("/FileStore/tables/departments.csv", header=True, inferSchema=True)
aisles_df = spark.read.csv("/FileStore/tables/aisles.csv", header=True, inferSchema=True)
products_df = spark.read.csv("/FileStore/tables/products.csv", header=True, inferSchema=True)
orders_df = spark.read.csv("/FileStore/tables/orders.csv", header=True, inferSchema=True)
order_products_prior_df = spark.read.csv("/FileStore/tables/order_products__prior.csv", header=True, inferSchema=True)
order_products_train_df = spark.read.csv("/FileStore/tables/order_products__train.csv", header=True, inferSchema=True)

### Pre-Processing - Null Values Check

In [0]:
from pyspark.sql.functions import isnan, when, count, col
orders_df.select([count(when(isnan(c), c)).alias(c) for c in orders_df.columns]).show()
products_df.select([count(when(isnan(c), c)).alias(c) for c in products_df.columns]).show()
aisles_df.select([count(when(isnan(c), c)).alias(c) for c in aisles_df.columns]).show()
departments_df.select([count(when(isnan(c), c)).alias(c) for c in departments_df.columns]).show()
order_products_prior_df.select([count(when(isnan(c), c)).alias(c) for c in order_products_prior_df.columns]).show()
order_products_train_df.select([count(when(isnan(c), c)).alias(c) for c in order_products_train_df.columns]).show()

+--------+-------+--------+------------+---------+-----------------+----------------------+
|order_id|user_id|eval_set|order_number|order_dow|order_hour_of_day|days_since_prior_order|
+--------+-------+--------+------------+---------+-----------------+----------------------+
|       0|      0|       0|           0|        0|                0|                     0|
+--------+-------+--------+------------+---------+-----------------+----------------------+

+----------+------------+--------+-------------+
|product_id|product_name|aisle_id|department_id|
+----------+------------+--------+-------------+
|         0|           0|       0|            0|
+----------+------------+--------+-------------+

+--------+-----+
|aisle_id|aisle|
+--------+-----+
|       0|    0|
+--------+-----+

+-------------+----------+
|department_id|department|
+-------------+----------+
|            0|         0|
+-------------+----------+

+--------+----------+-----------------+---------+
|order_id|product_id|

### Pre-Processing - Statistical Summary using Describe

In [0]:
departments_df.describe().show()
aisles_df.describe().show()
products_df.describe().show()
orders_df.describe().show()
order_products_prior_df.describe().show()
order_products_train_df.describe().show()

+-------+------------------+----------+
|summary|     department_id|department|
+-------+------------------+----------+
|  count|                21|        21|
|   mean|              11.0|      null|
| stddev|6.2048368229954285|      null|
|    min|                 1|   alcohol|
|    max|                21|    snacks|
+-------+------------------+----------+

+-------+-----------------+--------------------+
|summary|         aisle_id|               aisle|
+-------+-----------------+--------------------+
|  count|              134|                 134|
|   mean|             67.5|                null|
| stddev|38.82653731663435|                null|
|    min|                1|air fresheners ca...|
|    max|              134|              yogurt|
+-------+-----------------+--------------------+

+-------+------------------+--------------------+-----------------+------------------+
|summary|        product_id|        product_name|         aisle_id|     department_id|
+-------+--------------

### Displaying the Tables

#### Orders

In [0]:
orders_df.show(10)

+--------+-------+--------+------------+---------+-----------------+----------------------+
|order_id|user_id|eval_set|order_number|order_dow|order_hour_of_day|days_since_prior_order|
+--------+-------+--------+------------+---------+-----------------+----------------------+
| 2539329|      1|   prior|           1|        2|                8|                  null|
| 2398795|      1|   prior|           2|        3|                7|                  15.0|
|  473747|      1|   prior|           3|        3|               12|                  21.0|
| 2254736|      1|   prior|           4|        4|                7|                  29.0|
|  431534|      1|   prior|           5|        4|               15|                  28.0|
| 3367565|      1|   prior|           6|        2|                7|                  19.0|
|  550135|      1|   prior|           7|        1|                9|                  20.0|
| 3108588|      1|   prior|           8|        1|               14|            

#### Products

In [0]:
products_df.show(5)

+----------+--------------------+--------+-------------+
|product_id|        product_name|aisle_id|department_id|
+----------+--------------------+--------+-------------+
|         1|Chocolate Sandwic...|      61|           19|
|         2|    All-Seasons Salt|     104|           13|
|         3|Robust Golden Uns...|      94|            7|
|         4|Smart Ones Classi...|      38|            1|
|         5|Green Chile Anyti...|       5|           13|
+----------+--------------------+--------+-------------+
only showing top 5 rows



#### Departments

In [0]:
departments_df.show()

+-------------+---------------+
|department_id|     department|
+-------------+---------------+
|            1|         frozen|
|            2|          other|
|            3|         bakery|
|            4|        produce|
|            5|        alcohol|
|            6|  international|
|            7|      beverages|
|            8|           pets|
|            9|dry goods pasta|
|           10|           bulk|
|           11|  personal care|
|           12|   meat seafood|
|           13|         pantry|
|           14|      breakfast|
|           15|   canned goods|
|           16|     dairy eggs|
|           17|      household|
|           18|         babies|
|           19|         snacks|
|           20|           deli|
+-------------+---------------+
only showing top 20 rows



#### Aisles

In [0]:
aisles_df.show(10)

+--------+--------------------+
|aisle_id|               aisle|
+--------+--------------------+
|       1|prepared soups sa...|
|       2|   specialty cheeses|
|       3| energy granola bars|
|       4|       instant foods|
|       5|marinades meat pr...|
|       6|               other|
|       7|       packaged meat|
|       8|     bakery desserts|
|       9|         pasta sauce|
|      10|    kitchen supplies|
+--------+--------------------+
only showing top 10 rows



#### Evaluation sets containing prior and training dataset.

In [0]:
order_products_prior_df.show(10)

+--------+----------+-----------------+---------+
|order_id|product_id|add_to_cart_order|reordered|
+--------+----------+-----------------+---------+
|       2|     33120|                1|        1|
|       2|     28985|                2|        1|
|       2|      9327|                3|        0|
|       2|     45918|                4|        1|
|       2|     30035|                5|        0|
|       2|     17794|                6|        1|
|       2|     40141|                7|        1|
|       2|      1819|                8|        1|
|       2|     43668|                9|        0|
|       3|     33754|                1|        1|
+--------+----------+-----------------+---------+
only showing top 10 rows



In [0]:
order_products_train_df.show(10)

+--------+----------+-----------------+---------+
|order_id|product_id|add_to_cart_order|reordered|
+--------+----------+-----------------+---------+
|       1|     49302|                1|        1|
|       1|     11109|                2|        1|
|       1|     10246|                3|        0|
|       1|     49683|                4|        0|
|       1|     43633|                5|        1|
|       1|     13176|                6|        0|
|       1|     47209|                7|        0|
|       1|     22035|                8|        1|
|      36|     39612|                1|        0|
|      36|     19660|                2|        1|
+--------+----------+-----------------+---------+
only showing top 10 rows



#### Temporary Tables for SQL

In [0]:
# Create Temporary Tables to work using sql like commands
aisles_df.createOrReplaceTempView("aisles")
departments_df.createOrReplaceTempView("departments")
order_products_prior_df.createOrReplaceTempView("order_products_prior")
order_products_train_df.createOrReplaceTempView("order_products_train")
orders_df.createOrReplaceTempView("orders")
products_df.createOrReplaceTempView("products")
     

### Count of total orders by the day of the week.

In [0]:
%sql
select 
  count(order_id) as total_orders, 
  (case 
     when order_dow = '0' then 'Sunday'
     when order_dow = '1' then 'Monday'
     when order_dow = '2' then 'Tuesday'
     when order_dow = '3' then 'Wednesday'
     when order_dow = '4' then 'Thursday'
     when order_dow = '5' then 'Friday'
     when order_dow = '6' then 'Saturday'              
   end) as day_of_week 
  from orders  
 group by order_dow 
 order by total_orders desc

total_orders,day_of_week
600905,Sunday
587478,Monday
467260,Tuesday
453368,Friday
448761,Saturday
436972,Wednesday
426339,Thursday


Output can only be rendered in Databricks

### Result
We started with grouping the dataset by "order_dow" which contains the day of the week the order was placed, along with the count of the orders made on each day of the week. On Sunday or Monday, most instacart orders are placed. The snippet attached below looks at the distribution across the day to figure out when most users place orders on Instacart.

### Orders placed across the day(hour of the day)

In [0]:
%sql
select 
  count(order_id) as total_orders, 
  order_hour_of_day as hour 
  from orders 
 group by order_hour_of_day 
 order by order_hour_of_day
     

total_orders,hour
22758,0
12398,1
7539,2
5474,3
5527,4
9569,5
30529,6
91868,7
178201,8
257812,9


Output can only be rendered in Databricks

### Result
It appears that majority of the orders are placed between 10 am and 4 pm, as can be seen in the bar plot above. The basic assumption was that the hours after the peak office hours would have the most order, but this came up to be a surprise.

### Department with most products

In [0]:
%sql
select
d.department_id,
d.department,
    count(1) as no_of_products
    from departments d
      inner join products p
         on p.department_id = d.department_id
   group by d.department_id, d.department 
   order by no_of_products desc

department_id,department,no_of_products
11,personal care,6563
19,snacks,6264
13,pantry,5371
7,beverages,4365
1,frozen,4007
16,dairy eggs,3449
17,household,3084
15,canned goods,2092
9,dry goods pasta,1858
4,produce,1684


Output can only be rendered in Databricks

### Result
In terms of the quantity of products, personal care is in front of pantry and frozen.

### Most popular product

In [0]:
%sql
select count(opp.order_id) as orders, p.product_name as popular_product
  from order_products_prior opp, products p 
 where p.product_id = opp.product_id 
 group by popular_product 
 order by orders desc 
 LIMIT 10

orders,popular_product
472565,Banana
379450,Bag of Organic Bananas
264683,Organic Strawberries
241921,Organic Baby Spinach
213584,Organic Hass Avocado
176815,Organic Avocado
152657,Large Lemon
142951,Strawberries
140627,Limes
137905,Organic Whole Milk


Output can only be rendered in Databricks

### Result
Spinach, strawberries, and bananas are the most popular fruits and vegetables. In reality, the majority of the top items appear to be healthful. 

In [0]:
%sql
select 
  count(order_id) as total_orders, 
  add_to_cart_order as order_frequency 
  from order_products_prior  
 group by order_frequency 
 order by total_orders desc
 LIMIT 10

total_orders,order_frequency
3214874,1
3058126,2
2871133,3
2664106,4
2442025,5
2213695,6
1986020,7
1766014,8
1562640,9
1378293,10


### FP Growth Algorithm
It all started with **Apriori Alogorithm**, which is an algorithm for frequency pattern analysis in the logistic Regression method of machine learning that generates itemsets and discovers most recent itemset. **FP Growth Alogorithm** is an improvement for Apriori where a frequency pattern is generated without a candidate generation. The algorithm represents the pattern in the form of a tree, also called as **FP Tree or Frequency pattern tree**. This algorithm mines the recent frequency pattern with itemset at the top of the tree and each item of the itemset as the nodes. The association of the nodes with the lower nodes that is the itemsets with the other itemsets are maintained while forming the tree.

The following stage is to prepare our data for usage by the pattern mining(FP growth) method. A basket of items arranged in such order must be present in each row. The below snippet created a dataframe called cart to load it into the alogorithm.

In [0]:
# Organize the data by shopping cart
from pyspark.sql.functions import collect_set, col, count
#Get all the product names for each order id(unique order) by joining the products id from training evaluation set with products dataframe.
rawData = spark.sql("select p.product_name, o.order_id from products p inner join order_products_train o where o.product_id = p.product_id")
cart = rawData.groupBy('order_id').agg(collect_set('product_name').alias('items'))
cart.createOrReplaceTempView('cart')

In [0]:
rawData.show()

+--------------------+--------+
|        product_name|order_id|
+--------------------+--------+
|    Bulgarian Yogurt|       1|
|Organic 4% Milk F...|       1|
|Organic Celery He...|       1|
|      Cucumber Kirby|       1|
|Lightly Smoked Sa...|       1|
|Bag of Organic Ba...|       1|
|Organic Hass Avocado|       1|
|Organic Whole Str...|       1|
|Grated Pecorino R...|      36|
|        Spring Water|      36|
| Organic Half & Half|      36|
|  Super Greens Salad|      36|
|Cage Free Extra L...|      36|
|Prosciutto, Ameri...|      36|
|Organic Garnet Sw...|      36|
|           Asparagus|      36|
|  Shelled Pistachios|      38|
|Organic Biologiqu...|      38|
|Organic Raw Unfil...|      38|
|Organic Baby Arugula|      38|
+--------------------+--------+
only showing top 20 rows



### The products inside each cart during checkout

In [0]:
#rawData.show(5)
#baskets.show(5)
display(cart.head(10))
          

order_id,items
762,"List(Organic Cucumber, Organic Romaine Lettuce, Celery Hearts, Organic Strawberries)"
844,"List(Organic Red Radish, Bunch, Baby Spinach, Organic Shredded Carrots, Granny Smith Apples, Green Beans, Cheese Pizza Snacks, Garlic Couscous)"
988,"List(Whipped Light Cream, Original, Complete ActionPacs Lemon Burst Dishwasher Detergent, Classic Vanilla Coffee Creamer, Natural Vanilla Ice Cream)"
1139,"List(Cinnamon Rolls with Icing, Red Vine Tomato, Picnic Potato Salad, Flaky Biscuits, Organic Strawberries, Organic Bakery Hamburger Buns Wheat - 8 CT, Buttermilk Biscuits, Banana, Guacamole)"
1143,"List(Water, Natural Premium Coconut Water, Organic Red Radish, Bunch, Organic Capellini Whole Wheat Pasta, Organic Raspberries, Calming Lavender Body Wash, Organic Garlic, Rustic Baguette, Organic Brussel Sprouts, Organic Butterhead (Boston, Butter, Bibb) Lettuce, Organic Blueberries, Spring Water, Large Lemon, Basil Pesto, Baby Arugula, Organic Hass Avocado, Unscented Long Lasting Stick Deodorant)"
1280,"List(Vanilla Soy Milk, French Vanilla Creamer, Organic Half & Half, Lactose Free Half & Half, Organic Whole Milk)"
1342,"List(Raw Shrimp, Seedless Cucumbers, Versatile Stain Remover, Organic Strawberries, Organic Mandarins, Chicken Apple Sausage, Pink Lady Apples, Bag of Organic Bananas)"
1350,"List(Mocha Frappucino Chilled Coffee Drink, Bare Fruit Banana Chips, Pressed Cool Pineapple, Ground Cinnamon, Lemon Love Juice Drink, Chia Sweet Peach Smoothie, Green Apple Chips, Sea Salt Chickpeas, Plus Lotion Facial Tissues, Organic Insect Repellent Fresh Natural Scent, Strawberry Banana Juice)"
1468,"List(Pomegranate Seeds, Organic Red Radish, Bunch, Natural Mini Pork Pepperoni, Cage Free Grade AA Large White Eggs, Bartlett Pears, Organic Red Potato, Organic Ginger Root, Banana, Red Peppers, Active Dry Yeast, Organic Lacinato (Dinosaur) Kale, Organic Baby Broccoli, Carrots, Fresh Cauliflower, Organic English Cucumber, Organic Grape Tomatoes, Organic Hass Avocado)"
1591,"List(Cracked Wheat, Strawberry Rhubarb Yoghurt, Organic Bunny Fruit Snacks Berry Patch, Goodness Grapeness Organic Juice Drink, Honey Graham Snacks, Spinach, Granny Smith Apples, Oven Roasted Turkey Breast, Pure Vanilla Extract, Chewy 25% Low Sugar Chocolate Chip Granola, Banana, Original Turkey Burgers Smoke Flavor Added, Twisted Tropical Tango Organic Juice Drink, Navel Oranges, Lower Sugar Instant Oatmeal Variety, Ultra Thin Sliced Provolone Cheese, Natural Vanilla Ice Cream, Cinnamon Multigrain Cereal, Garlic, Goldfish Pretzel Baked Snack Crackers, Original Whole Grain Chips, Medium Scarlet Raspberries, Lemon Yogurt, Original Patties (100965) 12 Oz Breakfast, Nutty Bars, Strawberry Banana Smoothie, Green Machine Juice Smoothie, Coconut Dreams Cookies, Buttermilk Waffles, Uncured Genoa Salami, Organic Greek Whole Milk Blended Vanilla Bean Yogurt)"


In [0]:
print((cart.count(), len(cart.columns)))

(131209, 2)


### Result
There are ~1.3M carts in the dataset. This dataset is ready to be fed to the spark FPGrowth algorithm.

The three important measures in the FP Growth Alogorithm are as below,

### Support
This measurement reveals how frequently a given itemset appears in all transactions. It seems sense that support calculates the percentage of transactions that contain each given basket, A, as a subset.For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.

### Confidence
Given that the cart already possesses the antecedents, this measurement determines the likelihood of the consequence occurring on the cart. In simples sense,  Confidence refers to the amount of times a given rule turns out to be true in practice. 

### Lift
The lift value of an association rule is the ratio of the confidence of the rule and the expected confidence of the rule. It is the measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model.

### Antecedent and Consequent
The IF part of the rule is known as antecedent and the THEN part of the rule is known as consequent

Let's set the minimum support at 0.001, which means that for the purposes of our analysis, every cart must occur at least 0.001* 1,31,209 (131) times in order to be taken into account. Which also means that one in every 1000 basket has this cart!

Support of the association rule can be calculated as : **(No of carts containing similar products) / (No of total carts)**

### Use of FP Growth Algorithm

In [0]:

%scala
import org.apache.spark.ml.fpm.FPGrowth

// Extracting the items
val carts_ds = spark.sql("select items from cart").as[Array[String]].toDF("items")

//FPGrowth Algorithm
val fpgrowth = new FPGrowth().setItemsCol("items").setMinSupport(0.001).setMinConfidence(0)
val model = fpgrowth.fit(carts_ds)

### Displaying the most frequent carts.

In [0]:

%scala
// Display frequent itemsets
val mostPopularItemInABasket = model.freqItemsets
mostPopularItemInABasket.createOrReplaceTempView("mostPopularItemInABasket")
mostPopularItemInABasket.show()


In [0]:
%sql
select items, freq from mostPopularItemInABasket where size(items) > 2 order by freq desc limit 10

items,freq
"List(Organic Hass Avocado, Organic Strawberries, Bag of Organic Bananas)",710
"List(Organic Raspberries, Organic Strawberries, Bag of Organic Bananas)",649
"List(Organic Baby Spinach, Organic Strawberries, Bag of Organic Bananas)",587
"List(Organic Raspberries, Organic Hass Avocado, Bag of Organic Bananas)",531
"List(Organic Hass Avocado, Organic Baby Spinach, Bag of Organic Bananas)",497
"List(Organic Avocado, Organic Baby Spinach, Banana)",484
"List(Organic Avocado, Large Lemon, Banana)",477
"List(Limes, Large Lemon, Banana)",452
"List(Organic Cucumber, Organic Strawberries, Bag of Organic Bananas)",424
"List(Limes, Organic Avocado, Large Lemon)",389


The cart with ["Organic Hass Avocado", "Organic Strawberries", "Bag of Organic Bananas"] has the highest frequency of 710. 

Now let's examine the **"if then"** associations using the association rules attribute of the fp growth algorithm and see the confidence and lift values for various objects. **The lift value of a rule must be greater than 1 for it to be useful to instacart.**

In [0]:
%scala
// Display generated association rules.
val ifThen = model.associationRules
// creating a "if then" table
ifThen.createOrReplaceTempView("ifThen")

In [0]:
%sql
SELECT      antecedent, consequent, lift
FROM        ifThen 

WHERE       ARRAY_CONTAINS(antecedent     ,'Banana') 

ORDER BY    lift DESC

antecedent,consequent,lift
"List(Green Bell Pepper, Banana)",List(Red Peppers),11.217850674406307
"List(Red Peppers, Banana)",List(Green Bell Pepper),9.440398977940903
"List(Limes, Banana)",List(Bunched Cilantro),7.75284834397696
"List(Organic Cilantro, Banana)",List(Limes),7.6413823072202
"List(Bunched Cilantro, Banana)",List(Limes),6.972464960058672
"List(Limes, Organic Avocado, Banana)",List(Large Lemon),6.689848158909315
"List(Organic Avocado, Large Lemon, Banana)",List(Limes),6.656788779810276
"List(Large Lemon, Organic Baby Spinach, Banana)",List(Organic Avocado),6.396389762723926
"List(Limes, Banana)",List(Organic Cilantro),5.8152259931908645
"List(Limes, Large Lemon, Banana)",List(Organic Avocado),5.720295335617887


In [0]:
%sql
select * from ifThen where lift > 1  order by lift desc limit 10

antecedent,consequent,confidence,lift,support
List(Strawberry Rhubarb Yoghurt),List(Blueberry Yoghurt),0.3096646942800789,80.29801358062228,0.0011965642600736
List(Blueberry Yoghurt),List(Strawberry Rhubarb Yoghurt),0.3102766798418972,80.29801358062227,0.0011965642600736
List(Icelandic Style Skyr Blueberry Non-fat Yogurt),List(Nonfat Icelandic Style Strawberry Yogurt),0.2170212765957447,78.66062066533443,0.0011660785464411
List(Nonfat Icelandic Style Strawberry Yogurt),List(Icelandic Style Skyr Blueberry Non-fat Yogurt),0.4226519337016574,78.66062066533442,0.0011660785464411
List(Icelandic Style Skyr Blueberry Non-fat Yogurt),List(Non Fat Acai & Mixed Berries Yogurt),0.2397163120567376,74.88794663964877,0.0012880214009709
List(Non Fat Acai & Mixed Berries Yogurt),List(Icelandic Style Skyr Blueberry Non-fat Yogurt),0.4023809523809524,74.88794663964876,0.0012880214009709
List(Blackberry Cucumber Sparkling Water),List(Kiwi Sandia Sparkling Water),0.2567567567567567,72.44902644580064,0.0010136499782789
List(Kiwi Sandia Sparkling Water),List(Blackberry Cucumber Sparkling Water),0.2860215053763441,72.44902644580063,0.0010136499782789
List(Non Fat Raspberry Yogurt),List(Icelandic Style Skyr Blueberry Non-fat Yogurt),0.3819444444444444,71.08446611505121,0.0016767142497846
List(Icelandic Style Skyr Blueberry Non-fat Yogurt),List(Non Fat Raspberry Yogurt),0.3120567375886525,71.08446611505121,0.0016767142497846


If someone purchases ["Strawberry Rhubarb Yoghurt,"] there is a very high likelihood that they will also purchase ["Blueberry Yoghurt]," as seen in the above table, which contains the rules in decreasing value of the lift values. The results of displaying in order of confidence are as follows. 

Note: While confidence only measures the likelihood that the consequent will occur when there is an antecedent, it also quantifies the power of association that is specifically due to the antecedent.

In [0]:
%sql
select antecedent as `antecedent (if)`, consequent as `consequent (then)`, confidence from ifThen order by confidence desc limit 50

antecedent (if),consequent (then),confidence
"List(Organic Raspberries, Organic Hass Avocado, Organic Strawberries)",List(Bag of Organic Bananas),0.5984251968503937
"List(Organic Cucumber, Organic Hass Avocado, Organic Strawberries)",List(Bag of Organic Bananas),0.546875
"List(Organic Kiwi, Organic Hass Avocado)",List(Bag of Organic Bananas),0.5459770114942529
"List(Organic Navel Orange, Organic Raspberries)",List(Bag of Organic Bananas),0.5412186379928315
"List(Yellow Onions, Strawberries)",List(Banana),0.5357142857142857
"List(Organic Whole String Cheese, Organic Hass Avocado)",List(Bag of Organic Bananas),0.5314685314685315
"List(Organic Navel Orange, Organic Hass Avocado)",List(Bag of Organic Bananas),0.5283018867924528
"List(Organic Raspberries, Organic Hass Avocado)",List(Bag of Organic Bananas),0.521099116781158
"List(Organic D'Anjou Pears, Organic Hass Avocado)",List(Bag of Organic Bananas),0.5170454545454546
"List(Organic Unsweetened Almond Milk, Organic Hass Avocado)",List(Bag of Organic Bananas),0.5141065830721003


### Result
Based on the Lift > 1 and Confidence, the above table shows the recommendation(consequent) of products for a perticular combination of cart. Most of the bags have organic bananas and organic fruits/vegetables as its consequent.

### Reference
1. https://www.upgrad.com/blog/association-rule-mining-an-overview-and-its-applications/
2. https://www.softwaretestinghelp.com/fp-growth-algorithm-data-mining/
3. https://towardsdatascience.com/association-rules-2-aa9a77241654
4. https://docs.databricks.com/notebooks/visualizations/index.html
5. https://sparkbyexamples.com/pyspark/pyspark-join-explained-with-examples/
6. https://louisazhou.gitbook.io/notes/sql/spark-sql
7. https://www.softwaretestinghelp.com/fp-growth-algorithm-data-mining/
8. https://infocenter.informationbuilders.com/wf80/index.jsp?topic=%2Fpubdocs%2FRStat16%2Fsource%2Ftopic49.htm
9. https://www.ibm.com/docs/en/db2/9.7?topic=associations-confidence-in-association-rule
10. https://towardsdatascience.com/how-to-conduct-market-basket-analysis-f14f391a8625
11. https://towardsdatascience.com/introduction-to-simple-association-rules-mining-for-market-basket-analysis-ef8f2d613d87
12. https://www.databricks.com/blog/2018/09/18/simplify-market-basket-analysis-using-fp-growth-on-databricks.html
13. https://towardsdatascience.com/the-fp-growth-algorithm-1ffa20e839b8
14. https://towardsdatascience.com/market-basket-analysis-using-pysparks-fpgrowth-55c37ebd95c0
15. https://medium.com/analytics-vidhya/shopper-behavior-exploration-and-market-basket-analysis-using-spark-650656d6a0e1
16. https://medium.com/analytics-vidhya/market-basket-analysis-on-3-million-orders-from-instacart-using-spark-24cc6469a92e