#Frequent pattern mining using Databricks

Using various Frequent Pattern Mining Algorithms on the [Groceries dataset](https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset) by **HEERAL DEDHIA** on Kaggle.

### **About Dataset**


**Association Rule Mining**

Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.

Association Rules are widely used to analyze retail basket or transaction data and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.

**Details of the dataset**

The dataset has 38765 rows of the purchase orders of people from the grocery stores. These orders can be analysed and association rules can be generated using Market Basket Analysis by algorithms like Apriori Algorithm.

**Apriori Algorithm**

Apriori is an algorithm for frequent itemset mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent itemsets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

**An example of Association Rules**

Assume there are 100 customers
10 of them bought milk, 8 bought butter and 6 bought both of them.
bought milk => bought butter
support = P(Milk & Butter) = 6/100 = 0.06
confidence = support/P(Butter) = 0.06/0.08 = 0.75
lift = confidence/P(Milk) = 0.75/0.10 = 7.5

Note: this example is extremely small. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

**Some important terms:**

* Support: This says how popular an itemset is, as measured by the proportion of transactions in which an itemset appears.

* Confidence: This says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y also appears.

* Lift: This says how likely item Y is purchased when item X is purchased while controlling for how popular item Y is.

##Importing Libraries

In [0]:
import pandas as pd
import numpy as np

##Data Loading and Preparation

In [0]:
df1 = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/tanujareddy.maligireddy@sjsu.edu/Groceries_dataset.csv")

In [0]:
df1.createOrReplaceTempView("data")

In [0]:
from pyspark.sql.functions import collect_set, col, count
baskets = df1.groupBy('Member_number').agg(collect_set('itemDescription').alias('items'))
baskets.createOrReplaceTempView('baskets')

In [0]:
display(baskets)


Member_number,items
1000,"List(pickled vegetables, whole milk, misc. beverages, pastry, salty snack, sausage, canned beer, semi-finished bread, hygiene articles, yogurt, soda)"
1001,"List(whole milk, beef, sausage, frankfurter, curd, rolls/buns, soda, white bread, whipped/sour cream)"
1002,"List(whole milk, sugar, butter, butter milk, specialty chocolate, frozen vegetables, tropical fruit, other vegetables)"
1003,"List(frozen meals, sausage, detergent, rolls/buns, root vegetables, dental care)"
1004,"List(pastry, whole milk, pip fruit, canned beer, shopping bags, packaged fruit/vegetables, cling film/bags, frozen fish, hygiene articles, red/blush wine, dish cleaner, rolls/buns, root vegetables, chocolate, tropical fruit, other vegetables)"
1005,"List(rolls/buns, margarine, whipped/sour cream)"
1006,"List(flour, whole milk, softener, frankfurter, chicken, rice, skin care, bottled water, shopping bags, bottled beer, rolls/buns, chocolate)"
1008,"List(liquor (appetizer), photo/film, liver loaf, yogurt, dessert, domestic eggs, white wine, soda, root vegetables, tropical fruit, hamburger meat)"
1009,"List(pastry, canned fish, ketchup, cocoa drinks, yogurt, newspapers, herbs, tropical fruit)"
1010,"List(pip fruit, frankfurter, specialty bar, bottled water, candles, kitchen towels, rolls/buns, UHT-milk, sliced cheese, coffee)"


##Train ML Model

##Use FP_Growth

In [0]:
%scala
import org.apache.spark.ml.fpm.FPGrowth

// Extract out the items 
val baskets = spark.sql("select items from baskets").as[Array[String]].toDF("items")

// Use FPGrowth
val fpgrowth = new FPGrowth().setItemsCol("items").setMinSupport(0.001).setMinConfidence(0)
val model = fpgrowth.fit(baskets)

##Most Frequest Itemsets

In [0]:
%scala
// Display frequent itemsets
val mostPopularItemInABasket = model.freqItemsets
mostPopularItemInABasket.createOrReplaceTempView("mostPopularItemInABasket")


In [0]:
%sql
select items, freq from mostPopularItemInABasket where size(items) > 2 order by freq desc limit 20

items,freq
"List(rolls/buns, other vegetables, whole milk)",320
"List(yogurt, other vegetables, whole milk)",280
"List(soda, other vegetables, whole milk)",270
"List(yogurt, rolls/buns, whole milk)",257
"List(soda, rolls/buns, whole milk)",254
"List(bottled water, other vegetables, whole milk)",219
"List(yogurt, soda, whole milk)",212
"List(soda, rolls/buns, other vegetables)",205
"List(yogurt, rolls/buns, other vegetables)",204
"List(tropical fruit, other vegetables, whole milk)",197


##Review Association Rules

##View Generated Association Rules

In [0]:
%scala
// Display generated association rules.
val ifThen = model.associationRules
ifThen.createOrReplaceTempView("ifThen")

In [0]:
%sql
select antecedent as `antecedent (if)`, consequent as `consequent (then)`, confidence from ifThen order by confidence desc limit 20

antecedent (if),consequent (then),confidence
"List(pasta, misc. beverages, rolls/buns)",List(coffee),1.0
"List(ice cream, bottled beer, citrus fruit, sausage)",List(rolls/buns),1.0
"List(ice cream, bottled beer, citrus fruit, sausage)",List(whole milk),1.0
"List(waffles, chocolate, citrus fruit, bottled water, soda)",List(other vegetables),1.0
"List(frozen dessert, pastry, citrus fruit)",List(whole milk),1.0
"List(beef, whipped/sour cream, canned beer, sausage, soda, whole milk)",List(pastry),1.0
"List(ham, cream cheese , citrus fruit, other vegetables)",List(rolls/buns),1.0
"List(chicken, newspapers, shopping bags, root vegetables)",List(whole milk),1.0
"List(cat food, bottled beer, root vegetables, rolls/buns)",List(whole milk),1.0
"List(beef, pork, tropical fruit, yogurt, soda)",List(whole milk),1.0
