# Frequent Pattern Mining in PySpark's MLlib Project Solution

Let's see if you can use the concepts we learned about in the lecture to try out frequent pattern mining techniques on a new dataset!


## Recap:

Spark MLlib implements two algorithms related to frequency pattern mining (FPM): 

- FP-growth
- PrefixSpan 

The distinction is that FP-growth does not use order information in the itemsets, if any, while PrefixSpan is designed for sequential pattern mining where the itemsets are ordered. 

## Data

You are owing a supermarket mall and through membership cards, you have some basic data about your customers like Customer ID, age, gender, annual income and spending score. Spending Score is something you assign to the customer based on your defined parameters like customer behavior and purchasing data.

## Problem statement

You own the mall and want to understand the customers like who can be easily grouped together so that a strategy can be provided to the marketing team to plan accordingly.

**Source:**  https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python

In [1]:
# First let's create our PySpark instance
# import findspark
# findspark.init()

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
# May take awhile locally
spark = SparkSession.builder.appName("FPM_Project").getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark
# Click the hyperlinked "Spark UI" link to view details about your Spark session

You are working with 1 core(s)


**Read in the dataframe**

In [2]:
path ="Datasets/"
df = spark.read.csv(path+'Mall_Customers.csv',inferSchema=True,header=True)

In [3]:
df.limit(4).toPandas()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77


In [4]:
df.printSchema()

root
 |-- CustomerID: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Annual Income (k$): integer (nullable = true)
 |-- Spending Score (1-100): integer (nullable = true)



In [5]:
# Let's rename some of these column names to be a bit more user friendly
# Sometime Spark will not be able to process a command if the var names have spaces or special characters
df = df.withColumnRenamed("Annual Income (k$)", "income")
df = df.withColumnRenamed("Spending Score (1-100)", "spending_score")
df.show(5)

+----------+------+---+------+--------------+
|CustomerID|Gender|Age|income|spending_score|
+----------+------+---+------+--------------+
|         1|  Male| 19|    15|            39|
|         2|  Male| 21|    15|            81|
|         3|Female| 20|    16|             6|
|         4|Female| 23|    16|            77|
|         5|Female| 31|    17|            40|
+----------+------+---+------+--------------+
only showing top 5 rows



In [6]:
#How many rows do we have in our dataframe?
df.count()

200

## Create a meaningful grouping system

We need to recode our values so they can be grouped and analyzed accordingly. Let's do that here.

In [8]:
from pyspark.sql.functions import *

groups = df.withColumn("age_group",expr("CASE WHEN Age < 30 THEN 'Under 30' WHEN Age BETWEEN 30 AND 55 THEN '30 to 55' WHEN Age > 50 THEN '50 +' ELSE 'Other' END AS age_group"))
print(groups.groupBy("age_group").count().show())

groups = groups.withColumn("income_group",expr("CASE WHEN income < 40 THEN 'Under 40' WHEN income BETWEEN 40 AND 70 THEN '40 - 70' WHEN income > 70 THEN '70 +' ELSE 'Other' END AS income_group"))
print(groups.groupBy("income_group").count().show())

groups = groups.withColumn("spending_group",expr("CASE WHEN spending_score < 30 THEN 'Less than 30' WHEN spending_score BETWEEN 30 AND 60 THEN '30 - 60' WHEN spending_score > 60 THEN '60 +' ELSE 'Other' END AS spending_group"))
print(groups.groupBy("spending_group").count().show())

print(groups.groupBy("Gender").count().show())

groups = groups.withColumn("items",array('Gender','age_group', 'income_group','spending_group')) #items is what spark is expecting
groups.limit(4).toPandas()

+---------+-----+
|age_group|count|
+---------+-----+
| 30 to 55|  116|
| Under 30|   55|
|     50 +|   29|
+---------+-----+

None
+------------+-----+
|income_group|count|
+------------+-----+
|     40 - 70|   80|
|    Under 40|   46|
|        70 +|   74|
+------------+-----+

None
+--------------+-----+
|spending_group|count|
+--------------+-----+
|          60 +|   62|
|       30 - 60|   92|
|  Less than 30|   46|
+--------------+-----+

None
+------+-----+
|Gender|count|
+------+-----+
|Female|  112|
|  Male|   88|
+------+-----+

None


Unnamed: 0,CustomerID,Gender,Age,income,spending_score,age_group,income_group,spending_group,items
0,1,Male,19,15,39,Under 30,Under 40,30 - 60,"[Male, Under 30, Under 40, 30 - 60]"
1,2,Male,21,15,81,Under 30,Under 40,60 +,"[Male, Under 30, Under 40, 60 +]"
2,3,Female,20,16,6,Under 30,Under 40,Less than 30,"[Female, Under 30, Under 40, Less than 30]"
3,4,Female,23,16,77,Under 30,Under 40,60 +,"[Female, Under 30, Under 40, 60 +]"


## Fit the FPGrowth model

Since order does not matter here. 

In [9]:
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.2, minConfidence=0.1)
model = fpGrowth.fit(groups)

## Determine item popularity

See what combos were most popular

In [10]:
itempopularity = model.freqItemsets
itempopularity.createOrReplaceTempView("itempopularity")
# Then Query the temp view
print("Top 20")
spark.sql("SELECT * FROM itempopularity ORDER BY freq desc").limit(200).toPandas()

Top 20


Unnamed: 0,items,freq
0,[30 to 55],116
1,[Female],112
2,[30 - 60],92
3,[Male],88
4,[40 - 70],80
5,"[40 - 70, 30 - 60]",77
6,[70 +],74
7,"[Female, 30 to 55]",72
8,[60 +],62
9,[Under 30],55


## Review Association Rules

In addition to freqItemSets, the FP-growth model also generates **associationRules**. For example, if a shopper purchases peanut butter, what is the probability (or confidence) that they will also purchase jelly.  For more information, a good reference is Susan Li’s *A Gentle Introduction on Market Basket Analysis — Association Rules*

A good way to think about association rules is that model determines that if you purchased something (i.e. the antecedent), then you will purchase this other thing (i.e. the consequent) with the following confidence.

**Source:** https://databricks.com/blog/2018/09/18/simplify-market-basket-analysis-using-fp-growth-on-databricks.html

In [12]:
# Display generated association rules.
assoc = model.associationRules
assoc.createOrReplaceTempView("assoc")
# Then Query the temp view
print("Top 20")
spark.sql("SELECT * FROM assoc ORDER BY confidence desc").limit(200).toPandas()

Top 20


Unnamed: 0,antecedent,consequent,confidence,lift
0,[40 - 70],[30 - 60],0.9625,2.092391
1,"[40 - 70, Female]",[30 - 60],0.957447,2.081406
2,[30 - 60],[40 - 70],0.836957,2.092391
3,"[30 - 60, Female]",[40 - 70],0.818182,2.045455
4,[70 +],[30 to 55],0.743243,1.281454
5,[Female],[30 to 55],0.642857,1.108374
6,[30 to 55],[Female],0.62069,1.108374
7,[30 - 60],[Female],0.597826,1.067547
8,[40 - 70],[Female],0.5875,1.049107
9,"[40 - 70, 30 - 60]",[Female],0.584416,1.043599


## Take aways

Awesome! So we see that the highest confidence group was the [40 - 70] income group paired with the [30-60] spending group which means that our advice to the marketing team might be to focus efforts on this group first. 