<a href="https://colab.research.google.com/github/LorenzoPolli/market-basket-analysis/blob/main/FP_growth_Market_basket_analysis_(eng_newspapers).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Market-basket analysis

The task is to implement a system finding frequent itemsets (aka market-basket analysis), analyzing the «Old Newspapers» dataset published on Kaggle and released under the public domain license (CC0).

> Project authors: Mathias Cardarello Fierro & Lorenzo Polli

Algorithms for Massive Data



*Università degli Studi di Milano*


15-Dec-2022


## **FP growth Algorithm**

### **1. Setup and data import**

In [None]:
%%capture
# Download the dataset containing old newspapers
import os

os.environ["KAGGLE_USERNAME"] = "mathiascardarello"
os.environ["KAGGLE_KEY"] = "89f16dcdf267d017756e3a2e5cece19a"
!pip install kaggle --upgrade
!kaggle datasets download alvations/old-newspapers --unzip

#### **1.1 Setting up PySpark and Spark NLP**


In [None]:
%%capture
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

In [None]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

Spark NLP version 4.2.4
Apache Spark version: 3.2.1


In [None]:
spark

#### **1.2 Import the dataset**

In [None]:
# Import the dataset and display only rows where the language is English

%%time

df = spark.read.csv('old-newspaper.tsv', sep='\t', header=True)
df = df.filter("Language == 'English'")
df.show(5)

+--------+------------+----------+--------------------+
|Language|      Source|      Date|                Text|
+--------+------------+----------+--------------------+
| English| latimes.com|2012/04/29|He wasn't home al...|
| English|stltoday.com|2011/07/10|The St. Louis pla...|
| English|   freep.com|2012/05/07|WSU's plans quick...|
| English|      nj.com|2011/02/05|The Alaimo Group ...|
| English|  sacbee.com|2011/10/02|And when it's oft...|
+--------+------------+----------+--------------------+
only showing top 5 rows

CPU times: user 337 ms, sys: 34.6 ms, total: 371 ms
Wall time: 59 s


In [None]:
# Group by source

%%time

from pyspark.sql.functions import desc
df.groupby("Source").count().sort(desc("count")).show(10)

+----------------+------+
|          Source| count|
+----------------+------+
|   cleveland.com|152716|
|          nj.com|125230|
|    stltoday.com|120632|
|  oregonlive.com|103496|
|     latimes.com| 60637|
|   azcentral.com| 42693|
|      sfgate.com| 42121|
|baltimoresun.com| 39994|
|       freep.com| 36469|
| startribune.com| 31821|
+----------------+------+
only showing top 10 rows

CPU times: user 373 ms, sys: 39.4 ms, total: 412 ms
Wall time: 1min 10s


In [None]:
# Check whether there are NULL values or not

%%time
import pyspark.sql.functions as F

df.select([F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns]).show()

+--------+------+----+----+
|Language|Source|Date|Text|
+--------+------+----+----+
|       0|     0|   0|   0|
+--------+------+----+----+



### **2. Pre-processing**

In [None]:
from pyspark.ml import Pipeline
import pyspark.sql.types as T
from typing import List
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.pretrained import PretrainedPipeline

In [None]:
# Initialize the annotators
document = DocumentAssembler()\
    .setInputCol("Text")\
    .setOutputCol("document")\
    .setCleanupMode("shrink") # remove new lines and tabs, plus merging multiple spaces and blank lines to a single space

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

normalizer = Normalizer()\
    .setInputCols(["token"])\
    .setOutputCol("normalized")\
    .setLowercase(True)\
    .setCleanupPatterns(["""[^\w\d\s]"""]) # remove punctuations (keep alphanumeric chars)

stop_words = StopWordsCleaner.pretrained('stopwords_en', 'en')\
    .setInputCols(["normalized"])\
    .setOutputCol("cleanTokens")\
    .setCaseSensitive(False)

lemmatizer = LemmatizerModel.pretrained()\
    .setInputCols(["cleanTokens"])\
    .setOutputCol("lemma")

prediction_pipeline = Pipeline(stages = [document, sentence, token, normalizer, stop_words, lemmatizer])

stopwords_en download started this may take some time.
Approximate size to download 2.9 KB
[OK!]
lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[OK!]


In [None]:
clean_df = prediction_pipeline.fit(df).transform(df)

In [None]:
# Add a column where each row corresponds to a different basket of items 
clean_df = clean_df.withColumn("Basket", clean_df.lemma.result) 
clean_df.select(['Text','Basket']).show()

+--------------------+--------------------+
|                Text|              Basket|
+--------------------+--------------------+
|He wasn't home al...|[wasnt, home, app...|
|The St. Louis pla...|[st, louis, plant...|
|WSU's plans quick...|[wsus, plan, quic...|
|The Alaimo Group ...|[alaimo, group, m...|
|And when it's oft...|[difficult, predi...|
|There was a certa...|[amount, scoff, y...|
|14915 Charlevoix,...|[14915, charlevoi...|
|"""It’s just anot...|[long, line, fail...|
|But time and agai...|[time, report, su...|
|I was just trying...|[hit, hard, somep...|
|MHTA President an...|[mhta, president,...|
|"""The absurdity ...|[absurdity, attem...|
|"GM labor relatio...|[gm, labor, relat...|
|Here is why Wandr...|[wandry, matter, ...|
|"""Cheap,"" he sa...|[cheap, hit, hard...|
|Andrade's childre...|[andrades, child,...|
|"""Let your hair ...|              [hair]|
|Born on April 15,...|[bear, april, 15,...|
|House Minority Le...|[house, minority,...|
|The first is the ...|[love, lov

Note: in the last row, item "love" is repeated twice, and this is not good for further algorithms application.

In [None]:
# Remove duplicates inside baskets
remove_duplicates = F.udf(lambda x: list(set(x)))
clean_df = clean_df.withColumn("BasketNoDup", remove_duplicates(F.col("Basket")))

### **3. Exploratory Data Analysis**

In [None]:
clean_df.createOrReplaceTempView("df_view")

In [None]:
# Total number of articles
total_rows = spark.sql("""SELECT COUNT(DISTINCT Text) AS total_rows 
                        FROM df_view""")
total_rows.show()

+----------+
|total_rows|
+----------+
|   1010092|
+----------+



### **4. Market-Basket Analysis**

#### **4.1 FPGrowth Algorithm**

In [None]:
#Frequent Pattern Growth – FP Growth is a method of mining frequent itemsets using support, lift, and confidence.
from pyspark.ml.fpm import FPGrowth

In [None]:
# Create a new object. It contains just text and baskets with no duplications inside
baskets = spark.sql("SELECT Text, SPLIT(BasketNoDup,',') AS BasketNoDup FROM df_view")

In [None]:
# Obtain a sample that is 5% the size of the original dataset
sample = baskets.sample(False, 0.05, 10)
sample.count()

50342

In [None]:
fp_growth = FPGrowth(itemsCol="BasketNoDup", minSupport=0.006, minConfidence=0.006)

In [None]:
# Apply Frequent Pattern Growth algorithm to the sample
model = fp_growth.fit(sample)

#### **4.2 Results**

##### **4.2.1 Experiment #1**

FREQUENT ITEMSETS

In [None]:
# Display frequent itemsets, with respect to parameters minSupport and minConfidence
freq_itemsets = model.freqItemsets
freq_itemsets.sort(freq_itemsets.freq.desc()).show()

+----------+----+
|     items|freq|
+----------+----+
|   [ year]|3884|
|   [ make]|3655|
|  [ state]|2760|
|   [ time]|2340|
| [ people]|2010|
|   [ game]|1930|
|[ include]|1837|
|   [ city]|1774|
|    [ day]|1768|
|   [ back]|1731|
|   [ play]|1730|
|   [ home]|1513|
|   [ team]|1497|
|    [ run]|1441|
|   [ week]|1430|
|   [ work]|1406|
| [ county]|1378|
|   [ find]|1370|
|   [ good]|1343|
|   [ show]|1322|
+----------+----+
only showing top 20 rows



In [None]:
# Display frequent itemsets, given the configured minSupport and minConfidence parameters
freq_itemsets = model.freqItemsets
freq_itemsets.sort(freq_itemsets.freq.asc()).show(5)

+-----------+----+
|      items|freq|
+-----------+----+
|   [ mayor]| 304|
|[ hospital]| 304|
| [ current]| 306|
| [[company]| 306|
|[ election]| 306|
+-----------+----+
only showing top 5 rows



In [None]:
# Total number of "frequent" items returned
freq_itemsets.count()

436

Note: among words with less frequency, there are some not-so-general words (i.e. mayor, hospital, election).

In [None]:
# Display frequent pairs of items
freq_itemsets.where(F.size(F.col("items"))>1).sort(freq_itemsets.freq.desc()).show()

+-----------------+----+
|            items|freq|
+-----------------+----+
|   [ play,  game]| 397|
|    [ louis,  st]| 383|
|   [ make,  year]| 381|
|    [ ago,  year]| 377|
|  [ state,  year]| 369|
|   [ time,  year]| 338|
|[ million,  year]| 315|
| [ season,  game]| 308|
+-----------------+----+



ASSOCIATION RULES

In [None]:
%%capture
# Display generated association rules
association_rules = model.associationRules 

In [None]:
# Order by CONFIDENCE
association_rules.sort(association_rules.confidence.desc()).show(10)

+----------+----------+-------------------+------------------+--------------------+
|antecedent|consequent|         confidence|              lift|             support|
+----------+----------+-------------------+------------------+--------------------+
|  [ louis]|     [ st]| 0.8362445414847162| 46.67208725878445|0.007607961543045...|
|    [ ago]|   [ year]| 0.5974643423137876|7.7439623894852465|0.007488776766914306|
|     [ st]|  [ louis]| 0.4246119733924612| 46.67208725878446|0.007607961543045...|
|[ million]|   [ year]| 0.2830188679245283|3.6683151001690533|0.006257200746891263|
| [ season]|   [ game]|0.23692307692307693| 6.179886807493025|0.006118151841404791|
|   [ play]|   [ game]|0.22947976878612716| 5.985736020845188|0.007886059354018513|
|   [ game]|   [ play]|0.20569948186528497| 5.985736020845188|0.007886059354018513|
|   [ game]| [ season]|0.15958549222797927| 6.179886807493025|0.006118151841404791|
|   [ time]|   [ year]|0.14444444444444443|1.8721993363085019|0.006714075722

In [None]:
# Order by LIFT (desc) - Look for highly dependent items (LIFT >> 1)
association_rules.sort(association_rules.lift.desc()).show(10)

+----------+----------+-------------------+------------------+--------------------+
|antecedent|consequent|         confidence|              lift|             support|
+----------+----------+-------------------+------------------+--------------------+
|     [ st]|  [ louis]| 0.4246119733924612| 46.67208725878446|0.007607961543045...|
|  [ louis]|     [ st]| 0.8362445414847162| 46.67208725878445|0.007607961543045...|
|    [ ago]|   [ year]| 0.5974643423137876|7.7439623894852465|0.007488776766914306|
|   [ year]|    [ ago]| 0.0970648815653965| 7.743962389485246|0.007488776766914306|
|   [ game]| [ season]|0.15958549222797927| 6.179886807493025|0.006118151841404791|
| [ season]|   [ game]|0.23692307692307693| 6.179886807493025|0.006118151841404791|
|   [ play]|   [ game]|0.22947976878612716| 5.985736020845188|0.007886059354018513|
|   [ game]|   [ play]|0.20569948186528497| 5.985736020845188|0.007886059354018513|
|[ million]|   [ year]| 0.2830188679245283|3.6683151001690533|0.006257200746

Note: pairs under the lift columns have the same value.

In [None]:
# Order by LIFT (asc) - Look for perfect substitutes items (LIFT << 1)
association_rules.sort(association_rules.lift.asc()).show(10)

+----------+----------+-------------------+------------------+--------------------+
|antecedent|consequent|         confidence|              lift|             support|
+----------+----------+-------------------+------------------+--------------------+
|   [ make]|   [ year]|0.10424076607387141|1.3511041827216361|0.007568233284335147|
|   [ year]|   [ make]|0.09809474768280124|1.3511041827216363|0.007568233284335147|
|  [ state]|   [ year]|0.13369565217391305|1.7328801549276855|0.007329863732072623|
|   [ year]|  [ state]|0.09500514933058703|1.7328801549276855|0.007329863732072623|
|   [ time]|   [ year]|0.14444444444444443|1.8721993363085019|0.006714075722061102|
|   [ year]|   [ time]|0.08702368692070031| 1.872199336308502|0.006714075722061102|
|   [ year]|[ million]|0.08110195674562307|3.6683151001690533|0.006257200746891263|
|[ million]|   [ year]| 0.2830188679245283|3.6683151001690533|0.006257200746891263|
|   [ game]|   [ play]|0.20569948186528497| 5.985736020845188|0.007886059354

Note: since the minimum value under the lift column is greater than 1, no perfect substitutes items have been returned.

In [None]:
# Here, the most frequent pairs of items are also displayed 
association_rules.sort(association_rules.support.desc()).show(10)

+----------+----------+-------------------+------------------+--------------------+
|antecedent|consequent|         confidence|              lift|             support|
+----------+----------+-------------------+------------------+--------------------+
|   [ game]|   [ play]|0.20569948186528497| 5.985736020845188|0.007886059354018513|
|   [ play]|   [ game]|0.22947976878612716| 5.985736020845188|0.007886059354018513|
|     [ st]|  [ louis]| 0.4246119733924612| 46.67208725878446|0.007607961543045...|
|  [ louis]|     [ st]| 0.8362445414847162| 46.67208725878445|0.007607961543045...|
|   [ year]|   [ make]|0.09809474768280124|1.3511041827216363|0.007568233284335147|
|   [ make]|   [ year]|0.10424076607387141|1.3511041827216361|0.007568233284335147|
|   [ year]|    [ ago]| 0.0970648815653965| 7.743962389485246|0.007488776766914306|
|    [ ago]|   [ year]| 0.5974643423137876|7.7439623894852465|0.007488776766914306|
|  [ state]|   [ year]|0.13369565217391305|1.7328801549276855|0.007329863732

PREDICTIONS

In [None]:
model.transform(sample).show(truncate=False) # transform examines the input items against all the association rules and summarize the consequents as prediction

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+
|Text                                                                                                                                                                       

----------------------

##### **4.2.1 Experiment #2**

In [None]:
fp_growth = FPGrowth(itemsCol="BasketNoDup", minSupport=0.003, minConfidence=0.003)

In [None]:
# Apply Frequent Pattern Growth algorithm to the sample
model = fp_growth.fit(sample)

FREQUENT ITEMSETS

In [None]:
# Display frequent itemsets, with respect to parameters minSupport and minConfidence
freq_itemsets = model.freqItemsets
freq_itemsets.sort(freq_itemsets.freq.desc()).show()



+----------+----+
|     items|freq|
+----------+----+
|   [ year]|3884|
|   [ make]|3655|
|  [ state]|2760|
|   [ time]|2340|
| [ people]|2010|
|   [ game]|1930|
|[ include]|1837|
|   [ city]|1774|
|    [ day]|1768|
|   [ back]|1731|
|   [ play]|1730|
|   [ home]|1513|
|   [ team]|1497|
|    [ run]|1441|
|   [ week]|1430|
|   [ work]|1406|
| [ county]|1378|
|   [ find]|1370|
|   [ good]|1343|
|   [ show]|1322|
+----------+----+
only showing top 20 rows



Note: the most frequent items are the same obtained in experiment #1. The main difference can be observed by looking at the tail of this dataset where less frequent items appear, indeed.

In [None]:
# Display frequent itemsets, given the configured minSupport and minConfidence parameters
freq_itemsets = model.freqItemsets
freq_itemsets.sort(freq_itemsets.freq.asc()).show(5)

+-----------+----+
|      items|freq|
+-----------+----+
| [ classic]| 152|
| [ compete]| 152|
|  [ crisis]| 152|
|[ audience]| 152|
|   [ human]| 152|
+-----------+----+
only showing top 5 rows



In [None]:
# Total number of "frequent" items returned
freq_itemsets.count()

1055

In [None]:
# Display frequent pairs of items
freq_itemsets.where(F.size(F.col("items"))>1).sort(freq_itemsets.freq.desc()).show()

+-----------------+----+
|            items|freq|
+-----------------+----+
|   [ play,  game]| 397|
|    [ louis,  st]| 383|
|   [ make,  year]| 381|
|    [ ago,  year]| 377|
|  [ state,  year]| 369|
|   [ time,  year]| 338|
|[ million,  year]| 315|
| [ season,  game]| 308|
|   [ time,  make]| 294|
|   [ team,  game]| 266|
|    [ win,  game]| 254|
|[ percent,  year]| 253|
| [ people,  make]| 241|
| [ school,  year]| 221|
|   [ good,  make]| 220|
|   [ team,  play]| 216|
|   [ work,  year]| 216|
|   [ play,  make]| 211|
| [ high,  school]| 208|
|[ include,  year]| 202|
+-----------------+----+
only showing top 20 rows



Note: "year", "make", "play" are easily found in pairs.

ASSOCIATION RULES

In [None]:
%%capture
# Display generated association rules
association_rules = model.associationRules 

In [None]:
# Order by CONFIDENCE
association_rules.sort(association_rules.confidence.desc()).show(10)

+-----------+------------+-------------------+------------------+--------------------+
| antecedent|  consequent|         confidence|              lift|             support|
+-----------+------------+-------------------+------------------+--------------------+
|   [ louis]|       [ st]| 0.8362445414847162| 46.67208725878445|0.007607961543045...|
|    [ vice]|[ president]|           0.796875|39.291166748285995|0.003039211791347...|
|  [ united]|    [ state]| 0.6599326599326599|12.037076074757234|0.003893369353621...|
|     [ ago]|     [ year]| 0.5974643423137876|7.7439623894852465|0.007488776766914306|
|      [ st]|    [ louis]| 0.4246119733924612| 46.67208725878446|0.007607961543045...|
|    [ ohio]|    [ state]| 0.3163841807909605|5.7708016048472945|0.003337173731675...|
|    [ past]|     [ year]|0.30615640599001664| 3.968209523776884|0.003654999801358...|
| [ million]|     [ year]| 0.2830188679245283|3.6683151001690533|0.006257200746891263|
| [ student]|   [ school]|               0.

In [None]:
# Order by LIFT (desc) - Look for highly dependent items (LIFT >> 1)
association_rules.sort(association_rules.lift.desc()).show(10)

+------------+------------+-------------------+------------------+--------------------+
|  antecedent|  consequent|         confidence|              lift|             support|
+------------+------------+-------------------+------------------+--------------------+
|       [ st]|    [ louis]| 0.4246119733924612| 46.67208725878446|0.007607961543045...|
|    [ louis]|       [ st]| 0.8362445414847162| 46.67208725878445|0.007607961543045...|
|[ president]|     [ vice]|0.14985308521057786|39.291166748285995|0.003039211791347...|
|     [ vice]|[ president]|           0.796875|39.291166748285995|0.003039211791347...|
|    [ state]|   [ united]|0.07101449275362319|12.037076074757236|0.003893369353621...|
|   [ united]|    [ state]| 0.6599326599326599|12.037076074757234|0.003893369353621...|
|   [ school]|  [ student]|0.18214936247723132|11.462204007285974|0.003972825871042072|
|  [ student]|   [ school]|               0.25|11.462204007285974|0.003972825871042072|
|   [ school]| [ district]|0.163

In [None]:
# Order by LIFT (asc) - Look for perfect substitutes items (LIFT << 1)
association_rules.sort(association_rules.lift.asc()).show(10)

+----------+----------+-------------------+------------------+--------------------+
|antecedent|consequent|         confidence|              lift|             support|
+----------+----------+-------------------+------------------+--------------------+
|  [ state]|   [ make]|0.06847826086956521|0.9431826562778802|0.003754320448134...|
|   [ make]|  [ state]|0.05170998632010944|0.9431826562778802|0.003754320448134...|
|   [ year]| [ people]|0.04582904222451081|1.1478237033165788|0.003535815025227444|
| [ people]|   [ year]|0.08855721393034825|1.1478237033165788|0.003535815025227444|
|   [ time]|  [ state]|0.06752136752136752|1.2315799578843056|0.003138532438123237|
|  [ state]|   [ time]| 0.0572463768115942|1.2315799578843056|0.003138532438123237|
|   [ year]|   [ play]|0.04402677651905252|  1.28115374770066|0.003396766119740...|
|   [ play]|   [ year]|0.09884393063583816|1.2811537477006603|0.003396766119740...|
|   [ make]|    [ day]| 0.0454172366621067|1.2932095746853933| 0.00329744547

Note: "state" and "make" are independent. "year" and "people", "time" and "state" are mostly likewise independent from each other.

In [None]:
# Here, the most frequent pairs of items are also displayed 
association_rules.sort(association_rules.support.desc()).show(10)

+----------+----------+-------------------+------------------+--------------------+
|antecedent|consequent|         confidence|              lift|             support|
+----------+----------+-------------------+------------------+--------------------+
|   [ game]|   [ play]|0.20569948186528497| 5.985736020845188|0.007886059354018513|
|   [ play]|   [ game]|0.22947976878612716| 5.985736020845188|0.007886059354018513|
|     [ st]|  [ louis]| 0.4246119733924612| 46.67208725878446|0.007607961543045...|
|  [ louis]|     [ st]| 0.8362445414847162| 46.67208725878445|0.007607961543045...|
|   [ make]|   [ year]|0.10424076607387141|1.3511041827216361|0.007568233284335147|
|   [ year]|   [ make]|0.09809474768280124|1.3511041827216363|0.007568233284335147|
|   [ year]|    [ ago]| 0.0970648815653965| 7.743962389485246|0.007488776766914306|
|    [ ago]|   [ year]| 0.5974643423137876|7.7439623894852465|0.007488776766914306|
|  [ state]|   [ year]|0.13369565217391305|1.7328801549276855|0.007329863732

PREDICTIONS

In [None]:
model.transform(sample).show(truncate=False) # transform examines the input items against all the association rules and summarize the consequents as prediction

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Text                      