# Project 2: Market-basket analysis - IMDB dataset

Project for the course of Algorithms for Massive Dataset <br> Nicolas Facchinetti 961648 <br> Antonio Belotti 960822

# Set up the Spark enviorment

We start by dowloading and installing all the needed tool to deal with Spark. In particular we are interested in obtainig a Java enviorment since Spark in written in Scala and so it need a JVM to run. Then we can download Apache Spark 3.1.2 with Hadoop 3.2 by the Apache CDN and uncompress it. Finally we can get and install PySpark, an interface for Apache Spark in Python

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
!tar xf spark-3.1.2-bin-hadoop3.2.tgz
!rm spark-3.1.2-bin-hadoop3.2.tgz
!pip install -q findspark

--2022-02-03 14:04:37--  https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 228834641 (218M) [application/x-gzip]
Saving to: ‘spark-3.1.2-bin-hadoop3.2.tgz’


2022-02-03 14:04:38 (195 MB/s) - ‘spark-3.1.2-bin-hadoop3.2.tgz’ saved [228834641/228834641]



The next step is to correctly set the path in our remote enviorment to use the obtained tools.

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"

Finally we can import PySpark in the project

In [3]:
import findspark
findspark.init("spark-3.1.2-bin-hadoop3.2")# SPARK_HOME
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

# Load preprocessed dataset from file data.zip

Use the code below do load the dataset from a preprocessed file data.zip

In [None]:
from google.colab import files
import os

uploaded = files.upload()

if os.path.isfile("data.zip"):
  !unzip -q data.zip && rm data.zip
  basket_data = spark.read.format("json").option("header", "true").load("data").select('tconst', 'nconsts').rdd
  basket_data.take(5)
else:
  print("Error in loading the file.")

Saving data.zip to data.zip


# Download the dataset from Kaggle

First install the Python module of Kaggle to download the dataset from its datacenter

In [4]:
!pip install kaggle



Then load kaggle.json, a file containing your API credentials to be able to use the services offered by Kaggle

In [5]:
from google.colab import files

uploaded = files.upload()
  
# Move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


Now we can download the dataset

In [6]:
!kaggle datasets download 'ashirwadsangwan/imdb-dataset'

Downloading imdb-dataset.zip to /content
 99% 1.42G/1.44G [00:17<00:00, 89.4MB/s]
100% 1.44G/1.44G [00:17<00:00, 88.1MB/s]


We now must unzip the compressed archive to use it. Once done we can also remove it.

In [7]:
!unzip imdb-dataset.zip && rm imdb-dataset.zip

Archive:  imdb-dataset.zip
  inflating: name.basics.tsv.gz      
  inflating: name.basics.tsv/name.basics.tsv  
  inflating: title.akas.tsv.gz       
  inflating: title.akas.tsv/title.akas.tsv  
  inflating: title.basics.tsv.gz     
  inflating: title.basics.tsv/title.basics.tsv  
  inflating: title.principals.tsv.gz  
  inflating: title.principals.tsv/title.principals.tsv  
  inflating: title.ratings.tsv.gz    
  inflating: title.ratings.tsv/title.ratings.tsv  


# Preapare the data for Spark

We can directly load the downloaded and extracted .tsv file in a Spark DataFrame by using the command read.csv(). We directly pass to the method the columns in which we are interested.

In [8]:
df_principals = spark.read.csv("/content/title.principals.tsv/title.principals.tsv", sep=r'\t', header=True).select('tconst','nconst','category')
df_principals.show(10)

+---------+---------+---------------+
|   tconst|   nconst|       category|
+---------+---------+---------------+
|tt0000001|nm1588970|           self|
|tt0000001|nm0005690|       director|
|tt0000001|nm0374658|cinematographer|
|tt0000002|nm0721526|       director|
|tt0000002|nm1335271|       composer|
|tt0000003|nm0721526|       director|
|tt0000003|nm5442194|       producer|
|tt0000003|nm1335271|       composer|
|tt0000003|nm5442200|         editor|
|tt0000004|nm0721526|       director|
+---------+---------+---------------+
only showing top 10 rows



In [9]:
df_basics = spark.read.csv("/content/title.basics.tsv/title.basics.tsv", sep=r'\t', header=True).select('tconst','titleType')
df_basics.show(10)

+---------+---------+
|   tconst|titleType|
+---------+---------+
|tt0000001|    short|
|tt0000002|    short|
|tt0000003|    short|
|tt0000004|    short|
|tt0000005|    short|
|tt0000006|    short|
|tt0000007|    short|
|tt0000008|    short|
|tt0000009|    movie|
|tt0000010|    short|
+---------+---------+
only showing top 10 rows



By inspecting the content of the column 'category' of df_principlas we can see that there are many jobs other than actors and actress (which are the two we are interested in)

In [10]:
df_principals.select("category").distinct().show()

+-------------------+
|           category|
+-------------------+
|            actress|
|           producer|
|             writer|
|           composer|
|           director|
|               self|
|              actor|
|             editor|
|    cinematographer|
|      archive_sound|
|production_designer|
|    archive_footage|
+-------------------+



Similarly we can do the same thing with df_basics and the column 'titleType' to see how many categories a title can have.

In [11]:
df_basics.select("titleType").distinct().show()

+------------+
|   titleType|
+------------+
|    tvSeries|
|tvMiniSeries|
|     tvMovie|
|   tvEpisode|
|       movie|
|   tvSpecial|
|       video|
|   videoGame|
|     tvShort|
|       short|
+------------+



Once the data is loaded in a Spark DataFrame we can use the PySpark SQL module for processing the data. We start by exctracting only actors and actress from df_principals

In [12]:
pre = df_principals.count()
df_principals.createOrReplaceTempView("PRINCIPALS") # create a temporary table on DataFrame
df_principals = spark.sql("SELECT * from PRINCIPALS WHERE category ='actor' OR category='actress'")
print("We reduced the number of row from {} to {}".format(pre, df_principals.count()))

We reduced the number of row from 36468817 to 14818798


 And then we do the same thing with movies in df_basics

In [13]:
pre = df_basics.count()
df_basics.createOrReplaceTempView("BASICS") # create a temporary table on DataFrame
df_basics = spark.sql("SELECT * from BASICS WHERE titleType ='movie'")
print("We reduced the number of row from {} to {}".format(pre, df_basics.count()))

We reduced the number of row from 6321302 to 536034


We can now see that we have two DataFrame, one containing only the movies and the other only the people which play as actor/actress in a title. To do the desired maket-basket analysis we have to pivot our tconst as rows, so each row stands for one titleId, and then including a list of nconst identifiers of the actors that played in it.

In [14]:
df_basics.show(10)

+---------+---------+
|   tconst|titleType|
+---------+---------+
|tt0000009|    movie|
|tt0000147|    movie|
|tt0000335|    movie|
|tt0000502|    movie|
|tt0000574|    movie|
|tt0000615|    movie|
|tt0000630|    movie|
|tt0000675|    movie|
|tt0000676|    movie|
|tt0000679|    movie|
+---------+---------+
only showing top 10 rows



In [15]:
df_principals.show(10)

+---------+---------+--------+
|   tconst|   nconst|category|
+---------+---------+--------+
|tt0000005|nm0443482|   actor|
|tt0000005|nm0653042|   actor|
|tt0000007|nm0179163|   actor|
|tt0000007|nm0183947|   actor|
|tt0000008|nm0653028|   actor|
|tt0000009|nm0063086| actress|
|tt0000009|nm0183823|   actor|
|tt0000009|nm1309758|   actor|
|tt0000011|nm3692297|   actor|
|tt0000014|nm0166380|   actor|
+---------+---------+--------+
only showing top 10 rows



So we start by joining the two dataframe to extract from df_principals only the records with tconst related to a movie. We can also discard the category column since is no longer useful.

In [16]:
basket_data = df_principals.join(df_basics, "tconst").select(df_principals.tconst, df_principals.nconst).sort("tconst")
basket_data.show(10)

+---------+---------+
|   tconst|   nconst|
+---------+---------+
|tt0000009|nm0183823|
|tt0000009|nm1309758|
|tt0000009|nm0063086|
|tt0000335|nm0675239|
|tt0000335|nm1010955|
|tt0000335|nm0675260|
|tt0000335|nm1012612|
|tt0000335|nm1012621|
|tt0000335|nm1011210|
|tt0000502|nm0215752|
+---------+---------+
only showing top 10 rows



Then we can remove hypothetical duplicated row and then aggregate the data using tconst identifier.

In [17]:
from pyspark.sql import functions as F
basket_data = basket_data.dropDuplicates()
basket_data = basket_data.groupBy("tconst").agg(F.collect_list("nconst").alias("nconsts")).sort('tconst')

In [18]:
print("There are {} titleId buckets".format(basket_data.count()))
basket_data.show(10, False)

There are 393656 titleId buckets
+---------+------------------------------------------------------------------+
|tconst   |nconsts                                                           |
+---------+------------------------------------------------------------------+
|tt0000009|[nm0063086, nm0183823, nm1309758]                                 |
|tt0000335|[nm1010955, nm1012612, nm1011210, nm1012621, nm0675239, nm0675260]|
|tt0000502|[nm0215752, nm0252720]                                            |
|tt0000574|[nm0846887, nm0846894, nm3002376, nm0170118]                      |
|tt0000615|[nm3071427, nm0581353, nm0888988, nm0240418, nm0346387, nm0218953]|
|tt0000630|[nm0624446]                                                       |
|tt0000676|[nm0097421, nm0140054]                                            |
|tt0000679|[nm0000875, nm0122665, nm0933446, nm2924919]                      |
|tt0000793|[nm0691995]                                                       |
|tt0000862|[nm52893

As we can see above we now have the data in the correct format to do our analysis: in each row we have the identifier of a movie and in the second column the list of the idenfiers of the actors that played in it.
Since we have done all the needed pre-processing computation on the data we can transform our DataFrame in a RDD to apply map-reduce functions.

Serialize to file the RDD and download to skip the processing all the time.



In [None]:
basket_data.write.format('json').save("data")

In [None]:
!zip -r data.zip data

In [None]:
from google.colab import files
files.download('data.zip')

# Apriori classic

We start by implementing the classic Apriori algorithm. In particular we search until no more k-itemsets are found.

In [50]:
def generate_candidate_k_set(frequent_kmin1_set, k):
  """
  frequent_kmin1_set: list
  k: int

  return: list

  Take as input an integer k and a frequent k-1 itemset. Return as output the candidate k itemset obtained from frequent_kmin1_set
  """
  if k < 2:
      raise ValueError("k must be >= 2")
  candidates = []
  if k == 2:
      # generate pairs
      for x in frequent_kmin1_set:
          for y in frequent_kmin1_set:
              # to prevent duplicates
              if x < y:
                  candidates.append((x, y))
  else:
    # generate itemsets k>2
    for x in frequent_kmin1_set:
      for y in frequent_kmin1_set:
        """
        Items	x and	y in	k-1 itemsets are	joined	if
        (x[1]=y[1]) ^ (x[2]=y[2]) ^ … ^ (x[k-2]=y[k-2])
        (x[k-1]<y[k-1]) to prevent duplicates
        """
        if x[:k - 2] == y[:k - 2] and x[k - 2] < y[k - 2]:
          candidates.append((*x[:k - 2], x[k - 2], y[k - 2]))
  return candidates

In [51]:
import functools
                    
def apriori_unlimited_k(transactions, support_threshold):
  """
  transactions: list [(key, [elements]), ...]
  support_threshold: int

  return: list [(itemset, support), ...]
  """
  frequent_elements = {}

  # count singletons
  counter = {}
  for _, transaction in transactions:
    for x in transaction:
      # if is not present in counter set return 0 and add 1, otherwise add 1 to entry x
      counter[x] = counter.get(x,0) + 1

  # filter out imtesets with count >= the threshold
  frequent_elements[1] = [(k, v) for k, v in counter.items() if v >= support_threshold]
  
  k = 2
  while True:
    all_candidate_sets = generate_candidate_k_set([e[0] for e in frequent_elements[k - 1]], k)
    if len(all_candidate_sets) == 0:
      # no more candidate set
      break

    counter = {}
    for _, transaction in transactions:
      for candidate_set in all_candidate_sets:
        # check that all the element in the candidate set are in transaction
        if all([candidate_element in transaction for candidate_element in candidate_set]):
          counter[candidate_set] = counter.get(candidate_set, 0) + 1
    # keep only the itemsets with counter >= the threshold
    frequent_elements[k] = [(itemset, occ) for itemset, occ in counter.items() if occ >= support_threshold]
    k += 1
  return functools.reduce(lambda a, b: a + b, frequent_elements.values())

In [52]:
test = [
        ("t1", ["I1", "I3", "I4"]),
        ("t2", ["I2", "I3", "I5"]),
        ("t3", ["I1", "I2", "I3", "I5"]),
        ("t4", ["I2", "I5"]),
  ]

apriori_unlimited_k(test, 2)

[('I1', 2),
 ('I3', 3),
 ('I2', 3),
 ('I5', 3),
 (('I1', 'I3'), 2),
 (('I3', 'I5'), 2),
 (('I2', 'I3'), 2),
 (('I2', 'I5'), 3),
 (('I2', 'I3', 'I5'), 2)]

# Apriori with MAP-REDUCE

Follow an implementatio of the Apriori algorithm using a map-reduce approach.

In [53]:
def filter_candidate(transaction, all_candidate_sets):
  """
  transaction: list
  all_candidate_set: list

  return: list
  return only the candidate set which are in the transaction
  """
  exist = []
  for candidate_set in all_candidate_sets:
    if all([candidate_element in transaction for candidate_element in candidate_set]):
      exist.append(candidate_set)
  return exist

In [62]:
def apriorihmap_unlimited_k(data, support_threshold):
  """ 
  data: Pyspark.rdd 
    [
      [tconst, [nconst,]],
    ]
  support_threshold: int

  return: Pyspark.rdd
  """
  nconst_rdd = data.map(lambda x: x[1])

  # find singletone
  frequent_items_rdd = nconst_rdd.flatMap(lambda x: x) \
        .map(lambda elem: (elem,1)) \
        .reduceByKey(lambda a,b: a+b) \
        .filter(lambda x: x[1] >= support_threshold)

  # to save frequent itemsets for each k iteration
  frequent_elements = frequent_items_rdd.map(lambda x: x[0]).collect()
  k = 2
  # until there are no more frequent itemesets with support >= threshold
  while frequent_elements:
    # generate all the candidate k itemset from frequent_elements (k - 1)
    all_candidate_sets = generate_candidate_k_set(frequent_elements, k) 
    
    frequent_k_rdd = nconst_rdd.flatMap(lambda x: filter_candidate(x, all_candidate_sets)) \
              .map(lambda x: (x,1)) \
              .reduceByKey(lambda a,b: a+b) \
              .filter(lambda x: x[1] >= support_threshold)
    frequent_items_rdd = frequent_items_rdd.union(frequent_k_rdd)

    # add new frequent element
    frequent_elements = frequent_k_rdd.map(lambda x: x[0]).collect()
    k += 1
  return frequent_items_rdd

# SON

We then decided to also implement SON to test out if there is an improvement in time complexity. We decided to partion the data in a number equal to the avaiable processors in the cluster for the sake of experimenta setup; in a real case scenario use the number of nodes in the cluster.

In [72]:
# empirical sweet-spot for the number of partitions (assuming every executor has 4 cores ...)
num_partitions = spark.sparkContext._jsc.sc().getExecutorMemoryStatus().size() * 4
num_partitions

4

We must define a function for the second step to properly count the number of occurrence of frequent itemsets in a partition.

In [73]:
def count_in_partition(data, frequent):
  """
  data: iterable
  frequent:  pyspark.Broadcast

  return: list 
  
  count the occurence of each itemeset in frequent in the partion data
  """
  # prepare data for processing
  frequent = frequent.value   # extract broadcasted values
  data = list(data)           # cast to list to iterate more than one time

  # check foreach frequent itemset
  for frequent_item in frequent:
    # trick to cast single element to list → not remove in the str duplicate char using set()
    if type(frequent_item) is not tuple:
      to_check = [frequent_item]
    else:
      to_check = frequent_item
      
    c = 0     # counter
    # and foreach row of the dataset
    for itemset in data:
      # check if the frequent itemset is subset of the items of the row
      if set(to_check).issubset(itemset[1]):
        c += 1
    yield (frequent_item, c)

In [74]:
def count_in_partition_v2(data, candidate_frequent_itemsets_bv):
  # extract broadcasted values
  candidate_frequent_itemsets = candidate_frequent_itemsets_bv.value

  # check foreach frequent itemset
  for candidate_freq_item in candidate_frequent_itemsets.keys():
    # need candidate_freq_item to be iterable even if it's only a single element
    if type(candidate_freq_item) is not tuple:
      candidate_freq_item = [candidate_freq_item]
      
    c = 0
    for _, bucket in data:
      if set(candidate_freq_item).issubset([x for x in bucket if candidate_frequent_itemsets.get(x,False)]):
        c += 1
    yield (tuple(candidate_freq_item), c)

Then the implementation of SON with a two step map-reduce. The first finds out the frequent itemsets in the partition and the latter go to count them in the dataset and filters out the ones with support greater than threshold.

In [75]:
def son_m_r(data, support):
  """
  data: Pyspark.rdd 
    [
      [tconst, [nconst,]],
    ]
  support: int

  return: Pyspark.rdd
  """
  reduced_support = support//data.getNumPartitions()
  # use apriori on every partition
  first_map = data.mapPartitions(lambda partition: apriori_unlimited_k(partition, reduced_support)).map(lambda x: (x[0], 1))
  # TRYYYYYYYYYYY WITH DISTINCT ON TIME COMPLEXITY
  first_reduce = first_map.reduceByKey(lambda a,b: a+b)       # possible to remove a+b ?????????????????

  
  # extract the frequent itemsets and broadcast them to worker nodes
  frequent_items = [x[0] for x in first_reduce.collect()]
  frequent_items = spark.sparkContext.broadcast(frequent_items)

  second_map = data.mapPartitions(lambda partition: count_in_partition(partition, frequent_items))
  second_reduce = second_map.reduceByKey(lambda a,b: a+b).filter(lambda x: x[1] >= support)
  return second_reduce

In [81]:
def son_m_r_v2(data, support):
  reduced_support = support/data.getNumPartitions()  # mi sa che non si deve arrotondare. va bene se è un float altrimenti poterbbe uscire 0
  candidate_frequent_itemsets_rdd = data.mapPartitions(lambda partition: apriori_unlimited_k(partition, reduced_support)).map(lambda x: x[0]).distinct()

  # broadcast the frequent items to worker nodes
  candidate_frequent_itemsets_bv = spark.sparkContext.broadcast(
      {x:True for x in candidate_frequent_itemsets_rdd.collect()}
  )

  second_map = data.mapPartitions(lambda partition: count_in_partition_v2(partition, candidate_frequent_itemsets_bv))
  frequent_itemsets = second_map.reduceByKey(lambda a,b: a+b).filter(lambda x: x[1] >= support)
  return frequent_itemsets

# Demo FP Growth

To carry our experiment we decided to also use the in library implementation of FP-growth as comparison benchmark.

In [77]:
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="nconsts")

In [78]:
"""
model = fpGrowth.fit(basket_data)

# Display frequent itemsets.
model.freqItemsets.show()
items = model.freqItemsets

# Display generated association rules.
model.associationRules.show()
rules = model.associationRules

# transform examines the input items against all the association rules and summarize the consequents as prediction
model.transform(basket_data).show()
transformed = model.transform(basket_data)
"""

'\nmodel = fpGrowth.fit(basket_data)\n\n# Display frequent itemsets.\nmodel.freqItemsets.show()\nitems = model.freqItemsets\n\n# Display generated association rules.\nmodel.associationRules.show()\nrules = model.associationRules\n\n# transform examines the input items against all the association rules and summarize the consequents as prediction\nmodel.transform(basket_data).show()\ntransformed = model.transform(basket_data)\n'

# Test of the algorithms

We extract a subset of 500 rows from the dataset to test out that our algorithms work as expected. We define min_support as 1% of the count of the rows.

In [67]:
import pandas as pd
minsup = 0.01
num_rows = 500
sup = minsup*num_rows

minid = basket_data.take(num_rows)
minid = spark.sparkContext.parallelize(minid)
minid.take(5)

[Row(tconst='tt0000009', nconsts=['nm0063086', 'nm0183823', 'nm1309758']),
 Row(tconst='tt0000335', nconsts=['nm1010955', 'nm1012612', 'nm1011210', 'nm1012621', 'nm0675239', 'nm0675260']),
 Row(tconst='tt0000502', nconsts=['nm0215752', 'nm0252720']),
 Row(tconst='tt0000574', nconsts=['nm0846887', 'nm0846894', 'nm3002376', 'nm0170118']),
 Row(tconst='tt0000615', nconsts=['nm3071427', 'nm0581353', 'nm0888988', 'nm0240418', 'nm0346387', 'nm0218953'])]

We start by exectuing the classic implementation of apriori. Is compulsory to  collect the data from the RDD since this is a non distributed implementation.

In [68]:
apriori1 = list(apriori_unlimited_k(minid.collect(), sup))

The we have the Apriori implementation with map-reduce

In [69]:
apriori2 = apriorihmap_unlimited_k(minid, sup).collect()

Follow the implementation with SON. The data must be repartioned on the finded sweet-spot number of partitions

In [79]:
minid = minid.repartition(num_partitions)
son = son_m_r(minid, sup).collect()

In [99]:
son

[('nm0539049', 5),
 ('nm0294276', 5),
 ('nm0526234', 5),
 ('nm0577476', 5),
 ('nm0505354', 7),
 ('nm0068213', 5),
 ('nm0165691', 6),
 ('nm0003425', 9),
 ('nm0252476', 7),
 ('nm0926280', 12),
 ('nm0691995', 8),
 ('nm0190516', 7),
 ('nm0243918', 5),
 ('nm0679170', 6),
 ('nm0746008', 6),
 ('nm0681933', 10),
 ('nm0768187', 5),
 ('nm0908390', 6),
 ('nm0392059', 5),
 ('nm0622772', 5),
 ('nm0292407', 10),
 ('nm0885818', 5),
 ('nm0516974', 9),
 ('nm0110838', 6),
 ('nm0074186', 6),
 ('nm0642190', 6),
 ('nm0140054', 10),
 ('nm0169878', 5),
 ('nm0016799', 8),
 ('nm0528022', 7),
 ('nm0330280', 5),
 ('nm0676473', 7),
 ('nm0163540', 8),
 ('nm0366008', 6)]

In [82]:
son_v2 = son_m_r_v2(minid, sup).collect()

In [83]:
son_v2

[(('nm0539049',), 5)]

Then we also train the in-library implementation of FPGrowth to have a comparison with a correct algorithm

In [96]:
# initialize
fpGrowth.setMinSupport(minsup)
model = fpGrowth.fit(minid.toDF())

# get itemsets
fp_growth = model.freqItemsets.collect()

In [97]:
# trasform the output of FPGrowth same as our implementations
def trasform_format(data):
  strings = []
  tuples = []
  for d in data:
    if len(d[0]) == 1:
      strings.append((d[0][0], d[1]))
    else:
      tuples.append((tuple(sorted(d[0])),d[1]))
  return strings + tuples

# keep only singleton and tuples in fpgrowth result and trasform the format of results
fp_growth = trasform_format(fp_growth)

Let's put the obtained result in a tabular way.

In [98]:
df1 = pd.DataFrame([x[1] for x in apriori1], index=[x[0] for x in apriori1], columns =['Apriori'])
df2 = pd.DataFrame([x[1] for x in apriori2], index=[x[0] for x in apriori2], columns =['Apriori MR'])
df3 = pd.DataFrame([x[1] for x in son], index=[x[0] for x in son], columns =['SON'])
df4 = pd.DataFrame([x[1] for x in fp_growth], index=[x[0] for x in fp_growth], columns =['FPGrowth'])

df = pd.concat([df1, df2, df3, df4], axis=1)
df["Equal"] = df.nunique(axis = 1, dropna=False) == 1       
df

Unnamed: 0,Apriori,Apriori MR,SON,FPGrowth,Equal
nm0140054,10,10,10.0,10,True
nm0691995,8,8,8.0,8,True
nm0243918,5,5,5.0,5,True
nm0169878,5,5,5.0,5,True
nm0539049,5,5,5.0,5,True
nm0330280,5,5,5.0,5,True
nm0768187,5,5,5.0,5,True
nm0528022,7,7,7.0,7,True
nm0681933,10,10,10.0,10,True
nm0003425,9,9,9.0,9,True


# Perfomance comparison