# Project 2: Market-basket analysis - IMDB dataset

Project for the course of Algorithms for Massive Dataset <br> Nicolas Facchinetti 961648 <br> Antonio Belotti 960822

# Set up the Spark enviorment

We start by dowloading and installing all the needed tool to deal with Spark. In particular we are interested in obtainig a Java enviorment since Spark in written in Scala and so it need a JVM to run. Then we can download Apache Spark 3.1.2 with Hadoop 3.2 by the Apache CDN and uncompress it. Finally we can get and install PySpark, an interface for Apache Spark in Python

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
!tar xf spark-3.1.2-bin-hadoop3.2.tgz
!pip install -q findspark

--2021-09-02 12:42:42--  https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 228834641 (218M) [application/x-gzip]
Saving to: ‘spark-3.1.2-bin-hadoop3.2.tgz’


2021-09-02 12:42:44 (163 MB/s) - ‘spark-3.1.2-bin-hadoop3.2.tgz’ saved [228834641/228834641]



The next step is to correctly set the path in our remote enviorment to use the obtained tools.

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"

Finally we can import PySpark in the project

In [3]:
import findspark
findspark.init("spark-3.1.2-bin-hadoop3.2")# SPARK_HOME
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

# Download the dataset from Kaggle

First install the Python module of Kaggle to download the dataset from its datacenter

In [4]:
!pip install kaggle



Then load kaggle.json, a file containing your API credentials to be able to use the services offered by Kaggle

In [5]:
from google.colab import files

uploaded = files.upload()
  
# Move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


Now we can download the dataset

In [6]:
!kaggle datasets download 'ashirwadsangwan/imdb-dataset'

Downloading imdb-dataset.zip to /content
 99% 1.43G/1.44G [00:09<00:00, 168MB/s]
100% 1.44G/1.44G [00:09<00:00, 169MB/s]


We now must unzip the compressed archive to use it. Once done we can also remove it.

In [7]:
!unzip imdb-dataset.zip && rm imdb-dataset.zip

Archive:  imdb-dataset.zip
  inflating: name.basics.tsv.gz      
  inflating: name.basics.tsv/name.basics.tsv  
  inflating: title.akas.tsv.gz       
  inflating: title.akas.tsv/title.akas.tsv  
  inflating: title.basics.tsv.gz     
  inflating: title.basics.tsv/title.basics.tsv  
  inflating: title.principals.tsv.gz  
  inflating: title.principals.tsv/title.principals.tsv  
  inflating: title.ratings.tsv.gz    
  inflating: title.ratings.tsv/title.ratings.tsv  


# Preapare the data for Spark

We can directly load the downloaded and extracted .tsv file in a Spark DataFrame by using the command read.csv(). We directly pass to the method the columns in which we are interested.

In [8]:
df_principals = spark.read.csv("/content/title.principals.tsv/title.principals.tsv", sep=r'\t', header=True).select('tconst','nconst','category')

+---------+---------+---------------+
|   tconst|   nconst|       category|
+---------+---------+---------------+
|tt0000001|nm1588970|           self|
|tt0000001|nm0005690|       director|
|tt0000001|nm0374658|cinematographer|
|tt0000002|nm0721526|       director|
|tt0000002|nm1335271|       composer|
|tt0000003|nm0721526|       director|
|tt0000003|nm5442194|       producer|
|tt0000003|nm1335271|       composer|
|tt0000003|nm5442200|         editor|
|tt0000004|nm0721526|       director|
+---------+---------+---------------+
only showing top 10 rows



In [None]:
df_principals.show(10)

In [9]:
df_basics = spark.read.csv("/content/title.basics.tsv/title.basics.tsv", sep=r'\t', header=True).select('tconst','titleType')

In [None]:
df_basics.show(10)

By inspecting the content of the column 'category' of df_principlas we can see that there are many jobs other than actors and actress (which are the two we are interested in)

In [None]:
df_principals.select("category").distinct().show()

+-------------------+
|           category|
+-------------------+
|            actress|
|           producer|
|             writer|
|           composer|
|           director|
|               self|
|              actor|
|             editor|
|    cinematographer|
|      archive_sound|
|production_designer|
|    archive_footage|
+-------------------+



Similarly we can do the same thing with df_basics and the column 'titleType' to see how many categories a title can have.

In [None]:
df_basics.select("titleType").distinct().show()

+------------+
|   titleType|
+------------+
|    tvSeries|
|tvMiniSeries|
|     tvMovie|
|   tvEpisode|
|       movie|
|   tvSpecial|
|       video|
|   videoGame|
|     tvShort|
|       short|
+------------+



Once the data is loaded in a Spark DataFrame we can use the PySpark SQL module for processing the data. We start by exctracting only actors and actress from df_principals

In [10]:
pre = df_principals.count()
df_principals.createOrReplaceTempView("PRINCIPALS") # create a temporary table on DataFrame
df_principals = spark.sql("SELECT * from PRINCIPALS WHERE category ='actor' OR category='actress'")
print("We reduced the number of row from {} to {}".format(pre, df_principals.count()))

We reduced the number of row from 36468817 to 14818798


 And then we do the same thing with movies in df_basics

In [11]:
pre = df_basics.count()
df_basics.createOrReplaceTempView("BASICS") # create a temporary table on DataFrame
df_basics = spark.sql("SELECT * from BASICS WHERE titleType ='movie'")
print("We reduced the number of row from {} to {}".format(pre, df_basics.count()))

We reduced the number of row from 6321302 to 536034


We can now see that we have two DataFrame, one containing only the movies and the other only the people which play as actor/actress in a title. To do the desired maket-basket analysis we have to pivot our tconst as rows, so each row stands for one titleId, and then including a list of nconst identifiers of the actors that played in it.

In [None]:
df_basics.show(10)

+---------+---------+
|   tconst|titleType|
+---------+---------+
|tt0000009|    movie|
|tt0000147|    movie|
|tt0000335|    movie|
|tt0000502|    movie|
|tt0000574|    movie|
|tt0000615|    movie|
|tt0000630|    movie|
|tt0000675|    movie|
|tt0000676|    movie|
|tt0000679|    movie|
+---------+---------+
only showing top 10 rows



In [None]:
df_principals.show(10)

+---------+---------+--------+
|   tconst|   nconst|category|
+---------+---------+--------+
|tt0000005|nm0443482|   actor|
|tt0000005|nm0653042|   actor|
|tt0000007|nm0179163|   actor|
|tt0000007|nm0183947|   actor|
|tt0000008|nm0653028|   actor|
|tt0000009|nm0063086| actress|
|tt0000009|nm0183823|   actor|
|tt0000009|nm1309758|   actor|
|tt0000011|nm3692297|   actor|
|tt0000014|nm0166380|   actor|
+---------+---------+--------+
only showing top 10 rows



So we start by joining the two dataframe to extract from df_principals only the records with tconst related to a movie. We can also discard the category column since is no longer usefull.

In [12]:
basket_data = df_principals.join(df_basics, "tconst").select(df_principals.tconst, df_principals.nconst).sort("tconst")

In [None]:
basket_data.show(10)

+---------+---------+
|   tconst|   nconst|
+---------+---------+
|tt0000009|nm0183823|
|tt0000009|nm1309758|
|tt0000009|nm0063086|
|tt0000335|nm0675239|
|tt0000335|nm1010955|
|tt0000335|nm0675260|
|tt0000335|nm1012612|
|tt0000335|nm1012621|
|tt0000335|nm1011210|
|tt0000502|nm0215752|
+---------+---------+
only showing top 10 rows



Then we can remove hypothetical duplicated row and then aggregate the data using tconst identifier.

In [13]:
from pyspark.sql import functions as F
basket_data = basket_data.dropDuplicates()
basket_data = basket_data.groupBy("tconst").agg(F.collect_list("nconst").alias("nconsts")).sort('tconst')

In [None]:
print("There are {} titleId buckets".format(basket_data.count()))
basket_data.show(10, False)

There are 393656 titleId buckets
+---------+------------------------------------------------------------------+
|tconst   |nconsts                                                           |
+---------+------------------------------------------------------------------+
|tt0000009|[nm0063086, nm0183823, nm1309758]                                 |
|tt0000335|[nm1010955, nm1012612, nm1011210, nm1012621, nm0675239, nm0675260]|
|tt0000502|[nm0215752, nm0252720]                                            |
|tt0000574|[nm0846887, nm0846894, nm3002376, nm0170118]                      |
|tt0000615|[nm3071427, nm0581353, nm0888988, nm0240418, nm0346387, nm0218953]|
|tt0000630|[nm0624446]                                                       |
|tt0000676|[nm0097421, nm0140054]                                            |
|tt0000679|[nm0000875, nm0122665, nm0933446, nm2924919]                      |
|tt0000793|[nm0691995]                                                       |
|tt0000862|[nm52893

As we can see above we now have the data in the correct format to do our analysis: in each row we have the identifier of a movie and in the second column the list of the idenfiers of the actors that played in it.
Since we done all the needed pre-processing computation on the data we can transform our DataFrame in a RDD to apply map-reduce functions.

In [37]:
data = basket_data.rdd
data.take(5)

[Row(tconst='tt0000009', nconsts=['nm0063086', 'nm0183823', 'nm1309758']),
 Row(tconst='tt0000335', nconsts=['nm1010955', 'nm1012612', 'nm1011210', 'nm1012621', 'nm0675239', 'nm0675260']),
 Row(tconst='tt0000502', nconsts=['nm0215752', 'nm0252720']),
 Row(tconst='tt0000574', nconsts=['nm0846887', 'nm0846894', 'nm3002376', 'nm0170118']),
 Row(tconst='tt0000615', nconsts=['nm3071427', 'nm0581353', 'nm0888988', 'nm0240418', 'nm0346387', 'nm0218953'])]

Serialize to file the RDD and download to skip the processing all the time.



In [45]:
basket_data.write.format('json').save("data.json")

In [15]:
data.coalesce(1).saveAsTextFile("myrdd")

# APRIORI MAP-REDUCE

In [51]:
data = spark.read.format("json").option("header", "true").load("data.json").sort("tconst")

In [52]:
data.show(5, False)

+------------------------------------------------------------------+---------+
|nconsts                                                           |tconst   |
+------------------------------------------------------------------+---------+
|[nm0063086, nm0183823, nm1309758]                                 |tt0000009|
|[nm1010955, nm1012612, nm1011210, nm1012621, nm0675239, nm0675260]|tt0000335|
|[nm0215752, nm0252720]                                            |tt0000502|
|[nm0846887, nm0846894, nm3002376, nm0170118]                      |tt0000574|
|[nm3071427, nm0581353, nm0888988, nm0240418, nm0346387, nm0218953]|tt0000615|
+------------------------------------------------------------------+---------+
only showing top 5 rows



In [44]:
data = spark.read.csv("data")
data.show(5, False)

+----------------------+---------------------+--------------+--------------+
|_c0                   |_c1                  |_c2           |_c3           |
+----------------------+---------------------+--------------+--------------+
|Row(tconst='tt0000009'| nconsts=['nm0063086'| 'nm0183823'  | 'nm1309758'])|
|Row(tconst='tt0000335'| nconsts=['nm1010955'| 'nm1012612'  | 'nm1011210'  |
|Row(tconst='tt0000502'| nconsts=['nm0215752'| 'nm0252720'])|null          |
|Row(tconst='tt0000574'| nconsts=['nm0846887'| 'nm0846894'  | 'nm3002376'  |
|Row(tconst='tt0000615'| nconsts=['nm3071427'| 'nm0581353'  | 'nm0888988'  |
+----------------------+---------------------+--------------+--------------+
only showing top 5 rows



Accediamo al campo 1 sicchè 0 è il bucket, flat perché cosi unisce tutte le row in una

In [32]:
data.flatMap(lambda row: row[1]).take(10)

['o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'o']

Mappiamo ogni record di autore trovato in se stesso e 1

In [None]:
data.flatMap(lambda row: (row[1]))
  .map(lambda elem: (elem,1)).take(10)

[('nm0063086', 1),
 ('nm0183823', 1),
 ('nm1309758', 1),
 ('nm1010955', 1),
 ('nm1012612', 1),
 ('nm1011210', 1),
 ('nm1012621', 1),
 ('nm0675239', 1),
 ('nm0675260', 1),
 ('nm0215752', 1)]

Aggiungiamo reduce che somma la parte dopo il contantore dell'attore

In [None]:
data.flatMap(lambda row: (row[1])).map(lambda elem: (elem,1)).reduceByKey(lambda a,b: a+b).collect()

[('nm0209738', 7),
 ('nm1268667', 2),
 ('nm0664342', 11),
 ('nm0509449', 6),
 ('nm8425234', 3),
 ('nm0392015', 4),
 ('nm0894679', 4),
 ('nm1211519', 1),
 ('nm0098310', 3),
 ('nm0042422', 23),
 ('nm0148791', 1),
 ('nm0027244', 6),
 ('nm0926330', 1),
 ('nm0718835', 9),
 ('nm0143212', 2),
 ('nm0625346', 4),
 ('nm0172418', 7),
 ('nm0048408', 6),
 ('nm0544842', 1),
 ('nm0272538', 2),
 ('nm0737139', 2),
 ('nm0500347', 1),
 ('nm0051019', 3),
 ('nm0445736', 2),
 ('nm0803103', 7),
 ('nm0336572', 1),
 ('nm0883927', 1),
 ('nm0790966', 2),
 ('nm0675729', 2),
 ('nm0187847', 1),
 ('nm0422012', 2),
 ('nm0347345', 6),
 ('nm1380139', 1),
 ('nm0101182', 1),
 ('nm0607058', 7),
 ('nm0526398', 21),
 ('nm1260667', 11),
 ('nm0272889', 5),
 ('nm0280636', 1),
 ('nm0820461', 1),
 ('nm0288153', 2),
 ('nm0814028', 1),
 ('nm0342376', 1),
 ('nm0550629', 1),
 ('nm0780616', 1),
 ('nm0852748', 3),
 ('nm2557717', 1),
 ('nm0945427', 218),
 ('nm1024669', 5),
 ('nm0522431', 2),
 ('nm0594324', 16),
 ('nm0412045', 1),
 ('nm

Aggiungiamo un threshold (almeno 200 apparizioni)

In [None]:
data.flatMap(lambda row: (row[1])).map(lambda elem: (elem,1)).reduceByKey(lambda a,b: a+b).filter(lambda x: x[1] >=200).take(10)

[('nm0945427', 218),
 ('nm0764762', 245),
 ('nm0045119', 251),
 ('nm0893449', 232),
 ('nm1894124', 273),
 ('nm0415549', 299),
 ('nm1006879', 239),
 ('nm0793813', 411),
 ('nm0482320', 344),
 ('nm0659173', 214)]

In [None]:
res = data.flatMap(lambda row: (row[1])) \
          .map(lambda elem: (elem,1)) \
          .reduceByKey(lambda a,b: a+b) \
          .filter(lambda x: x[1] >=200) \
          .collect()
res

[('nm0945427', 218),
 ('nm0764762', 245),
 ('nm0045119', 251),
 ('nm0893449', 232),
 ('nm1894124', 273),
 ('nm0415549', 299),
 ('nm1006879', 239),
 ('nm0793813', 411),
 ('nm0482320', 344),
 ('nm0659173', 214),
 ('nm0798827', 228),
 ('nm0688093', 243),
 ('nm0474820', 230),
 ('nm0706691', 315),
 ('nm0374974', 284),
 ('nm0004417', 287),
 ('nm0046850', 348),
 ('nm0004467', 210),
 ('nm0222432', 274),
 ('nm0080173', 216),
 ('nm0006369', 276),
 ('nm0154164', 235),
 ('nm0695177', 215),
 ('nm1066548', 228),
 ('nm0623427', 438),
 ('nm0619107', 387),
 ('nm0159159', 205),
 ('nm0619309', 313),
 ('nm0035067', 294),
 ('nm0001000', 303),
 ('nm0000465', 295),
 ('nm0695199', 256),
 ('nm0993695', 213),
 ('nm0417314', 207),
 ('nm0451600', 206),
 ('nm0315553', 234),
 ('nm1066229', 228),
 ('nm0352032', 229),
 ('nm3183374', 261),
 ('nm0006982', 585),
 ('nm0613417', 239),
 ('nm0304262', 323),
 ('nm0648803', 565),
 ('nm1001108', 219),
 ('nm0004660', 213),
 ('nm0419707', 214),
 ('nm0004429', 270),
 ('nm0023173'

Vediamo ora per la seconda parte di apriori

In [None]:
data.take(10)

[Row(tconst='tt0000009', nconsts=['nm0063086', 'nm0183823', 'nm1309758']),
 Row(tconst='tt0000335', nconsts=['nm1010955', 'nm1012612', 'nm1011210', 'nm1012621', 'nm0675239', 'nm0675260']),
 Row(tconst='tt0000502', nconsts=['nm0215752', 'nm0252720']),
 Row(tconst='tt0000574', nconsts=['nm0846887', 'nm0846894', 'nm3002376', 'nm0170118']),
 Row(tconst='tt0000615', nconsts=['nm3071427', 'nm0581353', 'nm0888988', 'nm0240418', 'nm0346387', 'nm0218953']),
 Row(tconst='tt0000630', nconsts=['nm0624446']),
 Row(tconst='tt0000676', nconsts=['nm0097421', 'nm0140054']),
 Row(tconst='tt0000679', nconsts=['nm0000875', 'nm0122665', 'nm0933446', 'nm2924919']),
 Row(tconst='tt0000793', nconsts=['nm0691995']),
 Row(tconst='tt0000862', nconsts=['nm5289318', 'nm5289829', 'nm0264569', 'nm0386036', 'nm0511080', 'nm5188470'])]

Prendiamo il primo record per provare e estriamo i due elementi. Scriviamo una funzione che controlla se gli elementi di una copia sono nella riga (si pup fare meglio ma così era semplice)

In [None]:
copia = ['nm0063086', 'nm0183823']    #primi due attori del primo record

def check_row(row, copia):
  for el in copia:
    if el not in row:
      return False
  return True

data.map(lambda x:x[1]).filter(lambda x: check_row(x,copia)).take(5)


[['nm0063086', 'nm0183823', 'nm1309758']]

Proviamo ora a cercare di far generare le copie possibili ad ogni singola riga. trick per evitare doppioni. flatmap direttamente almeno sono gia spacchettate

In [None]:
def generate_candidate(x):
  candidates = []
  for a, elemA in enumerate(x):
    for b, elemB in enumerate(x):
      if a < b:
        candidates.append((elemA, elemB))
  return candidates

data.map(lambda x: x[1]).flatMap(lambda x: generate_candidate(x)).take(10)

[('nm0063086', 'nm0183823'),
 ('nm0063086', 'nm1309758'),
 ('nm0183823', 'nm1309758'),
 ('nm1010955', 'nm1012612'),
 ('nm1010955', 'nm1011210'),
 ('nm1010955', 'nm1012621'),
 ('nm1010955', 'nm0675239'),
 ('nm1010955', 'nm0675260'),
 ('nm1012612', 'nm1011210'),
 ('nm1012612', 'nm1012621')]

Aggiungiamo poi un controllo che la copia generata sia in quelle di interesse

In [None]:
copia = [('nm0063086', 'nm0183823'), ('nm0846894', 'nm3002376')]

def generate_candidate(x):
  candidates = []
  for a, elemA in enumerate(x):
    for b, elemB in enumerate(x):
      if a < b:
        candidates.append((elemA, elemB))
  return candidates

data.map(lambda x: x[1]).flatMap(lambda x: generate_candidate(x)).filter(lambda x: x in copia).take(3)

[('nm0063086', 'nm0183823'), ('nm0846894', 'nm3002376')]

Vero proprio passo di map. Le tuple per qualche motivo sono hashabili

In [None]:
copia = [('nm0063086', 'nm0183823'), ('nm0846894', 'nm3002376')]

def generate_candidate(x):
  candidates = []
  for a, elemA in enumerate(x):
    for b, elemB in enumerate(x):
      if a < b:
        candidates.append((elemA, elemB))
  return candidates

data.map(lambda x: x[1]).flatMap(lambda x: generate_candidate(x)) \
    .filter(lambda x: x in copia).map(lambda x: (x,1)).take(3)

[(('nm0063086', 'nm0183823'), 1), (('nm0846894', 'nm3002376'), 1)]

Aggiungiamo reduce e il controllo del threshold

In [29]:
copia = [('nm0063086', 'nm0183823'), ('nm0846894', 'nm3002376')]

def generate_candidate(x):
  candidates = []
  for a, elemA in enumerate(x):
    for b, elemB in enumerate(x):
      if a < b:
        candidates.append((elemA, elemB))
  return candidates

data.map(lambda x: x[1]).flatMap(lambda x: generate_candidate(x)) \
    .filter(lambda x: x in copia) \
    .map(lambda x: (x,1)) \
    .reduceByKey(lambda a,b: a+b) \
    .filter(lambda x: x[1] >=1) \
    .take(3)
          

[]

In [27]:
def apriori(data, support_threshold):
    singleton_counter = []
    lookup_index_table = {}
    reverse_lookup_index_table = {}

    # count singletons in a map-reduce manner
    singletons_count = data.flatMap(lambda row: (row[1])) \
          .map(lambda elem: (elem,1)) \
          .reduceByKey(lambda a,b: a+b) \
          .filter(lambda x: x[1] >= support_threshold) \
          .collect()

    frequent_items = [elem[0] for elem in singletons_count]

    couples_to_check = []
    # trick con enumerate per evitare di fare due volte la stessa copia al contrario
    for A, elemA in enumerate(frequent_items):
      for B, elemB in enumerate(frequent_items):
        if A < B:
          couples_to_check.append((elemA, elemB))

    def generate_candidate(x):
      candidates = []
      for a, elemA in enumerate(x):
        for b, elemB in enumerate(x):
          if a < b:
            candidates.append((elemA, elemB))
      return candidates

    couples_count = data.map(lambda x: x[1]) \
      .flatMap(lambda x: generate_candidate(x)) \
      .filter(lambda x: x in couples_to_check) \
      .map(lambda x: (x,1)) \
      .reduceByKey(lambda a,b: a+b) \
      .filter(lambda x: x[1] >=1) \
      .collect()
    
    frequent_couples = [elem[0] for elem in couples_count]

    return couples_count

In [28]:
rules = apriori(data, 1)
rules

[]

# Demo FP Growht

In [None]:
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="nconsts", minSupport=0.0001, minConfidence=0.0001)
model = fpGrowth.fit(basket_data)

In [None]:
# Display frequent itemsets.
model.freqItemsets.show()
items = model.freqItemsets

+--------------------+----+
|               items|freq|
+--------------------+----+
|         [nm1388202]| 153|
|         [nm1246350]|  46|
|         [nm0430646]| 120|
|[nm0430646, nm000...|  66|
|         [nm1800631]| 104|
|         [nm0260020]|  62|
|         [nm0924307]|  59|
|         [nm0870317]|  45|
|         [nm0011964]|  57|
|         [nm1066974]|  43|
|         [nm0320760]|  92|
|         [nm0575655]|  42|
|         [nm0022081]|  84|
|         [nm0576762]|  78|
|         [nm0463590]|  55|
|         [nm0350524]|  53|
|         [nm0215260]|  41|
|         [nm0457554]|  51|
|         [nm0640307]|  40|
|         [nm0007102]|  73|
+--------------------+----+
only showing top 20 rows



In [None]:
# Display generated association rules.
model.associationRules.show()
rules = model.associationRules

+--------------------+-----------+-------------------+------------------+--------------------+
|          antecedent| consequent|         confidence|              lift|             support|
+--------------------+-----------+-------------------+------------------+--------------------+
|[nm1467390, nm062...|[nm0006982]| 0.7333333333333333| 493.4719088319088|1.117727152640884...|
|         [nm1720239]|[nm0628736]|          0.3359375| 905.7795376712329|1.092324262808137...|
|         [nm1720239]|[nm1908630]|              0.375| 1190.491935483871|1.219338711971873...|
|         [nm0103977]|[nm0019382]|0.10150375939849623|175.25247328848437|2.057634076452537...|
|         [nm0103977]|[nm0004469]|0.06892230576441102|150.73155109997214|1.397158940801105...|
|         [nm0103977]|[nm0080238]|0.06140350877192982| 137.3401116427432|1.244741601804621...|
|         [nm0103977]|[nm0707399]|0.06766917293233082|163.42561926288113|1.371756050968358...|
|         [nm0620630]|[nm0707901]| 0.2903225806451

In [None]:
# transform examines the input items against all the association rules and summarize the consequents as prediction
model.transform(basket_data).show()
transformed = model.transform(basket_data)

+---------+--------------------+----------+
|   tconst|             nconsts|prediction|
+---------+--------------------+----------+
|tt0000009|[nm0063086, nm018...|        []|
|tt0000335|[nm1010955, nm101...|        []|
|tt0000502|[nm0215752, nm025...|        []|
|tt0000574|[nm0846887, nm084...|        []|
|tt0000615|[nm3071427, nm058...|        []|
|tt0000630|         [nm0624446]|        []|
|tt0000676|[nm0097421, nm014...|        []|
|tt0000679|[nm0000875, nm012...|        []|
|tt0000793|         [nm0691995]|        []|
|tt0000862|[nm5289318, nm528...|        []|
|tt0000886|         [nm0609814]|        []|
|tt0000891|[nm0727622, nm081...|        []|
|tt0000941|[nm0034453, nm014...|        []|
|tt0000947|[nm0488932, nm081...|        []|
|tt0000992|         [nm0119164]|        []|
|tt0001049|[nm1834127, nm010...|        []|
|tt0001101|         [nm0923594]|        []|
|tt0001112|[nm0135493, nm014...|        []|
|tt0001115|[nm0630641, nm006...|        []|
|tt0001116|[nm0736379, nm006...|

# Demo Antonio

In [None]:
import pandas as pd

Lets try to load some data in a Pandas Dataframe

In [None]:
actors_cols = {
    "original": [
        "nconst",  # actor unique id
        "knownForTitles"  # move he/she is in
    ],
    "renamed": ["actorId", "titles"]
}

actors_df = pd.read_csv(
    "name.basics.tsv.gz",
    compression="gzip",
    sep='\t',
    usecols=actors_cols["original"]
)

# clean and pre-process
actors_df.columns = actors_cols["renamed"]
actors_df.drop(actors_df[actors_df.titles == "\\N"].index, inplace=True)
actors_df.titles = actors_df.titles.apply(lambda x: x.split(","))

In [None]:
actors_df

Unnamed: 0,actorId,titles
0,nm0000001,"[tt0050419, tt0053137, tt0043044, tt0072308]"
1,nm0000002,"[tt0117057, tt0037382, tt0071877, tt0038355]"
2,nm0000003,"[tt0049189, tt0059956, tt0054452, tt0057345]"
3,nm0000004,"[tt0078723, tt0080455, tt0072562, tt0077975]"
4,nm0000005,"[tt0050986, tt0083922, tt0069467, tt0050976]"
...,...,...
9711011,nm9993708,"[tt9046122, tt8744286]"
9711012,nm9993709,[tt8744286]
9711016,nm9993713,[tt8325250]
9711017,nm9993714,[tt2455546]


In [None]:
def apriori(transactions, support_threshold):
    singleton_counter = []
    lookup_index_table = {}
    reverse_lookup_index_table = {}

    # count singletons
    for bucket in transactions:
        for elem in bucket:
            if elem not in lookup_index_table:
                # The newly discovered element is appended on the tail of the array counter
                lookup_index_table[elem] = len(singleton_counter)
                reverse_lookup_index_table[len(singleton_counter)] = elem
                singleton_counter.append(0)

            idx = lookup_index_table[elem]
            singleton_counter[idx] += 1

    frequent_items_table = [i for i,v in enumerate(singleton_counter) if v > support_threshold]

    # count pairs
    pair_counter = {}
    for bucket in transactions:
        frequent_items = [lookup_index_table[item] for item in bucket 
                          if lookup_index_table[item] in frequent_items_table]

        for x in frequent_items:
            for y in frequent_items:
                if x<y:
                    pair_counter[(x,y)] = pair_counter.get((x,y), 0) +1 

    return [list(map(lambda x: reverse_lookup_index_table[x], i)) for i,c in pair_counter.items() 
            if c > support_threshold] 

In [None]:
# test
rules = apriori(actors_df.titles, 300)

movies_df = pd.read_csv("title.basics.tsv.gz", compression='gzip', sep='\t')
from IPython.display import display

for x,y in rules:
    display(movies_df.loc[((movies_df.tconst == x) | (movies_df.tconst == y))])

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
117965,tt0120737,movie,The Lord of the Rings: The Fellowship of the Ring,The Lord of the Rings: The Fellowship of the Ring,0,2001,\N,178,"Adventure,Drama,Fantasy"
161829,tt0167260,movie,The Lord of the Rings: The Return of the King,The Lord of the Rings: The Return of the King,0,2003,\N,201,"Adventure,Drama,Fantasy"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
161829,tt0167260,movie,The Lord of the Rings: The Return of the King,The Lord of the Rings: The Return of the King,0,2003,\N,201,"Adventure,Drama,Fantasy"
161830,tt0167261,movie,The Lord of the Rings: The Two Towers,The Lord of the Rings: The Two Towers,0,2002,\N,179,"Adventure,Drama,Fantasy"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
117965,tt0120737,movie,The Lord of the Rings: The Fellowship of the Ring,The Lord of the Rings: The Fellowship of the Ring,0,2001,\N,178,"Adventure,Drama,Fantasy"
161830,tt0167261,movie,The Lord of the Rings: The Two Towers,The Lord of the Rings: The Two Towers,0,2002,\N,179,"Adventure,Drama,Fantasy"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
3674370,tt4154756,movie,Avengers: Infinity War,Avengers: Infinity War,0,2018,\N,149,"Action,Adventure,Sci-Fi"
3674389,tt4154796,movie,Avengers: Endgame,Avengers: Endgame,0,2019,\N,181,"Action,Adventure,Drama"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
806227,tt0831387,movie,Godzilla,Godzilla,0,2014,\N,123,"Action,Adventure,Sci-Fi"
2570231,tt2015381,movie,Guardians of the Galaxy,Guardians of the Galaxy,0,2014,\N,121,"Action,Adventure,Comedy"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
2973307,tt2527336,movie,Star Wars: Episode VIII - The Last Jedi,Star Wars: Episode VIII - The Last Jedi,0,2017,\N,152,"Action,Adventure,Fantasy"
3492744,tt3748528,movie,Rogue One: A Star Wars Story,Rogue One,0,2016,\N,133,"Action,Adventure,Sci-Fi"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
2958034,tt2488496,movie,Star Wars: Episode VII - The Force Awakens,Star Wars: Episode VII - The Force Awakens,0,2015,\N,138,"Action,Adventure,Sci-Fi"
3492744,tt3748528,movie,Rogue One: A Star Wars Story,Rogue One,0,2016,\N,133,"Action,Adventure,Sci-Fi"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
3492744,tt3748528,movie,Rogue One: A Star Wars Story,Rogue One,0,2016,\N,133,"Action,Adventure,Sci-Fi"
3505845,tt3778644,movie,Solo: A Star Wars Story,Solo: A Star Wars Story,0,2018,\N,135,"Action,Adventure,Sci-Fi"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
2973307,tt2527336,movie,Star Wars: Episode VIII - The Last Jedi,Star Wars: Episode VIII - The Last Jedi,0,2017,\N,152,"Action,Adventure,Fantasy"
3505845,tt3778644,movie,Solo: A Star Wars Story,Solo: A Star Wars Story,0,2018,\N,135,"Action,Adventure,Sci-Fi"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
2387413,tt1825683,movie,Black Panther,Black Panther,0,2018,\N,134,"Action,Adventure,Sci-Fi"
3674389,tt4154796,movie,Avengers: Endgame,Avengers: Endgame,0,2019,\N,181,"Action,Adventure,Drama"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
2508312,tt1951265,movie,The Hunger Games: Mockingjay - Part 1,The Hunger Games: Mockingjay - Part 1,0,2014,\N,123,"Action,Adventure,Sci-Fi"
2508313,tt1951266,movie,The Hunger Games: Mockingjay - Part 2,The Hunger Games: Mockingjay - Part 2,0,2015,\N,137,"Action,Adventure,Sci-Fi"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
460904,tt0478970,movie,Ant-Man,Ant-Man,0,2015,\N,117,"Action,Adventure,Comedy"
2570231,tt2015381,movie,Guardians of the Galaxy,Guardians of the Galaxy,0,2014,\N,121,"Action,Adventure,Comedy"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
190776,tt0198093,tvSeries,El comisario,El comisario,0,1999,2009,60,"Action,Adventure,Crime"
284942,tt0297174,tvSeries,Hospital Central,Hospital Central,0,2000,2012,60,"Action,Adventure,Drama"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
116610,tt0119282,movie,Hercules,Hercules,0,1997,\N,93,"Adventure,Animation,Comedy"
220626,tt0230011,movie,Atlantis: The Lost Empire,Atlantis: The Lost Empire,0,2001,\N,95,"Action,Adventure,Animation"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
52506,tt0053494,tvSeries,Coronation Street,Coronation Street,0,1960,\N,30,"Drama,Romance"
66734,tt0068069,tvSeries,Emmerdale,Emmerdale Farm,0,1972,\N,30,"Drama,Romance"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
64045,tt0065323,tvSeries,NFL Monday Night Football,NFL Monday Night Football,0,1970,\N,\N,Sport
869574,tt0896893,tvSeries,NFL on FOX,NFL on FOX,0,1994,\N,\N,Sport


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
273009,tt0284702,tvSeries,Big Brother Uncut,Big Brother Uncut,0,2001,2013,60,Reality-TV
275403,tt0287196,tvSeries,Big Brother: Australia,Big Brother,0,2001,2014,25,"Game-Show,Music,Reality-TV"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
86667,tt0088580,tvSeries,Neighbours,Neighbours,0,1985,\N,22,"Drama,Romance"
106362,tt0108709,tvSeries,Blue Heelers,Blue Heelers,0,1994,2006,60,"Action,Crime,Drama"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
64045,tt0065323,tvSeries,NFL Monday Night Football,NFL Monday Night Football,0,1970,\N,\N,Sport
391224,tt0407424,tvSeries,The NFL on NBC,The NFL on NBC,0,1965,1998,\N,Sport


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
3160174,tt2974350,videoGame,Super Smash Bros. for Wii U,Dairantou sumasshu burazâzu for Wii U,0,2014,\N,\N,"Action,Adventure,Family"
3345498,tt3408266,videoGame,Super Smash Bros. for Nintendo 3DS,Dairantou sumasshu burazâzu for Nintendo 3DS,0,2014,\N,\N,"Action,Family,Fantasy"


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
64045,tt0065323,tvSeries,NFL Monday Night Football,NFL Monday Night Football,0,1970,\N,\N,Sport
391223,tt0407423,tvSeries,The NFL on CBS,The NFL on CBS,0,1956,\N,\N,Sport
