# Project 2: Market-basket analysis - IMDB dataset

Project for the course of Algorithms for Massive Dataset <br> Nicolas Facchinetti 961648 <br> Antonio Belotti 960822

# Set up the Spark enviorment

We start by dowloading and installing all the needed tool to deal with Spark. In particular we are interested in obtainig a Java enviorment since Spark in written in Scala and so it need a JVM to run. Then we can download Apache Spark 3.1.2 with Hadoop 3.2 by the Apache CDN and uncompress it. Finally we can get and install PySpark, an interface for Apache Spark in Python

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
!tar xf spark-3.1.2-bin-hadoop3.2.tgz
!rm spark-3.1.2-bin-hadoop3.2.tgz
!pip install -q findspark

--2022-01-21 10:04:23--  https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 228834641 (218M) [application/x-gzip]
Saving to: ‘spark-3.1.2-bin-hadoop3.2.tgz’


2022-01-21 10:04:25 (182 MB/s) - ‘spark-3.1.2-bin-hadoop3.2.tgz’ saved [228834641/228834641]



The next step is to correctly set the path in our remote enviorment to use the obtained tools.

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"

Finally we can import PySpark in the project

In [3]:
import findspark
findspark.init("spark-3.1.2-bin-hadoop3.2")# SPARK_HOME
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

# Load preprocessed dataset from file data.zip

Use the code below do load the dataset from a preprocessed file data.zip

In [4]:
from google.colab import files
import os

uploaded = files.upload()

if os.path.isfile("data.zip"):
  !unzip -q data.zip && rm data.zip
  data = spark.read.format("json").option("header", "true").load("data").select('tconst', 'nconsts').rdd
  data.take(5)
else:
  print("Error in loading the file.")

Saving data.zip to data.zip
replace data/part-00063-717f407c-435f-4def-bca7-0ae425d828a4-c000.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: Y
replace data/.part-00174-717f407c-435f-4def-bca7-0ae425d828a4-c000.json.crc? [y]es, [n]o, [A]ll, [N]one, [r]ename: A


# Download the dataset from Kaggle

First install the Python module of Kaggle to download the dataset from its datacenter

In [None]:
!pip install kaggle

Then load kaggle.json, a file containing your API credentials to be able to use the services offered by Kaggle

In [None]:
from google.colab import files

uploaded = files.upload()
  
# Move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Now we can download the dataset

In [None]:
!kaggle datasets download 'ashirwadsangwan/imdb-dataset'

We now must unzip the compressed archive to use it. Once done we can also remove it.

In [None]:
!unzip imdb-dataset.zip && rm imdb-dataset.zip

# Preapare the data for Spark

We can directly load the downloaded and extracted .tsv file in a Spark DataFrame by using the command read.csv(). We directly pass to the method the columns in which we are interested.

In [None]:
df_principals = spark.read.csv("/content/title.principals.tsv/title.principals.tsv", sep=r'\t', header=True).select('tconst','nconst','category')

In [None]:
df_principals.show(10)

In [None]:
df_basics = spark.read.csv("/content/title.basics.tsv/title.basics.tsv", sep=r'\t', header=True).select('tconst','titleType')

In [None]:
df_basics.show(10)

By inspecting the content of the column 'category' of df_principlas we can see that there are many jobs other than actors and actress (which are the two we are interested in)

In [None]:
df_principals.select("category").distinct().show()

Similarly we can do the same thing with df_basics and the column 'titleType' to see how many categories a title can have.

In [None]:
df_basics.select("titleType").distinct().show()

Once the data is loaded in a Spark DataFrame we can use the PySpark SQL module for processing the data. We start by exctracting only actors and actress from df_principals

In [None]:
pre = df_principals.count()
df_principals.createOrReplaceTempView("PRINCIPALS") # create a temporary table on DataFrame
df_principals = spark.sql("SELECT * from PRINCIPALS WHERE category ='actor' OR category='actress'")
print("We reduced the number of row from {} to {}".format(pre, df_principals.count()))

 And then we do the same thing with movies in df_basics

In [None]:
pre = df_basics.count()
df_basics.createOrReplaceTempView("BASICS") # create a temporary table on DataFrame
df_basics = spark.sql("SELECT * from BASICS WHERE titleType ='movie'")
print("We reduced the number of row from {} to {}".format(pre, df_basics.count()))

We can now see that we have two DataFrame, one containing only the movies and the other only the people which play as actor/actress in a title. To do the desired maket-basket analysis we have to pivot our tconst as rows, so each row stands for one titleId, and then including a list of nconst identifiers of the actors that played in it.

In [None]:
df_basics.show(10)

In [None]:
df_principals.show(10)

So we start by joining the two dataframe to extract from df_principals only the records with tconst related to a movie. We can also discard the category column since is no longer usefull.

In [None]:
basket_data = df_principals.join(df_basics, "tconst").select(df_principals.tconst, df_principals.nconst).sort("tconst")

In [None]:
basket_data.show(10)

Then we can remove hypothetical duplicated row and then aggregate the data using tconst identifier.

In [None]:
from pyspark.sql import functions as F
basket_data = basket_data.dropDuplicates()
basket_data = basket_data.groupBy("tconst").agg(F.collect_list("nconst").alias("nconsts")).sort('tconst')

In [None]:
print("There are {} titleId buckets".format(basket_data.count()))
basket_data.show(10, False)

As we can see above we now have the data in the correct format to do our analysis: in each row we have the identifier of a movie and in the second column the list of the idenfiers of the actors that played in it.
Since we done all the needed pre-processing computation on the data we can transform our DataFrame in a RDD to apply map-reduce functions.

Serialize to file the RDD and download to skip the processing all the time.



In [None]:
basket_data.write.format('json').save("data")

In [None]:
!zip -r data.zip data

In [None]:
from google.colab import files
files.download('data.zip')

# Apriori classic

We start by implementing the classic Apriori algorithm. In particular we search only for tuples and not larger itemsets.

In [39]:
from itertools import tee

def apriori_new(partitionData, support_threshold):
  singleton_counter = {}
  
  d1, d2 = tee(partitionData, 2)

  # count singletons
  for _, bucket in d1:
    for item in bucket:
      singleton_counter[item] = singleton_counter.get(item, 0) + 1

  frequent_singleton = [(k,v) for k,v in singleton_counter.items() if v >= support_threshold]

  # count pairs
  pair_counter = {}
  for _, bucket in d2:
      frequent_items_of_bucket = [item for item in bucket if singleton_counter.get(item, 0) >= support_threshold]
      
      for x in frequent_items_of_bucket:
          for y in frequent_items_of_bucket:
              if x<y:
                  pair_counter[(x,y)] = pair_counter.get((x,y), 0) + 1

  # tuple(sorted(couple)) is done because may a couple be generated backward in a partition
  frequent_couples = [tuple((couple, support)) for couple, support in pair_counter.items() if support >= support_threshold]
  
  return iter(frequent_singleton + frequent_couples)

In [6]:
from itertools import tee

def apriori(partitionData, support_threshold):
  singleton_counter = []
  lookup_index_table = {}
  reverse_lookup_index_table = {}
  
  d1, d2 = tee(partitionData, 2)

  # count singletons
  for bucket in d1:
    for item in bucket[1]:
      if item not in lookup_index_table:
        # The newly discovered element is appended on the tail of the array counter
        lookup_index_table[item] = len(singleton_counter)
        reverse_lookup_index_table[len(singleton_counter)] = item
        singleton_counter.append(0)
      idx = lookup_index_table[item]
      singleton_counter[idx] += 1

  frequent_items_table = [index for index,count in enumerate(singleton_counter) if count >= support_threshold]
  frequent_singleton = [(reverse_lookup_index_table[item], singleton_counter[item]) for item in frequent_items_table]
  # count pairs
  pair_counter = {}
  for bucket in d2:
      frequent_items_of_bucket = [lookup_index_table[item] for item in bucket[1] 
                        if lookup_index_table[item] in frequent_items_table]
      
      for x in frequent_items_of_bucket:
          for y in frequent_items_of_bucket:
              if x<y:
                  pair_counter[(x,y)] = pair_counter.get((x,y), 0) + 1

  # tuple(sorted(couple)) is done because may a couple be generated backward in a partition
  frequent_couples = [(tuple(sorted((reverse_lookup_index_table[couple[0]], reverse_lookup_index_table[couple[1]]))), count) for couple ,count 
                      in pair_counter.items() if count >= support_threshold]
  
  return iter(frequent_singleton + frequent_couples)

# Apriori with MAP-REDUCE

Follow an implementatio of the Apriori algorithm using a map-reduce approach. The logic in the implementation is a bit different than the one provided by the book, in particular in the approach of generating couples. Also in this case we stop our search to tuples of frequent itemsets.

In [7]:
def apriorihmap(data, support_threshold):
    """ 
    data: Pyspark.rdd 
      [
        [tconst, [nconst,]],
      ]
    """
    nconst_rdd = data.map(lambda x: x[1])

    frequent_items_rdd = nconst_rdd.flatMap(lambda x: x) \
          .map(lambda elem: (elem,1)) \
          .reduceByKey(lambda a,b: a+b) \
          .filter(lambda x: x[1] >= support_threshold)

    #print(f"found {frequent_items_rdd.count()} frequent singletons")
    frequent_singletons_bv = spark.sparkContext.broadcast({k[0]:True for k in frequent_items_rdd.collect()})

    def generate_candidate(x):
      candidates = []
      for a in x:
        for b in x:
          if a < b:
            # the tuple may be generated backwards, sort to get rid of the problem
            candidates.append((a,b))
      return candidates
    
    frequent_couples_rdd = data.map(lambda x: x[1]) \
          .filter(lambda x: [elem for elem in x if frequent_singletons_bv.value.get(elem, False)])\
          .flatMap(lambda x: generate_candidate(x)) \
          .map(lambda x: (x,1)) \
          .reduceByKey(lambda a,b: a+b) \
          .filter(lambda x: x[1] >=support_threshold)

    return frequent_items_rdd.union(frequent_couples_rdd)

# SON

We then decided to also implement SON to test out if there is an improvement in time complexity. In the first step of the algorithm we decided to use the classic apriori implementation done before for the map function. We decided to partion the data in a number equal to the avaiable processors in the cluster.

In [8]:
# empirical sweet-spot for the number of partitions (assuming every executor has 4 cores ...)
num_partitions = spark.sparkContext._jsc.sc().getExecutorMemoryStatus().size() * 4
num_partitions

4

We must define a function for the second step to properly count the number of occurrence of frequent itemsets in a partition.

In [9]:
def count_in_partition(data, frequent):
  # prepare data for processing
  frequent = frequent.value   # extract broadcasted values
  data = list(data)           # cast to list to iterate more than one time

  # check foreach frequent itemset
  for frequent_item in frequent:
    # trick to cast single element to list → not remove in the str duplicate char using set()
    if type(frequent_item) is not tuple:
      to_check = [frequent_item]
    else:
      to_check = frequent_item
      
    c = 0     # counter
    # and foreach row of the dataset
    for itemset in data:
      # check if the frequent itemset is subset of the items of the row
      if set(to_check).issubset(itemset[1]):
        c += 1
    yield (frequent_item, c)

In [10]:
def count_in_partition_v2(data, candidate_frequent_itemsets_bv):
  # extract broadcasted values
  candidate_frequent_itemsets = candidate_frequent_itemsets_bv.value

  # check foreach frequent itemset
  for candidate_freq_item in candidate_frequent_itemsets.keys():
    # need candidate_freq_item to be iterable even if it's only a single element
    if type(candidate_freq_item) is not tuple:
      candidate_freq_item = [candidate_freq_item]
      
    c = 0
    for _, bucket in data:
      if set(candidate_freq_item).issubset([x for x in bucket if candidate_frequent_itemsets.get(x,False)]):
        c += 1
    yield (tuple(candidate_freq_item), c)

Then the implementation of SON with a two step map-reduce. The first finds out the frequent itemsets in the partition and the latter go to count them in the dataset and filters out the ones with support greater than threshold.

In [11]:
def son_m_r(data, support):
  reduced_support = support//data.getNumPartitions()
  first_map = data.mapPartitions(lambda partition: apriori(partition, reduced_support)).map(lambda x: (x[0], 1))
  first_reduce = first_map.reduceByKey(lambda a,b: a+b)       # possible to remove a+b ?????????????????

  # extract the frequent items and broadcast them to worker nodes
  frequent_items = [x[0] for x in first_reduce.collect()]
  frequent_items = spark.sparkContext.broadcast(frequent_items)

  second_map = data.mapPartitions(lambda partition: count_in_partition(partition, frequent_items))
  second_reduce = second_map.reduceByKey(lambda a,b: a+b).filter(lambda x: x[1] >= support)
  return second_reduce

In [12]:
def son_m_r_v2(data, support):
  reduced_support = support/data.getNumPartitions()  # mi sa che non si deve arrotondare. va bene se è un float altrimenti poterbbe uscire 0
  candidate_frequent_itemsets_rdd = data.mapPartitions(lambda partition: apriori(partition, reduced_support)).map(lambda x: x[0]).distinct()

  # broadcast the frequent items to worker nodes
  candidate_frequent_itemsets_bv = spark.sparkContext.broadcast(
      {x:True for x in candidate_frequent_itemsets_rdd.collect()}
  )

  second_map = data.mapPartitions(lambda partition: count_in_partition_v2(partition, candidate_frequent_itemsets_bv))
  frequent_itemsets = second_map.reduceByKey(lambda a,b: a+b).filter(lambda x: x[1] >= support)
  return frequent_itemsets

In [59]:
def son_m_r_v3(data, support):
  reduced_support = support//data.getNumPartitions()
  first_map = data.mapPartitions(lambda partition: apriori_new(partition, reduced_support)).map(lambda x: (x[0], 1))
  first_reduce = first_map.reduceByKey(lambda a,b: a+b)       # possible to remove a+b ?????????????????

  # extract the frequent items and broadcast them to worker nodes
  frequent_items = [x[0] for x in first_reduce.collect()]
  frequent_items = spark.sparkContext.broadcast(frequent_items)

  second_map = data.mapPartitions(lambda partition: count_in_partition(partition, frequent_items))
  second_reduce = second_map.reduceByKey(lambda a,b: a+b).filter(lambda x: x[1] >= support)
  return second_reduce

# Demo FP Growth

To carry our experiment we decided to also use the in library implementation of FP-growth as comparison benchmark.

In [13]:
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="nconsts")

In [14]:
"""
model = fpGrowth.fit(basket_data)

# Display frequent itemsets.
model.freqItemsets.show()
items = model.freqItemsets

# Display generated association rules.
model.associationRules.show()
rules = model.associationRules

# transform examines the input items against all the association rules and summarize the consequents as prediction
model.transform(basket_data).show()
transformed = model.transform(basket_data)
"""

'\nmodel = fpGrowth.fit(basket_data)\n\n# Display frequent itemsets.\nmodel.freqItemsets.show()\nitems = model.freqItemsets\n\n# Display generated association rules.\nmodel.associationRules.show()\nrules = model.associationRules\n\n# transform examines the input items against all the association rules and summarize the consequents as prediction\nmodel.transform(basket_data).show()\ntransformed = model.transform(basket_data)\n'

# Test of the algorithms

We extract a subset of 500 rows from the dataset to test out that our algorithms work as expected. We define min_support as 1% of the count of the rows.

In [15]:
minsup = 0.01
num_rows = 500
sup = minsup*num_rows

minid = data.take(num_rows)
minid = spark.sparkContext.parallelize(minid)
minid.take(5)

[Row(tconst='tt0000009', nconsts=['nm0063086', 'nm0183823', 'nm1309758']),
 Row(tconst='tt0000335', nconsts=['nm1010955', 'nm1012612', 'nm1011210', 'nm1012621', 'nm0675239', 'nm0675260']),
 Row(tconst='tt0000502', nconsts=['nm0215752', 'nm0252720']),
 Row(tconst='tt0000574', nconsts=['nm0846887', 'nm0846894', 'nm3002376', 'nm0170118']),
 Row(tconst='tt0000615', nconsts=['nm3071427', 'nm0581353', 'nm0888988', 'nm0240418', 'nm0346387', 'nm0218953'])]

We start by exectuing the classic implementation of apriori. Is compulsory to before collect the data from the RDD since this is a non distributed implementation.

In [44]:
apriori_classic = list(apriori(minid.collect(), sup))
apriori_classic

[('nm0140054', 10),
 ('nm0243918', 5),
 ('nm0169878', 5),
 ('nm0539049', 5),
 ('nm0003425', 9),
 ('nm0016799', 8),
 ('nm0528022', 7),
 ('nm0679170', 6),
 ('nm0294276', 5),
 ('nm0252476', 7),
 ('nm0526234', 5),
 ('nm0746008', 6),
 ('nm0681933', 10),
 ('nm0330280', 5),
 ('nm0768187', 5),
 ('nm0676473', 7),
 ('nm0908390', 6),
 ('nm0392059', 5),
 ('nm0622772', 5),
 ('nm0577476', 5),
 ('nm0163540', 8),
 ('nm0292407', 10),
 ('nm0926280', 12),
 ('nm0505354', 7),
 ('nm0366008', 6),
 ('nm0885818', 5),
 ('nm0516974', 9),
 ('nm0110838', 6),
 ('nm0691995', 8),
 ('nm0190516', 7),
 ('nm0074186', 6),
 ('nm0642190', 6),
 ('nm0068213', 5),
 ('nm0165691', 6),
 (('nm0140054', 'nm0243918'), 5),
 (('nm0003425', 'nm0016799'), 7),
 (('nm0577476', 'nm0622772'), 5),
 (('nm0292407', 'nm0926280'), 9),
 (('nm0292407', 'nm0505354'), 5),
 (('nm0505354', 'nm0926280'), 5),
 (('nm0292407', 'nm0642190'), 5),
 (('nm0642190', 'nm0926280'), 5)]

In [56]:
apriori_2 = list(apriori_new(minid.collect(), sup))
apriori_2

[('nm0140054', 10),
 ('nm0243918', 5),
 ('nm0169878', 5),
 ('nm0539049', 5),
 ('nm0003425', 9),
 ('nm0016799', 8),
 ('nm0528022', 7),
 ('nm0679170', 6),
 ('nm0294276', 5),
 ('nm0252476', 7),
 ('nm0526234', 5),
 ('nm0746008', 6),
 ('nm0681933', 10),
 ('nm0330280', 5),
 ('nm0768187', 5),
 ('nm0676473', 7),
 ('nm0908390', 6),
 ('nm0392059', 5),
 ('nm0622772', 5),
 ('nm0577476', 5),
 ('nm0163540', 8),
 ('nm0292407', 10),
 ('nm0926280', 12),
 ('nm0505354', 7),
 ('nm0366008', 6),
 ('nm0885818', 5),
 ('nm0516974', 9),
 ('nm0110838', 6),
 ('nm0691995', 8),
 ('nm0190516', 7),
 ('nm0074186', 6),
 ('nm0642190', 6),
 ('nm0068213', 5),
 ('nm0165691', 6),
 (('nm0140054', 'nm0243918'), 5),
 (('nm0003425', 'nm0016799'), 7),
 (('nm0577476', 'nm0622772'), 5),
 (('nm0292407', 'nm0926280'), 9),
 (('nm0292407', 'nm0505354'), 5),
 (('nm0505354', 'nm0926280'), 5),
 (('nm0292407', 'nm0642190'), 5),
 (('nm0642190', 'nm0926280'), 5)]

The we have the Apriori implementation with map-reduce

In [57]:
apriori_map = apriorihmap(minid, sup).collect()

Follow the implementation with SON. The data must be repartioned on the finded sweet-spot number of partitions

In [21]:
import time

In [22]:
minid = minid.repartition(num_partitions)
b = time.time()
son = son_m_r(minid,sup).collect()
print(time.time()-b)

6.215772390365601


In [23]:
b = time.time()
son_v2 = son_m_r_v2(minid, sup).collect()
print(time.time()-b)

0.8593125343322754


In [58]:
son_v2

[(('nm0539049',), 5)]

In [60]:
son_v3 = son_m_r_v3(minid, sup).collect()
son_v3

[('nm0539049', 5),
 ('nm0294276', 5),
 ('nm0526234', 5),
 ('nm0577476', 5),
 ('nm0505354', 7),
 (('nm0577476', 'nm0622772'), 5),
 (('nm0292407', 'nm0505354'), 5),
 ('nm0068213', 5),
 ('nm0165691', 6),
 ('nm0003425', 9),
 ('nm0252476', 7),
 ('nm0926280', 12),
 (('nm0505354', 'nm0926280'), 5),
 ('nm0691995', 8),
 ('nm0190516', 7),
 ('nm0243918', 5),
 ('nm0679170', 6),
 ('nm0746008', 6),
 ('nm0681933', 10),
 ('nm0768187', 5),
 ('nm0908390', 6),
 ('nm0392059', 5),
 ('nm0622772', 5),
 ('nm0292407', 10),
 ('nm0885818', 5),
 ('nm0516974', 9),
 ('nm0110838', 6),
 (('nm0003425', 'nm0016799'), 7),
 ('nm0074186', 6),
 ('nm0642190', 6),
 (('nm0292407', 'nm0642190'), 5),
 ('nm0140054', 10),
 ('nm0169878', 5),
 ('nm0016799', 8),
 ('nm0528022', 7),
 ('nm0330280', 5),
 ('nm0676473', 7),
 ('nm0163540', 8),
 ('nm0366008', 6),
 (('nm0140054', 'nm0243918'), 5),
 (('nm0292407', 'nm0926280'), 9),
 (('nm0642190', 'nm0926280'), 5)]

Then we also train the in-library implementation of FPGrowth

In [24]:
# initialize
fpGrowth.setMinSupport(minsup)
model = fpGrowth.fit(minid.toDF())

# get itemsets
fp_growth = model.freqItemsets.collect()

In [25]:
import pandas as pd

def trasform_format(data):
  strings = []
  tuples = []
  for d in data:
    if len(d[0]) == 1:
      strings.append((d[0][0], d[1]))
    else:
      tuples.append((tuple(sorted(d[0])),d[1]))
  return strings + tuples

# keep only singleton and tuples in fpgrowth result and trasform the format of results
fp_growth = [item for item in fp_growth if len(item[0]) <= 2]
fp_growth = trasform_format(fp_growth)

Let's put the obtained result in a tabular way.

In [61]:
df1 = pd.DataFrame([x[1] for x in apriori_classic], index=[x[0] for x in apriori_classic], columns =['Apriori classic'])
df2 = pd.DataFrame([x[1] for x in apriori_2], index=[x[0] for x in apriori_2], columns =['Apriori 2'])
df3 = pd.DataFrame([x[1] for x in apriori_map], index=[x[0] for x in apriori_map], columns =['Apriori map'])
df4 = pd.DataFrame([x[1] for x in son], index=[x[0] for x in son], columns =['SON'])
df5 = pd.DataFrame([x[1] for x in son_v3], index=[x[0] for x in son_v3], columns =['SON 3'])
df6 = pd.DataFrame([x[1] for x in fp_growth], index=[x[0] for x in fp_growth], columns =['FPGrowth'])

df = pd.concat([df1, df2, df3, df4, df5, df6], axis=1)
df["Equal"] = df.nunique(axis = 1, dropna=False) == 1       
df

Unnamed: 0,Apriori classic,Apriori 2,Apriori map,SON,SON 3,FPGrowth,Equal
nm0140054,10,10,10,10,10,10,True
nm0243918,5,5,5,5,5,5,True
nm0169878,5,5,5,5,5,5,True
nm0539049,5,5,5,5,5,5,True
nm0003425,9,9,9,9,9,9,True
nm0016799,8,8,8,8,8,8,True
nm0528022,7,7,7,7,7,7,True
nm0679170,6,6,6,6,6,6,True
nm0294276,5,5,5,5,5,5,True
nm0252476,7,7,7,7,7,7,True
