



---

# **MARKET-BASKET ANALYSIS**

## Massive Algorithm - Data Science for Economics

**Angelica Longo, Melissa Rizzi**

The goal of this project is to implement a system for **detecting frequent itemsets**, commonly known as **market-basket analysis**.
In this notebook, the detector treats each user’s reviewed books as a basket, with books serving as items.

The project is based on the **[Amazon Books Review](https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews)** dataset, published on Kaggle under the public domain CC0 license. Data is downloaded during the execution of the scripts via an API and contains variables related to users and their reviews of purchased books.

Given the large volume of data (3 million rows), a reasonable subsample is created using **PySpark**, consisting of approximately 500'000 rows, while ensuring scalability for the full dataset.

The project is structured as follows:

- **Preprocessing** – This phase includes data cleaning, checking data integrity, handling null values, removing duplicates, and computing the overall mean to verify consistency with the selected subsample.
- **Subsampling** – A subset of data is created while maintaining a representative distribution of user choices and ratings.
- **Frequent Itemset Mining** – The final step involves implementing an algorithm to identify frequent itemsets within the dataset.

This structured approach ensures both **efficiency** and **scalability** while maintaining **data integrity**.

## **Table of Contents**
- [1. Data Import](#1-Data-Import)
- [2. Data PreProcessing](#2-data-preprocessing)
  - [2.1 Data Integrity](#21-data-integrity)
  - [2.2 Missing Data](#22-missing-data)
  - [2.3 Data Duplicates](#22-data-duplicates)
  - [2.4 Rating Means](#22-rating-means)
- [3. Subsample Creation](#3-subsample-creation)
- [4. Frequent Itemset Mining](#4-frequent-itemset-mining)


---
## **1. Data Import**

In [1]:
#import os
#import zipfile

In [2]:
#os.environ['KAGGLE_USERNAME'] = "melissarizzi"
#os.environ['KAGGLE_KEY'] = "3ed913e7329a3117a254e67179c0f8bb"

In [4]:
#!kaggle datasets download -d mohamedbakhet/amazon-books-reviews

In [5]:
#with zipfile.ZipFile("amazon-books-reviews.zip", 'r') as zip_ref:
  #  zip_ref.extractall("amazon_books_data")

---
## **2. Data PreProcessing**

In [76]:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, min, max, sum, when, collect_set,count
from pyspark.sql.types import DoubleType
from pyspark.ml.fpm import FPGrowth
import pyspark.sql.functions as F

In [7]:
# Create Spark Session
spark = SparkSession.builder.appName("MapReduce").getOrCreate()

In [8]:
# Import data
data = spark.read.csv("amazon_books_data/Books_rating.csv", header=True, inferSchema=True)
#data.show(5)

In [9]:
#Select only useful columns
df = data.select("Id", 'Title', "User_id", "review/score").withColumnRenamed("review/score", "score")
#df.show(5)

+----------+--------------------+--------------+-----+
|        Id|               Title|       User_id|score|
+----------+--------------------+--------------+-----+
|1882931173|Its Only Art If I...| AVCGYZL8FQQTD|  4.0|
|0826414346|Dr. Seuss: Americ...|A30TK6U7DNS82R|  5.0|
|0826414346|Dr. Seuss: Americ...|A3UH4UZ4RSVO82|  5.0|
|0826414346|Dr. Seuss: Americ...|A2MVUWT453QH61|  4.0|
|0826414346|Dr. Seuss: Americ...|A22X4XUPKF66MR|  4.0|
+----------+--------------------+--------------+-----+
only showing top 5 rows



### **2.1 Data Integrity**

In [10]:
df.printSchema()

root
 |-- Id: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- User_id: string (nullable = true)
 |-- score: string (nullable = true)



In [11]:
# Transform 'score' variable in double type
df = df.withColumn("score", col("score").cast(DoubleType()))
df.printSchema()

root
 |-- Id: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- User_id: string (nullable = true)
 |-- score: double (nullable = true)



In [12]:
# Check score range
df.select(min(col("score")).alias("min_score"), max(col("score")).alias("max_score")).show()

+---------+----------+
|min_score| max_score|
+---------+----------+
|      1.0|1.295568E9|
+---------+----------+



In [13]:
# Keep just data with the 'score' values in the correct range [1, 5]
df = df.filter((col("score") >= 1) & (col("score") <= 5))
df.select(min("score").alias("min_score"), max("score").alias("max_score")).show()

+---------+---------+
|min_score|max_score|
+---------+---------+
|      1.0|      5.0|
+---------+---------+



### **2.2 Missing Data**

In [14]:
# Count null values for each variable
null_counts = df.select(
    [sum(when(col(c).isNull(), 1).otherwise(0)).alias(c) for c in df.columns]
)

#null_counts.show()

What stands out right away, especially for the purpose of our analysis, is that there are many missing values in the User_id variable. One possible reason for this could be that users who leave reviews but are not registered don’t have a user ID. Our goal is to identify baskets of items purchased by the same users, but without the user ID, this analysis cannot be conducted. We explored the possibility of using profile names instead, by assigning a dummy ID to users with the same name. However, we were aware that this might not provide accurate results due to potential name duplication. Moreover, there were more missing profile names than missing user IDs, which made this solution unfeasible. After considering our options, we ultimately decided to **drop the missing values**, as we couldn’t identify a suitable method to replace them.

In [15]:
# Remove null values
df_clean = df.dropna()
#df_clean.show(5)

In [16]:
# Check data size
n_rows = df.count()
n_rows_clean = df_clean.count()
print(f"Number of Rows - Before cleaning: {n_rows}")
print(f"Number of Rows - After cleaning: {n_rows_clean}")

Number of Rows - Before cleaning: 2981912
Number of Rows - After cleaning: 2420237


### **2.3 Data Duplicates**

In [17]:
# Remove duplicated rows
df_clean = df_clean.dropDuplicates()

n_rows_clean = df_clean.count()
print(f"Number of Rows - After duplicates removal: {n_rows_clean}")

Number of Rows - After duplicates removal: 2383199


Up until now, we’ve performed a general cleaning of the dataset. From here on, we’ll focus exclusively on the three columns that are relevant to our analysis (Id, User_id, and score), forming a new dataset: df_short.

In [18]:
# Remove useless columns
df_short = df_clean.select("Id", "User_id","score")
#df_short.show(5)

In [19]:
# Check and remove duplicated rows for the three considered variables
df_short= df_short.dropDuplicates()

Given that the same user could have rated the same book twice, we want to compute the mean of the different scores given by the same user to the same book.

In [20]:
# Find duplicates considering only 'Id' and 'User_id'
duplicati = df_short.groupBy("Id", "User_id").count().filter("count > 1")

In [21]:
# Compute average score for every (Id, User_id)
score_mean = df_short.groupBy('Id', 'User_id').agg(F.mean('score').alias('mean_score'))
df_final = df_short.join(score_mean, on=['Id', 'User_id'], how='left')

# Creation of the final preprocessed dataset
df_final = df_final.select('Id','User_id', 'mean_score')
df_final = df_final.dropDuplicates()
#df_final.show(5)

In [22]:
#n_rows_final = df_final.count()
#print(f"Number of Rows - Final dataset: {n_rows_final}")

Number of Rows - Final dataset: 2380155


### **2.4 Rating Means**

We want to calculate the overall average score to see if consistency is maintained after creating the subsample.

- Overall mean score:

In [23]:
df_final = df_final.withColumn("mean_score", F.col("mean_score").cast("double"))

overall_mean = df_final.agg(F.avg("mean_score")).collect()[0][0]
print(f"Overall mean score - Final dataset: {overall_mean}")

Overall mean score - Final dataset: 4.223861961370863


---
## **3. Subsample Creation**

We aim to create a subsample that remains consistent with the original dataset. To achieve this, we select a fraction of users while ensuring that all their reviews are included. This approach allows us to better represent their purchasing behavior and rating patterns, preserving the integrity of the data.

In [24]:
num_users = df_final.select("User_id").distinct().count()
print(f"Total number of different users - Original dataset: {num_users}")

Total number of different users - Original dataset: 1004214


In [25]:
# Keep just 20% of the users
sample_fraction = 0.2
user_sample = df_final.select("User_id").distinct().sample(fraction=sample_fraction, seed=42)

In [26]:
# Create the subsample with the selected users
df_sampled = df_final.join(user_sample, on="User_id", how="inner")
#df_sampled.show(5)

In [27]:
# Check subsample size
#n_rows_sample = df_sampled.count()
#print(f"Number of Rows - Sample: {n_rows_sample}")

Number of Rows - Sample: 471573


In [28]:
# Check data integrity
df_sampled.printSchema()

root
 |-- User_id: string (nullable = true)
 |-- Id: string (nullable = true)
 |-- mean_score: double (nullable = true)



### **3.1 Sample reliability**

In [77]:
#Check number of books rated by each user
avg_rows_original = df_final.groupBy("User_id").count().agg(F.mean("count")).collect()[0][0]
avg_rows_sampled = df_sampled.groupBy("User_id").count().agg(F.mean("count")).collect()[0][0]

print(f"Books rated by each user - Original: {avg_rows_original:.2f}")
print(f"books rated by each user - Subsample: {avg_rows_sampled:.2f}")

Books rated by each user - Original: 2.37
books rated by each user - Subsample: 2.34


In [75]:
#Check Score Average and Standard Deviation
df_final.select(F.mean("mean_score"), F.stddev("mean_score")).show()
df_sampled.select(F.mean("mean_score"), F.stddev("mean_score")).show()

+-----------------+------------------+
|  avg(mean_score)|stddev(mean_score)|
+-----------------+------------------+
|4.223861961370863|1.1813900530297332|
+-----------------+------------------+

+-----------------+------------------+
|  avg(mean_score)|stddev(mean_score)|
+-----------------+------------------+
|4.221354205322752|1.1848382781134867|
+-----------------+------------------+



---
## **4. Algorithm Implementation**

In [None]:
#scegliere 

Before continuing the analysis, we considered only books that received a score above 3, as low ratings might indicate lack of engagement with the book.

In [61]:
# Filter and keep just rows with rating >= 3
df_filtered = df_sampled.filter(col("mean_score") >= 3)
#df_filtered.count()

Then we filtered for users who have rated at least 2 books, because otherwise, if they have only one high rating, we wouldn't be able to understand what other books they might buy based on that rating.

In [63]:
# Filter and keep just users who rated > 1 book
user_counts = df_filtered.groupBy("User_id").agg(count("Id").alias("book_count"))
users_with_multiple_books = user_counts.filter(col("book_count") > 1).select("User_id")

df_filtered = df_filtered.join(users_with_multiple_books, on="User_id", how="inner")
#df_filtered.count()

In [32]:
#df_filtered.show(5)

+--------------------+----------+----------+
|             User_id|        Id|mean_score|
+--------------------+----------+----------+
|A0236983QUCQMORABO03|1587888408|       5.0|
|A0236983QUCQMORABO03|1587888432|       5.0|
|A0236983QUCQMORABO03|1593352077|       5.0|
|A025268923L497N34...|B000P0UDX4|       5.0|
|A025268923L497N34...|B00005UVH9|       5.0|
+--------------------+----------+----------+
only showing top 5 rows



### **4.1 A-priori Algorithms**

The Apriori algorithm is an association rule mining method used to discover frequent patterns within large datasets. It works by first identifying the most frequently occurring itemsets and then extracting association rules that express relationships between these itemsets. The algorithm follows an iterative approach, progressively eliminating less frequent itemsets, improving efficiency. It is commonly used in market basket analysis.

#### **4.1.1 Mlxtend**

In the context of Apriori, MLxtend provides an easy-to-use implementation for association rule mining. The apriori function in MLxtend helps identify frequent itemsets from transaction data, while the association_rules function generates association rules based on these frequent itemsets. It is widely used for tasks like market basket analysis, where the goal is to find associations between products frequently bought together.

In [78]:
#import necessary libraries
from pyspark.sql.functions import collect_set
from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd
from scipy.sparse import lil_matrix

#### 4.1 FP-Growth Algorithm

Primo algoritmo è appunto FP-GROWTH, da dire qualcosa

- Creation of Baskets of items 

In [33]:
# Create baskets of items for every user
df_basket = df_filtered.groupBy("User_id").agg(collect_set("Id").alias("items"))

- Algorithm application

In [34]:
# Apply FP-Growth
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.01, minConfidence=0.2)
model = fpGrowth.fit(df_basket)


Abbiamo scelto un supporto di tot perche , e un confidence di tot perchè.
 - Support: probabilità di acquisto di tutto il basket
 - Confidence: probabilità che se compro un basket compro anche l'altro libr
    

In [35]:
print("Frequent Itemsets:")
model.freqItemsets.show(truncate=False)

# Count number of Frequent Itemsets
#num_freq_itemsets = model.freqItemsets.count()
#print(f"Number of rows of Frequent Itemsets: {num_freq_itemsets}")

Frequent Itemsets:
+------------------------------------------------+----+
|items                                           |freq|
+------------------------------------------------+----+
|[B000ILIJE0]                                    |689 |
|[B000NWU3I4]                                    |687 |
|[B000NWU3I4, B000ILIJE0]                        |686 |
|[B000PC54NG]                                    |685 |
|[B000PC54NG, B000NWU3I4]                        |684 |
|[B000PC54NG, B000NWU3I4, B000ILIJE0]            |683 |
|[B000PC54NG, B000ILIJE0]                        |684 |
|[B000NWQXBA]                                    |683 |
|[B000NWQXBA, B000PC54NG]                        |683 |
|[B000NWQXBA, B000PC54NG, B000NWU3I4]            |682 |
|[B000NWQXBA, B000PC54NG, B000NWU3I4, B000ILIJE0]|682 |
|[B000NWQXBA, B000PC54NG, B000ILIJE0]            |683 |
|[B000NWQXBA, B000NWU3I4]                        |682 |
|[B000NWQXBA, B000NWU3I4, B000ILIJE0]            |682 |
|[B000NWQXBA, B000ILIJE0]    

- Alta confidenza e lift elevato: Se vedi una regola con alta confidenza (vicina a 1.0) e un valore di lift molto alto, significa che c'è una forte correlazione tra gli articoli dell'antecedente e quelli del conseguente. Queste sono regole particolarmente utili per le raccomandazioni di prodotto.

- Basso supporto, alta confidenza e lift alto: Anche se il supporto è basso (ad esempio, 1% delle transazioni), un lift elevato e una confidenza vicina a 1.0 indicano che la regola è molto significativa per un numero ridotto di transazioni.

In [36]:
print("Association Rules:")
model.associationRules.show(truncate=False)

# Count number of Association Rules
num_association_rules = model.associationRules.count()
print(f"Number of rows of Association Rules: {num_association_rules}")

Association Rules:
+------------------------------------------------------------------------+------------+------------------+-----------------+--------------------+
|antecedent                                                              |consequent  |confidence        |lift             |support             |
+------------------------------------------------------------------------+------------+------------------+-----------------+--------------------+
|[B000NDSX6C, B000GQG7D2, B000H9R1Q0, B000PC54NG, B000NWU3I4, B000ILIJE0]|[B000GQG5MA]|0.9640522875816994|78.92355222311484|0.010936051899907321|
|[B000NDSX6C, B000GQG7D2, B000H9R1Q0, B000PC54NG, B000NWU3I4, B000ILIJE0]|[B000Q032UY]|1.0               |79.57227138643069|0.01134383688600556 |
|[B000NDSX6C, B000GQG7D2, B000H9R1Q0, B000PC54NG, B000NWU3I4, B000ILIJE0]|[B000NWQXBA]|1.0               |78.98975109809663|0.01134383688600556 |
|[B000GQG7D2, B000Q032UY, B000PC54NG, B000NWU3I4, B000ILIJE0]            |[B000H9R1Q0]|1.0               

- Considering association rules with antecedent and precedent a single book

In [37]:
association_rules = model.associationRules

association_rules_single = association_rules.withColumn(
    "antecedent_single", 
    F.explode(association_rules.antecedent)
).withColumn(
    "consequent_single", 
    F.explode(association_rules.consequent)
)

association_rules_single.select("antecedent_single", "consequent_single", "confidence", "lift", "support").show(truncate=False)


+-----------------+-----------------+------------------+-----------------+--------------------+
|antecedent_single|consequent_single|confidence        |lift             |support             |
+-----------------+-----------------+------------------+-----------------+--------------------+
|B000NDSX6C       |B000GQG5MA       |0.9640522875816994|78.92355222311484|0.010936051899907321|
|B000GQG7D2       |B000GQG5MA       |0.9640522875816994|78.92355222311484|0.010936051899907321|
|B000H9R1Q0       |B000GQG5MA       |0.9640522875816994|78.92355222311484|0.010936051899907321|
|B000PC54NG       |B000GQG5MA       |0.9640522875816994|78.92355222311484|0.010936051899907321|
|B000NWU3I4       |B000GQG5MA       |0.9640522875816994|78.92355222311484|0.010936051899907321|
|B000ILIJE0       |B000GQG5MA       |0.9640522875816994|78.92355222311484|0.010936051899907321|
|B000NDSX6C       |B000Q032UY       |1.0               |79.57227138643069|0.01134383688600556 |
|B000GQG7D2       |B000Q032UY       |1.0

#### 4.2 A-Priori Algoritmhs

Breve introduction 

##### 4.2.1Prima implementazione usando la libreria mlxtend

In [38]:
from pyspark.sql.functions import collect_set
from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd
from scipy.sparse import lil_matrix

- Creare basket 

In [39]:
# Raggruppare i libri per ogni utente
transactions = df_filtered.groupBy("User_id").agg(collect_set("Id").alias("books"))
#transactions.show(5)

- Per applicare la libreria mlxtend è necessario convertire in pandas

In [40]:
pandas_df = transactions.toPandas()

In [41]:
pandas_df.head(5)

Unnamed: 0,User_id,books
0,A0236983QUCQMORABO03,"[1587888432, 1587888408, 1593352077]"
1,A025268923L497N34PUMH,"[B000P0UDX4, B00005UVH9, B0006IU3EE]"
2,A07084061WTSSXN6VLV92,"[0808510002, B000Q34B8I, 0521639522, B000FC1BY..."
3,A100Q4BGPV187I,"[0743236017, 0743554884]"
4,A100TQ7ZRE0W02,"[0971237034, 0976325608, 0971237018, 0972800522]"


- Creare mastrice sparsa

In [42]:
unique_books = set(item for sublist in pandas_df["books"] for item in sublist)

book_index = {book: idx for idx, book in enumerate(unique_books)}  

sparse_matrix = lil_matrix((len(pandas_df), len(unique_books)), dtype=bool)

- check dimensione matrice:

In [43]:
pandas_df.shape

(53950, 2)

In [44]:
len(unique_books)

70168

In [45]:
sparse_matrix

<List of Lists sparse matrix of dtype 'bool'
	with 0 stored elements and shape (53950, 70168)>

-Popolare la matrice 

In [46]:
for row_idx, books in enumerate(pandas_df["books"]):
    for book in books:
        col_idx = book_index[book]
        sparse_matrix[row_idx, col_idx] = 1

encoded_df = pd.DataFrame.sparse.from_spmatrix(sparse_matrix, columns=book_index.keys())
encoded_df

  encoded_df = pd.DataFrame.sparse.from_spmatrix(sparse_matrix, columns=book_index.keys())


Unnamed: 0,075730141X,0872204529,B0000891YC,B0006DCMD4,B000KYIFEY,B000GRB6CS,1929925700,0802789471,0976631067,0896891755,...,0393974065,007228661X,B000QA78L8,0838632629,B000FQ4JKQ,0671247476,0060090308,0867162406,0061059730,1413746209
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53945,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
53946,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
53947,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
53948,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


- Applicare l'algoritmo Apriori

In [47]:
frequent_itemsets = apriori(encoded_df, min_support=0.01, use_colnames=True)

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)

print("Itemset frequenti:")
frequent_itemsets


Itemset frequenti:


Unnamed: 0,support,itemsets
0,0.012771,(B000ILIJE0)
1,0.012326,(B000GQG7D2)
2,0.012734,(B000NWU3I4)
3,0.012660,(B000NWQXBA)
4,0.012697,(B000PC54NG)
...,...,...
506,0.010936,"(B000H9R1Q0, B000NDSX6C, B000ILIJE0, B000GQG5M..."
507,0.010955,"(B000H9R1Q0, B000NDSX6C, B000NWQXBA, B000ILIJE..."
508,0.011270,"(B000H9R1Q0, B000NDSX6C, B000NWQXBA, B000ILIJE..."
509,0.010936,"(B000H9R1Q0, B000NDSX6C, B000NWQXBA, B000GQG5M..."


In [48]:
print("\nRegole di associazione:")
rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]


Regole di associazione:


Unnamed: 0,antecedents,consequents,support,confidence,lift
0,(B000ILIJE0),(B000GQG7D2),0.012308,0.963716,78.184140
1,(B000GQG7D2),(B000ILIJE0),0.012308,0.998496,78.184140
2,(B000NWU3I4),(B000ILIJE0),0.012715,0.998544,78.187910
3,(B000ILIJE0),(B000NWU3I4),0.012715,0.995646,78.187910
4,(B000ILIJE0),(B000NWQXBA),0.012660,0.991292,78.301887
...,...,...,...,...,...
18655,(B000GQG5MA),"(B000H9R1Q0, B000NDSX6C, B000NWQXBA, B000ILIJE...",0.010936,0.895296,78.923552
18656,(B000PC54NG),"(B000H9R1Q0, B000NDSX6C, B000NWQXBA, B000ILIJE...",0.010936,0.861314,78.759124
18657,(B000Q032UY),"(B000H9R1Q0, B000NDSX6C, B000NWQXBA, B000ILIJE...",0.010936,0.870206,79.572271
18658,(B000NWU3I4),"(B000H9R1Q0, B000NDSX6C, B000NWQXBA, B000ILIJE...",0.010936,0.858806,78.396964


- solo un libro

In [49]:
filtered_rules = rules[
    (rules["antecedents"].apply(lambda x: len(x) == 1)) &  # Solo 1 libro come antecedente
    (rules["consequents"].apply(lambda x: len(x) == 1))    # Solo 1 libro come conseguente
]
print("\nRegole di associazione:")
filtered_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]


Regole di associazione:


Unnamed: 0,antecedents,consequents,support,confidence,lift
0,(B000ILIJE0),(B000GQG7D2),0.012308,0.963716,78.184140
1,(B000GQG7D2),(B000ILIJE0),0.012308,0.998496,78.184140
2,(B000NWU3I4),(B000ILIJE0),0.012715,0.998544,78.187910
3,(B000ILIJE0),(B000NWU3I4),0.012715,0.995646,78.187910
4,(B000ILIJE0),(B000NWQXBA),0.012660,0.991292,78.301887
...,...,...,...,...,...
67,(B000GQG5MA),(B000NDSX6C),0.011437,0.936267,78.678518
68,(B000NDSX6C),(B000Q032UY),0.011715,0.984424,78.332828
69,(B000Q032UY),(B000NDSX6C),0.011715,0.932153,78.332828
70,(B000GQG5MA),(B000Q032UY),0.012030,0.984825,78.364801


In [50]:
sorted_rules = filtered_rules.sort_values(by="confidence", ascending=True)
sorted_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

Unnamed: 0,antecedents,consequents,support,confidence,lift
55,(B000PC54NG),(B000NDSX6C),0.011807,0.929927,78.145735
37,(B000NWU3I4),(B000NDSX6C),0.011844,0.930131,78.162878
25,(B000GQG7D2),(B000NDSX6C),0.011474,0.930827,78.221371
47,(B000NWQXBA),(B000NDSX6C),0.011789,0.931186,78.251529
11,(B000ILIJE0),(B000NDSX6C),0.011900,0.931785,78.301887
...,...,...,...,...,...
44,(B000H9R1Q0),(B000NWQXBA),0.012567,1.000000,78.989751
43,(B000NWQXBA),(B000PC54NG),0.012660,1.000000,78.759124
10,(B000NDSX6C),(B000ILIJE0),0.011900,1.000000,78.301887
64,(B000H9R1Q0),(B000Q032UY),0.012567,1.000000,79.572271


##### 4.2.2 algoritmo a priori senza mlxtend

In [51]:
#Apriori non è direttamente implementato in PySpark perché è meno scalabile rispetto a FP-Growth.

The function takes in input an RDD and a threshold. The function is composed by the combination of the main Spark's functions: map, reduce and filter.

In [52]:
from itertools import combinations

def apriori_frequent_books(rdd, threshold):
    """
    Implementazione di Apriori per trovare coppie frequenti di libri letti dagli stessi utenti.
    
    :param rdd: RDD con liste di libri letti per ogni utente.
    :param threshold: Frequenza minima per considerare una coppia frequente.
    :return: RDD contenente le coppie frequenti di libri e il loro conteggio.
    """

    # Step 1: Creare le coppie di libri per ogni utente
    pairs_rdd = rdd.flatMap(lambda books: [tuple(sorted(pair)) for pair in combinations(books, 2)])

    # Step 2: Contare la frequenza delle coppie
    pair_counts_rdd = pairs_rdd.map(lambda pair: (pair, 1)).reduceByKey(lambda a, b: a + b)

    # Step 3: Filtrare le coppie con supporto maggiore della soglia
    frequent_pairs_rdd = pair_counts_rdd.filter(lambda pair_count: pair_count[1] >= threshold)

    return frequent_pairs_rdd

# Convertire df_filtered in un RDD con liste di libri per utente
rdd = df_filtered.rdd.map(lambda row: (row["User_id"], row["Id"])) \
                     .groupByKey() \
                     .map(lambda x: list(set(x[1])))  # Lista di libri unici per utente

# Imposta la soglia minima per la frequenza delle coppie
threshold = 2

# Esegui Apriori sui libri
frequent_book_pairs = apriori_frequent_books(rdd, threshold)



In [53]:
print(frequent_book_pairs.sortBy(lambda x: -x[1]).take(15))


[(('B000ILIJE0', 'B000NWU3I4'), 686), (('B000NWU3I4', 'B000PC54NG'), 684), (('B000ILIJE0', 'B000PC54NG'), 684), (('B000NWQXBA', 'B000PC54NG'), 683), (('B000ILIJE0', 'B000NWQXBA'), 683), (('B000NWQXBA', 'B000NWU3I4'), 682), (('B000H9R1Q0', 'B000PC54NG'), 678), (('B000NWQXBA', 'B000Q032UY'), 678), (('B000ILIJE0', 'B000Q032UY'), 678), (('B000H9R1Q0', 'B000Q032UY'), 678), (('B000H9R1Q0', 'B000ILIJE0'), 678), (('B000PC54NG', 'B000Q032UY'), 678), (('B000H9R1Q0', 'B000NWQXBA'), 678), (('B000H9R1Q0', 'B000NWU3I4'), 677), (('B000NWU3I4', 'B000Q032UY'), 677)]


In [54]:
frequent_df = frequent_book_pairs.toDF(["pair", "count"])

In [55]:
book_counts = rdd.flatMap(lambda books: [(book, 1) for book in books]) \
                 .reduceByKey(lambda a, b: a + b) \
                 .toDF(["book", "count"])


In [56]:
frequent_df.show(5)

+--------------------+-----+
|                pair|count|
+--------------------+-----+
|{0486417786, B000...|  108|
|{B0006D70HW, B000...|    7|
|{B000HVR6KY, B000...|  256|
|{068983375X, 0747...|  127|
|{0340283947, 0904...|  127|
+--------------------+-----+
only showing top 5 rows



In [57]:
rules_df = frequent_df.select(
    col("pair").getField("_1").alias("book_A"),
    col("pair").getField("_2").alias("book_B"),
    col("count").alias("pair_count")
)

In [58]:
# Join con il conteggio dei singoli libri per calcolare la confidenza
rules_df = rules_df.join(book_counts.withColumnRenamed("book", "book_A"), "book_A") \
                   .withColumnRenamed("count", "count_A") \
                   .join(book_counts.withColumnRenamed("book", "book_B"), "book_B") \
                   .withColumnRenamed("count", "count_B")

In [59]:
# Calcolo di supporto, confidenza e lift
rules_df = rules_df.withColumn("support", col("pair_count") / rdd.count()) \
                   .withColumn("confidence_AtoB", col("pair_count") / col("count_A")) \
                   .withColumn("confidence_BtoA", col("pair_count") / col("count_B")) \
                   .withColumn("lift", col("confidence_AtoB") / (col("count_B") / rdd.count()))

# Mostriamo le regole ordinate per lift decrescente
rules_df.orderBy(col("lift").desc()).show(15, False)

+----------+----------+----------+-------+-------+--------------------+---------------+---------------+-------+
|book_B    |book_A    |pair_count|count_A|count_B|support             |confidence_AtoB|confidence_BtoA|lift   |
+----------+----------+----------+-------+-------+--------------------+---------------+---------------+-------+
|B000HKIICA|0132126052|2         |2      |2      |3.707136237256719E-5|1.0            |1.0            |26975.0|
|B000PMQ1F6|B000FOZ632|2         |2      |2      |3.707136237256719E-5|1.0            |1.0            |26975.0|
|B000NB04WU|B000NAKPQ6|2         |2      |2      |3.707136237256719E-5|1.0            |1.0            |26975.0|
|B000I10HRM|B0008ARAPA|2         |2      |2      |3.707136237256719E-5|1.0            |1.0            |26975.0|
|0943497809|087552379X|2         |2      |2      |3.707136237256719E-5|1.0            |1.0            |26975.0|
|B0007HT3L8|B0007G22K8|2         |2      |2      |3.707136237256719E-5|1.0            |1.0            |2

In [60]:
# Mostriamo le regole ordinate per lift decrescente
rules_df.orderBy(col("lift").asc()).show(15, False)

+----------+----------+----------+-------+-------+---------------------+---------------------+---------------------+-------------------+
|book_B    |book_A    |pair_count|count_A|count_B|support              |confidence_AtoB      |confidence_BtoA      |lift               |
+----------+----------+----------+-------+-------+---------------------+---------------------+---------------------+-------------------+
|B000PCESRE|B000GQG5MA|2         |659    |340    |3.707136237256719E-5 |0.0030349013657056147|0.0058823529411764705|0.4815674372935821 |
|B000PMCF1A|B000GQG5MA|2         |659    |339    |3.707136237256719E-5 |0.0030349013657056147|0.0058997050147492625|0.48298799020595257|
|B000I3NFKG|B000GQG5MA|2         |659    |338    |3.707136237256719E-5 |0.0030349013657056147|0.005917159763313609 |0.48441694875685776|
|B000MOOAJG|B000ILIJE0|3         |689    |434    |5.5607043558850786E-5|0.0043541364296081275|0.0069124423963133645|0.5412572819754804 |
|B000PWMT1G|B000ILIJE0|3         |689    

#### 4.3 SON Algorithms

In [65]:
from itertools import combinations

def generate_candidates(basket, k):
    """
    Genera combinazioni di itemset di lunghezza k.
    """
    return list(combinations(sorted(basket), k))


In [66]:
from collections import defaultdict

def find_frequent_itemsets(partition, min_support, k):
    """
    Trova gli itemset frequenti locali in una partizione.
    """
    baskets = list(partition)
    local_counts = defaultdict(int)
    
    # Conta gli itemset di lunghezza k
    for basket in baskets:
        for candidate in generate_candidates(basket, k):
            local_counts[candidate] += 1
    
    # Filtra per supporto minimo locale
    partition_size = len(baskets)
    local_frequent_itemsets = {itemset for itemset, count in local_counts.items() if count / partition_size >= min_support}
    
    return list(local_frequent_itemsets)


In [67]:
def count_global_frequencies(df_basket, candidates):
    """
    Conta le frequenze globali degli itemset candidati.
    """
    global_counts = defaultdict(int)
    
    for row in df_basket.collect():
        basket = set(row["items"])
        for candidate in candidates:
            if set(candidate).issubset(basket):
                global_counts[candidate] += 1
    
    return global_counts


In [70]:
min_support = 0.5
k = 2  # Per cercare coppie di itemset frequenti

In [71]:
local_frequent_itemsets = df_basket.rdd.mapPartitions(lambda partition: find_frequent_itemsets(partition, min_support, k))