



---

# MARKET-BASKET ANALYSIS

## Massive Algorithm 
### Data Science for Economics

##### Angelica Longo, Melissa Rizzi

The goal of this project is to implement a system for **detecting frequent itemsets**, commonly known as **market-basket analysis**.
In this notebook, the detector treats each user’s reviewed books as a basket, with books serving as items.

The project is based on the **[Amazon Books Review](https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews)** dataset, published on Kaggle under the public domain CC0 license. Data is downloaded during the execution of the scripts via an API and contains variables related to users and their reviews of purchased books.

Given the large volume of data (3 million rows), a reasonable subsample is created using **PySpark**, consisting of approximately 500,000 rows, while ensuring scalability for the full dataset.

The project is structured as follows:

- **Preprocessing** – This phase includes data cleaning, checking data integrity, handling null values, removing duplicates, and computing the overall mean to verify consistency with the selected subsample.
- **Subsampling** – A subset of data is created while maintaining a representative distribution of user choices and ratings.
- **Frequent Itemset Mining** – The final step involves implementing an algorithm to identify frequent itemsets within the dataset.

This structured approach ensures both **efficiency** and **scalability** while maintaining **data integrity**.

### Table of Contents
- [1. Data Import](#1-Data-Import)
- [2. Data PreProcessing](#2-data-preprocessing)
  - [2.1 Data Integrity](#21-data-integrity)
  - [2.2 Missing Data](#22-missing-data)
  - [2.3 Data Duplicates](#22-data-duplicates)
  - [2.4 Rating Means](#22-rating-means)
- [3. Subsample Creation](#3-subsample-creation)
- [4. Frequent Itemset Mining](#4-frequent-itemset-mining)


---
### 1. Data Import

In [1]:
import os
import zipfile

In [2]:
os.environ['KAGGLE_USERNAME'] = "melissarizzi"
os.environ['KAGGLE_KEY'] = "3ed913e7329a3117a254e67179c0f8bb"

In [3]:
#!pip install kaggle

In [12]:
#!kaggle datasets download -d mohamedbakhet/amazon-books-reviews

^C
Dataset URL: https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews
License(s): CC0-1.0
('Connection broken: IncompleteRead(111699986 bytes read, 1028583775 more expected)', IncompleteRead(111699986 bytes read, 1028583775 more expected))


In [14]:
with zipfile.ZipFile("amazon-books-reviews.zip", 'r') as zip_ref:
    zip_ref.extractall("amazon_books_data")

---
### 2. Data PreProcessing

In [15]:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, min, max, sum, when, collect_set,count
from pyspark.sql.types import DoubleType
from pyspark.sql import functions as F
from pyspark.ml.fpm import FPGrowth

In [16]:
# Create Spark Session
spark = SparkSession.builder.appName("MapReduce").getOrCreate()

In [17]:
# Import data
data = spark.read.csv("amazon_books_data/Books_rating.csv", header=True, inferSchema=True)
#data.show(5)

In [19]:
#Select only useful columns
df = data.select("Id", 'Title', "User_id", "review/score",'review/text').withColumnRenamed("review/score", "score")
df.show(5)

+----------+--------------------+--------------+-----+--------------------+
|        Id|               Title|       User_id|score|         review/text|
+----------+--------------------+--------------+-----+--------------------+
|1882931173|Its Only Art If I...| AVCGYZL8FQQTD|  4.0|This is only for ...|
|0826414346|Dr. Seuss: Americ...|A30TK6U7DNS82R|  5.0|I don't care much...|
|0826414346|Dr. Seuss: Americ...|A3UH4UZ4RSVO82|  5.0|"If people become...|
|0826414346|Dr. Seuss: Americ...|A2MVUWT453QH61|  4.0|Theodore Seuss Ge...|
|0826414346|Dr. Seuss: Americ...|A22X4XUPKF66MR|  4.0|"Philip Nel - Dr....|
+----------+--------------------+--------------+-----+--------------------+
only showing top 5 rows



#### 2.1 Data Integrity

In [20]:
df.printSchema()

root
 |-- Id: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- User_id: string (nullable = true)
 |-- score: string (nullable = true)
 |-- review/text: string (nullable = true)



In [21]:
# Transform 'score' variable in double type
df = df.withColumn("score", col("score").cast(DoubleType()))
df.printSchema()

root
 |-- Id: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- User_id: string (nullable = true)
 |-- score: double (nullable = true)
 |-- review/text: string (nullable = true)



In [22]:
# Check score range
df.select(min(col("score")).alias("min_score"), max(col("score")).alias("max_score")).show()

+---------+----------+
|min_score| max_score|
+---------+----------+
|      1.0|1.295568E9|
+---------+----------+



In [23]:
# Keep just data with the 'score' values in the correct range [1, 5]
df = df.filter((col("score") >= 1) & (col("score") <= 5))
df.select(min("score").alias("min_score"), max("score").alias("max_score")).show()

+---------+---------+
|min_score|max_score|
+---------+---------+
|      1.0|      5.0|
+---------+---------+



#### 2.2 Missing Data

In [24]:
# Count null values for each variable
null_counts = df.select(
    [sum(when(col(c).isNull(), 1).otherwise(0)).alias(c) for c in df.columns]
)

#null_counts.show()

What stands out right away, especially for the purpose of our analysis, is that there are many missing values in the User_id variable. One possible reason for this could be that users who leave reviews but are not registered don’t have a user ID. Our goal is to identify baskets of items purchased by the same users, but without the user ID, this analysis cannot be conducted. We explored the possibility of using profile names instead, by assigning a dummy ID to users with the same name. However, we were aware that this might not provide accurate results due to potential name duplication. Moreover, there were more missing profile names than missing user IDs, which made this solution unfeasible. After considering our options, we ultimately decided to **drop the missing values**, as we couldn’t identify a suitable method to replace them.

In [25]:
# Remove null values
df_clean = df.dropna()
#df_clean.show(5)

In [26]:
# Check data size
n_rows = df.count()
n_rows_clean = df_clean.count()
print(f"Number of Rows - Before cleaning: {n_rows}")
print(f"Number of Rows - After cleaning: {n_rows_clean}")

Number of Rows - Before cleaning: 2981912
Number of Rows - After cleaning: 2420235


#### 2.3 Data Duplicates

In [27]:
# Remove duplicated rows
df_clean = df_clean.dropDuplicates()

#n_rows_clean = df_clean.count()
#print(f"Number of Rows - After duplicates removal: {n_rows_clean}")

Up until now, we’ve performed a general cleaning of the dataset. From here on, we’ll focus exclusively on the three columns that are relevant to our analysis (Id, User_id, and score), forming a new dataset: df_short.

In [28]:
# Remove useless columns
df_short = df_clean.select("Id", "User_id","score")
#df_short.show(5)

In [29]:
# Check and remove duplicated rows for the three considered variables
df_short= df_short.dropDuplicates()

Given that the same user could have rated the same book twice, we want to compute the mean of the different scores given by the same user to the same book.

In [30]:
# Find duplicates considering only 'Id' and 'User_id'
duplicati = df_short.groupBy("Id", "User_id").count().filter("count > 1")

In [31]:
# Compute average score for every (Id, User_id)
score_mean = df_short.groupBy('Id', 'User_id').agg(F.mean('score').alias('mean_score'))
df_final = df_short.join(score_mean, on=['Id', 'User_id'], how='left')

# Creation of the final preprocessed dataset
df_final = df_final.select('Id','User_id', 'mean_score')
df_final = df_final.dropDuplicates()
#df_final.show(5)

In [32]:
n_rows_final = df_final.count()
print(f"Number of Rows - Final dataset: {n_rows_final}")

Number of Rows - Final dataset: 2380153


#### 2.4 Rating Means

We want to calculate the overall average score to see if consistency is maintained after creating the subsample.

- Overall mean score:

In [33]:
df_final = df_final.withColumn("mean_score", F.col("mean_score").cast("double"))

overall_mean = df_final.agg(F.avg("mean_score")).collect()[0][0]
print(f"Overall mean score - Final dataset: {overall_mean}")

Overall mean score - Final dataset: 4.22386130919595


- Mean score for each item:

In [34]:
#score_per_id = df_final.groupBy("Id").agg(F.avg("mean_score").alias("avg_score_pre"))
#score_per_id.show(5)

In [35]:
# Check data integrity
#score_per_id_above_5 = score_per_id.filter(F.col("avg_score_pre") > 5)
#score_per_id_above_5.show(5)

---
### 3. Subsample Creation

We aim to create a subsample that remains consistent with the original dataset. To achieve this, we select a fraction of users while ensuring that all their reviews are included. This approach allows us to better represent their purchasing behavior and rating patterns, preserving the integrity of the data.

In [36]:
num_users = df_final.select("User_id").distinct().count()
print(f"Total number of different users - Original dataset: {num_users}")

Total number of different users - Original dataset: 1004214


In [37]:
# Keep just 20% of the users
sample_fraction = 0.2
user_sample = df_final.select("User_id").distinct().sample(fraction=sample_fraction, seed=42)

In [38]:
# Create the subsample with the selected users
df_sampled = df_final.join(user_sample, on="User_id", how="inner")
#df_sampled.show(5)

In [39]:
# Check subsample size
n_rows_sample = df_sampled.count()
print(f"Number of Rows - Sample: {n_rows_sample}")

Number of Rows - Sample: 471573


In [40]:
# Check data integrity
df_sampled.printSchema()

root
 |-- User_id: string (nullable = true)
 |-- Id: string (nullable = true)
 |-- mean_score: double (nullable = true)



In [42]:
# Check mean coherence with the original dataset
overall_mean_sample = df_sampled.agg(F.avg("mean_score")).collect()[0][0]
print(f"Overall mean - Sample: {overall_mean_sample}")

Overall mean - Sample: 4.221354205322752


The overall mean of the subsample is coherent with the overall mean of the original final dataset.

---
### 4. Algorithm Implementation

#### 4.1 FP-Growth Algorithm

We considered only books that received a score above 3

In [43]:
# Filter and keep just rows with rating >= 3
df_filtered = df_sampled.filter(col("mean_score") >= 3)

In [44]:
# Filter and keep just users who rated > 1 book
user_counts = df_filtered.groupBy("User_id").agg(count("Id").alias("book_count"))
users_with_multiple_books = user_counts.filter(col("book_count") > 1).select("User_id")

df_filtered = df_filtered.join(users_with_multiple_books, on="User_id", how="inner")

- Creation od Baskets of items 

In [45]:
# Create baskets of items for every user
df_basket = df_filtered.groupBy("User_id").agg(collect_set("Id").alias("items"))

- Algorithm application

In [46]:
# Apply FP-Growth
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.01, minConfidence=0.2)
model = fpGrowth.fit(df_basket)

                        # Support: probabilità di acquisto di tutto il basket
                        # Confidence: probabilità che se compro un basket compro anche l'altro libro


In [47]:
print("Frequent Itemsets:")
model.freqItemsets.show(truncate=False)

# Count number of Frequent Itemsets
#num_freq_itemsets = model.freqItemsets.count()
#print(f"Number of rows of Frequent Itemsets: {num_freq_itemsets}")

Frequent Itemsets:
+------------------------------------------------+----+
|items                                           |freq|
+------------------------------------------------+----+
|[B000ILIJE0]                                    |689 |
|[B000NWU3I4]                                    |687 |
|[B000NWU3I4, B000ILIJE0]                        |686 |
|[B000PC54NG]                                    |685 |
|[B000PC54NG, B000NWU3I4]                        |684 |
|[B000PC54NG, B000NWU3I4, B000ILIJE0]            |683 |
|[B000PC54NG, B000ILIJE0]                        |684 |
|[B000NWQXBA]                                    |683 |
|[B000NWQXBA, B000PC54NG]                        |683 |
|[B000NWQXBA, B000PC54NG, B000NWU3I4]            |682 |
|[B000NWQXBA, B000PC54NG, B000NWU3I4, B000ILIJE0]|682 |
|[B000NWQXBA, B000PC54NG, B000ILIJE0]            |683 |
|[B000NWQXBA, B000NWU3I4]                        |682 |
|[B000NWQXBA, B000NWU3I4, B000ILIJE0]            |682 |
|[B000NWQXBA, B000ILIJE0]    

- Alta confidenza e lift elevato: Se vedi una regola con alta confidenza (vicina a 1.0) e un valore di lift molto alto, significa che c'è una forte correlazione tra gli articoli dell'antecedente e quelli del conseguente. Queste sono regole particolarmente utili per le raccomandazioni di prodotto.

- Basso supporto, alta confidenza e lift alto: Anche se il supporto è basso (ad esempio, 1% delle transazioni), un lift elevato e una confidenza vicina a 1.0 indicano che la regola è molto significativa per un numero ridotto di transazioni.

In [59]:
print("Association Rules:")
model.associationRules.show(truncate=False)

# Count number of Association Rules
num_association_rules = model.associationRules.count()
print(f"Number of rows of Association Rules: {num_association_rules}")

Association Rules:
+------------------------------------------------------------------------+------------+------------------+-----------------+--------------------+
|antecedent                                                              |consequent  |confidence        |lift             |support             |
+------------------------------------------------------------------------+------------+------------------+-----------------+--------------------+
|[B000NDSX6C, B000GQG7D2, B000H9R1Q0, B000PC54NG, B000NWU3I4, B000ILIJE0]|[B000GQG5MA]|0.9640522875816994|78.92355222311484|0.010936051899907321|
|[B000NDSX6C, B000GQG7D2, B000H9R1Q0, B000PC54NG, B000NWU3I4, B000ILIJE0]|[B000Q032UY]|1.0               |79.57227138643069|0.01134383688600556 |
|[B000NDSX6C, B000GQG7D2, B000H9R1Q0, B000PC54NG, B000NWU3I4, B000ILIJE0]|[B000NWQXBA]|1.0               |78.98975109809663|0.01134383688600556 |
|[B000GQG7D2, B000Q032UY, B000PC54NG, B000NWU3I4, B000ILIJE0]            |[B000H9R1Q0]|1.0               

- Considering association rules with antecedent and precedent a single book

In [None]:
association_rules = model.associationRules

association_rules_single = association_rules.withColumn(
    "antecedent_single", 
    F.explode(association_rules.antecedent)
).withColumn(
    "consequent_single", 
    F.explode(association_rules.consequent)
)

association_rules_single.select("antecedent_single", "consequent_single", "confidence", "lift", "support").show(truncate=False)


In [None]:
######################################àà

In [48]:
#A PRIORI ALGORITHMS 

In [49]:
#!pip install mlxtend

In [50]:
# Raggruppare i libri per ogni utente
transactions = df_filtered.groupBy("User_id").agg(collect_set("Id").alias("books"))

In [51]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import collect_set
from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd

In [52]:
pandas_df = transactions.toPandas()

In [53]:
pandas_df.head(5)

Unnamed: 0,User_id,books
0,A0236983QUCQMORABO03,"[1587888432, 1587888408, 1593352077]"
1,A025268923L497N34PUMH,"[B000P0UDX4, B00005UVH9, B0006IU3EE]"
2,A07084061WTSSXN6VLV92,"[0808510002, B000Q34B8I, 0521639522, B000FC1BY..."
3,A100Q4BGPV187I,"[0743236017, 0743554884]"
4,A100TQ7ZRE0W02,"[0971237034, 0976325608, 0971237018, 0972800522]"


In [54]:
#!pip install scipy

In [55]:
from scipy.sparse import lil_matrix

In [56]:
# Crea una lista unica di libri
unique_books = set(item for sublist in pandas_df["books"] for item in sublist)

# Creare una matrice sparsa
book_index = {book: idx for idx, book in enumerate(unique_books)}  # Mappa tra libro e indice

# Creiamo una matrice sparsa (Sparse Matrix) in formato LIL
sparse_matrix = lil_matrix((len(pandas_df), len(unique_books)), dtype=bool)

In [57]:
# Popolare la matrice sparsa
for row_idx, books in enumerate(pandas_df["books"]):
    for book in books:
        col_idx = book_index[book]
        sparse_matrix[row_idx, col_idx] = 1

# Convertire la matrice sparsa in un DataFrame Pandas
encoded_df = pd.DataFrame.sparse.from_spmatrix(sparse_matrix, columns=book_index.keys())


  encoded_df = pd.DataFrame.sparse.from_spmatrix(sparse_matrix, columns=book_index.keys())


In [58]:
# Applicare l'algoritmo Apriori
frequent_itemsets = apriori(encoded_df, min_support=0.01, use_colnames=True)

# Generare le regole di associazione
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)

# Stampare i risultati
print("Itemset frequenti:")
frequent_itemsets

print("\nRegole di associazione:")
rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

Itemset frequenti:

Regole di associazione:


Unnamed: 0,antecedents,consequents,support,confidence,lift
0,(B000ILIJE0),(B000NWU3I4),0.012715,0.995646,78.187910
1,(B000NWU3I4),(B000ILIJE0),0.012715,0.998544,78.187910
2,(B000ILIJE0),(B000NDSX6C),0.011900,0.931785,78.301887
3,(B000NDSX6C),(B000ILIJE0),0.011900,1.000000,78.301887
4,(B000ILIJE0),(B000GQG5MA),0.012196,0.955007,78.183068
...,...,...,...,...,...
18655,(B000NDSX6C),"(B000Q032UY, B000ILIJE0, B000H9R1Q0, B000NWU3I...",0.010936,0.919003,78.823876
18656,(B000GQG5MA),"(B000Q032UY, B000ILIJE0, B000H9R1Q0, B000NWU3I...",0.010936,0.895296,78.923552
18657,(B000GQG7D2),"(B000Q032UY, B000ILIJE0, B000H9R1Q0, B000NWU3I...",0.010936,0.887218,78.726009
18658,(B000PC54NG),"(B000Q032UY, B000ILIJE0, B000H9R1Q0, B000NWU3I...",0.010936,0.861314,78.759124


In [None]:
# Chiude la sessione Spark
#spark.stop()