



---

# MARKET-BASKET ANALYSIS

## Massive Algorithm 
### Data Science for Economics

##### Angelica Longo, Melissa Rizzi

The goal of this project is to implement a system for **detecting frequent itemsets**, commonly known as **market-basket analysis**.
In this notebook, the detector treats each user’s reviewed books as a basket, with books serving as items.

The project is based on the **[Amazon Books Review](https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews)** dataset, published on Kaggle under the public domain CC0 license. Data is downloaded during the execution of the scripts via an API and contains variables related to users and their reviews of purchased books.

Given the large volume of data (3 million rows), a reasonable subsample is created using **PySpark**, consisting of approximately 500,000 rows, while ensuring scalability for the full dataset.

The project is structured as follows:

- **Preprocessing** – This phase includes data cleaning, checking data integrity, handling null values, removing duplicates, and computing the overall mean to verify consistency with the selected subsample.
- **Subsampling** – A subset of data is created while maintaining a representative distribution of user choices and ratings.
- **Frequent Itemset Mining** – The final step involves implementing an algorithm to identify frequent itemsets within the dataset.

This structured approach ensures both **efficiency** and **scalability** while maintaining **data integrity**.

### Table of Contents
- [1. Data Import](#1-Data-Import)
- [2. Data PreProcessing](#2-data-preprocessing)
  - [2.1 Data Integrity](#21-data-integrity)
  - [2.2 Missing Data](#22-missing-data)
  - [2.3 Data Duplicates](#22-data-duplicates)
  - [2.4 Rating Means](#22-rating-means)
- [3. Subsample Creation](#3-subsample-creation)
- [4. Frequent Itemset Mining](#4-frequent-itemset-mining)


---
### 1. Data Import

In [1]:
#import os
#import zipfile

In [2]:
#os.environ['KAGGLE_USERNAME'] = "melissarizzi"
#os.environ['KAGGLE_KEY'] = "3ed913e7329a3117a254e67179c0f8bb"

In [3]:
#!pip install kaggle



In [4]:
#!kaggle datasets download -d mohamedbakhet/amazon-books-reviews

Dataset URL: https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews
License(s): CC0-1.0
Downloading amazon-books-reviews.zip to C:\Users\angel\Desktop\GitHub\Massive-Data




  0%|          | 0.00/1.06G [00:00<?, ?B/s]
  0%|          | 1.00M/1.06G [00:01<23:06, 822kB/s]
  0%|          | 4.00M/1.06G [00:01<04:55, 3.85MB/s]
  1%|          | 7.00M/1.06G [00:01<02:37, 7.20MB/s]
  1%|          | 10.0M/1.06G [00:01<01:44, 10.8MB/s]
  1%|1         | 13.0M/1.06G [00:01<01:18, 14.4MB/s]
  2%|1         | 17.0M/1.06G [00:01<00:58, 19.3MB/s]
  2%|1         | 21.0M/1.06G [00:01<00:48, 22.9MB/s]
  2%|2         | 25.0M/1.06G [00:02<00:43, 25.8MB/s]
  3%|2         | 29.0M/1.06G [00:02<00:40, 27.6MB/s]
  3%|3         | 33.0M/1.06G [00:02<00:38, 29.0MB/s]
  3%|3         | 37.0M/1.06G [00:02<00:39, 28.0MB/s]
  4%|3         | 40.0M/1.06G [00:02<00:40, 27.1MB/s]
  4%|3         | 43.0M/1.06G [00:02<00:39, 27.7MB/s]
  4%|4         | 46.0M/1.06G [00:02<00:39, 27.5MB/s]
  5%|4         | 49.0M/1.06G [00:02<00:40, 27.0MB/s]
  5%|4         | 52.0M/1.06G [00:03<00:40, 27.0MB/s]
  5%|5         | 56.0M/1.06G [00:03<00:37, 29.2MB/s]
  6%|5         | 60.0M/1.06G [00:03<00:34, 31.0MB/s]
  

In [5]:
#with zipfile.ZipFile("amazon-books-reviews.zip", 'r') as zip_ref:
#    zip_ref.extractall("amazon_books_data")

---
### 2. Data PreProcessing

In [6]:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, min, max, sum, when, collect_set
from pyspark.sql.types import DoubleType
from pyspark.sql import functions as F
from pyspark.ml.fpm import FPGrowth

In [7]:
# Create Spark Session
spark = SparkSession.builder.appName("MapReduce").getOrCreate()

In [8]:
# Import data
data = spark.read.csv("amazon_books_data/Books_rating.csv", header=True, inferSchema=True)
#data.show(5)

+----------+--------------------+-----+--------------+--------------------+------------------+------------+-----------+--------------------+--------------------+
|        Id|               Title|Price|       User_id|         profileName|review/helpfulness|review/score|review/time|      review/summary|         review/text|
+----------+--------------------+-----+--------------+--------------------+------------------+------------+-----------+--------------------+--------------------+
|1882931173|Its Only Art If I...| NULL| AVCGYZL8FQQTD|"Jim of Oz ""jim-...|               7/7|         4.0|  940636800|Nice collection o...|This is only for ...|
|0826414346|Dr. Seuss: Americ...| NULL|A30TK6U7DNS82R|       Kevin Killian|             10/10|         5.0| 1095724800|   Really Enjoyed It|I don't care much...|
|0826414346|Dr. Seuss: Americ...| NULL|A3UH4UZ4RSVO82|        John Granger|             10/11|         5.0| 1078790400|Essential for eve...|"If people become...|
|0826414346|Dr. Seuss: Ameri

In [9]:
#Select only useful columns
df = data.select("Id", 'Title', "User_id", "review/score",'review/text').withColumnRenamed("review/score", "score")
#df.show(5)

+----------+--------------------+--------------+-----+--------------------+
|        Id|               Title|       User_id|score|         review/text|
+----------+--------------------+--------------+-----+--------------------+
|1882931173|Its Only Art If I...| AVCGYZL8FQQTD|  4.0|This is only for ...|
|0826414346|Dr. Seuss: Americ...|A30TK6U7DNS82R|  5.0|I don't care much...|
|0826414346|Dr. Seuss: Americ...|A3UH4UZ4RSVO82|  5.0|"If people become...|
|0826414346|Dr. Seuss: Americ...|A2MVUWT453QH61|  4.0|Theodore Seuss Ge...|
|0826414346|Dr. Seuss: Americ...|A22X4XUPKF66MR|  4.0|"Philip Nel - Dr....|
+----------+--------------------+--------------+-----+--------------------+
only showing top 5 rows



#### 2.1 Data Integrity

In [10]:
df.printSchema()

root
 |-- Id: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- User_id: string (nullable = true)
 |-- score: string (nullable = true)
 |-- review/text: string (nullable = true)



In [11]:
# Transform 'score' variable in double type
df = df.withColumn("score", col("score").cast(DoubleType()))
df.printSchema()

root
 |-- Id: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- User_id: string (nullable = true)
 |-- score: double (nullable = true)
 |-- review/text: string (nullable = true)



In [12]:
# Check score range
df.select(min(col("score")).alias("min_score"), max(col("score")).alias("max_score")).show()

+---------+----------+
|min_score| max_score|
+---------+----------+
|      1.0|1.295568E9|
+---------+----------+



In [14]:
# Keep just data with the 'score' values in the correct range [1, 5]
df = df.filter((col("score") >= 1) & (col("score") <= 5))
df.select(min("score").alias("min_score"), max("score").alias("max_score")).show()

+---------+---------+
|min_score|max_score|
+---------+---------+
|      1.0|      5.0|
+---------+---------+



#### 2.2 Missing Data

In [15]:
# Count null values for each variable
null_counts = df.select(
    [sum(when(col(c).isNull(), 1).otherwise(0)).alias(c) for c in df.columns]
)

null_counts.show()

+---+-----+-------+-----+-----------+
| Id|Title|User_id|score|review/text|
+---+-----+-------+-----+-----------+
|  0|  196| 561492|    0|          9|
+---+-----+-------+-----+-----------+



What stands out right away, especially for the purpose of our analysis, is that there are many missing values in the User_id variable. One possible reason for this could be that users who leave reviews but are not registered don’t have a user ID. Our goal is to identify baskets of items purchased by the same users, but without the user ID, this analysis cannot be conducted. We explored the possibility of using profile names instead, by assigning a dummy ID to users with the same name. However, we were aware that this might not provide accurate results due to potential name duplication. Moreover, there were more missing profile names than missing user IDs, which made this solution unfeasible. After considering our options, we ultimately decided to **drop the missing values**, as we couldn’t identify a suitable method to replace them.

In [16]:
# Remove null values
df_clean = df.dropna()
#df_clean.show(5)

+----------+--------------------+--------------+-----+--------------------+
|        Id|               Title|       User_id|score|         review/text|
+----------+--------------------+--------------+-----+--------------------+
|1882931173|Its Only Art If I...| AVCGYZL8FQQTD|  4.0|This is only for ...|
|0826414346|Dr. Seuss: Americ...|A30TK6U7DNS82R|  5.0|I don't care much...|
|0826414346|Dr. Seuss: Americ...|A3UH4UZ4RSVO82|  5.0|"If people become...|
|0826414346|Dr. Seuss: Americ...|A2MVUWT453QH61|  4.0|Theodore Seuss Ge...|
|0826414346|Dr. Seuss: Americ...|A22X4XUPKF66MR|  4.0|"Philip Nel - Dr....|
+----------+--------------------+--------------+-----+--------------------+
only showing top 5 rows



In [17]:
# Check data size
n_rows = df.count()
n_rows_clean = df_clean.count()
print(f"Number of Rows - Before cleaning: {n_rows}")
print(f"Number of Rows - After cleaning: {n_rows_clean}")

Number of Rows - Before cleaning: 2981912
Number of Rows - After cleaning: 2420235


#### 2.3 Data Duplicates

In [18]:
# Remove duplicated rows
df_clean = df_clean.dropDuplicates()

n_rows_clean = df_clean.count()
print(f"Number of Rows - After duplicates removal: {n_rows_clean}")

Number of Rows - After duplicates removal: 2398220


Up until now, we’ve performed a general cleaning of the dataset. From here on, we’ll focus exclusively on the three columns that are relevant to our analysis (Id, User_id, and score), forming a new dataset: df_short.

In [19]:
# Remove useless columns
df_short = df_clean.select("Id", "User_id","score")
#df_short.show(5)

+----------+--------------+-----+
|        Id|       User_id|score|
+----------+--------------+-----+
|0809080699|A29LG535LJRITI|  5.0|
|0671551345|A3CLKX8W3F1L1D|  3.0|
|0671551345| A4FX5YCJA630V|  3.0|
|B000MCKQRS| A1237ROTM7659|  4.0|
|B000890HE2|A2JYUAIAUYXCQN|  5.0|
+----------+--------------+-----+
only showing top 5 rows



In [20]:
# Check and remove duplicated rows for the three considered variables
df_short= df_short.dropDuplicates()

Given that the same user could have rated the same book twice, we want to compute the mean of the different scores given by the same user to the same book.

In [21]:
# Find duplicates considering only 'Id' and 'User_id'
duplicati = df_short.groupBy("Id", "User_id").count().filter("count > 1")

+----------+--------------+-----+
|        Id|       User_id|count|
+----------+--------------+-----+
|0451518713|A2TZZQUHX0PVN4|    2|
|B000K0DB8I|A2OJH3S0SUNVGE|    2|
|B000Q56SO6|A23LDF7TIIKTCY|    2|
|B000GSKM2M|A1RJD10TTI568L|    2|
|B0006W43TQ| A81F0YW06W5VQ|    2|
+----------+--------------+-----+
only showing top 5 rows



In [22]:
# Compute average score for every (Id, User_id)
score_mean = df_short.groupBy('Id', 'User_id').agg(F.mean('score').alias('mean_score'))
df_final = df_short.join(score_mean, on=['Id', 'User_id'], how='left')

# Creation of the final preprocessed dataset
df_final = df_final.select('Id','User_id', 'mean_score')
df_final = df_final.dropDuplicates()
#df_final.show(5)

+----------+--------------+-----+----------+
|        Id|       User_id|score|mean_score|
+----------+--------------+-----+----------+
|1597400602|A3DKP67DK28RUB|  5.0|       5.0|
|B0007H4QBK|A3MVU8X8EC9VRT|  5.0|       5.0|
|B0007H4QBK| AFP82QXWXAG2V|  5.0|       5.0|
|B000MCKQRS|A1TQL7XMTVF4JG|  3.0|       3.0|
|B000O3QCH8|A1T97PDD7JYCUA|  1.0|       1.0|
+----------+--------------+-----+----------+
only showing top 5 rows



In [24]:
n_rows_final = df_final.count()
print(f"Number of Rows - Final dataset: {n_rows_final}")

Number of Rows - Final dataset: 2380153


#### 2.4 Rating Means

We want to calculate the overall average score to see if consistency is maintained after creating the subsample.

- Overall mean score:

In [25]:
df_final = df_final.withColumn("mean_score", F.col("mean_score").cast("double"))

overall_mean = df_final.agg(F.avg("mean_score")).collect()[0][0]
print(f"Overall mean score - Final dataset: {overall_mean}")

Overall mean score - Final dataset: 4.22386130919595


- Mean score for each item:

In [26]:
#score_per_id = df_final.groupBy("Id").agg(F.avg("mean_score").alias("avg_score_pre"))
#score_per_id.show(5)

+----------+-------------+
|        Id|avg_score_pre|
+----------+-------------+
|0027861317|        4.625|
|0028622480|          4.0|
|0029267358|          4.0|
|0060929081|         4.24|
|0071409807|         3.75|
+----------+-------------+
only showing top 5 rows



In [27]:
# Check data integrity
#score_per_id_above_5 = score_per_id.filter(F.col("avg_score_pre") > 5)
#score_per_id_above_5.show(5)

+---+-------------+
| Id|avg_score_pre|
+---+-------------+
+---+-------------+



---
### 3. Subsample Creation

We aim to create a subsample that remains consistent with the original dataset. To achieve this, we select a fraction of users while ensuring that all their reviews are included. This approach allows us to better represent their purchasing behavior and rating patterns, preserving the integrity of the data.

In [34]:
num_users = df_final.select("User_id").distinct().count()
print(f"Total number of different users - Original dataset: {num_users}")

Total number of different users - Original dataset: 1004214


In [58]:
# Keep just 20% of the users
sample_fraction = 0.2
user_sample = df_final.select("User_id").distinct().sample(fraction=sample_fraction, seed=42)

In [59]:
# Create the subsample with the selected users
df_sampled = df_final.join(user_sample, on="User_id", how="inner")
#df_sampled.show(5)

In [60]:
# Check subsample size
n_rows_sample = df_sampled.count()
print(f"Number of Rows - Sample: {n_rows_sample}")

Number of Rows - Sample: 510146


In [61]:
# Check data integrity
df_sampled.printSchema()

root
 |-- User_id: string (nullable = true)
 |-- Id: string (nullable = true)
 |-- score: double (nullable = true)



In [62]:
# Check mean coherence with the original dataset
overall_mean_sample = df_sampled.agg(F.avg("score")).collect()[0][0]
print(f"Overall mean - Sample: {overall_mean_sample}")

Overall mean - Sample: 4.226667137120224


The overall mean of the subsample is coherent with the overall mean of the original final dataset.

---
### 4. Algorithm Implementation

#### 4.1 FP-Growth Algorithm

We considered only books that received a score above 3

In [None]:
# Filter and keep just rows with rating >= 3
df_filtered = df_sampled.filter(col("score") >= 3)

- Creation od Baskets of items 

In [None]:
# Create baskets of items for every user
df_basket = df_FP.groupBy("User_id").agg(collect_set("Id").alias("items"))

- Algorithm application

In [85]:
# Apply FP-Growth
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.01, minConfidence=0.2)
model = fpGrowth.fit(df_basket)

                        # Support: probabilità di acquisto di tutto il basket
                        # Confidence: probabilità che se compro un basket compro anche l'altro libro


Frequent Itemsets:
+------------------------------------------------+----+
|items                                           |freq|
+------------------------------------------------+----+
|[B000ILIJE0]                                    |1051|
|[B000NWU3I4]                                    |1046|
|[B000NWU3I4, B000ILIJE0]                        |1045|
|[B000PC54NG]                                    |1037|
|[B000PC54NG, B000NWU3I4]                        |1036|
|[B000PC54NG, B000NWU3I4, B000ILIJE0]            |1035|
|[B000PC54NG, B000ILIJE0]                        |1036|
|[B000NWQXBA]                                    |1033|
|[B000NWQXBA, B000PC54NG]                        |1033|
|[B000NWQXBA, B000PC54NG, B000NWU3I4]            |1032|
|[B000NWQXBA, B000PC54NG, B000NWU3I4, B000ILIJE0]|1032|
|[B000NWQXBA, B000PC54NG, B000ILIJE0]            |1033|
|[B000NWQXBA, B000NWU3I4]                        |1032|
|[B000NWQXBA, B000NWU3I4, B000ILIJE0]            |1032|
|[B000NWQXBA, B000ILIJE0]    

In [None]:
print("Frequent Itemsets:")
model.freqItemsets.show(truncate=False)

# Count number of Frequent Itemsets
num_freq_itemsets = model.freqItemsets.count()
print(f"Number of rows of Frequent Itemsets: {num_freq_itemsets}")

- Alta confidenza e lift elevato: Se vedi una regola con alta confidenza (vicina a 1.0) e un valore di lift molto alto, significa che c'è una forte correlazione tra gli articoli dell'antecedente e quelli del conseguente. Queste sono regole particolarmente utili per le raccomandazioni di prodotto.

- Basso supporto, alta confidenza e lift alto: Anche se il supporto è basso (ad esempio, 1% delle transazioni), un lift elevato e una confidenza vicina a 1.0 indicano che la regola è molto significativa per un numero ridotto di transazioni.

In [86]:
print("Association Rules:")
model.associationRules.show(truncate=False)

# Count number of Association Rules
num_association_rules = model.associationRules.count()
print(f"Number of rows of Association Rules: {num_association_rules}")

Number of rows of Frequent Itemsets: 511
Number of rows of Association Rules: 2295


- Considering association rules with antecedent and precedent a single book

In [89]:
association_rules = model.associationRules

association_rules_single = association_rules.withColumn(
    "antecedent_single", 
    F.explode(association_rules.antecedent)
).withColumn(
    "consequent_single", 
    F.explode(association_rules.consequent)
)

association_rules_single.select("antecedent_single", "consequent_single", "confidence", "lift", "support").show(truncate=False)


+-----------------+-----------------+------------------+-----------------+--------------------+
|antecedent_single|consequent_single|confidence        |lift             |support             |
+-----------------+-----------------+------------------+-----------------+--------------------+
|B000NDSX6C       |B000GQG5MA       |0.9646302250803859|77.10227000047992|0.011203923863112948|
|B000GQG7D2       |B000GQG5MA       |0.9646302250803859|77.10227000047992|0.011203923863112948|
|B000H9R1Q0       |B000GQG5MA       |0.9646302250803859|77.10227000047992|0.011203923863112948|
|B000PC54NG       |B000GQG5MA       |0.9646302250803859|77.10227000047992|0.011203923863112948|
|B000NWU3I4       |B000GQG5MA       |0.9646302250803859|77.10227000047992|0.011203923863112948|
|B000ILIJE0       |B000GQG5MA       |0.9646302250803859|77.10227000047992|0.011203923863112948|
|B000NDSX6C       |B000Q032UY       |1.0               |78.0651117589893 |0.011614734404760423|
|B000GQG7D2       |B000Q032UY       |1.0

to do:
- algoritmo a priori
- altri algoritmi?
- dashboard??
- report