



---

# MARKET-BASKET ANALYSIS

## Massive Algorithm 
### Data Science for Economics

##### Angelica Longo, Melissa Rizzi

The goal of this project is to implement a system for **detecting frequent itemsets**, commonly known as **market-basket analysis**.
In this notebook, the detector treats each user’s reviewed books as a basket, with books serving as items.

The project is based on the **[Amazon Books Review](https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews)** dataset, published on Kaggle under the public domain CC0 license. Data is downloaded during the execution of the scripts via an API and contains variables related to users and their reviews of purchased books.

Given the large volume of data (3 million rows), a reasonable subsample is created using **PySpark**, consisting of approximately 500,000 rows, while ensuring scalability for the full dataset.

The project is structured as follows:

- **Preprocessing** – This phase includes data cleaning, checking data integrity, handling null values, removing duplicates, and computing the overall mean to verify consistency with the selected subsample.
- **Subsampling** – A subset of data is created while maintaining a representative distribution of user choices and ratings.
- **Frequent Itemset Mining** – The final step involves implementing an algorithm to identify frequent itemsets within the dataset.

This structured approach ensures both **efficiency** and **scalability** while maintaining **data integrity**.

### Table of Contents
- [1. Data Import](#1-Data-Import)
- [2. Data PreProcessing](#2-data-preprocessing)
  - [2.1 Data Integrity](#21-data-integrity)
  - [2.2 Missing Data](#22-missing-data)
  - [2.3 Data Duplicates](#22-data-duplicates)
  - [2.4 Rating Means](#22-rating-means)
- [3. Subsample Creation](#3-subsample-creation)
- [4. Frequent Itemset Mining](#4-frequent-itemset-mining)


---
### 1. Data Import

In [1]:
#import os
#import zipfile

In [2]:
#os.environ['KAGGLE_USERNAME'] = "melissarizzi"
#os.environ['KAGGLE_KEY'] = "3ed913e7329a3117a254e67179c0f8bb"

In [3]:
#!pip install kaggle

In [4]:
#!kaggle datasets download -d mohamedbakhet/amazon-books-reviews

In [5]:
#with zipfile.ZipFile("amazon-books-reviews.zip", 'r') as zip_ref:
#    zip_ref.extractall("amazon_books_data")

---
### 2. Data PreProcessing

In [6]:
#import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, min, max, sum, when
from pyspark.sql.types import DoubleType
from pyspark.sql import functions as F

In [7]:
# Create Spark Session
spark = SparkSession.builder.appName("MapReduce").getOrCreate()

In [8]:
# Import data
data = spark.read.csv("amazon_books_data/Books_rating.csv", header=True, inferSchema=True)
data.show(5)

In [10]:
#Select only useful columns
df = data.select("Id", 'Title', "User_id", "review/score",'review/text').withColumnRenamed("review/score", "score")
df.show(5)

#### 2.1 Data Integrity

In [12]:
df.printSchema()

root
 |-- Id: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- User_id: string (nullable = true)
 |-- score: string (nullable = true)
 |-- review/text: string (nullable = true)



In [13]:
# Transform 'score' variable in double type
df = df.withColumn("score", col("score").cast(DoubleType()))
df.printSchema()

root
 |-- Id: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- User_id: string (nullable = true)
 |-- score: double (nullable = true)
 |-- review/text: string (nullable = true)



In [14]:
# Check score range
df.select(min(col("score")).alias("min_score"), max(col("score")).alias("max_score")).show()

+---------+----------+
|min_score| max_score|
+---------+----------+
|      1.0|1.295568E9|
+---------+----------+



In [15]:
df.select("score").distinct().show()

+-----------+
|      score|
+-----------+
|        1.0|
|        4.0|
|      19.95|
|       NULL|
|        3.0|
|        2.0|
|        5.0|
|      327.0|
| 1.295568E9|
|1.2089952E9|
|  1.21176E9|
+-----------+



In [16]:
# Keep just data with the 'score' values in the correct range [1, 5]
df = df.filter((col("score") >= 1) & (col("score") <= 5))
df.select(min("score").alias("min_score"), max("score").alias("max_score")).show()

+---------+---------+
|min_score|max_score|
+---------+---------+
|      1.0|      5.0|
+---------+---------+



#### 2.2 Missing Data

In [18]:
# Count null values for each variable
null_counts = df.select(
    [sum(when(col(c).isNull(), 1).otherwise(0)).alias(c) for c in df.columns]
)

null_counts.show()

+---+-----+-------+-----+-----------+
| Id|Title|User_id|score|review/text|
+---+-----+-------+-----+-----------+
|  0|  196| 561492|    0|          9|
+---+-----+-------+-----+-----------+



What stands out right away, especially for the purpose of our analysis, is that there are many missing values in the User_id variable. One possible reason for this could be that users who leave reviews but are not registered don’t have a user ID. Our goal is to identify baskets of items purchased by the same users, but without the user ID, this analysis cannot be conducted. We explored the possibility of using profile names instead, by assigning a dummy ID to users with the same name. However, we were aware that this might not provide accurate results due to potential name duplication. Moreover, there were more missing profile names than missing user IDs, which made this solution unfeasible. After considering our options, we ultimately decided to **drop the missing values**, as we couldn’t identify a suitable method to replace them.

In [19]:
# Remove null values
df_clean = df.dropna()
df_clean.show(5)

+----------+--------------------+--------------+-----+--------------------+
|        Id|               Title|       User_id|score|         review/text|
+----------+--------------------+--------------+-----+--------------------+
|1882931173|Its Only Art If I...| AVCGYZL8FQQTD|  4.0|This is only for ...|
|0826414346|Dr. Seuss: Americ...|A30TK6U7DNS82R|  5.0|I don't care much...|
|0826414346|Dr. Seuss: Americ...|A3UH4UZ4RSVO82|  5.0|"If people become...|
|0826414346|Dr. Seuss: Americ...|A2MVUWT453QH61|  4.0|Theodore Seuss Ge...|
|0826414346|Dr. Seuss: Americ...|A22X4XUPKF66MR|  4.0|"Philip Nel - Dr....|
+----------+--------------------+--------------+-----+--------------------+
only showing top 5 rows



In [20]:
# Check data size
n_rows = df.count()
n_rows_clean = df_clean.count()
print(f"Number of Rows - Before cleaning: {n_rows}")
print(f"Number of Rows - After cleaning: {n_rows_clean}")

Number of Rows - Before cleaning: 2981912
Number of Rows - After cleaning: 2420235


#### 2.3 Data Duplicates

In [21]:
# Remove duplicated rows
df_clean = df_clean.dropDuplicates()

n_rows_clean = df_clean.count()
print(f"Number of Rows - After duplicates removal: {n_rows_clean}")

Number of Rows- after duplicates removal: 2398220


Up until now, we’ve performed a general cleaning of the dataset. From here on, we’ll focus exclusively on the three columns that are relevant to our analysis (Id, User_id, and score), forming a new dataset: df_short.

In [22]:
# Remove useless columns
df_short = df_clean.select("Id", "User_id","score")
df_short.show(5)

In [24]:
# Check and remove duplicated rows for the three considered variables
df_short= df_short.dropDuplicates()

Given that the same user could have rated the same book twice, we want to compute the mean of the different scores given by the same user to the same book.

In [25]:
# Find duplicates considering only 'Id' and 'User_id'
duplicati = df_short.groupBy("Id", "User_id").count().filter("count > 1")
duplicati.show(5)

+----------+--------------+-----+
|        Id|       User_id|count|
+----------+--------------+-----+
|0451518713|A2TZZQUHX0PVN4|    2|
|B000K0DB8I|A2OJH3S0SUNVGE|    2|
|B000Q56SO6|A23LDF7TIIKTCY|    2|
|B000GSKM2M|A1RJD10TTI568L|    2|
|B0006W43TQ| A81F0YW06W5VQ|    2|
+----------+--------------+-----+
only showing top 5 rows



In [26]:
# Compute average score for every (Id, User_id)
score_mean = df_short.groupBy('Id', 'User_id').agg(F.mean('score').alias('mean_score'))

df_final = df_short.join(score_mean, on=['Id', 'User_id'], how='left')
df_final.show(5)

+----------+--------------+-----+----------+
|        Id|       User_id|score|mean_score|
+----------+--------------+-----+----------+
|1597400602|A3DKP67DK28RUB|  5.0|       5.0|
|B0007H4QBK|A3MVU8X8EC9VRT|  5.0|       5.0|
|B0007H4QBK| AFP82QXWXAG2V|  5.0|       5.0|
|B000MCKQRS|A1TQL7XMTVF4JG|  3.0|       3.0|
|B000O3QCH8|A1T97PDD7JYCUA|  1.0|       1.0|
+----------+--------------+-----+----------+
only showing top 5 rows



In [27]:
# Creation of the final preprocessed dataset
df_final = df_final.select('Id','User_id', 'mean_score')
df_final = df_final.dropDuplicates()
df_final.show(5)

In [30]:
n_rows_final = df_final.count()
print(f"Number of Rows - Final dataset: {n_rows_final}")

Number of Rows - Final dataset: 2380153


#### 2.4 Rating Means

We want to calculate the overall average score to see if consistency is maintained after creating the subsample.

- Overall mean score:

In [33]:
df_final = df_final.withColumn("mean_score", F.col("mean_score").cast("double"))

overall_mean = df_final.agg(F.avg("mean_score")).collect()[0][0]
print(f"Overall mean score - Final dataset: {overall_mean}")

Overall mean score: 4.22386130919595


- Mean score for each item:

In [34]:
score_per_id = df_final.groupBy("Id").agg(F.avg("mean_score").alias("avg_score_pre"))
score_per_id.show(5)

+----------+-------------+
|        Id|avg_score_pre|
+----------+-------------+
|0027861317|        4.625|
|0028622480|          4.0|
|0029267358|          4.0|
|0060929081|         4.24|
|0071409807|         3.75|
+----------+-------------+
only showing top 5 rows



In [35]:
# Check data integrity
score_per_id_above_5 = score_per_id.filter(F.col("avg_score_pre") > 5)

score_per_id_above_5.show(5)


+---+-------------+
| Id|avg_score_pre|
+---+-------------+
+---+-------------+



---
### 3. Subsample Creation

In [36]:
# Keep 10% of original data
# df_subsample = df_final.sample(withReplacement=False, fraction=0.1, seed=42)

In [37]:
#n_rows_subsample = df_subsample.count()
#print(f"Number of Rows - Subsample: {n_rows_subsample}")

Number of Rows - Subsample: 237713


In [38]:
# Check data integrity
#df_subsample.printSchema()

root
 |-- Id: string (nullable = true)
 |-- User_id: string (nullable = true)
 |-- mean_score: double (nullable = true)



In [39]:
#df_subsample = df_subsample.withColumn("mean_score", F.col("mean_score").cast("double"))

- Overall mean score

In [40]:
#overall_mean_sample = df_subsample.agg(F.avg("mean_score")).collect()[0][0]
#print(f"Overall mean sample: {overall_mean_sample}")

Overall mean sample: 4.227572745285281


It's coherent with the already computed mean (before subsample creation)

In [41]:
####################################

We aim to create a subsample that remains consistent with the original dataset. To achieve this, we select a fraction of users while ensuring that all their reviews are included. This approach allows us to better represent their purchasing behavior and rating patterns, preserving the integrity of the data.

In [42]:
num_users = df_final.select("User_id").distinct().count()
print(f"Total number of different users - Original dataset: {num_users}")

Numero totale di utenti unici: 1004214


In [43]:
# Keep just 20% of the users
sample_fraction = 0.2
user_sample = df_final.select("User_id").distinct().sample(fraction=sample_fraction, seed=42)

In [45]:
# Create the subsample with the selected users
df_sampled = df_final.join(user_sample, on="user_id", how="inner")
df_sampled.show(5)

+--------------------+----------+----------+
|             User_id|        Id|mean_score|
+--------------------+----------+----------+
|A00891092QIVH4W1Y...|0134354575|       2.0|
|A00891092QIVH4W1Y...|0395051029|       2.0|
|A00891092QIVH4W1Y...|1582790337|       2.0|
|A00891092QIVH4W1Y...|B000P4Q3JS|       2.0|
|A00891092QIVH4W1Y...|0140860282|       2.0|
+--------------------+----------+----------+
only showing top 5 rows



In [46]:
# Check subsample size
n_rows_sample = df_sampled.count()
print(f"Number of Rows - Sample: {n_rows_sample}")

Number of Rows - Sample: 471573


In [47]:
# Check data integrity
df_sampled.printSchema()

root
 |-- User_id: string (nullable = true)
 |-- Id: string (nullable = true)
 |-- mean_score: double (nullable = true)



In [48]:
# Check mean coherence with the original dataset
overall_mean_sample = df_sampled.agg(F.avg("mean_score")).collect()[0][0]
print(f"Overall mean - Sample: {overall_mean_sample}")

Overall mean sample: 4.221354205322752


The overall mean of the subsample is coherent with the overall mean of the original final dataset.

---
### 4. Algorithm Implementation