## MASSIVE ALGORITHMS

### 1. Data Import

In [13]:
#import os
#import zipfile

In [14]:
#os.environ['KAGGLE_USERNAME'] = "melissarizzi"
#os.environ['KAGGLE_KEY'] = "3ed913e7329a3117a254e67179c0f8bb"

In [15]:
#!pip install kaggle



In [16]:
#!kaggle datasets download -d mohamedbakhet/amazon-books-reviews

Dataset URL: https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews
License(s): CC0-1.0
amazon-books-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)


In [17]:
#with zipfile.ZipFile("amazon-books-reviews.zip", 'r') as zip_ref:
#    zip_ref.extractall("amazon_books_data")

### 2. Data PreProcessing

In [18]:
# Useful libraries
import pandas as pd

In [19]:
data = pd.read_csv("amazon_books_data/Books_rating.csv")

In [20]:
# Check data types to verify integrity of data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000000 entries, 0 to 2999999
Data columns (total 10 columns):
 #   Column              Dtype  
---  ------              -----  
 0   Id                  object 
 1   Title               object 
 2   Price               float64
 3   User_id             object 
 4   profileName         object 
 5   review/helpfulness  object 
 6   review/score        float64
 7   review/time         int64  
 8   review/summary      object 
 9   review/text         object 
dtypes: float64(2), int64(1), object(7)
memory usage: 228.9+ MB


In [21]:
# Check review/score range
min_value = data['review/score'].min()
print(min_value)
max_value = data['review/score'].max()
print(max_value)

1.0
5.0


All rating scores are in the correct range

In [22]:
# Remove useless variables
df = data.drop(['Price', 'profileName', 'review/helpfulness', 'review/time'] ,axis=1)

In [23]:
df.head()

Unnamed: 0,Id,Title,User_id,review/score,review/summary,review/text
0,1882931173,Its Only Art If Its Well Hung!,AVCGYZL8FQQTD,4.0,Nice collection of Julie Strain images,This is only for Julie Strain fans. It's a col...
1,826414346,Dr. Seuss: American Icon,A30TK6U7DNS82R,5.0,Really Enjoyed It,I don't care much for Dr. Seuss but after read...
2,826414346,Dr. Seuss: American Icon,A3UH4UZ4RSVO82,5.0,Essential for every personal and Public Library,"If people become the books they read and if ""t..."
3,826414346,Dr. Seuss: American Icon,A2MVUWT453QH61,4.0,Phlip Nel gives silly Seuss a serious treatment,"Theodore Seuss Geisel (1904-1991), aka &quot;D..."
4,826414346,Dr. Seuss: American Icon,A22X4XUPKF66MR,4.0,Good academic overview,Philip Nel - Dr. Seuss: American IconThis is b...


#### 2.1 Missing data

In [24]:
df.isnull().sum()

Id                     0
Title                208
User_id           561787
review/score           0
review/summary       407
review/text            8
dtype: int64

What stands out right away, especially for the purpose of our analysis, is that there are many missing values in the User_id variable. One possible reason for this could be that users who leave reviews but are not registered don’t have a user ID. Our goal is to identify baskets of items purchased by the same users, but without the user ID, this analysis cannot be conducted. We explored the possibility of using profile names instead, by assigning a dummy ID to users with the same name. However, we were aware that this might not provide accurate results due to potential name duplication. Moreover, there were more missing profile names than missing user IDs, which made this solution unfeasible. After considering our options, we ultimately decided to **drop the missing values**, as we couldn’t identify a suitable method to replace them.

In [25]:
# Remove missing values
df = df.dropna()
# Check
df.isnull().sum()

Id                0
Title             0
User_id           0
review/score      0
review/summary    0
review/text       0
dtype: int64

In [26]:
# Check dataset shape after null values removal 
df.shape

(2437620, 6)

#### 2.2 Data Duplicates

In [27]:
# Numbers of rows with the same value for all the variables
df.duplicated().sum()

np.int64(17544)

In [28]:
# Remove duplicated rows
df = df.drop_duplicates()
# Check
df.duplicated().sum()

np.int64(0)

In [29]:
# Check dataset shape after null values removal 
df.shape

(2420076, 6)

Up until now, we’ve performed a general cleaning of the dataset. From here on, we’ll focus exclusively on the three columns that are relevant to our analysis (Id, User_id, and review/score), forming a new dataset: df_short.

In [30]:
# Remove useless columns
df_short = df.drop(['Title', 'review/text', 'review/summary'] ,axis=1)

In [31]:
df_short.head()

Unnamed: 0,Id,User_id,review/score
0,1882931173,AVCGYZL8FQQTD,4.0
1,826414346,A30TK6U7DNS82R,5.0
2,826414346,A3UH4UZ4RSVO82,5.0
3,826414346,A2MVUWT453QH61,4.0
4,826414346,A22X4XUPKF66MR,4.0


In [32]:
# Check duplicated rows for the three considered variables
df_duplicati = df_short[df_short.duplicated(keep=False)]
df_duplicati.shape

(37049, 3)

In [33]:
# Count just the duplicated rows
df_short.duplicated().sum()

np.int64(19996)

In [34]:
# Remove duplicated rows
df_short = df_short.drop_duplicates()
# Check
df_short.duplicated().sum()

np.int64(0)

In [35]:
# Check duplicates just for the varibales Id, User_id, to verify if there are 2 reviews from the same user but with different score
duplicati = df_short[df_short.duplicated(subset=['Id', 'User_id'], keep=False)]
duplicati.shape

(6020, 3)

In [36]:
# Compute the mean of the different score given by the same user to the same item
score_mean = df_short.groupby(['Id', 'User_id'])['review/score'].mean()
df_final = pd.merge(df_short, score_mean, on=['Id', 'User_id'], how='left', suffixes=('', '_mean'))
df_final = df_final.drop(columns=['review/score'])
df_final = df_final.rename(columns={'review/score_mean': 'mean_score'})
df_final.head()

Unnamed: 0,Id,User_id,mean_score
0,1882931173,AVCGYZL8FQQTD,4.0
1,826414346,A30TK6U7DNS82R,5.0
2,826414346,A3UH4UZ4RSVO82,5.0
3,826414346,A2MVUWT453QH61,4.0
4,826414346,A22X4XUPKF66MR,4.0


In [37]:
# Count of duplicats
df_final.duplicated().sum()

np.int64(3059)

In [38]:
# Remove duplicated values
df_final = df_final.drop_duplicates()
# Check
df_final.duplicated().sum()

np.int64(0)

In [39]:
# Check shape after duplicated values removal
df_final.shape

(2397021, 3)

#### 2.3 Rating means

We want to calculate the average score for each book (Id) to see if consistency is maintained after creating the subsample.

- Overall mean score:

In [40]:
overall_mean_score = df_final['mean_score'].mean()
print(overall_mean_score)

4.224974527409926


- Mean score for each item:

In [41]:
mean_scores = df_final.groupby('Id')['mean_score'].mean().reset_index()
print(mean_scores)

                Id  mean_score
0       0001047604    3.000000
1       0001047655    3.725806
2       0001047736    3.500000
3       0001047825    4.041667
4       0001047876    4.000000
...            ...         ...
216004  B002PWLQ04    5.000000
216005  B0030EY97I    3.000000
216006  B003IMNND8    3.700000
216007  B005MX54HO    4.597561
216008  B0064P287I    4.475904

[216009 rows x 2 columns]


### 3. Reasonable Subsample

We want to create a reasonable subsample that is representative of the original dataset, but with lower computational costs. To do so, we use MapReduce.

In [44]:
pip install pyspark

Collecting pysparkNote: you may need to restart the kernel to use updated packages.

  Downloading pyspark-3.5.5.tar.gz (317.2 MB)
     ---------------------------------------- 0.0/317.2 MB ? eta -:--:--
     --------------------------------------- 3.9/317.2 MB 29.4 MB/s eta 0:00:11
     - ------------------------------------ 11.8/317.2 MB 32.0 MB/s eta 0:00:10
     -- ----------------------------------- 19.7/317.2 MB 34.5 MB/s eta 0:00:09
     --- ---------------------------------- 27.3/317.2 MB 34.5 MB/s eta 0:00:09
     ---- --------------------------------- 35.1/317.2 MB 35.4 MB/s eta 0:00:08
     ----- -------------------------------- 43.3/317.2 MB 35.7 MB/s eta 0:00:08
     ------ ------------------------------- 51.1/317.2 MB 35.8 MB/s eta 0:00:08
     ------- ------------------------------ 58.5/317.2 MB 35.8 MB/s eta 0:00:08
     ------- ------------------------------ 64.7/317.2 MB 35.0 MB/s eta 0:00:08
     -------- ----------------------------- 71.8/317.2 MB 34.7 MB/s eta 0:00

In [3]:
# Useful libraries 
from pyspark.sql import SparkSession

# Creazione della SparkSession
spark = SparkSession.builder \
    .appName("SubsampleBooks") \
    .master("local[2]")\
    .getOrCreate()

# Check 
spark

KeyboardInterrupt: 

In [None]:
# Convert the Pandas DataFrame in a Spark DataFrame
df_final_spark = spark.createDataFrame(df_final)

In [None]:
# Reproducible Subsample
sampling_rate = 0.08
seed_value = 42
df_sampled = df_final_spark.sample(fraction=sampling_rate, withReplacement=False, seed=seed_value)

df_sampled.head()