<table>
<tr>
    <td>
        <img src="https://www.wordstream.com/wp-content/uploads/2021/07/how-to-get-amazon-reviews.png" width="200"/>
    </td>
    <td style="text-align: left; vertical-align: top;">
        <h1><strong>Amazon Reviews</strong><br></h1>
        <h4>Engineering Large Scale Data Analytics Systems<br>
        ENSF 612 - Fall 2023</h4>
    </td>
</tr>
</table>


*** Note: run all the code the first time. For subsecuent runs, set rerun flag to False. This will avoid resetting spark, mounting the drive and compute high intensive functions that were already computed.


In [178]:
rerun = True

**Setting Up Spark**

In [179]:
# the capture magic command captures the output of the block to avoid clutter
%%capture

if rerun:
  !apt-get install openjdk-8-jdk-headless -qq > /dev/null
  !wget https://dlcdn.apache.org/spark/spark-3.3.3/spark-3.3.3-bin-hadoop3.tgz
  !tar -xvf spark-3.3.3-bin-hadoop3.tgz
  !pip install findspark

  import os
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
  os.environ["SPARK_HOME"] = "/content/spark-3.3.3-bin-hadoop3"

  import findspark
  findspark.init()
  findspark.find()
  from pyspark.sql import SparkSession

  # Setting up 4 threads, potentially allowing a 4-core processor execute 4 tasks in parallel
  spark = SparkSession.builder\
          .master("local[4]")\
          .appName("Colab")\
          .config('spark.ui.port', '4050')\
          .getOrCreate()

  sc = spark.sparkContext

**Mounting Drive & Loading Datasets**

In [180]:
if rerun:
  from google.colab import drive
  drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [181]:
!ls drive/MyDrive/Big\ Data/datasets

All_Beauty_5.json  Cell_Phones_and_Accessories_5_subsample.json  Software_5.json
Appliances_5.json  Musical_Instruments_5_subsample.json		 Video_Games_5_subsample.json


In [182]:
dataset_directory = 'drive/MyDrive/Big Data/datasets'

# Gets the list of files in the dataset directory that end in ".json"
json_files = [file for file in os.listdir(dataset_directory) if file.endswith('.json')]

# Creates a list of full file paths
file_paths = [os.path.join(dataset_directory, file) for file in json_files]

In [183]:
import json

# Function to parse NDJSON (new line-delimited JSON) files and extract specific fields
def parse_ndjson(line):
    try:
        # Parse the JSON line and return only reviewText asin and reviewerID
        json_line = json.loads(line)
        return (
            json_line.get('reviewText', ''),
            json_line.get('asin', ''),
            json_line.get('reviewerID', '')
        )
    except json.JSONDecodeError:
        # In case of error, skip this record and return None
        return None

In [184]:
# Initialize an empty RDD
data_rdd = spark.sparkContext.emptyRDD()

# Read each file into an RDD, parse its ndjson objects if not None, and union with the existing RDD
for file_path in file_paths:
    file_rdd = sc.textFile(file_path, 4)
    parsed_rdd = file_rdd.map(parse_ndjson).filter(lambda x: x is not None)
    data_rdd = data_rdd.union(parsed_rdd)

# convert the data_rdd to a distributed Spark DataFrame
df = spark.createDataFrame(data_rdd, schema=['review', 'itemID', 'reviewerID'])

***Defined a json object viewer to observe json objects down the line

In [185]:
def json_viewer(row):
    if row:  # Check if the row is not empty
        # If the row is a list, get the first element, assuming it's a Row object
        if isinstance(row, list):
            row = row[0]
        # Convert the Row to a dictionary
        row_dict = row.asDict()
        # Pretty-print the JSON
        print(json.dumps(row_dict, indent=4))
    else:
        print("The row is empty or None.")

**Data Pre-Proccesing**

In [186]:
df.show # Shows the attributes of the DataFrame

<bound method DataFrame.show of DataFrame[review: string, itemID: string, reviewerID: string]>

In [187]:
df.count() # Amount of records on the DataFrame

118242

In [188]:
df.head(1) # Preview a single record

[Row(review="I've been using Dreamweaver (and it's predecessor Macromedia's UltraDev) for many years.  For someone who is an experienced web designer, this course is a high-level review of the CS5 version of Dreamweaver, but it doesn't go into a great enough level of detail to find it very useful.\n\nOn the other hand, this is a great tool for someone who is a relative novice at web design.  It starts off with a basic overview of HTML and continues through the concepts necessary to build a modern web site.  Someone who goes through this course should exit with enough knowledge to create something that does what you want it do do...within reason.  Don't expect to go off and build an entire e-commerce system with only this class under your belt.\n\nIt's important to note that there's a long gap from site design to actual implementation.  This course teaches you how to implement a design.  The user interface and overall user experience is a different subject that isn't covered here...it's

In [189]:
if rerun:
  from pyspark.sql.functions import col, count

  # Check if there are missing values in the dataset (If necessary, we would fill-in missing values with an appropiate method)
  for column in df.columns:
      null_count = df.filter(col(column).isNull()).count()
      print(f"Number of nulls in column {column}: {null_count}")

Number of nulls in column review: 0
Number of nulls in column itemID: 0
Number of nulls in column reviewerID: 0


**Text Pre-Processing**

1. Text Cleaning

In [190]:
from pyspark.sql.functions import regexp_replace, udf, PandasUDFType

# Keep contractions by allowing apostrophes in words
pattern = "[^A-Za-z'\\s]+"
cleaned_df = df.withColumn("cleaned_review", regexp_replace("review", pattern, ""))

1.b Contractions expanding. Although this step in not estrictly neccesary. Expanding contractions can make the text clearer and more consistent for the model, which can improve its ability to interpret and analyze the words.

In [191]:
%%capture
if rerun:
  !pip install contractions

In [192]:
from pyspark.sql.types import StringType
import contractions

# Define a UDF to expand contractions using the `contractions` library
def expand_contractions_text(text):
    return contractions.fix(text)

expand_contractions_udf = udf(expand_contractions_text, StringType())

# Apply the UDF to the DataFrame to create a new column with expanded contractions
expanded_df = cleaned_df.withColumn("expanded_review", expand_contractions_udf("cleaned_review"))

2. Tokenization

In [193]:
from pyspark.ml.feature import Tokenizer

# Tokenize the text reviews
tokenizer = Tokenizer(inputCol="expanded_review", outputCol="words")
tokenized_df = tokenizer.transform(expanded_df)

3. Stopword Removal

In [194]:
from pyspark.ml.feature import StopWordsRemover

# Remove stopwords
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
filtered_df = remover.transform(tokenized_df)

4. Stemming/Lemmatization

In [195]:
%%capture
from nltk.stem.porter import PorterStemmer
import pandas as pd
import nltk

nltk.download('punkt')
stemmer = PorterStemmer()

# Define a pandas UDF to stem the words
@pandas_udf("array<string>", PandasUDFType.SCALAR_ITER)
def stem_text(iterator):
    for words in iterator:
        yield pd.Series([[stemmer.stem(word) for word in word_list] for word_list in words])

# Apply stemming to the DataFrame
stemmed_df = filtered_df.withColumn("stemmed_words", stem_text("filtered_words"))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [196]:
json_viewer(stemmed_df.take(1))

{
    "review": "I've been using Dreamweaver (and it's predecessor Macromedia's UltraDev) for many years.  For someone who is an experienced web designer, this course is a high-level review of the CS5 version of Dreamweaver, but it doesn't go into a great enough level of detail to find it very useful.\n\nOn the other hand, this is a great tool for someone who is a relative novice at web design.  It starts off with a basic overview of HTML and continues through the concepts necessary to build a modern web site.  Someone who goes through this course should exit with enough knowledge to create something that does what you want it do do...within reason.  Don't expect to go off and build an entire e-commerce system with only this class under your belt.\n\nIt's important to note that there's a long gap from site design to actual implementation.  This course teaches you how to implement a design.  The user interface and overall user experience is a different subject that isn't covered here...