# Machine Learning on Spark
By Gabriel Marco Mercado
### Introduction
In this notebook, I aim to showcase the application of a simple machine learning regression problem using Apache Spark to predict the helpfulness rate of product reviews. Apache Spark ML is crucial due to its ability to handle large-scale data processing and machine learning tasks efficiently. Spark's distributed computing framework allows for the parallel processing of massive datasets, significantly reducing the computation time compared to traditional single-machine tools.

Spark ML's key benefits include its scalability, speed, and distributed computing capabilities. It can scale from a single machine to a large cluster, handling datasets too large for a single machine's memory. Spark's in-memory computation model accelerates data processing, benefiting iterative machine learning algorithms. Additionally, Spark ML integrates seamlessly with other big data tools like Hadoop, HDFS, and Apache Kafka, offering a versatile solution within the big data ecosystem. Its high-level APIs in Java, Scala, Python, and R provide ease of use and accessibility, enabling rapid development and prototyping of machine learning models.

### Tools and Skills Employed

In this project, the following skills were employed:

1. **Apache Spark ML**
   - Creating and managing ML pipelines for streamlined machine learning workflows.
   - Using `StringIndexer` and `VectorAssembler` for feature transformation in a distributed environment.
   - Building and evaluating machine learning models using Pipelines.

2. **Distributed Computing with Spark**
   - Leveraging Spark's in-memory computation for efficient processing of large-scale data.
   - Parallel processing of massive datasets using Spark's distributed architecture.

3. **Integration with AWS S3**
   - Downloading and managing datasets stored in AWS S3 using the `boto3` library.
   - Ensuring seamless data transfer between cloud storage and Spark.

4. **Big Data Tools Integration**
   - Integrating Spark with other big data tools and platforms like Hadoop and HDFS for enhanced data processing capabilities.

These skills demonstrate advanced capabilities in handling and processing large-scale data using distributed computing frameworks, as well as integrating various big data tools to create efficient and scalable machine learning solutions.

### Our Objective
Our goal is to achieve a Mean Absolute Error (MAE) of less than 0.5. An MAE below 0.5 indicates that our model's predictions are close to the actual helpfulness scores, demonstrating high accuracy. Beating this threshold signifies that our model is effective at predicting helpfulness, making it a valuable tool for improving user experience on e-commerce platforms by identifying the most useful reviews.

### Dataset
In order to demonstrate Spark ML, we use the Helpful Sentences from Reviews dataset, which is available from the AWS Open Data Registry. This dataset comprises customer reviews from Amazon, specifically focusing on sentences that have been marked as helpful by other users.

Here are the features of the dataset:

| Feature         | Description                                             |
|-----------------|---------------------------------------------------------|
| `asin`          | Amazon Standard Identification Number, uniquely identifies the product |
| `helpful`       | A numerical score indicating how helpful the sentence was rated by users |
| `main_image_url`| URL of the main image associated with the product       |
| `product_title` | Title of the product                                    |
| `sentence`      | The review sentence itself                              |

By using this dataset, we aim to build a machine learning model that predicts the helpfulness score of a review sentence based on these features. This helps us understand the characteristics of helpful reviews, enhancing the user experience on e-commerce platforms by highlighting valuable feedback and assisting customers in making informed purchase decisions. For more information, you can visit the [AWS Open Data Registry](https://registry.opendata.aws/helpful-sentences-from-reviews/).

### Methodology

In this project, we employ the following methodology to predict the helpfulness rate of product reviews:

1. **Data Loading and Exploration**
   - Load the `train.json` and `test.json` datasets from AWS S3 using the `boto3` library.
   - Explore the datasets to understand the schema and contents.

2. **Data Preprocessing**
   - Use `StringIndexer` to convert categorical string columns (`asin`, `main_image_url`, `product_title`, `sentence`) into numerical indices.
   - Apply `VectorAssembler` to combine the indexed columns into a single `features` vector column.

3. **Model Building and Evaluation**
   - Create a machine learning pipeline using Apache Spark ML, incorporating `StringIndexer`, `VectorAssembler`, and `LinearRegression`.
   - Train the model on the training dataset and evaluated it on the test dataset using Mean Absolute Error (MAE) as the performance metric.

4. **Result Analysis**
   - Compare the Test MAE to the goal MAE of 0.5.

### Data Loading and Exploration
We download both the `train.json` and `test.json` files from an AWS S3 bucket.

In [1]:
# Import libraries --------------------------------
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, DoubleType, FloatType
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation

# Set up boto3 and download .json files -----------
!pip install boto3
import boto3

s3 = boto3.client('s3', region_name='us-east-1')
bucket_name = 'helpful-sentences-from-reviews'
file_key = 'train.json'
s3.download_file(bucket_name, file_key, 'train.json')
file_key = 'test.json'
s3.download_file(bucket_name, file_key, 'test.json')

# Initialize spark --------------------------------
spark = (SparkSession.builder
         .config("spark.executor.memory", "4g")
         .config("spark.driver.memory", "4g")
         .getOrCreate())

Defaulting to user installation because normal site-packages is not writeable


In [2]:
# Read dataset ------------------------------------
df = spark.read.json('train.json')

In [3]:
# Data Exploration --------------------------------
df.printSchema()
df.limit(10).toPandas()

root
 |-- asin: string (nullable = true)
 |-- helpful: double (nullable = true)
 |-- main_image_url: string (nullable = true)
 |-- product_title: string (nullable = true)
 |-- sentence: string (nullable = true)



Unnamed: 0,asin,helpful,main_image_url,product_title,sentence
0,B000AO3L84,1.7,http://ecx.images-amazon.com/images/I/41XAEKR9...,Canon 430EX Speedlite Flash for Canon EOS SLR ...,this flash is a superb value.
1,B001SEQPGK,1.3,http://ecx.images-amazon.com/images/I/71KLvmtc...,Sony Cyber-shot DSC-W290 12 MP Digital Camera ...,The pictures were not sharp at all.
2,0553386697,1.9,http://ecx.images-amazon.com/images/I/81HdbmkR...,The Whole-Brain Child: 12 Revolutionary Strate...,A very good resource for parents.
3,B006SUWZH2,0.25,http://ecx.images-amazon.com/images/I/61A2WQOL...,Memorex Portable CD Boombox with AM FM Radio,"We have it in a child's room, and will be swit..."
4,B000W7F5SS,0.9,http://ecx.images-amazon.com/images/I/91E7TPDb...,Harry Potter and the Order of the Phoenix (Wid...,Again the makers are too lazy to bring in the ...
5,B000AO3L84,2.0,http://ecx.images-amazon.com/images/I/41XAEKR9...,Canon 430EX Speedlite Flash for Canon EOS SLR ...,This flash is a great value for the money.
6,B00081NX5U,0.73,http://ecx.images-amazon.com/images/I/51GQZT32...,iPod Detachable Receiver 7,"So I've had these speakers for three days now,..."
7,B00000F1D3,0.9,http://ecx.images-amazon.com/images/I/41H0Y95G...,Believe,"they're cd's or tape's forget about the ""spice..."
8,B00000FCBH,1.3,http://ecx.images-amazon.com/images/I/51iQ7z1W...,2Pac Greatest Hits,he proved that even with a dysfunctional famil...
9,B00013M6NU,0.4,http://ecx.images-amazon.com/images/I/41XR3APD...,Nikon MH-61 Battery Charger for Nikon EN-EL5 B...,"I realize these things can happen, so I truly ..."


### Data Preprocessing
Because we downloaded both the `train.json` and `test.json` files, we set them to appropriate variables. Caching is very important to save computational power, as it involves storing data in memory to speed up future access. In Spark, this means keeping frequently accessed DataFrames in memory, which reduces the need to repeatedly read data from disk, significantly improving computational efficiency and performance.

In [4]:
# Data Preprocessing ------------------------------
train_df = spark.read.json('train.json').fillna("")
test_df = spark.read.json('test.json').fillna("")
train_df.cache()
test_df.cache()

DataFrame[asin: string, helpful: double, main_image_url: string, product_title: string, sentence: string]

Next, we use `StringIndexer` from `pyspark.ml.feature` to convert categorical string columns into numeric indices, which are necessary for machine learning algorithms.
                                                                                                                                      
In this process, we...
1. Initialize `StringIndexer` for each categorical column (`asin`, `main_image_url`, `product_title`, `sentence`) to create corresponding indexed columns (`asin_index`, `main_image_url_index`, `product_title_index`, `sentence_index`).
2. Fit the `StringIndexer` models on the training dataset to learn the mapping from categorical values to numeric indices.
3. Transform the training dataset to apply these mappings, converting the categorical columns into numeric indices.
4. Apply the same transformations to the test dataset using the fitted models, ensuring consistency between training and test data. This step is crucial for maintaining the integrity of our machine learning model's input data.

In [5]:
from pyspark.ml.feature import StringIndexer

si_a = StringIndexer(inputCol="asin", outputCol="asin_index")
si_miu = StringIndexer(inputCol="main_image_url", outputCol="main_image_url_index")
si_pt = StringIndexer(inputCol="product_title", outputCol="product_title_index")
si_s = StringIndexer(inputCol="sentence", outputCol="sentence_index", handleInvalid="keep")

si_a_model = si_a.fit(train_df)
si_miu_model = si_miu.fit(train_df)
si_pt_model = si_pt.fit(train_df)
si_s_model = si_s.fit(train_df)

train_df = si_a_model.transform(train_df)
train_df = si_miu_model.transform(train_df)
train_df = si_pt_model.transform(train_df)
train_df = si_s_model.transform(train_df)

test_df = si_a_model.transform(test_df)
test_df = si_miu_model.transform(test_df)
test_df = si_pt_model.transform(test_df)
test_df = si_s_model.transform(test_df)

In [6]:
train_df.printSchema()
train_df.limit(5).toPandas()

root
 |-- asin: string (nullable = false)
 |-- helpful: double (nullable = true)
 |-- main_image_url: string (nullable = false)
 |-- product_title: string (nullable = false)
 |-- sentence: string (nullable = false)
 |-- asin_index: double (nullable = false)
 |-- main_image_url_index: double (nullable = false)
 |-- product_title_index: double (nullable = false)
 |-- sentence_index: double (nullable = false)



Unnamed: 0,asin,helpful,main_image_url,product_title,sentence,asin_index,main_image_url_index,product_title_index,sentence_index
0,B000AO3L84,1.7,http://ecx.images-amazon.com/images/I/41XAEKR9...,Canon 430EX Speedlite Flash for Canon EOS SLR ...,this flash is a superb value.,40.0,40.0,40.0,19715.0
1,B001SEQPGK,1.3,http://ecx.images-amazon.com/images/I/71KLvmtc...,Sony Cyber-shot DSC-W290 12 MP Digital Camera ...,The pictures were not sharp at all.,85.0,85.0,85.0,13765.0
2,0553386697,1.9,http://ecx.images-amazon.com/images/I/81HdbmkR...,The Whole-Brain Child: 12 Revolutionary Strate...,A very good resource for parents.,37.0,37.0,37.0,744.0
3,B006SUWZH2,0.25,http://ecx.images-amazon.com/images/I/61A2WQOL...,Memorex Portable CD Boombox with AM FM Radio,"We have it in a child's room, and will be swit...",119.0,119.0,119.0,16633.0
4,B000W7F5SS,0.9,http://ecx.images-amazon.com/images/I/91E7TPDb...,Harry Potter and the Order of the Phoenix (Wid...,Again the makers are too lazy to bring in the ...,22.0,22.0,21.0,916.0


In [7]:
test_df.printSchema()
test_df.limit(5).toPandas()

root
 |-- asin: string (nullable = false)
 |-- helpful: double (nullable = true)
 |-- main_image_url: string (nullable = false)
 |-- product_title: string (nullable = false)
 |-- sentence: string (nullable = false)
 |-- asin_index: double (nullable = false)
 |-- main_image_url_index: double (nullable = false)
 |-- product_title_index: double (nullable = false)
 |-- sentence_index: double (nullable = false)



Unnamed: 0,asin,helpful,main_image_url,product_title,sentence,asin_index,main_image_url_index,product_title_index,sentence_index
0,B00VG90446,1.07,http://ecx.images-amazon.com/images/I/51O-QlBL...,Flexion KS-902 Kinetic Series Wireless Bluetoo...,so it stays in place around your neck.,48.0,47.0,47.0,20000.0
1,B001196MG0,1.33,http://ecx.images-amazon.com/images/I/11+6ow0D...,"Savage 107X12-1 Seamless Background Paper, 107...",Love this seamless paper!,97.0,97.0,97.0,20000.0
2,B00081NX5U,1.17,http://ecx.images-amazon.com/images/I/51GQZT32...,iPod Detachable Receiver 7,very happy with my purchase.-JI,5.0,5.0,5.0,20000.0
3,B003HC9JIW,1.6,http://ecx.images-amazon.com/images/I/51u7P1db...,Start! Walking At Home with Leslie Sansone: B...,"Even for someone with poor balance, this dvd i...",54.0,54.0,55.0,20000.0
4,B00C30FCUI,1.49,http://ecx.images-amazon.com/images/I/416HU0yr...,Symphonized NRG Premium Genuine Wood In-ear No...,", those have always produced a nice deep tone.",0.0,0.0,0.0,20000.0


Based on the `.printSchema()` results, we have successfully converted the non-numeric columns into numeric ones, as indicated by their `double` data type. However, our preprocessing does not end here.

`StringIndexer` converts categorical string columns into numerical indices, a transformation crucial for algorithms that require numerical input, as they cannot handle categorical data directly.

Next, we need to prepare our data for machine learning algorithms in Spark, which require input features to be in a single vector column rather than multiple individual columns. This is where `VectorAssembler` comes in. `VectorAssembler` combines the indexed columns into a single vector column called `features`, which can then be used for model training and prediction.

In [8]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=["asin_index", "main_image_url_index",
               "product_title_index", "sentence_index"],
    outputCol="features"
)

train_df = assembler.transform(train_df)
test_df = assembler.transform(test_df)

In [9]:
train_df.printSchema()
train_df.limit(5).toPandas()

root
 |-- asin: string (nullable = false)
 |-- helpful: double (nullable = true)
 |-- main_image_url: string (nullable = false)
 |-- product_title: string (nullable = false)
 |-- sentence: string (nullable = false)
 |-- asin_index: double (nullable = false)
 |-- main_image_url_index: double (nullable = false)
 |-- product_title_index: double (nullable = false)
 |-- sentence_index: double (nullable = false)
 |-- features: vector (nullable = true)



Unnamed: 0,asin,helpful,main_image_url,product_title,sentence,asin_index,main_image_url_index,product_title_index,sentence_index,features
0,B000AO3L84,1.7,http://ecx.images-amazon.com/images/I/41XAEKR9...,Canon 430EX Speedlite Flash for Canon EOS SLR ...,this flash is a superb value.,40.0,40.0,40.0,19715.0,"[40.0, 40.0, 40.0, 19715.0]"
1,B001SEQPGK,1.3,http://ecx.images-amazon.com/images/I/71KLvmtc...,Sony Cyber-shot DSC-W290 12 MP Digital Camera ...,The pictures were not sharp at all.,85.0,85.0,85.0,13765.0,"[85.0, 85.0, 85.0, 13765.0]"
2,0553386697,1.9,http://ecx.images-amazon.com/images/I/81HdbmkR...,The Whole-Brain Child: 12 Revolutionary Strate...,A very good resource for parents.,37.0,37.0,37.0,744.0,"[37.0, 37.0, 37.0, 744.0]"
3,B006SUWZH2,0.25,http://ecx.images-amazon.com/images/I/61A2WQOL...,Memorex Portable CD Boombox with AM FM Radio,"We have it in a child's room, and will be swit...",119.0,119.0,119.0,16633.0,"[119.0, 119.0, 119.0, 16633.0]"
4,B000W7F5SS,0.9,http://ecx.images-amazon.com/images/I/91E7TPDb...,Harry Potter and the Order of the Phoenix (Wid...,Again the makers are too lazy to bring in the ...,22.0,22.0,21.0,916.0,"[22.0, 22.0, 21.0, 916.0]"


In [10]:
test_df.printSchema()
test_df.limit(5).toPandas()

root
 |-- asin: string (nullable = false)
 |-- helpful: double (nullable = true)
 |-- main_image_url: string (nullable = false)
 |-- product_title: string (nullable = false)
 |-- sentence: string (nullable = false)
 |-- asin_index: double (nullable = false)
 |-- main_image_url_index: double (nullable = false)
 |-- product_title_index: double (nullable = false)
 |-- sentence_index: double (nullable = false)
 |-- features: vector (nullable = true)



Unnamed: 0,asin,helpful,main_image_url,product_title,sentence,asin_index,main_image_url_index,product_title_index,sentence_index,features
0,B00VG90446,1.07,http://ecx.images-amazon.com/images/I/51O-QlBL...,Flexion KS-902 Kinetic Series Wireless Bluetoo...,so it stays in place around your neck.,48.0,47.0,47.0,20000.0,"[48.0, 47.0, 47.0, 20000.0]"
1,B001196MG0,1.33,http://ecx.images-amazon.com/images/I/11+6ow0D...,"Savage 107X12-1 Seamless Background Paper, 107...",Love this seamless paper!,97.0,97.0,97.0,20000.0,"[97.0, 97.0, 97.0, 20000.0]"
2,B00081NX5U,1.17,http://ecx.images-amazon.com/images/I/51GQZT32...,iPod Detachable Receiver 7,very happy with my purchase.-JI,5.0,5.0,5.0,20000.0,"[5.0, 5.0, 5.0, 20000.0]"
3,B003HC9JIW,1.6,http://ecx.images-amazon.com/images/I/51u7P1db...,Start! Walking At Home with Leslie Sansone: B...,"Even for someone with poor balance, this dvd i...",54.0,54.0,55.0,20000.0,"[54.0, 54.0, 55.0, 20000.0]"
4,B00C30FCUI,1.49,http://ecx.images-amazon.com/images/I/416HU0yr...,Symphonized NRG Premium Genuine Wood In-ear No...,", those have always produced a nice deep tone.",0.0,0.0,0.0,20000.0,"(0.0, 0.0, 0.0, 20000.0)"


After running the code with `VectorAssembler`, you'll notice the presence of a new column `features` in our dataset. The `features` column, created by `VectorAssembler`, combines the numerical indices of `asin`, `main_image_url`, `product_title`, and `sentence` into a single vector. This combined vector is essential for machine learning algorithms, which require input features to be in this format.

With the `features` column in place, we can now proceed to build a simple machine learning model to predict the helpfulness score (`helpful`).

### Model Development and Assessment

In [11]:
from pyspark.ml.regression import LinearRegression

model = LinearRegression(
            featuresCol="features",
            labelCol="helpful"
        )

model_trained = model.fit(train_df)
predictions = model_trained.transform(test_df)

In [12]:
predictions[['features', 'sentence', 'prediction']].show()

+--------------------+--------------------+------------------+
|            features|            sentence|        prediction|
+--------------------+--------------------+------------------+
|[48.0,47.0,47.0,2...|so it stays in pl...|1.1836249559867604|
|[97.0,97.0,97.0,2...|Love this seamles...|1.1679165301592547|
|[5.0,5.0,5.0,2000...|very happy with m...| 1.175001097995939|
|[54.0,54.0,55.0,2...|Even for someone ...|1.1951499027694603|
|   (4,[3],[20000.0])|, those have alwa...|1.1753861288566285|
|[97.0,97.0,97.0,2...|but after a year ...|1.1679165301592547|
|[33.0,33.0,33.0,2...|While not quite a...|1.1728449251760786|
|[55.0,55.0,54.0,2...|Until now, Sloe G...|1.1472286821807682|
|[30.0,30.0,30.0,2...|I considered the ...|1.1730759436924922|
|[4.0,4.0,4.0,2000...|"Baby, don't cry,...|1.1750781041680771|
|[116.0,116.0,116....|I'm glad I bought...| 1.166453412888635|
|[5.0,5.0,5.0,2000...|Great sound I bou...| 1.175001097995939|
|[18.0,18.0,18.0,2...|The Age of Enligh...|1.1740000177

In [13]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(
    labelCol="helpful",
    predictionCol="prediction",
    metricName="mae")
mae = evaluator.evaluate(predictions)

print(f"Test MAE: {mae}")
if mae <= 0.5:
    print("The model meets the required MAE threshold of 0.5.")
else:
    print("The model does not meet the required MAE threshold of 0.5.")

Test MAE: 0.32752672659753945
The model meets the required MAE threshold of 0.5.


From our development and testing of the model, we achieved a Test Mean Absolute Error (MAE) of 0.3275, which meets the required threshold of 0.5. With that, we can say that we've successfully created a model that can accurately determine the helpfulness rate of product reviews. The low MAE indicates that our model's predictions are close to the actual helpfulness scores, validating its effectiveness.

### Pipelines
A key takeaway from this task is the introduction and employment of Pipelines. With Pipelines, we can consolidate the entire process without the need for multiple lines of code, thereby reducing overall runtime and increasing proficiency in developing machine learning models in Spark.

In [14]:
# Import libraries --------------------------------
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, dayofweek, count, sum as spark_sum, floor
from pyspark.sql.types import IntegerType, DoubleType, FloatType
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation

# Set up boto3 and download .json file ------------
# !pip install boto3
# import boto3

# s3 = boto3.client('s3', region_name='us-east-1')
# bucket_name = 'helpful-sentences-from-reviews'
# file_key = 'train.json'
# s3.download_file(bucket_name, file_key, 'train.json')
# file_key = 'test.json'
# s3.download_file(bucket_name, file_key, 'test.json')

# Initialize spark --------------------------------
spark = (SparkSession.builder
         .config("spark.executor.memory", "4g")
         .config("spark.driver.memory", "4g")
         .getOrCreate())

# Data Preprocessing ------------------------------
train_df = spark.read.json('train.json').fillna("")
test_df = spark.read.json('test.json').fillna("")
train_df.cache()
test_df.cache()

# Pipeline Creation and Model Fitting -------------
from pyspark.ml import Pipeline
pipeline = Pipeline(
    stages=[
        StringIndexer(
            inputCol="asin",
            outputCol="asin_index"
        ),
        StringIndexer(
            inputCol="main_image_url",
            outputCol="main_image_url_index"
        ),
        StringIndexer(
            inputCol="product_title",
            outputCol="product_title_index"
        ),
        StringIndexer(
            inputCol="sentence",
            outputCol="sentence_index"
        ),
        VectorAssembler(
            inputCols=["asin_index",
                       "main_image_url_index",
                       "product_title_index"],
            outputCol="features"
        ), 
        LinearRegression(
            featuresCol="features",
            labelCol="helpful"
        )]
)

# Train and Test Model ----------------------------
pipeline_model = pipeline.fit(train_df)
predictions = pipeline_model.transform(test_df)

# Evaluate Model ----------------------------------
evaluator = RegressionEvaluator(
    labelCol="helpful",
    predictionCol="prediction",
    metricName="mae")
mae = evaluator.evaluate(predictions)

print(f"Test MAE: {mae}")
if mae <= 0.5:
    print("The model meets the required MAE threshold of 0.5.")
else:
    print("The model does not meet the required MAE threshold of 0.5.")

Test MAE: 0.32734439097351814
The model meets the required MAE threshold of 0.5.


By using a Pipeline, we have streamlined the process of transforming the data, fitting the model, and making predictions into a cohesive workflow. This approach simplifies the code and enhances the efficiency and reproducibility of our machine learning tasks, resulting in a robust model capable of predicting the helpfulness rate of product reviews with high accuracy.

### Conclusion

In this project, we successfully demonstrated the application of Apache Spark ML for predicting the helpfulness rate of product reviews. By leveraging Spark's distributed computing capabilities, we efficiently handled large-scale data and performed extensive preprocessing, including converting categorical features into numerical indices and assembling them into feature vectors.

Our linear regression model achieved a Test Mean Absolute Error (MAE) of 0.3275, well below the threshold of 0.5, indicating high accuracy in predicting the helpfulness scores. The use of Spark ML Pipelines streamlined the entire workflow, enhancing both the efficiency and reproducibility of our machine learning process.

### References
1. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). *Spark: Cluster computing with working sets*. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association. Available at: [https://www.usenix.org/legacy/event/hotcloud10/tech/full_papers/Zaharia.pdf](https://www.usenix.org/legacy/event/hotcloud10/tech/full_papers/Zaharia.pdf)

2. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., ... & Zaharia, M. (2016). *MLlib: Machine Learning in Apache Spark*. Journal of Machine Learning Research, 17(34), 1-7. Available at: [http://jmlr.org/papers/v17/15-237.html](http://jmlr.org/papers/v17/15-237.html)

3. Amazon Web Services. (n.d.). *Helpful Sentences from Reviews Dataset*. AWS Open Data Registry. Available at: [https://registry.opendata.aws/helpful-sentences-from-reviews/](https://registry.opendata.aws/helpful-sentences-from-reviews/)