# Create Wine Recommender Using NLP on AWS SageMaker

Authors: __[Zephyr Headley](https://github.com/jzheadley)__ and __[John Naylor](https://jonaylor.xyz)__

Blog Post: \<Blog Post Link>

[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/jonaylor89/WineInAMillion/blob/main/notebooks/Wine%20In%20A%20Million.ipynb)


# Introduction

**Bidirectional Encoder Representations from Transformers** (**BERT**) is a [transformer](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) based [machine learning](https://en.wikipedia.org/wiki/Machine_learning) technique for [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing) (NLP) pre-training developed by [Google](https://en.wikipedia.org/wiki/Google). BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google. In 2019, Google announced that it had begun leveraging BERT in [its search engine](https://en.wikipedia.org/wiki/Google_Search), and by late 2020 it was using BERT in almost every English-language query. A 2020 literature survey concluded that "in a little over a year, BERT has become a ubiquitous baseline in NLP experiments", counting over 150 research publications analyzing and improving the model. BERT improves on previous word2vec models but not just embedding the lone words but embedding words using the context they are in to more accurately represent them. Today, BERT has been adapted, altered, and fine tuned for many different use cases and in this notebook, we'll specifically be using **Sentence-BERT**. 

### What is Sentence-BERT?

Sentence-BERT is a modification of pretrained BERT described well in an article [here](https://www.capitalone.com/tech/machine-learning/how-to-finetune-sbert-for-question-matching/). To quote a passage from said article:

> Sentence-BERT is a word embedding model. [Word embedding](https://en.wikipedia.org/wiki/Word_embedding) models are used to numerically represent language by transforming phrases, words, or word pieces (parts of words) into vectors. These models can be pre-trained on a large background corpus (dataset) and then later updated with a smaller corpus that is catered towards a specific domain or task. This process is known as fine-tuning.
> 

Sentence-BERT works great for the task we're going to be using it for because it has been optimized for faster similarity computation on the individual sentence level.

### Nearest Neighbors

The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice, along with Manhattan and Cosine Similarity. Neighbors-based methods are known as *non-generalizing* machine learning methods, since they simply “remember” all of its training data (possibly transformed into a fast indexing structure such as a [Ball Tree](https://scikit-learn.org/stable/modules/neighbors.html#ball-tree) or [KD Tree](https://scikit-learn.org/stable/modules/neighbors.html#kd-tree)). This is similar to the popular KNN algorithm except that, generally, KNN usually implies classification or regression on the neighboring sample point while *just* NN is simply returning the neighboring sample points. 

### Overview

In this notebook, we'll demonstrate how BERT can be used in tandem with Nearest Neighbors to create a recommendation engine that uses natural language as an input. To do this, we'll take advantage of a dataset of wine reviews located [here](https://www.kaggle.com/zynicide/wine-reviews) that contains 130k different reviews of various wines. We'll use BERT to take those wine reviews, convert the reviews into word embeddings (i.e. vectors) and store those embeddings in AWS S3. With the embeddings stored in S3, we will then use that as our dataset for the Nearest Neighbor algorithm which will in turn be able to accept new user input, create an embedding for it, and find the *K* closest embeddings to that user input. **In essence finding the wines that have a review most similar to the input the user provided.**

In [None]:
!pip install sentence_transformers
!pip install nvidia-ml-py3
!pip install opendatasets

!pip install nb_black
%load_ext nb_black

In [None]:
import os
import tarfile
import json
import time
import pandas as pd
import boto3
import joblib
import sagemaker
import opendatasets as od
from time import gmtime, strftime
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import image_uris
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
from sentence_transformers import SentenceTransformer
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.pytorch import PyTorch, PyTorchModel
from sagemaker.predictor import Predictor
from sagemaker.inputs import TrainingInput
from sklearn.neighbors import NearestNeighbors
from sagemaker.pipeline import PipelineModel

# Preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from string import punctuation

from tqdm.notebook import tqdm

tqdm.pandas()

print(f"SageMaker SDK Version: {sagemaker.__version__}")

In [None]:
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("wordnet")

### Setup

Let's start by specifying:

- The S3 buckets and prefixes that you want to use for saving model data and where training data is located. These should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK.

In [None]:
role = get_execution_role()

# bucket = "<S3_BUCKET>"
# prefix = "<S3_KEY_PREFIX>"
# filename = "<DATASET_FILENAME>"

bucket = "winemag-data-wineinamillion-23452"
prefix = "data/"
filename = "winemag-data-130k-v2.csv"

assert bucket != "<S3_BUCKET>"
assert prefix != "<S3_KEY_PREFIX>"
assert filename != "<DATASET_FILENAME>"

raw_data_location = f"s3://{bucket}/{prefix}raw/{filename}"

### Download the dataset from Kaggle (requires Kaggle account)

In [None]:
# https://www.analyticsvidhya.com/blog/2021/04/how-to-download-kaggle-datasets-using-jupyter-notebook/
od.download("https://www.kaggle.com/zynicide/wine-reviews")
inputs = boto3.resource("s3").Bucket(bucket).upload_file(f"wine-reviews/{filename}", f"{prefix}raw/{filename}")

To make sure everything worked correctly, we can read the data into a Pandas dataframe directly from S3 and then use the describe method to give us a summary of our data.  It'll kinda be nonsense but lets us know it was at least able to pull the data.

In [None]:
df = pd.read_csv(raw_data_location)
df.describe()

We can use `head()` to give us a little sample of the dataset as a whole.

In [None]:
df.head(5)

And we can take a look at one of the descriptions to get a better idea of what they look like

In [None]:
print(df["description"][0])

# Preprocess & Clean Data

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. Fortunately, we aren't using *too* much data so we can use the tools provided by the SageMaker Python SDK to clearn and upload the data to a default bucket.

In [None]:
def clean_data(desc):
    words = stopwords.words('english')
    lower = " ".join([w for w in desc.lower().split() if not w in words])
    punct = ''.join(ch for ch in lower if ch not in punctuation)
    wordnet_lemmatizer = WordNetLemmatizer()

    word_tokens = nltk.word_tokenize(punct)
    lemmatized_word = [wordnet_lemmatizer.lemmatize(word) for word in word_tokens]

    word_joined = " ".join(lemmatized_word)
    
    return word_joined
    

df['clean_desc'] = df["description"].apply(clean_data)

print(df['clean_desc'].head(5))

After cleaning the dataset, we upload the cleaned version to s3 for later steps.

In [None]:
# Upload the preprocessed dataset to S3
df.to_csv("cleaned_dataset.csv")
clean_data_location = f"{prefix}clean/cleaned_dataset.csv"
inputs = boto3.resource("s3").Bucket(bucket).upload_file('dataset.csv', clean_data_location)


# Sentence-BERT Embeddings

There is a Python library called **`sentence-transformers`** that provides an easy method to compute dense vector representations for **sentences**, **paragraphs**, and **images**. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various task. Text is embedding in vector space such that similar text is close and can efficiently be found using cosine similarity.

The sentence tranformer we'll use is "sentence-transformers/all-MiniLM-L6-v2" which is fairly generic but should work if accuracy isn't the number one priority. Downloading this model returns a folder of stuff which we'll need to save to s3 and bundle somehow in order to use SageMaker to host our model. 

Check out the full list of sentence tranformers on HuggingFace: [https://huggingface.co/sentence-transformers](https://huggingface.co/sentence-transformers)

In [None]:
# A peak at what the embeddings for BERT look like

model_name = 'sentence-transformers/all-MiniLM-L6-v2'

saved_model_dir = 'transformer'
if not os.path.isdir(saved_model_dir):
    os.makedirs(saved_model_dir)

model = SentenceTransformer(model_name)
model.save(saved_model_dir)

embeddings = model.encode(df["clean_desc"][0])
print(len(embeddings))

# Generate Initial Embeddings

**[WARNING]** This step will cost some money 💰


In [None]:
embeddings_estimator = PyTorch(
    role = role, 
    entry_point ='embeddings_script.py',
    instance_type="ml.m5.2xlarge",
    instance_count=1,
    source_dir = './src', 
    framework_version = '1.9.0',
    py_version = 'py38',
    sagemaker_session=sagemaker.Session(),
    output_path=f"s3://{bucket}/model/embeddings",
    hyperparameters={
        'output-data-dir': "/opt/ml/output/data/",
    },
)

In [None]:
# Generates embeddings of dataset
embeddings_estimator.fit({'train': f"s3://{bucket}/{prefix}clean/dataset.csv"})
print("[+] finished model fitting")

### Creating and Testing Embedding Endpoint

After training, we use the `PyTorchModel` object to build and deploy a `PyTorchPredictor`. This creates a Sagemaker Endpoint - a hosted prediction service that we can use to perform inference.

We have implementation of `model_fn` , `input_fn`,  `predict_fn`, and  `output_fn` in the `encoding_inference.py` script that is required. We are going to use default implementations of  and `transform_fn` defined in [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers).

The serializer and deserializer configure the `ContentType` field and the `Accept` field which in our case is both `application/json` . The `ContentType` field configures the first container, while the `Accept` field configures the last container. You can also specify each container'sHere we just grab the first line from the test data (you'll notice that the inference python script is very particular about the ordering of the inference request data). `Accept` and `ContentType` values using environment variables.

The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances. Here we will deploy the model to a single `ml.m5.large` instance.

In [None]:
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
embeddings_endpoint_name = "embeddings-model-ep-" + timestamp_prefix

embedding_predictor = embeddings_estimator.deploy(
    instance_type='ml.m5.large',
    initial_instance_count=1,
    endpoint_name=embeddings_endpoint_name,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)

In [None]:
test_embedding = embedding_predictor.predict(
    {"data": "sweet wine with a hint of tartness"}
)
print(len(test_embedding["embeddings"]))



# Nearest Neighbors Model 

As mentioned before, Neighbors-based methods are known as *non-generalizing* machine learning methods, since they simply “remember” all of its training data. The API for it is still the same as any other model (i.e. we'll still call `fit` to give the model our dataset) but there isn't actually any fitting or training going on. In our case, we'll pull the embeddings from s3 and that'll be our input dataset.

The full user guide for sklearn's `NearestNeighbors` class is available here: [https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html)

To see the full docs for the SKLearn Estimator and Model, see here:  https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/

In [None]:
nn_estimator = SKLearn(
    entry_point='nn_script.py',
    source_dir = './src', 
    train_instance_type="ml.m5.large",
    role=role,
    sagemaker_session=sagemaker.Session(),
    framework_version="0.23-1",
    py_version="py3",
    hyperparameters={
        'n_neighbors': 10, 
        'metric': 'cosine',
    },
)
    
nn_estimator.fit({'train': f"s3://{bucket}/model/embeddings/embeddings.csv.tar.gz"})

Similar to the PyTorchModel above, after training, we use the SKLearnModel object to build and deploy a SKLearnPredictor. 

In [None]:
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
nn_endpoint_name = "nn-model-ep-" + timestamp_prefix
nn_predictor = nn_estimator.deploy(
    instance_type="ml.m5.large",
    initial_instance_count=1,
    endpoint_name=nn_endpoint_name,
)

In [None]:
predictor = Predictor(
    endpoint_name=nn_endpoint_name,
    sagemaker_session=sagemaker.Session(),
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)

prediction = predictor.predict(
    {"embeddings": test_embedding["embeddings"], "kneighbors": 5}
)
print(prediction)
# zipped = list(
#     zip(
#         prediction["recommendations"]["neighbors"][0],
#         prediction["recommendations"]["distance"][0],
#     )
# )



# Inference Pipeline

Setting up a Machine Learning pipeline can be done with the `PipelineModel`. This sets up a list of models in a single endpoint; in this example, we configure our pipeline model with the BERT embedding model and the fitted Scikit-learn Nearest Neighbors inference model. Deploying the model follows the same `deploy` pattern in the SDK

Inference Pipeline documentation is located here: [https://sagemaker.readthedocs.io/en/stable/api/inference/pipeline.html](https://sagemaker.readthedocs.io/en/stable/api/inference/pipeline.html)

In [None]:
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
endpoint_name = "inference-pipeline-ep-" + timestamp_prefix
pipeline_model = PipelineModel(
    role=role, 
    models=[
        embeddings_estimator.create_model(), 
        nn_estimator.create_model(),
    ],
    sagemaker_session=sagemaker.Session(),
)


### Hosting

In [None]:
inference_pipeline = pipeline_model.deploy(
    initial_instance_count=1,
    instance_type="ml.m4.xlarge",
    endpoint_name=endpoint_name,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)

In [None]:
pipeline_predictor = Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker.Session(),
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)

# Test Pipeline

We make our request with the payload in 'application/json' format, since that is what our script currently supports. If other formats need to be supported, this would have to be added to the output_fn() method in our entry point. The prediction output in this case is trying to guess the wines with the most similar reviews to the one inputted.

In [None]:
test_payload = json.dumps({"data": "sweet wine with a hint of tartness"})
test_response = pipeline_predictor.predict(data=test_payload)

print(test_response)

# Clean Up

Finally, we should delete the model and endpoint before we close the notebook.

In [None]:
# Delete model
embeddings_model.delete_model()
nn_model.delete_model()
pipeline_predictor.delete_model()

# Delete endpoint and endpoint configuration
embeddings_predictor.delete_predictor()
nn_predictor.delete_predictor()
pipeline_predictor.delete_predictor()