<a href="https://colab.research.google.com/github/leanmarqs/findMaxCrossingSubarray/blob/master/hertie_transformers_bert_similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Measuring Word Similarity with BERT (Sephora Makeup Reviews)

By [The BERT for Humanists](https://melaniewalsh.github.io/BERT-for-Humanists/) Team

How can we measure the similarity of words, or word uses, in a collection of texts? Let's say we're interested in a collection of Sephora makeup reviews, and we specifically want to understand the experiences of customers who are "sensitive" to makeup in various ways. Do reviewers use the word "sensitive" similarly or differently when describing different products, or when rating a product positively or negatively? What about when reviewers discuss how well makeup holds up at the the "pool, the "gym," the "office, or a "wedding" — how do these contexts compare?  

We can explore all of these questions with BERT, a natural language processing model that has revolutionized the field.

BERT turns words or tokens into vectors — essentially, a list of numbers in a coordinate system (x, y). We can then use the geometric similarity between these resulting vectors as a way to represent varying types of similarity between words.

## In This Notebook
In this Colab notebook, we will specifically analyze a collection of 5k [Sephora makeup reviews](https://github.com/everestpipkin/datagardens/tree/master/students/khanniie/5_newDataSet) scraped by Google engineer [Connie Ye](https://connieye.com/about). While an undergraduate student at Carnegie Mellon, Ye completed a project about Sephora reviews that specifically mentioned crying, and she even created a [website](https://connie.dog/sephora/) where you can explore these waterlogged reviews. 

For our purposes here, we will analyze all 5k Sephora reviews with the [DistilBert model](https://huggingface.co/transformers/model_doc/distilbert.html) and the HuggingFace Python library. DistilBert is a smaller — yet still powerful! — version of BERT. By using the rich representations of words that BERT produces, we will then explore the multivalent meanings of particular words in context.

We hope this notebook will help illustrate how BERT works, how well it works, and how you might use BERT to explore the similarity of words in a collection of texts. But we also hope that these results will expose some of the limitations and challenges of BERT. 

In [None]:
#@title BERT Word Vectors: A Preview { display-mode: "form" }
#@title: Hover
import pandas as pd
import altair as alt

url = "https://github.com/melaniewalsh/Neat-Datasets/raw/main/bert-word-sensitive.csv"
df = pd.read_csv(url, encoding='utf-8')

search_keywords = ['sensitive']
color_by = 'word'

alt.Chart(df, title=f"Word Similarity in Sephora Reviews: {', '.join(search_keywords).title()}").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y",
    color= color_by,
    tooltip=['word', 'context', 'type', 'brand']
    ).interactive().properties(
    width=500,
    height=500
)

The plot above displays a preview of our later results. This is what we're working toward!

You can hover over each point to see the instance of each word in context.

<br><br><br><br>

## **Import necessary Python libraries and modules**

Ok enough introduction! Let's get started.

To use the HuggingFace [`transformers` Python library](https://huggingface.co/transformers/installation.html), we first need to install it with `pip`.

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 4.3 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 40.9 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 3.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 54.1 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYA

Then we will import the DistilBertModel and DistilBertTokenizerFast from the Hugging Face `transformers` library. We will also import a handful of other Python libraries and modules.

In [None]:
# For BERT
from transformers import DistilBertTokenizerFast, DistilBertModel

# For data manipulation and analysis
import pandas as pd
pd.options.display.max_colwidth = 200
import numpy as np
from sklearn.decomposition import PCA

# For interactive data visualization
import altair as alt

In [None]:
from collections import defaultdict
import random
import json
from urllib.request import urlopen

<br><br><br><br>

## **Load text dataset**

Below we will read in a number of JSON files that contain the Sephora reviews. This is simply how we've chosen to load our data, and any way that you are comfortable lodaing data should work for this step. In the end, all you really need is a list of texts.

In [None]:
# dataset_url = 'https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/crying_dataset.json'

dataset_urls = ['https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/better_than_sex.json',
                'https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/better_than_sex_waterproof.json',
                'https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/cannonball.json',
                'https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/diorshow_waterproof.json',
                'https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/emotionproof.json',
                'https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/fenty.json',
                'https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/hangover_primer.json',
                'https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/highliner.json',
                'https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/kat_von_d_tattoo_liner.json',
                # 'https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/kvd_inkwell_longwear.json',
                'https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/primer.json',
                'https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/stila_waterproof_eyeliner.json',
                'https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/urban_decay_setting_spray.json']

In [None]:
# Read in each dataset and merged together
df_list = []
for _url in dataset_urls:
  # _json = pd.read_json(_url)
  print(_url)
  _df = pd.DataFrame(pd.read_json(_url)['reviews'].tolist())
  _json = json.loads(urlopen(_url).read()) 
  _df['brand'] = _json['brand']
  _df['name'] = _json['product_name']
  _df['type'] =  _json['product_type']
  _df['url'] = _json['url']
  df_list.append(_df)

reviews_df = pd.concat(df_list)

len(reviews_df.index)

https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/better_than_sex.json
https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/better_than_sex_waterproof.json
https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/cannonball.json
https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/diorshow_waterproof.json
https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/emotionproof.json
https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/fenty.json
https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/hangover_primer.json
https://raw.githubusercontent.com/everestpipkin/datagardens/master/students/khanniie/5_newDataSet/all_json/

5018

Let's look at a sample of reviews

In [None]:
reviews_df.sample(3)

Unnamed: 0,extra_info,stars,date,title,description,brand,name,type,url
288,<b>Age</b> 25-34,5 stars,5 Oct 2014,Some good stuff!,This primer definitely keeps my eye shadow in place. So much so that I wish I could find a FACE primer that works as well as this does!\nI'd originally received a small sample of a variety of UD p...,Urban Decay,Eyeshadow Primer Potion - Original,Eye Primer,https://www.sephora.com/product/eyeshadow-primer-potion-tube-original-P284716?icid2=products%20grid:p284716
339,<b>Age</b> 18-24,5 stars,13 Oct 2014,This is the PERFECT eyeliner for the waterline because is last all day long.,,Marc Jacobs,Highliner Gel Eye Crayon Eyeliner,Gel Eyeliner,https://www.sephora.com/product/highliner-gel-crayon-P379434?icid2=products%20grid:p379434
502,<b>Age</b> 25-34,4 stars,26 Dec 2015,"great volume, but..","This mascara is hands down the best mascara I've used in regards to volume and blackness. It almost gives my lashes a falsie look, without the feel of false lashes (I HATE falsies). Love the shape...",Too Faced,Better Than Sex Mascara,Mascara,https://www.sephora.com/product/better-than-sex-mascara-P381000?icid2=products%20grid:p381000


What types of makeup are being reviewed?

In [None]:
reviews_df['type'].value_counts()

Mascara            1786
Liquid Eyeliner    1535
Gel Eyeliner        584
Primer              403
Setting Spray       391
Eye Primer          319
Name: type, dtype: int64

What brands are being reviewed?

In [None]:
reviews_df['brand'].value_counts()

Too Faced                  1373
Urban Decay                1033
Marc Jacobs                 584
Kat Von D                   567
FENTY BEAUTY BY RIHANNA     550
Stila                       418
Dior                        407
Tom Ford                     86
Name: brand, dtype: int64

Let's convert the review text into a list of reviews

In [None]:
texts = reviews_df['description'].tolist()

Let's examine one review in particular

In [None]:
texts[4]

"I finally caved trying this after reading so many great reviews. Unfortunately, for me, it was not meant to be. Despite the brush being great, I had a hard time applying this without getting too clumpy. I didn't find the result to be as volumizing as advertised and it smudged after only a few hours on me. Sigh. Conclusion: this mascara is not so sexy."

<br><br><br><br>

## **Encode/tokenize text data for BERT**

Next we need to transform our reviews into a format that BERT (via Huggingface) will understand. This is called *encoding* or *tokenizing* the data.

We will tokenize the reviews with the `tokenizer()` from HuggingFace's `DistilBertTokenizerFast`. Here's what the `tokenizer()` will do:

1. Truncate the texts if they're more than 512 tokens or pad them if they're fewer than 512 tokens. If a word is not in BERT's vocabulary, it will be broken up into smaller "word pieces," demarcated by a `##`.

2. Add in special tokens to help BERT:
    - [CLS] — Start token of every document
    - [SEP] — Separator between each sentence 
    - [PAD] — Padding at the end of the document as many times as necessary, up to 512 tokens
    - &#35;&#35; — Start of a "word piece" 

Here we will load `DistilBertTokenizerFast` from HuggingFace library, which will help us transform and encode the texts so they can be used with BERT.

In [None]:
from transformers import DistilBertTokenizerFast

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

The `tokenizer()` will break word tokens into word pieces, truncate to 512 tokens, and add padding and special BERT tokens.

In [None]:
tokenized_reviews = tokenizer(texts, truncation=True, padding=True, return_tensors="pt")

Let's examine the first tokenized review. We can see that the special BERT tokens have been inserted where necessary.

In [None]:
' '.join(tokenized_reviews[4].tokens)

"[CLS] i finally cave ##d trying this after reading so many great reviews . unfortunately , for me , it was not meant to be . despite the brush being great , i had a hard time applying this without getting too cl ##ump ##y . i didn ' t find the result to be as vol ##umi ##zing as advertised and it sm ##udged after only a few hours on me . sigh . conclusion : this mascara is not so sexy . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PA

<br><br><br><br>

## **Load pre-trained BERT model**

Here we will load a pre-trained BERT model. To speed things up we will use a GPU, but using GPU involves a few extra steps.
The command `.to("cuda")` moves data from regular memory to the GPU's memory.




In [None]:
from transformers import DistilBertModel

In [None]:
model = DistilBertModel.from_pretrained('distilbert-base-uncased').to("cuda")

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<br><br><br><br>

## **Get BERT word embeddings for each document in a collection**

To get word embeddings for all the words in our collection, we will use a `for` loop.

For each review in our list, we will tokenize the review, and we will extract the vocabulary word ID for each word/token in the review (to use for later reference). Then we will run the tokenized review through the BERT model and extract the vectors for each word/token in the review.

We thus create two big lists for all the reviews in our collection — `doc_word_ids` and `doc_word_vectors`.

In [None]:
# List of vocabulary word IDs for all the words in each document (aka each review)
doc_word_ids = []

# List of word vectors for all the words in each document (aka each review)
doc_word_vectors = []

# Below we will slice our review to ignore the first (0th) and last (-1) special BERT tokens
start_of_words = 1
end_of_words = -1

# Below we will index the 0th or first document, which will be the only document, since we're analzying one review at a time
first_document = 0

for i, review in enumerate(texts):
  
    # Here we tokenize each review with the DistilBERT Tokenizer
    inputs = tokenizer(review, return_tensors="pt", truncation=True, padding=True)

    # Here we extract the vocabulary word ids for all the words in the review (the first or 0th document, since we only have one document)
    # We ignore the first and last special BERT tokens
    # We also convert from a Pytorch tensor to a numpy array
    doc_word_ids.append(inputs.input_ids[first_document].numpy()[start_of_words:end_of_words])

    # Here we send the tokenized reviews to the GPU
    # The model is already on the GPU, but this review isn't, so we send it to the GPU
    inputs.to("cuda")
    # Here we run the tokenized review through the DistilBERT model
    outputs = model(**inputs)

    # We take every element from the first or 0th document, from the 2nd to the 2nd to last position
    # Grabbing the last layer is one way of getting token vectors. There are different ways to get vectors with different pros and cons
    doc_word_vectors.append(outputs.last_hidden_state[first_document,start_of_words:end_of_words,:].detach().cpu().numpy())


Confirm that we have the same number of documents for both the tokens and the vectors:

In [None]:
len(doc_word_ids), len(doc_word_vectors)

(5018, 5018)

In [None]:
doc_word_ids[0], doc_word_vectors[0]

(array([ 2023, 27700,  2003,  9643,   999,  1045,  2245,  2009,  2052,
         2022,  6429,  1998,  2001,  2061,  7568,  2008,  2009,  2001,
         2200,  2152,  3737,  1998,  6450,  2021,  2009,  2001,  2107,
         1037, 10520,  1012,  2028,  5435,  2006,  1998,  2009,  2001,
         2061, 18856, 24237,  2100,   999,  2036,  1010,  2011,  1996,
         2203,  1997,  1996,  2154,  2026,  2104,  2159,  2020,  2304,
         2013,  1996, 27700,   999,  2074,  9202]),
 array([[ 0.04701875,  0.0745259 , -0.00122115, ..., -0.26615977,
          0.36384755,  0.2643668 ],
        [ 0.9409262 ,  0.17063163, -0.17726046, ..., -0.22725135,
         -0.29596454, -0.34577647],
        [-0.03995913, -0.02058482, -0.01363328, ..., -0.1370049 ,
          0.1258472 ,  0.62158054],
        ...,
        [ 0.24695502,  0.30529794,  0.20497112, ...,  0.14400621,
          0.541336  ,  0.00972535],
        [ 0.36152843, -0.09225815,  0.26666042, ...,  0.09259592,
          0.11792559,  0.9894849 ],

<br><br><br><br>

## **Concatenate all word IDs/vectors for all documents**

Each element of these lists contains all the tokens/vectors for one document. But we want to concatenate them into two giant collections.

In [None]:
all_word_ids = np.concatenate(doc_word_ids)
all_word_vectors = np.concatenate(doc_word_vectors, axis=0)

We want to make comparisons between vectors quickly. One common option is *cosine similarity*, which measures the angle between vectors but ignores their length. We can speed this computation up by setting all the review vectors to have length 1.0.

In [None]:
# Calculating the length of each vector (Pythagorean theorem)
row_norms = np.sqrt(np.sum(all_word_vectors ** 2, axis=1))

# Dividing every vector by its length
all_word_vectors /= row_norms[:,np.newaxis]

<br><br><br><br>

## **Find all word positions in a collection**

We can use the array `all_word_ids` to find all the places, or *positions*, in the collection where a word appears.

We can find a word's vocab ID in BERT with `tokenizer.vocab` and then check to see where/how many times this ID occurs in `all_word_ids`.

In [None]:
def get_word_positions(words):
  
  """This function accepts a list of words, rather than a single word"""

  # Get word/vocabulary ID from BERT for each word
  word_ids = [tokenizer.vocab[word] for word in words]

  # Find all the positions where the words occur in the collection
  word_positions = np.where(np.isin(all_word_ids, word_ids))[0]

  return word_positions

Here we'll check to see all the places where the word "sensitive" appears in the collection.

In [None]:
get_word_positions(["sensitive"])

array([  6092,  10738,  14997,  15153,  29942,  41575,  41939,  63591,
        70013,  79906,  83863,  85581,  98966, 100207, 101808, 113760,
       132861, 133107, 139456, 141039, 143913, 150184, 151975, 152983,
       153478, 153664, 155722, 158101, 158191, 158462, 158588, 158733,
       159408, 159555, 160489, 160519, 161026, 164309, 164446, 165695,
       166768, 169200, 169807, 169993, 170953, 173125, 174005, 174135,
       174318, 176044, 177231, 177921, 178412, 178696, 179240, 180165,
       180317, 180691, 181207, 182832, 183028, 183982, 185382, 185398,
       187400, 187941, 188768, 189513, 192046, 198649, 201040, 204296,
       204679, 209885, 218240, 218641, 221080, 222657, 224083, 227826,
       231586, 234207, 237730, 239393, 240547, 262200, 263270, 266501,
       272143, 281776, 281825, 286571, 287928, 287972, 294293, 294672,
       308101, 325566, 325776, 331010, 331186, 333168, 335337, 335643,
       350574, 354482, 354807, 359220, 360683, 360778, 361831, 363309,
      

In [None]:
word_positions = get_word_positions(["sensitive"])

<br><br><br><br>

## **Find word from word position**

Nice! Now we know all the positions where the word "sensitive" appears in the collection. But it would be more helpful to know the actual words that appear in context around it. To find these context words, we have to convert position IDs back into words.

In [None]:
# Here we create an array so that we can go backwards from numeric token IDs to words
word_lookup = np.empty(tokenizer.vocab_size, dtype="O")

for word, index in tokenizer.vocab.items():
    word_lookup[index] = word

Now we can use `word_lookup` to find a word based on its position in the collection.

In [None]:
word_positions = get_word_positions(["sensitive"])

for word_position in word_positions:
  print(word_position, word_lookup[all_word_ids[word_position]])

6092 sensitive
10738 sensitive
14997 sensitive
15153 sensitive
29942 sensitive
41575 sensitive
41939 sensitive
63591 sensitive
70013 sensitive
79906 sensitive
83863 sensitive
85581 sensitive
98966 sensitive
100207 sensitive
101808 sensitive
113760 sensitive
132861 sensitive
133107 sensitive
139456 sensitive
141039 sensitive
143913 sensitive
150184 sensitive
151975 sensitive
152983 sensitive
153478 sensitive
153664 sensitive
155722 sensitive
158101 sensitive
158191 sensitive
158462 sensitive
158588 sensitive
158733 sensitive
159408 sensitive
159555 sensitive
160489 sensitive
160519 sensitive
161026 sensitive
164309 sensitive
164446 sensitive
165695 sensitive
166768 sensitive
169200 sensitive
169807 sensitive
169993 sensitive
170953 sensitive
173125 sensitive
174005 sensitive
174135 sensitive
174318 sensitive
176044 sensitive
177231 sensitive
177921 sensitive
178412 sensitive
178696 sensitive
179240 sensitive
180165 sensitive
180317 sensitive
180691 sensitive
181207 sensitive
182832 sens

We can also look for the 3 words that come before "sensitive" and the 3 words that come after it.

In [None]:
word_positions = get_word_positions(["sensitive"])

for word_position in word_positions:

  # Slice 3 words before "user"
  start_pos = word_position - 3
  # Slice 3 words after "user"
  end_pos = word_position + 4

  context_words = word_lookup[all_word_ids[start_pos:end_pos]]
  # Join the words together
  context_words = ' '.join(context_words)
  print(word_position, context_words)

6092 not for the sensitive type this is
10738 good for my sensitive eyes and for
14997 mascara ##s or sensitive eyes i would
15153 i have very sensitive eyes and most
29942 ##rita ##te my sensitive eyes , and
41575 really watery and sensitive eyes , so
41939 nice for my sensitive eyes ! …
63591 great for my sensitive , watery eyes
70013 i have very sensitive dry eyes and
79906 eyes are so sensitive . it does
83863 ##proof mascara for sensitive eyes . this
85581 surprisingly great for sensitive eyes , and
98966 you have super sensitive eyes with contact
100207 i have really sensitive eyes , and
101808 rub on your sensitive eye area .
113760 ' m super sensitive when it comes
132861 i have very sensitive eyes , rosa
133107 aren ' t sensitive so i got
139456 . i have sensitive eyes so eye
141039 eyes can be sensitive to mascara .
143913 someone with super sensitive eyes and i
150184 due to my sensitive eyes . application
151975 ##er ##gies or sensitive eyes because your
152983 ##y , super 

Let's make some functions that will help us get the context words around a certain word position for whatever size window (certain number of words before and after) that we want.

The first function `get_context()` will simply return the tokens without cleaning them, and the second function `get_context_clean()` will return the tokens in a more readable fashion.

In [None]:
def get_context(word_id, window_size=10):
  
  """Simply get the tokens that occur before and after word position"""

  start_pos = max(0, word_id - window_size) # The token where we will start the context view
  end_pos = min(word_id + window_size + 1, len(all_word_ids)) # The token where we will end the context view

  # Make a list called tokens and use word_lookup to get the words for given token IDs from starting position up to the keyword
  tokens = [word_lookup[word] for word in all_word_ids[start_pos:end_pos] ]
  
  context_words = " ".join(tokens)

  return context_words

In [None]:
import re

def get_context_clean(word_id, window_size=10):
  
  """Get the tokens that occur before and after word position AND make them more readable"""

  keyword = word_lookup[all_word_ids[word_id]]
  start_pos = max(0, word_id - window_size) # The token where we will start the context view
  end_pos = min(word_id + window_size + 1, len(all_word_ids)) # The token where we will end the context view

  # Make a list called tokens and use word_lookup to get the words for given token IDs from starting position up to the keyword
  tokens = [word_lookup[word] for word in all_word_ids[start_pos:end_pos] ]
  
  # Make wordpieces slightly more readable
  # This is probably not the most efficient way to clean and correct for weird spacing
  context_words = " ".join(tokens)
  context_words = re.sub(r'\s+([##])', r'\1', context_words)
  context_words = re.sub(r'##', r'', context_words)
  context_words = re.sub('\s+\'s', '\'s', context_words)
  context_words = re.sub('\s+\'d', '\'d', context_words)
  context_words = re.sub('\s\'er', '\'er', context_words)
  context_words = re.sub(r'\s+([-,:?.!;])', r'\1', context_words)
  context_words = re.sub(r'([-\'"])\s+', r'\1', context_words)
  context_words = re.sub('\s+\'s', '\'s', context_words)
  context_words = re.sub('\s+\'d', '\'d', context_words)

  # Bold the keyword by putting asterisks around it
  if keyword in context_words:
    context_words = re.sub(f"\\b{keyword}\\b", f"**{keyword}**", context_words)
    context_words = re.sub(f"\\b({keyword}[esdtrlying]+)\\b", fr"**\1**", context_words)

  return context_words

To visualize the search keyword even more easily, we're going to import a couple of Python modules that will allow us to output text with bolded words and other styling. Here we will make a function `print_md()` that will allow us to print with Markdown styling.

In [None]:
from IPython.display import Markdown, display

def print_md(string):
    display(Markdown(string))

In [None]:
word_positions = get_word_positions(["sensitive"])

for word_position in word_positions:

  print_md(f"<br> {word_position}:  {get_context_clean(word_position)} <br>")

<br> 6092:  a compliment. worth the money. not for the **sensitive** type this is not at all worth the hype <br>

<br> 10738:  the day. this product was not good for my **sensitive** eyes and for the price i think it's <br>

<br> 14997:  have problems with the staying power of mascaras or **sensitive** eyes i would skip this one. … read more <br>

<br> 15153:  brand. i absolutely loved it. i have very **sensitive** eyes and most mascaras irritate my eyes <br>

<br> 29942:  udge easy, doesn 't irritate my **sensitive** eyes, and gives my amazing definition and volume! <br>

<br> 41575:  ! to start off, i have really watery and **sensitive** eyes, so i can only wear waterproof mascara <br>

<br> 41939:  to, so less rubbing is really nice for my **sensitive** eyes! … read more this mascara is absolutely amazing <br>

<br> 63591:  . … read more this mascara is great for my **sensitive**, watery eyes and oily lids. stays put <br>

<br> 70013:  also keeps my lashes very soft. i have very **sensitive** dry eyes and this does not aggravate them <br>

<br> 79906:  it was a fiber mascara since my eyes are so **sensitive**. it does work and doesn 't smudge <br>

<br> 83863:  was told this was the best waterproof mascara for **sensitive** eyes. this stuff feels like razors when it <br>

<br> 85581:  fab i decided this mascara is surprisingly great for **sensitive** eyes, and for people who are looking for a <br>

<br> 98966:  starts to get dried. so if you have super **sensitive** eyes with contact lenses, you'd better not <br>

<br> 100207:  to take it off at night. i have really **sensitive** eyes, and my eyes don 't burn from <br>

<br> 101808:  you dont have to tug & rub on your **sensitive** eye area. i tried it with the dior <br>

<br> 113760:  t seem dry to me. i 'm super **sensitive** when it comes to which products work for me, <br>

<br> 132861:  ation i read on a forum. i have very **sensitive** eyes, rosacea. 99 % of mascaras <br>

<br> 133107:  is a bit strong but my eyes aren 't **sensitive** so i got use to the smell. the smell <br>

<br> 139456:  this product in exchange for a review. i have **sensitive** eyes so eye cream shadow etc can cause tears, <br>

<br> 141039:  and last all day long! my eyes can be **sensitive** to mascara. any flaking and i am it <br>

<br> 143913:  this came out. i 'm someone with super **sensitive** eyes and i was hesitant at first to try it <br>

<br> 150184:  my makeup remover which is important due to my **sensitive** eyes. application is super easy especially with the shape <br>

<br> 151975:  goodluck if you have allergies or **sensitive** eyes because your eyes will water and the liner comes <br>

<br> 152983:  this primer. i have oily, super **sensitive** skin and really dislike silicone-based primers <br>

<br> 153478:  , but i guess it ’ s not for my **sensitive** skin. i have been in the skin care business <br>

<br> 153664:  !!!! wow so i have combination, **sensitive** skin and i'd been hearing a lot about <br>

<br> 155722:  holding my makeup in place, but not for my **sensitive** skin. this didn 't work well with my <br>

<br> 158101:  in areas. it smells great. if you are **sensitive** to strong scents then you may want to avoid it <br>

<br> 158191:  … read more great product! i have extremely dry **sensitive** eczema skin and this primer helps soothe <br>

<br> 158462:  my combo / oily / acne prone / **sensitive** skin! i really like how my make up looks <br>

<br> 158588:  objectives. i have very dry skin that is rather **sensitive**. i am not particularly acne prone, but <br>

<br> 158733:  on my face, but ultimately broke out my very **sensitive** skin and i had to stop using it. which <br>

<br> 159408:  reviews about it breaking people out, i have very **sensitive** acne prone skin and my skin has actually improved <br>

<br> 159555:  or is it just me? i have tried their **sensitive** skin poreless primer and was not happy <br>

<br> 160489:  for me as well as very gentle on my sometimes **sensitive** skin. i now use this daily and love it <br>

<br> 160519:  work for me, i don 't really have **sensitive** skin and it made me breakout in places that are <br>

<br> 161026:  it smells like coconut and doesn 't make my **sensitive** skin break out like some primers do. absolutely <br>

<br> 164309:  lovee love loveee this! i suffer from **sensitive** combination skin that is severely dry during the winter months <br>

<br> 164446:  the couple of years, my skin has gotten quite **sensitive**, and with that, quite dehydrated <br>

<br> 165695:  doesn 't cause redness or irritation to my **sensitive** skin. i love the pump dispenser <br>

<br> 166768:  t irritate or make my cranky / **sensitive** skin breakout! no silicones! works fantastic, <br>

<br> 169200:  and really gives my makeup staying power. i have **sensitive**, acne-prone skin and have had no <br>

<br> 169807:  it's a beautiful product. my skin is **sensitive** and reactive, no issues. in fact, i <br>

<br> 169993:  primer i have ever used. i have combination **sensitive** skin so it can get dry randomly. other prime <br>

<br> 170953:  all while making my skin feel smooth. i have **sensitive** skin that literally flares up if i lightly rub <br>

<br> 173125:  super excited to try this product. my skin is **sensitive** and dry, so i thought coconut water sounded great <br>

<br> 174005:  all day even in 85 degree weather. i have **sensitive** oily skin and this is perfect for me. <br>

<br> 174135:  y-i love this product-i also have **sensitive** skin and don 't like to wear too much <br>

<br> 174318:  !! i would deff recommended if you have **sensitive** dry skin. it works wonders. i love how <br>

<br> 176044:  lotion, which i love and soothes my **sensitive** skin. unlike with most of the silicone based <br>

<br> 177231:  . i have tried every primer for my combination **sensitive** skin. from ysl, to laura mercier <br>

<br> 177921:  primers at all costs because my skin is super **sensitive** to silicones. i also have combination oil- <br>

<br> 178412:  dressed up and it's just amazing! having **sensitive**, highly allergic skin and rosacea, i was <br>

<br> 178696:  how it made some people break out ( i have **sensitive** skin ), but i haven 't had any <br>

<br> 179240:  my skin is oily and acne prone and **sensitive** to products. this goes on smooth like a lightweight <br>

<br> 180165:  caking. it also is very nice formula for **sensitive** skin, it doesn 't have a strong scent <br>

<br> 180317:  acne-prone, oily, or have **sensitive** skin! … read more best primer for dry <br>

<br> 180691:  . plus the scent is lovely. i have super **sensitive** skin and just about any type of lotion makes <br>

<br> 181207:  makes my skin look great! i have dry, **sensitive** skin. this means i have to wear a moist <br>

<br> 182832:  sunscreen. love this primer. i have **sensitive** skin too and no problems with this. light coconut <br>

<br> 183028:  product is definitely not good for me i have very **sensitive** fair skin. i foolishly didn 't read <br>

<br> 183982:  one, it peaked my interest. since i have **sensitive** skin, i always read a number of reviews for <br>

<br> 185382:  waterline let alone lash line because i was so **sensitive**. could not keep my eyes open and would not <br>

<br> 185398:  eyes open and would not recommend it for people with **sensitive** eyes. i have been trying out this pencil for <br>

<br> 187400:  lining my waterline for years and do not have **sensitive** eyes in general but for some reason i can ' <br>

<br> 187941:  t have any smudging. i have pretty **sensitive** eyes, and this didn 't bother them one <br>

<br> 188768:  the hunt for something that 'll work with my **sensitive** eyes and oil-slick eyelids. i bought this <br>

<br> 189513:  . maybe i just have a dud and too **sensitive** eyes … read more this is probably the 10th eye <br>

<br> 192046:  ... which is coming from someone with very **sensitive** eyes. i also use this to tightline with <br>

<br> 198649:  fan of waterproof makeup since my eyes can be **sensitive** but this liner does not irritate my eyes <br>

<br> 201040:  let me first preface by stating that i have extremely **sensitive**, allergy-prone, and watery eyes. <br>

<br> 204296:  's fine. just a warning to those with **sensitive** eyes! i love the vibrance of the <br>

<br> 204679:  the eyeliner on very difficult. i also have **sensitive** allergy prone eyes which is why i don ' <br>

<br> 209885:  what to do? a few notes: i have **sensitive** eyes ( allergies ) that water terribly when <br>

<br> 218240:  recommend this for every day use to someone who has **sensitive** skin or eyes. recently received this as a birthday <br>

<br> 218641:  only eyeliner that does not irritate my **sensitive** eyes. my new favorite! i bought the high <br>

<br> 221080:  only recommend this product to those who do not have **sensitive** eyes, or do not react to use of this <br>

<br> 222657:  on smooth, quick and creamy. i have extremely **sensitive** eyes and wear glasses, and a lot of liner <br>

<br> 224083:  . the very best. easy application and nice for **sensitive** eyes wow, i love this eyeliner! i <br>

<br> 227826:  and does not irritate my eyes which are **sensitive**. as eye liner goes, it's about <br>

<br> 231586:  ). i am not sure if this formula is **sensitive** to room temperature, but it was warm in the <br>

<br> 234207:  -tip ( i recommend make up forever's **sensitive** eye remover. it's blue and so <br>

<br> 237730:  as well. as far as being good for my **sensitive** skin: i accidentally stabbed myself in the eye twice <br>

<br> 239393:  black, last all day.. i have really **sensitive** eyes and this does not irratate at all <br>

<br> 240547:  seen about this liner. … read more i have **sensitive** easily irritated eyes and this does not bother them at <br>

<br> 262200:  gets rid of it! … read more i have **sensitive** eyes and with all other liquid eyeliners they <br>

<br> 263270:  eyeliner of all time. i 've got **sensitive** eyes and oily eyelids, so i 've <br>

<br> 266501:  's brush tip is much more comfortable on my **sensitive** eyes than the felt tip on kat von d ' <br>

<br> 272143:  first; if, for example, your eyes are **sensitive** and tend to tear up when you 're applying <br>

<br> 281776:  ruin such a good thing? it can dry out **sensitive** eyelids. the consistency is pretty thick and you only <br>

<br> 281825:  's smooth and creamy and does not bother my **sensitive** skin. as for how well it works i have <br>

<br> 286571:  to heal for several days after. if you have **sensitive** skin, i would not recommend this product to you <br>

<br> 287928:  me. however, it is really harsh for the **sensitive** eyelid skin. i had no troubles at first <br>

<br> 287972:  up but it is too harsh for my generally moderately **sensitive** skin. i 'm 32 so definitely at the <br>

<br> 294293:  -aging version is much better. i have super **sensitive** eyelids and have reacted badly to other brands of prime <br>

<br> 294672:  sure if sephora carries it yet but i have **sensitive** skin and it literally melts makeup away gently and <br>

<br> 308101:  already sampling at the store. my entire body is **sensitive**, so i had to be real careful & pic <br>

<br> 325566:  this pen, and it doesn 't bother my **sensitive** eyes!!! will repurchase this great <br>

<br> 325776:  amazing for many reasons: 1. i have extremely **sensitive** eyes and it doesn 't even tingle when i <br>

<br> 331010:  't be disappointed! also-i have very **sensitive** eyes, and this product has not caused any problems <br>

<br> 331186:  hypoallergenic, my eyes are very **sensitive** and this eyeliner does not irritate them <br>

<br> 333168:  all kinds of eyeliner. my eyes are really **sensitive** and i my eyes randomly start crying through out the <br>

<br> 335337:  plus, it doesn 't irritate my **sensitive** eyes. it's worth every penny. as <br>

<br> 335643:  waterproof eyeliner that stays put. i have **sensitive** eyes and this does not burn. no racco <br>

<br> 350574:  best products ive ever bought! i have really **sensitive** skin and it breaks out or gets irritated pretty easily <br>

<br> 354482:  but it ’ s a great product! i have **sensitive** skin and have always struggled to find the right products <br>

<br> 354807:  make me break out so it's good for **sensitive** skin! i 'll be buying this again! <br>

<br> 359220:  . since then, my body has become very heat **sensitive**. even in the dead of winter, i experience <br>

<br> 360683:  i have normal to dry skin that's very **sensitive** & this product does not dry out or irrita <br>

<br> 360778:  at the end of the shift. i have extremely **sensitive** skin that is acne prone, and this has <br>

<br> 361831:  -greasy, didn 't irritate my **sensitive** skin, and i didn 't notice any fragrance <br>

<br> 363309:  skin itchy. i returned it sadly. my **sensitive** skin doesn 't seem to care much for it <br>

<br> 370747:  t do anything for me. i have combination, **sensitive** skin and live in a pretty humid environment ( houston <br>

<br> 371587:  still work for girls with less acne prone / **sensitive** skin. i submitted a similar review of this product <br>

<br> 371622:  hopefully this review will, since it may help other **sensitive** skin girls avoid a major breakout. … read more <br>

<br> 372567:  it everyday!! i 'm asian, super **sensitive**, oily, acne prone skin and i <br>

Here we make a list of all the context views for our keyword.

In [None]:
word_positions = get_word_positions(["sensitive"])

keyword_contexts = []
keyword_contexts_tokens = []

for position in word_positions:

  keyword_contexts.append(get_context_clean(position))
  keyword_contexts_tokens.append(get_context(position))

<br><br><br><br>

## **Get word vectors and reduce them with PCA**

Finally, we don't just want to *read* all the instances of "sensitive" in the collection, we want to *measure* the similarity of all the instances of "sensitive."

To measure similarity between all the instances of "sensitive," we will take the vectors for each instance and then use PCA to reduce each 768-dimensionsal vector to the 2 dimensions that capture the most variation.

In [None]:
from sklearn.decomposition import PCA

word_positions = get_word_positions(["sensitive"])

pca = PCA(n_components=2)

pca.fit(all_word_vectors[word_positions,:].T)

PCA(n_components=2)

Then, for convenience, we will put these PCA results into a Pandas DataFrame, which will use to generate an interactive plot.

In [None]:
df_to_plot = pd.DataFrame({"x": pca.components_[0,:], 
                           "y": pca.components_[1,:],
                           "context": keyword_contexts, 
                           "tokens": keyword_contexts_tokens})
df_to_plot.head()

Unnamed: 0,x,y,context,tokens
0,-0.079377,0.017589,a compliment. worth the money. not for the **sensitive** type this is not at all worth the hype,a compliment . worth the money . not for the sensitive type this is not at all worth the h ##ype
1,-0.095697,-0.080537,the day. this product was not good for my **sensitive** eyes and for the price i think it's,the day . this product was not good for my sensitive eyes and for the price i think it ' s
2,-0.0952,-0.087407,have problems with the staying power of mascaras or **sensitive** eyes i would skip this one. … read more,have problems with the staying power of mascara ##s or sensitive eyes i would skip this one . … read more
3,-0.094516,-0.149339,brand. i absolutely loved it. i have very **sensitive** eyes and most mascaras irritate my eyes,brand . i absolutely loved it . i have very sensitive eyes and most mascara ##s ir ##rita ##te my eyes
4,-0.093228,-0.060299,"udge easy, doesn 't irritate my **sensitive** eyes, and gives my amazing definition and volume!","##udge easy , doesn ' t ir ##rita ##te my sensitive eyes , and gives my amazing definition and volume !"


<br><br><br><br>

## **Match context with original text and metadata** 

It's helpful (and fun!) to know where each instance of a word actually comes from. The easiest method we've found for matching a bit of context with its original review and metdata is to 1) add a tokenized version of each review to our original Pandas Dataframe 2) check to see if the context shows up in a review 3) and if so, grab the original review and metadata.

In [None]:
# Tokenize all the reviews
tokenized_reviews = tokenizer(texts, truncation=True, padding=True, return_tensors="pt")

# Get a list of all the tokens for each review
all_tokenized_reviews = []
for i in range(len(tokenized_reviews['input_ids'])):
  all_tokenized_reviews.append(' '.join(tokenized_reviews[i].tokens))

# Add them to the original DataFrame
reviews_df['tokens'] = all_tokenized_reviews

In [None]:
def find_original_review(rows):

  """This function checks to see whether the context tokens show up in the original review,
  and if so, returns metadata about the title, author, period, and URL for that review"""

  text = rows['tokens'].replace('**', '')
  text = text[55:70]

  if reviews_df['tokens'].str.contains(text, regex=False).any() == True :
    row = reviews_df[reviews_df['tokens'].str.contains(text, regex=False)].values[0]
    stars, brand, name, makeup_type = row[1], row[5], row[6], row[7]
    return stars, brand, name, makeup_type
  else:
    return None, None, None, None

In [None]:
df_to_plot[['stars', 'brand', 'name', 'type']] = df_to_plot.apply(find_original_review, axis='columns', result_type='expand')

<br><br><br><br>

## **Plot word embeddings**

Lastly, we will plot the words vectors from this DataFrame with the Python data viz library [Altair](https://altair-viz.github.io/gallery/scatter_tooltips.html).

In [None]:
import altair as alt

In [None]:
alt.Chart(df_to_plot, title="Word Similarity: Sensitive").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y",
    # If you click a point, take you to the URL link 
    # href="link",
    # The categories that show up in the hover tooltip
    tooltip=['context', 'type', 'brand', 'name',
    ).interactive().properties(
    width=500,
    height=500
)

<br><br><br><br>

## **Plot word embeddings from keywords (all at once!)**

We can put the code from the previous few sections into a single cell and plot the BERT word embeddings for any list of words. 

In [None]:
# List of keywords that you want to compare
keywords = ['pool', 'gym', 'wedding', 'funeral', 'office', 'party']

# How to color the points in the plot. The other option is "period" for time period
color_by = 'word'

# Get all word positions
word_positions = get_word_positions(keywords)

# Get all contexts around the words
keyword_contexts = []
keyword_contexts_tokens = []
words = []

for position in word_positions:
  words.append(word_lookup[all_word_ids[position]])
  keyword_contexts.append(get_context_clean(position))
  keyword_contexts_tokens.append(get_context(position))

# Reduce word vectors with PCA
pca = PCA(n_components=2)
pca.fit(all_word_vectors[word_positions,:].T)

# Make a DataFrame with PCA results
df_to_plot = pd.DataFrame({"x": pca.components_[0,:], 
                           "y": pca.components_[1,:],
                           "context": keyword_contexts, 
                           "tokens": keyword_contexts_tokens, 
                           "word": words})
# Match original text and metadata
df_to_plot[['stars', 'brand', 'name', 'type']] = df_to_plot.apply(find_original_review, axis='columns', result_type='expand')

# Rename columns so that the context shows up as the "title" in the tooltip (bigger and bolded)
# df = df.rename(columns={'brand': 'brand', 'name': 'title'})

# Make the plot
alt.Chart(df_to_plot, title=f"Word Similarity: {', '.join(keywords).title()}").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y",
    color= color_by,
    # href="link",
    tooltip=['word', 'context', 'type', 'brand', 'name', 'stars']
    ).interactive().properties(
    width=500,
    height=500
)

In [None]:
# List of keywords that you want to compare
keywords = ['smooth', 'sharp', 'clean']

# How to color the points in the plot. The other option is "period" for time period
color_by = 'word'

# Get all word positions
word_positions = get_word_positions(keywords)

# Get all contexts around the words
keyword_contexts = []
keyword_contexts_tokens = []
words = []

for position in word_positions:
  words.append(word_lookup[all_word_ids[position]])
  keyword_contexts.append(get_context_clean(position))
  keyword_contexts_tokens.append(get_context(position))

# Reduce word vectors with PCA
pca = PCA(n_components=2)
pca.fit(all_word_vectors[word_positions,:].T)

# Make a DataFrame with PCA results
df_to_plot = pd.DataFrame({"x": pca.components_[0,:], 
                           "y": pca.components_[1,:],
                           "context": keyword_contexts, 
                           "tokens": keyword_contexts_tokens, 
                           "word": words})
# Match original text and metadata
df_to_plot[['stars', 'brand', 'name', 'type']] = df_to_plot.apply(find_original_review, axis='columns', result_type='expand')

# Rename columns so that the context shows up as the "title" in the tooltip (bigger and bolded)
# df = df.rename(columns={'brand': 'brand', 'name': 'title'})

# Make the plot
alt.Chart(df_to_plot, title=f"Word Similarity: {', '.join(keywords).title()}").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y",
    color= color_by,
    # href="link",
    tooltip=['word', 'context', 'type', 'brand', 'name', 'stars']
    ).interactive().properties(
    width=500,
    height=500
)

In [None]:
# List of keywords that you want to compare
keywords = ['long']

# How to color the points in the plot. The other option is "period" for time period
color_by = 'word'

# Get all word positions
word_positions = get_word_positions(keywords)

# Get all contexts around the words
keyword_contexts = []
keyword_contexts_tokens = []
words = []

for position in word_positions:
  words.append(word_lookup[all_word_ids[position]])
  keyword_contexts.append(get_context_clean(position))
  keyword_contexts_tokens.append(get_context(position))

# Reduce word vectors with PCA
pca = PCA(n_components=2)
pca.fit(all_word_vectors[word_positions,:].T)

# Make a DataFrame with PCA results
df_to_plot = pd.DataFrame({"x": pca.components_[0,:], 
                           "y": pca.components_[1,:],
                           "context": keyword_contexts, 
                           "tokens": keyword_contexts_tokens, 
                           "word": words})
# Match original text and metadata
df_to_plot[['stars', 'brand', 'name', 'type']] = df_to_plot.apply(find_original_review, axis='columns', result_type='expand')

# Rename columns so that the context shows up as the "title" in the tooltip (bigger and bolded)
# df = df.rename(columns={'brand': 'brand', 'name': 'title'})

# Make the plot
alt.Chart(df_to_plot, title=f"Word Similarity: {', '.join(keywords).title()}").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y",
    color= color_by,
    # href="link",
    tooltip=['word', 'context', 'type', 'brand', 'name', 'stars']
    ).interactive().properties(
    width=500,
    height=500
)