## Your turn!

Now it's your turn! Now we're going to put a few pieces together from various lessons in order to cluster the adjectives that people are using to describe food in their reviews. I'm going to set up a few things for you but you're going to do most of the work!


In [None]:
# imports

import numpy as np
import pandas as pd
import nltk
import re
from sklearn import feature_extraction

### 1. Load Dataset

First let's load in our data. Remember that this is a JSON file of food reviews with some metadata.

In [None]:
# For downloading large files from Google Drive
import gdown

# download the reviews w/ cuisine categories
gdown.download('https://drive.google.com/uc?export=download&id=1WA_KAOXWOU8yyslDRtvl_JTl6Da0LniS', quiet=False)

Put it in a dataframe

In [None]:
# load the reviews json file
reviews_df = pd.read_json(path_or_buf="./atl_reviews_with_cats.json")

reviews_df

Now create a new dataframe that consists of a slice of the above with only the contents of `row['comment']['text']`

In [None]:
comments_df = reviews_df['comment'].apply(lambda x: x['text'])

comments_df

Now create a random sample of 5000 reviews so that we finish this during the class time!

In [None]:
comments_sample_df = comments_df.sample(n=5000, random_state=1)
comments_sample_df

### 2. Find the adjectives

Your turn!

The goal here is to create a list called `all_adj_list` that consists of all of the adjectives that appear in these reviews.

In order to do this you will need to:
1. Iterate through the `comment_df` dataframe
2. Use spaCy to identify any adjectives in that comment
3. Append each adjective to a list

In the next step, we will sort the adjectives by frequency. But for now, we just want to pull the adjectives out of the comments.

In [None]:
# required imports and instantiation of the spacy model
import spacy
import en_core_web_sm

nlp = en_core_web_sm.load()

# list for storing adjectives
all_adj_list = []

# the rest of your code here!
for comment in comments_sample_df:
    doc = nlp(comment)
    for token in doc:
        if token.pos_ == 'ADJ':
           # print("Adding: " + str(token))
            all_adj_list.append(str(token))

len(all_adj_list)


Now that we have our list, here's some code to sort it by value counts. We're going to work with adjectives that only appear 5 times or more in our reviews dataset.

In [None]:
from collections import Counter

# use Counter to create dict w/ value counts from list
all_adj_dict = dict(Counter(all_adj_list))

print("Number of unique adjectives: " + str(len(all_adj_dict)))

# now create a set w/ adjectives used 5 or more times
adj_vocab = set()

for adj in all_adj_dict:
  if all_adj_dict[adj] > 4:
    adj_vocab.add(adj)

print("Number of adjectives used 5 or more times: " + str(len(adj_vocab)))

# take a look
print(adj_vocab)

### 3. Get BERT embeddings

This chunk of code comes from our word similarity notebook. I've pre-run it and saved the embeddings in a JSON file that we can load instead of running all of this. But it's good to look at so you can see how the parts are coming together.

First, we tokenize our reviews for BERT.

In [None]:
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

comments_text = comments_sample_df.tolist()

tokenized_comments = tokenizer(comments_text, truncation=True, padding=True, return_tensors="pt")

Then we load the pretrained model:

In [None]:
from transformers import DistilBertModel

model = DistilBertModel.from_pretrained('distilbert-base-uncased').to("cuda")

Then we get the BERT embeddings for each of the adjectives in our `adj_vocab`.


In [None]:
# List of vocabulary word IDs for all the words in each document (aka each review)
doc_word_ids = []

# List of word vectors for all the words in each document (aka each review)
doc_word_vectors = []

# Below we will slice our reviews to ignore the first (0th) and last (-1) special BERT tokens
start_of_words = 1
end_of_words = -1

# Below we will index the 0th or first document, which will be the only document, since we're analzying one review at a time
first_document = 0

for i, review in enumerate(comments_text):

    # Here we tokenize each poem with the DistilBERT Tokenizer
    inputs = tokenizer(review, return_tensors="pt", truncation=True, padding=True)

    # Here we extract the vocabulary word ids for all the words in the review (the first or 0th document, since we only have one document)
    # We ignore the first and last special BERT tokens
    # We also convert from a Pytorch tensor to a numpy array
    doc_word_ids.append(inputs.input_ids[first_document].numpy()[start_of_words:end_of_words])

    # Here we send the tokenized reviews to the GPU
    # The model is already on the GPU, but this review isn't, so we send it to the GPU
    inputs.to("cuda")
    # Here we run the tokenized reviews through the DistilBERT model
    outputs = model(**inputs)

    # We take every element from the first or 0th document, from the 2nd to the 2nd to last position
    # Grabbing the last layer is one way of getting token vectors. There are different ways to get vectors with different pros and cons
    doc_word_vectors.append(outputs.last_hidden_state[first_document,start_of_words:end_of_words,:].detach().cpu().numpy())



Concatenate all wordIDs/vectors for all documents

In [None]:
all_word_ids = np.concatenate(doc_word_ids)
all_word_vectors = np.concatenate(doc_word_vectors, axis=0)

Now pull out the vectors assocaited with the adjectives in our `adj_vocab` set

In [None]:
# newer version -- more error checking, create average vectors
adj_vocab_list = list(adj_vocab)

final_vocab_list = []
avg_adj_vectors = []

for adj in adj_vocab_list:
  if adj in tokenizer.vocab:
    final_vocab_list.append(adj)

    adj_vectors = []

    # get word_id
    word_id = tokenizer.vocab[adj]

    # find all the positions where the words occur in the dataset
    word_positions = np.where(np.isin(all_word_ids, word_id))

    # get the vectors for all those posiitons
    adj_vectors.append(np.mean(all_word_vectors[word_positions], axis=0))

    # create average vector for all vectors in adj_vectors
    average_adj_vector = np.mean(adj_vectors, axis=0)

    # append to avg_adj_vectors
    avg_adj_vectors.append(average_adj_vector)

# len should be same or less than number of original adjecives
# due to not in vocab vectors
len(avg_adj_vectors)

In [None]:
len(avg_adj_vectors)

In [None]:
len(final_vocab_list)

In [None]:
# export adj_vectors as pickle file

import pickle

with open('avg_adj_vectors_596.pkl', 'wb') as f:
    pickle.dump(avg_adj_vectors, f)

with open('final_vocab_list_596.pkl', 'wb') as f:
    pickle.dump(final_vocab_list, f)

### 4. Cluster adjective vectors!

Import the BERT vectors if you couldn't run the code above

In [None]:
import gdown
import pickle

# import pickle files
# vectors
gdown.download('https://drive.google.com/uc?export=download&id=1itJrQVYCrCCeol5eWZmC5-jaIbWSzKfK', quiet=False)

# vocab
gdown.download('https://drive.google.com/uc?export=download&id=1P1z7wzm4IEsGn3dpkb4MrvYX6a0I1PbP', quiet=False)

with open('avg_adj_vectors_596.pkl', 'rb') as f:
  avg_adj_vectors = pickle.load(f)

with open('final_vocab_list_596.pkl', 'rb') as f:
  final_vocab_list = pickle.load(f)


In [None]:
# create dataframe with two columns, final_vocab_list and avg_adj_vectors

adj_vectors_df = pd.DataFrame({'word': final_vocab_list, 'BERT_vector': avg_adj_vectors})

adj_vectors_df

### 4a. Determine best K

Since we have ~600 vectors, let's try k up to 30

In [None]:
from sklearn.cluster import KMeans

inertia_vals = []

for i in range(1, 30):
    km_clustering = KMeans(n_clusters=i, n_init=10)
    km_clustering.fit(avg_adj_vectors)
    inertia_vals.append(km_clustering.inertia_)

ks = list(range(1, 30))

In [None]:
# library for quickly calculating "knee"
%pip install kneed

In [None]:
# determine knee
from kneed import KneeLocator
kn = KneeLocator(ks, inertia_vals, curve='convex', direction='decreasing')
print(kn.knee)

In [None]:
# plot
import matplotlib.pyplot as plt
plt.xlabel('number of clusters k')
plt.ylabel('Inertia')
plt.plot(ks, inertia_vals, 'bx-')
plt.vlines(kn.knee, plt.ylim()[0], plt.ylim()[1], linestyles='dashed')

### 4b. Cluster!


In [None]:
from sklearn.cluster import KMeans

num_clusters = 8

km = KMeans(n_clusters=num_clusters, n_init=10) # default is also 10, but good to know

clusters = km.fit_predict(avg_adj_vectors)

adj_vectors_df['cluster'] = clusters

adj_vectors_df

In [None]:
import textwrap

# find all adjectives in each cluster
for i in range(num_clusters):
    print("Adjectives in cluster " + str(i) + ": ")
    cluster_adjs = ""

    # create new df of only the specific cluster
    # remember boolean selection!
    cluster_df = adj_vectors_df[ adj_vectors_df["cluster"] == i ]

    # create series of adjectives assoc w/ that cluster
    for adj in cluster_df['word']:
        cluster_adjs += adj + ", "

    print(textwrap.fill(cluster_adjs, 100) + "\n")

**If you had to provide an overal characterization of each of these clusters, what would you say?**