# In part 2, we plan to use publically available deep embeddings of text in order to predict emotions.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Here we need both HuggingFace's "datasets" and "sentence-transformers" libraries.
# Explanations to follow in the code.

#!pip install datasets
#!pip install sentence-transformers

In [3]:
# Let's download a dataset of English tweets, with an "emotion" label attached
# to each tweet, as we did in part 1.
from datasets import load_dataset, Dataset
emotions = load_dataset("SetFit/emotion")

Repo card metadata block was not found. Setting CardData to empty.


In [4]:
train = emotions["train"].to_pandas()
val = emotions["validation"].to_pandas()

# Converting text to "deep embeddings".
Using HuggingFace's SentenceTransformer platform, we will convert the tweets to 384 dimensional vectors.  This mysterious embedding was trained from hundreds of millions of sentences available on the web, precisely for the purpose of being able to semantically compare sentences (to tell whether two sentences mean roughly the same).  

We do not learn in ML II how to train such sentence-transformers, but we are allowed to use them.

In [5]:
def convert_texts_to_deep_embeddings(texts):
  from sentence_transformers import SentenceTransformer

  model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
  # documentation: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

  n = len(texts)
  ret = []

# Work in batches of 100 tweets, to avoid memory problems
  for i in range(0, n, 100):
    embeddings = model.encode(list(texts[i:min(i+100, n)]))
    ret.append(embeddings)
  return np.concatenate(ret, axis=0)

In [None]:
# The first time you execute the embedding, some data will download to your
# filesystem.  This data, which contains hundreds of millions of bytes,
# will be cached by HuggingFace library for the next uses.

val_embedded = convert_texts_to_deep_embeddings(val["text"])

In [None]:
train_embedded = convert_texts_to_deep_embeddings(train["text"])
print (f"Embedding dimension = {train_embedded.shape[1]}")

In [None]:
from sklearn.neighbors import KNeighborsClassifier
k = 5
clf = KNeighborsClassifier(n_neighbors = k, metric='cosine')
clf.fit(train_embedded, train["label_text"])

In [None]:
from sklearn.metrics import hamming_loss # count number of times class is wrong

In [None]:
train_preds = clf.predict(train_embedded)
hamming_loss(train["label_text"], train_preds)

In [None]:
val_preds = clf.predict(val_embedded)
hamming_loss(val["label_text"], val_preds)

**Discussion point:**  Compare results to the best you could get with 384 features in part 1 of the notebook.

**Discussion point:** Plot a confusion matrix, as done in part 1 of the notebook (feel free to reuse code from the forum threads.)

# Let's use PCA to further reduce the dimension

In [None]:
from sklearn.decomposition import PCA

# Let's start with a full PCA, to get the "scree" plot.
fitted_pca = PCA().fit(train_embedded)
import matplotlib.pyplot as plt
plt.plot(fitted_pca.explained_variance_ratio_.cumsum())

**Discussion point:** As done in part 1, reduce the dimension to various values below 384, and see how it affects the accuracy.  Compare with part 1 of the notebook (bad of words), to see whether you can get the same accuracy, dimensionality being equal.

One of the fun things you can do with PCA on few dimensions, is  visualization.  Let's try to visualize the data in 2d, using the first 5 components of PCA, and see if we get any insights.  I believe that it is possible to see some separation between sadness and joy, which are two extremes, in this mapping.


**Discussion Point:** Can you find other visual-semantic phenomena?  Feel free to do 3d plots!

In [None]:
pca = PCA(n_components=5).fit(train_embedded)
train_reduced = pca.transform(train_embedded)
plt.figure(figsize=(10,10))
import matplotlib.pyplot as plt
for emotion, color in [('sadness', 'black'), ('joy', 'orange')]:
  indices = (train['label_text'] == emotion)
  plt.plot(train_reduced[indices,1], train_reduced[indices, 2], '.', color=color)

# Taking the deep embeddings up a notch
The dimensionality of 384 is quite small, given the monster models that are out there today.  
For a list of examples, see this page:
https://huggingface.co/sentence-transformers

In order to use other deep sematic embedding models, replace the string "sentence-transformers/all-MiniLM-L6-v2" with the string corresponding to the model you want to try. Note that some models will take a long time to run the embedding, and will also require much disk space for download.  If you own a GPU, or using a GPU environment on the cloud, you can take advantage and make the embedding run much faster.  Feel free to find examples online for running HuggingFace's sentence-transformers on GPU, or ask me.

Also note that "bigger" is not always "better" - some of the sentence-transformers may be bigger than the one we use here, but may not perform as well.  

**Advanced Discussion Point:** Feel free to try other sentence-transformer semantic embedding models, and share your findings with everyone!