# Workshop: NLP & Content Analysis

This workshop session will provide a hands-on experience for you to learn modern "out-of-the-box" natural language processing techniques that will allow you to design systematic methods to analize text.

**During this lab you will**:

- Get hands-on experience through the following modules:
  - Sentiment lexicon
  - Valence recognition
  - Perspective API
  - Word embeddings
  - Clustering

# 0. Setup Environment

**Please make sure to change the runtime of this Colab notebook to use GPUs.** Select 'Runtime' --> 'Change runtime type' --> select 'GPU' for the Hardware accelerator. GPUs are optimized for graphics-related computations, which involve many matrix multiplications. It turns out that matrix multiplications are similarly pervasive in modern machine learning (i.e. deep neural networks), and GPUs can greatly increase the speed of using these models.

In [1]:
%%capture
!pip install gensim
!pip install sentence-transformers

import matplotlib.pyplot as plt

import pandas as pd
pd.set_option('display.max_colwidth', 0)
pd.set_option('display.max_columns', 999)

import itertools
from pprint import pprint
import numpy as np

import warnings
warnings.filterwarnings("ignore")

import gensim
import scipy
from gensim.models import Word2Vec

#### Download data and models

The following code downloads the data and our toolkit for exploring this data.

In [2]:
# get reddit sample
!gdown https://drive.google.com/uc?id=1ogRFqUjIUqBxIUkGX6z8DobOWonOJp_u

# get backend toolkit code
!wget https://public-thought.media.mit.edu/static/ccc_toolkit_v_21_0.py --no-check-certificate

# get emotions lexicon
!wget https://saifmohammad.com/WebDocs/Lexicons/NRC-Emotion-Lexicon.zip
!unzip -qq NRC-Emotion-Lexicon.zip

# Download and load data (this is going to take a couple of minutes.)
from ccc_toolkit_v_21_0 import (get_clusters,
                          plot_tsne_viz,
                          set_replicable_results,
                          run_clustering,
                          retrieve,
                          SentenceTransformer)

Downloading...
From: https://drive.google.com/uc?id=1ogRFqUjIUqBxIUkGX6z8DobOWonOJp_u
To: /content/reddit_workshop_sample.csv
  0% 0.00/788k [00:00<?, ?B/s]100% 788k/788k [00:00<00:00, 160MB/s]
--2023-01-10 15:20:01--  https://public-thought.media.mit.edu/static/ccc_toolkit_v_21_0.py
Resolving public-thought.media.mit.edu (public-thought.media.mit.edu)... 18.27.78.114
Connecting to public-thought.media.mit.edu (public-thought.media.mit.edu)|18.27.78.114|:443... connected.
  Issued certificate has expired.
HTTP request sent, awaiting response... 200 OK
Length: 13578 (13K) [application/octet-stream]
Saving to: ‘ccc_toolkit_v_21_0.py’


2023-01-10 15:20:02 (247 MB/s) - ‘ccc_toolkit_v_21_0.py’ saved [13578/13578]

--2023-01-10 15:20:02--  https://saifmohammad.com/WebDocs/Lexicons/NRC-Emotion-Lexicon.zip
Resolving saifmohammad.com (saifmohammad.com)... 192.185.17.122
Connecting to saifmohammad.com (saifmohammad.com)|192.185.17.122|:443... connected.
HTTP request sent, awaiting response...

# 1. Dataset

### Reddit Moderated Comments (Sample)

Selected sample of comments that have been removed by Reddit moderators.


In [3]:
import pandas as pd
reddit_data = pd.read_csv('reddit_workshop_sample.csv')

In [5]:
reddit_data.columns

Index(['text', 'subreddit', 'subreddit description', 'violation reason',
       'moderator comment', 'comment link', 'internal id'],
      dtype='object')

In [6]:
reddit_data.shape

(659, 7)

In [7]:
reddit_data[:2]

Unnamed: 0,text,subreddit,subreddit description,violation reason,moderator comment,comment link,internal id
0,"Oh, himbs is so sad that I won't agree to his weird need to control other people.\n\nSad bro, real sad. I picture you at a party, lecturing at some hot thing as they frantically search for a way to escape the awful drudgery of listening to you.",r/AmItheAsshole,"A catharsis for the frustrated moral philosopher in all of us, and a place to finally find out if you were wrong in an argument that's been bothering you. Tell us about any non-violent conflict you have experienced; give us both sides of the story, and find out if you're right, or you're the asshole. \n\nSee our ~~*Best Of*~~ ""Most Controversial"" at /r/AITAFiltered!","incivility, overly cruel or hostile","Your comment has been removed because it violates rule 1: [Be Civil](https://www.reddit.com/r/AmItheAsshole/about/rules/). Further incidents may result in a ban.\r\n\r\n[""Why do I have to be civil in a sub about assholes?""](https://www.reddit.com/r/AmItheAsshole/wiki/faq)\r\n\r\n**[Message the mods](https://www.reddit.com/message/compose?to=/r/AmItheAsshole) if you have any questions or concerns.**",/r/AmItheAsshole/comments/vuh8j2/aita_for_not_taking_my_kid_to_a_class_that_his/ifgjrwt/,6869
1,"Apparently his legs are too short to walk him back to the counter because he is a (baby). \n(Hopefully I won’t get banned for that)\nHonestly, WHY CANT HE ASK LIKE AN ADULT IF HE WANTS SOMETHING??\n\nAnd then he doubles down saying he shouldn’t ‘have to’ act like an adult because he should be able to take anything he wants from her?!\n\nWHAT?!",r/AmItheAsshole,"A catharsis for the frustrated moral philosopher in all of us, and a place to finally find out if you were wrong in an argument that's been bothering you. Tell us about any non-violent conflict you have experienced; give us both sides of the story, and find out if you're right, or you're the asshole. \n\nSee our ~~*Best Of*~~ ""Most Controversial"" at /r/AITAFiltered!","incivility, overly cruel or hostile","Your comment has been removed because it violates rule 1: [Be Civil](https://www.reddit.com/r/AmItheAsshole/about/rules/). Further incidents may result in a ban.\r\n\r\n[""Why do I have to be civil in a sub about assholes?""](https://www.reddit.com/r/AmItheAsshole/wiki/faq)\r\n\r\n**[Message the mods](https://www.reddit.com/message/compose?to=/r/AmItheAsshole) if you have any questions or concerns.**",/r/AmItheAsshole/comments/vj194d/aita_for_not_giving_my_mayonnaise_to_my_boyfriend/idgw640/,5938


In [8]:
reddit_data.subreddit.value_counts()

r/AmItheAsshole         125
r/changemyview          105
r/collapse              45 
r/texas                 44 
r/DestinyTheGame        40 
r/Christianity          37 
r/NintendoSwitch        37 
r/Israel                29 
r/UFOs                  25 
r/Fantasy               24 
r/Dallas                22 
r/povertyfinance        14 
r/explainlikeimfive     14 
r/mormon                12 
r/RPClipsGTA            10 
r/legaladvice           10 
r/buildapc              9  
r/AskTrumpSupporters    9  
r/doctorwho             5  
r/rpg                   5  
r/ShingekiNoKyojin      5  
r/sex                   5  
r/MMORPG                5  
r/China                 5  
r/liberalgunowners      5  
r/medicine              5  
r/syriancivilwar        4  
r/ExperiencedDevs       4  
Name: subreddit, dtype: int64

In [9]:
reddit_data['violation reason'].value_counts()

incivility, overly cruel or hostile                                       125
rule b -d party/devils advocate/soapboxing (ops only)                     105
be respectful to others                                                   54 
be friendly                                                               44 
no uncivil behavior, witchhunting, etc                                    40 
harassment, pestering, insistence upon debate, etc                        37 
no hate-speech, personal attacks or harassment                            37 
post in a civilized manner                                                29 
follow the standards of civility                                          25 
be kind                                                                   24 
civility                                                                  14 
uncivil behavior                                                          14 
being uncivil                                                   

In [11]:
reddit_data[['moderator comment']]

Unnamed: 0,moderator comment
0,"Your comment has been removed because it violates rule 1: [Be Civil](https://www.reddit.com/r/AmItheAsshole/about/rules/). Further incidents may result in a ban.\r\n\r\n[""Why do I have to be civil in a sub about assholes?""](https://www.reddit.com/r/AmItheAsshole/wiki/faq)\r\n\r\n**[Message the mods](https://www.reddit.com/message/compose?to=/r/AmItheAsshole) if you have any questions or concerns.**"
1,"Your comment has been removed because it violates rule 1: [Be Civil](https://www.reddit.com/r/AmItheAsshole/about/rules/). Further incidents may result in a ban.\r\n\r\n[""Why do I have to be civil in a sub about assholes?""](https://www.reddit.com/r/AmItheAsshole/wiki/faq)\r\n\r\n**[Message the mods](https://www.reddit.com/message/compose?to=/r/AmItheAsshole) if you have any questions or concerns.**"
2,"Your comment has been removed because it violates rule 1: [Be Civil](https://www.reddit.com/r/AmItheAsshole/about/rules/). Further incidents may result in a ban.\r\n\r\n[""Why do I have to be civil in a sub about assholes?""](https://www.reddit.com/r/AmItheAsshole/wiki/faq)\r\n\r\n**[Message the mods](https://www.reddit.com/message/compose?to=/r/AmItheAsshole) if you have any questions or concerns.**"
3,"Your comment has been removed because it violates rule 1: [Be Civil](https://www.reddit.com/r/AmItheAsshole/about/rules/). Further incidents may result in a ban.\r\n\r\n[""Why do I have to be civil in a sub about assholes?""](https://www.reddit.com/r/AmItheAsshole/wiki/faq)\r\n\r\n**[Message the mods](https://www.reddit.com/message/compose?to=/r/AmItheAsshole) if you have any questions or concerns.**"
4,"Your comment has been removed because it violates rule 1: [Be Civil](https://www.reddit.com/r/AmItheAsshole/about/rules/). Further incidents may result in a ban.\r\n\r\n[""Why do I have to be civil in a sub about assholes?""](https://www.reddit.com/r/AmItheAsshole/wiki/faq)\r\n\r\n**[Message the mods](https://www.reddit.com/message/compose?to=/r/AmItheAsshole) if you have any questions or concerns.**"
...,...
654,"Warning - Removed for Rule 1. Discuss in good faith, please. No accusations of lying."
655,"**Please read this entire message**\r\n\r\n---\r\n\r\nYour comment has been removed for the following reason(s):\r\n\r\n* Rule #1 of ELI5 is to *be nice*. Breaking Rule 1 is not tolerated.\r\n\r\n\r\n\r\n---\r\nIf you would like this removal reviewed, please read the [detailed rules](https://www.reddit.com/r/explainlikeimfive/wiki/detailed_rules) first. **If you believe this comment was removed erroneously**, please [use this form](https://old.reddit.com/message/compose?to=%2Fr%2Fexplainlikeimfive&amp;subject=Please%20review%20my%20thread?&amp;message=Link:%20https://www.reddit.com/r/explainlikeimfive/comments/ufsm2y/-/i6vmwdc/%0A%0APlease%20answer%20the%20following%203%20questions:%0A%0A1.%20The%20concept%20I%20want%20explained:%0A%0A2.%20List%20the%20search%20terms%20you%20used%20to%20look%20for%20past%20posts%20on%20ELI5:%0A%0A3.%20How%20is%20this%20post%20unique:) and we will review your submission."
656,"**Please read this entire message**\r\n\r\n---\r\n\r\nYour comment has been removed for the following reason(s):\r\n\r\n* Rule #1 of ELI5 is to *be nice*. Breaking Rule 1 is not tolerated.\r\n\r\n\r\n\r\n---\r\nIf you would like this removal reviewed, please read the [detailed rules](https://www.reddit.com/r/explainlikeimfive/wiki/detailed_rules) first. **If you believe this comment was removed erroneously**, please [use this form](https://old.reddit.com/message/compose?to=%2Fr%2Fexplainlikeimfive&amp;subject=Please%20review%20my%20thread?&amp;message=Link:%20https://www.reddit.com/r/explainlikeimfive/comments/vlw94u/-/ie1fo55/%0A%0APlease%20answer%20the%20following%203%20questions:%0A%0A1.%20The%20concept%20I%20want%20explained:%0A%0A2.%20List%20the%20search%20terms%20you%20used%20to%20look%20for%20past%20posts%20on%20ELI5:%0A%0A3.%20How%20is%20this%20post%20unique:) and we will review your submission."
657,"**Please read this entire message**\r\n\r\n---\r\n\r\nYour comment has been removed for the following reason(s):\r\n\r\n* Rule #1 of ELI5 is to *be nice*. Breaking Rule 1 is not tolerated.\r\n\r\n\r\n\r\n---\r\nIf you would like this removal reviewed, please read the [detailed rules](https://www.reddit.com/r/explainlikeimfive/wiki/detailed_rules) first. **If you believe this comment was removed erroneously**, please [use this form](https://old.reddit.com/message/compose?to=%2Fr%2Fexplainlikeimfive&amp;subject=Please%20review%20my%20thread?&amp;message=Link:%20https://www.reddit.com/r/explainlikeimfive/comments/uykuul/-/ia5hn06/%0A%0APlease%20answer%20the%20following%203%20questions:%0A%0A1.%20The%20concept%20I%20want%20explained:%0A%0A2.%20List%20the%20search%20terms%20you%20used%20to%20look%20for%20past%20posts%20on%20ELI5:%0A%0A3.%20How%20is%20this%20post%20unique:) and we will review your submission."


# 2. Sentiment Lexicons

We might be interested in when analyzing the would be how emotions expressed in these comments.

Natural language processing has some techniques we can use to understand the emotional arc of conversations!

This field of NLP is called "sentiment analysis." 



We will use a list of words called a lexicon to make an emotion classifier, a model which will tell us whether a sentence is generally positive or negative emotion in nature.

We can make this simple classifier using a sentiment and emotion lexicon called [Emolex](http://saifmohammad.com/WebPages/lexicons.html).

Each word in the sentiment lexicon is tagged with either 'positive', 'negative', or even both in rare cases. Each word in the emotion lexicon is tagged with one of the 8 emotions according to Plutchik's wheel of emotions -- joy, trust, fear, surprise, sadness, anticipation, anger, and disgust. This is, of course, just one theory of emotion. Other lexicons are available within Emolex, labeling words from different domains along different dimensions (e.g., valence, arousal, dominance).

These lexicons are typically created either (a) manually, through crowdsourced annotations, or (b) automatically (e.g., calculating co-ocurrence of words with each emotion word). The version we'll be using was created through crowdsourcing -- details can be found in the [paper](https://arxiv.org/pdf/1308.6297.pdf).

Let's first load in the lexicon and get a sense of what it contains.

In [12]:
from ccc_toolkit_v_21_0 import parse_emolex

lexicon_path = 'NRC-Emotion-Lexicon/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt'
word2sentiments, word2emotions = parse_emolex(lexicon_path)

Here are some labeled positive/negative sentiments:

In [13]:
print('Sentiment lexicon:\n')
for i, (word, sentiments) in enumerate(word2sentiments.items()):
    print(word, sentiments)
    if i == 10:
        break

Sentiment lexicon:

abandon {'negative'}
abandoned {'negative'}
abandonment {'negative'}
abba {'positive'}
abduction {'negative'}
aberrant {'negative'}
aberration {'negative'}
abhor {'negative'}
abhorrent {'negative'}
ability {'positive'}
abject {'negative'}


Here are some labeled emotions:

In [14]:
print('Emotion lexicon:\n')
for i, (word, emotions) in enumerate(word2emotions.items()):
    print(word, emotions)
    if i == 30:
        break

Emotion lexicon:

abacus {'trust'}
abandon {'fear', 'sadness'}
abandoned {'anger', 'fear', 'sadness'}
abandonment {'surprise', 'anger', 'fear', 'sadness'}
abbot {'trust'}
abduction {'surprise', 'fear', 'sadness'}
aberration {'disgust'}
abhor {'anger', 'fear', 'disgust'}
abhorrent {'anger', 'fear', 'disgust'}
abject {'disgust'}
abnormal {'disgust'}
abolish {'anger'}
abominable {'disgust', 'fear'}
abomination {'anger', 'fear', 'disgust'}
abortion {'disgust', 'fear', 'sadness'}
abortive {'sadness'}
abrupt {'surprise'}
abscess {'sadness'}
absence {'fear', 'sadness'}
absent {'sadness'}
absentee {'sadness'}
absolution {'joy', 'trust'}
abundance {'joy', 'trust', 'anticipation', 'disgust'}
abundant {'joy'}
abuse {'sadness', 'anger', 'fear', 'disgust'}
abysmal {'sadness'}
abyss {'fear', 'sadness'}
academic {'trust'}
accelerate {'anticipation'}
accident {'surprise', 'fear', 'sadness'}
accidental {'surprise', 'fear'}


**Next, let's load load some comments and create a classifer with our lexicons.** Our approach is simple: a comment $x$ composed of a sequence of words $[w_0, w_1, ..., w_n]$, is classified as positive/negative or containing an emotion $e$ if any of the words $w_i$ is present in the lexicon. For example, a comment "it was abnormal" would be classified as containing the emotion "digust" because abnormal is mapped to disgust in the emotion lexicon.

**Let's look at an example below.**

In [15]:
import re

def chunks(lst, n):
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

def get_paragraph(string, words=10):
    return '\n'.join([' '.join(x) for x in chunks(string.split(), words)])

comment = """
The way the current system is design has so many pitfalls, and it's all your fault.
"""

print(f'comment: {get_paragraph(comment)}\n')
print('Words in utterance tagged with emotions:\n')
word_list = re.sub("[^\w]", " ",  comment).lower().split()

for word in word_list:
    if word in word2emotions:
        print('\t', word, word2emotions[word])

comment: The way the current system is design has so many
pitfalls, and it's all your fault.

Words in utterance tagged with emotions:

	 system {'trust'}
	 fault {'sadness'}


This approach has some shortcomings! One issue is the binary categorization of words as positive or negative.

To overcome this, we will try a slightly different approach.

**We'll use a more advanced lexicon-based (and rule-based) method called VADER (Valence Aware Dictionary and sEntiment Reasoner). VADER consists of (1) a crowd-sourced valence-aware sentiment lexicon (i.e., each term has a *strength* associated with it), and 5 rules to incorporate features such as word-order sensitive relationships between terms.** These include intensifying valence through punctuation or capitlization, degree modifiers through adverbs, contrastive conjunction as a mixed signal with the latter clause dominating (e.g., "The food here is great, but the service is horrible."), and negation of sentiment by preceding terms. The final sentiment is ultimately a sum of each word's valence and incorporation of these rules. See the [paper](http://eegilbert.org/papers/icwsm14.vader.hutto.pdf) and [repository of code](https://github.com/cjhutto/vaderSentiment) for more details.

Though we are using it on conversation data, VADER also works well for social media posts, as it contains slang,
commonly misspelled words, and emoticons in its lexicon (though some of these may be outdated now, as VADER came out in 2014).

In [16]:
!pip install vaderSentiment

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 KB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [17]:
reddit_data.columns

Index(['text', 'subreddit', 'subreddit description', 'violation reason',
       'moderator comment', 'comment link', 'internal id'],
      dtype='object')

In [18]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

comment_valences = []
for i, comment in enumerate(reddit_data.text):
    comment_valences.append(analyzer.polarity_scores(comment))
reddit_data['vader_sentiment'] = comment_valences

Let's see what VADER tells us:

In [19]:
for i, (_, row) in enumerate(reddit_data.iterrows()):
    if i == 5: break
    print(get_paragraph(row.text))
    print('>>', row.vader_sentiment)
    print('\n')

Oh, himbs is so sad that I won't agree to
his weird need to control other people. Sad bro, real
sad. I picture you at a party, lecturing at some
hot thing as they frantically search for a way to
escape the awful drudgery of listening to you.
>> {'neg': 0.309, 'neu': 0.621, 'pos': 0.07, 'compound': -0.9325}


Apparently his legs are too short to walk him back
to the counter because he is a (baby). (Hopefully I
won’t get banned for that) Honestly, WHY CANT HE ASK
LIKE AN ADULT IF HE WANTS SOMETHING?? And then he
doubles down saying he shouldn’t ‘have to’ act like an
adult because he should be able to take anything he
wants from her?! WHAT?!
>> {'neg': 0.076, 'neu': 0.793, 'pos': 0.131, 'compound': 0.6239}


Are you stupid? Why is this every ones solution on
this site? Break up because he doesn’t clean as much??
There’s such thing as communication, they don’t need to immediately
jump the guns and break up.
>> {'neg': 0.08, 'neu': 0.779, 'pos': 0.141, 'compound': 0.3736}


You’re an absolu

The `compound` score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate.

## A brief ending note on classifers

We used lexicon- and rule-based classifiers in our labs today. Another approach are *feature-based* classifiers. In traditional machine learning models, a *feature extraction* step is first applied to an input text, extracting up to hundreds of lexical features about the words present, parts of speech present, ordering features, lexicon-based features, etc. Assuming a labeled dataset $X,Y$, where $X$ are the input texts and $Y$ are the ground-truth labels, the model learns probabilistically to predict a label $y$ given the extracted features of $x$.

In deep neural networks (DNNs), this feature extraction step is skipped. Instead of having to hand-engineer features, DNNs can automatically learn features that are helpful for predicting $y$. Typically, these automatically-derived features will include the hand-engineered features -- see [1](https://www.aclweb.org/anthology/N19-1419.pdf), [2](https://hal.inria.fr/hal-02131630/document), etc.

Deep neural networks and other machine learning methods now typically produce better results than lexicon-based approaches. However, a lexicon-based approach can still be useful, as shown with VADER, and be (a) a strong baseline, (b) useful for producing weak labels to train a machine learning model, and (c) act as a sanity check for a machine learning model (e.g., a neural-network based toxicity classifier such as [Perspective API](https://www.perspectiveapi.com/#/home) correlates extremely strongly with a simple expletive-lexicon-based classifier on Facebook comments.)

# 3. Perspective API

"Perspective is a free API that helps you host better conversations online.
The API uses machine learning models to score the perceived impact a comment
might have on a conversation. You can use this score to give feedback to
commenters, help moderators more easily review comments, allow readers
to more easily find interesting or productive comments, and more."

https://developers.perspectiveapi.com/s/?language=en_US

In [None]:
from googleapiclient import discovery
import json

client = discovery.build(
  "commentanalyzer",
  "v1alpha1",
  developerKey='AIzaSyBMQ_87KcUcRFTHlEus8JYxzuy7wBAVaI0',
  discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
  static_discovery=False,
)

text = reddit_data.text[0]
print(text, '\n\n------')

analyze_request = {
  'comment': { 'text': text },
  'requestedAttributes': {'TOXICITY': {},
                          'THREAT': {},
                          'INSULT': {},
                          'IDENTITY_ATTACK': {}}
}

response = client.comments().analyze(body=analyze_request).execute()
print(json.dumps(response, indent=2))

# 4. Word embeddings

#### Load models

Again, this may take a minute or two. Head on over to the next section while you're waiting for this to finish!

In [None]:
# https://nlp.stanford.edu/projects/glove/
# https://github.com/RaRe-Technologies/gensim-data

import gensim.downloader as api
glove_model = api.load("glove-wiki-gigaword-100") # "glove-twitter-50"
glove_model_vocab = glove_model.wv.vocab

### You shall know a word by the company it keeps

A starting question in natural language processing is how to represent words and text. As we saw previously, one simple approach is called "bag of words" (BoW), which simply counts the number of times each word appears. In this approach, words are represented as a "one-hot vector", corresponding to a vector of 0's and a 1 for the index of that word. For example, if our vocabulary consists of `['drink', 'eat' 'pray', 'love', 'earthquake']`, then `drink = [1,0,0,0,0]` and `love = [0,0,0,1,0]`.

--- 

While simple, one drawback to this approach is that semantically related words don't have similar representations. For instance, we might want "eat" and "drink" to be more similar to each other than "eat" and "earthquake". In order to measure similarity, we must first have a distance metric. One common distance metric is Euclidean distance, defined between two vectors $p$ and $q$ as $d(p,q) = \sqrt{\sum_{i=1}^n (p_i - q_i) ^ 2}$. However, note that $d(eat,drink) = d(eat,earthquake)$. In fact, all of the words are equally similar using this one-hot representation.

--- 

The goal is to thus learn representations that can better capture notions of semantic and lexical similarity. The **Word2vec** model is one popular such approach. Each word is initialized with a randomly distributed (typically Gaussian-like) vector representation. Models are trained on large corpora to learn which words frequently co-occur together. For instance, given the context `the hungry hippo ____ his meal`, the model learns to predict that `devoured` and `ate` are reasonable words. Over the course of training, these vector representations are improved such that (1) they can better predict neighboring words, and, as a natural byproduct, (2) they better capture the semantic similarity we desired.

---

Let's take a look at what exactly we mean by this vector representation:

In [None]:
print('Each word is represented as a {}-dimensional vector'.format(
    glove_model.wv['orange'].shape[0]))
print(glove_model.wv['orange'])

### Semantic neighbors

Let's look at what we can do once we have these vector representations.

**First, a sanity check. Let's compute similarity scores between pairs of words.** The default distance metric is cosine similarity, which accounts for the *direction* of the vectors while normalizing for the *magnitude*. This score ranges from 0 to 1.

In [None]:
def compute_similarity(pairs, model):
    """
    This is a helper function to print the similarity of each
    pair of words in pairs. The output is printed in sorted order
    of similarity.

    Args:
        pairs: list of tuples, each tuple contains two strings
        model: Gensim word2vec model with a similarity() function
    """
    results = []
    for w1, w2 in pairs:
        sim = model.similarity(w1, w2)
        results.append((w1, w2, sim))

    for w1, w2, sim in sorted(results, key=lambda x: -x[2]):
        # print('%r\t%r\t%.2f' % (w1, w2, sim))
        print('{:>12}\t{:>12}\t{:.2f}'.format(w1, w2, sim))

In [None]:
pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]

compute_similarity(pairs, glove_model)

**Perhaps you're interested in something a little trickier, such as the similarity of fruits, colors, or size adjectives.**



In [None]:
items = ['apple', 'cantaloupe', 'banana', 'coconut', 'pineapple', 'watermelon']
# items = ['red', 'orange', 'yellow', 'blue', 'green', 'indigo', 'violet']
# items = ['big', 'large', 'huge', 'enormous', 'gargantuan', 'vast']
pairs = itertools.combinations(items, 2)  # creates pairwise combinations

compute_similarity(pairs, glove_model)

**We can also find a word's nearest neighbors.**

For the word 'plum' (example below), these may include other fruits, specific plums (Kakadu plum), and also the notion of a "plum" job (sinecure, cushy, plum assignments, coveted). You may wonder why actor Yeager Lithgow apepars further down
the list. A quick Google search surfaces a number of 
articles (on which this Word2vec model was trained) about
"[his wife] pushing him into the plum job".

In [None]:
result = glove_model.most_similar(positive=['plum'], topn=15)
pprint(result)

**Remember that word embeddings are trained by learning which words co-occur with other words. This means that the *.most_similar()* function doesn't necessariliy return words with the same *meaning* -- instead, it returns words that frequently occur with that word.**

For example, while most of the most similar words for 'happy' are positive, 'disappointed' and even *'unhappy'* are within the top 15 matches.

In [None]:
result = glove_model.most_similar(positive=['happy'], topn=20)
pprint(result)

**We can also find words that are similar to *multiple* words.** This corresponds to finding words in vector space near the average of the two words (or centroid if multiple words are given).

In [None]:
result = glove_model.most_similar(positive=['french', 'pastry'], topn=5)
pprint(result)

**Word embeddings are also famous for being able to reconstruct analogies of the form `A is to B as C is to ___`.** For example, if we calculate `king + woman - man`, the closest word embedding is `queen`! (There are some [nuances](https://www.facebook.com/groups/1174547215919768/permalink/1846673885373761/)). 

Let's try it out for ourselves. In the following, we're finding the terms most similar to both woman and king, but dissimilar from man.

In [None]:
print('King + woman - man:')
result = glove_model.most_similar(positive=['woman', 'king'], negative=['man'], topn=3)
pprint(result)

print('\n' + '-' * 100 + '\n')


# You'll see that warmer appears as a top term.
# This is due to the co-ocurrence nature we mentioned before, where
# colder and warmer are likely to occur in similar contexts, and hence
# have similar vector representations. 
print('Louder + cold - loud:')
result = glove_model.most_similar(positive=['louder', 'cold'], negative=['loud'], topn=3)
pprint(result)

**Finally, note that word embeddings may contain ["human-like" biases](https://science.sciencemag.org/content/356/6334/183.abstract)**. For example, if we compute `computer_programmer + woman - man`, the top result is `homemaker`. See for yourself below.

These biases can have harmful repercussions on downstream tasks. There are [methods to debias](https://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-debiasing-word-embeddings.pdf) these embeddings, but there are [limitations](https://arxiv.org/pdf/1903.03862.pdf) to these methods as well.

In [None]:
print('Doctor + woman - man:')
result = glove_model.most_similar(positive=['doctor', 'woman'],
                                   negative=['man'], topn=3)
pprint(result)

print('\n' + '-' * 100 + '\n')

print('Computer + woman - man:')
result = glove_model.most_similar(positive=['computer', 'woman'],
                                   negative=['man'], topn=3)
pprint(result)

print('\n' + '-' * 100 + '\n')
print('Similarity between "criminal" and various names')
pairs = [
    ('criminal', 'matthew'),
    ('criminal', 'bob'),
    ('criminal', 'jake'),
    ('criminal', 'darnell'),
    ('criminal', 'trayvon'),
    ('criminal', 'deshawn'),
    ('criminal', 'alexander'),
    ('criminal', 'aleksander'),
    ('criminal', 'camilo'),
    ('criminal', 'belén'),
]
compute_similarity(pairs, glove_model)


# 5. Embedding-based text retrieval


When we come upon a particularly interesting review, we might be interested in retrieving text from all conversations that matches that topic.

One way we can do this is through a "text retrieval" approach. 

Our problem setup is: given a query $q$ ("cancelled flight") and a set of documents $D$ (our set of comments), we want to retrieve (and rank) the $n$ documents that most closely match our query. A query can be any natural language string (e.g. a word, a phrase, a sentence, a paragraph, etc.). Each document can similarly be any natural language string.

We will be using a model called [Sentence-BERT](https://arxiv.org/abs/1908.10084), which computes a high dimensional vector representation of each sentence. It's an exciting method that would deserve its own workshop!

**Let's first load the model and the data.**




In [None]:
# There are different versions of the model, specified by the name 'bert-base...'

sent_model = SentenceTransformer('bert-base-nli-mean-tokens')

In [None]:
# we are only using a subset of LVN public data.
# with open('lvn_subset_data.pickle', 'rb') as handle:
#    lvn_data = pickle.load(handle)
import numpy as np
np.random.seed(111)

text = reddit_data.text.values.tolist()

# alternatively, we can run BERT model to get embeddings, which takes ~30 mins.
comments_embeddings = sent_model.encode(text, convert_to_tensor=True)

print("Sampled data contains {} comments.".format(len(text)))

What does an "embedding" representation look like?

In [None]:
comments_embeddings

**Next, let's define the retrieval function and perform some searches.** For each query, we embed it using the same SentenceModel we used to embed the utterances. This means that both queries and utterances now live in the same high-dimensional vector space (to some degree...). We can now compute the cosine similarity between a query and each utterance to retrieve the most similar utterances in LVN conversations.

In [None]:
queries = ["it's your fault. You insulted her."]

retrieve(sent_model, queries, text, comments_embeddings, closest_n=8)

**Great. Try creating some of your own queries.** Some questions you may be interested in asking include:
- Do queries that are paraphrases of each other return similar matches?
- For a query of your choosing, how does the average similarity score (over the top n results) compare between queries? What does this say about the relevance of that query / topic?


In [None]:
# Enter your own query between the quotation marks of your_query below!
your_query = ["""My neighbour who sucks at farming but is good at
making wooden wheels makes a deal with me. I will work on his farm,
give him a share of the crops every harvest, and in return I get
to sell the other portion. Is that slavery? How is that possibly slavery?"""]
retrieve(sent_model, your_query, text, comments_embeddings, closest_n=2)

# 6. Clustering and visualization

From the activity above, we see there is some overlap in the topics/sentiment that people bring up across comments.

Assuming we don't want to think of and retrieve each topic individually, how might we sort and visualize our entire set of conversations?

One approach would be through **unsupervised clustering.** A popular method for this is called K-means clustering. Let's watch a short video about K-means below!

In [None]:
%%html
<!--
"""
Here the algorithm working on a dataset of 300 "documents" embedded to
two dimensions.
"""
-->
<iframe width="560" height="315" src="https://www.youtube.com/embed/5I3Ei69I40s"
frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen>
</iframe>

Next, let's implement K-means clustering and visualize its results in 2-dimensional space.

K-means clusters data points in high dimensions.
- Originally we have **768** dimensions from **BERT**.
- To run **k-mean** we need to compress that space to a lower dimensionality one. We do that by using **PCA** to capture as much linear variance (information) as possible.
- We run **k-mean** on those lower-dimensionality vectors.
- Finally, we reduce dimensionality to 2D using **t-SNE** for visualization purposes.


In [None]:
"""
Every time you run this cell you'll get different clusters and embeddings
due to the fact that these methods calculate approximate transformations
that depend on the initial random seed.
"""
set_replicable_results(True)

viz_coord, clstr_model, predicted_clusters, _, _ = run_clustering(text,
                                                                  pca_components=20,
                                                                  k_clusters=20,
                                                                  embeddings=comments_embeddings.cpu())

# hover the dots in the plot with your mouse
# to see where each piece of text got located.

plot_tsne_viz(viz_coord, text, title="T-SNE<br>Hover the dots in the plot with your mouse.")

It's hard to make sense of this. There aren't clear clusters, and it doesn't tell us anything important without additional manual labor.

Let's color our 2D visualization according to the cluster ids we found through k-means. How does it look?

Do we see any clustering that we would expect?

In [None]:
# Predicted clusters
plot_tsne_viz(viz_coord, text,
                  clusters = predicted_clusters, 
                  coloring='clusters',
                  title="K-means clusters. We ran k-means and got the following clusters:")