<a href="https://colab.research.google.com/github/steve-wilson/ds32019/blob/master/05_Text_Embeddings_DS3Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Fundamentals of Text Analysis for User Generated Content @ EDHEC, 2021

# Part 5: Text Embeddings

[<- Previous: Content Analysis](https://colab.research.google.com/github/gordeli/textanalysis/blob/master/colab/04_Text_Processsing_Basics_DS3Text.ipynb)

[-> Next: Machine Learning](https://colab.research.google.com/github/gordeli/textanalysis/blob/master/colab/06_Text_Processsing_Basics_DS3Text.ipynb)

Dates: February 8 - 15, 2021

Facilitator: [Ivan Gordeliy](https://www.linkedin.com/in/gordeli/)

---



## Initial Setup

- **Run "Setup" below first.**

    - This will load libraries and download some resources that we'll use throughout the tutorial.

    - You will see a message reading "Done with setup!" when this process completes.



In [None]:
#@title Setup (click the "run" button to the left) {display-mode: "form"}

## Setup ##

# imports

# built-in Python libraries
# -------------------------

# for defaultdict data structure
import collections
# for reading csv files
import csv
# operating system functions
import os

# 3rd party libraries
# -------------------

import numpy as np

! pip install --upgrade gensim
import gensim
import gensim.test.utils
import gensim.scripts.glove2word2vec
import gensim.models.fasttext

import scipy.spatial.distance

import nltk
nltk.download('stopwords')

! pip install fasttext
import fasttext

if not os.path.exists("glove.twitter.27B.zip"):
    !wget http://nlp.stanford.edu/data/glove.twitter.27B.zip
else:
    print("GloVe already downloaded.")
if not os.path.exists("glove.twitter.27B.50d.txt"):
    !unzip glove.twitter.27B.zip
else:
    print("GloVe already extracted.")
if not os.path.exists("questions-words.txt"):
    !wget http://download.tensorflow.org/data/questions-words.txt
else:
    print("Word analogies data already loaded.")
if not os.path.exists("crawl-300d-2M-subword.zip"):
    !wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip
else:
    print("Fasttext embeddings already downloaded")
if not os.path.exists("crawl-300d-2M-subword.vec"):
    print("Extracting Fasttext embeddings. This may take several minutes...")
    !unzip crawl-300d-2M-subword.zip
else:
    print("Fasttext already extracted.")
if not os.path.exists("Stsbenchmark.tar.gz"):
    !wget http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz
    !tar -xzf Stsbenchmark.tar.gz
print()
print("Done with setup!")
print("If you'd like, you can click the (X) button to the left to clear this output.")

---

## A - Word Embeddings

### Using gensim for word embeddings

- Gensim is back: this time we'll use it to experiment with word embeddings.
- We can use it to load an embeddings matrix in several formats.
    - Since we've been working with some social media data, let's load the GloVe Twitter embeddings:

In [None]:
# Load 100-dimensional vectors
glove_file_path = "glove.twitter.27B.100d.txt"
# First convert to word2vec format (what gensim expects)
tmp_file_path = gensim.test.utils.get_tmpfile("word2vec_twitter_100d.txt")
_ = gensim.scripts.glove2word2vec.glove2word2vec(glove_file_path, tmp_file_path)

# Now load with gensim (may take several minutes)
print("Loading GloVe...")
glove = gensim.models.KeyedVectors.load_word2vec_format(tmp_file_path)
print("GloVe was loaded into memory.")

- **Now, what can we do with these embeddings?**
- Find the most similar words to a query:

In [None]:
glove.most_similar("france")

- Measure similarity between any pair of words:

In [None]:
pairs = [ ('data','pastry') , ('beautiful','lovely'), ('hot','cold'), \
          ('a person walked their dog','a dog walked with a person') , \
          ('the school was prestigious','the university was highly ranked')]

for pair in pairs:
    sim = glove.n_similarity(pair[0].split(),pair[1].split())
    print(pair,":",sim)
print()

- Or just get the vector representation for a word:

In [None]:
vector = glove['summer']
print(vector)

### Word embedding geometry

- The classic example:

In [None]:
mystery_word_vec = glove['king'] - glove['man'] + glove['woman']
glove.similar_by_vector(mystery_word_vec)

### Putting it together: word analogies

- In the not-too-distance past, these types of questions appeared on American college entrance examinations:
    - e.g., "man:king::woman:?"
        - answer: queen
- Around 2013, when word vectors were becoming extremely popular, this task was studied within the NLP community, and along with it came some standard datasets.
    Let's load one of them here:

In [None]:
analogies = collections.defaultdict(list)
category = ""
with open("questions-words.txt",'r') as analogy_questions:
    for aq in analogy_questions:
        if aq.startswith(":"):
            category = aq.split(":")[1].strip()
        else:
            a,b,c,d = aq.split()
            analogies[category].append([a,b,c,d])
            
print("Loaded analogies with",len(analogies),"categories.")

**Exercise 6**: Word analogies
- Using the GloVe embeddings that we loaded before, write a function to complete word analogies.
    - return a list of the top `top_n` guesses, in order from most likely to least likely.

In [None]:
# solve an analogy a is to b as c is to ?
def guess_analogy(a,b,c,vectors,top_n=5):
    
# ------------- Exercise 6 -------------- #


    return []
# ---------------- End ------------------ #
    

# quick test
a,b,c,d = "man","king","woman","queen"
guesses = guess_analogy(a,b,c,glove)
if d in guesses:
    print("queen was in the top n guesses!")
else:
    print("queen was NOT in the top n guesses...")
    print("guesses were:",guesses)

In [None]:
#@title Sample Solution (double-click to view) {display-mode: "form"}

def guess_analogy(a,b,c,vectors,top_n=5):
    
# ------------- Exercise 6 -------------- #

    inputs = set([a,b,c])
    guess_vector = vectors[b] - vectors[a] + vectors[c]
    guesses = glove.similar_by_vector(guess_vector)
    return [item[0] for item in guesses if item[0] not in inputs][:top_n]

# ---------------- End ------------------ #


# quick test
a,b,c,d = "man","king","woman","queen"
guesses = guess_analogy(a,b,c,glove)
if d in guesses:
    print("queen was in the top n guesses!")
else:
    print("queen was NOT in the top n guesses...")
    print("guesses were:",guesses)

- When you are ready, run your function on some more of the dataset:

In [None]:
# evaluation

# just check 100 per category, otherwise you will have to wait a while
max_per_category = 100
correct = collections.defaultdict(int)
top_5_correct = collections.defaultdict(int)
total = collections.defaultdict(int)
for category, analogy_questions in analogies.items():
    print("evaluating",category)
    for aq in analogy_questions[:max_per_category]:
        a,b,c,d = aq
        if all([item in glove for item in [a,b,c,d]]):
            guesses = guess_analogy(a,b,c,glove)
            total[category] += 1
            if d in guesses:
                top_5_correct[category] += 1
                if d == guesses[0]:
                    correct[category] += 1
                
global_correct = 0
global_top_5_correct = 0
global_total = 0
print()
print("Category-level results")
for category in analogies:
    if total[category]:
        print(category)
        print("\t","top-1 score:",float(correct[category])/total[category])
        print("\t","top-5 score:",float(top_5_correct[category])/total[category])
        global_correct += correct[category]
        global_top_5_correct += top_5_correct[category]
        global_total += total[category]
print("Overall results")
print("\t","top-1 score:",float(global_correct)/global_total)
print("\t","top-5 score:",float(global_top_5_correct)/global_total)
 

- Did you get better results than the sample solution?
    - It achieved, overall:
        - top-1: 0.46875
        - top-5: 0.640625
- What could you do to improve the results?

---
## B - Sub-words and Compositionality

- You may have noticed that our word embeddings don't always work:

In [None]:
# note that the current version of GloVe Twitter is about 5 years old now
try:
    print(glove['adulting'])
except Exception as e:
    print("Error:",e)

- How could we address this?
- We could:
    - skip words for which we do not have embeddings
    - use the same "out of vocabulary" (OOV) vector to represent each unknown word.
    - re-train our word embeddings every...
        - year? month? day?
    - change the unit of semantics from *words* to *something else*.
        - What else might we choose?
            - Some proposals include subwords, word pieces, and characters.
- Let's consider the last approach, first by looking at subword embeddings with fasttext:

### Sub-words embeddings with FastText

In [None]:
# we can also load these using gensim
# this will also take several minutes to load into memory
emb_path = "crawl-300d-2M-subword.bin"
fasttext_model = fasttext.load_model(emb_path)

- We can do the same kinds of things that we did before with GloVe:

In [None]:
print("First 50 dimensions of vector for computer:")
print(fasttext_model['computer'][:50])

print("Similarity between 'forest' and 'trees':")
print(1- scipy.spatial.distance.cosine(fasttext_model['forest'], fasttext_model['trees']))

- In addition, we can use subword information to reason about unknown words:

In [None]:
print("First 50 dimensions of vector for adulting:")
print(fasttext_model['adulting'][:50])
print()
# just to prove that totally new words are handled
print("First 50 dimensions of vector for howhoozaling:")
print(fasttext_model['howhoozaling'][:50])

### Semantic compositions

- Simply averaging word embeddings (mean pooling) turns out to be a strong baseline for short text representations.

In [None]:
sent1 = "The airplane flew over the fields"
sent2 = "A train crossed the river"

sent1_emb = np.mean([fasttext_model[w.lower()] for w in sent1.split()],axis=0)
sent2_emb = np.mean([fasttext_model[w.lower()] for w in sent2.split()],axis=0)

print("Similarity:",1- scipy.spatial.distance.cosine(sent1_emb,sent2_emb))

### Putting it together: short text similarity

- SemEval is a yearly competition to solve a range of semantic NLP tasks.
- A common SemEval task is "semantic text similarity".
    - The goal is to build a model that can produce similarity scores for pairs of texts that are highly correlated with human judgements of similarity.
    - Let's load some data for this task:

In [None]:
test_sts_data = []
fnames = ['section1','section2','section3','docid','score','doc1','doc2']
with open("stsbenchmark/sts-test.csv",'r') as infile:
    reader = csv.DictReader(infile, fieldnames=fnames,dialect=csv.excel_tab)
    for row in reader:
        test_sts_data.append(row)
print("Loaded",len(test_sts_data),"test pairs.")
print("Example:",test_sts_data[0])

- Now, let's build a simple system to get some reasonable results on this task.
    - hint: stopword removal should be helpful here!

In [None]:
def get_text_sim_score(text1,text2,word_embeddings):

# ------------- Exercise 3 -------------- #



# ---------------- End ------------------ #
    return score

In [None]:
#@title Sample Solution (double-click to view) {display-mode: "form"}
stopwords = set(nltk.corpus.stopwords.words('english'))

def get_text_sim_score(text1,text2,word_embeddings):

# ------------- Exercise 3 -------------- #

    vec1 = np.mean([word_embeddings[w] for w in text1.split() if w.lower() not in stopwords],axis=0)
    vec2 = np.mean([word_embeddings[w] for w in text2.split() if w.lower() not in stopwords],axis=0)
    score = 1 - scipy.spatial.distance.cosine(vec1,vec2)

# ---------------- End ------------------ #
    return score

- Let's evaluate the performance:

In [None]:
golds, preds = [],[]
for test_data in test_sts_data:
    text1 = test_data['doc1']
    text2 = test_data['doc2']
    if text1 and text2:
        gold = float(test_data['score'])
        pred = get_text_sim_score(text1,text2,fasttext_model)
        golds.append(gold)
        preds.append(pred)

print("Some examples of pairs and their human annotated scores:")
for item in test_sts_data[:5]:
    print(item['score'],item['doc1'],item['doc2'])
print("Correct labels:",golds[:5])
print("Predictions:",preds[:5])
print("Correlation Score (rho, p-value):",scipy.stats.pearsonr(golds,preds))

- That is a good start! 
- It's definitely possible to do better with different compisition functions/networks, or even this same approach with different embeddings.
    - See some of the state-of-the-art results here: http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

- [-> Next: Machine Learning](https://colab.research.google.com/github/gordeli/textanalysis/blob/master/colab/06_Text_Processsing_Basics_DS3Text.ipynb)