## Seminar 1: Fun with Word Embeddings (3 points)

Today we gonna play with word embeddings: train our own little embedding, load one from   gensim model zoo and use it to visualize text corpora.

This whole thing is gonna happen on top of embedding dataset.

__Requirements:__  `pip install --upgrade nltk gensim bokeh` , but only if you're running locally.

In [1]:
import gutil

In [1]:
! pip install --upgrade nltk gensim bokeh

Requirement already up-to-date: nltk in /home/gpashchenko/.local/lib/python3.8/site-packages (3.6.7)
Collecting gensim
  Downloading gensim-4.1.2-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 2.9 MB/s eta 0:00:011
[?25hCollecting bokeh
  Downloading bokeh-2.4.2-py3-none-any.whl (18.5 MB)
[K     |████████████████████████████████| 18.5 MB 6.6 MB/s eta 0:00:01
Collecting smart-open>=1.8.1
  Downloading smart_open-5.2.1-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 2.3 MB/s eta 0:00:01
Collecting typing-extensions>=3.10.0
  Downloading typing_extensions-4.0.1-py3-none-any.whl (22 kB)
Collecting packaging>=16.8
  Downloading packaging-21.3-py3-none-any.whl (40 kB)
[K     |████████████████████████████████| 40 kB 2.4 MB/s eta 0:00:01
Installing collected packages: smart-open, gensim, typing-extensions, packaging, bokeh
Successfully installed bokeh-2.4.2 gensim-4.1.2 packaging-21.3 smart-open

In [2]:
# download the data:
!wget https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1 -O ./quora.txt
# alternative download link: https://yadi.sk/i/BPQrUu1NaTduEw

--2022-01-21 20:39:57--  https://www.dropbox.com/s/obaitrix9jyu84r/quora.txt?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.70.18, 192.52.178.30, 192.33.14.30, ...
Connecting to www.dropbox.com (www.dropbox.com)|162.125.70.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/dl/obaitrix9jyu84r/quora.txt [following]
--2022-01-21 20:39:58--  https://www.dropbox.com/s/dl/obaitrix9jyu84r/quora.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc5e33b90be74421989a622d0169.dl.dropboxusercontent.com/cd/0/get/BePD8USa9i8VH9Eh_bdXIdZ_BC8hv9jeE6yzt31kDLHOoLN_GxVIdGrqPam26re95_b8tmC4EAKNCNjFABXQNXcb-Sr85ag6pB2xor3yyPiKBvGZ0B_SAATHgUF_r6BO12Nft3iKiRFp0bUIR7-9Xmtq/file?dl=1# [following]
--2022-01-21 20:39:58--  https://uc5e33b90be74421989a622d0169.dl.dropboxusercontent.com/cd/0/get/BePD8USa9i8VH9Eh_bdXIdZ_BC8hv9jeE6yzt31kDLHOoLN_GxVIdGrqPam26re95_b8tmC4EAKNCNjFABXQNXcb-

In [1]:
import numpy as np

data = list(open("./quora.txt", encoding="utf-8"))
data[50]

"What TV shows or books help you read people's body language?\n"

In [7]:
len(data)

537272

__Tokenization:__ a typical first step for an nlp task is to split raw data into words.
The text we're working with is in raw format: with all the punctuation and smiles attached to some words, so a simple str.split won't do.

Let's use __`nltk`__ - a library that handles many nlp tasks like tokenization, stemming or part-of-speech tagging.

In [3]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()

print(tokenizer.tokenize(data[50]))

['What', 'TV', 'shows', 'or', 'books', 'help', 'you', 'read', 'people', "'", 's', 'body', 'language', '?']


In [4]:
tokenizer.tokenize(data[0])

['Can',
 'I',
 'get',
 'back',
 'with',
 'my',
 'ex',
 'even',
 'though',
 'she',
 'is',
 'pregnant',
 'with',
 'another',
 'guy',
 "'",
 's',
 'baby',
 '?']

In [5]:
# TASK: lowercase everything and extract tokens with tokenizer. 
# data_tok should be a list of lists of tokens for each line in data.

data_tok = [tokenizer.tokenize(x.lower()) for x in data]

In [6]:
assert all(isinstance(row, (list, tuple)) for row in data_tok), "please convert each line into a list of tokens (strings)"
assert all(all(isinstance(tok, str) for tok in row) for row in data_tok), "please convert each line into a list of tokens (strings)"
is_latin = lambda tok: all('a' <= x.lower() <= 'z' for x in tok)
assert all(map(lambda l: not is_latin(l) or l.islower(), map(' '.join, data_tok))), "please make sure to lowercase the data"

In [7]:
print([' '.join(row) for row in data_tok[:2]])

["can i get back with my ex even though she is pregnant with another guy ' s baby ?", 'what are some ways to overcome a fast food addiction ?']


__Word vectors:__ as the saying goes, there's more than one way to train word embeddings. There's Word2Vec and GloVe with different objective functions. Then there's fasttext that uses character-level models to train word embeddings. 

The choice is huge, so let's start someplace small: __gensim__ is another nlp library that features many vector-based models incuding word2vec.

In [8]:
from gensim.models import Word2Vec
model = Word2Vec(data_tok, 
                 vector_size=32,      # embedding vector size
                 min_count=5,  # consider words that occured at least 5 times
                 window=5).wv  # define context as a 5-word window around the target word

In [9]:
# now you can get word vectors !
model.get_vector('anything')

array([-2.2882588 ,  0.1343236 ,  1.1172318 ,  2.113377  ,  3.0306091 ,
        3.0476143 ,  1.8451658 , -5.1490836 , -0.37057856,  2.200452  ,
        0.07217664,  2.4077938 ,  3.1068797 ,  0.64090174,  1.9048729 ,
       -0.8738476 ,  0.07650772, -1.2080091 ,  0.42416903, -2.8709009 ,
       -3.3885703 ,  0.2821075 , -1.2939146 , -2.8906276 ,  1.0795268 ,
       -3.2066882 ,  0.26704913,  1.1598912 ,  0.43757188,  0.3992324 ,
       -0.12498759, -0.8935104 ], dtype=float32)

In [16]:
model.most_similar("asdasdasdaqwewqe")

KeyError: "Key 'asdasdasdaqwewqe' not present"

In [18]:
# or query similar words directly. Go play with it!
model.most_similar('bread')

[('rice', 0.9588751196861267),
 ('fruit', 0.93880295753479),
 ('cheese', 0.9220866560935974),
 ('butter', 0.9188661575317383),
 ('sauce', 0.9168456196784973),
 ('soup', 0.9117885231971741),
 ('chicken', 0.9066646099090576),
 ('pasta', 0.8991862535476685),
 ('honey', 0.8957147002220154),
 ('beans', 0.8936512470245361)]

### Using pre-trained model

Took it a while, huh? Now imagine training life-sized (100~300D) word embeddings on gigabytes of text: wikipedia articles or twitter posts. 

Thankfully, nowadays you can get a pre-trained word embedding model in 2 lines of code (no sms required, promise).

In [2]:
import gensim.downloader as api
model = api.load('glove-twitter-100')

In [6]:
model.most_similar(positive = ['bad'])

[('but', 0.8499988317489624),
 ('shit', 0.8407027721405029),
 ('really', 0.8376076221466064),
 ('way', 0.8303709626197815),
 ('thing', 0.8276876211166382),
 ("n't", 0.8259333968162537),
 ('right', 0.8235622048377991),
 ('stupid', 0.8233468532562256),
 ('think', 0.8231157660484314),
 ('crazy', 0.822810173034668)]

In [23]:
vec = model.get_vector("coder") + model.get_vector("money") - model.get_vector('brain')

In [25]:
vec2 = model.get_vector("broker")

In [31]:
np.linalg.norm(vec-vec2)

6.6456246

In [35]:
np.linalg.norm(model.get_vector("money"))

6.4092784

In [34]:
np.linalg.norm(model.get_vector("coder"))

5.394739

In [40]:
np.arccos(np.dot(vec,vec2)/(np.linalg.norm(vec) * np.linalg.norm(vec2)))

0.95243216

In [18]:
model.most_similar(positive=["coder", "money"], negative=["brain"])

[('broker', 0.5820156335830688),
 ('bonuses', 0.5424473285675049),
 ('banker', 0.5385113954544067),
 ('designer', 0.5197198390960693),
 ('merchandising', 0.4964233934879303),
 ('treet', 0.49220189452171326),
 ('shopper', 0.49205613136291504),
 ('part-time', 0.4912828207015991),
 ('freelance', 0.4843311607837677),
 ('aupair', 0.47964534163475037)]

### Visualizing word vectors

One way to see if our vectors are any good is to plot them. Thing is, those vectors are in 30D+ space and we humans are more used to 2-3D.

Luckily, we machine learners know about __dimensionality reduction__ methods.

Let's use that to plot 1000 most frequent words

In [61]:
model.get_vecattr(951,"count")

1192563

In [64]:
words = sorted(model.key_to_index.keys(), 
               key=lambda word: model.get_vecattr(word, "count"),
               reverse=True)[:1000]

print(words[::100])

['<user>', '_', 'please', 'apa', 'justin', 'text', 'hari', 'playing', 'once', 'sei']


In [66]:
words[::100]

['<user>',
 '_',
 'please',
 'apa',
 'justin',
 'text',
 'hari',
 'playing',
 'once',
 'sei']

In [67]:
# for each word, compute it's vector with model
word_vectors = [model.get_vector(x) for x in words]
word_vectors = np.array(word_vectors)

In [71]:
assert isinstance(word_vectors, np.ndarray)
assert word_vectors.shape == (len(words), 100)
assert np.isfinite(word_vectors).all()

#### Linear projection: PCA

The simplest linear dimensionality reduction method is __P__rincipial __C__omponent __A__nalysis.

In geometric terms, PCA tries to find axes along which most of the variance occurs. The "natural" axes, if you wish.

<img src="https://github.com/yandexdataschool/Practical_RL/raw/master/yet_another_week/_resource/pca_fish.png" style="width:30%">


Under the hood, it attempts to decompose object-feature matrix $X$ into two smaller matrices: $W$ and $\hat W$ minimizing _mean squared error_:

$$\|(X W) \hat{W} - X\|^2_2 \to_{W, \hat{W}} \min$$
- $X \in \mathbb{R}^{n \times m}$ - object matrix (**centered**);
- $W \in \mathbb{R}^{m \times d}$ - matrix of direct transformation;
- $\hat{W} \in \mathbb{R}^{d \times m}$ - matrix of reverse transformation;
- $n$ samples, $m$ original dimensions and $d$ target dimensions;



In [76]:
from sklearn.decomposition import PCA

# map word vectors onto 2d plane with PCA. Use good old sklearn api (fit, transform)
# after that, normalize vectors to make sure they have zero mean and unit variance
pca = PCA(n_components = 2)

pca.fit(word_vectors)

word_vectors_pca = pca.transform(word_vectors)

# and maybe MORE OF YOUR CODE here :)

In [86]:
word_vectors_pca = word_vectors_pca - word_vectors_pca.mean(0)
word_vectors_pca/=word_vectors_pca.std(0)

In [87]:
assert word_vectors_pca.shape == (len(word_vectors), 2), "there must be a 2d vector for each word"
assert max(abs(word_vectors_pca.mean(0))) < 1e-5, "points must be zero-centered"
assert max(abs(1.0 - word_vectors_pca.std(0))) < 1e-2, "points must have unit variance"

#### Let's draw it!

In [88]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook
output_notebook()

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    if isinstance(color, str): color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: pl.show(fig)
    return fig

In [89]:
draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=words)

# hover a mouse over there and see if you can identify the clusters

### Visualizing neighbors with t-SNE
PCA is nice but it's strictly linear and thus only able to capture coarse high-level structure of the data.

If we instead want to focus on keeping neighboring points near, we could use TSNE, which is itself an embedding method. Here you can read __[more on TSNE](https://distill.pub/2016/misread-tsne/)__.

In [109]:
from sklearn.manifold import TSNE

# map word vectors onto 2d plane with TSNE. hint: don't panic it may take a minute or two to fit.
# normalize them as just lke with pca


#word_tsne = #YOUR CODE

In [None]:
draw_vectors(word_tsne[:, 0], word_tsne[:, 1], color='green', token=words)

### Visualizing phrases

Word embeddings can also be used to represent short phrases. The simplest way is to take __an average__ of vectors for all tokens in the phrase with some weights.

This trick is useful to identify what data are you working with: find if there are any outliers, clusters or other artefacts.

Let's try this new hammer on our data!


In [100]:
def get_phrase_embedding(phrase):
    """
    Convert phrase to a vector by aggregating it's word embeddings. See description above.
    """
    # 1. lowercase phrase
    # 2. tokenize phrase
    # 3. average word vectors for all words in tokenized phrase
    # skip words that are not in model's vocabulary
    # if all words are missing from vocabulary, return zeros
    
    
    tokens = tokenizer.tokenize(phrase.lower())
    vector = np.zeros([model.vector_size], dtype='float32')
    count =0
    for token in tokens:
        try:
            vector += model.get_vector(token)
            count +=1
        except KeyError:
            pass
    vector /= count
    return vector

        
    

In [101]:
phrase = "I'm very sure. This never happened to me before..."



In [102]:
get_phrase_embedding(phrase)

array([ 0.31807372,  0.25188074,  0.06748973,  0.07918166, -0.20551686,
        0.3099363 ,  0.12886   , -0.06323332,  0.1167265 ,  0.1529691 ,
       -0.02558171, -0.22829592, -4.4815    ,  0.02976158, -0.12603833,
       -0.1181161 , -0.23754632, -0.02389293, -0.47764084, -0.03727125,
        0.0933293 , -0.11675651, -0.07968461,  0.16103767,  0.22593975,
       -0.9288166 , -0.13147809, -0.27092585,  0.32907042, -0.12769467,
       -0.1002182 ,  0.05345418, -0.33383527,  0.16147001,  0.05949384,
        0.13145241,  0.04188244, -0.11879817,  0.09030958, -0.02781658,
       -1.0278689 ,  0.00457059,  0.21854158, -0.06651367,  0.23628668,
       -0.14505692,  0.17416383, -0.08423841, -0.12737857, -0.01618933,
       -0.16621883,  0.12946318,  0.17684542, -0.01754692,  0.12413007,
        0.01527582, -0.11908681,  0.11422   ,  0.15738223,  0.09719244,
        0.05083408,  0.0163896 ,  0.03656067, -0.11753249,  0.25182876,
       -0.10408742, -0.11623914, -0.11769509, -0.22145198,  0.07

In [103]:
vector = get_phrase_embedding("I'm very sure. This never happened to me before...")

assert np.allclose(vector[::10],
                   np.array([ 0.31807372, -0.02558171,  0.0933293 , -0.1002182 , -1.0278689 ,
                             -0.16621883,  0.05083408,  0.17989802,  1.3701859 ,  0.08655966],
                              dtype=np.float32))

In [106]:
# let's only consider ~5k phrases for a first run.
chosen_phrases = data[::len(data) // 1000]

# compute vectors for chosen phrases
#phrase_vectors = # YOUR CODE
phrase_vectors = np.array([get_phrase_embedding(phrase) for phrase in chosen_phrases])

In [107]:
assert isinstance(phrase_vectors, np.ndarray) and np.isfinite(phrase_vectors).all()
assert phrase_vectors.shape == (len(chosen_phrases), model.vector_size)

In [110]:
# map vectors into 2d space with pca, tsne or your other method of choice
# don't forget to normalize

phrase_vectors_2d = TSNE().fit_transform(phrase_vectors)

phrase_vectors_2d = (phrase_vectors_2d - phrase_vectors_2d.mean(axis=0)) / phrase_vectors_2d.std(axis=0)



In [111]:
draw_vectors(phrase_vectors_2d[:, 0], phrase_vectors_2d[:, 1],
             phrase=[phrase[:50] for phrase in chosen_phrases],
             radius=20,)

Finally, let's build a simple "similar question" engine with phrase embeddings we've built.

In [None]:
# compute vector embedding for all lines in data
data_vectors = np.array([get_phrase_embedding(l) for l in data])

In [None]:
def find_nearest(query, k=10):
    """
    given text line (query), return k most similar lines from data, sorted from most to least similar
    similarity should be measured as cosine between query and line embedding vectors
    hint: it's okay to use global variables: data and data_vectors. see also: np.argpartition, np.argsort
    """
    # YOUR CODE
    
    return <YOUR CODE: top-k lines starting from most similar>

In [None]:
results = find_nearest(query="How do i enter the matrix?", k=10)

print(''.join(results))

assert len(results) == 10 and isinstance(results[0], str)
assert results[0] == 'How do I get to the dark web?\n'
assert results[3] == 'What can I do to save the world?\n'

In [None]:
find_nearest(query="How does Trump?", k=10)

In [None]:
find_nearest(query="Why don't i ask a question myself?", k=10)

__Now what?__
* Try running TSNE on all data, not just 1000 phrases
* See what other embeddings are there in the model zoo: `gensim.downloader.info()`
* Take a look at [FastText](https://github.com/facebookresearch/fastText) embeddings
* Optimize find_nearest with locality-sensitive hashing: use [nearpy](https://github.com/pixelogik/NearPy) or `sklearn.neighbors`.

In [113]:
import gensim

In [117]:
models = gensim.downloader.info()['models']

In [121]:
len(models)

13

In [120]:
for model in models:
    print(model)
    print(models[model], "\n\n\n")

fasttext-wiki-news-subwords-300
{'num_records': 999999, 'file_size': 1005007116, 'base_dataset': 'Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens)', 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/fasttext-wiki-news-subwords-300/__init__.py', 'license': 'https://creativecommons.org/licenses/by-sa/3.0/', 'parameters': {'dimension': 300}, 'description': '1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).', 'read_more': ['https://fasttext.cc/docs/en/english-vectors.html', 'https://arxiv.org/abs/1712.09405', 'https://arxiv.org/abs/1607.01759'], 'checksum': 'de2bb3a20c46ce65c9c131e1ad9a77af', 'file_name': 'fasttext-wiki-news-subwords-300.gz', 'parts': 1} 



conceptnet-numberbatch-17-06-300
{'num_records': 1917247, 'file_size': 1225497562, 'base_dataset': 'ConceptNet, word2vec, GloVe, and OpenSubtitles 2016', 'reader_code': 'https://github.com/RaRe-Technologies/gensim-