In [4]:
# !pip install fasttext
# !pip install gensim

In [4]:
import gensim 
import logging

### NPR Media Dialog Dataset Overview

* Dataset Specifications (npr.org archives):
  * 140,000+ NPR radio interview transcripts
  * 20-year temporal coverage
  * 10,000+ hours of transcribed audio content



* Available via Kaggle platform
  * [kaggle.com/datasets/shuyangli94/interview-npr-media-dialog-transcripts](kaggle.com/datasets/shuyangli94/interview-npr-media-dialog-transcripts)

In [1]:
with open("./media/npr_1000_utterances.csv", 'r') as f:
    i = 0 
    for i,line in enumerate (f):
        print(line)
        if i ==3:
            break
        i += 1
        

episode,episode_order,speaker,utterance

57264,9,"Ms. LOREN MOONEY (Editor-in-Chief, Bicycling Magazine)","It's a 2,200-mile race. To give some sense of perspective, that's roughly the distance between Washington, D.C. and Las Vegas. They do it over the course of three weeks at very fast speeds. But incredibly, oftentimes the distance between first and second is somewhere between and one and three minutes."

57264,10,"Ms. LOREN MOONEY (Editor-in-Chief, Bicycling Magazine)","So for a top competitor like Lance to try to make up that much time -he's now 13 minutes, 26 seconds behind the current race leader, Cadel Evans of Australia. And even Lance said yesterday that for him, the -any chance of winning the tour has gone out the window. He still does have a teammate on his team, RadioShack team, American Levi Leipheimer currently in eighth place, two minutes, 14 seconds back. And Lance is going to do what he can to help Leipheimer do well."

57264,11,"NEAL CONAN, host","So in every team, p

In [2]:
import csv
reader = csv.reader(open("./media/npr_1000_utterances.csv"), delimiter=',', quotechar='"')
for row in reader:
    print(row)
    break

['episode', 'episode_order', 'speaker', 'utterance']


We will use `simple_preprocess` to   lowercases, tokenizes, de-accent the a string.
The output of `simple_preprocess` are final tokens = unicode strings.



In [7]:
some_text = "WWW.google.com So #$test in ~every! time! What?"

gensim.utils.simple_preprocess(some_text) 

['www', 'google', 'com', 'so', 'test', 'in', 'every', 'time', 'what']

In [8]:
def read_input(input_file):
    utterances = []
    with open(input_file, newline='', encoding='utf-8') as csvfile:
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        next(reader, None)
        for row in reader:
            text = row[-1]  
            yield gensim.utils.simple_preprocess(text)  # Yield the preprocessed text

utterances = list(read_input("./media/npr_1000_utterances.csv"))


In [10]:
len(utterances)


999

In [12]:
utterances[20][0:10]

['that', 'right', 'one', 'of', 'the', 'very', 'cool', 'things', 'about', 'the']

We will use `gensim` to train a `Word2Vec` model on the 1000 utterances

https://radimrehurek.com/gensim/models/word2vec.html

In [13]:
model = gensim.models.Word2Vec(utterances, window=10, min_count=2, workers=10)
model

<gensim.models.word2vec.Word2Vec at 0x17783b850>

In [16]:
model.wv.key_to_index 

{'the': 0,
 'and': 1,
 'of': 2,
 'to': 3,
 'that': 4,
 'in': 5,
 'you': 6,
 'it': 7,
 'is': 8,
 'they': 9,
 'we': 10,
 'for': 11,
 'this': 12,
 'on': 13,
 'are': 14,
 'was': 15,
 'have': 16,
 'he': 17,
 'so': 18,
 'be': 19,
 'there': 20,
 'with': 21,
 'but': 22,
 'know': 23,
 'what': 24,
 'as': 25,
 'not': 26,
 'about': 27,
 're': 28,
 'do': 29,
 'think': 30,
 'or': 31,
 'who': 32,
 'people': 33,
 'at': 34,
 'from': 35,
 'can': 36,
 'just': 37,
 'like': 38,
 'very': 39,
 'us': 40,
 'one': 41,
 'if': 42,
 'well': 43,
 'has': 44,
 'some': 45,
 'their': 46,
 'an': 47,
 'all': 48,
 'when': 49,
 'much': 50,
 'these': 51,
 'by': 52,
 'his': 53,
 'them': 54,
 'more': 55,
 'out': 56,
 'how': 57,
 'because': 58,
 'would': 59,
 'don': 60,
 'now': 61,
 'really': 62,
 'right': 63,
 'been': 64,
 'talk': 65,
 'here': 66,
 'get': 67,
 'time': 68,
 'had': 69,
 'case': 70,
 'public': 71,
 'other': 72,
 'npr': 73,
 'up': 74,
 'our': 75,
 'your': 76,
 'were': 77,
 'see': 78,
 'said': 79,
 'those': 80,
 '

In [26]:
len(model.wv.key_to_index)

2337

In [17]:
model.wv.key_to_index.get("washington")

406

In [19]:
model.wv.key_to_index.get("chicago")

124

In [24]:
model.wv.key_to_index.get("tokyo") == None

True

### Question 1.
In the code above, we see that Washington is set to index 406, and Chicago is set to index 124. Why is Tokyo set to None?

In [27]:
model.wv["washington"].size

100

In [28]:
model.wv["washington"]

array([-0.01453825,  0.11561118,  0.05901224,  0.01848982,  0.02993853,
       -0.2699527 ,  0.09842443,  0.34577912, -0.14953035, -0.11280701,
       -0.06670459, -0.14930536, -0.05689364,  0.08399664,  0.07548301,
       -0.07253155,  0.03045561, -0.11971742, -0.12283409, -0.32158768,
        0.09490295,  0.09169069,  0.14070782, -0.11774407,  0.00546052,
        0.01279462, -0.05683667, -0.02806698, -0.18858129,  0.01687071,
        0.19345355, -0.01585543,  0.0303948 , -0.1799805 , -0.04571705,
        0.10247669, -0.01555746, -0.10694227, -0.08396591, -0.10558803,
       -0.01866447, -0.12303678, -0.04440008, -0.02612207,  0.08921865,
       -0.03680946, -0.07383728, -0.0439963 ,  0.13914657,  0.12657501,
        0.10129873, -0.10588013, -0.06920014, -0.03927993, -0.05805773,
        0.03013077,  0.03459086, -0.03486379, -0.15963997,  0.00557181,
        0.0070584 ,  0.02559014,  0.07467877,  0.02383271, -0.14963184,
        0.1535382 ,  0.07176617,  0.10931878, -0.18806885,  0.11

In [29]:
w1 = ["washington"]
model.wv.most_similar(positive=w1, topn=6)

[('him', 0.9987382292747498),
 ('from', 0.9987295866012573),
 ('his', 0.9987185597419739),
 ('after', 0.9987154006958008),
 ('has', 0.9987115859985352),
 ('us', 0.9987057447433472)]

## Question 2. 
When searching for the words most similar to 'Washington', we get results like 'his', 'from', etc., which are clearly not semantically similar to the word 'Washington'. Why is that? Didn't we show that Word2Vec does a good job of grouping semantically similar words, like city names?

### Some issues with the embeddings

-- Add text after pracitcal

In [31]:
utterances = list(read_input("./media/npr_100000_utterances.csv"))
model = gensim.models.Word2Vec(utterances, window=10, min_count=2, workers=10)
model

<gensim.models.word2vec.Word2Vec at 0x1125c85e0>

In [32]:
w1 = ["peace"]
model.wv.most_similar (positive=w1,topn=6)

[('palestinians', 0.7815094590187073),
 ('israel', 0.7738756537437439),
 ('diplomatic', 0.7640045285224915),
 ('hamas', 0.7584590315818787),
 ('palestinian', 0.7460921406745911),
 ('democracy', 0.7318993806838989)]

In [33]:
w1 = ["france"]
model.wv.most_similar (positive=w1,topn=6)

[('india', 0.7822583317756653),
 ('paris', 0.773598313331604),
 ('indonesia', 0.7654559016227722),
 ('australia', 0.7619239091873169),
 ('kenya', 0.7597754001617432),
 ('nigeria', 0.7595521807670593)]

In [34]:
w1 = ["clean"]
model.wv.most_similar (positive=w1,topn=6)

[('wind', 0.7141364216804504),
 ('blow', 0.6803496479988098),
 ('coal', 0.6799625158309937),
 ('costs', 0.6738210320472717),
 ('carbon', 0.6735171675682068),
 ('product', 0.6574962139129639)]

In [None]:
### Question 3.

Why do the embeddings seem more specific now? Can you explain?

### Facebook's FastText
```FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.```


![](https://www.dropbox.com/s/i74guibnv5mxx2h/fasttext.png?dl=1)

https://fasttext.cc/

In [35]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip

--2024-10-22 09:52:32--  https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 2600:9000:20a6:6200:13:6e38:acc0:93a1, 2600:9000:20a6:7800:13:6e38:acc0:93a1, 2600:9000:20a6:5200:13:6e38:acc0:93a1, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|2600:9000:20a6:6200:13:6e38:acc0:93a1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 681808098 (650M) [application/zip]
Saving to: ‘wiki-news-300d-1M.vec.zip’


2024-10-22 09:53:03 (21.7 MB/s) - ‘wiki-news-300d-1M.vec.zip’ saved [681808098/681808098]



In [36]:
!mv wiki-news-300d-1M.vec.zip media/
!unzip -f media/wiki-news-300d-1M.vec.zip

Archive:  media/wiki-news-300d-1M.vec.zip


In [37]:
!head ./media/words_to_keep

Athens
Greece
Bangkok
Thailand
Latvia
lats
Bulgaria
lev
bad
worse


In [38]:
keep_words= [x.rstrip() for x in open("./media/words_to_keep")]
keep_words

['Athens',
 'Greece',
 'Bangkok',
 'Thailand',
 'Latvia',
 'lats',
 'Bulgaria',
 'lev',
 'bad',
 'worse',
 'big',
 'bigger',
 'boy',
 'girl',
 'brother',
 'sister']

In [40]:
import numpy as np
words_embeds = {}
for line in open("media/wiki-news-300d-1M.vec"):
    data = line.split()
    if data[0] in keep_words:
        words_embeds[data[0]] = np.array(list(map(float, data[1:])))


FileNotFoundError: [Errno 2] No such file or directory: 'media/wiki-news-300d-1M.vec'

In [50]:
words_embeds["big"].size


300

In [51]:
words_embeds.keys()

dict_keys(['big', 'bad', 'girl', 'Greece', 'boy', 'brother', 'worse', 'sister', 'bigger', 'Thailand', 'Bulgaria', 'Athens', 'Latvia', 'Bangkok', 'lats', 'lev'])

In [52]:
res = words_embeds["girl"] - words_embeds["boy"] + words_embeds["brother"]
res

array([ 0.3193, -0.0592, -0.0461, -0.1192, -0.0269,  0.0761, -0.0135,
       -0.0527, -0.0506,  0.073 ,  0.0514, -0.1087,  0.0921, -0.0437,
        0.0175,  0.204 , -0.0187, -0.0671,  0.1144, -0.0361, -0.1047,
        0.1144, -0.2408,  0.0436,  0.0406, -0.0068, -0.1024,  0.1106,
       -0.0419, -0.1826,  0.1547,  0.0084, -0.2653,  0.1108, -0.1934,
        0.152 , -0.1   , -0.0516,  0.0547, -0.0557, -0.0244,  0.1113,
       -0.0535,  0.0241,  0.0024, -0.0016, -0.0264,  0.0544,  0.0263,
       -0.0619,  0.0101, -0.0284, -0.6605,  0.12  ,  0.051 , -0.0162,
       -0.0769,  0.1983,  0.0784,  0.0058,  0.0147, -0.0124, -0.0843,
       -0.076 , -0.0489, -0.0935, -0.0857,  0.113 ,  0.0732, -0.0112,
        0.0516, -0.0555, -0.026 ,  0.0495, -0.0793,  0.1126,  0.0691,
        0.1725, -0.0754,  0.049 ,  0.036 , -0.1203, -0.0533, -0.185 ,
        0.108 , -0.073 , -0.2247, -0.2533,  0.1525,  0.0283, -0.0906,
        0.1891,  0.0952, -0.0831,  0.1297,  0.1307, -0.0464,  0.0717,
        0.0647, -0.0

In [56]:
(res - words_embeds["sister"]).round(2)

array([ 0.09,  0.03, -0.02, -0.01,  0.06,  0.  , -0.03, -0.1 , -0.18,
        0.05,  0.08,  0.09,  0.15, -0.  , -0.02,  0.12, -0.15, -0.14,
        0.08,  0.1 , -0.11,  0.09, -0.12, -0.12,  0.04,  0.05, -0.  ,
       -0.  , -0.04, -0.11, -0.03, -0.03, -0.06,  0.02, -0.13,  0.05,
       -0.13,  0.04, -0.01, -0.03,  0.11,  0.03,  0.07,  0.01,  0.1 ,
        0.  ,  0.02,  0.06,  0.1 , -0.07,  0.02,  0.05, -0.02,  0.12,
        0.07, -0.04, -0.01, -0.03,  0.01, -0.04,  0.03, -0.17,  0.02,
        0.04, -0.16,  0.11, -0.03,  0.05, -0.06,  0.03, -0.05, -0.07,
        0.03, -0.  , -0.08, -0.05,  0.06,  0.1 , -0.01,  0.08,  0.01,
       -0.07,  0.02,  0.03,  0.08, -0.14, -0.11, -0.12,  0.03,  0.05,
        0.02,  0.05,  0.03, -0.02,  0.08, -0.02, -0.06,  0.09,  0.07,
        0.02, -0.  ,  0.1 , -0.  ,  0.06, -0.06, -0.09, -0.  , -0.08,
       -0.04,  0.1 , -0.07,  0.08,  0.1 ,  0.02, -0.  , -0.03, -0.22,
       -0.06, -0.05, -0.07,  0.04, -0.01,  0.06, -0.  ,  0.01,  0.04,
       -0.01,  0.05,

In [57]:
words_embeds["big"] - words_embeds["bigger"]

array([-0.1327,  0.0784, -0.055 ,  0.0611,  0.0797,  0.1164,  0.046 ,
       -0.037 , -0.0355,  0.0513, -0.0199, -0.0035, -0.0899, -0.1537,
        0.1146, -0.0609, -0.0377,  0.0771,  0.1355, -0.0447,  0.0406,
        0.0276, -0.0242, -0.1399, -0.015 , -0.017 ,  0.0434,  0.1223,
       -0.1286, -0.0405, -0.0133,  0.0051,  0.148 , -0.0099, -0.0094,
       -0.0259, -0.0225, -0.0012,  0.0467,  0.0649, -0.0056, -0.1069,
       -0.0731, -0.0892,  0.0733, -0.147 , -0.0383, -0.021 , -0.0656,
       -0.0414,  0.0473, -0.1991,  0.0511, -0.1198, -0.0425,  0.0225,
        0.0601, -0.0004,  0.1398, -0.1176,  0.0966, -0.0023, -0.0718,
       -0.1041, -0.0655,  0.0543, -0.0539, -0.0064,  0.014 , -0.0332,
        0.0053, -0.005 ,  0.0586, -0.0089, -0.0059,  0.0169,  0.0279,
        0.0184, -0.0467, -0.1185,  0.0281, -0.1469,  0.0144, -0.0349,
       -0.029 , -0.0187,  0.0252,  0.0442,  0.0651, -0.029 , -0.0815,
        0.0467, -0.067 ,  0.0571,  0.0345, -0.0431,  0.036 , -0.0751,
        0.0092, -0.0

In [58]:
words_embeds["bad"] - words_embeds["worse"]

array([-0.0787,  0.0242, -0.1185, -0.0537,  0.043 , -0.0393,  0.0571,
        0.0376, -0.0712,  0.0804, -0.0086, -0.0253, -0.0076, -0.0387,
        0.0228, -0.0386, -0.1761,  0.1261, -0.1006, -0.0851, -0.0035,
        0.0702,  0.0701, -0.032 , -0.1056, -0.0551, -0.0382,  0.0622,
       -0.1478,  0.0503, -0.0453,  0.0566,  0.1311,  0.1304, -0.0086,
       -0.0209, -0.1283,  0.0258, -0.0828,  0.1165,  0.031 ,  0.0346,
       -0.166 , -0.0239, -0.0041,  0.0102,  0.1294, -0.0491,  0.0247,
       -0.0963,  0.0205, -0.125 , -0.0094, -0.0271, -0.0095, -0.0034,
        0.0866, -0.0041,  0.1393, -0.1289,  0.0528,  0.0063, -0.1536,
       -0.0411,  0.0974,  0.2016, -0.0616,  0.0381,  0.0947, -0.0441,
        0.0408,  0.0899, -0.0282,  0.0379,  0.0183,  0.0029,  0.0329,
        0.0653,  0.1541, -0.0011,  0.0159, -0.0963,  0.1247, -0.0394,
        0.0224,  0.0022, -0.0763, -0.0646,  0.2064,  0.1143, -0.0372,
        0.0529,  0.0101,  0.0577,  0.0126,  0.0242,  0.0227,  0.0749,
        0.0312, -0.0

In [59]:
(words_embeds["bad"] - words_embeds["worse"]) - (words_embeds["big"] - words_embeds["bigger"])

array([ 5.400e-02, -5.420e-02, -6.350e-02, -1.148e-01, -3.670e-02,
       -1.557e-01,  1.110e-02,  7.460e-02, -3.570e-02,  2.910e-02,
        1.130e-02, -2.180e-02,  8.230e-02,  1.150e-01, -9.180e-02,
        2.230e-02, -1.384e-01,  4.900e-02, -2.361e-01, -4.040e-02,
       -4.410e-02,  4.260e-02,  9.430e-02,  1.079e-01, -9.060e-02,
       -3.810e-02, -8.160e-02, -6.010e-02, -1.920e-02,  9.080e-02,
       -3.200e-02,  5.150e-02, -1.690e-02,  1.403e-01,  8.000e-04,
        5.000e-03, -1.058e-01,  2.700e-02, -1.295e-01,  5.160e-02,
        3.660e-02,  1.415e-01, -9.290e-02,  6.530e-02, -7.740e-02,
        1.572e-01,  1.677e-01, -2.810e-02,  9.030e-02, -5.490e-02,
       -2.680e-02,  7.410e-02, -6.050e-02,  9.270e-02,  3.300e-02,
       -2.590e-02,  2.650e-02, -3.700e-03, -5.000e-04, -1.130e-02,
       -4.380e-02,  8.600e-03, -8.180e-02,  6.300e-02,  1.629e-01,
        1.473e-01, -7.700e-03,  4.450e-02,  8.070e-02, -1.090e-02,
        3.550e-02,  9.490e-02, -8.680e-02,  4.680e-02,  2.420e

In [55]:
# https://github.com/facebookresearch/faiss
# !pip install faiss
dbutils.fs.ls("dbfs:/FileStore/")

NameError: name 'dbutils' is not defined

In [61]:
import faiss

In [88]:
words = np.array(list(words_embeds.keys()))
embeds = np.array(list(words_embeds.values()))

In [75]:
index = faiss.IndexFlatL2(300)
index.add(embeds)   

In [80]:
np.array([words_embeds["bad"]])

array([[ 3.180e-02, -1.105e-01, -1.446e-01, -3.510e-02, -4.610e-02,
         6.990e-02,  4.390e-02, -4.310e-02,  7.330e-02, -6.790e-02,
         7.900e-03, -1.142e-01, -3.660e-02, -3.880e-02, -8.730e-02,
         1.160e-02, -1.213e-01,  7.770e-02,  1.200e-03, -9.450e-02,
         8.100e-03,  7.300e-03,  4.880e-02,  1.211e-01, -2.710e-02,
        -1.052e-01, -4.690e-02, -6.370e-02, -4.960e-02, -1.770e-02,
         4.860e-02, -3.890e-02,  2.301e-01,  1.047e-01,  8.890e-02,
         1.141e-01, -2.470e-02, -5.620e-02, -6.120e-02,  9.820e-02,
         1.081e-01,  6.760e-02, -1.718e-01, -1.690e-02, -1.780e-02,
        -6.100e-03,  3.450e-02,  8.070e-02,  1.600e-02,  2.320e-02,
         4.800e-02, -9.170e-02, -7.630e-01, -4.980e-02,  3.170e-02,
        -3.420e-02,  1.029e-01,  2.180e-02,  6.800e-03, -5.440e-02,
        -1.476e-01, -2.400e-03,  3.340e-02, -5.610e-02,  1.173e-01,
        -1.176e-01,  2.580e-02,  9.420e-02, -5.800e-02,  1.440e-02,
        -2.730e-02,  5.440e-02, -6.590e-02, -8.3

In [81]:
index.search(np.array([words_embeds["bad"]]), k=3)

(array([[0.       , 2.598906 , 3.0293205]], dtype=float32), array([[1, 6, 0]]))

In [87]:
words

array(dict_keys(['big', 'bad', 'girl', 'Greece', 'boy', 'brother', 'worse', 'sister', 'bigger', 'Thailand', 'Bulgaria', 'Athens', 'Latvia', 'Bangkok', 'lats', 'lev']),
      dtype=object)

In [None]:
words[0]

In [None]:
index.is_trained

### Demo from Faiss: The Missing Manual
- Demo posted on Pinecone's website

  - [Faiss: The Missing Manual](https://www.pinecone.io/learn/series/faiss/faiss-tutorial/)