<a href="https://colab.research.google.com/github/lucasgneccoh/FNC_nlp_project/blob/main/notebooks/gensim_pretrained_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports and initial definitions

## Clone repository

In [2]:
import os
%cd /content
!git clone https://github.com/lucasgneccoh/FNC_nlp_project.git

os.chdir("/content/FNC_nlp_project")

In [3]:
!pip install gensim==4.0.0b
import gensim
import gensim.downloader as api

Collecting gensim==4.0.0b
[?25l  Downloading https://files.pythonhosted.org/packages/13/47/16e2e4f34ec7534db21facf505c5a17e3ba10cbce72f675721277628d454/gensim-4.0.0b0-cp37-cp37m-manylinux1_x86_64.whl (24.0MB)
[K     |████████████████████████████████| 24.0MB 1.7MB/s 
Installing collected packages: gensim
  Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.0.0b0


# Word2vec trained with Google news
We would like to work with this pretrained **word2vec** model. 

Here we load the model and test it in simple examples

For more information on the model, see the [original google site](https://code.google.com/archive/p/word2vec/) or the [gensim repository](https://github.com/RaRe-Technologies/gensim-data#models)

From our tests, we liked this model because it seems to really capture the semantics of the words.

Sometimes the download using the `gensim` api is not possible and throws a `ConnectionResetError`. In that case, you can download the original Google news model from the [original site](https://code.google.com/archive/p/word2vec/) and load it.
The following cell does that in the Google Colab machine

In [8]:
""" *** This cell only works in Google Colab ***
    It saves the file in the temporal machine using wget, which is much faster
    than the download using the gensim api
    If you are not using Google Colab, then simply download the file yourself
    and get the path to load the word vectors into the variable path_google_w2v

"""

cwd = os.getcwd()
%cd /content
!mkdir downloaded_embeddings
%cd /content/downloaded_embeddings
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
path_google_w2v = "/content/downloaded_embeddings/GoogleNews-vectors-negative300.bin.gz"

%cd $cwd



--2021-04-30 01:01:49--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.78.166
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.78.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2021-04-30 01:02:14 (64.0 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [9]:
%%time
w2v_google = gensim.models.KeyedVectors.load_word2vec_format(path_google_w2v, binary=True)

CPU times: user 1min 2s, sys: 3.07 s, total: 1min 5s
Wall time: 1min 5s


Here is the code if you want to try using the `gensim` api. 

NOTE: This will take time, and it weights 1.5G

In [None]:
%%time
""" Download the w2v model (1662.8 MB) 
"""
w2v_google = api.load('word2vec-google-news-300', return_path=True)

## Examples

In [10]:
print(f"Total number of unique words in Google corpus: {len(w2v_google.key_to_index)}")

Total number of unique words in Google corpus: 3000000


In [11]:
""" Here we want to do 'most similar' queries for some words
"""
to_check = ['woman', 'man', 'strong', 'america', 'china', 'weak', 'bank', 'hard', 'easy', 'hoax']
for w in to_check: 
    print("Similar to {}: {}".format(w, w2v_google.most_similar(positive=[w], topn=3)))

Similar to woman: [('man', 0.7664012908935547), ('girl', 0.7494640946388245), ('teenage_girl', 0.7336829304695129)]
Similar to man: [('woman', 0.7664012908935547), ('boy', 0.6824871301651001), ('teenager', 0.6586930155754089)]
Similar to strong: [('solid', 0.7009872198104858), ('stong', 0.6510646939277649), ('robust', 0.6499253511428833)]
Similar to america: [('american', 0.7169357538223267), ('americans', 0.7042055130004883), ('europe', 0.6617692112922668)]
Similar to china: [('dinnerware', 0.6587947607040405), ('crockery', 0.6426128149032593), ('porcelain', 0.6392655372619629)]
Similar to weak: [('weaker', 0.7303191423416138), ('Weak', 0.6872072815895081), ('sluggish', 0.6702948808670044)]
Similar to bank: [('banks', 0.7440759539604187), ('banking', 0.690161406993866), ('Bank', 0.6698698401451111)]
Similar to hard: [('harder', 0.6780325174331665), ('Hard', 0.6441888809204102), ('tough', 0.6342882513999939)]
Similar to easy: [('easier', 0.6639506220817566), ('easiest', 0.6109094023704

In [None]:
""" Here we want to do 'most similar' queries using
    positive and negative words
"""
pos = [['woman'], ['man'], ['china'], ['america']]
neg = [['man'], ['woman'], ['america'], ['china']]

for p, n in zip(pos, neg):
    print(f"+: {p}, -: {n}")
    print(*w2v_google.most_similar(positive = p, negative = n, topn=5))
    print('-'*50)

+: ['woman'], -: ['man']
('she', 0.45412716269493103) ('her', 0.39712801575660706) ('Certified_Nurse_Midwife', 0.3824717402458191) ('Ms.', 0.37514764070510864) ('silicone_gel_implant', 0.3704040050506592)
--------------------------------------------------
+: ['man'], -: ['woman']
('Shaun_Maloney_Aiden_McGeady', 0.35027220845222473) ('tactically_adept', 0.3487197160720825) ('Matt_Bramald', 0.3400961458683014) ('strongside_LB', 0.337636798620224) ('newboy', 0.33329278230667114)
--------------------------------------------------
+: ['china'], -: ['america']
('dinnerware', 0.6009308695793152) ('tableware', 0.5477373600006104) ('flatware', 0.533893346786499) ('crockery', 0.5331457853317261) ('vases', 0.5137503743171692)
--------------------------------------------------
+: ['america'], -: ['china']
('americans', 0.3803958296775818) ('nebraska', 0.3545896112918854) ('texas', 0.3540230393409729) ('american', 0.34462225437164307) ('atlanta', 0.3441712558269501)
--------------------------------

In [None]:
""" We are done, free memory 
"""
del w2v_google

# FastText
Another interesting option would be the pre traines **fasText** using wiki news

See the [gensim repository](https://github.com/RaRe-Technologies/gensim-data#models) for more information on the available models and the [fast text site](https://fasttext.cc/docs/en/english-vectors.html) for details on this **fastText** model.

From the examples and tests we did, this model seems to relate words based more on their syntax than on their semantics. We think the Google News **word2vec** was superior

In [None]:
from gensim.models import FastText

The following cell downloads the **fastText** vectors in the Google Colab machine 

In [None]:
cwd = os.getcwd()
%cd /content/downloaded_embeddings
!wget -c "https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.vec.zip"
!unzip "wiki-news-300d-1M-subword.vec.zip"
path_ft_facebook = "/content/wiki-news-300d-1M-subword.vec"

%cd $cwd


In [None]:
%%time
ft_facebook = gensim.models.KeyedVectors.load_word2vec_format(path_ft_facebook, binary=False)

CPU times: user 5min 17s, sys: 10.4 s, total: 5min 27s
Wall time: 5min 18s


Again, here is the code to use the `gensim` api

In [None]:
%%time
ft_facebook = api.load('fasttext-wiki-news-subwords-300',return_path=False)

CPU times: user 10min 37s, sys: 44.2 s, total: 11min 21s
Wall time: 12min 52s


In [None]:
""" Again 'most similar' queries for some words to compare 
    against the Google news word2vec
"""
to_check = ['woman', 'man', 'strong', 'america', 'china', 'weak', 'bank', 'hard', 'easy', 'hoax']
for w in to_check: 
    ans = ft_facebook.most_similar(positive=[w], topn=3)
    print("Similar to {} : {}".format(w, ans))

Similar to woman : [('man', 0.8119728565216064), ('woman--', 0.7959333062171936), ('lady', 0.775004506111145)]
Similar to man : [('woman', 0.8119728565216064), ('man--', 0.73244309425354), ('man--and', 0.7232114672660828)]
Similar to strong : [('weak', 0.7990458607673645), ('strong-', 0.7753340005874634), ('strongish', 0.7710019946098328)]
Similar to america : [('americas', 0.7932907938957214), ('america.', 0.7870585322380066), ('usa', 0.7484654784202576)]
Similar to china : [('china.', 0.6974452137947083), ('chinas', 0.6943490505218506), ('porcelain', 0.6891270875930786)]
Similar to weak : [('strong', 0.7990459203720093), ('weaker', 0.7790804505348206), ('feeble', 0.7767082452774048)]
Similar to bank : [('banks', 0.8217378854751587), ('bank-', 0.7699344754219055), ('banking', 0.7486941814422607)]
Similar to hard : [('harder', 0.7852727174758911), ('tough', 0.7670266032218933), ('hards', 0.7339461445808411)]
Similar to easy : [('straightforward', 0.7886142730712891), ('quick', 0.772329

In [None]:
""" Here we want to do 'most similar' queries using
    positive and negative words
"""
pos = [['woman'], ['man'], ['china'], ['america']]
neg = [['man'], ['woman'], ['america'], ['china']]

for p, n in zip(pos, neg):
    print(f"+: {p}, -: {n}")
    print(*ft_facebook.most_similar(positive = p, negative = n, topn=5))
    print('-'*50)

+: ['woman'], -: ['man']
('WCJF', 0.3708224296569824) ('woman-to-woman', 0.35294315218925476) ('Němcová', 0.3509597182273865) ('OBGYN', 0.3495628237724304) ('OB-GYNs', 0.3440229892730713)
--------------------------------------------------
+: ['man'], -: ['woman']
('guvnor', 0.3640994131565094) ('roght', 0.3340262472629547) ('brillig', 0.3283083438873291) ('genuis', 0.3235689103603363) ('Guvnor', 0.3225095272064209)
--------------------------------------------------
+: ['china'], -: ['america']
('porcelain', 0.47698038816452026) ('porcelains', 0.4386729300022125) ('Jingdezhen', 0.43594178557395935) ('vases', 0.4217434525489807) ('crockery', 0.4191197156906128)
--------------------------------------------------
+: ['america'], -: ['china']
('america.', 0.40032851696014404) ('americas', 0.32730168104171753) ('americans', 0.3270949125289917) ('americans.', 0.32590192556381226) ('american.', 0.31294891238212585)
--------------------------------------------------
