## Pre trained embeddings

Word embeddings are generally computed using word-occurrence statistics (observations about what words co-occur in sentences or documents), using a variety of  techniques, some involving neural networks, others not. The idea of a dense, lowdimensional embedding space for words, computed in an unsupervised way, was initially explored by Bengio et al. in the early 2000s,1 but it only started to take off in research and industry applications after the release of one of the most famous and successful word-embedding schemes: the Word2vec algorithm (https://code.google.com/ archive/p/word2vec), developed by Tomas Mikolov at Google in 2013. Word2vec dimensions capture specific semantic properties, such as gender.

There are various precomputed databases of word embeddings that you can download and use in a Keras Embedding layer. Word2vec is one of them. Another popular one is called Global Vectors for Word Representation (GloVe, https://nlp.stanford.edu/projects/glove), which was developed by Stanford researchers in 2014. This embedding technique is based on factorizing a matrix of word co-occurrence statistics. Its developers have made available precomputed embeddings for millions of English tokens, obtained from Wikipedia data and Common Crawl data.

One of the most widely used pretrained word embeddings is Glove and can be downloaded from https://nlp.stanford.edu/projects/glove/

GloVe is pre-computed embeddings from 2014 English Wikipedia. It's a 822MB zip file named glove.6B.zip, containing 100-dimensional embedding vectors for 400,000 words (or non-word tokens).

In [None]:
!wget https://nlp.stanford.edu/data/glove.6B.zip

--2023-10-14 06:32:02--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-10-14 06:32:02--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2023-10-14 06:34:47 (4.98 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]



In [None]:
!mkdir glove
!unzip glove.6B.zip -d glove/

Archive:  glove.6B.zip
  inflating: glove/glove.6B.50d.txt  
  inflating: glove/glove.6B.100d.txt  
  inflating: glove/glove.6B.200d.txt  
  inflating: glove/glove.6B.300d.txt  


In [None]:
!head -20 /content/glove/glove.6B.50d.txt

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581
, 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392
. 0.15164 0.30177 -0.16763 0.17684 0.31719 0.33973 -0.43478 -0.31086 -0.44999 -0.29486 0.16608 0.11963 -0.41328 -0.42353

### Exploring the embeddings

In [None]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors

word2vec_output_file = "/content/glove/glove.6B.50d.txt"

In [None]:
pretrained_w2v_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False, no_header=True)

In [None]:
pretrained_w2v_model.most_similar('bangalore')

[('chennai', 0.9154870510101318),
 ('hyderabad', 0.8739371299743652),
 ('kolkata', 0.8573629260063171),
 ('ahmedabad', 0.8511857390403748),
 ('pune', 0.8379703164100647),
 ('kanpur', 0.8285415768623352),
 ('patna', 0.7850483655929565),
 ('lahore', 0.7841722369194031),
 ('calcutta', 0.7834596633911133),
 ('delhi', 0.778948187828064)]

In [None]:
pretrained_w2v_model.most_similar('dhoni')

[('dravid', 0.868076503276825),
 ('yuvraj', 0.8542092442512512),
 ('ganguly', 0.8492220640182495),
 ('kumble', 0.8360502123832703),
 ('rahul', 0.8356800675392151),
 ('sehwag', 0.835641622543335),
 ('laxman', 0.8343917727470398),
 ('raina', 0.8334249258041382),
 ('mahendra', 0.8286129236221313),
 ('karthik', 0.8234096765518188)]

In [None]:
pretrained_w2v_model.most_similar('india')

[('indian', 0.8648794293403625),
 ('pakistan', 0.8529723286628723),
 ('malaysia', 0.816650927066803),
 ('bangladesh', 0.8154239058494568),
 ('delhi', 0.8142766356468201),
 ('indonesia', 0.7939143776893616),
 ('thailand', 0.7864410281181335),
 ('sri', 0.7809486985206604),
 ('lanka', 0.7792481780052185),
 ('africa', 0.772837221622467)]

In [None]:
pretrained_w2v_model.most_similar('lakshmi')

[('laxmi', 0.7461349964141846),
 ('parvathi', 0.7299206852912903),
 ('parvati', 0.7262587547302246),
 ('shing', 0.7092645764350891),
 ('devi', 0.705733060836792),
 ('sethu', 0.6944319605827332),
 ('ratan', 0.6929410696029663),
 ('shiva', 0.6819362044334412),
 ('narasimha', 0.6802400946617126),
 ('phoolan', 0.664139449596405)]

In [None]:
pretrained_w2v_model.most_similar('wikipedia')

[('german-language', 0.7409663796424866),
 ('english-language', 0.7226234078407288),
 ('dictionaries', 0.719719648361206),
 ('dictionary', 0.7159746885299683),
 ('publish', 0.7159086465835571),
 ('website', 0.7043480277061462),
 ('encyclopedia', 0.700772762298584),
 ('blog', 0.6988518834114075),
 ('periodical', 0.6961102485656738),
 ('editions', 0.6913991570472717)]

In [None]:
def analogy(a, b, c):
    result = pretrained_w2v_model.most_similar([c, b], [a])
    return result[0][0]

In [None]:
analogy('india', 'indian', 'japan')

'japanese'

In [None]:
analogy('india', 'delhi', 'antarctica')

'glacier'

In [None]:
analogy('india', 'dhoni', 'england')

'collingwood'

## Excellent References

For further exploration and better understanding, you can use the following references.

- Glossary of Deep Learning: Word Embedding

    https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca


- wevi: word embedding visual inspector

    https://ronxin.github.io/wevi/  
    
    
- Learning Word Embedding    

    https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html


- On the contribution of neural networks and word embeddings in Natural Language Processing

    https://medium.com/@josecamachocollados/on-the-contribution-of-neural-networks-and-word-embeddings-in-natural-language-processing-c8bb1b85c61c

## Learning our own word embeddings


Customer reviews about a movie," including a mix of positive and negative sentiments.

- "Outstanding performances and a gripping storyline!"
- "This movie exceeded my expectations, a must-watch!"
- "Heartwarming and beautifully directed."
- "I can't stop thinking about this incredible film."
- "A cinematic masterpiece, truly remarkable."
- "Disappointing plot and lackluster acting."
- "Wasted my time, the worst movie I've seen."
- "Boring and unoriginal, a total letdown."
- "Terrible script, couldn't connect with the characters."
- "Avoid this film, it's a complete disaster."


In [None]:
from gensim.models import Word2Vec

# Initialize a list with movie review sentences
movie_reviews = [
    "Loved it fantastic film",
    "Great plot amazing acting",
    "Disappointed boring storyline",
    "Awesome movie worth watching",
    "Weak performances not impressed",
    "Incredible visuals, weak script",
    "Enjoyed it a good time pass",
    "Terrible film wasted my money",
    "Superb acting clichéd plot",
    "Entertaining but forgettable"
]

# Split the sentences into tokens (words) using space as the delimiter
tokenized_reviews = [review.lower().split() for review in movie_reviews]

# Train the Word2Vec model
model = Word2Vec(tokenized_reviews, vector_size=20, window=5, min_count=1)

# Save the model for later use
model.save("word2vec_myconnect.model")

# To load a saved model later, you can use:
# model = Word2Vec.load("word2vec_myconnect.model")

# Find the vector representation of a word
vector = model.wv['great']
print("Vector for 'great':", vector)

Vector for 'great': [-0.04121339  0.04649677 -0.0009883  -0.00983638  0.02301815 -0.02047658
  0.01371557  0.03469983  0.03032713 -0.03755397  0.04691175  0.02335904
  0.0198306  -0.03121753  0.0422999  -0.01075082  0.04412594 -0.02681001
 -0.0406471   0.03412279]


In [None]:
# Find similar words to a given word
similar_words = model.wv.most_similar('storyline')
print("Similar words to 'storyline':", similar_words)

Similar words to 'storyline': [('performances', 0.38172370195388794), ('time', 0.3259935677051544), ('watching', 0.31713858246803284), ('disappointed', 0.2947767674922943), ('entertaining', 0.24814458191394806), ('but', 0.2424035668373108), ('wasted', 0.2035253942012787), ('incredible', 0.19321590662002563), ('plot', 0.18296225368976593), ('money', 0.16422370076179504)]


In [None]:
similar_words = model.wv.most_similar('terrible')
print("Similar words to 'terrible':", similar_words)

Similar words to 'terrible': [('disappointed', 0.4167259931564331), ('superb', 0.38196608424186707), ('worth', 0.3576344847679138), ('boring', 0.2781214118003845), ('performances', 0.2728114128112793), ('film', 0.257135808467865), ('awesome', 0.24820274114608765), ('my', 0.21764503419399261), ('plot', 0.16161365807056427), ('good', 0.08501722663640976)]
