# Pre trained embeddings

Word embeddings are generally computed using word-occurrence statistics (observations about what words co-occur in sentences or documents), using a variety of  techniques, some involving neural networks, others not. The idea of a dense, lowdimensional embedding space for words, computed in an unsupervised way, was initially explored by Bengio et al. in the early 2000s,1 but it only started to take off in research and industry applications after the release of one of the most famous and successful word-embedding schemes: the Word2vec algorithm (https://code.google.com/ archive/p/word2vec), developed by Tomas Mikolov at Google in 2013. Word2vec dimensions capture specific semantic properties, such as gender.

There are various precomputed databases of word embeddings that you can download and use in a Keras Embedding layer. Word2vec is one of them. Another popular one is called Global Vectors for Word Representation (GloVe, https://nlp.stanford.edu/projects/glove), which was developed by Stanford researchers in 2014. This embedding technique is based on factorizing a matrix of word co-occurrence statistics. Its developers have made available precomputed embeddings for millions of English tokens, obtained from Wikipedia data and Common Crawl data.

One of the most widely used pretrained word embeddings is Glove and can be downloaded from https://nlp.stanford.edu/projects/glove/

GloVe is pre-computed embeddings from 2014 English Wikipedia. It's a 822MB zip file named glove.6B.zip, containing 100-dimensional embedding vectors for 400,000 words (or non-word tokens).

In [None]:
!wget https://nlp.stanford.edu/data/glove.6B.zip

--2023-12-11 05:10:38--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-12-11 05:10:38--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2023-12-11 05:13:17 (5.17 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]



In [None]:
!mkdir glove
!unzip glove.6B.zip -d glove/

Archive:  glove.6B.zip
  inflating: glove/glove.6B.50d.txt  
  inflating: glove/glove.6B.100d.txt  
  inflating: glove/glove.6B.200d.txt  
  inflating: glove/glove.6B.300d.txt  


In [None]:
!head -20 /content/glove/glove.6B.50d.txt

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581
, 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392
. 0.15164 0.30177 -0.16763 0.17684 0.31719 0.33973 -0.43478 -0.31086 -0.44999 -0.29486 0.16608 0.11963 -0.41328 -0.42353

### Exploring the embeddings

In [None]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors

word2vec_output_file = "/content/glove/glove.6B.50d.txt"

In [None]:
pretrained_w2v_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False, no_header=True)

In [None]:
len(pretrained_w2v_model)

400000

In [None]:
pretrained_w2v_model.most_similar('bangalore')

[('chennai', 0.9154870510101318),
 ('hyderabad', 0.8739371299743652),
 ('kolkata', 0.8573629260063171),
 ('ahmedabad', 0.8511857390403748),
 ('pune', 0.8379703164100647),
 ('kanpur', 0.8285415768623352),
 ('patna', 0.7850483655929565),
 ('lahore', 0.7841722369194031),
 ('calcutta', 0.7834596633911133),
 ('delhi', 0.778948187828064)]

In [None]:
pretrained_w2v_model.most_similar('dhoni')

[('dravid', 0.868076503276825),
 ('yuvraj', 0.8542092442512512),
 ('ganguly', 0.8492220640182495),
 ('kumble', 0.8360502123832703),
 ('rahul', 0.8356800675392151),
 ('sehwag', 0.835641622543335),
 ('laxman', 0.8343917727470398),
 ('raina', 0.8334249258041382),
 ('mahendra', 0.8286129236221313),
 ('karthik', 0.8234096765518188)]

In [None]:
pretrained_w2v_model.most_similar('google')

[('yahoo', 0.8942785263061523),
 ('aol', 0.8527129292488098),
 ('microsoft', 0.8450710773468018),
 ('internet', 0.8179759979248047),
 ('web', 0.8175380229949951),
 ('facebook', 0.8087005019187927),
 ('ebay', 0.7930073142051697),
 ('netscape', 0.7912958860397339),
 ('online', 0.7908353209495544),
 ('software', 0.7816095352172852)]

In [None]:
pretrained_w2v_model.most_similar('hp')

[('packard', 0.7827728390693665),
 ('compaq', 0.7555494904518127),
 ('ibm', 0.7533679008483887),
 ('hewlett', 0.7533037066459656),
 ('dell', 0.7504765391349792),
 ('cisco', 0.70441734790802),
 ('xerox', 0.6908343434333801),
 ('intel', 0.6819787621498108),
 ('kw', 0.6761818528175354),
 ('lucent', 0.6756484508514404)]

In [None]:
pretrained_w2v_model.most_similar('wikipedia')

[('german-language', 0.7409663796424866),
 ('english-language', 0.7226234078407288),
 ('dictionaries', 0.719719648361206),
 ('dictionary', 0.7159746885299683),
 ('publish', 0.7159086465835571),
 ('website', 0.7043480277061462),
 ('encyclopedia', 0.700772762298584),
 ('blog', 0.6988518834114075),
 ('periodical', 0.6961102485656738),
 ('editions', 0.6913991570472717)]

In [None]:
def analogy(a, b, c):
    result = pretrained_w2v_model.most_similar([c, b], [a])
    return result[0][0]

In [None]:
analogy('india', 'indian', 'japan')

'japanese'

In [None]:
analogy('india', 'delhi', 'france')

'paris'

In [None]:
analogy('india', 'dhoni', 'england')

'collingwood'

# Document Embeddings

In [8]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

In [65]:
# Step 1: Generate embeddings using BAAI/bge-small-en model from HuggingFace
model = SentenceTransformer("all-MiniLM-L6-v2")

In [108]:
sentences = ['Government policies should prioritize the welfare of citizens.',
 'A well-cooked meal can bring people together.',
 'Regular exercise helps improve both physical and mental health.',
 'Digital marketing strategies reach a wider global audience.',
 'Fresh ingredients make all the difference in cooking.',
 'Listening to music can evoke deep emotional responses.',
 'History is written by those who shape its events.',
 'Effective branding builds trust and attracts loyal customers.',
 'A balanced diet is key to maintaining good health.',
 'Music has the power to heal and unite people.',
 'Economic stability is essential for national growth.']

In [109]:
# Generate embeddings for each summary
embeddings = list(map(lambda sentence: model.encode(sentence), sentences))

In [110]:
# Create DataFrame with reviews and embeddings
embed_df = pd.DataFrame({
    'sentence': sentences,
    'embedding': embeddings
})

In [111]:
# Set the maximum column width to 100 characters
pd.set_option('display.max_colwidth', 500)

In [112]:
embed_df

Unnamed: 0,sentence,embedding
0,Government policies should prioritize the welfare of citizens.,"[-0.051664412, 0.0010407149, 0.037959058, -0.037145548, -0.033960618, 0.1046173, 0.10129703, -0.046506867, -0.09336018, 0.056426253, 0.07446415, 0.123309545, -0.04867209, 0.03245632, 0.07731287, 0.023410179, 0.10810875, 0.008092158, -0.057126448, 0.038353473, -0.036911417, -0.033216406, -0.035005994, 0.016439414, -0.033517495, -0.01811514, -0.0043659084, -0.025708642, 0.0004220911, 0.039880782, 0.043573894, -0.030682992, 0.0011141354, 0.039253056, -0.025893899, 0.052451078, -0.021276634, -0...."
1,A well-cooked meal can bring people together.,"[0.009125884, 0.00903909, -0.001334927, 0.07981436, -0.08276294, -0.028271118, 0.034884, -0.11284357, 0.014852262, -0.031526245, 0.04477015, -0.007579754, -0.05523187, -0.041576672, 0.019067481, -0.10551557, 0.10614761, -0.05365012, -0.04145356, -0.057196602, -0.1467616, -0.011838014, 0.01544652, 0.0021416643, 0.017979424, -0.010859036, 0.048679467, -0.04529665, -0.015292847, 0.037917726, 0.037846956, 0.0014440635, 0.04407926, 0.048026856, -0.095511146, 0.11453601, 0.08775456, -0.025925709, ..."
2,Regular exercise helps improve both physical and mental health.,"[0.064394064, 0.04254936, 0.080913335, 0.061345328, -0.05532228, 0.07707272, 0.014337269, -0.045221876, -0.024948658, 0.0024622462, 0.043790106, 0.11271967, -0.049053364, -0.00089764845, 0.15284145, -0.020286566, -0.008339523, 0.041927837, -0.0028935687, 0.014871353, -0.02962646, -0.03211418, 0.103952125, 0.058435503, -0.025621835, 0.051627062, -0.03728801, -0.008059096, 0.007811645, 0.053138297, 0.040803306, 0.051928952, 0.028830752, 0.036306173, -0.042018335, 0.024275178, 0.005123572, 0.08..."
3,Digital marketing strategies reach a wider global audience.,"[0.029987263, -0.073401935, 0.016416237, -0.07685502, 0.06466903, -0.0021748925, -0.033078045, 0.03690687, 0.008457469, -0.006512489, 0.021165436, 0.046234395, 0.026757963, -0.014263151, 0.107905306, -0.06976661, -0.020068923, -0.04681939, -0.034021147, 0.0015102229, 0.01283612, -0.052093543, 0.08955835, 0.027933681, -0.098328464, -0.04442452, 0.008712511, 0.004594569, 0.014045987, -0.050324596, 0.027554644, 0.12275363, 0.05079254, 0.008511258, -0.0006068854, 0.03576252, -0.013381892, 0.0411..."
4,Fresh ingredients make all the difference in cooking.,"[0.0011540126, -0.062332325, 0.039537884, 0.056177072, 0.07975883, -0.016657073, -0.07671936, -0.052278988, 0.064678475, -0.073357426, 0.048184335, 0.04711113, -0.08348797, -0.050604913, 0.011513779, -0.0993145, 0.1270017, 0.015974017, -0.08919765, -0.0782354, -0.02882138, 0.006713358, 0.043020148, 0.036059275, 0.059294835, 0.0008123851, 0.031915676, 0.00069113704, 0.030869788, -0.07693413, -0.00444681, 0.08795215, 0.029743297, -0.06116345, -0.03439547, 0.026279416, 0.04536655, 0.02058389, 0..."
5,Listening to music can evoke deep emotional responses.,"[0.071400285, -0.04516402, 0.08011501, 0.035749953, -0.0042751217, 0.033621054, 0.023426866, 0.0027543027, 0.07827193, -0.062489584, -0.0639897, -0.03248978, -0.023055824, -0.08534279, 0.038004637, 0.02651968, 0.059148755, 0.06186571, -0.09235571, -0.03791019, -0.02125043, 0.010915964, -0.09601715, 0.02888499, -0.033843253, 0.07182131, 0.03922951, -0.005638566, -0.0065087485, 0.028752303, 0.06270543, 0.037529908, 0.061653122, -0.025585337, -0.09743518, 0.043832846, -0.050009854, -0.003348359..."
6,History is written by those who shape its events.,"[0.040604025, 0.06510615, 0.0030110786, 0.021273384, -0.022880955, 0.07255096, -0.071838245, -0.06530792, 0.049175676, 0.0368702, -0.010962155, 0.07522543, -0.020281682, 0.026332743, -0.073469274, -0.034369268, -0.12367903, -0.040058665, 0.01826153, -0.06527854, 0.02986002, -0.03145467, 0.03773245, 0.022601519, 0.010873018, 0.038034882, 0.04482524, -0.035226107, 0.01051095, 0.025569104, -0.046430405, -0.07299706, 0.004665724, -0.030327912, -0.06214623, 0.006263692, -0.0057589356, 0.052243162..."
7,Effective branding builds trust and attracts loyal customers.,"[0.016988559, 0.025426459, 0.02367814, -0.063959286, 0.015448879, 0.005835783, 0.0708476, 0.027515607, 0.022626976, -0.06485289, 0.056140397, 0.051543415, 0.07392231, -0.034091707, 0.10046593, -0.070732094, 0.065495044, 0.026946872, -0.01895898, -0.06534655, -0.090615325, -0.08987164, 0.007113973, 0.041967377, -0.08849303, -0.024515692, 0.009714136, 0.04521284, 0.023320094, -0.11263317, 0.03691681, 0.06124254, 0.058387075, -0.06754206, -0.039564934, 0.0472665, 0.0026479028, -0.023607083, -0...."
8,A balanced diet is key to maintaining good health.,"[-0.03243351, 0.042308513, 0.035654098, 0.044864688, -0.055672385, 0.04060491, -0.04012269, -0.0018086808, -0.07143829, -0.013335502, 0.023623066, 0.023237322, -0.041928913, -0.10253718, 0.055660002, -0.03207499, 0.0083817495, 0.099899516, -0.03404632, -0.01610899, -0.0350096, 0.011561266, 0.09161728, 0.07689886, -0.12809265, 0.025293691, 0.01980993, -0.09485435, -0.04003415, -0.008018007, 0.11514358, 0.06673689, 0.060249582, -0.015969789, -0.13910697, 0.038419336, 0.024528602, -0.053090483,..."
9,Music has the power to heal and unite people.,"[0.047103435, 0.005926523, 0.083847955, -0.03740527, -0.029195163, 0.11056149, 0.020474575, -0.05220817, 0.019696862, -0.0021311445, -0.05106575, 0.09096371, 0.028264292, -0.091086864, 0.026167339, 0.062635675, -0.064193904, 0.030887276, -0.06600474, -0.031118039, -0.0925712, 0.10744984, -0.02100068, 0.041092735, -0.04183281, 0.057942126, -0.04305716, -0.0005643705, 0.014463392, -0.034809846, 0.025252592, 0.038319476, 0.011513563, -0.032314707, -0.11209268, 0.07901338, -0.0050278176, -0.0005..."


In [113]:
# Calculate pairwise cosine similarity
cosine_sim_matrix = cosine_similarity(embeddings)

# Convert cosine similarity matrix to DataFrame for easy visualization
cosine_sim_df = pd.DataFrame(cosine_sim_matrix, index=embed_df['sentence'], columns=embed_df['sentence'])

In [114]:
#!pip install ace_tools

In [115]:
cosine_sim_df

sentence,Government policies should prioritize the welfare of citizens.,A well-cooked meal can bring people together.,Regular exercise helps improve both physical and mental health.,Digital marketing strategies reach a wider global audience.,Fresh ingredients make all the difference in cooking.,Listening to music can evoke deep emotional responses.,History is written by those who shape its events.,Effective branding builds trust and attracts loyal customers.,A balanced diet is key to maintaining good health.,Music has the power to heal and unite people.,Economic stability is essential for national growth.
sentence,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Government policies should prioritize the welfare of citizens.,1.0,0.103339,0.147027,0.073202,0.01335,-0.005713,0.094055,0.140832,0.185193,0.106377,0.300287
A well-cooked meal can bring people together.,0.103339,1.0,0.074206,0.042159,0.395341,0.088751,0.064395,0.125634,0.322217,0.207835,0.128898
Regular exercise helps improve both physical and mental health.,0.147027,0.074206,1.0,-0.059923,0.089724,0.170325,-0.003221,-0.056823,0.46425,0.213618,0.069147
Digital marketing strategies reach a wider global audience.,0.073202,0.042159,-0.059923,1.0,0.040141,0.152679,0.066129,0.452744,0.076899,0.188702,0.067078
Fresh ingredients make all the difference in cooking.,0.01335,0.395341,0.089724,0.040141,1.0,0.031568,-0.012091,0.154825,0.280131,0.08264,0.077206
Listening to music can evoke deep emotional responses.,-0.005713,0.088751,0.170325,0.152679,0.031568,1.0,0.060112,0.085046,0.07875,0.577532,-0.011456
History is written by those who shape its events.,0.094055,0.064395,-0.003221,0.066129,-0.012091,0.060112,1.0,-0.017601,0.017352,0.119593,0.078711
Effective branding builds trust and attracts loyal customers.,0.140832,0.125634,-0.056823,0.452744,0.154825,0.085046,-0.017601,1.0,0.093172,0.112415,0.122516
A balanced diet is key to maintaining good health.,0.185193,0.322217,0.46425,0.076899,0.280131,0.07875,0.017352,0.093172,1.0,0.190044,0.21915
Music has the power to heal and unite people.,0.106377,0.207835,0.213618,0.188702,0.08264,0.577532,0.119593,0.112415,0.190044,1.0,0.111512


## Excellent References

For further exploration and better understanding, you can use the following references.

- Glossary of Deep Learning: Word Embedding

    https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca


- wevi: word embedding visual inspector

    https://ronxin.github.io/wevi/  
    
    
- Learning Word Embedding    

    https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html


- On the contribution of neural networks and word embeddings in Natural Language Processing

    https://medium.com/@josecamachocollados/on-the-contribution-of-neural-networks-and-word-embeddings-in-natural-language-processing-c8bb1b85c61c