# Stanford GloVe Vectors Embedding 

Glove Vectors: https://nlp.stanford.edu/projects/glove/

Paper: https://nlp.stanford.edu/pubs/glove.pdf

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space

Some words often come in pairs, like `nice and easy` or `pros and cons`. So the co-occurrence of words in a corpus can teach us something about its meaning. 
Sometimes, it means they are similar or sometimes it means they are opposite.

![image.png](attachment:image.png)

The training objective is to learn word vectors such that their dot product equals the logarithm of the words’ probability of co-occurrence.

![image.png](attachment:image.png)

## Data Preparation

In [2]:
import numpy as np
from numpy import array
import pandas as pd

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

In [7]:
# !pip install git+https://github.com/laxmimerit/preprocess_kgptalkie.git

In [8]:
import preprocess_kgptalkie as ps

In [6]:
df = pd.read_csv('twitter4000.csv')
df.head()

Unnamed: 0,twitts,sentiment
0,is bored and wants to watch a movie any sugge...,0
1,back in miami. waiting to unboard ship,0
2,"@misskpey awwww dnt dis brng bak memoriessss, ...",0
3,ughhh i am so tired blahhhhhhhhh,0
4,@mandagoforth me bad! It's funny though. Zacha...,0


### Preprocessing and Cleaning 

In [10]:
%%time
df['twitts'] = df['twitts'].apply(lambda x: ps.cont_exp(x))
df['twitts'] = df['twitts'].apply(lambda x: ps.remove_special_chars(x))
df['twitts'] = df['twitts'].apply(lambda x: ps.remove_accented_chars(x))
df['twitts'] = df['twitts'].apply(lambda x: ps.remove_emails(x))
df['twitts'] = df['twitts'].apply(lambda x: ps.remove_html_tags(x))
df['twitts'] = df['twitts'].apply(lambda x: ps.remove_urls(x))
df['twitts'] = df['twitts'].apply(lambda x: ps.make_base(x))
# df['reviews'] = df['reviews'].apply(lambda x: ps.spelling_correction(x).raw_sentences[0])

Wall time: 55.6 s


In [13]:
df['twitts']

0        is bore and want to watch a movie any suggestion
1                      back in miami wait to unboard ship
2       misskpey awwww dnt this bring back memoriessss...
3                        ughhh i am so tired blahhhhhhhhh
4       mandagoforth me bad Its funny though Zachary Q...
                              ...                        
3995                                      i just graduate
3996                 templating work it all have to be do
3997                         mommy just bring me starbuck
3998          omarepps watch you on a House rerunlovin it
3999    thank for try to make me smile Ill make your 1...
Name: twitts, Length: 4000, dtype: object

In [14]:
df['sentiment'].value_counts()

1    2000
0    2000
Name: sentiment, dtype: int64

### `GloVe` Vectors 

In [19]:
# you -0.11076 0.30786 -0.5198 0.035138 0.10368 -0.052505 -0.18021 -0.11839 -0.054253 2.498 -0.30241 0.043233 -0.095862 -0.093529 -0.19817 -0.26599 -0.34703 1.4518 -0.49013 0.041637 0.11185 -0.019023 -0.18716 -0.10407 -0.43665 0.073561 0.019546 -0.15012 0.18499 -0.24364 0.20327 0.28916 -0.21694 0.28351 -0.10092 -0.042189 -0.073457 0.27325 -0.12898 -0.059407 -0.073329 0.01249 -0.20459 -0.44558 0.040863 0.24588 -0.26111 -0.086821 0.13628 0.11094 -0.10835 0.0098775 0.17394 0.006475 0.27467 -0.0097433 0.16561 -0.16975 -0.12561 -0.071688 -0.056815 -0.28632 -0.24231 0.27819 0.24112 -0.009142 -0.053634 0.43907 0.39 0.1252 -0.063581 0.058089 0.59187 -0.18385 0.090201 0.13788 0.41051 -0.39034 -0.071701 0.37935 0.031344 -0.003615 -0.25773 -0.048608 0.1952 -0.29912 0.04721 -0.13577 0.67253 -0.083033 -0.1968 0.074079 0.17826 0.20097 -0.036357 0.027783 -0.32144 -0.2962 -0.1326 0.30375 0.05418 0.070012 0.11935 0.04668 0.37338 -0.63809 0.33868 -0.091924 -0.12639 0.068526 0.11981 -0.22509 0.56067 -0.035003 0.36471 -0.26875 -0.0048343 0.064098 -0.2876 -0.023736 0.21348 -0.4122 -0.12958 0.051024 0.42078 -0.086314 -0.10035 -0.26017 0.0096791 0.064299 0.10799 -0.095081 -0.12798 0.054993 0.060576 -0.037241 -0.19778 -0.12237 -0.16846 -0.098457 -1.8562 0.3119 -0.30854 -0.098816 -0.0019955 -0.29415 0.078162 0.18014 -0.027904 -0.049573 0.071973 0.16791 -0.033054 -0.079709 -0.097695 0.26119 0.11585 -0.25638 -0.089019 -0.024823 -0.10813 0.20349 -0.20903 0.18039 0.39647 -0.13119 0.46686 -0.053135 0.014807 0.059119 -0.084577 -0.05861 0.34677 -0.25996 0.052293 0.19285 -0.27362 -0.10858 -0.030143 0.35079 0.20094 0.08739 -0.12402 0.02094 0.041557 -0.026728 -0.025289 -0.34984 -0.078001 0.17182 -0.06293 -0.074751 0.045825 -0.27333 0.23052 0.19061 -0.20641 -0.039203 0.33908 0.52254 -0.10861 -0.30465 -0.053306 -0.26766 -0.0043355 0.23916 0.22283 -0.053289 0.20198 -0.084151 0.10375 -0.35093 -0.19961 0.010933 0.26317 0.34094 -0.068638 0.20576 -0.52757 -0.084815 0.11056 0.021289 0.063286 0.094234 0.20282 -0.15887 -0.010649 0.25771 -0.23234 -0.23733 -0.15439 0.13906 0.086255 0.38443 -0.25632 0.031801 0.080305 -0.40683 -0.51163 0.26979 0.41308 0.057052 0.054701 -0.060832 0.19468 -0.38259 -0.044148 -0.036737 -0.3972 0.55777 0.069855 -0.21519 -0.091095 0.033559 -0.16332 0.42089 0.019067 -0.21884 0.27533 0.23683 0.094193 0.038504 0.2238 -0.11986 0.23199 -0.088445 -0.014716 0.65752 0.59385 0.24571 0.024754 -0.31514 -0.1547 0.00057218 -0.042344 0.081696 0.030109 0.070089 0.08708 -0.079636 -0.0083257 -0.14395 0.038982 -0.095362 0.27599 -0.3907 0.44441 -0.35471 0.2331 -0.0067546 -0.18892 0.27837 -0.38501 -0.11408 0.28191 -0.30946 -0.21878 -0.059105 0.47604 0.05661

In [15]:
df['twitts']

0        is bore and want to watch a movie any suggestion
1                      back in miami wait to unboard ship
2       misskpey awwww dnt this bring back memoriessss...
3                        ughhh i am so tired blahhhhhhhhh
4       mandagoforth me bad Its funny though Zachary Q...
                              ...                        
3995                                      i just graduate
3996                 templating work it all have to be do
3997                         mommy just bring me starbuck
3998          omarepps watch you on a House rerunlovin it
3999    thank for try to make me smile Ill make your 1...
Name: twitts, Length: 4000, dtype: object

In [16]:
glove_vectors = dict()

In [17]:
%%time
file = open('glove/glove.6B.50d.txt', encoding='utf-8')

for line in file:
    values = line.split()
    word = values[0]
    vectors = np.asarray(values[1: ])
    glove_vectors[word] = vectors
file.close()

Wall time: 8.79 s


In [20]:
glove_vectors

{'the': array(['0.418', '0.24968', '-0.41242', '0.1217', '0.34527', '-0.044457',
        '-0.49688', '-0.17862', '-0.00066023', '-0.6566', '0.27843',
        '-0.14767', '-0.55677', '0.14658', '-0.0095095', '0.011658',
        '0.10204', '-0.12792', '-0.8443', '-0.12181', '-0.016801',
        '-0.33279', '-0.1552', '-0.23131', '-0.19181', '-1.8823',
        '-0.76746', '0.099051', '-0.42125', '-0.19526', '4.0071',
        '-0.18594', '-0.52287', '-0.31681', '0.00059213', '0.0074449',
        '0.17778', '-0.15897', '0.012041', '-0.054223', '-0.29871',
        '-0.15749', '-0.34758', '-0.045637', '-0.44251', '0.18785',
        '0.0027849', '-0.18411', '-0.11514', '-0.78581'], dtype='<U11'),
 ',': array(['0.013441', '0.23682', '-0.16899', '0.40951', '0.63812', '0.47709',
        '-0.42852', '-0.55641', '-0.364', '-0.23938', '0.13001',
        '-0.063734', '-0.39575', '-0.48162', '0.23291', '0.090201',
        '-0.13324', '0.078639', '-0.41634', '-0.15428', '0.10068',
        '0.48891', '0

In [18]:
len(glove_vectors)

400000

In [19]:
keys = glove_vectors.keys()
len(keys)

400000

In [21]:
glove_vectors.get('hello')

array(['-0.38497', '0.80092', '0.064106', '-0.28355', '-0.026759',
       '-0.34532', '-0.64253', '-0.11729', '-0.33257', '0.55243',
       '-0.087813', '0.9035', '0.47102', '0.56657', '0.6985', '-0.35229',
       '-0.86542', '0.90573', '0.03576', '-0.071705', '-0.12327',
       '0.54923', '0.47005', '0.35572', '1.2611', '-0.67581', '-0.94983',
       '0.68666', '0.3871', '-1.3492', '0.63512', '0.46416', '-0.48814',
       '0.83827', '-0.9246', '-0.33722', '0.53741', '-1.0616',
       '-0.081403', '-0.67111', '0.30923', '-0.3923', '-0.55002',
       '-0.68827', '0.58049', '-0.11626', '0.013139', '-0.57654',
       '0.048833', '0.67204'], dtype='<U9')

In [24]:
glove_vectors.get('hello').shape

(50,)

In [25]:
glove_vectors.get('aassrfdfa')

In [27]:
word_vector_matrix = np.zeros((vocab_size, 50))

for word, index in token.word_index.items():
    vector = glove_vectors.get(word)
    if vector is not None:
        word_vector_matrix[index] = vector
    else:
        print(word)

tommcfly
dougiemcfly
donniewahlberg
kirstiealley
quotthe
peterfacinelli
davidarchie
quoti
youngq
jordanknight
atampt
gtlt
songzyuuup
dannygokey
jackalltimelow
ashleytisdale
jonathanrknight
ughhh
twitterberry
misss
shaaqt
cepic
soooooooooooo
spymaster
notbut
sweetkisses
dcdebbie
earthlifeshop
youquot
ltsobgt
wayyy
workno
heidimontag
mampg
seeee
sohotel
xbllygbsn
damohopo
katlb
anddd
rckergirl
mequot
quotyoure
aplusk
wwwtweeteraddercom
paulgdog
hotwords
deltagoodrem
miamiadc
trvsbrkr
thesupergirl
gailporter
jeffreecuntstar
youngcash
misskpey
memoriessss
mandagoforth
kevinmarquis
peterfacinellis
raisenot
stacig
killahzzzz
sfannah
brianquest
garethemery
kimberleymtkg
daniesass
hungrydomaine
quotmermaid
dinnerquot
shvizut
wellat
shirshor
hummm
jchamanes
scaredddd
andrewhuntre
sonjacassella
darenotspeak
mammaj
jkmrprez
yearquot
haemoglobin
stedic
paulmccourt
greenisland
twilightofdoom
krazyfreak
whyhe
xxbrry
brianrubin
lomara
kankzxd
anthothemantho
antonwheel
hanaames
orlax
bastantep
catshat

guendouglas
eliburford
amazingphoebe
ohhhleann
amberlrhea
jonnybabyy
angeliquebates
onlylies
somereason
errored
kennah
tweetsearch
prettyblacklex
mcrmuffin
janabellexo
carissarogers
atabey
mandahead
jillianvalentin
beautyfulashley
beatersidk
ninjafrog
netamarie
freakshowmikey
fsbigbob
moriqua
vidaecaffe
monkeyspanda
madcatdisease
dyankd
mediatemple
rockpapershot
iblogologist
shontane
steveheath
upwhy
errinmerrrill
rubyam
kellymcfly
mewell
onelet
uncleo
babymaybe
serveri
natalietran
goodbut
ellapsycho
madisonxgeorgie
brownskinbunnies
famnot
teegee
officialmelb
missmel
clickwindrepeat
chiefdork
casuallavish
revisting
cherrywopie
sidestomach
staceyandeen
babiiluv
tomcmfly
visitingchile
sugawright
roninofragnarok
seriouz
akihikio
zwoise
mzkiss
johnchandler
scruffypanther
quotnormalskinnyquot
bfoxb
onepov
quotonce
dreamquot
andreashale
goooodyet
chriskey
townhallforhope
suuuperpach
tonycupcakes
maurillio
charices
officialcharice
refundnow
timdisaster
travdave
skeezoyd
dvartistry
indywoodfil

glidden
collingsa
bhinn
artomatic
mickeymonkeyy
hereya
lizpriore
joshuaradin
futuredirected
vectorlovers
runshouse
djodcouk
riffrecording
bluvox
urasawas
anatomyo
undiess
summmeeerrr
ischoolatdrexel
schenkin
dloesch
mischkas
liamvickery
amandalester
trendall
riskyadinda
donniedoll
lovingpaws
pbcomquot
airsmithing
cashola
againugh
davynathan
sazmows
jesusandmary
uniqueguitarist
lechantdoiseau
tidemark
pavwoahva
paulamcfly
handsheldhigh
nennamusic
chrunchie
ffffrrrriiiiiddddaaaaayyyy
lvatt
paulaabdulfan
jessstroup
ohamps
nikkilorenzo
lthas
quinnspurr
bluelipstx
nissietr
nissieeeehave
elleyevee
creepygnome
shootno
chriscyvas
amylaree
joegigantino
selfneed
ericakelly
therealcabbie
killasluddie
realsed
kevinwweaver
buzzbaker
talktodiane
bringxknives
halfdate
themandymoore
isiswisdom
linacalabria
nightynite
waaaida
pianoooo
katetropa
simchabe
sorrynext
joelmadden
wwwmybigdayplannercom
xxkassyxx
jiamiin
barclub
carlosedp
dailymetv
guysthis
koifishsushi
sscullion
timtech
inklesstales
beehughes