# word2vec: How To Implement word2vec

### Explore Pre-trained Embeddings

Some other options:
- `glove-twitter-{25/50/100/200}`
- `glove-wiki-gigaword-{50/200/300}`
- `word2vec-google-news-300`
- `word2vec-ruscorpora-news-300`

In [1]:
# Install gensim
!pip install -U gensim

Collecting gensim
  Downloading gensim-3.8.3-cp36-cp36m-macosx_10_9_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 9.5 MB/s eta 0:00:01
Collecting smart-open>=1.8.1
  Downloading smart_open-4.1.2-py3-none-any.whl (111 kB)
[K     |████████████████████████████████| 111 kB 10.8 MB/s eta 0:00:01
[?25hInstalling collected packages: smart-open, gensim
  Attempting uninstall: smart-open
    Found existing installation: smart-open 1.8.0
    Uninstalling smart-open-1.8.0:
      Successfully uninstalled smart-open-1.8.0
  Attempting uninstall: gensim
    Found existing installation: gensim 3.7.1
    Uninstalling gensim-3.7.1:
      Successfully uninstalled gensim-3.7.1
[31mERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

medca

In [2]:
# Load pretrained word vectors using gensim
import gensim.downloader as api

wiki_embeddings = api.load('glove-wiki-gigaword-100')



In [12]:
# Explore the word vector for "king" - "pain"
wiki_embeddings['pain']

array([ 0.074065 ,  0.27331  ,  0.11436  , -0.021311 , -0.7097   ,
        0.69676  , -0.63839  , -0.10449  , -0.046197 , -0.94039  ,
       -0.45073  , -0.2913   ,  0.16654  ,  0.10863  ,  0.50075  ,
       -0.35285  , -1.1068   , -0.24618  , -0.4846   , -0.28999  ,
        0.26324  , -0.10728  , -1.3835   ,  0.67262  ,  0.090377 ,
        1.4126   ,  0.62699  , -0.9212   ,  0.71476  , -0.4183   ,
        0.36514  ,  0.12508  , -0.60492  ,  0.14183  , -0.75623  ,
       -0.40986  , -0.073459 ,  0.73399  ,  0.20977  ,  0.20305  ,
        0.22164  ,  0.3502   ,  0.13281  , -1.019    , -0.30507  ,
        0.37541  ,  0.72874  , -0.025062 , -0.21775  , -0.63315  ,
       -0.22306  ,  0.12251  ,  0.035594 ,  0.59439  ,  0.43194  ,
       -1.7208   ,  0.24543  , -0.52877  ,  0.68096  ,  0.591    ,
        0.99566  ,  0.87977  , -0.031954 , -0.095788 , -0.036024 ,
        0.12737  ,  0.85311  , -1.157    , -0.15524  , -0.66628  ,
       -0.3557   ,  0.10642  , -0.090021 ,  0.45239  ,  1.1023

In [13]:
# Find the words most similar to pain based on the trained word vectors
wiki_embeddings.most_similar('persistent', topn=15)

[('lingering', 0.7641147375106812),
 ('mounting', 0.7182013988494873),
 ('widespread', 0.6965022087097168),
 ('concern', 0.6881173849105835),
 ('concerns', 0.6766042113304138),
 ('fears', 0.6710178852081299),
 ('worries', 0.6708651781082153),
 ('unrelenting', 0.6659743785858154),
 ('continuing', 0.6657591462135315),
 ('severe', 0.6639633178710938),
 ('chronic', 0.6589462757110596),
 ('worsening', 0.656125545501709),
 ('prolonged', 0.6560916900634766),
 ('nagging', 0.6553113460540771),
 ('serious', 0.6525529623031616)]

In [14]:
wiki_embeddings.similarity('persistent','pain')

0.5015517

In [15]:
wiki_embeddings.similarity('persistent','chronic')

0.6589464

### Train Our Own Model

In [5]:
# Read in the data and clean up column names
import gensim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
pd.set_option('display.max_colwidth', 100)

messages = pd.read_csv('../../../data/spam.csv', encoding='latin-1')
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["label", "text"]
messages.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [6]:
# Clean data using the built in cleaner in gensim
messages['text_clean'] = messages['text'].apply(lambda x: gensim.utils.simple_preprocess(x))
messages.head()

Unnamed: 0,label,text,text_clean
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, until, jurong, point, crazy, available, only, in, bugis, great, world, la, buffet, cine, th..."
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, in, wkly, comp, to, win, fa, cup, final, tkts, st, may, text, fa, to, to, receive,..."
3,ham,U dun say so early hor... U c already then say...,"[dun, say, so, early, hor, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, don, think, he, goes, to, usf, he, lives, around, here, though]"


In [7]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(messages['text_clean'],
                                                    messages['label'], test_size=0.2)

In [8]:
# Train the word2vec model
w2v_model = gensim.models.Word2Vec(X_train,
                                   size=100,
                                   window=5,
                                   min_count=2)

In [9]:
# Explore the word vector for "king" base on our trained model
w2v_model.wv['king']

array([-0.05425125,  0.04536858, -0.09595686, -0.02699764,  0.11971134,
        0.10338762, -0.03818981, -0.01448795,  0.0171045 ,  0.00511975,
        0.01045221, -0.00677045, -0.12050592,  0.11097632, -0.04719375,
       -0.02802079,  0.01247429, -0.06322849,  0.06611794,  0.07224897,
       -0.02086301,  0.016499  ,  0.02015498,  0.00358362,  0.08886525,
       -0.099216  ,  0.06923407,  0.01566726, -0.05832795,  0.03870581,
       -0.02199215,  0.03693705, -0.00661952, -0.04715456,  0.07135164,
       -0.00723605,  0.02134361, -0.09508089, -0.00362955, -0.03568636,
        0.05925028, -0.01528659, -0.04217548,  0.01903476, -0.02175902,
       -0.08289368, -0.06005706, -0.02793312,  0.06268803,  0.06778472,
       -0.03594127,  0.11335944, -0.06159783, -0.0157827 , -0.03330815,
       -0.00814747, -0.08040741, -0.02449049, -0.02535428, -0.02809742,
        0.03898891, -0.03665545, -0.0125957 ,  0.04661012, -0.04162746,
       -0.04639079, -0.04960034, -0.07714609,  0.04107031, -0.09

In [10]:
# Find the most similar words to "king" based on word vectors from our trained model
w2v_model.wv.most_similar('king')

[('show', 0.9984022974967957),
 ('being', 0.9983983039855957),
 ('coming', 0.9983887672424316),
 ('working', 0.9983633756637573),
 ('watching', 0.9983620643615723),
 ('boy', 0.998355507850647),
 ('gonna', 0.9983476400375366),
 ('poly', 0.9983355402946472),
 ('how', 0.9983333945274353),
 ('friends', 0.9983316659927368)]