# Word Embeddings in Action - Word2Vec

### Word embeddings are a really useful way of converting text into a format that is interpretable to the model while still keeping it's semantic meaning intact.

In [3]:
# import required libraries
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
import re
pd.set_option('display.max_colwidth', 200)

In [2]:
# Load the twitter dataset
df = pd.read_csv('datasets\\tweets.csv')
df.head()

Unnamed: 0,id,label,tweet
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android #apps #beautiful #cute #health #igers #iphoneonly #iphonesia #iphone
1,2,0,Finally a transparant silicon case ^^ Thanks to my uncle :) #yay #Sony #Xperia #S #sonyexperias… http://instagram.com/p/YGEt5JC6JM/
2,3,0,We love this! Would you go? #talk #makememories #unplug #relax #iphone #smartphone #wifi #connect... http://fb.me/6N3LsUpCu
3,4,0,I'm wired I know I'm George I was made that way ;) #iphone #cute #daventry #home http://instagr.am/p/Li_5_ujS4k/
4,5,1,What amazing service! Apple won't even talk to me about a question I have unless I pay them $19.95 for their stupid support!


Skip the code block below if you have already downloaded the stopwords before.

In [None]:
# download stopwords (one-time download)
nltk.download('stopwords')

In [3]:
stop_words = set(stopwords.words('english')) 

In [4]:
nltk.download('wordnet') #one-time download

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rajkumar.mo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
lemmatizer = WordNetLemmatizer() 

In [7]:
# function for text pre-processing
def tweet_cleaner(text):
    newString=re.sub(r'@[A-Za-z0-9]+','',text)                     #removing user mentions
    newString=re.sub("#","",newString)                             #removing hashtag symbol
    newString= re.sub(r'http\S+', '', newString)                   #removing links
    newString= re.sub(r"'s\b","",newString)                        #removing 's
    letters_only = re.sub("[^a-zA-Z]", " ", newString)             #Fetching out only letters
    lower_case = letters_only.lower()                              #converting everything to lowercase
    tokens = [w for w in lower_case.split() if not w in stop_words]#stopwords removal
    newString=''
    for i in tokens:                                                 
        newString=newString+lemmatizer.lemmatize(i)+' '            #converting words to lemma                               
    
    return newString.strip() 

In [8]:
# empty list to store tweets after pre-processing
cleaned_tweets = []

# pre-processing the tweets
for i in df['tweet']:
    cleaned_tweets.append(tweet_cleaner(i))

#creating new column  
df['cleaned_tweets']= cleaned_tweets

In [9]:
df.head()

Unnamed: 0,id,label,tweet,cleaned_tweets
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android #apps #beautiful #cute #health #igers #iphoneonly #iphonesia #iphone,fingerprint pregnancy test android apps beautiful cute health igers iphoneonly iphonesia iphone
1,2,0,Finally a transparant silicon case ^^ Thanks to my uncle :) #yay #Sony #Xperia #S #sonyexperias… http://instagram.com/p/YGEt5JC6JM/,finally transparant silicon case thanks uncle yay sony xperia sonyexperias
2,3,0,We love this! Would you go? #talk #makememories #unplug #relax #iphone #smartphone #wifi #connect... http://fb.me/6N3LsUpCu,love would go talk makememories unplug relax iphone smartphone wifi connect
3,4,0,I'm wired I know I'm George I was made that way ;) #iphone #cute #daventry #home http://instagr.am/p/Li_5_ujS4k/,wired know george made way iphone cute daventry home
4,5,1,What amazing service! Apple won't even talk to me about a question I have unless I pay them $19.95 for their stupid support!,amazing service apple even talk question unless pay stupid support


### Using Google's pre-trained Word2Vec


In [12]:
# download and extract word2vec embeddings 
! wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2022-01-10 12:33:49--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.146.117
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.146.117|:443... connected.
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

    The file is already fully retrieved; nothing to do.

'gunzip' is not recognized as an internal or external command,
operable program or batch file.


In [16]:
#! gunzip GoogleNews-vectors-negative300.bin.gz
! unzip E:\Analytics_Vidhya\tools\GoogleNews-vectors-negative300.bin.gz


Archive:  E:/Analytics_Vidhya/tools/GoogleNews-vectors-negative300.bin.gz


  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in E:/Analytics_Vidhya/tools/GoogleNews-vectors-negative300.bin.gz,
        and cannot find E:/Analytics_Vidhya/tools/GoogleNews-vectors-negative300.bin.gz.zip, period.


In [2]:
!pip install gensim.models

ERROR: Could not find a version that satisfies the requirement gensim.models (from versions: none)
ERROR: No matching distribution found for gensim.models


In [17]:
from gensim.models import KeyedVectors

# path of the downloaded model
filename = 'E:\\Analytics_Vidhya\\tools\\GoogleNews-vectors-negative300.bin\\GoogleNews-vectors-negative300.bin'

# load into gensim
w2vec = KeyedVectors.load_word2vec_format(filename, binary=True)

ModuleNotFoundError: No module named 'gensim'

Once you have executed the above code, your word2vec model is finally installed and loaded. Let's explore some of the features of this model.

__Contextual Relationship Between Words__

 - One of the impressive things about word2vec is it's ability to capture semantic relationship between words. That is the reason that you can do cool stuff like perform linear algebra on words and get an appropriate output. Have a look at the following example:

    `airplane - fly + drive = car`

 - If you pass the left hand side of the above equation to the model, it will give the right handside. Which makes sense because what would you get if you remove the ability to fly from an airplane? And add the ability to drive? You would get a car!

### Text Classification using Word2Vec

Let's now get back to our task to classify our twitter data by using __word2vec__ as features. However, word2vec gives vector representation of individual words, in order to find the same for a statement or a document you can take mean of the vectors of it's constituent words.

<br>

Please note that the length of every vector of the pre-trained word2vec embeddings is 300.


In [None]:
# function to get vector representation of a tweet
def word_vector(tokens):
    vec = np.zeros((1,300))
    count = 0.
    for word in tokens:
        try:
            vec += w2vec.wv.word_vec(word)
            count += 1.
        except KeyError: # handling the case where the token is not in vocabulary
                         
            continue
    if count != 0:
        vec /= count
    return vec

In [None]:
# empty array of shape (no. of tweets X 300) to store word2vec features
wordvec_arrays = np.zeros((len(df), 300))

for i,j in enumerate(df['cleaned_tweets']):
  wordvec_arrays[i,:] = word_vector(j.split())

In [None]:
wordvec_arrays.shape

In [None]:
from sklearn.model_selection import train_test_split

# split into train and test
y = df['label']
X_train_wv, X_test_wv, y_train_wv, y_test_wv = train_test_split(wordvec_arrays, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

In [None]:
model = LogisticRegression()

# Fit the model on the train dataset
model = model.fit(X_train_wv, y_train_wv)

# Make predictions on the test dataset
pred = model.predict(X_test_wv)

In [None]:
# check the accuracy of the model
print("F1 Score:", f1_score(y_test_wv, pred))