This Notebook implements Word2vec for sentiment analysis

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

In [2]:
data = pd.read_csv(r"tweets.csv")
data.head()

Unnamed: 0,id,label,tweet
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...
1,2,0,Finally a transparant silicon case ^^ Thanks t...
2,3,0,We love this! Would you go? #talk #makememorie...
3,4,0,I'm wired I know I'm George I was made that wa...
4,5,1,What amazing service! Apple won't even talk to...


In [3]:
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))
nltk.download("wordnet")
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [4]:
def clean_tweet(x):
  x = re.sub("@\w+", "", x) #removing user mentions
  x = re.sub("#", "", x) # removing hashtags
  x = re.sub("http\S+", "", x) # removing links
  x = re.sub("'s\b", "", x) # removing 's
  x = re.sub("[^a-zA-Z]", " ", x) #fetching only aplha letters
  x = x.lower() # lowercasing
  tokens = [ i for i in x.split() if not i in stop_words ] # removing stopwords
  final_string = ""
  for token in tokens:
    final_string = final_string + token + " "
  return final_string.strip()

In [5]:
data.tweet = data.tweet.apply( lambda x: clean_tweet(x) )

**Using google's word2Vec**

In [None]:
#downloading and extracting word2vec
! wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
! gunzip GoogleNews-vectors-negative300.bin.gz

In [7]:
from gensim.models import KeyedVectors

# path of the downloaded model
filename = 'GoogleNews-vectors-negative300.bin'

# load into gensim
w2vec = KeyedVectors.load_word2vec_format(filename, binary=True)

**Contextual Relationship Between Words**

One of the impressive things about word2vec is it's ability to capture semantic relationship between words. That is the reason that you can do cool stuff like perform linear algebra on words and get an appropriate output. Have a look at the following example:

airplane - fly + drive = car

If you pass the left hand side of the above equation to the model, it will give the right handside. Which makes sense because what would you get if you remove the ability to fly from an airplane? And add the ability to drive? You would get a car!

**Text Classification using Word2Vec**

word2vec gives vector representation of individual words, in order to find the same for a statement or a document you can take mean of the vectors of it's constituent words.

Length of every pre-trained word2vec embeddings is 300

In [8]:
#function to get vector representation of tweet
def convert_to_vec(x):
  vec = np.zeros((1, 300))
  count = 0
  for token in x.split(" "):
    try:
      vec += w2vec.wv.word_vec(token)
      count += 1
    except KeyError:
      continue
  if count != 0:
    vec = vec / count
  return vec

In [9]:
wordvec_arrays = np.zeros((data.shape[0], 300))

for i, j in enumerate(data.tweet):
  wordvec_arrays[i, :] = convert_to_vec(j)

wordvec_arrays.shape

  import sys


(7920, 300)

In [10]:
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(wordvec_arrays, data.label, test_size=0.2, random_state=42)

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

lr = LogisticRegression()

lr.fit(train_x, train_y)

y_predict = lr.predict(test_x)

In [12]:
f1_score(y_predict, test_y)

0.747225647348952