Word2Vec, as its name suggests, is a method to convert a word into form of a vector.

If we have a small dataset, Tfidf might be of use, but in case we have a large dataset, Word2Vec should be the go-to choice.

# 1. Word Vectors with Spacy

In [10]:
import spacy
# !python -m spacy download en_core_web_lg
import en_core_web_lg

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# !pip install git+https://github.com/laxmimerit/preprocess_kgptalkie.git --upgrade --force-reinstall
import preprocess_kgptalkie as ps

!python -m textblob.download_corpora

In [11]:
nlp = en_core_web_lg.load() #import the nlp model

In [14]:
x = 'dog cat lion dsfa'

doc = nlp(x)

In [16]:
for token in doc:
  print(token.text,token.has_vector,token.vector_norm)

dog True 7.0336733
cat True 6.6808186
lion True 6.5120897
dsfa False 0.0


For the first 3 words we do have vectors corresponding to those words from the pre-trained nlp model.

## Semantic Similarity with Spacy

In [17]:
x

'dog cat lion dsfa'

In [18]:
for token1 in doc:
  for token2 in doc:
    print(token1.text,token2.text,token1.similarity(token2))

dog dog 1.0
dog cat 0.80168545
dog lion 0.47424486
dog dsfa 0.0
cat dog 0.80168545
cat cat 1.0
cat lion 0.5265438
cat dsfa 0.0
lion dog 0.47424486
lion cat 0.5265438
lion lion 1.0
lion dsfa 0.0
dsfa dog 0.0
dsfa cat 0.0
dsfa lion 0.0
dsfa dsfa 1.0


  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


We can see dog and cat have a strong similarity as they are both house pets.

Cat has a stronger similarity to lion compared to dog.

And 'dsfa' doesn't have any similarity to any other words as expected.

# 2.Model building for word2vec

## 2.1. Data preparation

In [27]:
df = pd.read_csv('/content/drive/MyDrive/My learnings/NLP basics/Codes/IMDB Movie Reviews Sentiment Analysis/Copy of imdb_reviews.txt',sep='\t',header = None)
df.columns = ['reviews','sentiment']

In [28]:
df.head()

Unnamed: 0,reviews,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


Let's do some data preprocessings

In [31]:
# These are series of preprocessing
df['reviews'] = df['reviews'].apply(lambda x: ps.cont_exp(x)) #you're -> you are; i'm -> i am
df['reviews'] = df['reviews'].apply(lambda x: ps.remove_emails(x))
df['reviews'] = df['reviews'].apply(lambda x: ps.remove_html_tags(x))
df['reviews'] = df['reviews'].apply(lambda x: ps.remove_urls(x))

df['reviews'] = df['reviews'].apply(lambda x: ps.remove_special_chars(x))
df['reviews'] = df['reviews'].apply(lambda x: ps.remove_accented_chars(x))
df['reviews'] = df['reviews'].apply(lambda x: ps.make_base(x)) #ran -> run, 
df['reviews'] = df['reviews'].apply(lambda x: ps.spelling_correction(x).raw_sentences[0]) #seplling -> spelling

In [33]:
df.head()

Unnamed: 0,reviews,sentiment
0,a very very very slowmove aimless movie about ...,0
1,not sure who was more lose the flat character ...,0
2,attempt artless with black white and clever ca...,0
3,very little music or anything to speak of,0
4,the good scene in the movie was when Gerard is...,1


## 2.2. Model building

In [34]:
def get_vec(x):
  doc = nlp(x)
  vec = doc.vector
  return vec

In [35]:
df['vec'] = df['reviews'].apply(lambda x:get_vec(x))

In [36]:
df.head()

Unnamed: 0,reviews,sentiment,vec
0,a very very very slowmove aimless movie about ...,0,"[-0.074153, 0.11350991, -0.23838478, 0.1394247..."
1,not sure who was more lose the flat character ...,0,"[0.062192187, 0.1952087, -0.14579107, -0.00481..."
2,attempt artless with black white and clever ca...,0,"[-0.19790795, 0.015133962, -0.107922316, -0.06..."
3,very little music or anything to speak of,0,"[-0.09093174, 0.25162372, -0.25681874, 0.15846..."
4,the good scene in the movie was when Gerard is...,1,"[0.064886056, 0.13270056, -0.15480983, -0.0207..."


In [38]:
X = df['vec'].to_numpy()
X = X.reshape(-1,1)

In [39]:
X.shape

(748, 1)

In [40]:
X = np.concatenate(np.concatenate(X,axis = 0),axis = 0).reshape(-1,300)

In [41]:
X.shape

(748, 300)

In [42]:
y = df['sentiment']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2, random_state = 0 ,stratify = y)

X_train.shape,X_test.shape

((598, 300), (150, 300))

# 3. Model Training and testing

## 3.1. Logistic Regression model

In [43]:
clf = LogisticRegression(solver = 'liblinear')

In [44]:
clf.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [48]:
y_pred = clf.predict(X_test)

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.79      0.82      0.81        73
           1       0.82      0.79      0.81        77

    accuracy                           0.81       150
   macro avg       0.81      0.81      0.81       150
weighted avg       0.81      0.81      0.81       150



We got an accuracy score of more than 80%, which is pretty good with logistic regression model.

In [47]:
#save the model
import pickle as pkl

pkl.dump(clf,open('lr_w2v_sentiment.pkl','wb'))

## 3.2. SVM model

In [50]:
from sklearn.svm import LinearSVC

In [51]:
clf = LinearSVC()

In [52]:
clf.fit(X_train,y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [53]:
y_pred = clf.predict(X_test)

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.78      0.81      0.79        73
           1       0.81      0.78      0.79        77

    accuracy                           0.79       150
   macro avg       0.79      0.79      0.79       150
weighted avg       0.79      0.79      0.79       150



# 4. Hyperparameters tunning with GridSearch CrossValidation

In [55]:
lr = LogisticRegression(solver = 'liblinear')

In [56]:
hyperparameters = {
    'penalty': ['l1','l2'],
    'C':(1,2,3,4)
}

In [57]:
clf = GridSearchCV(lr, hyperparameters,n_jobs = -1, cv = 5)

In [58]:
%%time
clf.fit(X_train,y_train)

CPU times: user 244 ms, sys: 98.6 ms, total: 343 ms
Wall time: 3.79 s


GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='liblinear',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'C': (1, 2, 3, 4), 'penalty': ['l1', 'l2']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [59]:
clf.best_params_

{'C': 4, 'penalty': 'l2'}

In [60]:
clf.best_score_

0.8361064425770308

83% accuracy on training set

In [61]:
y_pred = clf.predict(X_test)

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.80      0.82      0.81        73
           1       0.83      0.81      0.82        77

    accuracy                           0.81       150
   macro avg       0.81      0.81      0.81       150
weighted avg       0.81      0.81      0.81       150



Still 81% accuracy score on test set

## 4.1 Test other machine learning models using lazypredict

In [None]:
!pip install lazypredict

In [63]:
from lazypredict.Supervised import LazyClassifier

In [65]:
clf = LazyClassifier(verbose = 0, ignore_warnings =0 , custom_metric = None)

In [67]:
# %%time
models, predictions = clf.fit(X_train,y_train,X_test,y_test)

ImportError: ignored