##WORD2VEC TEXT EMBEDDING

TIP: Word embedding is a technique used to represent words in a vector space

---

where words with similar meanings are mapped to nearby points. This allows for the representation of semantic relationships between words, which can be used for various natural language processing tasks such as:

Document classification: Documents with similar topics will have similar word embedding representations.
Machine translation: Words with similar meanings in different languages will have similar word embedding representations.
Question answering: Word embeddings can be used to find the best answer to a question based on the semantic similarity between the question and the answer choices.
There are various methods for creating word embeddings, including:

Count-based methods: These methods create word embeddings based on the frequency of words in a corpus.
Predictive methods: These methods create word embeddings based on the ability of a word to predict other words in a sentence.
Contextual methods: These methods create word embeddings based on the context in which a word is used.
Some popular word embedding models include:

**Word2Vec:** This model uses a neural network to learn word embeddings.

**GloVe:** This model combines global matrix factorization and local context window methods to learn word embeddings.

**ELMo:** This model uses a deep learning architecture to learn word embeddings that are specific to the context in which they are used.

**BERT:** This model uses a transformer architecture to learn word embeddings that are specific to the context in which they are used.

Word embeddings are a powerful tool for natural language processing. They can be used to improve the performance of various tasks such as document classification, machine translation, and question answering.

## Data Preparation

In [4]:
!pip install git+https://github.com/laxmimerit/preprocess_kgptalkie.git


Collecting git+https://github.com/laxmimerit/preprocess_kgptalkie.git
  Cloning https://github.com/laxmimerit/preprocess_kgptalkie.git to /tmp/pip-req-build-lcs_xypq
  Running command git clone --filter=blob:none --quiet https://github.com/laxmimerit/preprocess_kgptalkie.git /tmp/pip-req-build-lcs_xypq
  Resolved https://github.com/laxmimerit/preprocess_kgptalkie.git to commit 96bf02872d9756f29d6cddb8aafaedcd2a39bbb4
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [5]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

import preprocess_kgptalkie as ps

In [6]:
df= pd.read_csv('/content/imdb_reviews.txt', sep="\t", header = None)
df.head()

Unnamed: 0,0,1
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [7]:
df.columns=['Reviews',' Sentiment']
df.head()

Unnamed: 0,Reviews,Sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [8]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [9]:
df['Reviews'] = df['Reviews'].apply(lambda x: ps.cont_exp(x))
df['Reviews'] = df['Reviews'].apply(lambda x: ps.remove_emails(x))
df['Reviews'] = df['Reviews'].apply(lambda x: ps.remove_html_tags(x))
df['Reviews'] = df['Reviews'].apply(lambda x: ps.remove_urls(x))

df['Reviews'] = df['Reviews'].apply(lambda x: ps.remove_special_chars(x))
df['Reviews'] = df['Reviews'].apply(lambda x: ps.remove_accented_chars(x))
df['Reviews'] = df['Reviews'].apply(lambda x: ps.make_base(x))
df['Reviews'] = df['Reviews'].apply(lambda x: ps.spelling_correction(x).raw_sentences[0])
df.head()

  return BeautifulSoup(x, 'lxml').get_text().strip()


Unnamed: 0,Reviews,Sentiment
0,a very very very slowmoving aimless movie abou...,0
1,not sure who was more lose the flat character ...,0
2,attempt artless with black white and clever ca...,0
3,very little music or anything to speak of,0
4,the good scene in the movie was when Gerard is...,1


### Model Bulding

In [10]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [11]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [12]:
def get_vec(x):
    doc = nlp(x)
    vec = doc.vector
    return vec

In [13]:
df['vec'] = df['Reviews'].apply(lambda x: get_vec(x)) # Vectorization

In [14]:
df.head()

Unnamed: 0,Reviews,Sentiment,vec
0,a very very very slowmoving aimless movie abou...,0,"[-2.037473, 1.9009953, -2.9871805, -1.7169145,..."
1,not sure who was more lose the flat character ...,0,"[-3.7171004, 1.0740024, -2.7910752, 0.02449170..."
2,attempt artless with black white and clever ca...,0,"[-2.9383492, 0.18677144, -2.3205724, -0.493188..."
3,very little music or anything to speak of,0,"[-1.9868913, 2.2468886, -4.532146, -2.550426, ..."
4,the good scene in the movie was when Gerard is...,1,"[-1.6121409, 2.763651, -2.7169354, -1.0312456,..."


In [15]:
df.shape

(748, 3)

In [16]:
X = df['vec'].to_numpy()
X = X.reshape(-1, 1)

In [17]:
X.shape

(748, 1)

In [18]:
X = np.concatenate(np.concatenate(X, axis = 0), axis = 0).reshape(-1, 300)
X.shape

(748, 300)

In [19]:
y = df[' Sentiment']

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 21, stratify = y)

In [21]:
X_train.shape, X_test.shape

((598, 300), (150, 300))

Model Training

###Logistic Regression

In [23]:
clf=LogisticRegression(solver = 'liblinear')

In [24]:
clf.fit(X_train, y_train)

In [25]:
y_pred=clf.predict(X_test)

In [26]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.76      0.73      0.74        73
           1       0.75      0.78      0.76        77

    accuracy                           0.75       150
   macro avg       0.75      0.75      0.75       150
weighted avg       0.75      0.75      0.75       150



## SVC

In [27]:
from sklearn.svm import LinearSVC

In [28]:
clf=LinearSVC()

In [29]:
clf.fit(X_train, y_train)



In [31]:
y_pred= clf.predict(X_test)

In [33]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.72      0.68      0.70        73
           1       0.72      0.75      0.73        77

    accuracy                           0.72       150
   macro avg       0.72      0.72      0.72       150
weighted avg       0.72      0.72      0.72       150



###Grid Search Cross Validation

In [35]:
from sklearn.model_selection import GridSearchCV

In [34]:
logisticReg= LogisticRegression(solver="liblinear")

In [36]:
hyperparameters = {
    'penalty': ['l1', 'l2'],
    'C': (1, 2, 3, 4)
}

In [37]:
clf = GridSearchCV(logisticReg, hyperparameters, n_jobs=-1, cv = 5)

In [38]:
clf.fit(X_train, y_train)

In [39]:
clf.best_params_

{'C': 1, 'penalty': 'l1'}

In [40]:
clf.best_estimator_

In [41]:
clf.best_score_

0.7708823529411766

In [42]:
y_pred = clf.predict(X_test)

In [43]:
print( classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.76      0.73      0.74        73
           1       0.75      0.78      0.76        77

    accuracy                           0.75       150
   macro avg       0.75      0.75      0.75       150
weighted avg       0.75      0.75      0.75       150



In [44]:
import pickle

In [45]:
pickle.dump(clf, open('w2v_sentiment.pkl', 'wb'))