### WORD EMBEDDINGS ###
In NLP, word embeddings is a term used for the representation of words in a vector space. The idea is to represent words as vectors in a continuous vector space, where semantically similar words are mapped to nearby points. This allows for the capture of semantic relationships between words, such as synonyms and antonyms, and can be used for various NLP tasks such as text classification, sentiment analysis, and machine translation.

There are 2 main types of word embeddings:
1. **Count of frequency**: This method uses the frequency of words in a corpus to create a vector representation. The most common method is the **Bag of Words (BoW)** model, which creates a vector for each document based on the frequency of each word in the document. Another method is **Term Frequency-Inverse Document Frequency (TF-IDF)**, which takes into account the frequency of words in a document as well as their frequency across the entire corpus.
2. **Deep learning**: This method uses neural networks to learn the vector representation of words. The most common methods are **Word2Vec**, **GloVe**, and **FastText**. These methods use a large corpus of text to learn the relationships between words and create dense vector representations.

#### WORD2VEC ####
- Word2Vec is a popular word embedding technique that uses a shallow neural network to learn the vector representation of words. It was developed by a team of researchers at Google led by Tomas Mikolov in 2013. Word2Vec uses two main architectures: Continuous Bag of Words (CBOW) and Skip-Gram.
    - **CBOW**: The CBOW model predicts the target word based on the context words surrounding it. It takes a set of context words as input and tries to predict the target word in the center. For example, given the context words "the", "cat", and "sat", the model would try to predict the target word "on".
    - **Skip-Gram**: The Skip-Gram model does the opposite of CBOW. It takes a target word as input and tries to predict the context words surrounding it. For example, given the target word "on", the model would try to predict the context words "the", "cat", and "sat".
- Word2Vec using cosine similarity to measure the similarity between two word vectors. Cosine similarity is a measure of the angle between two vectors in a multi-dimensional space. It is defined as the cosine of the angle between the two vectors, which ranges from -1 to 1. A cosine similarity of 1 indicates that the two vectors are identical, while a cosine similarity of -1 indicates that they are completely dissimilar.
- Word2Vec is trained using a large corpus of text, and the resulting word vectors can be used for various NLP tasks such as text classification, sentiment analysis, and machine translation. The word vectors can also be used to find similar words, perform analogies, and visualize the relationships between words in a multi-dimensional space.
- Advantages of Word2Vec:
    - Sparse matrix --> Dense matrix: Word2Vec creates dense vector representations of words, which are more efficient to store and process than sparse matrices.
    - Semantic information is getting captured: Word2Vec captures the semantic relationships between words, allowing for the representation of synonyms and antonyms in the vector space.
    - Vocab size --> fixed set of dimensions: Word2Vec creates a fixed-size vector representation for each word, which makes it easier to work with in machine learning models.
    - Out of vocabulary words: Word2Vec can handle out-of-vocabulary words by using subword information, which allows for the representation of rare or unseen words in the vector space.

##### CBOW #####
- Assume that we have a corpus: "The cat sat on the mat".
    - Select a window size of n. It's mean that we will take n words before and n words after the target word. For example, if we select n=3:
        - Input 1: "The cat sat" -> Target: "on" -> Output: "the", "cat", "sat", "on", "the", "mat"
        - Input 2: "cat sat on" -> Target: "the" -> Output: "the", "cat", "sat", "on", "the", "mat"
        - Input 3: "sat on the" -> Target: "cat" -> Output: "the", "cat", "sat", "on", "the", "mat"
        - Input 4: "on the mat" -> Target: "sat" -> Output: "the", "cat", "sat", "on", "the", "mat"
        - Input 5: "the mat" -> Target: "on" -> Output: "the", "cat", "sat", "on", "the", "mat"
        - Input 6: "the" -> Target: "mat" -> Output: "the", "cat", "sat", "on", "the", "mat"
    - One hot encoding: Each word in the vocabulary is represented as a one-hot vector, where the length of the vector is equal to the size of the vocabulary. For example, if the vocabulary size is 6, the one-hot encoding for the word "cat" would be [0, 1, 0, 0, 0, 0].
    - The input to the CBOW model is the one-hot encoding of the context words, and the output is the one-hot encoding of the target word. The model learns to predict the target word based on the context words by adjusting the weights of the neural network during training.
- CBOW is a fully connected neural network with an input layer, a hidden layer, and an output layer. The input layer takes the one-hot encoding of the context words as input, and the output layer produces the one-hot encoding of the target word. The hidden layer is used to learn the vector representation of the words in the vocabulary.
- The CBOW model is trained using a large corpus of text, and the resulting word vectors can be used for various NLP tasks such as text classification, sentiment analysis, and machine translation. The word vectors can also be used to find similar words, perform analogies, and visualize the relationships between words in a multi-dimensional space.
         

##### Skip-Gram #####
- The Skip-Gram model is the opposite of the CBOW model. It takes a target word as input and tries to predict the context words surrounding it. For example, given the target word "on", the model would try to predict the context words "the", "cat", and "sat".
- The Skip-Gram model is also a fully connected neural network with an input layer, a hidden layer, and an output layer. The input layer takes the one-hot encoding of the target word as input, and the output layer produces the one-hot encoding of the context words. The hidden layer is used to learn the vector representation of the words in the vocabulary.
- The Skip-Gram model is trained using a large corpus of text, and the resulting word vectors can be used for various NLP tasks such as text classification, sentiment analysis, and machine translation. The word vectors can also be used to find similar words, perform analogies, and visualize the relationships between words in a multi-dimensional space.
- The Skip-Gram model is particularly useful for learning word vectors for rare words, as it can learn from a small number of context words. This makes it a powerful tool for NLP tasks where the vocabulary size is large and the frequency of words varies widely.

##### Conclusion #####
- Small corpus: CBOW
- Large corpus: Skip-Gram
- To increase CBOW or Skip-Gram:
    - Increase training dataset
    - Increase the window size which lead to increase vector dimension

#### Avg Word2Vec ####
- The average Word2Vec model is a simple yet effective way to create a fixed-size vector representation of a document or sentence by averaging the word vectors of the individual words in the document or sentence. This approach is particularly useful when dealing with variable-length documents or sentences, as it allows for the creation of a fixed-size vector representation that can be used in various NLP tasks such as text classification, sentiment analysis, and machine translation.
- The average Word2Vec model works by first creating a word vector for each word in the document or sentence using a pre-trained Word2Vec model. Then, the word vectors are averaged to create a single fixed-size vector representation for the entire document or sentence. This approach captures the semantic information of the individual words while also providing a fixed-size representation that can be used in machine learning models.

### IMPLEMENT ###

In [2]:
import gensim
from gensim.models import Word2Vec, KeyedVectors

In [56]:
import os
from gensim.downloader import base_dir

print("Model cache path:", base_dir)

Model cache path: C:\Users\hoang/gensim-data


In [59]:
import os
import shutil
from gensim.downloader import base_dir

model_dir = os.path.join(base_dir, 'word2vec-google-news-300')

if os.path.exists(model_dir):
    shutil.rmtree(model_dir)
    print("Model 'word2vec-google-news-300' has been deleted.")
else:
    print("Model directory not found.")

Model 'word2vec-google-news-300' has been deleted.


In [5]:
import gensim.downloader as api

wv = api.load("word2vec-google-news-300")

In [6]:
vec_king = wv['king']
vec_king

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

In [7]:
vec_king.shape

(300,)

In [8]:
wv.most_similar('cricket')

[('cricketing', 0.8372225761413574),
 ('cricketers', 0.8165745735168457),
 ('Test_cricket', 0.8094819188117981),
 ('Twenty##_cricket', 0.8068488240242004),
 ('Twenty##', 0.7624265551567078),
 ('Cricket', 0.75413978099823),
 ('cricketer', 0.7372578382492065),
 ('twenty##', 0.7316356897354126),
 ('T##_cricket', 0.7304614186286926),
 ('West_Indies_cricket', 0.6987985968589783)]

In [9]:
wv.most_similar('happy')

[('glad', 0.7408890724182129),
 ('pleased', 0.6632170677185059),
 ('ecstatic', 0.6626912355422974),
 ('overjoyed', 0.6599286794662476),
 ('thrilled', 0.6514049172401428),
 ('satisfied', 0.6437949538230896),
 ('proud', 0.636042058467865),
 ('delighted', 0.627237856388092),
 ('disappointed', 0.6269949674606323),
 ('excited', 0.6247665286064148)]

In [10]:
wv.similarity('hockey', 'sports')

0.53541523

In [11]:
vec = wv['king'] - wv['man'] + wv['woman']
vec

array([ 4.29687500e-02, -1.78222656e-01, -1.29089355e-01,  1.15234375e-01,
        2.68554688e-03, -1.02294922e-01,  1.95800781e-01, -1.79504395e-01,
        1.95312500e-02,  4.09919739e-01, -3.68164062e-01, -3.96484375e-01,
       -1.56738281e-01,  1.46484375e-03, -9.30175781e-02, -1.16455078e-01,
       -5.51757812e-02, -1.07574463e-01,  7.91015625e-02,  1.98974609e-01,
        2.38525391e-01,  6.34002686e-02, -2.17285156e-02,  0.00000000e+00,
        4.72412109e-02, -2.17773438e-01, -3.44726562e-01,  6.37207031e-02,
        3.16406250e-01, -1.97631836e-01,  8.59375000e-02, -8.11767578e-02,
       -3.71093750e-02,  3.15551758e-01, -3.41796875e-01, -4.68750000e-02,
        9.76562500e-02,  8.39843750e-02, -9.71679688e-02,  5.17578125e-02,
       -5.00488281e-02, -2.20947266e-01,  2.29492188e-01,  1.26403809e-01,
        2.49023438e-01,  2.09960938e-02, -1.09863281e-01,  5.81054688e-02,
       -3.35693359e-02,  1.29577637e-01,  2.41699219e-02,  3.48129272e-02,
       -2.60009766e-01,  

In [12]:
wv.most_similar([vec])

[('king', 0.8449392318725586),
 ('queen', 0.7300517559051514),
 ('monarch', 0.645466148853302),
 ('princess', 0.6156251430511475),
 ('crown_prince', 0.5818676352500916),
 ('prince', 0.5777117609977722),
 ('kings', 0.5613663792610168),
 ('sultan', 0.5376775860786438),
 ('Queen_Consort', 0.5344247817993164),
 ('queens', 0.5289887189865112)]

##### SPAM CLASSIFICATION #####

In [13]:
import pandas as pd

messages = pd.read_csv('SpamClassifier\SMSSpamCollection', sep='\t', names=['label', 'message'])

In [14]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

In [15]:
import re
import nltk

In [16]:
from nltk.corpus import stopwords

In [17]:
corpus = []
for i in range(len(messages)):
    message = re.sub('[^a-zA-Z]', ' ', messages['message'][i])
    message = message.lower()
    message = message.split()
    message = [lemmatizer.lemmatize(word) for word in message if word not in set(stopwords.words('english'))]
    message = ' '.join(message)
    corpus.append(message)

In [18]:
[[i, j, k] for i, j, k in zip(list(map(len, corpus)), corpus, messages['message']) if i < 1]

[[0, '', 'What you doing?how are you?'],
 [0, '', 'Where @'],
 [0, '', '645'],
 [0, '', 'Can a not?'],
 [0, '', ':) '],
 [0, '', 'What you doing?how are you?'],
 [0, '', ':( but your not here....'],
 [0, '', ':-) :-)']]

In [19]:
corpus

['go jurong point crazy available bugis n great world la e buffet cine got amore wat',
 'ok lar joking wif u oni',
 'free entry wkly comp win fa cup final tkts st may text fa receive entry question std txt rate c apply',
 'u dun say early hor u c already say',
 'nah think go usf life around though',
 'freemsg hey darling week word back like fun still tb ok xxx std chgs send rcv',
 'even brother like speak treat like aid patent',
 'per request melle melle oru minnaminunginte nurungu vettam set callertune caller press copy friend callertune',
 'winner valued network customer selected receivea prize reward claim call claim code kl valid hour',
 'mobile month u r entitled update latest colour mobile camera free call mobile update co free',
 'gonna home soon want talk stuff anymore tonight k cried enough today',
 'six chance win cash pound txt csh send cost p day day tsandcs apply reply hl info',
 'urgent week free membership prize jackpot txt word claim c www dbuk net lccltd pobox ldnw rw'

In [20]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

In [21]:
words = []
for sent in corpus:
    sent_token = sent_tokenize(sent)
    for sent in sent_token:
        words.append(simple_preprocess(sent))

In [22]:
words

[['go',
  'jurong',
  'point',
  'crazy',
  'available',
  'bugis',
  'great',
  'world',
  'la',
  'buffet',
  'cine',
  'got',
  'amore',
  'wat'],
 ['ok', 'lar', 'joking', 'wif', 'oni'],
 ['free',
  'entry',
  'wkly',
  'comp',
  'win',
  'fa',
  'cup',
  'final',
  'tkts',
  'st',
  'may',
  'text',
  'fa',
  'receive',
  'entry',
  'question',
  'std',
  'txt',
  'rate',
  'apply'],
 ['dun', 'say', 'early', 'hor', 'already', 'say'],
 ['nah', 'think', 'go', 'usf', 'life', 'around', 'though'],
 ['freemsg',
  'hey',
  'darling',
  'week',
  'word',
  'back',
  'like',
  'fun',
  'still',
  'tb',
  'ok',
  'xxx',
  'std',
  'chgs',
  'send',
  'rcv'],
 ['even', 'brother', 'like', 'speak', 'treat', 'like', 'aid', 'patent'],
 ['per',
  'request',
  'melle',
  'melle',
  'oru',
  'minnaminunginte',
  'nurungu',
  'vettam',
  'set',
  'callertune',
  'caller',
  'press',
  'copy',
  'friend',
  'callertune'],
 ['winner',
  'valued',
  'network',
  'customer',
  'selected',
  'receivea',
 

In [23]:
# train w2v from scratch
model = gensim.models.Word2Vec(words)

In [24]:
# get all the vocabulary
model.wv.index_to_key

['call',
 'get',
 'ur',
 'gt',
 'lt',
 'go',
 'day',
 'ok',
 'free',
 'know',
 'come',
 'like',
 'time',
 'good',
 'got',
 'love',
 'text',
 'want',
 'send',
 'need',
 'one',
 'txt',
 'today',
 'going',
 'stop',
 'home',
 'lor',
 'sorry',
 'see',
 'still',
 'mobile',
 'take',
 'back',
 'da',
 'reply',
 'dont',
 'think',
 'tell',
 'week',
 'phone',
 'hi',
 'new',
 'please',
 'later',
 'pls',
 'co',
 'msg',
 'min',
 'dear',
 'night',
 'make',
 'message',
 'well',
 'say',
 'thing',
 'much',
 'claim',
 'hope',
 'great',
 'oh',
 'hey',
 'give',
 'number',
 'happy',
 'friend',
 'wat',
 'work',
 'way',
 'yes',
 'www',
 'prize',
 'let',
 'right',
 'tomorrow',
 'already',
 'tone',
 'ask',
 'said',
 'win',
 'cash',
 'amp',
 'life',
 'yeah',
 'im',
 'really',
 'meet',
 'babe',
 'find',
 'miss',
 'morning',
 'last',
 'year',
 'service',
 'uk',
 'thanks',
 'care',
 'anything',
 'would',
 'com',
 'also',
 'nokia',
 'lol',
 'feel',
 'every',
 'keep',
 'pick',
 'sure',
 'urgent',
 'sent',
 'contact',


In [25]:
model.corpus_count

5564

In [26]:
model.wv.similar_by_word('good')

[('give', 0.9995896816253662),
 ('day', 0.9995685815811157),
 ('night', 0.9995494484901428),
 ('today', 0.9995375871658325),
 ('much', 0.99953693151474),
 ('get', 0.9995298981666565),
 ('got', 0.9995242953300476),
 ('going', 0.9995211362838745),
 ('said', 0.9995195269584656),
 ('go', 0.9995181560516357)]

In [27]:
model.wv['good'].shape

(100,)

In [28]:
# avg w2v
def avg_w2v(doc):
    # remove out-of-vocabulary words
    return np.mean([model.wv[word] for word in doc if word in model.wv.index_to_key], axis=0)

In [29]:
from tqdm import tqdm
import numpy as np

In [30]:
X = []
for i in tqdm(range(len(words))):
    X.append(avg_w2v(words[i]))

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
100%|██████████| 5564/5564 [00:00<00:00, 12588.07it/s]


In [31]:
len(X)

5564

In [32]:
X[0].shape

(100,)

In [33]:
y = messages[list(map(lambda x: len(x) > 0, corpus))]
y = pd.get_dummies(y['label'])
y = y.iloc[:, 0].values

In [34]:
y.shape

(5564,)

In [40]:
X[0].reshape(1, -1).shape

(1, 100)

In [45]:
import pandas as pd

dfs = []
for i in range(len(X)):
    dfs.append(pd.DataFrame(X[i].reshape(1, -1)))

df = pd.concat(dfs, ignore_index=True)


  df = pd.concat(dfs, ignore_index=True)


In [46]:
df.shape

(5564, 100)

In [47]:
# independent feature
X = df

In [48]:
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [50]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()

In [51]:
classifier.fit(X_train, y_train)

In [52]:
y_pred = classifier.predict(X_test)

In [54]:
from sklearn.metrics import accuracy_score, classification_report

print(accuracy_score(y_test, y_pred))

0.967654986522911


In [55]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.93      0.82      0.87       148
        True       0.97      0.99      0.98       965

    accuracy                           0.97      1113
   macro avg       0.95      0.90      0.93      1113
weighted avg       0.97      0.97      0.97      1113

