### `Word Embeddings`
* In this notebook we will go through word embeddings using deep learning, we will not train a new model we will use pre-trained ones as training a new one will cost a lot.
* We will be using `spacy` in this tutorial to demonstrate word embeddings

``` bash
# Upgrade pip, install spacy, and download model
$ pip install -U pip setuptools wheel
$ pip install -U spacy
$ python -m spacy download en_core_web_md
```

---

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
cmap = sns.light_palette('blue', as_cmap=True)
import spacy
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

In [2]:
# Using spaCy model
nlp = spacy.load('en_core_web_md')

In [3]:
# Embedding size
embed_size = len(nlp('car').vector)
print('embed_size ->', embed_size)

# Use it like that (first 10 values)
nlp('car').vector[:10]

embed_size -> 300


array([ 4.5855 ,  2.4556 , -8.5233 , -6.0595 , -0.44879, -2.5409 ,
        4.3721 ,  1.4889 ,  4.6075 ,  6.7933 ], dtype=float32)

In [4]:
# on samples
words = ['cat', 'dog', 'car', 'bird', 'eagle']
vectors = [nlp(word).vector for word in words]

In [5]:
# Get similarity
similarities = cosine_similarity(vectors, vectors)
pd.DataFrame(similarities, columns=words, index=words).style.background_gradient(cmap=cmap)

Unnamed: 0,cat,dog,car,bird,eagle
cat,1.0,0.822082,0.196986,0.536937,0.330381
dog,0.822082,1.0,0.325002,0.45674,0.268694
car,0.196986,0.325002,1.0,0.153305,0.069607
bird,0.536937,0.45674,0.153305,1.0,0.623637
eagle,0.330381,0.268694,0.069607,0.623637,1.0


* The vectors generated by `spacy` model is a 300 dimensional vector which is the output of a pre-trained GloVe model.

---

* `The same dataset we are working on`

In [6]:
import numpy as np
import pandas as pd
from collections import Counter
import random
from termcolor import colored
from tqdm.auto import tqdm

# sklearn
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
# Load dataset
data = fetch_20newsgroups(subset='test', remove=['headers', 'footers', 'quotes'],
                         categories=['rec.autos', 'comp.windows.x', 
                                     'soc.religion.christian', 'rec.sport.baseball'])

# Split to X & y
X = data.data
y = [data.target_names[i] for i in data.target]

# Split to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

* `Vectorizing using spaCy`

In [8]:
# Empty list for vectorization
X_train_vect = np.zeros((len(X_train), embed_size))
X_test_vect = np.zeros((len(X_test), embed_size))

# Looping over X_train
for i, doc in tqdm(enumerate(nlp.pipe(X_train)), total=len(X_train)):
    X_train_vect[i, :] = doc.vector

for i, doc in tqdm(enumerate(nlp.pipe(X_test)), total=len(X_test)):
    X_test_vect[i, :] = doc.vector

100%|██████████| 1268/1268 [00:56<00:00, 22.25it/s]
100%|██████████| 318/318 [00:17<00:00, 17.80it/s]


* `1. Train a Classifier`

In [9]:
# Linear SVC
clf = LinearSVC()
clf.fit(X_train_vect, y_train)

y_pred_test = clf.predict(X_test_vect)



In [10]:
# Report
print(classification_report(y_test, y_pred_test))

                        precision    recall  f1-score   support

        comp.windows.x       0.92      0.89      0.90        79
             rec.autos       0.87      0.90      0.88        79
    rec.sport.baseball       0.91      0.94      0.93        80
soc.religion.christian       0.97      0.95      0.96        80

              accuracy                           0.92       318
             macro avg       0.92      0.92      0.92       318
          weighted avg       0.92      0.92      0.92       318



In [11]:
accuracy_score(y_test, y_pred_test) # of a classifier

0.9182389937106918

----

* `2. Using Cosine Similarity Get Top Similar (as we did before)`

In [12]:
for i in random.choices(range(0, len(X_test)), k=5):
    print(f"ID: {i}")
    print("True label:", colored(y_test[i], 'green'))
    distances = cosine_similarity(X_test_vect[i].reshape(1, embed_size), X_train_vect).flatten()
    indices = np.argsort(distances)[::-1]
    for _, j in enumerate(indices[:3]):
        print(f"{_} nearest label is {colored(y_train[j], 'green' if y_train[j]==y_test[i] else 'red')}",
             f"similarity: {colored(round(distances[j], 3), 'yellow')}")

ID: 163
True label: [32mcomp.windows.x[0m
0 nearest label is [32mcomp.windows.x[0m similarity: [33m0.913[0m
1 nearest label is [32mcomp.windows.x[0m similarity: [33m0.911[0m
2 nearest label is [32mcomp.windows.x[0m similarity: [33m0.906[0m
ID: 5
True label: [32mrec.sport.baseball[0m
0 nearest label is [32mrec.sport.baseball[0m similarity: [33m0.979[0m
1 nearest label is [32mrec.sport.baseball[0m similarity: [33m0.978[0m
2 nearest label is [32mrec.sport.baseball[0m similarity: [33m0.978[0m
ID: 68
True label: [32mrec.autos[0m
0 nearest label is [31mrec.sport.baseball[0m similarity: [33m0.967[0m
1 nearest label is [31msoc.religion.christian[0m similarity: [33m0.964[0m
2 nearest label is [31mrec.sport.baseball[0m similarity: [33m0.963[0m
ID: 135
True label: [32msoc.religion.christian[0m
0 nearest label is [32msoc.religion.christian[0m similarity: [33m0.982[0m
1 nearest label is [32msoc.religion.christian[0m similarity: [33m0.981[0m
2 near

In [13]:
# List to append in it the predicted of test labels
y_pred_test = []

# Loop over the entire test dataset
for i in range(len(X_test)):
    # Compute cosine similarity between the test instance and all training instances
    distances = cosine_similarity(X_test_vect[i].reshape(1, embed_size), X_train_vect).flatten()
    # Get the indices of the training instances sorted by similarity in descending order
    indices = np.argsort(distances)[::-1]
    # Get the labels of the three nearest neighbors
    nearest_labels = [y_train[j] for j in indices[:3]]
    # Determine the most common label among the three nearest neighbors
    y_pred_each = Counter(nearest_labels).most_common(1)[0][0]
    # Append to list
    y_pred_test.append(y_pred_each)

# Get Accuracy score
acc = accuracy_score(y_test, y_pred_test)
print(f'Acccuray Score using cosine simlarity is: {acc*100:.3f} %') # using cosine similarity as a metric

Acccuray Score using cosine simlarity is: 74.214 %


---

* `3. Using Euclidean Distance for measuring similarity`

In [14]:
for i in random.choices(range(0, len(X_test)), k=5):
    print(f"ID: {i}")
    print("True label:", colored(y_test[i], 'green'))
    distances = euclidean_distances(X_test_vect[i].reshape(1, embed_size), X_train_vect).flatten() 
    indices = np.argsort(distances)
    for _, j in enumerate(indices[:3]):
        print(f"{_} nearest label is {colored(y_train[j], 'green' if y_train[j]==y_test[i] else 'red')}",
             f"similarity: {colored(round(distances[j], 3), 'yellow')}")

ID: 138
True label: [32msoc.religion.christian[0m
0 nearest label is [32msoc.religion.christian[0m similarity: [33m5.068[0m
1 nearest label is [32msoc.religion.christian[0m similarity: [33m6.055[0m
2 nearest label is [32msoc.religion.christian[0m similarity: [33m6.318[0m
ID: 63
True label: [32mrec.sport.baseball[0m
0 nearest label is [32mrec.sport.baseball[0m similarity: [33m6.581[0m
1 nearest label is [32mrec.sport.baseball[0m similarity: [33m10.307[0m
2 nearest label is [32mrec.sport.baseball[0m similarity: [33m10.581[0m
ID: 241
True label: [32msoc.religion.christian[0m
0 nearest label is [31mrec.sport.baseball[0m similarity: [33m5.626[0m
1 nearest label is [32msoc.religion.christian[0m similarity: [33m6.032[0m
2 nearest label is [31mcomp.windows.x[0m similarity: [33m6.277[0m
ID: 291
True label: [32msoc.religion.christian[0m
0 nearest label is [32msoc.religion.christian[0m similarity: [33m6.368[0m
1 nearest label is [32msoc.religion.ch

In [15]:
# List to append in it the predicted of test labels
y_pred_test = []

# Loop over the entire test dataset
for i in range(len(X_test)):
  
    # Compute euclidean_distances between the test instance and all training instances
    distances = euclidean_distances(X_test_vect[i].reshape(1, embed_size), X_train_vect).flatten() 
    # Get the indices of the training instances sorted by distance in ascending order
    indices = np.argsort(distances)
    # Get the labels of the three nearest neighbors
    nearest_labels = [y_train[j] for j in indices[:3]]
    # Determine the most common label among the three nearest neighbors
    y_pred_each = Counter(nearest_labels).most_common(1)[0][0]
    # Append to list
    y_pred_test.append(y_pred_each)

# Get Accuracy score
acc = accuracy_score(y_test, y_pred_test)
print(f'Acccuray Score using Euclidean Distance is: {acc*100:.3f} %') # usign euclidean distance

Acccuray Score using Euclidean Distance is: 73.899 %


---

* `Conclusion`

- Word embedding is a very powerful feature specially if you have small data, as your model will make use of the learned features of the word2vec or GloVe model and thus will be able to make better predictions.
- Word2vec and GloVe don't count for different context that the same word can have in different sentences

----