## Feature Extraction

In order to feed text to a model we need to transform it to a numerical features, in this notebook we will discuss how to build a bag-of-words model from text to use it later for different applications.

--------------

### Bag of words

Count the occurrences of words in the corpus.

In [6]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
# CounterVectorizer (BOW)
texts = [ 'the red dog', 'cat eats dog', 'dog eats food',
        'red cat eats', 'the hot dog']

vectorizer = CountVectorizer()
vectorizer.fit(texts)
x = vectorizer.transform(texts)
colums = vectorizer.get_feature_names_out()
pd.DataFrame(x.todense(), columns=colums, index=texts)

Unnamed: 0,cat,dog,eats,food,hot,red,the
the red dog,0,1,0,0,0,1,1
cat eats dog,1,1,1,0,0,0,0
dog eats food,0,1,1,1,0,0,0
red cat eats,1,0,1,0,0,1,0
the hot dog,0,1,0,0,1,0,1


--------------

### Stop-words

Stop-words are words that are not significant to the topic in hand, for example `[am, is, are, in, at, ...]` can be considered stop-words in many applications as they don't add meaning.

In some other domains and problems you may have different kind of stop-words, for example if you are processing some chatbot data you may find `[can you please, would you please, can I, may I, ...]` such examples don't add meaning so stop-words can also be domain specific, and `TFIDF` can help you find these.

In [3]:
# CounterVectorizer (BOW) with stop words
texts = [ 'the red dog', 'cat eats dog', 'dog eats food',
         'red cat eats', 'the hot dog']

vectorizer = CountVectorizer(stop_words='english')
vectorizer.fit(texts)
x = vectorizer.transform(texts)
columns = vectorizer.get_feature_names_out()
pd.DataFrame(x.todense(), columns=columns, index=texts)

Unnamed: 0,cat,dog,eats,food,hot,red
the red dog,0,1,0,0,0,1
cat eats dog,1,1,1,0,0,0
dog eats food,0,1,1,1,0,0
red cat eats,1,0,1,0,0,1
the hot dog,0,1,0,0,1,0


--------------

### N-Grams

N-Grams is a way we can use to count for the context in the text, the bigger n-gram range the bigger context you can capture but also more features to generate, so be careful not to break your memory.

In [4]:
# CounterVectorizer (BOW) with n_grams and stop words
texts = [ 'the red dog', 'cat eats dog', 'dog eats food',
        'red cat eats', 'the hot dog']

vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 2))
vectorizer.fit(texts)
x = vectorizer.transform(texts)
columns = vectorizer.get_feature_names_out()
pd.DataFrame(x.todense(), columns=columns, index=texts)

Unnamed: 0,cat,cat eats,dog,dog eats,eats,eats dog,eats food,food,hot,hot dog,red,red cat,red dog
the red dog,0,0,1,0,0,0,0,0,0,0,1,0,1
cat eats dog,1,1,1,0,1,1,0,0,0,0,0,0,0
dog eats food,0,0,1,1,1,0,1,1,0,0,0,0,0
red cat eats,1,1,0,0,1,0,0,0,0,0,1,1,0
the hot dog,0,0,1,0,0,0,0,0,1,1,0,0,0


----------

### TFIDF

Instead of just counting the frequency of each word, each word here is weighted using TF-IDF

$$W_{x, y} = tf_{x, y} \times log(\frac{N}{df_x})$$

In [7]:
# TF-IDF with stop words and n_grams
texts = [ 'the red dog', 'cat eats dog', 'dog eats food',
        'red cat eats', 'the hot dog']

vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_df=4 ,min_df=2)
vectorizer.fit(texts)
x = vectorizer.transform(texts)
columns = vectorizer.get_feature_names_out()
pd.DataFrame(x.todense(), columns=columns, index=texts)

Unnamed: 0,cat,cat eats,dog,eats,red
the red dog,0.0,0.0,0.572526,0.0,0.819887
cat eats dog,0.561066,0.561066,0.391791,0.465735,0.0
dog eats food,0.0,0.0,0.643744,0.765241,0.0
red cat eats,0.520646,0.520646,0.0,0.432183,0.520646
the hot dog,0.0,0.0,1.0,0.0,0.0


---------

> We can already build some application using only these, let's try a very quick one

In [8]:
import numpy as np
import pandas as pd
from collections import Counter
import random
from termcolor import colored

# sklearn
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances, manhattan_distances
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [9]:
# Load dataset
data = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'),
                        categories=('rec.autos', 'comp.windows.x', 
                                    'soc.religion.christian', 'rec.sport.baseball'))
X = data.data
y = [data.target_names[i] for i in data.target]
print(f'DATA : {X[0]}')
print(f'LABEL: {y[0]}')

DATA : With all the recent problems the Indians have been having
with their pitching staff I have heard numerous names
thrown around about who could solve their problem.

One name I have not heard is Mike Soper (RP).  As far as
I know, Soper has had pretty good minor league stats.
Why not give the kid a chance?  Anyone know anything about
this guy?

-- 
LABEL: rec.sport.baseball


In [10]:
Counter(y)

Counter({'soc.religion.christian': 398,
         'rec.sport.baseball': 397,
         'rec.autos': 396,
         'comp.windows.x': 395})

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

Let's try to get the top-5 similar articles

In [12]:
vectorizer = CountVectorizer(stop_words='english', max_features=1000, max_df=0.7, min_df=0.01)
vectorizer.fit(X_train)
X_train_v = vectorizer.transform(X_train)
X_test_v = vectorizer.transform(X_test)

In [13]:
X_train_v.shape, X_test_v.shape

((1268, 1000), (318, 1000))

* `Using Cosine Similairty`

In [14]:
for i in random.choices(range(0, len(X_test)), k=5):
    print(f"ID: {i}")
    print("True label:", colored(y_test[i], 'green'))
    distances = cosine_similarity(X_test_v[i], X_train_v).flatten()
    indices = np.argsort(distances)[::-1]
    for _, j in enumerate(indices[:3]):
        print(f"{_} nearest label is {colored(y_train[j], 'green' if y_train[j]==y_test[i] else 'red')}",
            f"similarity: {colored(round(distances[j], 3), 'yellow')}")

ID: 50
True label: [32mrec.sport.baseball[0m
0 nearest label is [32mrec.sport.baseball[0m similarity: [33m0.756[0m
1 nearest label is [31msoc.religion.christian[0m similarity: [33m0.58[0m
2 nearest label is [31mcomp.windows.x[0m similarity: [33m0.534[0m
ID: 292
True label: [32mrec.sport.baseball[0m
0 nearest label is [31msoc.religion.christian[0m similarity: [33m0.447[0m
1 nearest label is [32mrec.sport.baseball[0m similarity: [33m0.39[0m
2 nearest label is [32mrec.sport.baseball[0m similarity: [33m0.359[0m
ID: 258
True label: [32mrec.autos[0m
0 nearest label is [31mcomp.windows.x[0m similarity: [33m0.287[0m
1 nearest label is [32mrec.autos[0m similarity: [33m0.223[0m
2 nearest label is [31mrec.sport.baseball[0m similarity: [33m0.22[0m
ID: 47
True label: [32mrec.sport.baseball[0m
0 nearest label is [32mrec.sport.baseball[0m similarity: [33m0.319[0m
1 nearest label is [32mrec.sport.baseball[0m similarity: [33m0.313[0m
2 nearest label i

In [15]:
# List to append in it the predicted of test labels
y_pred_test = []

# Loop over the entire test dataset
for i in range(len(X_test)):
    # Get the true label for the current test instance
    true_label = y_test[i]   
    # Compute cosine similarity between the test instance and all training instances
    distances = cosine_similarity(X_test_v[i], X_train_v).flatten() 
    # Get the indices of the training instances sorted by similarity in descending order
    indices = np.argsort(distances)[::-1]
    # Get the labels of the three nearest neighbors
    nearest_labels = [y_train[j] for j in indices[:3]]
    # Determine the most common label among the three nearest neighbors
    y_pred_each = Counter(nearest_labels).most_common(1)[0][0]
    # Append to list
    y_pred_test.append(y_pred_each)

# Get Accuracy score
acc = accuracy_score(y_test, y_pred_test)
print(f'Acccuray Score using cosine simlarity is: {acc*100:.3f} %')

Acccuray Score using cosine simlarity is: 80.189 %


* `Using Euclidean Distance`

In [16]:
for i in random.choices(range(0, len(X_test)), k=5):
    print(f"ID: {i}")
    print("True label:", colored(y_test[i], 'green'))
    distances = euclidean_distances(X_test_v[i], X_train_v).flatten() 
    indices = np.argsort(distances)
    for _, j in enumerate(indices[:3]):
        print(f"{_} nearest label is {colored(y_train[j], 'green' if y_train[j]==y_test[i] else 'red')}",
            f"similarity: {colored(round(distances[j], 3), 'yellow')}")

ID: 34
True label: [32mrec.sport.baseball[0m
0 nearest label is [32mrec.sport.baseball[0m similarity: [33m13.675[0m
1 nearest label is [32mrec.sport.baseball[0m similarity: [33m15.067[0m
2 nearest label is [32mrec.sport.baseball[0m similarity: [33m15.264[0m
ID: 12
True label: [32mrec.sport.baseball[0m
0 nearest label is [32mrec.sport.baseball[0m similarity: [33m7.81[0m
1 nearest label is [32mrec.sport.baseball[0m similarity: [33m8.0[0m
2 nearest label is [31msoc.religion.christian[0m similarity: [33m8.185[0m
ID: 207
True label: [32mrec.sport.baseball[0m
0 nearest label is [31mrec.autos[0m similarity: [33m7.071[0m
1 nearest label is [32mrec.sport.baseball[0m similarity: [33m7.348[0m
2 nearest label is [31msoc.religion.christian[0m similarity: [33m7.348[0m
ID: 172
True label: [32mrec.autos[0m
0 nearest label is [32mrec.autos[0m similarity: [33m3.742[0m
1 nearest label is [31mrec.sport.baseball[0m similarity: [33m3.742[0m
2 nearest label

In [17]:
# List to append in it the predicted of test labels
y_pred_test = []

# Loop over the entire test dataset
for i in range(len(X_test)):
    # Get the true label for the current test instance
    true_label = y_test[i]   
    # Compute euclidean_distances between the test instance and all training instances
    distances = euclidean_distances(X_test_v[i], X_train_v).flatten() 
    # Get the indices of the training instances sorted by distance in ascending order
    indices = np.argsort(distances)
    # Get the labels of the three nearest neighbors
    nearest_labels = [y_train[j] for j in indices[:3]]
    # Determine the most common label among the three nearest neighbors
    y_pred_each = Counter(nearest_labels).most_common(1)[0][0]
    # Append to list
    y_pred_test.append(y_pred_each)

# Get Accuracy score
acc = accuracy_score(y_test, y_pred_test)
print(f'Acccuray Score using Euclidean Distance is: {acc*100:.3f} %')

Acccuray Score using Euclidean Distance is: 52.516 %


* `Using Dot Product`

In [18]:
for i in random.choices(range(0, len(X_test)), k=5):
    print(f"ID: {i}")
    print("True label:", colored(y_test[i], 'green'))
    distances = (X_test_v[i] * X_train_v.T).toarray().flatten()  # dot product
    indices = np.argsort(distances)[::-1]
    for _, j in enumerate(indices[:3]):
        print(f"{_} nearest label is {colored(y_train[j], 'green' if y_train[j]==y_test[i] else 'red')}",
            f"similarity: {colored(round(distances[j], 3), 'yellow')}")

ID: 49
True label: [32msoc.religion.christian[0m
0 nearest label is [32msoc.religion.christian[0m similarity: [33m1054[0m
1 nearest label is [32msoc.religion.christian[0m similarity: [33m395[0m
2 nearest label is [31mcomp.windows.x[0m similarity: [33m323[0m
ID: 218
True label: [32mrec.autos[0m
0 nearest label is [31mcomp.windows.x[0m similarity: [33m39[0m
1 nearest label is [32mrec.autos[0m similarity: [33m37[0m
2 nearest label is [31mcomp.windows.x[0m similarity: [33m31[0m
ID: 149
True label: [32mrec.autos[0m
0 nearest label is [31mcomp.windows.x[0m similarity: [33m65[0m
1 nearest label is [31mcomp.windows.x[0m similarity: [33m31[0m
2 nearest label is [31mcomp.windows.x[0m similarity: [33m16[0m
ID: 70
True label: [32msoc.religion.christian[0m
0 nearest label is [32msoc.religion.christian[0m similarity: [33m253[0m
1 nearest label is [32msoc.religion.christian[0m similarity: [33m71[0m
2 nearest label is [32msoc.religion.christian[0m 

In [19]:
# List to append in it the predicted of test labels
y_pred_test = []

# Loop over the entire test dataset
for i in range(len(X_test)):
    # Get the true label for the current test instance
    true_label = y_test[i]   
    # Compute Dot product between the test instance and all training instances
    distances = (X_test_v[i] * X_train_v.T).toarray().flatten()  # dot product
    # Get the indices of the training instances sorted by similarity in descending order
    indices = np.argsort(distances)[::-1]
    # Get the labels of the three nearest neighbors
    nearest_labels = [y_train[j] for j in indices[:3]]
    # Determine the most common label among the three nearest neighbors
    y_pred_each = Counter(nearest_labels).most_common(1)[0][0]
    # Append to list
    y_pred_test.append(y_pred_each)

# Get Accuracy score
acc = accuracy_score(y_test, y_pred_test)
print(f'Acccuray Score using Euclidean Distance is: {acc*100:.3f} %')

Acccuray Score using Euclidean Distance is: 48.428 %


-------------

> ## `Great Job`

---------