# Bag of Words 

Data format for ML algorithms : 
- Data must be in tabular form
- Training features must be numerical

Bag of words Model
- extract word tokens
- compute frequency of word tokens
- construct a word vector out of these frequencies and vocabulary of corpus

We can use the scikit-learn `CountVectorizer` which takes a collection of text documents and creates a matrix of token counts. 

In [96]:
import spacy
from scipy import spatial

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [97]:
# A corpus of sentences.
corpus = [
  "Red Bull drops hint on F1 engine.",
  "Honda exits F1, leaving F1 partner Red Bull.",
  "Hamilton eyes record eighth F1 title.",
  "Aston Martin announces sponsor."
]

## Plain frequency BOW

Intuition: 

First, we have **vocabulary** list that consists of all unique words in documents. 

Next, convert the doc into a BOW **vector**. 

Vocabulary→ a,an,decade,endangered,have,is,jungle,king,lifespans,lion,Lions,of,species,the,The

eg. corpus = "The lion is the king of the jungle"

vector = [0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 2, 1]

In [98]:
vectorizer = CountVectorizer()

The *fit_transform* method does two things:
1. It learns a vocabulary dictionary from the corpus.
2. It returns a matrix where each row represents a document and each column represents a token (i.e. term).

In [99]:
bow = vectorizer.fit_transform(corpus)

In [100]:
# View list of features (tokens).
print(vectorizer.get_feature_names_out())

# View vocabulary dictionary.
vectorizer.vocabulary_

['announces' 'aston' 'bull' 'drops' 'eighth' 'engine' 'exits' 'eyes' 'f1'
 'hamilton' 'hint' 'honda' 'leaving' 'martin' 'on' 'partner' 'record'
 'red' 'sponsor' 'title']


{'red': 17,
 'bull': 2,
 'drops': 3,
 'hint': 10,
 'on': 14,
 'f1': 8,
 'engine': 5,
 'honda': 11,
 'exits': 6,
 'leaving': 12,
 'partner': 15,
 'hamilton': 9,
 'eyes': 7,
 'record': 16,
 'eighth': 4,
 'title': 19,
 'aston': 1,
 'martin': 13,
 'announces': 0,
 'sponsor': 18}

In [101]:
print(type(bow))

<class 'scipy.sparse._csr.csr_matrix'>


Specifically, the `CountVectorizer` generates a sparse matrix. The sparse matrix object includes a number of useful methods. 

In [102]:
print(bow)

  (0, 17)	1
  (0, 2)	1
  (0, 3)	1
  (0, 10)	1
  (0, 14)	1
  (0, 8)	1
  (0, 5)	1
  (1, 17)	1
  (1, 2)	1
  (1, 8)	2
  (1, 11)	1
  (1, 6)	1
  (1, 12)	1
  (1, 15)	1
  (2, 8)	1
  (2, 9)	1
  (2, 7)	1
  (2, 16)	1
  (2, 4)	1
  (2, 19)	1
  (3, 1)	1
  (3, 13)	1
  (3, 0)	1
  (3, 18)	1


1. column 1: a list of tuples. 
- 1st value: represents docs. There are 4 rows = 4 docs. 1=row1, 2=row2,...
- 2nd value: token in index in the vocabulary list. 
2. Column 2: token counts in a row. 

## Binary BOW with custom tokenizer

In short, we can customize our tokenizer (i.e. write our own method to return the desired tokens). 

In [103]:
# As usual, we start by importing spaCy and loading a statistical model.
nlp = spacy.load('en_core_web_sm')

# Create a tokenizer callback using spaCy under the hood. Here, we tokenize
# the passed-in text and return the tokens, filtering out punctuation.
def spacy_tokenizer(doc):
    return [t.text for t in nlp(doc) if not t.is_punct]

In [104]:
vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, lowercase=False, binary=True)
bow = vectorizer.fit_transform(corpus)

In [105]:
print(vectorizer.get_feature_names_out())
vectorizer.vocabulary_

['Aston' 'Bull' 'F1' 'Hamilton' 'Honda' 'Martin' 'Red' 'announces' 'drops'
 'eighth' 'engine' 'exits' 'eyes' 'hint' 'leaving' 'on' 'partner' 'record'
 'sponsor' 'title']


{'Red': 6,
 'Bull': 1,
 'drops': 8,
 'hint': 13,
 'on': 15,
 'F1': 2,
 'engine': 10,
 'Honda': 4,
 'exits': 11,
 'leaving': 14,
 'partner': 16,
 'Hamilton': 3,
 'eyes': 12,
 'record': 17,
 'eighth': 9,
 'title': 19,
 'Aston': 0,
 'Martin': 5,
 'announces': 7,
 'sponsor': 18}

#### 2-array method
View the 2d matrix of our tokens. 

In [106]:
print(bow.toarray())
print()
print('Indexing and slicing.')
print(bow[0])
print()
print(bow[0:2])

[[0 1 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0]
 [0 1 1 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0]
 [0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1]
 [1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0]]

Indexing and slicing.
  (0, 6)	1
  (0, 1)	1
  (0, 8)	1
  (0, 13)	1
  (0, 15)	1
  (0, 2)	1
  (0, 10)	1

  (0, 6)	1
  (0, 1)	1
  (0, 8)	1
  (0, 13)	1
  (0, 15)	1
  (0, 2)	1
  (0, 10)	1
  (1, 6)	1
  (1, 1)	1
  (1, 2)	1
  (1, 4)	1
  (1, 11)	1
  (1, 14)	1
  (1, 16)	1


# Similarity Scores 
1. Dot Product
2. Magnitude of a vector
3. Cosine Similarity

The dot product of two vectors $\vec{V}$ and $\vec{W}$ is given by:

$$
\vec{V} = (v_1, v_2, \ldots, v_n), \quad \vec{W} = (w_1, w_2, \ldots, w_n)
$$

The dot product $\vec{V}$ $\cdot$ $\vec{W}$ is calculated as:

$$
\vec{V} \cdot \vec{W} = (v_1 \times w_1) + (v_2 \times w_2) + \ldots + (v_n \times w_n)
$$


For any vector $\vec{V} = (v_1, v_2, \ldots, v_n)$, the magnitude is defined as:

$$
\|\vec{V}\| = \sqrt{v_1^2 + v_2^2 + \ldots + v_n^2}
$$

The cosine similarity between vectors $\vec{V}$ and $\vec{W}$ is defined as:

$$
\text{cos}(\vec{V}, \vec{W}) = \frac{\vec{V} \cdot \vec{W}}{\|\vec{V}\| \cdot \|\vec{W}\|}
$$


It is used to find the angle between those 2 vectors. If the value of cosine is 1, the vectors point in the same direction. Otherwise, if it is 0, the vectors are orthogonal (dissimilar). 

## Manual method

In [107]:
# The cosine method expects array_like inputs, so we need to generate
# arrays from our sparse matrix.
doc1_vs_doc2 = 1 - spatial.distance.cosine(bow[0].toarray()[0], bow[1].toarray()[0])
doc1_vs_doc3 = 1 - spatial.distance.cosine(bow[0].toarray()[0], bow[2].toarray()[0])
doc1_vs_doc4 = 1 - spatial.distance.cosine(bow[0].toarray()[0], bow[3].toarray()[0])

print(corpus)

print(f"Doc 1 vs Doc 2: {doc1_vs_doc2}")
print(f"Doc 1 vs Doc 3: {doc1_vs_doc3}")
print(f"Doc 1 vs Doc 4: {doc1_vs_doc4}")

['Red Bull drops hint on F1 engine.', 'Honda exits F1, leaving F1 partner Red Bull.', 'Hamilton eyes record eighth F1 title.', 'Aston Martin announces sponsor.']
Doc 1 vs Doc 2: 0.4285714285714286
Doc 1 vs Doc 3: 0.15430334996209194
Doc 1 vs Doc 4: 0.0


## Cosine_similarity from `scikit-learn` lib

In [108]:
from sklearn.metrics.pairwise import cosine_similarity

In [109]:
# cosine_similarity can take either array-likes or sparse matrices.
print(cosine_similarity(bow))

[[1.         0.42857143 0.15430335 0.        ]
 [0.42857143 1.         0.15430335 0.        ]
 [0.15430335 0.15430335 1.         0.        ]
 [0.         0.         0.         1.        ]]


## N-gram models
Contigous sequence of n elements (or words) in a given document. 

BOW might have shortcomings: 

Context of the words is lost as we might get identical vectors if there are 2 docs with exact the same words. Let's see an example: 

'The movie was good and not boring' -> Positive 

'The movie was not good and boring' -> Negative

For (n = 1) (Bag-of-Words):

Sentence: "for you a thousand times over"

For (n = 2) (bigrams):

n-grams: ['for you', 'you a', 'a thousand', 'thousand times', 'times over']


Cons: Curse of dimensionality and higher order n-grams are rare. 

In [110]:
# Setting ngram_range parameter to (1, 2) generates both unigrams and bigrams.
vectorizer = CountVectorizer(lowercase=False, binary=True, ngram_range=(1,2))
bigrams = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print('Number of features: {}'.format(len(vectorizer.get_feature_names_out())))

['Aston' 'Aston Martin' 'Bull' 'Bull drops' 'F1' 'F1 engine' 'F1 leaving'
 'F1 partner' 'F1 title' 'Hamilton' 'Hamilton eyes' 'Honda' 'Honda exits'
 'Martin' 'Martin announces' 'Red' 'Red Bull' 'announces'
 'announces sponsor' 'drops' 'drops hint' 'eighth' 'eighth F1' 'engine'
 'exits' 'exits F1' 'eyes' 'eyes record' 'hint' 'hint on' 'leaving'
 'leaving F1' 'on' 'on F1' 'partner' 'partner Red' 'record'
 'record eighth' 'sponsor' 'title']
Number of features: 40


In [111]:
# Setting n_gram range to (2, 2) generates only bigrams.
vectorizer = CountVectorizer(lowercase=False, binary=True, ngram_range=(2,2))
bigrams = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(vectorizer.vocabulary_)

['Aston Martin' 'Bull drops' 'F1 engine' 'F1 leaving' 'F1 partner'
 'F1 title' 'Hamilton eyes' 'Honda exits' 'Martin announces' 'Red Bull'
 'announces sponsor' 'drops hint' 'eighth F1' 'exits F1' 'eyes record'
 'hint on' 'leaving F1' 'on F1' 'partner Red' 'record eighth']
{'Red Bull': 9, 'Bull drops': 1, 'drops hint': 11, 'hint on': 15, 'on F1': 17, 'F1 engine': 2, 'Honda exits': 7, 'exits F1': 13, 'F1 leaving': 3, 'leaving F1': 16, 'F1 partner': 4, 'partner Red': 18, 'Hamilton eyes': 6, 'eyes record': 14, 'record eighth': 19, 'eighth F1': 12, 'F1 title': 5, 'Aston Martin': 0, 'Martin announces': 8, 'announces sponsor': 10}


## Exercises

EXERCISE 1: Create a spacy_tokenizer callback which takes a string and returns a list of tokens (each token's text) with punctuation filtered out.

In [112]:
# There are 5 docs (=rows) in this new corpus. 
corpus = [
    "Students use their GPS-enabled cellphones to take birdview photographs of a land in order to find specific danger points such as rubbish heaps.",
    "Teenagers are enthusiastic about taking aerial photograph in order to study their neighbourhood.",
    "Aerial photography is a great way to identify terrestrial features that aren’t visible from the ground level, such as lake contours or river paths.",
    "During the early days of digital SLRs, Canon was pretty much the undisputed leader in CMOS image sensor technology.",
    "Syrian President Bashar al-Assad tells the US it will 'pay the price' if it strikes against Syria."
]

nlp = spacy.load('en_core_web_sm')

def spacy_tokenizer(doc):
    return (t.text for t in nlp(doc) if not t.is_punct)

# Initialize a CountVectorizer object and set it to use
# your spacy_tokenizer with lower-casing off and to create a binary BOW.
vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, lowercase=False, binary=True)
binary_bow = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
vectorizer.vocabulary_

['Aerial' 'Assad' 'Bashar' 'CMOS' 'Canon' 'During' 'GPS' 'President'
 'SLRs' 'Students' 'Syria' 'Syrian' 'Teenagers' 'US' 'a' 'about' 'aerial'
 'against' 'al' 'are' 'as' 'birdview' 'cellphones' 'contours' 'danger'
 'days' 'digital' 'early' 'enabled' 'enthusiastic' 'features' 'find'
 'from' 'great' 'ground' 'heaps' 'identify' 'if' 'image' 'in' 'is' 'it'
 'lake' 'land' 'leader' 'level' 'much' 'neighbourhood' 'n’t' 'of' 'or'
 'order' 'paths' 'pay' 'photograph' 'photographs' 'photography' 'points'
 'pretty' 'price' 'river' 'rubbish' 'sensor' 'specific' 'strikes' 'study'
 'such' 'take' 'taking' 'technology' 'tells' 'terrestrial' 'that' 'the'
 'their' 'to' 'undisputed' 'use' 'visible' 'was' 'way' 'will']


{'Students': 9,
 'use': 77,
 'their': 74,
 'GPS': 6,
 'enabled': 28,
 'cellphones': 22,
 'to': 75,
 'take': 67,
 'birdview': 21,
 'photographs': 55,
 'of': 49,
 'a': 14,
 'land': 43,
 'in': 39,
 'order': 51,
 'find': 31,
 'specific': 63,
 'danger': 24,
 'points': 57,
 'such': 66,
 'as': 20,
 'rubbish': 61,
 'heaps': 35,
 'Teenagers': 12,
 'are': 19,
 'enthusiastic': 29,
 'about': 15,
 'taking': 68,
 'aerial': 16,
 'photograph': 54,
 'study': 65,
 'neighbourhood': 47,
 'Aerial': 0,
 'photography': 56,
 'is': 40,
 'great': 33,
 'way': 80,
 'identify': 36,
 'terrestrial': 71,
 'features': 30,
 'that': 72,
 'n’t': 48,
 'visible': 78,
 'from': 32,
 'the': 73,
 'ground': 34,
 'level': 45,
 'lake': 42,
 'contours': 23,
 'or': 50,
 'river': 60,
 'paths': 52,
 'During': 5,
 'early': 27,
 'days': 25,
 'digital': 26,
 'SLRs': 8,
 'Canon': 4,
 'was': 79,
 'pretty': 58,
 'much': 46,
 'undisputed': 76,
 'leader': 44,
 'CMOS': 3,
 'image': 38,
 'sensor': 62,
 'technology': 69,
 'Syrian': 11,
 'Presid

The string below is a whole paragraph. We want to create another
binary BOW but using the vocabulary of our *current* CountVectorizer. This means
that words in this paragraph which AREN'T already in the vocabulary won't be
epresented. This is to illustrate how BOW can't handle out-of-vocabulary words
unless you rebuild your whole vocabulary. Still, we'll see that if there's
enough overlapping vocabulary, some similarity can still be picked up.

Note that we call 'transform' only instead of 'fit_transform' because the fit step (i.e. vocabulary build) is already done and we don't want to re-fit here.

EXERCISE 2: using the pairwise cosine_similarity method from sklearn,
calculate the similarities between each document from the corpus against
this new document (new_bow). HINT: You can pass two parameters to
cosine_similarity in this case. 

Which document is the most similar? Which is the least similar? Do the results make sense based on what you see?

In [113]:
from sklearn.metrics.pairwise import cosine_similarity

# Example new document
s = ["Teenagers take aerial shots of their neighbourhood using digital cameras sitting in old bottles which are launched via kites - a common toy for children living in the favelas. They then use GPS-enabled smartphones to take pictures of specific danger points - such as rubbish heaps, which can become a breeding ground for mosquitoes carrying dengue fever."]

new_bow = vectorizer.transform(s)

# Calculate cosine similarities with all documents in the corpus
similarities = cosine_similarity(binary_bow, new_bow)

similarities

array([[0.69565217],
       [0.40482045],
       [0.29192018],
       [0.19658927],
       [0.0521286 ]])

Interpretation: 
1. 5 values - correspond to 5 docs in the 'corpus'. 
2. The 1st row has the highest consie similarity score = 0.69 => row 1 has the most similar tokens to the new paragraph. 