# Bag of Words (BoW)
<hr style="border:2px solid black">

<img src="bag_of_words.png" width="600"/>

## 1. Introduction

### 1.1 BoW: What & Why?

**Modeling text data**
>- text data are usually messy, while ML algorithms prefer well-defined fixed-length inputs
>- ML algorithms cannot work with raw text directly
>- text must be converted into numbers, specifically vectors of numbers

**Feature extraction**
>- various linguistic properties of text reflected in vectors derived from text data
>- BoW is a popular and simple method of feature extraction with text data

**BoW approach**
>- a BoW is a representation of text that describes the occurrence of words within a document
>- involves two things:
    + a vocabulary of known words
    + a measure of the presence of known words
>- information about the order or structure of words in the document is discarded 
>- intuition is that documents are similar if they have similar content

**Pros of BoW**
>- works for any text, easy to understand and implement
>- does not require a language model (no training)
>- successful in problems such as language modeling and document classification

**Cons of BoW**
>- all words are equally (dis)similar (discrete, orthogonal vectors)
>- ignores the context by discarding order of words
>- computationally challenging because of sparse data

### 1.2 NLP Jargon

>|   terminology   |                meaning                  |
 |:---------------:|:---------------------------------------:|
 |    `corpus`     |   a list of strings or text documents   |
 | `tokenization`  |dividing text into words (or other units)|
 |    `n-grams`    |  tokenizing into strings with n words   |
 |   `stop words`  |frequent words that carry little meaning |
 |    `stemming`   |         cutting off word endings        |
 | `lemmatization` | grouping together of different forms of the same word |
 | `vectorization` |       converting text into numbers      |
 |     `tf-idf`    |    method for normalizing token counts  |

### 1.3 Text Corpus

**Beatles corpus**

In [1]:
BEATLES_CORPUS = [
    "Yesterday, all my troubles seemed so far away", 
    "We all live in a yellow submarine, yellow submarine",
    "When I find myself in times of trouble, mother mary comes to me",
    "Penny lane is in my ears and in my eyes",
    "Here comes the sun and I say it's alright little darling"
]

**Backstreet Boys corpus**

In [2]:
BACKSTREET_BOYS_CORPUS = [       
    "You're the one for me you're my ecstasy youre the one I need hey yeah ohh",
    "You're my fire the one desire believe me when I say I want it that way",
    "Everybody rock your body, everybody rock your body, right backstreets back alright",
    "Show me the meaning of being lonely is this the feeling i need to walk with",
    "Now I can see that weve fallen apart from the way that it used to be yeah"
]

**Total corpus & label**

In [3]:
# corpus
CORPUS = BEATLES_CORPUS + BACKSTREET_BOYS_CORPUS

# label
l1,l2 = len(BEATLES_CORPUS), len(BACKSTREET_BOYS_CORPUS)
LABELS = [f"beatles_{i}" for i in range(l1)] + [f"bboys_{i}" for i in range(l2)] 

In [4]:
CORPUS

['Yesterday, all my troubles seemed so far away',
 'We all live in a yellow submarine, yellow submarine',
 'When I find myself in times of trouble, mother mary comes to me',
 'Penny lane is in my ears and in my eyes',
 "Here comes the sun and I say it's alright little darling",
 "You're the one for me you're my ecstasy youre the one I need hey yeah ohh",
 "You're my fire the one desire believe me when I say I want it that way",
 'Everybody rock your body, everybody rock your body, right backstreets back alright',
 'Show me the meaning of being lonely is this the feeling i need to walk with',
 'Now I can see that weve fallen apart from the way that it used to be yeah']

In [5]:
LABELS

['beatles_0',
 'beatles_1',
 'beatles_2',
 'beatles_3',
 'beatles_4',
 'bboys_0',
 'bboys_1',
 'bboys_2',
 'bboys_3',
 'bboys_4']

<hr style="border:2px solid black">

## 2. BoW from Scratch

**Load packages**

In [6]:
# data analysis stack
import numpy as np
import pandas as pd

# text-related stack
import re
from sklearn.feature_extraction import text

# miscellaneous
import warnings
warnings.filterwarnings("ignore")

### 2.1 Convert to lowercase

In [7]:
lowercase_corpus = [line.lower() for line in CORPUS]

In [8]:
lowercase_corpus

['yesterday, all my troubles seemed so far away',
 'we all live in a yellow submarine, yellow submarine',
 'when i find myself in times of trouble, mother mary comes to me',
 'penny lane is in my ears and in my eyes',
 "here comes the sun and i say it's alright little darling",
 "you're the one for me you're my ecstasy youre the one i need hey yeah ohh",
 "you're my fire the one desire believe me when i say i want it that way",
 'everybody rock your body, everybody rock your body, right backstreets back alright',
 'show me the meaning of being lonely is this the feeling i need to walk with',
 'now i can see that weve fallen apart from the way that it used to be yeah']

### 2.2 Tokenize

In [9]:
def tokenize(text):
    """
    This function spits out a list of tokens of a string
    """
    # pre-compile pattern, so that code runs faster
    token_pattern = re.compile(r"(?u)\b\w\w+\b")
    
    # make an iterator of all match objects
    matches = token_pattern.finditer(text)
    
    # create a list of all tokens
    token_list = [match.group() for match in matches]
    
    return token_list

In [10]:
tokenized_corpus = [tokenize(line) for line in lowercase_corpus]
tokenized_corpus[0]

['yesterday', 'all', 'my', 'troubles', 'seemed', 'so', 'far', 'away']

### 2.3 Remove stop words

In [25]:
english_stop_words = text.ENGLISH_STOP_WORDS
# must cast the text vec into lists, otherwise will cause 
## "TypeError" when working with count_vectorizer. 
## solution: https://stackoverflow.com/questions/75643277/how-can-i-solve-the-error-the-stop-words-parameter-of-tfidfvectorizer-must-be
english_stop_words = list(english_stop_words)

In [26]:
clean_corpus = [[term for term in tokens if term not in english_stop_words] for tokens in tokenized_corpus]
clean_corpus

[['yesterday', 'troubles', 'far', 'away'],
 ['live', 'yellow', 'submarine', 'yellow', 'submarine'],
 ['times', 'trouble', 'mother', 'mary', 'comes'],
 ['penny', 'lane', 'ears', 'eyes'],
 ['comes', 'sun', 'say', 'alright', 'little', 'darling'],
 ['ecstasy', 'youre', 'need', 'hey', 'yeah', 'ohh'],
 ['desire', 'believe', 'say', 'want', 'way'],
 ['everybody',
  'rock',
  'body',
  'everybody',
  'rock',
  'body',
  'right',
  'backstreets',
  'alright'],
 ['meaning', 'lonely', 'feeling', 'need', 'walk'],
 ['weve', 'fallen', 'apart', 'way', 'used', 'yeah']]

### 2.4 Vectorize

In [13]:
# flatten nested term list
term_list = [term for sub_list in clean_corpus for term in sub_list]
term_list 

['yesterday',
 'troubles',
 'far',
 'away',
 'live',
 'yellow',
 'submarine',
 'yellow',
 'submarine',
 'times',
 'trouble',
 'mother',
 'mary',
 'comes',
 'penny',
 'lane',
 'ears',
 'eyes',
 'comes',
 'sun',
 'say',
 'alright',
 'little',
 'darling',
 'ecstasy',
 'youre',
 'need',
 'hey',
 'yeah',
 'ohh',
 'desire',
 'believe',
 'say',
 'want',
 'way',
 'everybody',
 'rock',
 'body',
 'everybody',
 'rock',
 'body',
 'right',
 'backstreets',
 'alright',
 'meaning',
 'lonely',
 'feeling',
 'need',
 'walk',
 'weve',
 'fallen',
 'apart',
 'way',
 'used',
 'yeah']

In [27]:
len(term_list)

55

In [28]:
# list of sorted unique terms
unique_terms = sorted(list(set(term_list)))

In [29]:
len(unique_terms)

44

In [17]:
unique_terms

['alright',
 'apart',
 'away',
 'backstreets',
 'believe',
 'body',
 'comes',
 'darling',
 'desire',
 'ears',
 'ecstasy',
 'everybody',
 'eyes',
 'fallen',
 'far',
 'feeling',
 'hey',
 'lane',
 'little',
 'live',
 'lonely',
 'mary',
 'meaning',
 'mother',
 'need',
 'ohh',
 'penny',
 'right',
 'rock',
 'say',
 'submarine',
 'sun',
 'times',
 'trouble',
 'troubles',
 'used',
 'walk',
 'want',
 'way',
 'weve',
 'yeah',
 'yellow',
 'yesterday',
 'youre']

In [18]:
count_matrix = pd.DataFrame(index=LABELS)
for term in unique_terms:
    count_matrix[term] = [sub_list.count(term) for sub_list in clean_corpus]

In [None]:
# corpus
#CORPUS = BEATLES_CORPUS + BACKSTREET_BOYS_CORPUS

# label
#l1,l2 = len(BEATLES_CORPUS), len(BACKSTREET_BOYS_CORPUS)
#LABELS = [f"beatles_{i}" for i in range(l1)] + [f"bboys_{i}" for i in range(l2)]

**Count matrix**

In [19]:
count_matrix

Unnamed: 0,alright,apart,away,backstreets,believe,body,comes,darling,desire,ears,...,troubles,used,walk,want,way,weve,yeah,yellow,yesterday,youre
beatles_0,0,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
beatles_1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
beatles_2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
beatles_3,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
beatles_4,1,0,0,0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
bboys_0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
bboys_1,0,0,0,0,1,0,0,0,1,0,...,0,0,0,1,1,0,0,0,0,0
bboys_2,1,0,0,1,0,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
bboys_3,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
bboys_4,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,1,1,1,0,0,0


### 2.5 Normalization

problem with simple occurance count of terms
>- large documents will have unnecessary weight
>- highly frequent terms without much "information content" may dominate

**`Term Frequency (tf)`** 
- $\text{tf}(t,d)$ $~=~$ fraction of occurance of term $t$ in document $d$

**`Document Frequency (df)`** 
- $\text{df}(t)$ $~=~$ number of documents that contain term $t$

**`Inverse Document Frequency (idf)`** 
- $\text{idf}(t)$ $~=~$ inverse document frequency of term $t$ $~=~$ $1+\log\left[\frac{1+n}{1+\text{df}(t)}\right]$

**`Term Frequency - Inverse Document Frequency (tf-idf)`** 
- $\text{tf-idf}(t,d)$ $~=~$ $\text{tf}(t,d) \times \text{idf}(t)$

### 2.7 Other Techniques

- `n_grams` hyperparameter
- feature selection techniques, e.g., PCA
- stemming
- lemmatization

<hr style="border:2px solid black">

## 3. BoW in Scikit-Learn

### 3.1 [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

In [30]:
count_vectorizer = CountVectorizer(
    #
    ###################################################################
    # convert all characters to lowercase before tokenizing
    #
    lowercase = True, 
    #
    ###################################################################
    # choose words to create features
    #
    analyzer = 'word',
    #
    ###################################################################
    # use built-in stop word list for English (default=None)
    #
    stop_words = english_stop_words,
    #
    ###################################################################
    # select tokens of 2 or more word characters (punctuation ignored) 
    #
    token_pattern = r"(?u)\b\w\w+\b",
    #
    ###################################################################
    # consider only unigrams of tokens
    #
    ngram_range = (1, 1)
    #
    ###################################################################  
)

In [31]:
vec = count_vectorizer.fit_transform(CORPUS)

**Feature matrix (document-term matrix)**

In [None]:
count_matrix_skl = pd.DataFrame(
    vec.todense(), 
    columns=count_vectorizer.get_feature_names(), 
    index=LABELS
)

In [None]:
count_matrix_skl 

### 3.2 [`TfidfTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

In [None]:
tf = TfidfTransformer(use_idf=True)

In [None]:
vec2 = tf.fit_transform(vec)

In [None]:
type(vec2)

In [None]:
print(vec2)

In [None]:
type(vec2.todense())

In [None]:
print(vec2.todense())

In [None]:
feature_matrix = pd.DataFrame(
    vec2.todense(), 
    columns=count_vectorizer.get_feature_names(), 
    index=LABELS
)

In [None]:
feature_matrix

### 3.3 [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html?highlight=tfidfvectorizer#sklearn.feature_extraction.text.TfidfVectorizer)

- Option A: Use a CountVectorizer + TfidfTransformer sequentially in a pipeline
- Option B: Use a TfidfVectorizer in a single step

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer(stop_words=english_stop_words)

In [None]:
vec3 = vectorizer.fit_transform(CORPUS)

In [None]:
feature_matrix1 = pd.DataFrame(
    vec3.todense(), 
    columns=count_vectorizer.get_feature_names(), 
    index=LABELS
)

In [None]:
feature_matrix1

### 3.4 Exercise

**Learn more about [`TfidfVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html?highlight=tfidfvectorizer#sklearn.feature_extraction.text.TfidfVectorizer), and all its hyperparameters.**

<hr style="border:2px solid black">

## References

- [A Gentle Introduction to the Bag-of-Words Model](https://machinelearningmastery.com/gentle-introduction-bag-words-model/)
- [How Bag of Words (BoW) Works in NLP](https://dataaspirant.com/bag-of-words-bow/)
- [TF(Term Frequency)-IDF(Inverse Document Frequency) from scratch in python](https://towardsdatascience.com/tf-term-frequency-idf-inverse-document-frequency-from-scratch-in-python-6c2b61b78558)