### Bag-of-What?
Bag-of-words (BoW) is a statistical language model based on word count. Say what?

Let’s start with that first part: a **statistical language model** is a way for computers to make sense of language based on probability. For example, let’s say we have the text:

“Five fantastic fish flew off to find faraway functions. Maybe find another five fantastic fish?”

A statistical language model focused on the starting letter for words might take this text and predict that words are most likely to start with the letter “f” because 11 out of 15 words begin that way. A different statistical model that pays attention to word order might tell us that the word “fish” tends to follow the word “fantastic.”

Bag-of-words does not give a flying fish about word starts or word order though; its sole concern is word count — how many times each word appears in a document.

If you’re already familiar with statistical language models, you may also have heard BoW referred to as the **unigram model**. It’s technically **a special case of another statistical model, the n-gram model**, with n (the number of words in a sequence) set to 1.

The words from the sentence go into the bag-of-words and come out as a dictionary of words with their corresponding counts. For statistical models, we call the text that we use to build the model our **training data**. Usually, we need to prepare our text data by breaking it up into documents (shorter strings of text, generally sentences).

In [6]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...


True

In [7]:
# preprocessing the text:
import nltk, re
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter

stop_words = stopwords.words('english')
normalizer = WordNetLemmatizer()

def get_part_of_speech(word):
  probable_part_of_speech = wordnet.synsets(word)
  pos_counts = Counter()
  pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
  pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
  pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
  pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )
  most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
  return most_likely_part_of_speech

def preprocess_text(text):
  cleaned = re.sub(r'\W+', ' ', text).lower()
  tokenized = word_tokenize(cleaned)
  normalized = [normalizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized]
  return normalized

# Define text_to_bow() below:
def text_to_bow(some_text):
  bow_dictionary = {}
  tokens = preprocess_text(some_text)
  for token in tokens:
    if token in bow_dictionary.keys():
      bow_dictionary[token] +=1
    else:
      bow_dictionary[token] = 1
  return bow_dictionary
print(text_to_bow("I love fantastic flying fish. These flying fish are just ok, so maybe I will find another few fantastic fish..."))

{'i': 2, 'love': 1, 'fantastic': 2, 'fly': 2, 'fish': 3, 'these': 1, 'be': 1, 'just': 1, 'ok': 1, 'so': 1, 'maybe': 1, 'will': 1, 'find': 1, 'another': 1, 'few': 1}


### Building a Features Dictionary

1. clean the training data, 
2. tokenize it
3. normalize (lemmatize with pos) it
4. create a dictionary with the unique words as keys and assign indexes to them  (instead of counting them as above)

This dictionary will function as a **vocabulary** in the future

In [None]:
def create_features_dictionary(documents):
  features_dictionary = {}
  merged = ' '.join(documents)
  tokens = preprocess_text(merged)
  index = 0
  for token in tokens:
    if token not in features_dictionary.keys():
      features_dictionary[token] = index
      index += 1
  return features_dictionary, tokens
training_documents = ["Five fantastic fish flew off to find faraway functions.", "Maybe find another five fantastic fish?", "Find my fish with a function please!"]

print(create_features_dictionary(training_documents))

### Building a BoW Vector

We can represent a vector as a list.
The indexes will represent a word and the value will be set to its count
If I understand correctly the vector will show which words and how many times appear from our vocabulary appears in some test text

In [None]:
def text_to_bow_vector(some_text, features_dictionary):
  bow_vector = [0 for x in range(len(features_dictionary)) ]
  tokens = preprocess_text(some_text)
  for token in tokens:
    if token in features_dictionary.keys():
      feature_index = features_dictionary[token]
      bow_vector[feature_index] += 1
  return bow_vector, tokens

features_dictionary = {'function': 8, 'please': 14, 'find': 6, 'five': 0, 'with': 12, 'fantastic': 1, 'my': 11, 'another': 10, 'a': 13, 'maybe': 9, 'to': 5, 'off': 4, 'faraway': 7, 'fish': 2, 'fly': 3}

text = "Another five fish find another faraway fish."
print(text_to_bow_vector(text, features_dictionary)[0])

It seems that once we have the features dictionary and the BOW vector we can train the model on test text (eg spam emails)
Then we can create test vectors to test the accuracy of the model.
However for this they used something called Naive Bayes classification which was not explained so for now I won't add the code below.

Yup as they say in the next lesson, all of this was not necessary, because there are libraries and functions already doing it:

### Spam A Lot No More
Amazing work! As is the case with many tasks in Python, there’s already a library that can do all of that work for you.

For `text_to_bow()`, you can approximate the functionality with the collections module’s Counter() function:
```
from collections import Counter

tokens = ['another', 'five', 'fish', 'find', 'another', 'faraway', 'fish']
print(Counter(tokens))

# Counter({'fish': 2, 'another': 2, 'find': 1, 'five': 1, 'faraway': 1})
```
For vectorization, you can use CountVectorizer from the machine learning library scikit-learn. You can use fit() to train the features dictionary and then transform() to transform text into a vector:
```
from sklearn.feature_extraction.text import CountVectorizer

training_documents = ["Five fantastic fish flew off to find faraway functions.", "Maybe find another five fantastic fish?", "Find my fish with a function please!"]
test_text = ["Another five fish find another faraway fish."]
bow_vectorizer = CountVectorizer()
bow_vectorizer.fit(training_documents)
bow_vector = bow_vectorizer.transform(test_text)
print(bow_vector.toarray())
# [[2 0 1 1 2 1 0 0 0 0 0 0 0 0 0]]
```


Hopefully we'll get there soon that I understand the whole code, however I learnt this from Bing:
I was wondering why different method was used for the training and the test data in the vectorization exercise:
```
# for training data:
vector_from_training_data = vectorizer.fit_transform(training_data)

# for test data:
vector_from_test_data = vectorizer.transform(test_data)
```

So Bing says thatthis prevents data leakage from the training data to the test data.
- For the training vectors the `.fit()` or the `.fit_transform()` method is used
- For the test data you need only to `.transform()`

the only thing remained to understand from this poart is how this MultinomialNB works:
`from sklearn.naive_bayes import MultinomialNB`

### BoW Wow
As you can see, bag-of-words is pretty useful! BoW also has several advantages over other language models. For one, it’s an easier model to get started with and a few Python libraries already have built-in support for it.

Because bag-of-words relies on single words, rather than sequences of words, there are more examples of each unit of language in the training corpus. More examples means the model has less **data sparsity** (i.e., it has more training knowledge to draw from) than other statistical models.

Imagine you want to make a shirt to sell to people. If you have the shirt exactly tailored to someone’s body, it probably won’t fit that many people. But if you make a shirt that is just a giant bag with arm holes, you know that no one will buy it. What do you do? You loosely fit the shirt to someone’s body, leaving some extra room for different body shapes.

**Overfitting** (adapting a model too strongly to training data, akin to our highly tailored shirt) is a common problem for statistical language models. While BoW still suffers from overfitting in terms of vocabulary, it overfits less than other statistical models, allowing for more flexibility in grammar and word choice.

The combination of low data sparsity and less overfitting makes the bag-of-words model more reliable with smaller training data sets than other statistical models.


The snippet below demonstrates how much more effective BoW is compared to a bigram if the training data is scarce:

In [8]:

from nltk.util import ngrams
from collections import Counter

text = "It's exciting to watch flying fish after a hard day's work. I don't know why some fish prefer flying and other fish would rather swim. It seems like the fish just woke up one day and decided, 'hey, today is the day to fly away.'"
tokens = preprocess_text(text)

# Bigram approach:
bigrams_prepped = ngrams(tokens, 2)
bigrams = Counter(bigrams_prepped)
print("Three most frequent word sequences and the number of occurrences according to Bigrams:")
print(bigrams.most_common(3))

# Bag-of-Words approach:
# Define bag_of_words here:
bag_of_words = Counter(tokens)
print("\nThree most frequent words and number of occurrences according to Bag-of-Words:")
most_common_three = bag_of_words.most_common(3)
print(most_common_three)


Three most frequent word sequences and the number of occurrences according to Bigrams:
[(('it', 's'), 1), (('s', 'excite'), 1), (('excite', 'to'), 1)]

Three most frequent words and number of occurrences according to Bag-of-Words:
[('fish', 4), ('fly', 3), ('day', 3)]


### BoW Ow
Alas, there is a trade-off for all the brilliance BoW brings to the table.

Unless you want sentences that look like “the a but for the”, BoW is NOT a great primary model for text prediction. If that sort of “sentence” isn’t your bag, it’s because **bag-of-words has high perplexity**, meaning that it’s not a very accurate model for language prediction. The probability of the following word is always just the most frequently used words.

If your BoW model finds “good” frequently occurring in a text sample, you might assume there’s a positive sentiment being communicated in that text… but if you look at the original text you may find that in fact every “good” was preceded by a “not.”

Hmm, that would have been helpful to know. The BoW model’s word tokens lack context, which can make a word’s intended meaning unclear.

Perhaps you are wondering, “What happens if the model comes across a new word that wasn’t in the training data?” As mentioned, like all statistical models, BoW suffers from overfitting when it comes to vocabulary.

There are several ways that NLP developers have tackled this issue. A common approach is through **language smoothing** in which some probability is siphoned from the known words and given to unknown words.

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#extracting-features-from-text-files