In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

* Bag-of-words converts a text document into a flat vector. It is “flat” because it doesn’t contain any of the original textual structures
* What is important here is the geometry of data in feature space. In a bag-of-words vector, each word becomes a dimension of the vector. If there are n words in the vocabulary, then a document becomes a point1 in n-dimensional spac

* Bag-of-n-Grams, or bag-of-n-grams, is a natural extension of bag-of-words. An n-gram is a sequence of n tokens. A word is essentially a 1-gram, also known as a unigram. After tokenization, the counting mechanism can collate individual tokens into word counts, or count overlapping sequences as n-grams. For example, the sen‐ tence “Emma knocked on the door” generates the n-grams “Emma knocked,” “knocked on,” “on the,” and “the door.”
n-grams retain more of the original sequence structure of the text, and therefore the bag-of-n-grams representation can be more informative. However, this comes at a cost. Theoretically, with k unique words, there could be k2 unique 2-grams (also called bigrams). In practice, there are not nearly so many, because not every word can follow every other word. Nevertheless, there are usually a lot more distinct n-grams (n > 1) than words. This means that bag-of-n-grams is a much bigger and sparser fea‐ ture space. It also means that n-grams are more expensive to compute, store, and model. The larger n is, the richer the information, and the greater the cost.

In [9]:
import json
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Load the first 10,000 reviews
f = open('yelp_academic_dataset_review.json')
js = []
for i in range(10000):
    js.append(json.loads(f.readline()))
f.close()

review_df = pd.DataFrame(js)

# Create feature transformers for unigrams, bigrams, and trigrams.
# The default ignores single-character words, which is useful in practice because
# it trims uninformative words, but we explicitly include them in this example for
# illustration purposes.
bow_converter = CountVectorizer(token_pattern='(?u)\\b\\w+\\b')
bigram_converter = CountVectorizer(ngram_range=(2,2), token_pattern='(?u)\\b\\w+\\b')
trigram_converter = CountVectorizer(ngram_range=(3,3), token_pattern='(?u)\\b\\w+\\b')

# Fit the transformers and look at vocabulary size
bow_converter.fit(review_df['text'])
bigram_converter.fit(review_df['text'])
trigram_converter.fit(review_df['text'])

# Get the feature names
words = bow_converter.get_feature_names_out()
bigrams = bigram_converter.get_feature_names_out()
trigrams = trigram_converter.get_feature_names_out()

print(len(words), len(bigrams), len(trigrams))

29222 368943 881620


* Depending on the task, one might also need to filter out rare words. These might be truly obscure words, or misspellings of common words. To a statistical model, a word that appears in only one or two documents is more like noise than useful informa‐ tion.

The algorithm for detecting common phrases through likelihood ratio test analysis proceeds as follows:

1. Compute occurrence probabilities for all singleton words: P(w).
2. Compute conditional pairwise word occurrence probabilities for all unique
bigrams: P(w2 | w1).
3. Compute the likelihood ratio log λ for all unique bigrams.
4. Sort the bigrams based on their likelihood ratio.
5. Take the bigrams with the smallest likelihood ratio values as features.

In [18]:
import pandas as pd
import json
# Load the first 10 reviewsj
f = open('yelp_academic_dataset_review.json')
js=[]
for i in range(10):
    js.append(json.loads(f.readline()))
f.close()
review_df = pd.DataFrame(js)
# First we'll walk through spaCy's functions
!pip install -U pip setuptools wheel
!pip install -U 'spacy[apple]'
!python -m spacy download en_core_web_sm

import spacy
# preload the language model
nlp = spacy.load('en')
# We can create a Pandas Series of spaCy nlp variables
doc_df = review_df['text'].apply(nlp)
# spaCy gives us fine-grained parts of speech using (.pos_) # and coarse-grained parts of speech using (.tag_)
for doc in doc_df[4]:
    print([doc.text, doc.pos_, doc.tag_])


Traceback (most recent call last):
  File "/Users/maruanottoni/miniforge3/envs/mlp/lib/python3.8/runpy.py", line 185, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/Users/maruanottoni/miniforge3/envs/mlp/lib/python3.8/runpy.py", line 144, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "/Users/maruanottoni/miniforge3/envs/mlp/lib/python3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/Users/maruanottoni/miniforge3/envs/mlp/lib/python3.8/site-packages/spacy/__init__.py", line 6, in <module>
  File "/Users/maruanottoni/miniforge3/envs/mlp/lib/python3.8/site-packages/spacy/errors.py", line 3, in <module>
    from .compat import Literal
  File "/Users/maruanottoni/miniforge3/envs/mlp/lib/python3.8/site-packages/spacy/compat.py", line 4, in <module>
    from thinc.util import copy_array
  File "/Users/maruanottoni/miniforge3/envs/mlp/lib/python3.8/site-packages/thinc/__in

OSError: dlopen(/Users/maruanottoni/miniforge3/envs/mlp/lib/python3.8/site-packages/mxnet/libmxnet.so, 0x0006): tried: '/Users/maruanottoni/miniforge3/envs/mlp/lib/python3.8/site-packages/mxnet/libmxnet.so' (not a mach-o file), '/System/Volumes/Preboot/Cryptexes/OS/Users/maruanottoni/miniforge3/envs/mlp/lib/python3.8/site-packages/mxnet/libmxnet.so' (no such file), '/Users/maruanottoni/miniforge3/envs/mlp/lib/python3.8/site-packages/mxnet/libmxnet.so' (not a mach-o file)

The bag-of-words representation is simple to understand, easy to compute, and use‐ ful for classification and search tasks. But sometimes single words are too simplistic to encapsulate some information in the text. To fix this problem, people look to longer sequences. Bag-of-n-grams is a natural generalization of bag-of-words. The concept is still easy to understand, and it’s just as easy to compute as bag-of-words.

Bag-of-n-grams generates a lot more distinct n-grams. It increases the feature storage cost, as well as the computation cost of the model training and prediction stages. The number of data points remains the same, but the dimension of the feature space is now much larger. Hence, the data is much more sparse. The higher n is, the higher the storage and computation cost, and the sparser the data. For these reasons, longer n-grams do not always lead to improvements in model accuracy (or any other perfor‐ mance measure). People usually stop at n = 2 or 3. Longer n-grams are rarely used.

One way to combat the increase in sparsity and cost is to filter the n-grams and retain only the most meaningful phrases. This is the goal of collocation extraction. In theory, collocations (or phrases) could form nonconsecutive token sequences in the text. In practice, however, looking for nonconsecutive phrases has a much higher computation cost for not much gain. So, collocation extraction usually starts with a candidate list of bigrams and utilizes statistical methods to filter them.

All of these methods turn a sequence of text tokens into a disconnected set of counts. Sets have much less structure than sequences; they lead to flat feature vectors.

In this chapter, we dipped our toes into the water with simple text featurization tech‐ niques. These techniques turn a piece of natural language text—full of rich semantic structure—into a simple flat vector. We discussed a number of common filtering techniques to clean up the vector entries. We also introduced n-grams and colloca‐ tion extraction as methods that add a little more structure into the flat vectors. The next chapter goes into a lot more detail about another common text featurization trick called tf-idf. Subsequent chapters will discuss more methods for adding struc‐ ture back into a flat vector.
