## Bag of N-Grams

One hot encoding, BoW and TF-IDF treat words as independent units. There is no notion of phrases or word ordering. Bag of N-grams (BoN) approach tries to remedy this. It does so by breaking text into chunks of n contiguous words/tokens. This can help us capture some context, which earlier approaches could not do.

In [3]:
# Our corpus
documents = [
    "Often, machine learning tutorials will recommend or require that you prepare your data in specific ways before fitting a machine learning model.",
    "Getting started in applied machine learning can be difficult, especially when working with real-world data.",
    "One good example is to use a one-hot encoding on categorical data."
]

processed_docs = [doc.lower().replace(",", "").replace(".", "").replace("-", " ") for doc in documents]
processed_docs

['often machine learning tutorials will recommend or require that you prepare your data in specific ways before fitting a machine learning model',
 'getting started in applied machine learning can be difficult especially when working with real world data',
 'one good example is to use a one hot encoding on categorical data']

CountVectorizer, which we used for BoW, can be used for getting a Bag of N-grams representation as well, using its ngram_range argument. The code snippet below shows how:

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

# N-gram vectorization example with count vectorizer and unigrams, bigrams
count_vect = CountVectorizer(ngram_range=(1, 2))

# Build a BOW representation for the corpus
bow_rep = count_vect.fit_transform(processed_docs)

# Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)

# See the BOW rep for first 2 documents
print("BoW representation for document 1: ", bow_rep[0].toarray())
print("BoW representation for document 2: ", bow_rep[1].toarray())

# Get the representation using this vocabulary, for a new text
temp = count_vect.transform(["machine learning is good"])

print("BoW representation for 'machine learning is good':", temp.toarray())

Our vocabulary:  {'often': 40, 'machine': 37, 'learning': 33, 'tutorials': 65, 'will': 73, 'recommend': 53, 'or': 47, 'require': 55, 'that': 61, 'you': 81, 'prepare': 49, 'your': 83, 'data': 10, 'in': 28, 'specific': 57, 'ways': 69, 'before': 4, 'fitting': 20, 'model': 39, 'often machine': 41, 'machine learning': 38, 'learning tutorials': 36, 'tutorials will': 66, 'will recommend': 74, 'recommend or': 54, 'or require': 48, 'require that': 56, 'that you': 62, 'you prepare': 82, 'prepare your': 50, 'your data': 84, 'data in': 11, 'in specific': 30, 'specific ways': 58, 'ways before': 70, 'before fitting': 5, 'fitting machine': 21, 'learning model': 35, 'getting': 22, 'started': 59, 'applied': 0, 'can': 6, 'be': 2, 'difficult': 12, 'especially': 16, 'when': 71, 'working': 77, 'with': 75, 'real': 51, 'world': 79, 'getting started': 23, 'started in': 60, 'in applied': 29, 'applied machine': 1, 'learning can': 34, 'can be': 7, 'be difficult': 3, 'difficult especially': 13, 'especially when':

Note that the number of features (and hence the size of the feature vector) increased a lot for the same data, compared to the other single word based representations!!