<a href="https://colab.research.google.com/github/nluninja/text-mining-dataviz-aa2526/blob/main/02-Text_Classification/NLP02-02-N-grams.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# N-grams

we’ll use CountVectorizer to perform some basic n-gram analysis (or ngram analysis) on some product descriptions stored in a Pandas dataframe.

N-grams (also called Q-grams or shingles) are single or multi- word phrases found within documents and they can reveal the underlying topic to help data scientists understand the topic, or be used within NLP models, such as text classification.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

## Load the data
Next, we’ll load a simple dataset containing some text data.

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/nluninja/text-mining-dataviz-aa2526/refs/heads/main/02-Text_Classification/gonutrition.csv')
df.head()

Unnamed: 0,product_name,product_description
0,Whey Protein Isolate 90,What is Whey Protein Isolate? Whey Protein Iso...
1,Whey Protein 80,What is Whey Protein 80? Whey Protein 80 is an...
2,Volt Preworkout™,What is Volt™? Our Volt pre workout formula in...


## Fit the CountVectorizer
 CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument.


In [3]:
text = df['product_description']
model = CountVectorizer(ngram_range = (1, 1))
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names_out())
df_output.T.tail(5)

Unnamed: 0,0,1,2
would,0,0,1
you,10,7,7
your,8,6,7
zinc,0,0,1
zma,0,0,1


## Remove stop words
Stop words generally don’t contribute very much and can massively bloat the size of your dataset, which increases model training times and causes various other issues. As a result, it’s a common practice to remove stop words.


In [4]:
text = df['product_description']
model = CountVectorizer(ngram_range = (1, 1), stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names_out())
df_output.T.tail(5)

Unnamed: 0,0,1,2
working,0,0,3
workout,2,4,18
world,0,2,0
zinc,0,0,1
zma,0,0,1


## Increase the n-gram range
Increasing the ngram_range will mean the vocabulary is expanded from single words to short phrases of your desired lengths.

In [5]:
text = df['product_description']
model = CountVectorizer(ngram_range = (2, 2), stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names_out())
df_output.T.tail(5)

Unnamed: 0,0,1,2
workout volt,0,0,1
world milk,0,1,0
world renowned,0,1,0
zinc magnesium,0,0,1
zma zinc,0,0,1


In [6]:
df_output.shape


(3, 1200)

You can also select n-grams of multiple sizes all at once by setting an unequal ngram_range.

For example, setting the range from 1, 5 will return n-grams containing __one, two, three, four, and five words__ after stop words have been removed.

As you’d imagine, this adds some potentially useful phrases to the vocabulary, as well as some nonsense, and increases the vocabulary size to 6091.

In [7]:
text = df['product_description']
model = CountVectorizer(ngram_range = (1, 5), stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names_out())
df_output.T.tail(5)

Unnamed: 0,0,1,2
zma,0,0,1
zma zinc,0,0,1
zma zinc magnesium,0,0,1
zma zinc magnesium recovery,0,0,1
zma zinc magnesium recovery formula,0,0,1


## Setting max_features
CountVectorizer includes a very useful optional argument called max_features that can be used to control the size of the vocabulary created so it includes only the most commonly encounted terms based on their term frequency across documents within the corpus.

* With no max_features value, we generate a vocabulary of 6091 items, as the whole corpus is used.
* Setting max_features to 100 limits the vocabulary size generated to the top 100 n-grams, as shown in the shape of the output dataframe.

In [8]:
text = df['product_description']
model = CountVectorizer(ngram_range = (1, 5), stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names_out())
df_output.T.tail(5)

Unnamed: 0,0,1,2
zma,0,0,1
zma zinc,0,0,1
zma zinc magnesium,0,0,1
zma zinc magnesium recovery,0,0,1
zma zinc magnesium recovery formula,0,0,1


In [9]:
df_output.shape


(3, 6091)

In [10]:
text = df['product_description']
model = CountVectorizer(ngram_range = (1, 5), max_features = 100, stop_words='english')
matrix = model.fit_transform(text).toarray()
df_output = pd.DataFrame(data = matrix, columns = model.get_feature_names_out())
df_output.T.tail(5)

Unnamed: 0,0,1,2
whey protein concentrate,4,3,0
whey protein isolate,11,0,0
work,0,2,3
workout,2,4,18
workout formula,0,0,6


In [11]:
df_output.shape


(3, 100)

## define a functions of n-grams

In [12]:
def get_ngrams(text, ngram_from=2, ngram_to=2, n=None, max_features=20000):

    vec = CountVectorizer(ngram_range = (ngram_from, ngram_to),
                          max_features = max_features,
                          stop_words='english').fit(text)
    bag_of_words = vec.transform(text)
    sum_words = bag_of_words.sum(axis = 0)
    words_freq = [(word, sum_words[0, i]) for word, i in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)

    return words_freq[:n]

# Get unigrams or 1-grams


In [13]:
unigrams = get_ngrams(df['product_description'], ngram_from=1, ngram_to=1, n=15)
unigrams_df = pd.DataFrame(unigrams)
unigrams_df.columns=["Unigram", "Frequency"]
unigrams_df.head()

Unnamed: 0,Unigram,Frequency
0,protein,80
1,whey,53
2,workout,24
3,volt,24
4,gn,23


## Get bigrams or 2-grams


In [14]:
bigrams = get_ngrams(df['product_description'], ngram_from=2, ngram_to=2, n=15)
bigrams_df = pd.DataFrame(bigrams)
bigrams_df.columns=["Bigram", "Frequency"]
bigrams_df.head()

Unnamed: 0,Bigram,Frequency
0,whey protein,45
1,pre workout,14
2,protein isolate,11
3,gn whey,11
4,protein 80,11


## Get 5-grams

In [15]:
pentagrams = get_ngrams(df['product_description'], ngram_from=5, ngram_to=5, n=15)
pentagrams_df = pd.DataFrame(pentagrams)
pentagrams_df.columns=["Pentagram", "Frequency"]
pentagrams_df.head()

Unnamed: 0,Pentagram,Frequency
0,free range grass fed cows,3
1,whey protein concentrate whey protein,3
2,whey protein isolate whey protein,2
3,protein isolate whey protein isolate,2
4,isolate whey protein isolate 90,2
