# 1. Creating Text Features - ngrams

N-grams in Creating Text Features

N-grams are contiguous sequences of n items from a given sequence of text or speech. In the context of text features, these items are typically words (but can also be characters or phonemes). N-grams capture the co-occurrence of words in a text and can provide more contextual information than individual words alone (as in a simple Bag of Words model).

Here's a breakdown of different types of n-grams:

1. Uni-grams (n=1): These are single words. A Bag of Words model essentially uses uni-grams.

Example: For the sentence "The quick brown fox", the uni-grams are: "The", "quick", "brown", "fox".

2. Bi-grams (n=2): These are sequences of two adjacent words.

Example: For the sentence "The quick brown fox", the bi-grams are: "The quick", "quick brown", "brown fox".

3. Tri-grams (n=3): These are sequences of three adjacent words.

Example: For the sentence "The quick brown fox", the tri-grams are: "The quick brown", "quick brown fox".

4. Quad-grams (n=4): These are sequences of four adjacent words.

Example: For the sentence "The quick brown fox", the quad-grams are: "The quick brown fox".

And so on for higher values of n.

Why Use N-grams for Text Features?

1. Capture Context: Unlike uni-grams, n-grams (with n > 1) retain some information about the order of words and the context in which they appear. This can be crucial for understanding the meaning and sentiment of a text. For example, the bi-gram "not good" has a very different sentiment than the uni-grams "not" and "good" considered separately.
2. Improve Model Performance: By incorporating n-grams, machine learning models can learn more complex relationships in the text and often achieve better performance in tasks like text classification, sentiment analysis, and language modeling.
3. Handle Negation: Bi-grams like "not happy" can help models understand negation, which is often missed by simple uni-gram models.
4.  Common Phrases: N-grams can capture common phrases or idioms that have a specific meaning when the words appear together (e.g., "piece of cake").

How N-grams are Used in Feature Creation:

1. Tokenization: The text is first broken down into individual words (tokens).
2. N-gram Generation: Based on the desired value of n, all possible contiguous sequences of n tokens are generated from each document in the corpus.
3. Vocabulary Creation: A vocabulary of all unique n-grams across the entire corpus is created.
4. Feature Vector Creation: Each document is then represented by a vector where each dimension corresponds to an n-gram in the vocabulary. The value in each dimension can be:

The frequency of that n-gram in the document (similar to Bag of Words but for n-grams).

The presence or absence of that n-gram in the document (binary representation).
TF-IDF weight of the n-gram.

Example of Bi-gram Feature Vectors:

Consider the sentences:

Sentence 1: "The food is very good."

Sentence 2: "The service is not good."

Bi-grams:

Sentence 1: "The food", "food is", "is very", "very good"

Sentence 2: "The service", "service is", "is not", "not good"

Vocabulary of Bi-grams:

{"The food", "food is", "is very", "very good", "The service", "service is", "is not", "not good"}

Bi-gram Frequency Vectors:

Sentence 1: {"The food": 1, "food is": 1, "is very": 1, "very good": 1, "The service": 0, "service is": 0, "is not": 0, "not good": 0}

Sentence 2: {"The food": 0, "food is": 0, "is very": 0, "very good": 0, "The service": 1, "service is": 1, "is not": 1, "not good": 1}

Considerations when using N-grams:

1. Increased Vocabulary Size: As n increases, the number of possible n-grams can grow exponentially, leading to a very large vocabulary and high-dimensional feature vectors. This can increase computational cost and potentially lead to data sparsity.
2. Need for Sufficient Data: To effectively learn patterns from higher-order n-grams (like tri-grams or quad-grams), you typically need a larger dataset.
Feature Selection/Reduction: Due to the potential for a large number of n-grams, feature selection or dimensionality reduction techniques might be necessary to focus on the most informative n-grams.

In summary, n-grams are sequences of n consecutive words that can capture more contextual information than individual words. By using uni-grams, bi-grams, tri-grams, and higher-order n-grams as features, machine learning models can better understand the relationships between words in a text and improve performance on various NLP tasks.

# Import necessary dependencies

In [33]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Create sample dataset

In [34]:
# Sample DataFrame of customer reviews for Men's Sports Apparel

reviews_data = pd.DataFrame({
    'ReviewText': [
        "Good quality product, comfortable for running.",
        "The material is not as expected, a bit thin.",
        "Excellent fit and very breathable for Kolkata weather.",
        "Average product, stitching could be better.",
        "Loved the design and the fabric is great for workouts.",
        "Not worth the price, expected better quality.",
        "Very happy with the purchase, fits perfectly.",
        "Disappointed with the color, it's different from the picture.",
        "Amazing for gym sessions, highly recommended.",
        "Okay product, nothing special."
    ]
})

print("Original Reviews Data:")
reviews_data

Original Reviews Data:


Unnamed: 0,ReviewText
0,"Good quality product, comfortable for running."
1,"The material is not as expected, a bit thin."
2,Excellent fit and very breathable for Kolkata ...
3,"Average product, stitching could be better."
4,Loved the design and the fabric is great for w...
5,"Not worth the price, expected better quality."
6,"Very happy with the purchase, fits perfectly."
7,"Disappointed with the color, it's different fr..."
8,"Amazing for gym sessions, highly recommended."
9,"Okay product, nothing special."


# ngram (uni-grams , bi-grams , tri-grams and quad-grams) implementation

In [35]:
# 1. Uni-grams (n=1)

unigram_vectorizer = CountVectorizer(ngram_range=(1, 1))
unigram_matrix = unigram_vectorizer.fit_transform(reviews_data['ReviewText'])
unigram_df = pd.DataFrame(unigram_matrix.toarray(), columns=unigram_vectorizer.get_feature_names_out())
reviews_unigram_df = pd.concat([reviews_data, unigram_df], axis=1)
print("\nReviews Data with Uni-grams:")
reviews_unigram_df


Reviews Data with Uni-grams:


Unnamed: 0,ReviewText,amazing,and,as,average,be,better,bit,breathable,color,...,sessions,special,stitching,the,thin,very,weather,with,workouts,worth
0,"Good quality product, comfortable for running.",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"The material is not as expected, a bit thin.",0,0,1,0,0,0,1,0,0,...,0,0,0,1,1,0,0,0,0,0
2,Excellent fit and very breathable for Kolkata ...,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,1,1,0,0,0
3,"Average product, stitching could be better.",0,0,0,1,1,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,Loved the design and the fabric is great for w...,0,1,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,1,0
5,"Not worth the price, expected better quality.",0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
6,"Very happy with the purchase, fits perfectly.",0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,1,0,0
7,"Disappointed with the color, it's different fr...",0,0,0,0,0,0,0,0,1,...,0,0,0,2,0,0,0,1,0,0
8,"Amazing for gym sessions, highly recommended.",1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
9,"Okay product, nothing special.",0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [36]:
# 2. Bi-grams (n=2)

bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
bigram_matrix = bigram_vectorizer.fit_transform(reviews_data['ReviewText'])
bigram_df = pd.DataFrame(bigram_matrix.toarray(), columns=bigram_vectorizer.get_feature_names_out())
reviews_bigram_df = pd.concat([reviews_data, bigram_df], axis=1)
print("\nReviews Data with Bi-grams:")
reviews_bigram_df


Reviews Data with Bi-grams:


Unnamed: 0,ReviewText,amazing for,and the,and very,as expected,average product,be better,better quality,bit thin,breathable for,...,the design,the fabric,the material,the picture,the price,the purchase,very breathable,very happy,with the,worth the
0,"Good quality product, comfortable for running.",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"The material is not as expected, a bit thin.",0,0,0,1,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
2,Excellent fit and very breathable for Kolkata ...,0,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
3,"Average product, stitching could be better.",0,0,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Loved the design and the fabric is great for w...,0,1,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
5,"Not worth the price, expected better quality.",0,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,1
6,"Very happy with the purchase, fits perfectly.",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,1,0
7,"Disappointed with the color, it's different fr...",0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
8,"Amazing for gym sessions, highly recommended.",1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,"Okay product, nothing special.",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
# 3. Tri-grams (n=3)

trigram_vectorizer = CountVectorizer(ngram_range=(3, 3))
trigram_matrix = trigram_vectorizer.fit_transform(reviews_data['ReviewText'])
trigram_df = pd.DataFrame(trigram_matrix.toarray(), columns=trigram_vectorizer.get_feature_names_out())
reviews_trigram_df = pd.concat([reviews_data, trigram_df], axis=1)
print("\nReviews Data with Tri-grams:")
reviews_trigram_df


Reviews Data with Tri-grams:


Unnamed: 0,ReviewText,amazing for gym,and the fabric,and very breathable,as expected bit,average product stitching,breathable for kolkata,color it different,comfortable for running,could be better,...,the design and,the fabric is,the material is,the price expected,the purchase fits,very breathable for,very happy with,with the color,with the purchase,worth the price
0,"Good quality product, comfortable for running.",0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,"The material is not as expected, a bit thin.",0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,Excellent fit and very breathable for Kolkata ...,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,"Average product, stitching could be better.",0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,Loved the design and the fabric is great for w...,0,1,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
5,"Not worth the price, expected better quality.",0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
6,"Very happy with the purchase, fits perfectly.",0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,1,0
7,"Disappointed with the color, it's different fr...",0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
8,"Amazing for gym sessions, highly recommended.",1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,"Okay product, nothing special.",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [38]:
# 4. Quad-grams (n=4)

quadgram_vectorizer = CountVectorizer(ngram_range=(4, 4))
quadgram_matrix = quadgram_vectorizer.fit_transform(reviews_data['ReviewText'])
quadgram_df = pd.DataFrame(quadgram_matrix.toarray(), columns=quadgram_vectorizer.get_feature_names_out())
reviews_quadgram_df = pd.concat([reviews_data, quadgram_df], axis=1)
print("\nReviews Data with Quad-grams:")
reviews_quadgram_df


Reviews Data with Quad-grams:


Unnamed: 0,ReviewText,amazing for gym sessions,and the fabric is,and very breathable for,as expected bit thin,average product stitching could,breathable for kolkata weather,color it different from,design and the fabric,different from the picture,...,the design and the,the fabric is great,the material is not,the price expected better,the purchase fits perfectly,very breathable for kolkata,very happy with the,with the color it,with the purchase fits,worth the price expected
0,"Good quality product, comfortable for running.",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"The material is not as expected, a bit thin.",0,0,0,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,Excellent fit and very breathable for Kolkata ...,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,"Average product, stitching could be better.",0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Loved the design and the fabric is great for w...,0,1,0,0,0,0,0,1,0,...,1,1,0,0,0,0,0,0,0,0
5,"Not worth the price, expected better quality.",0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
6,"Very happy with the purchase, fits perfectly.",0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,1,0
7,"Disappointed with the color, it's different fr...",0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,1,0,0
8,"Amazing for gym sessions, highly recommended.",1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,"Okay product, nothing special.",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
# generate n-grams of a range, e.g., uni-grams and bi-grams together:

ngram_1_2_vectorizer = CountVectorizer(ngram_range=(1, 2))
ngram_1_2_matrix = ngram_1_2_vectorizer.fit_transform(reviews_data['ReviewText'])
ngram_1_2_df = pd.DataFrame(ngram_1_2_matrix.toarray(), columns=ngram_1_2_vectorizer.get_feature_names_out())
reviews_ngram_1_2_df = pd.concat([reviews_data, ngram_1_2_df], axis=1)
print("\nReviews Data with Uni-grams and Bi-grams:")
reviews_ngram_1_2_df


Reviews Data with Uni-grams and Bi-grams:


Unnamed: 0,ReviewText,amazing,amazing for,and,and the,and very,as,as expected,average,average product,...,thin,very,very breathable,very happy,weather,with,with the,workouts,worth,worth the
0,"Good quality product, comfortable for running.",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"The material is not as expected, a bit thin.",0,0,0,0,0,1,1,0,0,...,1,0,0,0,0,0,0,0,0,0
2,Excellent fit and very breathable for Kolkata ...,0,0,1,0,1,0,0,0,0,...,0,1,1,0,1,0,0,0,0,0
3,"Average product, stitching could be better.",0,0,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
4,Loved the design and the fabric is great for w...,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
5,"Not worth the price, expected better quality.",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
6,"Very happy with the purchase, fits perfectly.",0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,1,1,0,0,0
7,"Disappointed with the color, it's different fr...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
8,"Amazing for gym sessions, highly recommended.",1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,"Okay product, nothing special.",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


1. Uni-grams: ngram_range=(1, 1): This tells CountVectorizer to extract only sequences of 1 word.
2. Bi-grams: ngram_range=(2, 2): This extracts only sequences of 2 adjacent words.
3. Tri-grams: ngram_range=(3, 3): This extracts only sequences of 3 adjacent words.
4. Quad-grams: ngram_range=(4, 4): This extracts only sequences of 4 adjacent words.
5. Uni-grams and Bi-grams: ngram_range=(1, 2): This extracts all sequences of 1 or 2 adjacent words.

