### N gram

An N-gram is a sequence of N continuous items from a text (words or characters).

In [None]:
#eg :: bigram , trigram
# If N = 1 → Unigram

# If N = 2 → Bigram

# If N = 3 → Trigram

# If N = 4  ->>> four gram , five gram    item can word , character, Token 

### Types of N-grams
1️ Word N-grams

## Based on words.

Example sentence:

"I love machine learning"

Unigrams → ["I", "love", "machine", "learning"]

Bigrams → ["I love", "love machine", "machine learning"]

Trigrams → ["I love machine", "love machine learning"]

### 2️ Character N-grams

### Based on characters.

Sentence: "cat"

Unigrams: ["c", "a", "t"]

Bigrams: ["ca", "at"]

Trigrams: ["cat"]

### Mostly used in:

Spelling correction

Language detection

Text similarity

 Why do we use N-grams?

### N-grams help NLP models understand:

Word patterns

Local context

Sentence structure

Probability of word sequences

##
food is good
1    0   1
1     1    1



Bigram ( combination of 2 words)
food  not good   ->> food good 


### Applications of N-grams
### 1. Language Modeling

Predicting the next word.

Example:

"I am going to" → next word likely "school"

 2. Text Generation

## Older chatbots and predictive keyboards used N-grams.

 3. Spelling Correction 

"teh" → bigrams mismatch → suggest "the".

### 4. Machine Translation

Before neural transformers, bigram/trigram models were used.

### 5. Sentiment Analysis Features

Convert text into N-gram features for ML models.

### 6. Text Classification (Spam, Fake news, Reviews)

Common N-grams become features.

sklearn ---> n-gram =(1,1)--> unigram (combination of one one word)
        ---> (1,2) --> unigram , bigram 
         --> (1,3 ) --> unigram, bigram
         -->(2,3)  ---> BIgram, trigram
         

###  Problems with N-grams
1️ Sparsity

Large N means you get rare combinations.

2️ High memory requirement

N=4 or N=5 models require huge corpora.

3️ Does not understand long context

N-grams only look at immediate neighbors.

Example:
"I went to New" → bigram may predict "York",
but cannot capture long-distance dependencies like transformers do.

 Smoothing in N-grams

To fix zero-probability issues.

Popular smoothing techniques:

Laplace smoothing

Add-k smoothing

Kneser–Ney smoothing (best)

Good-Turing smoothing

### N-gram Features in ML Models

When using classical ML algorithms (SVM, Naive Bayes, Logistic Regression), we convert text to:

Bag of Unigrams

Bag of Bigrams

Bag of Trigrams

Combination (1–2 grams)

Example:
Sentence:
"Good product and fast delivery"

Unigrams + bigrams become features like:

good

product

fast

delivery

good product

fast delivery

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 2))  # unigrams + bigrams
X = vectorizer.fit_transform(["I love NLP", "NLP is great"])
print(vectorizer.get_feature_names_out())


['great' 'is' 'is great' 'love' 'love nlp' 'nlp' 'nlp is']


In [3]:
# character N-gram example

In [4]:
vectorizer = CountVectorizer(analyzer='char', ngram_range=(3, 5))


Good for:

Plagiarism detection

Language detection

String matching

## pratical implementation


In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# 1. Load dataset from local CSV
df = pd.read_csv("SMSSpamCollection.csv", sep='\t', header=None, names=["label", "text"], encoding="latin-1")
df.columns = ["label", "text"]

# 2. Convert target into binary values
df["label"] = df["label"].map({"ham": 0, "spam": 1})

# 3. Split data
X_train, X_test, y_train, y_test = train_test_split(
    df["text"], df["label"], test_size=0.2, random_state=42
)

# 4. Character N-grams (3 to 5 grams)
vectorizer = CountVectorizer(analyzer='char', ngram_range=(3, 5))
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# 5. Train model
model = LogisticRegression(max_iter=300)
model.fit(X_train_vec, y_train)

# 6. Predictions
pred = model.predict(X_test_vec)

# 7. Results
print("Accuracy:", accuracy_score(y_test, pred))
print(classification_report(y_test, pred))


Accuracy: 0.9856502242152466
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       1.00      0.89      0.94       149

    accuracy                           0.99      1115
   macro avg       0.99      0.95      0.97      1115
weighted avg       0.99      0.99      0.99      1115

