## Objective
- Tokenize the given text using `nltk.word_tokenize`.
- Remove punctuation tokens to clean the data.
- Generate unigrams, bigrams, and trigrams using `nltk.ngrams`.
- Count the frequency of each n-gram using `collections.Counter`.
- Calculate the probability of each n-gram.


## Step-wise Explanation
1. **Tokenization:** We start by using NLTK's `word_tokenize` function to split the input text into tokens. This includes words and punctuation marks as separate tokens.
2. **Punctuation Removal:** Next, we filter out tokens that are not alphanumeric (i.e., remove punctuation). This gives a cleaner list of word tokens.
3. **N-gram Generation:** We generate unigrams, bigrams, and trigrams from the filtered tokens using `nltk.ngrams`, which takes the list of tokens and n (1, 2, or 3) to create sequences of words.
4. **Frequency Counting:** Using `collections.Counter`, we count how many times each n-gram occurs in the text. This provides a frequency distribution for unigrams, bigrams, and trigrams.
5. **Probability Calculation:** Finally, we calculate the probability of each n-gram by dividing its frequency by the total number of n-grams of that type. This gives a relative frequency (probability) of each n-gram.


## Code


In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
from collections import Counter
import string
import os

# Set NLTK data path to a writable directory
nltk.data.path.append('/tmp/nltk_data')
if not os.path.exists('/tmp/nltk_data'):
    os.makedirs('/tmp/nltk_data')


# Download the punkt tokenizer (for tokenization)
nltk.download('punkt')
nltk.download('punkt_tab') # Attempt to download punkt_tab

# Define a sample text
text = "NLP is amazing. It is widely used in AI applications, including speech recognition; people love analyzing text data!"

# 1. Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)

# 2. Remove punctuation
filtered_tokens = [token for token in tokens if token.isalnum()]
print("Tokens after punctuation removal:", filtered_tokens)

# 3. Generate n-grams
unigrams = list(ngrams(filtered_tokens, 1))
bigrams = list(ngrams(filtered_tokens, 2))
trigrams = list(ngrams(filtered_tokens, 3))
print("Unigrams:", unigrams)
print("Bigrams:", bigrams)
print("Trigrams:", trigrams)

# 4. Frequency counts
freq_uni = Counter(unigrams)
freq_bi = Counter(bigrams)
freq_tri = Counter(trigrams)
print("Unigram frequencies:", freq_uni)
print("Bigram frequencies:", freq_bi)
print("Trigram frequencies:", freq_tri)

# 5. Probability calculations
total_uni = sum(freq_uni.values())
total_bi = sum(freq_bi.values())
total_tri = sum(freq_tri.values())
print(f"Total unigrams: {total_uni}, Total bigrams: {total_bi}, Total trigrams: {total_tri}")

print("Unigram probabilities:")
for uni, count in freq_uni.items():
    print(f"{uni} -> {count} / {total_uni} = {count/total_uni}")
print("Bigram probabilities:")
for bi, count in freq_bi.items():
    print(f"{bi} -> {count} / {total_bi} = {count/total_bi}")
print("Trigram probabilities:")
for tri, count in freq_tri.items():
    print(f"{tri} -> {count} / {total_tri} = {count/total_tri}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Tokens: ['NLP', 'is', 'amazing', '.', 'It', 'is', 'widely', 'used', 'in', 'AI', 'applications', ',', 'including', 'speech', 'recognition', ';', 'people', 'love', 'analyzing', 'text', 'data', '!']
Tokens after punctuation removal: ['NLP', 'is', 'amazing', 'It', 'is', 'widely', 'used', 'in', 'AI', 'applications', 'including', 'speech', 'recognition', 'people', 'love', 'analyzing', 'text', 'data']
Unigrams: [('NLP',), ('is',), ('amazing',), ('It',), ('is',), ('widely',), ('used',), ('in',), ('AI',), ('applications',), ('including',), ('speech',), ('recognition',), ('people',), ('love',), ('analyzing',), ('text',), ('data',)]
Bigrams: [('NLP', 'is'), ('is', 'amazing'), ('amazing', 'It'), ('It', 'is'), ('is', 'widely'), ('widely', 'used'), ('used', 'in'), ('in', 'AI'), ('AI', 'applications'), ('applications', 'including'), ('including', 'speech'), ('speech', 'recognition'), ('recognition', 'people'), ('people', 'love'), ('love', 'analyzing'), ('analyzing', 'text'), ('text', 'data')]
Trigram

## Output
```
Tokens: ['NLP', 'is', 'amazing', '.', 'It', 'is', 'widely', 'used', 'in', 'AI', 'applications', ',', 'including', 'speech', 'recognition', ';', 'people', 'love', 'analyzing', 'text', 'data', '!']
Tokens after punctuation removal: ['NLP', 'is', 'amazing', 'It', 'is', 'widely', 'used', 'in', 'AI', 'applications', 'including', 'speech', 'recognition', 'people', 'love', 'analyzing', 'text', 'data']
Unigrams: [('NLP',), ('is',), ('amazing',), ('It',), ('is',), ('widely',), ('used',), ('in',), ('AI',), ('applications',), ('including',), ('speech',), ('recognition',), ('people',), ('love',), ('analyzing',), ('text',), ('data',)]
Bigrams: [('NLP', 'is'), ('is', 'amazing'), ('amazing', 'It'), ('It', 'is'), ('is', 'widely'), ('widely', 'used'), ('used', 'in'), ('in', 'AI'), ('AI', 'applications'), ('applications', 'including'), ('including', 'speech'), ('speech', 'recognition'), ('recognition', 'people'), ('people', 'love'), ('love', 'analyzing'), ('analyzing', 'text'), ('text', 'data')]
Trigrams: [('NLP', 'is', 'amazing'), ('is', 'amazing', 'It'), ('amazing', 'It', 'is'), ('It', 'is', 'widely'), ('is', 'widely', 'used'), ('widely', 'used', 'in'), ('used', 'in', 'AI'), ('in', 'AI', 'applications'), ('AI', 'applications', 'including'), ('applications', 'including', 'speech'), ('including', 'speech', 'recognition'), ('speech', 'recognition', 'people'), ('recognition', 'people', 'love'), ('people', 'love', 'analyzing'), ('love', 'analyzing', 'text'), ('analyzing', 'text', 'data')]
Unigram frequencies: Counter({('is',): 2, ('NLP',): 1, ('amazing',): 1, ('It',): 1, ('widely',): 1, ('used',): 1, ('in',): 1, ('AI',): 1, ('applications',): 1, ('including',): 1, ('speech',): 1, ('recognition',): 1, ('people',): 1, ('love',): 1, ('analyzing',): 1, ('text',): 1, ('data',): 1})
Bigram frequencies: Counter({('NLP', 'is'): 1, ('is', 'amazing'): 1, ('amazing', 'It'): 1, ('It', 'is'): 1, ('is', 'widely'): 1, ('widely', 'used'): 1, ('used', 'in'): 1, ('in', 'AI'): 1, ('AI', 'applications'): 1, ('applications', 'including'): 1, ('including', 'speech'): 1, ('speech', 'recognition'): 1, ('recognition', 'people'): 1, ('people', 'love'): 1, ('love', 'analyzing'): 1, ('analyzing', 'text'): 1, ('text', 'data'): 1})
Trigram frequencies: Counter({('NLP', 'is', 'amazing'): 1, ('is', 'amazing', 'It'): 1, ('amazing', 'It', 'is'): 1, ('It', 'is', 'widely'): 1, ('is', 'widely', 'used'): 1, ('widely', 'used', 'in'): 1, ('used', 'in', 'AI'): 1, ('in', 'AI', 'applications'): 1, ('AI', 'applications', 'including'): 1, ('applications', 'including', 'speech'): 1, ('including', 'speech', 'recognition'): 1, ('speech', 'recognition', 'people'): 1, ('recognition', 'people', 'love'): 1, ('people', 'love', 'analyzing'): 1, ('love', 'analyzing', 'text'): 1, ('analyzing', 'text', 'data'): 1})
Total unigrams: 18, Total bigrams: 17, Total trigrams: 16
Unigram probabilities:
('NLP',) -> 1 / 18 = 0.05555555555555555
('is',) -> 2 / 18 = 0.1111111111111111
('amazing',) -> 1 / 18 = 0.05555555555555555
('It',) -> 1 / 18 = 0.05555555555555555
('widely',) -> 1 / 18 = 0.05555555555555555
('used',) -> 1 / 18 = 0.05555555555555555
('in',) -> 1 / 18 = 0.05555555555555555
('AI',) -> 1 / 18 = 0.05555555555555555
('applications',) -> 1 / 18 = 0.05555555555555555
('including',) -> 1 / 18 = 0.05555555555555555
('speech',) -> 1 / 18 = 0.05555555555555555
('recognition',) -> 1 / 18 = 0.05555555555555555
('people',) -> 1 / 18 = 0.05555555555555555
('love',) -> 1 / 18 = 0.05555555555555555
('analyzing',) -> 1 / 18 = 0.05555555555555555
('text',) -> 1 / 18 = 0.05555555555555555
('data',) -> 1 / 18 = 0.05555555555555555
Bigram probabilities:
('NLP', 'is') -> 1 / 17 = 0.058823529411764705
('is', 'amazing') -> 1 / 17 = 0.058823529411764705
('amazing', 'It') -> 1 / 17 = 0.058823529411764705
('It', 'is') -> 1 / 17 = 0.058823529411764705
('is', 'widely') -> 1 / 17 = 0.058823529411764705
('widely', 'used') -> 1 / 17 = 0.058823529411764705
('used', 'in') -> 1 / 17 = 0.058823529411764705
('in', 'AI') -> 1 / 17 = 0.058823529411764705
('AI', 'applications') -> 1 / 17 = 0.058823529411764705
('applications', 'including') -> 1 / 17 = 0.058823529411764705
('including', 'speech') -> 1 / 17 = 0.058823529411764705
('speech', 'recognition') -> 1 / 17 = 0.058823529411764705
('recognition', 'people') -> 1 / 17 = 0.058823529411764705
('people', 'love') -> 1 / 17 = 0.058823529411764705
('love', 'analyzing') -> 1 / 17 = 0.058823529411764705
('analyzing', 'text') -> 1 / 17 = 0.058823529411764705
('text', 'data') -> 1 / 17 = 0.058823529411764705
Trigram probabilities:
('NLP', 'is', 'amazing') -> 1 / 16 = 0.0625
('is', 'amazing', 'It') -> 1 / 16 = 0.0625
('amazing', 'It', 'is') -> 1 / 16 = 0.0625
('It', 'is', 'widely') -> 1 / 16 = 0.0625
('is', 'widely', 'used') -> 1 / 16 = 0.0625
('widely', 'used', 'in') -> 1 / 16 = 0.0625
('used', 'in', 'AI') -> 1 / 16 = 0.0625
('in', 'AI', 'applications') -> 1 / 16 = 0.0625
('AI', 'applications', 'including') -> 1 / 16 = 0.0625
('applications', 'including', 'speech') -> 1 / 16 = 0.0625
('including', 'speech', 'recognition') -> 1 / 16 = 0.0625
('speech', 'recognition', 'people') -> 1 / 16 = 0.0625
('recognition', 'people', 'love') -> 1 / 16 = 0.0625
('people', 'love', 'analyzing') -> 1 / 16 = 0.0625
('love', 'analyzing', 'text') -> 1 / 16 = 0.0625
('analyzing', 'text', 'data') -> 1 / 16 = 0.0625
```

## Conclusion
In this practical, we successfully tokenized the input text, removed punctuation, and generated unigrams, bigrams, and trigrams. By counting the frequency of each n-gram and computing their probabilities, we obtained the relative likelihood of each sequence in the text. This simple NLP pipeline demonstrates how we can process raw text data and extract useful statistical information about word sequences. Such frequency and probability calculations form a fundamental basis for many NLP tasks, including language modeling and text analysis.
