Text Analytics
1. Extract Sample document and apply following
document preprocessing methods: Tokenization, POS
Tagging, stop words removal, Stemming and
Lemmatization.
2. Create representation of document by calculating Term
Frequency and Inverse Document Frequency.

In [1]:
import nltk
import math
import re
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
# Ensure necessary NLTK data files are downloaded
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /home/palakagrawal/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/palakagrawal/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/palakagrawal/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/palakagrawal/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/palakagrawal/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
# Read the sample documents
with open('doc_01.txt', 'r') as file:
    doc1 = file.read()

with open('doc_02.txt', 'r') as file:
    doc2 = file.read()

# Tokenization
Tokenization is the process of breaking down a text into individual words or tokens. This is often the first step in natural language processing tasks

In [4]:
# Step 1: Tokenization
def tokenize_text(doc):
    return nltk.word_tokenize(doc)

# Running the process for both documents
tokens_doc1 = nltk.word_tokenize(doc1)
tokens_doc2 = nltk.word_tokenize(doc2)

print("Tokenized Document 1:", tokens_doc1)
print("\n\nTokenized Document 2:", tokens_doc2)

Tokenized Document 1: ['Between', '2016', 'and', '2019', ',', 'the', 'state', 'forest', 'department', 'under', 'the', 'BJP', 'government', 'had', 'launched', '‘', 'Green', 'Maharashtra', '’', 'drive', 'with', 'an', 'aim', 'to', 'plant', '50', 'crore', 'trees', 'across', 'the', 'state', 'in', 'the', 'four-year', 'period', '.', 'In', 'October', '2019', ',', 'the', 'government', 'had', 'claimed', 'it', 'had', 'surpassed', 'the', 'target', 'by', 'planting', '33', 'crore', 'trees', 'in', 'July-September', '2019', '.', 'The', 'Indian', 'Express', 'had', 'found', 'that', 'non-forest', 'agencies', '—', 'such', 'as', 'gram', 'panchayats', '—', 'which', 'were', 'tasked', 'with', 'planting', 'trees', 'had', 'not', 'uploaded', 'the', 'mandatory', 'audio-visual', 'proof', 'of', 'the', 'tree', 'plantation', 'drives', 'on', 'the', 'specially', 'created', 'portal', '.', 'In', 'Pune', 'Revenue', 'Division', ',', 'it', 'was', 'claimed', 'the', 'gram', 'panchayats', 'planted', '1.7', 'crore', 'saplings',

# Stop Words
Stop words are common words like 'the', 'is', 'and', etc., which often do not carry significant meaning in text analysis. Remove these stop words from the text to focus on the more meaningful content.

In [7]:
# Step 2: Stop Word Removal
def remove_stop_words(tokens):
    stop_words = set(nltk.corpus.stopwords.words('english'))
    after_removing = []
    for token in tokens:
        if token.lower() not in stop_words:
            after_removing.append(token)
    return after_removing
    # return [token for token in tokens if token.lower() not in stop_words]
    
# Remove stop words
tokens_doc1 = remove_stop_words(tokens_doc1)
tokens_doc2 = remove_stop_words(tokens_doc2)

print("Tokenized Document 1:", tokens_doc1)
print("\n\nTokenized Document 2:", tokens_doc2)

Tokenized Document 1: ['2016', '2019', ',', 'state', 'forest', 'department', 'BJP', 'government', 'launched', '‘', 'Green', 'Maharashtra', '’', 'drive', 'aim', 'plant', '50', 'crore', 'trees', 'across', 'state', 'four-year', 'period', '.', 'October', '2019', ',', 'government', 'claimed', 'surpassed', 'target', 'planting', '33', 'crore', 'trees', 'July-September', '2019', '.', 'Indian', 'Express', 'found', 'non-forest', 'agencies', '—', 'gram', 'panchayats', '—', 'tasked', 'planting', 'trees', 'uploaded', 'mandatory', 'audio-visual', 'proof', 'tree', 'plantation', 'drives', 'specially', 'created', 'portal', '.', 'Pune', 'Revenue', 'Division', ',', 'claimed', 'gram', 'panchayats', 'planted', '1.7', 'crore', 'saplings', ';', 'however', ',', 'evidence', 'uploaded', '87', 'per', 'cent', '(', '1.49', 'crore', ')', 'saplings', '.', 'Also', ',', '59', 'government', 'agencies', 'involved', 'drive', 'many', '38', 'submitted', 'survival', 'reports', 'saplings', '.', 'year', ',', 'targets', 'set',

# POS Tagging

POS tagging involves labeling each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, etc.

### 9 Primary Parts of Speech:
- **Noun**  
- **Pronoun**  
- **Verb**  
- **Adjective**  
- **Adverb**  
- **Preposition**  
- **Conjunction**  
- **Interjection**  
- **Determiner**  


### Common POS Tags Table

| POS Tag | Meaning                                | Example                        |
|---------|----------------------------------------|--------------------------------|
| **NN**  | Noun, singular                         | *tree, drive*                  |
| **NNS** | Noun, plural                           | *saplings, people*             |
| **NNP** | Proper noun, singular                  | *India, Uttar*                 |
| **JJ**  | Adjective                              | *green, successful*            |
| **VB**  | Verb, base form                        | *ensure, reduce*               |
| **VBD** | Verb, past tense                       | *took, said*                   |
| **VBG** | Verb, gerund/present participle        | *planting, citing*             |
| **VBN** | Verb, past participle                  | *planted, quoted*              |
| **VBP** | Verb, non-3rd person singular present  | *combat, survive*              |
| **RB**  | Adverb                                 | *especially, also*             |
| **IN**  | Preposition/Subordinating conjunction  | *across, along*                |
| **CD**  | Cardinal number                        | *250, 15, 2019*                |
| **DT**  | Determiner                             | *another*                      |
| **PRP** | Personal pronoun                       | *they, we*                     |
| **POS** | Possessive ending                      | *'s*                           |
| **.**   | Sentence-ending punctuation            | *.*                            |
| **,**   | Comma                                  | *,*                            |
| **`` / ''** | Quote marks                        | *“ ” or ' '*                   |



In [29]:
# Step 3: POS Tagging
def pos_tagging(tokens):
    return nltk.pos_tag(tokens)

# POS Tagging
tagged_doc1 = nltk.pos_tag(tokens_doc1)
tagged_doc2 = nltk.pos_tag(tokens_doc2)

print("POS Tagged Document 1:", tagged_doc1)
print("POS Tagged Document 2:", tagged_doc2)

POS Tagged Document 1: [('2016', 'CD'), ('2019', 'CD'), (',', ','), ('state', 'NN'), ('forest', 'JJS'), ('department', 'NN'), ('BJP', 'NNP'), ('government', 'NN'), ('launched', 'VBD'), ('‘', 'NNP'), ('Green', 'NNP'), ('Maharashtra', 'NNP'), ('’', 'NNP'), ('drive', 'NN'), ('aim', 'NN'), ('plant', 'NN'), ('50', 'CD'), ('crore', 'NN'), ('trees', 'NNS'), ('across', 'IN'), ('state', 'NN'), ('four-year', 'JJ'), ('period', 'NN'), ('.', '.'), ('October', 'NNP'), ('2019', 'CD'), (',', ','), ('government', 'NN'), ('claimed', 'VBD'), ('surpassed', 'JJ'), ('target', 'NN'), ('planting', 'VBG'), ('33', 'CD'), ('crore', 'NN'), ('trees', 'NNS'), ('July-September', 'NNP'), ('2019', 'CD'), ('.', '.'), ('Indian', 'JJ'), ('Express', 'NNP'), ('found', 'VBD'), ('non-forest', 'JJS'), ('agencies', 'NNS'), ('—', 'VBP'), ('gram', 'JJ'), ('panchayats', 'NNS'), ('—', 'VBP'), ('tasked', 'VBN'), ('planting', 'NN'), ('trees', 'NNS'), ('uploaded', 'JJ'), ('mandatory', 'JJ'), ('audio-visual', 'JJ'), ('proof', 'NN'), (

# Stemming

**Stemming** in Natural Language Processing (NLP) refers to the process of reducing a word to its **base or root form**, typically by removing suffixes and prefixes. The resulting base form is known as the *stem*.

For example:  
**eating**, **eats**, **eaten** → **eat**

### Purpose of Stemming

Stemming is a **linguistic normalization** technique used to:

- Extract the base form of a word.
- Improve the efficiency of NLP tasks.
- Reduce the number of word forms stored or analyzed.

It's similar to trimming a tree down to its stem—removing extra branches (prefixes/suffixes) to focus on the core.

### Applications

- **Search Engines**: To index and retrieve documents more effectively.  
  Instead of storing multiple word forms, only the **stem** is indexed, reducing storage and improving accuracy.

## Porter's Stemmer Algorithm

One of the most widely used stemming algorithms, **Porter’s Stemmer** was introduced in 1980. It is popular for its:

- **Speed**
- **Simplicity**
- Suitability for **English-only** text

### How it Works

The algorithm reduces words by identifying and removing common **suffix patterns**, often in multiple steps.

> Example Rule:  
> If a word ends in **EED** and contains a vowel-consonant sequence before it, replace **EED** with **EE**.  
>  
> - "agreed" → "agree"

### Key Points

- Often used in **Data Mining** and **Information Retrieval**
- Final stemmed words are **not always meaningful** words
- Oldest and one of the most effective stemmers in NLP
- Maps a group of related words to a **single root form**


In [13]:
# Step 4: Stemming
def stemming(tokens):
    stemmer = PorterStemmer()
    after_stemming = []
    for token in tokens:
        after_stemming.append(stemmer.stem(token))
    # return [stemmer.stem(token) for token in tokens]
    return after_stemming
    
# Stemming
stemmed_doc1 = stemming(tokens_doc1)
stemmed_doc2 = stemming(tokens_doc2)

print("Stemmed Document 1:", stemmed_doc1)
print("Stemmed Document 2:", stemmed_doc2)

Stemmed Document 1: ['2016', '2019', ',', 'state', 'forest', 'depart', 'bjp', 'govern', 'launch', '‘', 'green', 'maharashtra', '’', 'drive', 'aim', 'plant', '50', 'crore', 'tree', 'across', 'state', 'four-year', 'period', '.', 'octob', '2019', ',', 'govern', 'claim', 'surpass', 'target', 'plant', '33', 'crore', 'tree', 'july-septemb', '2019', '.', 'indian', 'express', 'found', 'non-forest', 'agenc', '—', 'gram', 'panchayat', '—', 'task', 'plant', 'tree', 'upload', 'mandatori', 'audio-visu', 'proof', 'tree', 'plantat', 'drive', 'special', 'creat', 'portal', '.', 'pune', 'revenu', 'divis', ',', 'claim', 'gram', 'panchayat', 'plant', '1.7', 'crore', 'sapl', ';', 'howev', ',', 'evid', 'upload', '87', 'per', 'cent', '(', '1.49', 'crore', ')', 'sapl', '.', 'also', ',', '59', 'govern', 'agenc', 'involv', 'drive', 'mani', '38', 'submit', 'surviv', 'report', 'sapl', '.', 'year', ',', 'target', 'set', 'forest', 'depart', 'compar', 'modest', '.', 'exampl', ',', 'pune', 'circl', '—', 'compris', 't

# Lemmatization

**Lemmatization** is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Unlike **stemming**, lemmatization takes the **context** of the word into account, ensuring the words are reduced to their **root or base form**, also known as a **lemma**.

### Example:
- **rocks** → **rock**
- **corpora** → **corpus**
- **better** → **good**

Lemmatization works by analyzing a word's context and reducing it to its lemma. For example, the verb **"walk"** could appear as **"walking"**, **"walks"**, or **"walked"** in different contexts. In lemmatization, inflectional endings like **"s"**, **"ed"**, and **"ing"** are removed, grouping these words together under the root form **"walk"**.

---

### Stemming vs Lemmatization

| **Aspect**            | **Stemming**                                      | **Lemmatization**                                      |
|-----------------------|---------------------------------------------------|--------------------------------------------------------|
| **Definition**         | Removes suffixes from a word, often leading to incorrect meanings. | Considers context and converts words to their base form (lemma). |
| **Example**            | "Caring" → "Car"                                  | "Caring" → "Care"                                      |
| **Accuracy**           | Can produce meaningless or incorrect words.      | Always produces meaningful words.                      |
| **Usage**              | Commonly used in large datasets where performance is critical. | More accurate but computationally expensive.           |
| **Performance**        | Faster, more suitable for large datasets.         | Slower, requires look-up tables for accurate results.   |


In [31]:
# Step 5: Lemmatization
def lemmatization(tokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in tokens]

# Lemmatization
lemmatized_doc1 = lemmatization(tokens_doc1)
lemmatized_doc2 = lemmatization(tokens_doc2)

print("Lemmatized Document 1:", lemmatized_doc1)
print("Lemmatized Document 2:", lemmatized_doc2)

Lemmatized Document 1: ['2016', '2019', ',', 'state', 'forest', 'department', 'BJP', 'government', 'launched', '‘', 'Green', 'Maharashtra', '’', 'drive', 'aim', 'plant', '50', 'crore', 'tree', 'across', 'state', 'four-year', 'period', '.', 'October', '2019', ',', 'government', 'claimed', 'surpassed', 'target', 'planting', '33', 'crore', 'tree', 'July-September', '2019', '.', 'Indian', 'Express', 'found', 'non-forest', 'agency', '—', 'gram', 'panchayat', '—', 'tasked', 'planting', 'tree', 'uploaded', 'mandatory', 'audio-visual', 'proof', 'tree', 'plantation', 'drive', 'specially', 'created', 'portal', '.', 'Pune', 'Revenue', 'Division', ',', 'claimed', 'gram', 'panchayat', 'planted', '1.7', 'crore', 'sapling', ';', 'however', ',', 'evidence', 'uploaded', '87', 'per', 'cent', '(', '1.49', 'crore', ')', 'sapling', '.', 'Also', ',', '59', 'government', 'agency', 'involved', 'drive', 'many', '38', 'submitted', 'survival', 'report', 'sapling', '.', 'year', ',', 'target', 'set', 'forest', 'de

# TF-IDF (Term Frequency - Inverse Document Frequency)

**TF-IDF** is a statistical measure used in Natural Language Processing (NLP) and information retrieval to evaluate how important a word is to a document in a collection or corpus. It helps identify which words are more significant or unique in a specific document compared to the entire corpus.

The **TF-IDF** score is calculated by combining two components:

1. **Term Frequency (TF)**
2. **Inverse Document Frequency (IDF)**

### Components of TF-IDF

## 1. **Term Frequency (TF)**

**Term Frequency** (TF) measures how often a term appears in a document relative to the total number of terms in that document. It is calculated using the formula:

$$
\text{TF}(t) = \frac{\text{Number of times term } t \text{ appears in a document}}{\text{Total number of terms in the document}}
$$

Where:
- \( t \) is the specific term.
- The numerator is the number of times the term appears in the document.
- The denominator is the total number of words (terms) in the document.

## 2. **Inverse Document Frequency (IDF)**

**Inverse Document Frequency** (IDF) is used to measure how important a term is in the overall corpus. The idea is that terms that appear frequently across many documents are not as valuable for distinguishing between documents. IDF is calculated using the following formula:

$$
\text{IDF}(t) = \log \left( \frac{N}{df(t)} \right)
$$

Where:
- \( N \) is the total number of documents in the corpus.
- \( df(t) \) is the number of documents containing the term \( t \).

The logarithmic scale helps to dampen the effect of terms that appear in many documents.

## **TF-IDF Calculation**

The **TF-IDF** score is simply the product of **TF** and **IDF** for a given term \( t \):

$$
\text{TF-IDF}(t) = \text{TF}(t) \times \text{IDF}(t)
$$

This score reflects both:
- The frequency of the term in a specific document (TF).
- The importance of the term in the entire corpus (IDF).

### Example:

Let’s say we have the following three documents in a corpus:

- **Document 1**: "I love NLP."
- **Document 2**: "NLP is amazing."
- **Document 3**: "I love coding."

We want to calculate the TF-IDF for the term **"love"** in **Document 1**.

1. **Term Frequency (TF)** for **"love"** in **Document 1**:

$$
\text{TF}(\text{love}) = \frac{1}{4} = 0.25 \quad \text{(since "love" appears once in Document 1, and Document 1 has 4 words)}
$$

2. **Inverse Document Frequency (IDF)** for **"love"**:

The term **"love"** appears in **Document 1** and **Document 3**, so the document frequency \( df(\text{love}) = 2 \).

$$
\text{IDF}(\text{love}) = \log \left( \frac{3}{2} \right) \approx 0.1761 \quad \text{(there are 3 documents in total, and "love" appears in 2 of them)}
$$

3. **TF-IDF** for **"love"** in **Document 1**:

$$
\text{TF-IDF}(\text{love}) = 0.25 \times 0.1761 \approx 0.0440
$$

### Significance of TF-IDF:
- **High TF-IDF values** indicate terms that are important and unique to a specific document.
- **Low TF-IDF values** indicate common terms that are not distinctive across documents.

### Applications of TF-IDF:
- **Search Engines**: TF-IDF helps determine which terms are most important in a document relative to a query.
- **Text Classification**: It is used to extract important features for machine learning models.
- **Topic Modeling**: Identifying key topics within a collection of documents.


In [19]:
# Step 6: Term Frequency (TF)
def calculate_term_freq(doc):
    word_tokens = nltk.word_tokenize(doc)
    tf_dict = dict()
    for word in word_tokens:
        tf_dict[word] = word_tokens.count(word)
    tf = dict()
    for word, count in tf_dict.items():
        tf[word] = count / len(word_tokens)
    return tf

# Calculate Term Frequency (TF)
tf_doc1 = calculate_term_freq(doc1)
tf_doc2 = calculate_term_freq(doc2)

print("Term Frequency (TF) for Document 1:", tf_doc1)
print("Term Frequency (TF) for Document 2:", tf_doc2)

Term Frequency (TF) for Document 1: {'Between': 0.0014749262536873156, '2016': 0.004424778761061947, 'and': 0.02654867256637168, '2019': 0.0058997050147492625, ',': 0.05604719764011799, 'the': 0.058997050147492625, 'state': 0.0029498525073746312, 'forest': 0.008849557522123894, 'department': 0.0029498525073746312, 'under': 0.004424778761061947, 'BJP': 0.0014749262536873156, 'government': 0.0058997050147492625, 'had': 0.01032448377581121, 'launched': 0.0014749262536873156, '‘': 0.008849557522123894, 'Green': 0.004424778761061947, 'Maharashtra': 0.007374631268436578, '’': 0.01032448377581121, 'drive': 0.0058997050147492625, 'with': 0.01032448377581121, 'an': 0.0014749262536873156, 'aim': 0.0029498525073746312, 'to': 0.019174041297935103, 'plant': 0.004424778761061947, '50': 0.004424778761061947, 'crore': 0.019174041297935103, 'trees': 0.007374631268436578, 'across': 0.0014749262536873156, 'in': 0.016224188790560472, 'four-year': 0.0014749262536873156, 'period': 0.0014749262536873156, '.'

In [20]:
# Step 7: Inverse Document Frequency (IDF)
def calculate_idf(doc1, doc2):
    d1_tokens = nltk.word_tokenize(doc1)
    d2_tokens = nltk.word_tokenize(doc2)
    N = 2  # Number of documents
    all_tokens = set(d1_tokens + d2_tokens)
    
    idf = dict()
    for token in all_tokens:
        
        df = 0
        if token in d1_tokens:
            df += 1
        if token in d2_tokens:
            df += 1

        # df = 0
        # for doc in [doc1, doc2]:
            # if token in nltk.word_tokenize(doc):
                # df += 1
        
        # df = sum(1 for doc in [doc1, doc2] if token in nltk.word_tokenize(doc))
        idf[token] = math.log(N / (1 + df))  # Adding 1 to avoid division by zero
    return idf

# Calculate Inverse Document Frequency (IDF)
idf = calculate_idf(doc1, doc2)

print("Inverse Document Frequency (IDF):", idf)

Inverse Document Frequency (IDF): {'emphasizes': 0.0, 'during': 0.0, '?': 0.0, 'remains': 0.0, 'reports': 0.0, 'put': 0.0, 'activities': 0.0, 'initiated': 0.0, '2.89': 0.0, '95': 0.0, 'Daund': 0.0, 'part': -0.40546510810816444, 'However': -0.40546510810816444, 'download': 0.0, 'Besides': 0.0, 'Uttar': 0.0, 'BJP': 0.0, 'monitor': 0.0, 'up': 0.0, '€5.2': 0.0, 'Highways': 0.0, 'free': 0.0, 'year': 0.0, 'five': 0.0, 'State': -0.40546510810816444, 'helpline': 0.0, 'modest': 0.0, 'maintaining': 0.0, 'participation': 0.0, '87': 0.0, 'regions': 0.0, '2': 0.0, 'between': 0.0, 'quoted': 0.0, 'this': 0.0, 'agencies': 0.0, 'Survey': 0.0, 'Solapur': 0.0, 'official': 0.0, '1.7': 0.0, 'called': 0.0, 'years': -0.40546510810816444, 'society': 0.0, 'plants': 0.0, 'that': -0.40546510810816444, 'led': 0.0, 'emissions': 0.0, 'stakeholders': 0.0, 'Pune': 0.0, 'An': 0.0, 'NGOs': 0.0, 'found': 0.0, 'usually': 0.0, 'work': 0.0, 'four': 0.0, 'growing': 0.0, 'and': -0.40546510810816444, ')': -0.40546510810816444

In [21]:
# Step 8: TF-IDF Calculation
def calculate_tfidf(tf, idf):
    tfidf = dict()
    for word in tf:
        tfidf[word] = tf[word] * idf.get(word, 0)
    # tfidf = {word: tf[word] * idf.get(word, 0) for word in tf}
    return tfidf

# Calculate TF-IDF
tfidf_doc1 = calculate_tfidf(tf_doc1, idf)
tfidf_doc2 = calculate_tfidf(tf_doc2, idf)

print("TF-IDF for Document 1:", tfidf_doc1)
print("TF-IDF for Document 2:", tfidf_doc2)

TF-IDF for Document 1: {'Between': 0.0, '2016': 0.0, 'and': -0.010764560392252154, '2019': -0.0023921245316115896, ',': -0.0227251830503101, 'the': -0.023921245316115895, 'state': -0.0011960622658057948, 'forest': -0.0035881867974173844, 'department': 0.0, 'under': -0.0017940933987086922, 'BJP': 0.0, 'government': -0.0023921245316115896, 'had': 0.0, 'launched': 0.0, '‘': 0.0, 'Green': 0.0, 'Maharashtra': 0.0, '’': 0.0, 'drive': -0.0023921245316115896, 'with': 0.0, 'an': -0.0005980311329028974, 'aim': 0.0, 'to': -0.0077744047277376665, 'plant': 0.0, '50': 0.0, 'crore': 0.0, 'trees': -0.002990155664514487, 'across': -0.0005980311329028974, 'in': -0.006578342461931871, 'four-year': 0.0, 'period': 0.0, '.': -0.012558653790960845, 'In': 0.0, 'October': 0.0, 'claimed': 0.0, 'it': 0.0, 'surpassed': 0.0, 'target': 0.0, 'by': -0.0011960622658057948, 'planting': -0.0017940933987086922, '33': 0.0, 'July-September': 0.0, 'The': -0.002990155664514487, 'Indian': -0.0005980311329028974, 'Express': -0