## **1. Environment Setup**
#### First, we need to install NLTK and download the necessary datasets (corpora) for tokenization and tagging.

In [None]:
# Install NLTK
!pip install nltk

import nltk
import pandas as pd

# Download essential NLTK resources
nltk.download('punkt')         # For tokenization
nltk.download('averaged_perceptron_tagger') # For POS tagging
nltk.download('averaged_perceptron_tagger_eng') # For English POS tagging
nltk.download('stopwords')     # For removing common words
nltk.download('wordnet')       # For Lemmatization
nltk.download('omw-1.4') # For WordNet data, including synonyms and antonyms
nltk.download('punkt_tab') # For tokenization of tab-separated text




[notice] A new release of pip is available: 25.2 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nitin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\nitin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\nitin\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nitin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nitin\AppData\Roaming\nltk_data...
[nltk_data]   Package word

True

## **2. Text Preprocessing Functions**
#### Preprocessing is the "cleaning" phase. It transforms raw text into a format that algorithms can actually understand.

### Step A: Tokenization & Noise Removal
#### We convert the text to lowercase, remove punctuation, and split it into individual words (tokens).

In [4]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

def preprocess_text(text):
    # 1. Lowercase
    text = text.lower()
    
    # 2. Tokenization
    tokens = word_tokenize(text)
    
    # 3. Remove Punctuation and Stopwords
    stop_words = set(stopwords.words('english'))
    cleaned_tokens = [w for w in tokens if w not in stop_words and w not in string.punctuation]
    
    return cleaned_tokens

sample_text = "The quick brown fox jumps over the lazy dog! It's a sunny day in the neighborhood."
tokens = preprocess_text(sample_text)
print(f"Cleaned Tokens: {tokens}")

Cleaned Tokens: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', "'s", 'sunny', 'day', 'neighborhood']


## **3. POS Tagging (Part-of-Speech)**
#### POS tagging assigns a category to each word (Noun, Verb, Adjective, etc.) based on its definition and context.

In [5]:
def get_pos_tags(tokens):
    return nltk.pos_tag(tokens)

pos_tags = get_pos_tags(tokens)
print("POS Tags:", pos_tags)

POS Tags: [('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('lazy', 'VBP'), ('dog', 'NN'), ("'s", 'POS'), ('sunny', 'JJ'), ('day', 'NN'), ('neighborhood', 'NN')]


##### Note: NLTK uses the Penn Treebank Tagset. For example, NN stands for Noun, and JJ stands for Adjective.

## **4. Lemmatization**
#### Unlike Stemming (which chops off ends of words), Lemmatization uses a vocabulary and morphological analysis to return the word to its dictionary base form (the lemma).

In [6]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

lemmas = lemmatize_tokens(tokens)
print("Lemmatized:", lemmas)

Lemmatized: ['quick', 'brown', 'fox', 'jump', 'lazy', 'dog', "'s", 'sunny', 'day', 'neighborhood']


## **5. Frequency Analysis Application**
#### Let's wrap this into a simple analysis tool that identifies the most frequent meaningful words in a text block.

In [7]:
from nltk.probability import FreqDist

def analyze_text(text):
    # Process
    tokens = preprocess_text(text)
    lemmas = lemmatize_tokens(tokens)
    
    # Frequency Distribution
    fdist = FreqDist(lemmas)
    
    # Convert to Dataframe for clean display
    df_freq = pd.DataFrame(fdist.items(), columns=['Word', 'Frequency']).sort_values(by='Frequency', ascending=False)
    
    return df_freq.head(10)

# Example Usage
blog_post = """
Artificial intelligence is transforming the world. Intelligence allows machines 
to learn from experience. Learning is essential for artificial systems to improve.
Most intelligence systems rely on data. Data is the new oil.
"""

result = analyze_text(blog_post)
print("Top 10 Words Analysis:")
print(result)

Top 10 Words Analysis:
            Word  Frequency
1   intelligence          3
0     artificial          2
13          data          2
10        system          2
4         allows          1
5        machine          1
2   transforming          1
3          world          1
7     experience          1
6          learn          1


## **Summary of Workflow**
 **1. Normalization:** Converting text to lowercase.

 **2. Tokenization:** Breaking sentences into word units.

 **3. Stopword Removal:** Eliminating "fluff" words (the, is, at).

 **4. POS Tagging:** Identifying the grammatical role of words.
 
 **5. Lemmatization:** Reducing words to their root form for consistent counting.