**ASS 1**
1.	Implement Conflation algorithm to generate document representative of a text file. 

run only on python below 3.12

In [None]:
# Conflation Algorithm: Generate Document Representative of a Text File
# Using Stemming and Lemmatization

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string


# Download resources (run once)
nltk.download('punkt')  
nltk.download('stopwords')
nltk.download('wordnet')


# Step 1: Read input text file
filename = "Conflation.txt"   # <-- use your own text file
with open(filename, 'r', encoding='utf-8') as file:
    text = file.read()

print("Original Text:\n", text)
print("-" * 80)

# Step 2: Tokenization
tokens = word_tokenize(text.lower())

# Step 3: Remove punctuation and stopwords
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word.isalpha() and word not in stop_words]

# Step 4: Apply Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in tokens]

# Step 5: Apply Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]

# Step 6: Display results
print("After Stemming:\n", ' '.join(stemmed_words))
print("-" * 80)
print("After Lemmatization:\n", ' '.join(lemmatized_words))
print("-" * 80)

# Step 7: Create Document Representative (word frequency)
from collections import Counter
freq = Counter(stemmed_words)
print("Document Representative (Word Frequency):")
for word, count in freq.most_common(10):
    print(f"{word}: {count}")


In [2]:
filename = "Conflation.txt"   # <-- use your own text file
with open(filename, 'r', encoding='utf-8') as file:
    text = file.read()

print("Original Text:\n", text)
print("-" * 80)

Original Text:
 Information retrieval is the process of obtaining information from large collections of text.
It involves searching, indexing, and ranking documents based on user queries.

--------------------------------------------------------------------------------


In [1]:
# Step 1: Read input text file
filename = "Conflation.txt"   # <-- use your own text file
text = open(filename, 'r', encoding='utf-8').read()

print("Original Text:\n", text)
print("-" * 80)

Original Text:
 Information retrieval is the process of obtaining information from large collections of text.
It involves searching, indexing, and ranking documents based on user queries.

--------------------------------------------------------------------------------


Imports (lines 1–6)

import nltk

Loads the NLTK package. Needed to access tokenizers, corpora, and NLP utilities. Without this, nothing NLTK-related will run.

from nltk.corpus import stopwords

Imports the stopwords corpus module. This provides predefined lists of frequent words (like "the", "is") that you usually remove during text processing because they carry little semantic content.

from nltk.tokenize import word_tokenize

Imports word_tokenize, a standard tokenizer that splits raw text into a list of tokens (words and punctuation). It handles punctuation and common tokenization edge cases better than str.split().

from nltk.stem import PorterStemmer, WordNetLemmatizer

Imports two algorithms:

PorterStemmer: reduces words to a stem (heuristic chopping) — fast but may produce non-dictionary stems.

WordNetLemmatizer: reduces words to dictionary lemmas using WordNet (needs correct POS for best results; default assumes noun).

import string

Imports Python’s string module (useful for punctuation lists, though in your code you use word.isalpha() instead of string.punctuation).

Why separate imports? — keeps code readable and loads only the functions you need.

NLTK downloads (lines 8–11)

nltk.download('punkt')

Downloads the Punkt tokenizer models used by word_tokenize. Run once per environment. If already present, this call returns quickly.

nltk.download('punkt_tab') (nonstandard)

Note: punkt_tab is not a standard NLTK resource. This line likely came from a trial or a mistaken suggestion. It may produce a "Resource not found" error or be ignored. You can safely remove this line — punkt alone is enough for tokenization.

nltk.download('stopwords')

Downloads the stopword lists (English, etc.). Needed for stopwords.words('english').

nltk.download('wordnet')

Downloads the WordNet lexical database required by WordNetLemmatizer.

Important: Download calls typically print messages and may prompt in some environments — for automated runs you might prefer to ensure these resources exist beforehand.

File read (lines 13–16)

filename = "Conflation.txt"

Defines the filename string. The file must be in the same directory where the script runs, or provide an absolute path.

with open(filename, 'r', encoding='utf-8') as file:

Opens the file in read mode with UTF-8 encoding. Using with ensures the file is closed automatically even on error.

text = file.read()

Reads the entire file into one string variable text. For very large files you might stream line-by-line, but for typical assignments this is fine.

Why encoding='utf-8'? — To handle non-ASCII characters safely; avoids UnicodeDecodeError for many text files.

Debugging print + separator (lines 18–19)

print("Original Text:\n", text)

Prints the raw content — useful for verifying input and demonstrating results to the examiner.

print("-" * 80)

Prints a visual separator (80 dashes) to make console output easier to read.

Tokenization (line 22)

tokens = word_tokenize(text.lower())

Converts the entire text to lowercase (text.lower()) to normalize case (so “Apple” and “apple” are treated same).

word_tokenize() splits the text into tokens. Tokens include words and punctuation (e.g., ["the", "runner", ",", "he"]).

Why lowercase first? — Case normalization reduces vocabulary size and improves conflation results.

Stopword and punctuation removal (lines 25–27)

stop_words = set(stopwords.words('english'))

Loads the English stopwords and converts to a set for O(1) membership checks (faster than list).

tokens = [word for word in tokens if word.isalpha() and word not in stop_words]

This list comprehension filters tokens:

word.isalpha() → keeps only tokens consisting of alphabetic characters (removes punctuation and numeric tokens).

Effect: tokens like 'don't' becomes 'don' and 't' if you tokenized before; isalpha() removes tokens containing apostrophes or hyphens. If you need to preserve contractions, consider a different filter.

word not in stop_words → removes common stopwords.

Result: tokens becomes a list of cleaned, lowercased, alphabetic words with no stopwords.

Pitfall: isalpha() will remove words containing apostrophes (e.g., "don't"), hyphens, or accents. If your text has those, preprocessing must be customized.

Stemming (lines 30–31)

stemmer = PorterStemmer()

Creates an instance of PorterStemmer. You call methods on this object to stem words.

stemmed_words = [stemmer.stem(word) for word in tokens]

Applies stemming to every token, producing a new list stemmed_words.

Stemming behavior: reduces inflections and derivations by chopping endings (e.g., running → run, happier → happier may become happier or happi depending on algorithm).

Note: Stems are not guaranteed to be real words (may be truncated). This is acceptable for bag-of-words style document representatives.

Lemmatization (lines 34–35)

lemmatizer = WordNetLemmatizer()

Creates a lemmatizer object that uses WordNet to return dictionary lemmas.

lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]

Lemmatizes each token. By default, lemmatize assumes the word is a noun. For better results, you would supply part-of-speech tags (e.g., 'v' for verb) — otherwise running remains running unless specified as a verb.

Difference vs stemming: Lemmatization returns valid dictionary words when given correct POS; it's more linguistically correct but requires POS or may underperform.

Display results (lines 38–43)

print("After Stemming:\n", ' '.join(stemmed_words))

Joins stemmed words with spaces and prints them as a single line. This shows the effect of stemming across the document.

print("-" * 80)

Separator.

print("After Lemmatization:\n", ' '.join(lemmatized_words))

Joins and prints lemmatized tokens for comparison with stems.

print("-" * 80)

Separator.

Why show both? — Teachers like to see both outputs to judge understanding: stemming (algorithmic) vs lemmatization (semantic).

Document representative (frequency) (lines 46–50)

from collections import Counter

Imports Counter for counting frequencies of items in a list (efficient and concise).

freq = Counter(stemmed_words)

Creates a frequency map (stem → count) using the stemmed tokens. Using stems in the representative merges word forms (e.g., run, running, ran → same stem), which demonstrates conflation.

print("Document Representative (Word Frequency):")

Prints a header for the output.

for word, count in freq.most_common(10):

Iterates over the top 10 most frequent stems in descending order.

print(f"{word}: {count}")

Prints each stem and its frequency count.

Why use stems for the representative? — Because conflation aims to treat related forms as one feature; using stems reduces dimensionality and makes the representative more meaningful.