# Tutorial I - Text Processing and Information Retrieval

**Duration:** 1.5 hour

**Prerequisites:** Basic Python knowledge, familiarity with pandas and numpy

**Learning Objectives**
*   Understand basic text preprocessing techniques
*   Implement text cleaning and normalization
*   Calculate TF-IDF scores
*   Create a simple document search system




## Setup

In [1]:
import re
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

## Exercise 1: Text Cleaning (20 minutes)
Implement a function that cleans text by:


*   Converting to lowercase
*   Removing special characters
*   Removing extra whitespace
*   Removing HTML tags (hint: look for content between < and >)

In [7]:
text = """<p>This is a    Sample Text ith HTML tags</p> & special chars!!"""
text = str(text.lower())
text = re.sub(r"[^a-z0-9\s]", "", text) #remove special chars
text = re.sub(r"\s+", " ", text) #remove whitespace
text = re.sub(r"<[^<]+?>", "", text) #remove HTML tags


print(text)

pthis is a sample text ith html tagsp special chars


In [8]:
def clean_text(text):
    """
    Clean and normalize text data.

    Args:
        text (str): Input text to clean

    Returns:
        str: Cleaned text
    """
    text = text.lower() 
    text = re.sub(r"[^a-z0-9\s]", "", text)  #remove special chars
    text = re.sub(r"\s+", " ", text)         #remove whitespace
    text = re.sub(r"<[^<]+?>", "", text)     #remove HTML tags

    return text

    pass

# Test your function
test_text = """<p>This is a    Sample Text
with HTML tags</p> & special chars!!!"""
print("Original:", test_text)
print("Cleaned:", clean_text(test_text))

Original: <p>This is a    Sample Text
with HTML tags</p> & special chars!!!
Cleaned: pthis is a sample text with html tagsp special chars


## Exercise 2: Building a TF-IDF Search Engine  (30 minutes)
Create a simple search engine using TF-IDF to find relevant documents.

In [None]:
class SimpleSearchEngine:
    def __init__(self):
        """Initialize the search engine with TF-IDF vectorizer"""
        self.vectorizer = TfidfVectorizer()
        self.documents = []
        pass

    def add_documents(self, documents):
        """
        Add documents to the search engine

        Args:
            documents (list): List of text documents
        """
        # Your code here
        pass

    def search(self, query, top_k=2):
        """
        Search for documents most relevant to query

        Args:
            query (str): Search query
            top_k (int): Number of results to return

        Returns:
            list: Indices of top_k most relevant documents
        """
        # Your code here
        pass

# Test documents
documents = [
    "The cat and the dog play",
    "The dog chases a ball",
    "A cat naps in the sun",
]

# Create and test your search engine
search_engine = SimpleSearchEngine()
search_engine.add_documents(documents)
results = search_engine.search("cat playing")
print("Search Results:", results)

## Exercise 3: Document Similarity (20 minutes)
Implement a function to find similar documents using cosine similarity.

In [None]:
def calculate_similarity(doc1, doc2):
    """
    Calculate cosine similarity between two documents

    Args:
        doc1 (str): First document
        doc2 (str): Second document

    Returns:
        float: Similarity score between 0 and 1
    """
    # Your code here
    pass

# Test documents
doc1 = "The quick brown fox jumps over the lazy dog"
doc2 = "The fast brown fox leaps over the sleepy dog"
doc3 = "Python programming is fun and interesting"

print("Similarity score (similar docs):", calculate_similarity(doc1, doc2))
print("Similarity score (different docs):", calculate_similarity(doc1, doc3))

## Exercise 4: Putting It All Together (20 minutes)
Use everything you've learned to create a complete document processing pipeline.

In [None]:
def process_documents(documents):
    """
    Process a collection of documents:
    1. Clean each document
    2. Calculate TF-IDF
    3. Find most similar document pairs

    Args:
        documents (list): List of text documents

    Returns:
        tuple: (processed_docs, similarity_matrix)
    """
    # Your code here
    pass

# Test the complete pipeline
test_docs = [
    "The quick brown fox jumps over the lazy dog.",
    "<p>A quick brown fox leaps over a lazy dog!</p>",
    "Programming in Python: A beginner's guide to coding",
    "Python programming tutorial for beginners"
]

processed_docs, similarity_matrix = process_documents(test_docs)
print("Processed Documents:", processed_docs)
print("\nSimilarity Matrix:\n", similarity_matrix)