# Lab 7 - IR IRBasics Vector Space Proximity Workshop

`Group 7:`
- Paula Ramirez 8963215
- Hasyashri Bhatt 9028501
- Babandeep 9001552

## 📄 Step 1: Document Collection

We collect a set of text documents to build our inverted index. The documents are from Free eBooks | Project Gutenberg (https://www.gutenberg.org/),which include various topics like crime, history, and more.

### 🔧 Our documents:
- we collected 20 text documents into the `sample_docs folder`, from above source. 
- All has plain text and more than 2000 words.
- Load the documents into a list for processing.


In [1]:
# Example: Load text files from a folder
import os

def load_documents(folder_path):
    documents = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.txt'):
            with open(os.path.join(folder_path, filename), 'r', encoding='utf-8') as file:
                documents.append(file.read())
    return documents

# Replace 'sample_docs/' with your actual folder
documents = load_documents('sample_docs/')
print(f"Loaded {len(documents)} documents.")


Loaded 20 documents.


This function read all documents from the `sample_docs` folder and returns the number documents loaded. 

## ✂️ Step 2: Tokenizer

In this section we created a basic tokenizer to process the text documents. This tokenizer split each document into tokens (words) and removes punctuation. It also converts all tokens to lowercase and with a regular expression to remove any non-alphanumeric characters.

At the end of this block, we will have a list of tokens for each document.

In [2]:
import re

def tokenize(text):
    tokens = re.findall(r'\b\w+\b', text.lower())
    return tokens

# Test on one document
tokens = tokenize(documents[0])
print(tokens[:20])  # Preview first 20 tokens


['the', 'project', 'gutenberg', 'ebook', 'of', 'glimpses', 'of', 'the', 'dark', 'ages', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in']


## 🔁 Step 3: Normalization Pipeline (Stemming, Stop Word Removal, etc.)

Using `nltk` library, we will implement a normalization pipeline that includes stemming and stop word removal. This will help us reduce the vocabulary size and focus on the most relevant terms in our documents.

For example, the word "anyone" will be stemmed to "anyon", and "glimpse" will be stemmed to "glimps".

In [3]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('stopwords', quiet=True)  # Suppress download warnings
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def normalize_tokens(tokens):
    return [stemmer.stem(t) for t in tokens if t not in stop_words]

# Example: normalize one document
norm_tokens = normalize_tokens(tokens)
print(norm_tokens[:20])

['project', 'gutenberg', 'ebook', 'glimps', 'dark', 'age', 'ebook', 'use', 'anyon', 'anywher', 'unit', 'state', 'part', 'world', 'cost', 'almost', 'restrict', 'whatsoev', 'may', 'copi']


## Previous token:

`['the', 'project', 'gutenberg', 'ebook', 'of', 'glimpses', 'of', 'the', 'dark', 'ages', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in']`

## After remove stopwords and applying stemming:

`['project', 'gutenberg', 'ebook', 'glimps', 'dark', 'age', 'ebook', 'use', 'anyon', 'anywher', 'unit', 'state', 'part', 'world', 'cost', 'almost', 'restrict', 'whatsoev', 'may', 'copi']`

The difference between the previous and the new token is that the previous token contains stopwords such as "the", "of", "is", "for", "in", etc., while the new token has these words removed, leaving only the significant words that contribute to the meaning of the text.
We saw that some letters were removed from the words, this is because we applied stemming to the words, which reduces them to their root form. For example, "glimpses" becomes "glimps", "ages" becomes "age", and "anyone" becomes "anyon". This helps in reducing the vocabulary size and improving search accuracy.

## 🔍 Step 4: Inverted Index

In this step, we are creating an inverted index from the processed tokens. The inverted index maps each unique token to the list of documents (or their IDs) where that token appears. This is a crucial step in building an efficient search engine.

One example is: 

`'glimps': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`

In [4]:
from collections import defaultdict

def build_inverted_index(documents):
    index = defaultdict(list)
    for doc_id, text in enumerate(documents):
        tokens = normalize_tokens(tokenize(text))
        seen = set()
        for token in tokens:
            if token not in seen:
                index[token].append(doc_id)
                seen.add(token)
    return index

inverted_index = build_inverted_index(documents)

# print first 20 terms in the inverted index in a table format
print("Term\t\tDoc IDs")
for term, doc_ids in list(inverted_index.items())[:20]:
    print(f"{term}\t\t{doc_ids}")
    


Term		Doc IDs
project		[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
gutenberg		[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
ebook		[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
glimps		[0, 3, 6, 7, 8, 10, 12, 14, 15]
dark		[0, 2, 3, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16]
age		[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
use		[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
anyon		[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
anywher		[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
unit		[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
state		[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
part		[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
world		[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
cost		[0, 1, 2, 3, 4, 5, 6, 7, 

## 🧪 Test: Phrase Queries

Here, we are implementing queries using positional indexing. This allows us to search for phrases within the documents.

To support **exact phrase search**, we store the position of each word in each document. The search function only returns documents where the full sequence of words appears **in order and without gaps**.
 
We test two phrases:
- `"crime and punishment"`
- `"this ebook"`
 
The function will print document indices and a short preview from each result for validation.


In [5]:
from collections import defaultdict

# --- Positional Index for Phrase Queries ---
def build_positional_index(documents):
    positional_index = defaultdict(lambda: defaultdict(list))
    for doc_id, text in enumerate(documents):
        tokens = normalize_tokens(tokenize(text))
        for pos, token in enumerate(tokens):
            positional_index[token][doc_id].append(pos)
    return positional_index

# --- Phrase Query Search Function ---
def phrase_search(query, positional_index):
    words = normalize_tokens(tokenize(query))
    if not words:
        return []
    doc_sets = [set(positional_index[word].keys()) for word in words if word in positional_index]
    if len(doc_sets) != len(words):
        return []
    candidate_docs = set.intersection(*doc_sets)
    matching_docs = []
    for doc_id in candidate_docs:
        positions = [positional_index[word][doc_id] for word in words]
        for pos in positions[0]:
            if all((pos + i) in positions[i] for i in range(1, len(words))):
                matching_docs.append(doc_id)
                break
    return matching_docs

# --- Build Positional Index ---
positional_index = build_positional_index(documents)

# --- Test Phrase Queries ---
query1 = "crime and punishment"
query2 = "this ebook"

results1 = phrase_search(query1, positional_index)
results2 = phrase_search(query2, positional_index)

print(f"🔎 Results for '{query1}':", results1)
for i in results1:
    print(f'→ Doc {i} preview: {documents[i][:150]}...')

print(f"🔎 Results for '{query2}':", results2)
for i in results2:
    print(f'→ Doc {i} preview: {documents[i][:150]}...')

🔎 Results for 'crime and punishment': [3, 16]
→ Doc 3 preview: ﻿The Project Gutenberg eBook of Crime and Punishment
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of th...
→ Doc 16 preview: ﻿The Project Gutenberg eBook of Voyage to the East Indies
    
This ebook is for the use of anyone anywhere in the United States and
most other parts ...
🔎 Results for 'this ebook': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
→ Doc 0 preview: ﻿The Project Gutenberg eBook of Glimpses of the Dark Ages
    
This ebook is for the use of anyone anywhere in the United States and
most other parts ...
→ Doc 1 preview: ﻿The Project Gutenberg eBook of Hannele
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no...
→ Doc 2 preview: ﻿The Project Gutenberg eBook of A history of the Peninsular War, Vol. 3, Sep. 1809-Dec. 1810
    
This ebook is for the use of anyone anywhere in the ...

## 🧪 Six Core NLP Concepts

### 🔹 Term-Document Incidence Matrix

The **Term-Document Incidence Matrix** is a binary matrix that shows whether a term $t$ appears in a document $d$.

- Rows represent terms in the vocabulary  
- Columns represent documents in the corpus  
- Each entry $w_{t,d}$ is defined as:

$$
w_{t,d} =
\begin{cases}
1 & \text{if } t \in d \\
0 & \text{otherwise}
\end{cases}
$$

This is a **binary representation** — it only records the **presence or absence** of a term, not how many times it appears.

---

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
 
#  Load text documents using  function
documents = load_documents('sample_docs/')
doc_ids = [f"Doc{i+1}" for i in range(len(documents))]
 
#  Create CountVectorizer to capture unigrams to trigrams
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 3))
 
#  Fit and transform the documents
X = vectorizer.fit_transform(documents)
 
#  Create Term-Document Incidence Matrix
incidence_matrix = pd.DataFrame(
    X.toarray(),
    index=doc_ids,
    columns=vectorizer.get_feature_names_out()
)
 
#  Filter for specific phrases
target_phrases = ["crime and punishment", "this ebook"]
filtered_columns = [phrase for phrase in target_phrases if phrase in incidence_matrix.columns]
filtered_matrix = incidence_matrix[filtered_columns]
 
#  Display
print(" Term-Document Incidence Matrix (Filtered for Phrases):")
display(filtered_matrix)

 Term-Document Incidence Matrix (Filtered for Phrases):


Unnamed: 0,crime and punishment,this ebook
Doc1,0,1
Doc2,0,1
Doc3,0,1
Doc4,1,1
Doc5,0,1
Doc6,0,1
Doc7,0,1
Doc8,0,1
Doc9,0,1
Doc10,0,1



### 🗣 Talking Point Term-Document Incidence Matrix
>  To check if a phrase like "crime and punishment" or "this ebook" appears in a document, we split the phrase into words and check if both words have a 1 in the same document column. If they do, the document contains both words. But this matrix doesn’t check if the words are next to each other,  just that they are present.  

### 🔹 Term Frequency (TF)

**Term Frequency (TF)** measures how frequently a term $t$ appears in a document $d$.

$$
tf_{t,d} = f_{t,d}
$$

Where $f_{t,d}$ is the raw count of term $t$ in document $d$.

---

#### ✅ Why Use It?

- TF reflects the importance of a word **within a specific document**.
- A higher TF means the term is likely central to the topic of that document.
- It's used as the **first step** in vectorizing text for machine learning models like classification, clustering, or information retrieval.

TF is most effective when combined with **IDF** (Inverse Document Frequency) to balance against very common terms across the corpus.

---

In [7]:
import pandas as pd
from collections import Counter
 
# Load all documents
documents = load_documents('sample_docs/')
doc_ids = [f"Doc{i+1}" for i in range(len(documents))]
 
doc_index = 4 
doc_text = documents[doc_index].lower().split()  # Simple tokenizer
 
# Raw Term Frequency
tf_raw = Counter(doc_text)
total_terms = len(doc_text)
 
# Normalized Term Frequency
tf_normalized = {term: count / total_terms for term, count in tf_raw.items()}
 
# Display
print(f" Document: {doc_ids[doc_index]}")
print("\n Raw Term Frequencies:")
display(pd.DataFrame(tf_raw.items(), columns=["Term", "Raw TF"]))
 
print("\n📏 Normalized Term Frequencies:")
display(pd.DataFrame(tf_normalized.items(), columns=["Term", "TF (Normalized)"]))
 

 Document: Doc5

 Raw Term Frequencies:


Unnamed: 0,Term,Raw TF
0,﻿the,1
1,project,83
2,gutenberg,25
3,ebook,8
4,of,144
...,...,...
1323,produce,1
1324,"ebooks,",1
1325,subscribe,1
1326,newsletter,1



📏 Normalized Term Frequencies:


Unnamed: 0,Term,TF (Normalized)
0,﻿the,0.000237
1,project,0.019673
2,gutenberg,0.005926
3,ebook,0.001896
4,of,0.034131
...,...,...
1323,produce,0.000237
1324,"ebooks,",0.000237
1325,subscribe,0.000237
1326,newsletter,0.000237



### 🗣 Talking Point TF:
>  Looking at the normalized TF for a document, terms with higher TF values are more important because they appear more frequently relative to the document length. Comparing Doc1 with another document, words like “crime,” “punishment,” or “ebook” with higher normalized TFs suggest the document’s main themes. This helps an AI agent prioritize these terms when building context or answering queries related to those topics.



### 🔹 Log Frequency Weight

To reduce the impact of very frequent terms, **log frequency weighting** is applied.

$$
w_{t,d} =
\begin{cases}
1 + \log_{10}(f_{t,d}) & \text{if } f_{t,d} > 0 \\
0 & \text{if } f_{t,d} = 0
\end{cases}
$$

This transformation reduces the skew caused by terms that appear many times in a document. Instead of allowing their raw frequency to dominate, we scale their contribution **logarithmically**.

---

#### ✅ Why Use It?

- Frequent terms are not always the most **important** terms.
- Log scaling ensures that:
  - Words with a raw count of 1 are preserved ($1 + \\log_{10}(1) = 1$),
  - But words with very high counts (e.g., 1000) don’t dominate the document vector.

This helps **normalize the influence** of repetitive terms and improve the **numerical stability** of document representations in models.

---


In [8]:
import pandas as pd
import numpy as np
from collections import Counter
 
# Load documents
documents = load_documents('sample_docs/')
doc_ids = [f"Doc{i+1}" for i in range(len(documents))]
 
# Choose which document to process
doc_index = 11 # Change index from 0–19 to process a different document
tokens = documents[doc_index].lower().split()
 
# Raw term frequency
raw_tf = Counter(tokens)
 
# Log frequency weighting
log_weighted_tf = {
    term: 1 + np.log10(freq) if freq > 0 else 0
    for term, freq in raw_tf.items()
}
 
# Create DataFrame
df = pd.DataFrame({
    "Term": list(raw_tf.keys()),
    "Raw TF (f_{t,d})": list(raw_tf.values()),
    "Log Weight (w_{t,d})": list(log_weighted_tf.values())
})
 
print(f" Document: {doc_ids[doc_index]}")
print(" Log Frequency Weighting:")
display(df)

 Document: Doc12
 Log Frequency Weighting:


Unnamed: 0,Term,"Raw TF (f_{t,d})","Log Weight (w_{t,d})"
0,﻿the,1,1.000000
1,project,83,2.919078
2,gutenberg,25,2.397940
3,ebook,8,1.903090
4,of,122,3.086360
...,...,...,...
11705,"ebooks,",1,1.000000
11706,subscribe,1,1.000000
11707,newsletter,1,1.000000
11708,hear,1,1.000000



### 🗣 Talking Point Log Frequency:
>  Based on the log frequency weighting output for Doc12, terms with higher raw frequencies and consequently higher log weights—such as “project,” “gutenberg,” and “of”—are likely the most important because they appear frequently and thus carry more significance within the document. Comparing these with another document’s TF or log weights can highlight thematic differences. For example, if “ebook” has a higher weight in Doc12 but is less frequent in Doc1, it suggests Doc12 focuses more on ebooks, helping an AI agent prioritize relevant documents when answering queries or building context.


### 🔹 Document Frequency (DF)

**Document Frequency** is the number of documents in which a term $t$ appears:

$$
df_t = |\{ d \in D : t \in d \}|
$$

Where:
- $df_t$ is the document frequency of term $t$
- $D$ is the set of all documents in the corpus
- $t \in d$ means the term $t$ appears in document $d$

---

#### ✅ Why Use It?

- It helps you understand **how common or rare** a word is across the entire document set.
- Words with **high DF** (e.g., “the”, “and”) occur in many documents and are often **less informative**.
- Words with **low DF** are more likely to be **specific and meaningful** for distinguishing between documents.
- DF is a key ingredient in calculating **Inverse Document Frequency (IDF)**.

---

In [9]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
 
# Load documents
documents = load_documents('sample_docs/')
doc_ids = [f"Doc{i+1}" for i in range(len(documents))]
 
# Create CountVectorizer for raw counts
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
 
# Extract terms and term-document matrix
terms = vectorizer.get_feature_names_out()
X_array = X.toarray()
 
# Compute Document Frequency (number of docs where term appears)
df_counts = (X_array > 0).sum(axis=0)
 
# Build DataFrame
df_table = pd.DataFrame({
    "Term": terms,
    "Document Frequency (df_t)": df_counts
}).sort_values("Document Frequency (df_t)", ascending=False)
 
print(" Document Frequency (DF) Table:")
display(df_table)
 

 Document Frequency (DF) Table:


Unnamed: 0,Term,Document Frequency (df_t)
14835,apply,20
101051,your,20
79009,redistribute,20
79010,redistributing,20
79011,redistribution,20
...,...,...
15,034,1
14,033,1
13,030,1
12,024,1



### 🗣 Talking Point Document Frequency:
> Choosing a term like “redistribute” which appears in all 20 documents (high document frequency), its impact on TF-IDF weighting would be lower because it’s very common and less helpful for distinguishing between documents. In contrast, a term like “fun” that appears in only one document (low document frequency) would get a higher TF-IDF weight, making it more important for identifying unique or specific content. This helps an AI agent focus on terms that better differentiate documents when building context or answering queries.


### 🔹 Inverse Document Frequency (IDF)

**Inverse Document Frequency (IDF)** measures how rare or informative a term is across the entire corpus:

$$
idf_t = \log_{10} \left( \frac{N}{df_t} \right)
$$

Where:
- $N$ is the total number of documents in the corpus  
- $df_t$ is the number of documents that contain the term $t$

---

#### ✅ Why Use It?

- IDF is used to **downweight common terms** and **upweight rare ones**.
- Words like “the”, “and”, or “data” appear frequently and are less helpful in distinguishing documents.
- Terms that appear in **fewer documents** are often **more informative** and **discriminative**.
- IDF is a core component of **TF-IDF**, a widely used technique in search engines, document classification, and clustering.

---

In [10]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
 
# Load documents
documents = load_documents('sample_docs/')
N = len(documents)  # Total number of documents
 
# Create CountVectorizer and get document-term matrix
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
terms = vectorizer.get_feature_names_out()
X_array = X.toarray()
 
# Document frequency for each term
df_counts = (X_array > 0).sum(axis=0)
 
# Compute IDF: log10(N / df_t)
idf_values = np.log10(N / df_counts)
 
# Create DataFrame
idf_table = pd.DataFrame({
    "Term": terms,
    "Document Frequency (df_t)": df_counts,
    "IDF (log10(N / df_t))": idf_values
}).sort_values("IDF (log10(N / df_t))", ascending=False)
 
print(" Inverse Document Frequency (IDF) Table:")
display(idf_table)

 Inverse Document Frequency (IDF) Table:


Unnamed: 0,Term,Document Frequency (df_t),IDF (log10(N / df_t))
101909,₁₂,1,1.30103
0,00,1,1.30103
101908,ῥοιζηδὸν,1,1.30103
2,001,1,1.30103
3,004,1,1.30103
...,...,...,...
72642,place,20,0.00000
14781,appears,20,0.00000
14780,appearing,20,0.00000
14888,approach,20,0.00000



### 🗣 Talking Point IDF:
> In this step, we calculated the Inverse Document Frequency (IDF) for each term in the corpus. The IDF metric helps us identify rare or unique terms that appear in fewer documents — giving them more importance during TF-IDF computation. As shown in the table, many terms like '00', '001', and others have an IDF value of approximately 1.30, which suggests they appear in only one document out of the entire set. However, we also noticed some noisy or non-English tokens like ῥοιζηδὸν or purely numeric tokens, which indicates a need for better token filtering or preprocessing to ensure cleaner term relevance.


### 🔹 TF-IDF Weighting

**TF-IDF (Term Frequency–Inverse Document Frequency)** scores each term $t$ in document $d$ based on how frequent and how rare it is:

$$
w_{t,d} = \left(1 + \log_{10}(f_{t,d})\right) \times \log_{10} \left( \frac{N}{df_t} \right)
$$

Where:
- $f_{t,d}$ is the raw count of term $t$ in document $d$
- $df_t$ is the number of documents that contain term $t$
- $N$ is the total number of documents in the corpus

---

#### ✅ Why Use It?

- TF-IDF balances **term importance within a document** (TF) against **term commonality across all documents** (IDF).
- It **boosts rare, relevant words** while **suppressing frequent, generic words**.
- TF-IDF is foundational in:
  - Information Retrieval (search engines)
  - Document similarity
  - Feature engineering for classification or clustering

---

In [12]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
 
# Load documents
documents = load_documents('sample_docs/')
N = len(documents)
doc_ids = [f"Doc{i+1}" for i in range(N)]
 
# Key phrases to track
target_phrases = ["crime",  "and", "punishment", "this", "ebook"]
 
# Vectorizer with up to trigrams (to capture phrases)
vectorizer = CountVectorizer(ngram_range=(1, 3))
X = vectorizer.fit_transform(documents)
terms = vectorizer.get_feature_names_out()
X_array = X.toarray()
 
# Document Frequencies and IDF
df = (X_array > 0).sum(axis=0)
idf = np.log10(N / df)
 
# Log-weighted TF
tf_log = 1 + np.where(X_array > 0, np.log10(X_array), 0)
 
# TF-IDF calculation
tfidf = tf_log * idf
 
# Full matrix
tfidf_df = pd.DataFrame(tfidf, columns=terms, index=doc_ids)
 
# Filter only selected phrases that exist
filtered_phrases = [phrase for phrase in target_phrases if phrase in tfidf_df.columns]
filtered_tfidf = tfidf_df[filtered_phrases]
 
# Output
print(" Manual TF-IDF Matrix for Selected Phrases:")
display(filtered_tfidf.round(3))
 

  tf_log = 1 + np.where(X_array > 0, np.log10(X_array), 0)


 Manual TF-IDF Matrix for Selected Phrases:


Unnamed: 0,crime,and,punishment,this,ebook
Doc1,0.445,0.0,0.318,0.0,0.0
Doc2,0.301,0.0,0.187,0.0,0.0
Doc3,0.301,0.0,0.276,0.0,0.0
Doc4,0.825,0.0,0.356,0.0,0.0
Doc5,0.301,0.0,0.187,0.0,0.0
Doc6,0.301,0.0,0.243,0.0,0.0
Doc7,0.392,0.0,0.243,0.0,0.0
Doc8,0.392,0.0,0.276,0.0,0.0
Doc9,0.392,0.0,0.187,0.0,0.0
Doc10,0.301,0.0,0.187,0.0,0.0



### 🗣 Talking Point TD-IDF:
> In this step, we computed the manual TF-IDF weights for selected key terms like 'crime', 'punishment', 'this', and 'ebook' across our document set. As seen in the output, the term 'crime' appears frequently and with higher weight in multiple documents, while 'punishment' and 'ebook' show zero TF-IDF—indicating that either these terms are absent or not frequent enough to impact weighting. This insight helps us identify which documents are most contextually relevant to specific themes like crime and punishment.
 
