# **LAB ASSIGNMENT - 4**

**Insert Documents**

In [3]:
documents0 = [
    "Stanford University is a great place for research.\n"
    "California is known for its beautiful weather and top-ranked universities.\n"
    "The University of California, Berkeley, is also a highly reputed institution.\n"
    "MIT is one of the most prestigious universities in the world.\n"
    "Harvard University has a long history of excellence in education and research.\n"
]

documents1 = [
    "Artificial Intelligence is transforming industries worldwide.\n"
    "Machine Learning and Deep Learning are subsets of AI.\n"
    "The impact of AI on the healthcare sector is revolutionary.\n"
    "AI technologies are being used in autonomous vehicles.\n"
    "Ethical concerns around AI are growing as the technology advances.\n"
]

documents2 = [
    "The University of California, Berkeley, is also a highly reputed institution."
]

documents3 = [
    "MIT is one of the most prestigious universities in the world."
]

documents4 = [
    "Harvard University has a long history of excellence in education and research."
]

all_documents = documents0 + documents1 + documents2 + documents3 + documents4

print("Documents inserted successfully:\n")
for i, doc in enumerate(all_documents):
    print(f"Doc {i}: {doc}")

Documents inserted successfully:

Doc 0: Stanford University is a great place for research.
California is known for its beautiful weather and top-ranked universities.
The University of California, Berkeley, is also a highly reputed institution.
MIT is one of the most prestigious universities in the world.
Harvard University has a long history of excellence in education and research.

Doc 1: Artificial Intelligence is transforming industries worldwide.
Machine Learning and Deep Learning are subsets of AI.
The impact of AI on the healthcare sector is revolutionary.
AI technologies are being used in autonomous vehicles.
Ethical concerns around AI are growing as the technology advances.

Doc 2: The University of California, Berkeley, is also a highly reputed institution.
Doc 3: MIT is one of the most prestigious universities in the world.
Doc 4: Harvard University has a long history of excellence in education and research.


**Preprocessing + Lemmatization**

In [4]:
import re
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def preprocess(doc):
    doc = doc.lower()                         # lowercase
    doc = re.sub(r'[^a-z\s]', '', doc)        # remove punctuation
    tokens = doc.split()                      # tokenization
    tokens = [lemmatizer.lemmatize(token) for token in tokens]  # lemmatization
    return tokens

processed_docs = []

print("Preprocessed Documents:\n")

for i, doc in enumerate(all_documents):
    tokens = preprocess(doc)
    processed_docs.append(tokens)
    print(f"Doc {i}: {tokens}\n")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Preprocessed Documents:

Doc 0: ['stanford', 'university', 'is', 'a', 'great', 'place', 'for', 'research', 'california', 'is', 'known', 'for', 'it', 'beautiful', 'weather', 'and', 'topranked', 'university', 'the', 'university', 'of', 'california', 'berkeley', 'is', 'also', 'a', 'highly', 'reputed', 'institution', 'mit', 'is', 'one', 'of', 'the', 'most', 'prestigious', 'university', 'in', 'the', 'world', 'harvard', 'university', 'ha', 'a', 'long', 'history', 'of', 'excellence', 'in', 'education', 'and', 'research']

Doc 1: ['artificial', 'intelligence', 'is', 'transforming', 'industry', 'worldwide', 'machine', 'learning', 'and', 'deep', 'learning', 'are', 'subset', 'of', 'ai', 'the', 'impact', 'of', 'ai', 'on', 'the', 'healthcare', 'sector', 'is', 'revolutionary', 'ai', 'technology', 'are', 'being', 'used', 'in', 'autonomous', 'vehicle', 'ethical', 'concern', 'around', 'ai', 'are', 'growing', 'a', 'the', 'technology', 'advance']

Doc 2: ['the', 'university', 'of', 'california', 'berkele

**TF-IDF Calculation + Term Ranking**

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Convert tokens back to text
processed_texts = [" ".join(doc) for doc in processed_docs]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(processed_texts)

terms = vectorizer.get_feature_names_out()

tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=terms)

print("TF-IDF Matrix:\n")
print(tfidf_df)

TF-IDF Matrix:

    advance        ai      also       and       are    around  artificial  \
0  0.000000  0.000000  0.115731  0.192134  0.000000  0.000000    0.000000   
1  0.128247  0.512988  0.000000  0.085888  0.384741  0.128247    0.128247   
2  0.000000  0.000000  0.357789  0.000000  0.000000  0.000000    0.000000   
3  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000    0.000000   
4  0.000000  0.000000  0.000000  0.276495  0.000000  0.000000    0.000000   

   autonomous  beautiful     being  ...  technology       the  topranked  \
0    0.000000   0.143445  0.000000  ...    0.000000  0.242443   0.143445   
1    0.128247   0.000000  0.128247  ...    0.256494  0.216756   0.000000   
2    0.000000   0.000000  0.000000  ...    0.000000  0.249844   0.000000   
3    0.000000   0.000000  0.000000  ...    0.000000  0.471808   0.000000   
4    0.000000   0.000000  0.000000  ...    0.000000  0.000000   0.000000   

   transforming  university      used   vehicle   weather     wo

**Query Relevance Using Cosine Similarity**

In [12]:
from sklearn.metrics.pairwise import cosine_similarity

query = input("Enter your query: ")

query_processed = preprocess(query)
query_text = " ".join(query_processed)

query_vector = vectorizer.transform([query_text])

similarity_scores = cosine_similarity(query_vector, tfidf_matrix)[0]

results = list(enumerate(similarity_scores))
results.sort(key=lambda x: x[1], reverse=True)

print("\nDocument Relevance Ranking:\n")

for doc_id, score in results:
    print(f"Doc {doc_id} → Relevance Score: {score:.4f}")

Enter your query: AI IN HEALTHCARE

Document Relevance Ranking:

Doc 1 → Relevance Score: 0.4480
Doc 3 → Relevance Score: 0.0873
Doc 4 → Relevance Score: 0.0861
Doc 0 → Relevance Score: 0.0598
Doc 2 → Relevance Score: 0.0000
