<a href="https://colab.research.google.com/github/jenchan88/CS466Lab2InfoRetrieval/blob/main/Lab2_InformationRetrieval_ipynb_jchanOldVersion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Lab Assignment: Document Retrieval System**
**Due Date: April 26**

**Objective:**
The objective of this lab assignment is to implement a document retrieval system using the TF-IDF algorithm and cosine similarity distance. Students will learn to preprocess text data, calculate TF-IDF scores, compute cosine similarity, and retrieve the top relevant documents for a given query.
This lab assignment is designed to enhance students' understanding of text processing, information retrieval, and machine learning techniques. It encourages hands-on experience with real-world text data and provides opportunities for experimentation and improvement. You can use the following helper code as a guideline or create your own implementation.

Tasks:
This lab is a continuation of Lab1. You can use the document_class pickel object. Implement classes to represent documents and document collections.
Implement methods to preprocess documents and compute TF-IDF scores.
Fetch documents from a URL and preprocess them.
Implement methods to preprocess user queries and compute cosine similarity between the query and documents.
Perform document retrieval for first 5 queries and display the top 20 relevant documents.

**Expected Output**
As an example, your implementation should give a similar output:

Documents for query 3: [184, 359, 486, 686, 327, 875, 13, 102, 685, 540, 349, 1361, 25, 573, 758, 12, 252, 880, 57, 755]

The numbers inside the parenthesis correspond to the top 20 documents.

**Detailed Tasks:**

1. **Text Preprocessing:**
    - Implement a text preprocessing function that tokenizes text, removes punctuation, converts text to lowercase, and removes stop words.
    - Ensure that the preprocessing function is capable of handling a collection of documents.
    

2. **TF-IDF Calculation:**
- we need to load the document class pickle file to access the TF-IDF scores for each document.
    - The TF-IDF scores are stored in this file, and by loading it, we can access the scores for each document and compare them with the query terms. This allows us to find which documents contain words similar to the ones in the query and retrieve their indexes.
    - Implement a TF-IDF calculation function that computes the TF-IDF scores for each term in the document collection.
    - Ensure that the TF-IDF calculation function normalizes the term frequencies and computes the inverse document frequency.
    - Implement a class or function to represent the document vectors.

3. **Document Retrieval:**
    - Implement a cosine similarity function to compute the similarity between a query and each document in the collection.
    - Use the TF-IDF vectors calculated in step 2 to compute the cosine similarity.
    -- compute_cosine_similarity could be a potential name of this function. You need to complete this function
    - Retrieve the top 20 relevant documents for each query using the files: queries.txt.

4. **Data Loading and Saving:**
    - Load the document collection from a given file (e.g., pickle file).
    - Save the TF-IDF vectors to a pickle file for future use.

5. **Integration and Testing:**
    - Integrate all the implemented components to create a complete document retrieval system.
    - Test the system using a sample query file (e.g., queries.txt).
    - Ensure that the system produces correct results and handles edge cases effectively.

**Submission:**
- Submit the Python notebook file (.ipynb) containing the implementation of each component including from the first Lab where you created the document class, documents_collection pickel file.
- Provide a README file explaining the usage of the code and any additional instructions.
- Ensure that the code is well-commented and organized.


You will create a class TextVector that contains the following methods:


In [4]:
# Install NLTK (if not already installed)
!pip install nltk



In [5]:
# Import the NLTK module
import nltk

# Download the punkt resource (if not already downloaded)
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [8]:
import re
import math
import pickle
import requests
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np

class Document:
    def __init__(self, doc_id, text):
        self.doc_id = doc_id
        self.text = text

    def preprocess_text(self):
        # Tokenize text
        tokens = []
        temp = word_tokenize(self.text)
        cleanedTokens = []
        # Remove punctuation and convert to lowercase
        for word in temp:
          if word.isalnum():
            cleanedTokens.append(word.lower())
        # Remove stop words
        for w in cleanedTokens:
          if w not in stopwords.words('english'):
            tokens.append(w)
        return tokens

class DocumentCollection:
    def __init__(self):
        self.documents = {}

    def add_document(self, doc):
        self.documents[doc.doc_id] = doc

    def preprocess_documents(self):
        for doc_id, doc in self.documents.items():
            self.documents[doc_id].text = doc.preprocess_text()

    def compute_tf_idf(self):
        tf_idf = {}
        N = len(self.documents)
        # Compute term frequency (TF) for each document
        term_freqs = {}
        for docID, doc in self.documents.items():
          term_freqs[docID] = Counter(doc.text)
        # Normalize term frequencies
          totalTerms = sum(term_freqs[docID].values())
          for term in term_freqs[docID]:
            term_freqs[docID][term] = term_freqs[docID][term] / totalTerms

        # Compute inverse document frequency (IDF) for each term
        docFreqs = Counter()
        ##give doc frqs same keys (terms) as in freq dictionary
        for freq in term_freqs.values():
          docFreqs.update(freq.keys())
        IDF = {}
        for term, df in docFreqs.items():
          ## check if frequency is non-zero to avoid 0 divison err
          if df != 0:
            IDF[term] = math.log(N / df)
          ## if dividing by 0, just return 0
          else:
            IDF[term] = 0
        # Compute TF-IDF scores
        for docID, freq in term_freqs.items():
          tf_idf[docID] = {}
          for term, tf in freq.items():
            tf_idf[docID][term] = tf * IDF[term]

        return tf_idf

    def save_to_pickle(self, filename):
        with open(filename, 'wb') as file:
            pickle.dump(self.documents, file)

    @staticmethod
    def load_from_pickle(filename):
        with open(filename, 'rb') as file:
            return pickle.load(file)

def fetch_documents(url):
    response = requests.get(url)
    if response.status_code == 200:
        text = response.text
        return extract_documents(text)
    else:
        print("Failed to retrieve the file. Status code:", response.status_code)
        return None

def extract_documents(text):
    collection = DocumentCollection()
    doc_id = None
    text_buffer = ""
    for line in text.splitlines():
        if line.startswith(".I"):
            if doc_id is not None:
                collection.add_document(Document(doc_id, text_buffer))
                text_buffer = ""
            doc_id = int(line.strip().split()[-1])
        elif line.startswith(".W"):
            text_buffer += line.strip().replace(".W", "")
        else:
            text_buffer += line.strip()
    if doc_id is not None:  # Add the last document
        collection.add_document(Document(doc_id, text_buffer))
    return collection

def preprocess_query(query):
    # Implement query preprocessing here (similar to document preprocessing)
    # Tokenize, remove stop words, etc.
    # Tokenize text
    tokens = word_tokenize(query)
    cleanedTokens = []
    query_terms = []

    # Remove punctuation and convert to lowercase
    for word in tokens:
        if word.isalnum():
            cleanedTokens.append(word.lower())

    # Remove stop words

    for w in cleanedTokens:
          if w not in stopwords.words('english'):
            query_terms.append(w)

    return query_terms



def compute_cosine_similarity(query, tf_idf):

    # Normalize query vector
    query_vector = {}
    for term in query:
        query_vector[term] = 1 / len(query)

    # Compute cosine similarity between query and each document
    cosSimilarityDict = {}
    for docID, docVector in tf_idf.items():
      # dotProd = np.dot(list(query_vector.values()), list(docVector.values()))
      common_terms = set(query_vector.keys()) & set(docVector.keys())
      ## Only calculate dot product for common terms
      dotProduct = 0
      for term in common_terms:
          dotProduct += query_vector[term] * docVector[term]
      queryMagnitude = sum(value ** 2 for value in query_vector.values()) ** 0.5
      docMagnitude = sum(value ** 2 for value in docVector.values()) ** 0.5
      # if query_vector.values() != 0 and docVector.values() !=0:
      if queryMagnitude != 0 and docMagnitude != 0:
        cosSim = dotProduct / (queryMagnitude * docMagnitude)
      else:
        cosSim = 0
      cosSimilarityDict[docID] = cosSim
    # Sort documents by similarity score
    sorted_documents = []
    # Sort documents by similarity score
    sorted_documents = sorted(cosSimilarityDict.items(), key=lambda x: x[1], reverse=True)

    # for i in range(min(20, len(cosSimilarityDict))):
    #   maxSimDocID = max(cosSimilarityDict, key=cosSimilarityDict.get)
    #   maxSimScore = cosSimilarityDict[maxSimDocID]
    #   sorted_documents.append((maxSimDocID, maxSimScore))
    return sorted_documents[:20]

def compute_tf_idf_for_query(query, documents):
    # Preprocess the query
    preQuery = preprocess_query(query)

    # Create a document object for the query
    queryDoc = Document("query", query)
    # Compute TF-IDF for the query document
    query_tf_idf = {}
    termFreq = Counter(preQuery)
    totalTerms = sum(termFreq.values())
    for term, freq in termFreq.items():
      tf = freq / totalTerms
      ## for every term, calculate doc frequency
      df = 0
      for doc in documents:
        if term in doc.text:
          df += 1
      if df > 0:
        idf = math.log(len(documents) / (df))
      else:
        idf = 0
      query_tf_idf[term] = tf * idf
    return query_tf_idf


def main():
    # Fetch documents from URL
    url = 'https://raw.githubusercontent.com/sumonacalpoly/Datasets/main/documents.txt'
    documents_collection = fetch_documents(url)

    if documents_collection:
        # Preprocess documents
        documents_collection.preprocess_documents()
        # Compute TF-IDF scores
        tf_idf = documents_collection.compute_tf_idf()
        # Save TF-IDF vectors to pickle file
        documents_collection.save_to_pickle("tfidfVectors.pkl")

        # Read queries from file
        queries_url = 'https://raw.githubusercontent.com/sumonacalpoly/Datasets/main/queries.txt'
        queries_response = requests.get(queries_url)
        if queries_response.status_code == 200:
            # queries = [line.strip() for line in queries_response.text.splitlines()]
            # Perform document retrieval for each query
            queries_data = queries_response.text.strip().split('.I')[1:6]
            queries = {}
            for queryData in queries_data:
              query_lines = queryData.split('.W')  # Split query lines
              query_number = query_lines[0].strip()
              query_text = query_lines[1].strip()
              queries[query_number] = query_text
              # queryTFIDF = compute_tf_idf_for_query(query, documents_collection.documents.values())
              # similarDocs = compute_cosine_similarity(queryTFIDF, tf_idf)
              # print(f"Documents for query '{query}': {[doc[0] for doc in similarDocs[:20]]}")
            ## Doc retrieval
            for query_number, query_text in queries.items():
                # Compute TF-IDF representation for the query
                queryTFIDF = compute_tf_idf_for_query(query_text, documents_collection.documents.values())
                # Compute cosine similarity between the query and documents
                similarDocs = compute_cosine_similarity(queryTFIDF, tf_idf)
                # Print the top 20 relevant documents for the query
                print(f"Documents for query '{query_number}': {[doc[0] for doc in similarDocs[:20]]}")
        else:
            print("Failed to retrieve queries file. Status code:", queries_response.status_code)



if __name__ == "__main__":
    main()

Documents for query '001': [13, 51, 184, 429, 792, 359, 1111, 686, 878, 486, 114, 327, 435, 875, 1144, 540, 700, 102, 100, 154]
Documents for query '002': [51, 12, 792, 700, 429, 875, 884, 1147, 1111, 833, 883, 606, 1163, 726, 746, 804, 100, 114, 1158, 1379]
Documents for query '004': [485, 181, 5, 144, 399, 542, 707, 91, 582, 425, 666, 350, 623, 944, 554, 347, 303, 395, 586, 90]
Documents for query '008': [166, 1312, 236, 185, 1275, 575, 317, 1085, 375, 914, 435, 410, 1189, 656, 1077, 4, 570, 456, 1010, 1061]
Documents for query '009': [103, 540, 26, 1272, 650, 1102, 813, 746, 1379, 568, 327, 1158, 360, 172, 401, 1066, 137, 1391, 573, 575]
