# Updating the Dictionary
The following project attempts to create a Boolean Retrieval system able to answer binary and phrase queries. Users should also be able to add and delete documents from the system.







In [71]:
from functools import total_ordering, reduce
import csv  # Import the csv module for CSV file parsing
import re  # Import the re module for regular expression operations


In [72]:
# Needed to execute the notebook on google colab
from google.colab import drive
drive.mount("/content/gdrive")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## Construction of the IRsystem

In order to tackle the problem different classes have to be implemented. Specifically:


*   **Posting** class: it represents a posting in an index
*   **PostingList** class: it represents a list of postings in an index
*   **Terms** class: it represents a term in an index, which is associated with its posting list
*   **PositionalIndex** class: it represents the index, which is a positional index given that the positions where each term appears in each document are saved in order to answer phrase queries
*   **IRsystem** class: it represents the structure of the whole IR system. It is used to initialize the system, add or delete documents and perform queries.

Let us now inspect these classes


### Postings

In [73]:
@total_ordering  # This decorator will add all rich comparison methods based on the definitions of __eq__ and __gt__.
class Posting:    # The class represents a 'Posting' in an index

    def __init__(self, docID, positions):
        # The initializer method for the class, which takes a document ID as an argument.
        self._docID = docID  # The document ID is stored in a protected member variable.
        self._positions= positions # The positions in the document where the term appears

    def get_from_corpus(self, corpus):
        # A method to retrieve a document's contents from a corpus using the stored document ID.
        return corpus[self._docID]  # Returns the document associated with the document ID from the corpus.

    def __eq__(self, other):
        # Special method to check equality with another Posting, based on document ID.
        return self._docID == other._docID  # Returns True if the document IDs are equal, otherwise False.

    def __gt__(self, other):
        # Special method to check if this Posting is greater than another Posting, based on document ID.
        return self._docID > other._docID  # Returns True if this document ID is greater than the other's.

    def __repr__(self):
        # Special method to provide the official string representation of the Posting.
        return f"DocID: {self._docID}, Positions: {', '.join(map(str, self._positions))}"


### Posting Lists

In [74]:
class PostingList:
    # This class represents a list of postings

    def __init__(self):
        # The initializer method for the class. It initializes an empty list of postings.
        self._postings = []  # Protected member variable that holds the list of postings.

    @classmethod
    def from_docID(cls, docID, position):
        # A class method to create a PostingList instance with a single Posting from a document ID and the position where the term was found.
        plist = cls()  # Creates a new instance of the class.
        plist._postings = [(Posting(docID, [position]))]  # Initializes the postings list with a single Posting and the position where the term was first found.
        return plist  # Returns the newly created PostingList instance.

    @classmethod
    def from_posting_list(cls, postingList):
        # A class method to create a PostingList instance from an existing list of Postings.
        plist = cls()  # Creates a new instance of the class.
        plist._postings = postingList  # Sets the postings list to the provided list.
        return plist  # Returns the newly created PostingList instance.

    def merge(self, other):
        # A method to merge another PostingList into this one, avoiding duplicates.
        i = 0  # Start index for the other PostingList.
        last = self._postings[-1] if self._postings else None  # The last Posting in the current list.
        while i < len(other._postings):
            current_posting = other._postings[i]
            # Check if there's a matching posting with the same document ID.
            if last is None or last._docID < current_posting._docID:
                # If no match is found, append the posting to the current list.
                self._postings.append(current_posting)
                last = current_posting
            elif last._docID == current_posting._docID:
                # If a match is found, add new positions to the existing posting.
                existing_posting = self._postings[-1]
                existing_posting._positions.extend(current_posting._positions)
            i += 1  # Move to the next posting in the other PostingList.

    def intersection(self, other):
        # A method to compute the intersection of this PostingList with another.
        intersection = []  # Start with an empty list for the intersection.
        i = 0  # Index for this PostingList.
        j = 0  # Index for the other PostingList.
        # Loop until one of the lists is exhausted.
        while (i < len(self._postings) and j < len(other._postings)):
            # If both postings are equal, add to the intersection.
            if (self._postings[i] == other._postings[j]):
                intersection.append(self._postings[i])
                i += 1
                j += 1
            # If the current posting is less, increment this list's index.
            elif (self._postings[i] < other._postings[j]):
                i += 1
            # If the other posting is less, increment the other list's index.
            else:
                j += 1
        return PostingList.from_posting_list(intersection)  # Return a new PostingList of the intersection.

    def union(self, other):
        # A method to compute the union of this PostingList with another.
        union = []  # Start with an empty list for the union.
        i = 0  # Index for this PostingList.
        j = 0  # Index for the other PostingList.
        # Loop until one of the lists is exhausted.
        while (i < len(self._postings) and j < len(other._postings)):
            # If both postings are equal, add to the union and increment both indexes.
            if (self._postings[i] == other._postings[j]):
                union.append(self._postings[i])
                i += 1
                j += 1
            # If the current posting is less, add it to the union and increment this list's index.
            elif (self._postings[i] < other._postings[j]):
                union.append(self._postings[i])
                i += 1
            # Otherwise, add the other posting to the union and increment the other list's index.
            else:
                union.append(other._postings[j])
                j += 1
        # Add any remaining postings from both lists to the union.
        for k in range(i, len(self._postings)):
            union.append(self._postings[k])
        for k in range(j, len(other._postings)):
            union.append(other._postings[k])
        return PostingList.from_posting_list(union)  # Return a new PostingList of the union.

    def difference(self, other):
        # A method to compute the difference of this PostingList with another.
        diff = []  # Start with an empty list for the difference.
        i = 0  # Index for this PostingList.
        j = 0  # Index for the other PostingList.
        # Loop until one of the lists is exhausted.
        while (i < len(self._postings) and j < len(other._postings)):
            # If both postings are equal, skip this posting and increment both indexes.
            if self._postings[i] == other._postings[j]:
                i += 1
                j += 1
            # If the current posting is less, add it to the difference and increment this list's index.
            elif self._postings[i] < other._postings[j]:
                diff.append(self._postings[i])
                i += 1
            # If the other posting is less, skip it and increment the other list's index.
            else:
                j += 1
        # Add any remaining postings from this list to the difference.
        for k in range(i, len(self._postings)):
            diff.append(self._postings[k])
        return PostingList.from_posting_list(diff)  # Return a new PostingList of the difference.

    def get_from_corpus(self, corpus):
        # A method to retrieve the contents of each Posting from a corpus.
        return list(map(lambda x: x.get_from_corpus(corpus), self._postings))  # Use map to apply the retrieval to each Posting.

    def __repr__(self):
        # Special method to provide the official string representation of the PostingList.
        return "/".join(map(str, self._postings))

### Terms

In [75]:
# Define a custom exception class for handling errors specific to merge operations.
class ImpossibleMergeError(Exception):
    pass

# The total_ordering decorator will automatically provide the other comparison methods based on __eq__ and __gt__.
@total_ordering
class Term:
    # A class that represents a term in a document, along with its posting list.

    def __init__(self, term, docID, position):
        # The initializer method for the class, taking a term and a document ID as arguments.
        self.term = term  # Public attribute to store the term.
        # Initialize posting_list for the term with a PostingList created from the given document ID.
        self.posting_list = PostingList.from_docID(docID,position)

    def merge(self, other):
        # A method to merge another Term's posting list into this one if they have the same term.
        if (self.term == other.term):
            # If terms match, merge the posting lists.
            self.posting_list.merge(other.posting_list)
        else:
            # If terms don't match, it's not possible to merge, so raise an exception.
            raise ImpossibleMergeError

    def __eq__(self, other):
        # Special method to check equality with another Term based on the term string.
        return self.term == other.term  # Comparison is done lexicographically.

    def __gt__(self, other):
        # Special method to determine if this Term is greater than another, based on the term string.
        return self.term > other.term  # Comparison is done lexicographically.

    def __repr__(self):
        # Special method to provide the official string representation of the Term.
        return f"{self.term}: {self.posting_list.__repr__()}"


### Positional Index

In [76]:
# Function to normalize text by removing punctuation, converting to lowercase.
def normalize(text):
    # Removes punctuation from the text using a regular expression.
    no_punctuation = re.sub(r'[^\w\s^-]', '', text)
    # Converts the text to lowercase.
    downcase = no_punctuation.lower()
    # Returns the normalized text.
    return downcase

# Function to tokenize the description of a movie into individual words.
def tokenize(movie):
    # Normalize the movie description.
    text = normalize(movie.description)
    # Split the text into a list of tokens (words) and return it.
    return list(text.split())

# Define a class that represents an inverted index.
class PositionalIndex:

    def __init__(self):
        # Initialize the inverted index with an empty dictionary.
        self._dictionary = []

    # Class method to create an inverted index from a corpus of documents.
    @classmethod
    def from_corpus(cls, corpus):
        # Create an intermediate dictionary to store terms and their postings.
        intermediate_dict = {}
        # Iterate over the documents in the corpus.
        for document in corpus:
            # Tokenize the document into individual words.
            tokens = tokenize(document)
            for index, token in enumerate(tokens):
                term = Term(token, document.docID, index)
                try:
                    # Try to merge the term with existing one in the intermediate dictionary.
                    intermediate_dict[token].merge(term)
                except KeyError:
                    # If the term is not already in the dictionary, add it.
                    intermediate_dict[token] = term
            # Print progress for every 1000 documents processed.
            if (document.docID % 1000 == 0):
                print("ID: " + str(document.docID))
        # Create a new PositionalIndex instance.
        idx = cls()
        # Sort the terms in the intermediate dictionary and store them in the index's dictionary.
        idx._dictionary = sorted(intermediate_dict.values(), key=lambda term: term.term)
        # Return the newly created inverted index.
        return idx

    # Method to merge indexes
    def merge(self, other):
        merged_dict = {}
        l1 = len(self._dictionary)
        l2 = len(other._dictionary)
        i=0
        j=0
        while i<l1 and j<l2:
            if self._dictionary[i].term == other._dictionary[j].term:
                # Terms match, merge their posting lists.
                self._dictionary[i].posting_list.merge(other._dictionary[j].posting_list)
                merged_dict[self._dictionary[i].term]=self._dictionary[i]
                i+=1
                j+=1
            elif self._dictionary[i].term < other._dictionary[j].term:
                # Term in dict1 is smaller, add it to the merged_dict.
                merged_dict[self._dictionary[i].term]=self._dictionary[i]
                i+=1
            else:
                # Term in dict2 is smaller, add it to the merged_dict.
                merged_dict[other._dictionary[j].term]=other._dictionary[j]
                j+=1
        # Add any remaining terms from dict1 or dict2.
        while i<l1:
            merged_dict[self._dictionary[i].term]=self._dictionary[i]
            i+=1
        while j<l2:
            merged_dict[other._dictionary[j].term]=other._dictionary[j]
            j+=1
        # Create a new InvertedIndex instance.
        idx = PositionalIndex()
        # Sort the terms in the intermediate dictionary and store them in the index's dictionary.
        idx._dictionary = sorted(merged_dict.values(), key=lambda term: term.term)
        # Return the newly created inverted index.
        return idx

    # Special method to retrieve the posting list for a given term.
    def __getitem__(self, key):
        # Iterate over the terms in the dictionary.
        for term in self._dictionary:
            # If the term matches the key, return its posting list.
            if term.term == key:
                return term
        # If the term is not found, raise a KeyError.
        raise KeyError

    # Special method to provide a string representation of the inverted index.
    def __repr__(self):
        # Returns a string indicating the number of terms in the dictionary.
        return "A dictionary with " + str(len(self._dictionary)) + " terms"


### Reading the Corpus
In this section, essential functions for processing the original text and converting it into documents are provided.

The **MovieDescription** class is used to represent movie objects, each identified by a unique ID, title, and description.

The **read_movie_description** function is needed to transform a corpus of documents into a list of **MovieDescription** objects. This function accepts three input variables:

*   **percentage**: Determines the proportion of the original corpus to be read, influencing the number of documents in the final corpus returned by the function.
*   **start_percentage**: Specifies the initial point for reading documents, allowing users to start from the beginning of the provided text or any desired position.
*   **rename**: A boolean variable used to reset the document IDs. Typically, the function assigns IDs based on the order of reading. However, setting rename to True enables users to create a new system with a subset of documents from the main corpus, with their document IDs reset to start from 0.



To successfully execute this notebook, you may need to make adjustments to the filenames, ensuring that they accurately specify the correct file paths.

In [77]:
# Define a class to hold the title and description of a movie.
class MovieDescription:

    def __init__(self, title, description, docID):
        # Constructor for the class that initializes the title and description attributes.
        self.docID = docID
        self.title = title
        self.description = description

    def __repr__(self):
        # Special method to provide the string representation of the MovieDescription object.
        # It returns the movie's title when the object is printed or shown in the interpreter.
        return self.title


# Define a function to read movie descriptions and titles from files.
def read_movie_descriptions(percentage, start_percentage, rename=False):
    # Names of the files containing plot summaries and metadata respectively.
    filename = '/content/gdrive/My Drive/Information retrieval/plot_summaries.txt' # Change the path to your specific path to the file
    movie_names_file = '/content/gdrive/My Drive/Information retrieval/movie.metadata.tsv' # Change the path to your specific path to the file
    # Open the movie metadata file and read it line by line up to the desired percentage.
    with open(movie_names_file, 'r', encoding="utf8") as csv_file:
        # Create a csv.reader object to read the file with tab as the delimiter.
        movie_names = csv.reader(csv_file, delimiter='\t')
        # Initialize a dictionary to hold movie IDs and their corresponding titles.
        names_table = {}
        for name in movie_names:
            # Populate the dictionary with movie ID as key and title as value.
            names_table[name[0]] = name[2]
    # Open the file containing plot summaries and read it line by line.
    with open(filename, 'r', encoding="utf8") as csv_file:
        # Create a csv.reader object to read the file with tab as the delimiter.
        descriptions = csv.reader(csv_file, delimiter='\t')
        # Initialize a list to hold the corpus of movie descriptions.
        corpus = []
        docID= 0
        for desc in descriptions:
            try:
                # Create a MovieDescription object using the title from names_table and the description from the file.
                movie = MovieDescription(names_table[desc[0]], desc[1], docID)
                # Add the MovieDescription object to the corpus.
                corpus.append(movie)
                # Update the counter of read documents
                docID+=1
            except KeyError:
                # If the movie ID is not found in names_table, ignore this description.
                pass
        total_lines=len(corpus)
        corpus=corpus[int(start_percentage*total_lines):int((start_percentage+percentage)*total_lines) ]
        if(rename):
            for index, doc in enumerate(corpus):
                doc.docID= index
        # Return the populated list of MovieDescription objects.
        return corpus




### Putting it all together

In [112]:
import copy
# Define a class for an Information Retrieval (IR) system.
class IRsystem:

    def __init__(self, corpus, index, invalidation_vector):
        # Initialize the IR system with a corpus (collection of documents) and the inverted index.
        self._corpus = copy.deepcopy(corpus)  # The corpus of documents.
        self._index = index  # The main inverted index for the corpus.
        self._auxiliary_index = PositionalIndex() # The auxiliary index
        self._merged = True # Boolean variable that says if the indexes have been merged
        self._auxiliary_corpus = [] # The corpus of added documents
        self._invalidation_vector = invalidation_vector  # The invalidation vector.

    @classmethod
    def from_corpus(cls, corpus):
        # Class method to create an IR system instance from a given corpus.
        # It creates an inverted index from the corpus first.
        index = PositionalIndex.from_corpus(corpus)
        invalidation_vector= [True]*len(corpus)
        # Returns an instance of the IR system with the given corpus and created index.
        return cls(corpus, index, invalidation_vector)

    @classmethod
    def from_index_file(cls, filename, corpus):
        index = PositionalIndex()
        # Create an intermediate dictionary to store terms and their postings.
        intermediate_dict = {}
        #read the index from filename
        with open(filename, 'r', encoding='utf-8') as txtfile:
            # Read lines from the file
            lines = txtfile.readlines()
            # Get the first line
            first_row = lines[0].strip()
            # Convert characters back to booleans
            invalidation_vector = [bool(int(char)) for char in first_row]
            # Remove the first line
            lines = lines[1:]
            # Iterate over the remaining lines (starting from the second line)
            for line in lines:
                # Split the line into term and posting list
                term, posting_list_str = line.split(': ', 1)
                # Create Term object
                term = Term(term, 0, 0)
                # Split the posting list into individual postings
                posting_entries = posting_list_str.split('/')
                # Process each posting entry
                postings = []
                for entry in posting_entries:
                    # Extract DocID and Positions
                    docID_str, positions_str = entry.split(', Positions: ')
                    docID = int(docID_str.split(': ')[1])
                    positions = list(map(int, positions_str.split(',')))
                    # Create Posting object
                    posting = Posting(docID, positions)
                    # Add the Posting to the postings list
                    postings.append(posting)
                # Create PostingList object
                postings = PostingList.from_posting_list(postings)
                # Set the term's posting list
                term.posting_list = postings
                # Add the term to the index
                intermediate_dict[term.term] = term
        # Set the dictionary
        index._dictionary = sorted(intermediate_dict.values(), key=lambda term: term.term)
        # Returns an instance of the IR system with the given corpus and the loaded index.
        return cls(corpus, index, invalidation_vector)

    # Method to add documents to the auxiliary index
    def add_documents(self, corpus):
        # Loop through the list of documents
        self._auxiliary_corpus= []
        for movie in corpus:
            # Check if the added document is new or if it is a document that was previously deleted
            if(movie.docID)>=len(self._corpus):
                self._corpus.append(MovieDescription(movie.title, movie.description, movie.docID))
                self._auxiliary_corpus.append(MovieDescription(movie.title, movie.description, movie.docID))
                self._invalidation_vector.append(True)
            else:
                self._invalidation_vector[movie.docID]= True
                self._corpus[movie.docID] = MovieDescription(movie.title, movie.description, movie.docID)
        self._auxiliary_index = self._auxiliary_index.from_corpus(self._auxiliary_corpus)
        if(len(self._auxiliary_corpus)==0):
            self._merged= True
        else:
            self._merged = False

    # Method to delete documents
    def delete_documents(self, corpus):
        for movie in corpus:
            self._invalidation_vector[movie.docID]=False
            self._corpus[movie.docID] = None

    # Method to merge main and auxiliary indexes
    def merge_indexes(self):
        self._index = self._index.merge(self._auxiliary_index)
        self._auxiliary_index = PositionalIndex()
        self._auxiliary_corpus = []
        self._merged = True

    # Method to retrieve valid corpus
    def get_valid_corpus(self):
        valid_docs=[]
        for movie in self.corpus:
            if(self._corpus[movie.docID] != None):
                valid_docs.append(movie)
        return valid_docs

    # Method to save the index
    def save_to_txt(self, filename):
        with open(filename, 'w', encoding='utf-8') as txtfile:
            # Convert boolean values to integers and join into a string
            invalidation_vector = ''.join(str(int(value)) for value in self._invalidation_vector)
            # Write the invalidation_vector in the first line
            txtfile.write(invalidation_vector+'\n')
            # Write data
            for term in self._index._dictionary:
                txtfile.write(term.__repr__() + '\n')

    # Method to combine postings lists according to the binary operator
    def combine_postings(self,stack):
        if not stack:
            return None
        result = None
        # Process the inner stacks recursively
        for i in range(len(stack)):
            if isinstance(stack[i], list):
                stack[i] = self.combine_postings(stack[i])
        # Combine the parsed expressions and operators
        result = stack[0]
        for i in range(1, len(stack), 2):
            if i + 1 < len(stack):
                operator, operand = stack[i], stack[i + 1]
                if operator == 'AND':
                    result = result.intersection(operand)
                elif operator == 'OR':
                    result = result.union(operand)
                elif operator == 'NOT':
                    result = result.difference(operand)
                else:
                    raise ValueError("Missing operator.")
            else:
                break
        return result


    # Method to organize the words/operations
    def parse_tokens(self, tokens, index):
        stack = []
        current_operator = None
        # Loop through the words
        i=0
        while(i<len(tokens)):
            if tokens[i].upper() in ['AND', 'OR', 'NOT']:
                current_operator = tokens[i].upper()
                stack.append(current_operator)
                i+=1
            elif tokens[i] == '(':
                stack2= []
                # Recursive call for expressions inside parentheses
                sub_expression = self.parse_tokens(tokens[i+1:], index)
                for token1 in sub_expression:
                    stack2.append(token1)
                stack.append(stack2)
                i=i+len(stack2)+2 # Skip to the expressions outside the parentheses
            elif tokens[i] == ')':
                return stack  # return the inner stack when closing parenthesis is encountered
            else:
                # Single term
                term_postings = index[tokens[i]].posting_list
                stack.append(term_postings)
                current_operator = None
                i+=1
        return stack


     # Method to answer queries with binary operations
    def answer_binary_query(self, words):
        # Use the parse_tokens function to get the posting lists of the terms from the main index and the operations
        tokens=self.parse_tokens(words, self._index)
        # Combine posting lists and operations
        postings = self.combine_postings(tokens)
        if not self._merged:
            # Use the parse_tokens function to get the posting lists of the terms from the auxiliary index and the operations
            tokens_aux = self.parse_tokens(words, self._auxiliary_index)
            # Combine posting lists and operations
            postings_aux = self.combine_postings(tokens_aux)
            # Combine the posting lists from both indexes.
            postings= postings.union(postings_aux)
        # Filter the postings list to exclude deleted documents
        valid_documents = [doc for doc in postings._postings if self._invalidation_vector[doc._docID]]
        # Return the list of documents from the corpus that match all query words.
        valid_documents = PostingList.from_posting_list(valid_documents)
        return valid_documents.get_from_corpus(self._corpus)


    # Method to retrieve the postings list of phrase queries, given the words and index to use
    def retrieve_phrase_posting_lists(self, words, index, proximity=1):
        # Normalize the words in the query to match the normalized index terms.
        norm_words = list(map(normalize, words))
        lw=len(norm_words)
        # Retrieve the posting lists for each normalized word from the index.
        postings = list(map(lambda w: index[w].posting_list, norm_words))
        # Initialize the intersection list to the posting list of the first term
        intersection = PostingList.from_posting_list(copy.deepcopy(postings[0]._postings))
        # Loop through the words
        for k in range(0,lw-1):
            i=len(intersection._postings) - 1
            j=len(postings[k+1]._postings) - 1
            # Loop through the postings lists in reverse
            while (i >= 0 and j >= 0):
                # If both postings docIDs are equal, check the positions.
                if (intersection._postings[i]._docID == postings[k+1]._postings[j]._docID):
                    #Loop through the positions in reverse
                    ii=len(intersection._postings[i]._positions) - 1
                    jj=len(postings[k+1]._postings[j]._positions) - 1
                    while (ii >= 0 and jj >= 0):
                        if(intersection._postings[i]._positions[ii]+ proximity + k == postings[k+1]._postings[j]._positions[jj]):
                            # You can leave the posting with this docID and position in the intersection, so simply skip to the next iteration
                            ii-=1
                            jj-=1
                        elif (intersection._postings[i]._positions[ii] > postings[k+1]._postings[j]._positions[jj]):
                            # Remove the position from the intersection as there are no matching subsequent words
                            intersection._postings[i]._positions.pop(ii)
                            ii-=1
                        else:
                            jj-=1
                    while (ii >= 0):
                        # Remove the positions from the intersection as there are no matching positions for the subsequent words
                        intersection._postings[i]._positions.pop(ii)
                        ii -= 1
                    # Check if the document no longer has any valid positions and, if so, remove it from the list
                    if(len(intersection._postings[i]._positions)==0):
                        intersection._postings.pop(i)
                    i-=1
                    j-=1
                elif(intersection._postings[i]._docID > postings[k+1]._postings[j]._docID):
                    # Remove the document from the intersection as there are no matching documents for the subsequent words
                    intersection._postings.pop(i)
                    i -= 1
                else:
                    j -= 1
            while (i >= 0):
                # Remove the document from the intersection as there are no matching documents for the subsequent words
                intersection._postings.pop(i)
                i -= 1
        return intersection

    # Method to answer phrase queries
    def answer_phrase_query(self, words, proximity=1):
        # Use the retrieve_phrase_posting_lists function for the main index.
        postings = self.retrieve_phrase_posting_lists(words, self._index)
        # Check if the indexes have been merged.
        if not self._merged:
            # Use the retrieve_phrase_posting_lists function for the auxiliary index.
            postings_aux = self.retrieve_phrase_posting_lists(words, self._auxiliary_index)
            # Combine the posting lists from both indexes.
            postings = postings.union(postings_aux)
        # Filter the postings list to exclude deleted documents
        valid_documents = [doc for doc in postings._postings if self._invalidation_vector[doc._docID]]
        # Return the list of documents from the corpus that match all query words.
        valid_documents = PostingList.from_posting_list(valid_documents)
        return valid_documents.get_from_corpus(self._corpus)





# Function to execute a query with binary operations against an IR system.
def binary_query(ir, text):
    # Split the text query into individual words.
    words = text.split()
    # Get the answer to the query using the IR system's parse_tokens method.
    answer = ir.answer_binary_query(words)
    # Print out each movie that matches the query.
    for movie in answer:
        print(movie)


# Function to execute a text query against an IR system.
def phrase_query(ir, text):
    # Split the text query into individual words.
    words = text.split()
    # Get the answer to the query using the IR system's answer_query method.
    answer = ir.answer_phrase_query(words)
    # Print out each movie that matches the query.
    for movie in answer:
        print(movie)


## Testing the IR system
To test the system, we'll divide the original dataset into three parts: A, B, and C. Initially the index should contain A and B, then C must be added and B
removed (not simultaneously).

The dataset is split according to the following percentages:
*   A = 50%
*   B = 30%
*   C = 20%

To "split" the dataset according to the mentioned percentages we are going to use the function **read_movie_description** with the appropriate variables to create lists of **MovieDescription** objects for each section; **corpus_a**, **corpus_b** and **corpus_c**.

We are also going to create 2 additional lists for the set of documents in **corpus_b** and **corpus_c**. In these lists, the document IDs will start from 0. This step is needed for later comparisons of the results of test queries. (This adjustment is unnecessary for documents in group A as their IDs already commence from 0.)

In [79]:
corpus_a = read_movie_descriptions(0.5,0)
corpus_b = read_movie_descriptions(0.3,0.5)
corpus_c = read_movie_descriptions(0.2,0.8)

corpus_b_only = read_movie_descriptions(0.3,0.5, True)
corpus_c_only = read_movie_descriptions(0.2,0.8, True)

We'll then create 4 separate IR systems.

*   **ir**: Containing the documents from A and B initially, where we will later add C and then remove B
*   **irA**: Containing only the documents in A
*   **irB**: Containing only the documents in B
*   **irC**: Containing only the documents in C




In [113]:
corpus=copy.deepcopy(corpus_a)
for movie in corpus_b:
    corpus.append(movie)
ir=IRsystem.from_corpus(corpus)

ID: 0
ID: 1000
ID: 2000
ID: 3000
ID: 4000
ID: 5000
ID: 6000
ID: 7000
ID: 8000
ID: 9000
ID: 10000
ID: 11000
ID: 12000
ID: 13000
ID: 14000
ID: 15000
ID: 16000
ID: 17000
ID: 18000
ID: 19000
ID: 20000
ID: 21000
ID: 22000
ID: 23000
ID: 24000
ID: 25000
ID: 26000
ID: 27000
ID: 28000
ID: 29000
ID: 30000
ID: 31000
ID: 32000
ID: 33000


In [114]:
irA=IRsystem.from_corpus(corpus_a)

ID: 0
ID: 1000
ID: 2000
ID: 3000
ID: 4000
ID: 5000
ID: 6000
ID: 7000
ID: 8000
ID: 9000
ID: 10000
ID: 11000
ID: 12000
ID: 13000
ID: 14000
ID: 15000
ID: 16000
ID: 17000
ID: 18000
ID: 19000
ID: 20000
ID: 21000


In [115]:
irB=IRsystem.from_corpus(corpus_b_only)

ID: 0
ID: 1000
ID: 2000
ID: 3000
ID: 4000
ID: 5000
ID: 6000
ID: 7000
ID: 8000
ID: 9000
ID: 10000
ID: 11000
ID: 12000


In [116]:
irC=IRsystem.from_corpus(corpus_c_only)

ID: 0
ID: 1000
ID: 2000
ID: 3000
ID: 4000
ID: 5000
ID: 6000
ID: 7000
ID: 8000


### Initial configuration
Now that the IR systems have been created we can perform an initial general query in **ir**, **irA** and **irB**.

Given that, currently, **ir** contains the documents from both **irA** and **irB**, we anticipate that the outcome of the query in **ir** will be the union of the results obtained from executing the same query in **irA** and **irB**.

In [117]:
binary_query(ir, "crocodiles AND chase")

Kronk's New Groove
Indiana Jones and the Temple of Doom
Kangaroo Jack
The Rescuers Down Under
Black Water
The Last Dragon
Shorts
The Chipmunk Adventure


In [118]:
binary_query(irA, "crocodiles AND chase")

Kronk's New Groove
Indiana Jones and the Temple of Doom
Kangaroo Jack
The Rescuers Down Under
Black Water
The Last Dragon
Shorts


In [119]:
binary_query(irB, "crocodiles AND chase")

The Chipmunk Adventure


### Adding documents
Let us now add the documents in C to the general IR system **ir**.

In [120]:
ir.add_documents(corpus_c)

ID: 34000
ID: 35000
ID: 36000
ID: 37000
ID: 38000
ID: 39000
ID: 40000
ID: 41000
ID: 42000


Now that the documents have been added to the system we can inspect the main and auxiliary indexes

In [121]:
print(ir._index)

A dictionary with 170829 terms


In [122]:
print(ir._auxiliary_index)

A dictionary with 77236 terms


It is also possible to merge the indexes. Let's try it and then inspect the indexes again

In [123]:
ir.merge_indexes()

In [124]:
print(ir._index)

A dictionary with 194757 terms


In [125]:
print(ir._auxiliary_index)

A dictionary with 0 terms


We can now perform the same general query as before in **ir** and in **irC** and inspect the results.

Given that, currently, **ir** contains the documents from **irA**, **irB** and **irC**, we anticipate that the outcome of the query in **ir** will be the union of the previous results and the results obtained from executing the same query in **irC**.

In [126]:
binary_query(irC, "crocodiles AND chase")

Barbie as the Island Princess


In [127]:
binary_query(ir, "crocodiles AND chase")

Kronk's New Groove
Indiana Jones and the Temple of Doom
Kangaroo Jack
The Rescuers Down Under
Black Water
The Last Dragon
Shorts
The Chipmunk Adventure
Barbie as the Island Princess


### Deleting documents
Let us now delete the documents in B in the general IR system **ir**.

In [128]:
ir.delete_documents(corpus_b)

To check that the documents have been correctly deleted we can perform the usual general query in **ir**.

Given that, currently, **ir** contains the documents from both **irA** and **irC**, we anticipate that the outcome of the query in **ir** will be the difference of the previous results and the results obtained from executing the same query in **irB**.

In [129]:
binary_query(ir, "crocodiles AND chase")

Kronk's New Groove
Indiana Jones and the Temple of Doom
Kangaroo Jack
The Rescuers Down Under
Black Water
The Last Dragon
Shorts
Barbie as the Island Princess


### Saving and Loading the index
The index can be saved in a .txt file whenever the user requires it with the following command

In [97]:
filename='/content/gdrive/My Drive/Information retrieval/index.txt'
ir.save_to_txt(filename)

The index can also be loaded from a file. The users has to specify the filename of the file containing the index and the corpus of documents the index refers.

In [98]:
loaded_ir= IRsystem.from_index_file(filename, corpus)

### Phrase queries
Phrase queries can also be performed with the provided IR system.

In [99]:
phrase_query(ir, "crocodiles chase")

The Rescuers Down Under


In [100]:
binary_query(ir, "crocodiles AND chase")

Kronk's New Groove
Indiana Jones and the Temple of Doom
Kangaroo Jack
The Rescuers Down Under
Black Water
The Last Dragon
Shorts
Barbie as the Island Princess
