# Basic Boolean Search in Documents

## Objective
Expand the simple term search functionality to include Boolean search capabilities. This will allow users to perform more complex queries by combining multiple search terms using Boolean operators.

## Problem Description
You must enhance the existing search engine from the previous exercise to support Boolean operators: AND, OR, and NOT. This will enable the retrieval of documents based on the logical relationships between multiple terms.

### Step 1: Data Preparation
Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.
Write a function to load and preprocess the text documents from a specified directory. This step involves reading each file, converting the text to lowercase for uniform processing, and storing the results in a dictionary.

In [1]:
import os

# Define the path to the directory containing the text files
CORPUS_DIR = '../data'
documents = {}
for filename in os.listdir(CORPUS_DIR):
    if filename.endswith('.txt'):
        file_path = os.path.join(CORPUS_DIR, filename)
        with open(file_path, 'r', encoding='utf-8') as file:
            documents[filename] = file.read().lower()  # Read and convert to lowercase

for doc, content in list(documents.items())[:6]:  # Display the first document's content
    print(f'Document: {doc}\nContent: {content[:200]}...')  # Print first 200 characters


Document: A Christmas Carol in Prose Being a Ghost Story of Christmas.txt
Content: 





<!doctype html>
<html
  lang="en"
  
  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"
  data-a11y-animated-images="system" data-a11y-link-underlines="true"
  >




  <hea...
Document: A Doll's House.txt
Content: 





<!doctype html>
<html
  lang="en"
  
  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"
  data-a11y-animated-images="system" data-a11y-link-underlines="true"
  >




  <hea...
Document: A Modest Proposal.txt
Content: 





<!doctype html>
<html
  lang="en"
  
  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"
  data-a11y-animated-images="system" data-a11y-link-underlines="true"
  >




  <hea...
Document: A Room with a View.txt
Content: 





<!doctype html>
<html
  lang="en"
  
  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"
  data-a11y-animated-images="system" data-a11y-link-underlines="true

### Step 2: Create an Inverted Index

Create an inverted index from the documents. This index maps each word to the set of document IDs in which that word appears. This facilitates word lookup in the search process.

In [2]:
import re

inverted_index = {}
for doc_id, content in documents.items():
    words = re.findall(r'\w+', content)  # Extract words
    for word in words:
        if word not in inverted_index:
            inverted_index[word] = set()
        inverted_index[word].add(doc_id)

# Display a sample of the inverted index
for word, docs in list(inverted_index.items())[:50]:  # Display first 5 entries
    print(f'Word: {word}\nDocuments: {docs}')


Word: doctype
Documents: {'Frankenstein.txt', 'Little Women; Or, Meg, Jo, Beth, and Amy.txt', 'The Pleasures of the Table.txt', 'The modern Prometheus.txt', 'The Adventures of Ferdinand Count Fathom — Complete.txt', 'Metamorphosis.txt', 'Crime and Punishment.txt', 'The Modern Regime, Volume 1.txt', 'Magazine of western history.txt', 'The star-stealers.txt', 'The Scarlet Letter.txt', 'The Souls of Black Folk.txt', 'Jane Eyre- An Autobiography.txt', 'Biographical Anecdotes of William Hogarth, With a Catalogue of His Works.txt', 'he Adventures of Sherlock Holmes.txt', 'Chronicles of London Bridge.txt', 'The Yellow Wallpaper.txt', 'A Modest Proposal.txt', 'The Blue Castle- a novel.txt', 'Little Women.txt', 'The Picture of Dorian Gray.txt', "Childe Harold's Pilgrimage.txt", 'Pygmalion.txt', 'The Works of the Rev.txt', 'The Romance of Lust A classic Victorian erotic novel.txt', 'Leviathan.txt', "Grimms' Fairy Tales.txt", 'Ang  Filibusterismo  (Karugtóng ng Noli Me Tangere).txt', 'The Metamor

### Step 3: Query Processing
- **Parse the Query**: Implement a function to parse the input query to identify the terms and operators.
- **Search Documents**: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.

In [3]:
def process_query(query, inverted_index):
    return inverted_index.get(query, set())

# Test the query processor
query = "question"
results = process_query(query, inverted_index)
print(f"Results for the query '{query}': {results}")


Results for the query 'question': {'Frankenstein.txt', 'pg16389.txt', 'Little Women; Or, Meg, Jo, Beth, and Amy.txt', 'The Pleasures of the Table.txt', 'The modern Prometheus.txt', 'The Adventures of Ferdinand Count Fathom — Complete.txt', 'Metamorphosis.txt', 'Crime and Punishment.txt', 'The Modern Regime, Volume 1.txt', 'Magazine of western history.txt', 'The star-stealers.txt', 'The Scarlet Letter.txt', 'The Souls of Black Folk.txt', 'Jane Eyre- An Autobiography.txt', 'Biographical Anecdotes of William Hogarth, With a Catalogue of His Works.txt', 'he Adventures of Sherlock Holmes.txt', 'Chronicles of London Bridge.txt', 'A Modest Proposal.txt', 'pg2641.txt', 'The Blue Castle- a novel.txt', 'Little Women.txt', 'The Picture of Dorian Gray.txt', "Childe Harold's Pilgrimage.txt", 'Pygmalion.txt', 'pg67979.txt', 'The Romance of Lust A classic Victorian erotic novel.txt', 'Leviathan.txt', 'The Metamorphoses of Ovid.txt', 'Twenty years after.txt', 'Peter Pan.txt', 'Memoirs of a London doll

### Step 4: Displaying Results
- **Output the Results**: Display the documents that match the query criteria. Include functionalities to handle queries that result in no matching documents.

In [4]:
# Try different queries
query = input("Enter a Boolean query (e.g., 'books'): ")
results = process_query(query, inverted_index)
if results:
    print("Documents found:", results)
else:
    print("No documents found.")

Documents found: {'Frankenstein.txt', 'pg16389.txt', 'Little Women; Or, Meg, Jo, Beth, and Amy.txt', 'The Pleasures of the Table.txt', 'The modern Prometheus.txt', 'The Adventures of Ferdinand Count Fathom — Complete.txt', 'Crime and Punishment.txt', 'The Modern Regime, Volume 1.txt', 'Magazine of western history.txt', 'The Scarlet Letter.txt', 'The Souls of Black Folk.txt', 'Jane Eyre- An Autobiography.txt', 'Biographical Anecdotes of William Hogarth, With a Catalogue of His Works.txt', 'he Adventures of Sherlock Holmes.txt', 'Chronicles of London Bridge.txt', 'pg2641.txt', 'The Blue Castle- a novel.txt', 'Little Women.txt', 'The Picture of Dorian Gray.txt', 'Pygmalion.txt', 'pg67979.txt', 'The Romance of Lust A classic Victorian erotic novel.txt', 'Leviathan.txt', 'The Metamorphoses of Ovid.txt', 'Twenty years after.txt', 'Peter Pan.txt', 'Memoirs of a London doll.txt', 'The Kama Sutra of Vatsyayana.txt', 'A Christmas Carol in Prose Being a Ghost Story of Christmas.txt', 'Wuthering H