# Basic Boolean Search in Documents

## Objective
Expand the simple term search functionality to include Boolean search capabilities. This will allow users to perform more complex queries by combining multiple search terms using Boolean operators.

## Problem Description
You must enhance the existing search engine from the previous exercise to support Boolean operators: AND, OR, and NOT. This will enable the retrieval of documents based on the logical relationships between multiple terms.

### Step 1: Data Preparation
Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.
Write a function to load and preprocess the text documents from a specified directory. This step involves reading each file, converting the text to lowercase for uniform processing, and storing the results in a dictionary.

In [1]:
import os

# Define the path to the directory containing the text files
CORPUS_DIR = '../data'
documents = {}
for filename in os.listdir(CORPUS_DIR):
    if filename.endswith('.txt'):
        file_path = os.path.join(CORPUS_DIR, filename)
        with open(file_path, 'r', encoding='utf-8') as file:
            documents[filename] = file.read().lower()  # Read and convert to lowercase

for doc, content in list(documents.items())[:6]:  # Display the first document's content
    print(f'Document: {doc}\nContent: {content[:200]}...')  # Print first 200 characters


Document: datasource.txt
Content: go to https://www.gutenberg.org/browse/scores/top#books-last1
download all 100 books in txt
populate /data folder...
Document: pg100.txt
Content: ﻿the project gutenberg ebook of the complete works of william shakespeare
    
this ebook is for the use of anyone anywhere in the united states and
most other parts of the world at no cost and with a...
Document: pg1342.txt
Content: ﻿the project gutenberg ebook of pride and prejudice
    
this ebook is for the use of anyone anywhere in the united states and
most other parts of the world at no cost and with almost no restrictions
...
Document: pg145.txt
Content: ﻿the project gutenberg ebook of middlemarch
    
this ebook is for the use of anyone anywhere in the united states and
most other parts of the world at no cost and with almost no restrictions
whatsoev...
Document: pg1513.txt
Content: ﻿the project gutenberg ebook of romeo and juliet
    
this ebook is for the use of anyone anywhere in the united states

### Step 2: Create an Inverted Index

Create an inverted index from the documents. This index maps each word to the set of document IDs in which that word appears. This facilitates word lookup in the search process.

In [5]:
import re

inverted_index = {}
for doc_id, content in documents.items():
    words = re.findall(r'\w+', content)  # Extract words
    for word in words:
        if word not in inverted_index:
            inverted_index[word] = set()
        inverted_index[word].add(doc_id)

# Display a sample of the inverted index
for word, docs in list(inverted_index.items())[:50]:  # Display first 5 entries
    print(f'Word: {word}\nDocuments: {docs}')


Word: go
Documents: {'pg84.txt', 'pg1513.txt', 'pg42538.txt', 'pg2641.txt', 'pg67979.txt', 'pg37106.txt', 'pg16389.txt', 'pg45368.txt', 'pg47948.txt', 'pg100.txt', 'pg69093.txt', 'pg1342.txt', 'pg2701.txt', 'datasource.txt', 'pg145.txt'}
Word: to
Documents: {'pg84.txt', 'pg1513.txt', 'pg42538.txt', 'pg2641.txt', 'pg67979.txt', 'pg37106.txt', 'pg16389.txt', 'pg45368.txt', 'pg47948.txt', 'pg100.txt', 'pg69093.txt', 'pg1342.txt', 'pg2701.txt', 'datasource.txt', 'pg145.txt'}
Word: https
Documents: {'datasource.txt'}
Word: www
Documents: {'pg84.txt', 'pg1513.txt', 'pg42538.txt', 'pg2641.txt', 'pg67979.txt', 'pg37106.txt', 'pg16389.txt', 'pg45368.txt', 'pg47948.txt', 'pg100.txt', 'pg69093.txt', 'pg1342.txt', 'pg2701.txt', 'datasource.txt', 'pg145.txt'}
Word: gutenberg
Documents: {'pg84.txt', 'pg1513.txt', 'pg42538.txt', 'pg2641.txt', 'pg67979.txt', 'pg37106.txt', 'pg16389.txt', 'pg45368.txt', 'pg47948.txt', 'pg100.txt', 'pg69093.txt', 'pg1342.txt', 'pg2701.txt', 'datasource.txt', 'pg145.txt'

### Step 3: Query Processing
- **Parse the Query**: Implement a function to parse the input query to identify the terms and operators.
- **Search Documents**: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.

In [9]:
def process_query(query, inverted_index):
    return inverted_index.get(query, set())

# Test the query processor
query = "question"
results = process_query(query, inverted_index)
print(f"Results for the query '{query}': {results}")


Results for the query 'question': {'pg84.txt', 'pg1513.txt', 'pg42538.txt', 'pg2641.txt', 'pg67979.txt', 'pg37106.txt', 'pg16389.txt', 'pg45368.txt', 'pg47948.txt', 'pg100.txt', 'pg69093.txt', 'pg1342.txt', 'pg2701.txt', 'pg145.txt'}


### Step 4: Displaying Results
- **Output the Results**: Display the documents that match the query criteria. Include functionalities to handle queries that result in no matching documents.

In [11]:
# Try different queries
query = input("Enter a Boolean query (e.g., 'books'): ")
results = process_query(query, inverted_index)
if results:
    print("Documents found:", results)
else:
    print("No documents found.")

Documents found: {'pg84.txt', 'pg1513.txt', 'pg42538.txt', 'pg2641.txt', 'pg67979.txt', 'pg37106.txt', 'pg16389.txt', 'pg45368.txt', 'pg47948.txt', 'pg100.txt', 'pg69093.txt', 'pg1342.txt', 'pg2701.txt', 'datasource.txt', 'pg145.txt'}
