# Boolean Retrieval

Generate matrix of term-document, perform the given boolean operation on it and create inverted index.
Do it for four different documents.

Given documents:

In [1]:
files = ['cinderella.txt', 'jackandthebeanstalk.txt', 'thefarmerandthebadger.txt', 'theprincessandthepea.txt']

# Imports

In [2]:
import numpy as np
import pandas as pd
import re

# Text preprocessing

Check how the first file looks like 

In [3]:
file = open('data/cinderella.txt', 'r')
content = file.read()
file.close()

content[0:1000]

"ONCE there was a gentleman who married, for his second wife, the proudest and most haughty woman that was ever seen. She had, by a former husband, two daughters of her own humor, who were, indeed, exactly like her in all things. He had likewise, by another wife, a young daughter, but of unparalleled goodness and sweetness of temper, which she took from her mother, who was the best creature in the world.\n\nNo sooner were the ceremonies of the wedding over but the mother-in-law began to show herself in her true colors. She could not bear the good qualities of this pretty girl, and the less because they made her own daughters appear the more odious. She employed her in the meanest work of the house: she scoured the dishes, tables, etc., and scrubbed madam's chamber, and those of misses, her daughters; she lay up in a sorry garret, upon a wretched straw bed, while her sisters lay in fine rooms, with floors all inlaid, upon beds of the very newest fashion, and where they had looking-glass

Text preprocessing steps:
text to lower cases, remove numbers, special characters and duplicated white characters from text.

In [4]:
def get_preprocessed_text(text):
    """
    Function conducts preprocessing on given text. 
    It changes text to lower cases, removes numbers, removes special characters and duplicated white characters.
    
    Args:
        text (string): The text to convert.
    Returns:
        string: The text after preprocessing.
    """
    if text != ' ':
        text = text.lower() 
        text = re.sub(r'\d+', '', text)
        text = re.sub(r'[^\w\s]','',text)
        text = re.sub(r"\s+"," ", text)
    
    return text

In [5]:
content = get_preprocessed_text(content)
content[0:1000]

'once there was a gentleman who married for his second wife the proudest and most haughty woman that was ever seen she had by a former husband two daughters of her own humor who were indeed exactly like her in all things he had likewise by another wife a young daughter but of unparalleled goodness and sweetness of temper which she took from her mother who was the best creature in the world no sooner were the ceremonies of the wedding over but the motherinlaw began to show herself in her true colors she could not bear the good qualities of this pretty girl and the less because they made her own daughters appear the more odious she employed her in the meanest work of the house she scoured the dishes tables etc and scrubbed madams chamber and those of misses her daughters she lay up in a sorry garret upon a wretched straw bed while her sisters lay in fine rooms with floors all inlaid upon beds of the very newest fashion and where they had lookingglasses so large that they might see themse

## Create term-document matrix

In [6]:
def create_td_matrix(files_names_list): 
    """
    Creates term-document matrix based on given files.
    
    Args:
        files_names_list (list[string]): The list with names of files to read data.
    Returns:
        pd.DataFrame: The DataFrame with file's names as columns and words as indexes.
    """
    words_in_docs = [] # [[word1, word2, ...],[word3, word1, ...],...]
    for file_name in files_names_list: 
        file = open(file_name, 'r')
        content = file.read()
        file.close()
        
        words_in_docs += [get_preprocessed_text(content).split()]

    unique_words = [] #[word1, word2, word3, ...]
    for words in words_in_docs:
        unique_words += words
    
    unique_words = list(set(unique_words))
    unique_words.sort(reverse=False)
    
    dt_matrix = np.zeros(shape=(len(unique_words), len(files_names_list)), dtype=np.int8)  
    columns = [name[:-4] for name in files_names_list]
    dt_matrix = pd.DataFrame(dt_matrix, columns=columns,index=unique_words)
    
    for i in range(len(words_in_docs)):
        column_name = columns[i]
        words_list = words_in_docs[i]
        for word in unique_words:
            if word in words_list:
                dt_matrix[column_name][word] = 1
                
    return dt_matrix

In [7]:
td_matrix = create_td_matrix(['data/' + file for file in files])

Head of term-document matrix:

In [8]:
td_matrix.head()

Unnamed: 0,data/cinderella,data/jackandthebeanstalk,data/thefarmerandthebadger,data/theprincessandthepea
a,1,1,1,1
able,1,0,1,0
about,0,1,1,1
above,1,0,0,0
abundantly,1,0,0,0


## Perform boolean operations

### 1. animal AND beautiful

In [9]:
word1 = 'animal'
word2 = 'beautiful'

Show with text contains given word:

In [10]:
td_matrix.loc[word1]

data/cinderella               0
data/jackandthebeanstalk      0
data/thefarmerandthebadger    1
data/theprincessandthepea     0
Name: animal, dtype: int8

In [11]:
td_matrix.loc[word2]

data/cinderella               1
data/jackandthebeanstalk      1
data/thefarmerandthebadger    1
data/theprincessandthepea     0
Name: beautiful, dtype: int8

In [12]:
def word1_and_word2(dt_matrix, word1, word2):
    """    
    Args:
        dt_matrix (pd.DataFrame): The term-document matrix.
        word1 (string): The first word from dt_matrix.
        word2 (string): The second word from dt_matrix.
    Returns:
        list[string]: The result of operation 'word1 AND word2' on dt_matrix.
                      The list of files.
    """
    result = []
    query_result = list(dt_matrix.loc[word1] & dt_matrix.loc[word2])
    columns = dt_matrix.columns

    for i in range(len(query_result)):    
        if query_result[i] == 1:
            result += [columns[i]]

    return result

In [13]:
print('Result of query: ',word1, ' AND ', word2, ' ->', word1_and_word2(td_matrix, word1, word2))

Result of query:  animal  AND  beautiful  -> ['data/thefarmerandthebadger']


### 2. badger AND NOT (animal OR country)

In [14]:
def query2(dt_matrix, word1, word2, word3):
    """
    Args:
        dt_matrix (pd.DataFrame): The term-document matrix.
        word1 (string): The first word from dt_matrix.
        word2 (string): The second word from dt_matrix.
        word3 (string): The third word from dt_matrix.
    Returns:
        list[string]: The result of operation 'word1 AND NOT (word2 OR word3)' on dt_matrix.
                      The list of files.
    """
    result = []
    query_result = list(dt_matrix.loc[word2] | dt_matrix.loc[word3])
    query_result = [0 if el == 1 else 1 for el in query_result]
    
    query_result = np.array(dt_matrix.loc[word1]) & np.array(query_result)
    columns = dt_matrix.columns

    for i in range(len(query_result)):    
        if query_result[i] == 1:
            result += [columns[i]]

    return result

In [15]:
word1, word2, word3 = 'badger', 'animal', 'country'

Show with text contains given word:

In [16]:
td_matrix.loc[word1]

data/cinderella               0
data/jackandthebeanstalk      0
data/thefarmerandthebadger    1
data/theprincessandthepea     0
Name: badger, dtype: int8

In [17]:
td_matrix.loc[word2]

data/cinderella               0
data/jackandthebeanstalk      0
data/thefarmerandthebadger    1
data/theprincessandthepea     0
Name: animal, dtype: int8

In [18]:
td_matrix.loc[word3]

data/cinderella               1
data/jackandthebeanstalk      0
data/thefarmerandthebadger    0
data/theprincessandthepea     0
Name: country, dtype: int8

In [19]:
print('Result of query: ',word1, ' AND NOT (', word2, ' OR ', word3, ') -> ', query2(td_matrix, word1, word2, word3))

Result of query:  badger  AND NOT ( animal  OR  country ) ->  []


# Create Inverted Index

In [20]:
inverted_index = pd.DataFrame()
inverted_index['posting_list'] = ''

for term, row in td_matrix.iterrows():  
    posting_list = list(td_matrix.columns[row ==True])    
    inverted_index.at[term, 'posting_list']= posting_list

Show Inverted Index:

In [21]:
inverted_index.head()

Unnamed: 0,posting_list
a,"[data/cinderella, data/jackandthebeanstalk, da..."
able,"[data/cinderella, data/thefarmerandthebadger]"
about,"[data/jackandthebeanstalk, data/thefarmerandth..."
above,[data/cinderella]
abundantly,[data/cinderella]
