# Boolean retrieval

Example of boolean retrieval on the <code>country_dataset</code>. The dataset is composed by a set of queries of the form:
<pre>
<code>
  c: [d_0, d_1, ..., d_n]  
</code>
</pre>
where <code>c</code> is the name of a country and <code>[d_0, d_1, ..., d_n]</code> is the list of the document ids that are relevant to <code>c</code>. Document texts are given in the <code>docs</code> list. Document ids correspond to the position of documents in the <code>docs</code> list.

In [1]:
import json

In [2]:
dataset_file = '../data/country_dataset.json'
with open(dataset_file, 'r') as infile:
    dataset = json.load(infile)

In [3]:
docs = dataset['docs']
queries = dataset['queries']

In [4]:
print(docs[10])
print(docs[16])

The group was presented to the Prince of Wales, later King Charles I, in 1623 while he was in Spain negotiating a marriage contract, and it soon became the most famous Italian sculpture in England.

Once the abbot went to Italy, the four brothers decided to celebrate, yet they needed some money.



## Get a tokenizer

In [5]:
from nltk.tokenize import TweetTokenizer

In [6]:
tk = TweetTokenizer()

## Boolean indexing

In [7]:
from collections import defaultdict

In [8]:
I = defaultdict(lambda: set())

In [9]:
for i, doc in enumerate(docs):
    for token in tk.tokenize(doc):
        I[token.lower()].add(i)

## Query processing
Using set union, we implement <code>OR</code> boolean queries.

In [10]:
q = list(queries.keys())[8]

In [11]:
q

'China'

In [12]:
def or_query(q, index):
    qs = [t.lower() for t in tk.tokenize(q)]
    answers = set()
    for s in qs:
        answers = answers.union(index[s])
    return answers

In [13]:
results = or_query(q, I)

In [14]:
for docid in results:
    print(docs[docid])

The women's snowboard halfpipe competition at the 2007 Asian Winter Games in Changchun, China was held on 29 January at the Beida Lake Skiing Resort.

A cupboard in France is probably different from a cupboard in Germany, or in China, or in England etc.

Wuzhou or Wu Prefecture was a zhou (prefecture) in imperial China.

The toli shad or Chinese herring (Tenualosa toli) is a fish of the Clupeidae family, a species of shad distributed in the western Indian Ocean and the Bay of Bengal to the Java Sea and the South China Sea.

Heisch was born June 10, 1872 in Latendorf, Germany, and after entering the marine corps he was sent as a private to China to fight in the Boxer Rebellion.

Lipulekh (also known as Tri-Corner) is a Himalayan pass between Nepal, India and China connecting the North Western Cornered Byash Valley of Nepal and Indian State of Uttarakhand with the old trading town of Taklakot (Purang) in Tibet and belongs to Nepal.

In the United Kingdom, United States, India, and Brazil

### Exercize: implement <code>AND</code> boolean queries

In [None]:
def and_query(q, index):
    qs = [t.lower() for t in tk.tokenize(q)]
    answers = index[qs[0]]
    for s in qs[1:]:
        answers = answers.intersection(index[s])
    return answers, qs

In [None]:
and_query("China India", I)

## Precision and recall

In [15]:
import numpy as np
import pandas as pd

In [16]:
ground_truth = queries['United States of America']

In [17]:
result = or_query('United States of America', I)

In [18]:
TP = set(ground_truth).intersection(result)

In [19]:
FP = [x for x in result if x not in ground_truth]
FN = [x for x in ground_truth if x not in result]

In [20]:
precision = len(TP) / (len(TP) + len(FP))
recall = len(TP) / (len(TP) + len(FN))

In [21]:
print(precision, recall, (2*precision*recall)/(precision + recall))

0.16358024691358025 0.7910447761194029 0.2710997442455243


In [22]:
outcome = defaultdict(lambda: {'precision': 0, 'recall': 0})
for query, expected in queries.items():
    retrieved = or_query(query, I)
    try:
        p = len(set(expected).intersection(retrieved)) / len(retrieved)
    except ZeroDivisionError:
        p = np.nan
    r = len(set(expected).intersection(retrieved)) / len(expected)
    outcome[query]['precision'] = p
    outcome[query]['recall'] = r

In [23]:
O = pd.DataFrame(outcome).T

In [24]:
O.head()

Unnamed: 0,precision,recall
India,1.0,0.869565
Slovenia,1.0,0.75
Canada,1.0,0.615385
Tanzania,1.0,1.0
Indonesia,1.0,0.75


In [25]:
O.mean()

precision    0.814421
recall       0.720115
dtype: float64

In [26]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [40]:
q = 'United States of America'
ground_truth = queries[q]
result = or_query(q, I)

In [41]:
all_docs = result.union(set(ground_truth))

In [42]:
y_true = np.zeros(len(all_docs))
y_pred = np.zeros(len(all_docs))

In [43]:
for i, d in enumerate(all_docs):
    if d in ground_truth:
        y_true[i] = 1
    if d in result:
        y_pred[i] = 1

In [44]:
y_true[:10]

array([0., 0., 0., 1., 1., 0., 0., 0., 0., 0.])

In [45]:
y_pred[:10]

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [46]:
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

         0.0       0.00      0.00      0.00       271
         1.0       0.16      0.79      0.27        67

    accuracy                           0.16       338
   macro avg       0.08      0.40      0.14       338
weighted avg       0.03      0.16      0.05       338



In [50]:
confusion_matrix(y_true, y_pred)

array([[  0, 271],
       [ 14,  53]])

In [51]:
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

In [52]:
print(tn, fp, fn, tp)

0 271 14 53


In [54]:
for d in result:
    print(docs[d])

Zalog is a formerly independent settlement in the eastern part of the capital Ljubljana in central Slovenia.

Beacon Hill is a neighbourhood located in Beacon Hill-Cyrville Ward in the east end of Ottawa, Ontario, Canada.

There is no road connection to the rest of Norway, even though it is located on the mainland.

Her parents, Matthew and Julia Moore, had come to the United States in 1888 and were living at 32 Monroe Street in Manhattan.

Pointe Coupee Parish School Board is a school district headquartered in unincorporated Pointe Coupee Parish, Louisiana, United States.

The descriptions of M. slaina ants were based on worker, queen and males collected from various places in Siberia and Kazakhstan.

The group was presented to the Prince of Wales, later King Charles I, in 1623 while he was in Spain negotiating a marriage contract, and it soon became the most famous Italian sculpture in England.

And there's no such case - they were all Serbs before, and did not belong to one of the p