# Session 3 - User Relevance Feedback

## 1 Document Relevance

In this sesion we are going to implement a pseudo user relevance feedback on top of ElasticSearch

One possibility that we have not used from the query results of ElasticSeach is the score computed as the relevance of the document respect to the terms of a query.

You have the script `SearchIndexWeights.py` that allows searching for keywords in an index just like we do in any seach engine (like Google search or Bing).

This script returns a limited number of hits and also shows the score of the documents (the documents are sorted by its score)

**Read the first section** of the session documentation and play a little bit with the documents that you have in the `news` index.

In [1]:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
from elasticsearch_dsl.query import Q


client = Elasticsearch()
s = Search(using=client, index='news')


q = Q('query_string',query='toronto')  # Feel free to change the word

s = s.query(q)
response = s[0:3].execute()
for r in response:  # only returns a specific number of results
    print('ID= %s SCORE=%s' % (r.meta.id,  r.meta.score))
    print('PATH= %s' % r.path)
    print('TEXT: %s' % r.text[:50])
    print('-----------------------------------------------------------------')

ID= udUOwXwBa2h4IIRS8nQo SCORE=7.8208394
PATH= /tmp/20_newsgroups/rec.sport.hockey/0010493
TEXT: Detroit is a very disciplined team.  There's a lot
-----------------------------------------------------------------
ID= ntUOwXwBa2h4IIRS83ou SCORE=7.634695
PATH= /tmp/20_newsgroups/rec.sport.baseball/0009082
TEXT: In article <C51vwC.Lru@usenet.ucs.indiana.edu> bod
-----------------------------------------------------------------
ID= a9UOwXwBa2h4IIRS8naL SCORE=7.510704
PATH= /tmp/20_newsgroups/rec.sport.hockey/0010488
TEXT: Detroit's going to beat Toronto in 6 or LESS!!!
 G
-----------------------------------------------------------------




***

## 2 Rocchio's Rule


For implementing the relevance we are going to use the Rocchio's rule. We are going to extend the query for a number of interations using the terms in the more relevant documents that are retrieved.

As is described in the session documentation you will need to write a scripts that given a query, repeats a number ($nrounds$) of times:

1. Obtain the $k$ more relevant documents
2. Compute a new query using the current query and the terms of the $k$ documents

The Rocchio's rule involves computing the folowing:

$$Query' = 	\alpha \times Query + \beta \times \frac{d_1 + d_2 + \cdots + d_k}{k}$$

So we have different parameters to play with:

1. The number of rounds ($nrounds$)
2. The number of relevand documents ($k$)
3. The parameters of the Rocchio's rule ($\alpha$ and $\beta$)
4. The numbeer of terms in the recomputed query ($R$)

**Read the documentation** and pay attention specially to how you have to build the query that you pass to ElasticSearch to include thw weights computed by the Rocchio's rule.

Think that some of the elements that you need for this part are functions that you programmed already as part of the past session assignment.

**Pay attention** to the documentation that you have to deliver for this session.



In [7]:
D = dict([("a",1),("b",1),("c",1),("d",1)])
key = "a"
if key in D:
    D[key]= D[key]+2
print(D)

{'a': 3, 'b': 1, 'c': 1, 'd': 1}


In [23]:
alpha = 1.0
beta = 1.0

def rocchioNewQuery(query,merged_d):

    for q in query:
        query[q] *= alpha

    for d in merged_d:
        merged_d[d] *=beta

    for q in query:
        if q in merged_d:
            merged_d[q] = merged_d[q] + query[q]
        else:
            merged_d[q]=query[q]

    r = list(zip(merged_d.keys(), merged_d.values()))
    return sorted(r, key=lambda x: x[1],reverse=True)[:3]

D = dict([("a",1),("b",1),("c",1),("d",1)])
Q = dict([("a",1),("z",3)])
print(rocchioNewQuery(Q,D))

[('z', 3.0), ('a', 2.0), ('b', 1.0)]
