---
Information Retrieval Exercises
====

---
Rider or Die
----

![](http://i.telegraph.co.uk/multimedia/archive/02162/ridderhaggard_2162866i.jpg)

You will be improving upon a rather poorly-made information retrieval system. You will build a system to quickly retrieve documents that match queries.

---
Data 
---

>“...one day a sunrise will come when we shall be among those who are lost, and then others will watch those glorious rays, and grow sad in the midst of beauty, and dream of Death in the 
full glow of arising Life!”   
> \- Rider Haggard

Your IR system will find relevant documents among a collection of 60 short stories by the famed [Rider Haggard](http://en.wikipedia.org/wiki/H._Rider_Haggard). 

The training data is located in the `data/` directory under the subdirectory `RiderHaggard/`. Within this directory you will see yet another directory `raw/`. This contains the raw text files of 60 different short stories written by Rider Haggard.

A set of development queries and their expected answers are in the `data/` directory, the files `queries.txt` and `solutions.txt` respectively.

----
Part I
---

Improve upon the IR system provided. This involves implementing:

- **Inverted Index:** a mapping from words to the documents in which they occur.
- **Boolean Retrieval:** in which you return the list of documents that contain all words in a query* 

You will implement and/or improve upon the following functions:

- `index():` This is where you will build the inverted index. The documents will have already been read in for you at this point, so you will want to look at some of the instance variables in the class:
    - `self.titles`
    - `self.docs`
    - `self.vocab`
- `get_posting():` This function returns a list of integers (document IDs) that identifies the documents in which the word is found. This is basically just an API into your inverted index, but you must implement it in order to be evaluated fully.
- `boolean_retrieve():` This function performs Boolean retrieval, returning a list of document IDs corresponding to the documents in which all the words in `query` occur.



\* Yes, we only support conjunctions...

----
Evaluation
----
Your IR system will be evaluated on a development set of queries as well as a held-out set of queries. The queries are encoded in the file **queries.txt** and are:

Running the code
---

In [1]:
reset -fs

That code will run you IR system and test it against the development set of queries. 

The first time you run the code the documents will be stemmed.

Then you will see the evaluation metrics

In [29]:
from collections import defaultdict

In [34]:
b.intersection(l)

{'a'}

In [41]:
h = [3,4,5,7,6,4,3,2,4,2]

In [43]:
p = sorted(h)
p

[2, 2, 3, 3, 4, 4, 4, 5, 6, 7]

In [20]:
l = set(['a','a','a','b'])
l

{'a', 'b'}

In [47]:
%run python/ir_system_part_1.py


Reading in documents...
Already stemmed!
Indexing...
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]  inv index
===== Running tests =====
Inverted Index Test
    Score: 3 Feedback: 5/5 Correct. Accuracy: 1.000000
Boolean Retrieval Test
    Score: 3 Feedback: 5/5 Correct. Accuracy: 1.000000


---


__Note__: That the first time you run this, it will create a directory named `stemmed/` in `../data/RiderHaggard/.` This is meant to be a simple cache for the raw text documents. Later runs will be much faster after the first run. 

*However*, this means that if something happens during this first run and it does not get through processing all the documents, you may be left with an incomplete set of documents in `../data/RiderHaggard/stemmed/.` If this happens, simply remove the `stemmed/` directory and re-run!

----
Part II
---

Continue improving the IR system by implementing:

- __tf-idf:__ Compute and store the term-frequency inverse-document- frequency value for every word-document co-occurrence: $w_{t,d}=(1+\text{log}_{10}\text{df}_{t,d})\times\text{log}_{10}(N/\text{df}_t)$

- **Cosine Similarity:** Implement cosine similarity in order to improve upon the ranked retrieval system, which currently retrieves documents based upon the Jaccard coefficient between the query and each document.

__Also__ note that when computing $w_{t,q}$ (*i.e.* the weight for the word $w$ in the query) do *not* include the idf term. That is, $w_{t,q}=1+\text{log}_{10}\text{tf}_{t,q}$.

To improve upon the information retrieval system, you must implement and/or improve upon the following functions:

- `compute_tfidf():` This function computes and stores the tf-idf values for words and documents. For this you will probably want to be aware of the class variables `vocab` and `docs` which hold, respectively, the list of all unique words and the list of documents, where each document is a list of words.
- `get_tfidf():` You must implement this function to return the tf-idf weight for a particular word and document ID.
- `rank_retrieve():` This function returns a priority queue of the top ranked documents for a given query. Right now it ranks documents according to their Jaccard similarity with the query, but you will replace this method of ranking with a ranking using the cosine similarity between the documents and query.

### Evaluation
Your IR system will be evaluated on the same set of queries as Part I.

In [5]:
%run python/ir_system_part_2.py

SyntaxError: invalid syntax (ir_system_part_2.py, line 125)

---
Hints
---

> Smart data structures and dumb code works a lot better than the other way around.

- Take your time - Read the instructions, skim the code, and __read the instructions again__. 
- `sets`, `Counters`, and `defaultdict` are your friends
- indexes are your best friends
- `np.log10` is __not__ the same as `np.log`
- Test your system with custom queries:

In [8]:

%run python/ir_system_part_2.py "My very own query"

SyntaxError: invalid syntax (ir_system_part_2.py, line 125)

In [7]:
%run python/ir_system_part_2.py "dream of Death in the full glow of arising Life"

SyntaxError: invalid syntax (ir_system_part_2.py, line 125)

In [13]:
%run python/ir_system_part_2.py "The space aliens were friendly"

Reading in documents...
Already stemmed!
Indexing...
Calculating tf-idf...
Best matching documents to 'The space aliens were friendly':
Hunter Quatermain's Story: 2.805049e-03
Long Odds: 2.238806e-03
The Tale of Three Lions: 1.573977e-03
Black Heart and White Heart: 1.519757e-03
The Mahatma and the Hare: 1.283148e-03
Maiwa's Revenge: 1.270245e-03
Queen of the Dawn (1925): 1.246572e-03
The World's Desire: 1.238237e-03
Allan and the Ice Gods (1927): 1.133530e-03
The Wizard: 1.112038e-03


<br>
<br>
---