---
Information Retrieval (IR)
----

![](http://boston.lti.cs.cmu.edu/classes/11-744/treclogo-c.gif)

![](http://2.bp.blogspot.com/-cH7ahEOwClg/US7fo0mbloI/AAAAAAAAAlU/nCYCyfS5ztI/s1600/intelligize-logo-534x226.png)

[Intelligize](http://www.intelligize.com/) 

> Efficiently search, retrieve and analyze SEC filings, agreements and exhibits. Utilize advanced analytics and comparison tools to pinpoint only the most relevant precedents.

> Access statutes, rules, regulations and other materials pertinent to US capital markets.   
> Granularly search through Comment Letters, corresponding responses, and No-Action Letters.



By The End Of This Session You Should Be Able To:
----

- Explain what is IR and why it is important
- Draw an IR system
- Evaluate an IR system with the following concepts:  
    - Precision and recall
    - Precision at k
    - Mean average precision
- Create a basic IR system with the following features:
    - Document-term matrix
    - Boolean retrieval

Information Retrieval (IR)
----

IR is just one small nested part of "search engines", there are large product and computer science parts.

We are going to focus on the NLP, Statistics, and Machine Learning aspects.  
(However, I will give you some general tips to impress people.)

<img src="images/search_engine.png" style="width: 300px;"/>

IR System: The NLP parts
---

![](http://www.mactech.com/content_images/macsimum/uploads/MultiLanguagePatent.jpg)

Information Retrieval Process (Fundamental)
-----

1. Given a collection of documents 
1. And an user’s query
1. Find the most relevant documents

Key IR Terms
----

- Query
- Document
- Collection
- Index
- Term

Query
----

A representation of what the user is looking for. 

Can be stored as a tuple of ngrams.

Document
----

An information entity that the user wants.

- Not just a "paper" item
- Can be records (medical), pages (websites), images, people, or movies

Document Storage
-------

1. Actual item representation (for users)
1. Value-added representation (for the system)
    - metadata
    - fixed unicode
    - tokens and counts
    - links to it (PageRank)

Collection
----

A set of documents

Index
-----

A representation of information that makes querying easier

What would be a good Python data structure for an index?

```python
{'hoe': {0, 1, 9, 12},
'rake': {0, 1, 5, 9}
}
```

Term
----

Word, token, or ngrams that appears in a document or a query

Document Representations
-----

Term-document matrix (m x n)

Document-document matrix (n x n)

Remember Heaps' Law

Evaluation Metrics
----

- Precision and recall
- Precision at k
- Mean average precision

Confusion Matrix
----

![](images/cont.png)

![](images/p_tp.png)

![](images/recall_tp.png)

<img src="images/items.png" style="width: 400px;"/>

![](images/recall.png)
<img src="images/items.png" style="width: 200px;"/>

![](images/precision.png)
<img src="images/items.png" style="width: 200px;"/>

![](images/p.png)

![](images/r.png)

![](images/p.png)
![](images/r.png)

![](images/pvsr.png)

Check for understanding
-----

If there are 100 documents in a collection that are relevant to a given query and 60 of these items are retrieved in a given search. 

What is the recall?



Recall = (60/100) = .60

In a given search, the system retrieves 80 items, out of which 30 are relevant and 50 are non-relevant. 

What is the precision?

Precison = (30/80) = .375

<img src="images/trump.png" style="width: 300px;"/>

When a query have thousands (or millions) of relevant documents, recall is often not a meaningful metric.

No one interested in reading all of them 

Search Engine Results Page (SERP)
-----

1st position is most important

2nd position is sometimes clicked on

3rd position is rarely clicked on

4th-end Doesn't matter

----

Above the fold is all that matters. The fold (aka attention) is getting smaller. For example, compare desktop to mobile to watch

Need "precision at k" or p@k
----

[P@10 or "Precision at 10"](https://en.wikipedia.org/wiki/Mean_average_precision#Precision_at_K) corresponds to the number of relevant results on the first search results page which typically has 10 shown results. 

What is precision at different k for this SERP?

1. N / not relevant document
2. N / not relevant document
3. N / not relevant document
4. R / relevant document
5. R / relevant document
6. N / not relevant document
7. R / relevant document 
8. R / relevant document 
9. N / not relevant document
10. R / relevant document
11. R / relevant document
12. N / not relevant document
13. R / relevant document

In [7]:
# Here is our data
serp = 'N N N R R N R R N R R N R'.split()

In [9]:
# Convert to boolean vector
serp_relevant = [relevance == 'R' # R = relevant 
                 for relevance in serp]

In [10]:
serp_relevant

[False,
 False,
 False,
 True,
 True,
 False,
 True,
 True,
 False,
 True,
 True,
 False,
 True]

In [11]:
# Caluclate precision at each k
precisions = [sum(serp_relevant[:k+1])/(k+1) 
               for k, relevant in enumerate(serp_relevant) 
               if relevant]

In [5]:
# What are the precisions?
print("Precision at relevant item: ")
print(*precisions, sep='\n')

Precision at relevant item: 
0.25
0.4
0.42857142857142855
0.5
0.5
0.5454545454545454
0.5384615384615384


In [6]:
import numpy as np

print("Average precision for this query: {:.2f}".format(np.mean(precisions)))

Average precision for this query: 0.45


System performance across multiple queries

![](images/map.png)

Summary
-----

- All significantly advanced tech companies will make their own IR system
- IR Systems:
    - Have a collection of documents
    - Process an user's query
    - Returns a SERP
- Evaluate a SERP with:
    + Precision 
    + Recall
    + p@k
    + MAP

<br>
<br> 
<br>

----

---
Advanced metrics
----

[Fall-out](https://en.wikipedia.org/wiki/Information_retrieval#Fall-out): The proportion of non-relevant documents that are retrieved, out of all non-relevant documents available. 

![](images/fallout.png)

----
[Generality](http://crpit.com/confpapers/CRPITV49Yan.pdf): The proportion of relevant items per query.
    
Larger the collection, the larger will be the number of non-relevant item in given query. Hence, an increase in the level of recall will cause a decrease in precision.

[Source](http://www.cs.usc.edu/assets/002/82932.pdf)

<br>
<br> 
<br>

----