----
Information Retrieval
----


By The End Of This Session You Should Be Able To:
----

- Describe the need to order SERP
- Use Jaccard similarity to quantify the relationship between a query and document
- Use Cosine similarity to measure the distance between query and document

Distance Metrics
-----

> "It's like being in a library where someone has scattered all the books on the floor, attached them together with threads and you are in the dark."  
> /- MorningSide, CBC Radio, May 1995

The need for ranking SERP

Jaccard similarity
----

![](images/jaccard.png)



![](https://dataaspirant.files.wordpress.com/2015/04/jaccard_similariyt.png)



![](https://dataaspirant.files.wordpress.com/2015/04/jaccaard2.png)



![](https://dataaspirant.files.wordpress.com/2015/04/jaccaard3.png)

[Source](https://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/)

Why Jaccard similarity?
----

Apply set operations to get distance between items.



It is a simple recommender system! Your content could be words, images, or wines.

__Hint__: Do this first, way before collaborative filtering.

How do you calculate Jaccard similarity?
---
<img src="images/trumpy.jpg" style="width: 300px;"/>
<img src="images/collins.png" style="width: 300px;"/>

In [10]:
q1 = "I mean, part of the beauty of me is that I'm very very rich."
q2 = "The problem with beauty is that it's like being born rich and getting poorer."

![](images/jaccard.png)

jaccard_sim(q1, q2) = 5/21 = 0.238

### Student Activity

Write a function to calculate Jaccard similarity

In [11]:
from fractions import Fraction

def jaccard_sim(a, b):
    """Calculate the jaccard similarity of the 2 docuents.
    jaccard similarity is the overlap of two sets.
    jaccard_sim = |A intersection B| / | A union B|
    """
    # Munging
    a = a.lower().replace(".", "").replace(",", "").replace("'", "")
    b = b.lower().replace(".", "").replace(",", "").replace("'", "")

    a = set(a.split())
    b = set(b.split())
    
    return Fraction(len(a.intersection(b)), len(a.union(b)))

In [12]:
jaccard_sim(q1, q2)

Fraction(5, 21)

What are limitations of Jaccard similarity?
---

1. Assumes items are hashable (aka, able to make into members of a set).  

2. Ignores rate, how often a item appears.

What is the cosine similarity?
---

![](https://dataaspirant.files.wordpress.com/2015/04/cosine.png)

![](images/cosine_similarity.png)

Why is cosine similarity so powerful?
-----

1. It is a vector based distance metric thus is fast and easy to calculate.  

2. It easy to interpret because it is bounded between -1 and 1.

[Cosine similarity calculator](http://www.appliedsoftwaredesign.com/archives/cosine-similarity-calculator)

---
tf-idf
---

What is tf-idf?
----

term frequency–inverse document frequency

---
Why tf-idf?
---

A __single__ statistical measure used to evaluate how important a word is to a document in a collection. 

The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. 

What is the tf-idf formula?
-----

![](https://deeplearning4j.org/img/tfidf.png)

TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: 

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

![](https://deeplearning4j.org/img/tfidf.png)

IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following: 

IDF(t) = log(Total number of documents / Number of documents with term t in it)


Wieghted by log scale (not linearily), if a term 100x common is not 100x more relevant.


![](http://3.bp.blogspot.com/-jAaRras-pOM/UXNQOMnz1BI/AAAAAAAAnYA/9FwvHPOp90c/s1600/TFIDF-FIG-01.JPG)

----
How to build a IR system
----

1. Convert the document to tf-idf vector
1. Convert the query to tf-idf vector
1. Compute cosine similarity between document vector and query vector
1. Rank documents
1. Return top K scores

---
Putting tf-idf and cosine similarity together
----
![](http://images.slideplayer.com/8/2321076/slides/slide_7.jpg)

![](http://nlp.stanford.edu/IR-book/html/htmledition/img411.png)

Check for understanding
---

Queries are short and documuents are long. How does the system handle that?
 

The vectors are normalized to account variety of lengths.

Summary
----

- We need to rank our SERP to increase precision@k
- We can do that by finding the similarity between the query and the documents
- Cosine similarity is a good distance metric
- tf-idf vectorizes the query and document to then calculate the distance metric
- Jaccard similarity is also an okay distance metric

<br>
<br> 
<br>

----