# 1. Vector Space Retrieval

1. [Vector space retrieval: TD_IDF implementation](#Exercise-1)
2. [Probabilistic retrieval model](#Exercise-2)
3. [Theory exercises](#Exercise-3)

## Exercise 1

In this exercise we will understand the functioning of TF-IDF ranking by implementing the vector space retrieval model.

From Wikipedia:

*TF-IDF (term frequency-inverse document frequency) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. Its value increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.*

*TF-IDF can be successfully used for stop-words filtering in various subject fields, including text summarization and classification.*

*One of the simplest ranking functions is computed by summing the TF-IDF for each query term; many more sophisticated ranking functions are variants of this simple model.*

For testing we have provided a simple document collection with 5 documents in file bread.txt:

  DocID | Document Text
  ------|------------------
  1     | How to Bake Breads Without Baking Recipes
  2     | Smith Pies: Best Pies in London
  3     | Numerical Recipes: The Art of Scientific Computing
  4     | Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes
  5     | Pastry: A Book of Best French Pastry Recipes

Your task is to find the top ranked documents according to the TF-IDF rank for the query 
$Q$ = `"baking"`

For further testing, use the collection __epfldocs.txt__, which contains recent tweets mentioning EPFL.

Compare the results also to the results obtained from the reference implementation using the scikit-learn library.

#### Implementation

In [5]:
# Loading of libraries and documents
import string
import math

import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

from collections import Counter

stemmer = PorterStemmer()


# Tokenize, stem a document
def tokenize(text):
    """Given a string, removes the punctuation and returns
    array of stemmed words.
    """
    # Remove punctuation
    text = "".join([ch for ch in text if ch
        not in string.punctuation])
    # Tokenize
    tokens = nltk.word_tokenize(text)
    # Return stemmed document
    return " ".join([stemmer.stem(word.lower())
        for word in tokens])


def read_documents(filename):
    # Read documents from file (each line is a document)
    with open(filename) as f:
        content = f.readlines()
    original_documents = [x.strip() for x in content] 
    documents = [tokenize(d).split() for d in original_documents]
    return documents, original_documents


def create_vocabulary(documents):
    # Create the vocabulary
    # flatten 'documents'
    vocabulary = set([item for sublist in documents
        for item in sublist])
    # remove stopwords and sort
    vocabulary = [word for word in vocabulary if
        word not in stopwords.words('english')]
    vocabulary.sort()
    return vocabulary


# Compute IDF, storing values in a dictionary
def idf_values(vocabulary, documents):
    """Computes IDF as log(N/nt)
    where nt is the number of documents where the
    term t appears.
    """
    idf = {}
    num_documents = len(documents)
    for term in vocabulary:
        idf[term] = [term in x for x in documents]
        idf[term] = math.log(num_documents /
            Counter(idf[term])[True], math.e)
    return idf


# Generate the vector for a document (with normalization)
def vectorize(document, vocabulary, idf):
    """Returns TF-IDF as TF*IDF
    Computes TF as the term frequency, which means
    ftd/max_count
    where:
        ftd = nb. of times term t occurs in doc d
        max_count = nb. of times most common term occurs in d
    This is known as 'augmented frequency' and prevents
    a bias towards longer documents.
    """
    tfidf = [0] * len(vocabulary)
    tf = [0] * len(vocabulary)
    counts = Counter(document)
    # get the count for most common word
    max_count = counts.most_common(1)[0][1]
    for i, term in enumerate(vocabulary):
        tf[i] = counts[term] / max_count 
        tfidf[i] = tf[i] * idf[term]
    return tfidf


# Compute cosine similarity
def cosine_similarity(v1, v2):
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]
        y = v2[i]
        sumxx += x * x
        sumyy += y * y
        sumxy += x * y
    if sumxy == 0:
        return 0
    else:
        return sumxy / math.sqrt(sumxx * sumyy)


def vectorize_query(query, vocabulary, idf):
    # get array of vectorized query words
    q = query.split()
    q = [stemmer.stem(w) for w in q]
    # get the vector for the query
    query_vector = vectorize(q, vocabulary, idf)
    return query_vector

    
# Compute the search result (get topk documents)
def search_vec(query, topk, vocabulary, idf, document_vectors,
    original_documents):
    query_vector = vectorize_query(query, vocabulary, idf)
    # compute cosine similarity scores
    scores = [[cosine_similarity(query_vector, document_vectors[d]), d]
        for d in range(len(documents))]
    # sort and return search result
    scores.sort(key=lambda x: -x[0])
    ans = []
    indices = []
    for i in range(topk):
        ans.append(original_documents[scores[i][1]])
        indices.append(scores[i][1])
    return ans, indices, query_vector

        
# Put everything together
def search_scores_vec(query, topk, vocabulary, documents, original_documents):
    idf = idf_values(vocabulary, documents)
    document_vectors = [vectorize(d, vocabulary, idf) for d in documents]
    ans, indices, query_vector = search_vec(query, topk, vocabulary,
        idf, document_vectors, original_documents)
    for d in ans:
        print(d)
    return ans, indices, query_vector

[nltk_data] Downloading package stopwords to /home/lucia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/lucia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Try with baking and compare with scikit

In [6]:
documents, original_documents = read_documents("bread.txt")
ans, indices, query_vector = search_scores_vec("baking", 5,
    create_vocabulary(documents), documents, original_documents)

How to Bake Breads Without Baking Recipes
Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes
Smith Pies: Best Pies in London
Numerical Recipes: The Art of Scientific Computing
Pastry: A Book of Best French Pastry Recipes


In [7]:
# Reference code using scikit-learn
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english')
features = tf.fit_transform(original_documents)
npm_tfidf = features.todense()
new_features = tf.transform(['baking'])

cosine_similarities = linear_kernel(new_features, features).flatten()
related_docs_indices = cosine_similarities.argsort()[::-1]
topk = 5
for i in range(topk):
    print(original_documents[related_docs_indices[i]])

How to Bake Breads Without Baking Recipes
Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes
Pastry: A Book of Best French Pastry Recipes
Numerical Recipes: The Art of Scientific Computing
Smith Pies: Best Pies in London


#### Try with computer science and compare with scikit

In [10]:
documents, original_documents = read_documents("epfldocs.txt")
ans, indices, query_vector = search_scores_vec("computer science", 10,
    create_vocabulary(documents), documents, original_documents)

Exciting News: "World University Rankings 2016-2017 by subject: computer science" No1 @ETH &amp; @EPFL on No8. Congrats https://t.co/ARSlXZoShQ
New computer model shows how proteins are controlled "at a distance" https://t.co/zNjK3bZ6mO  via @EPFL_en #VDtech https://t.co/b9TglXO4KD
An interview with Patrick Barth, a new @EPFL professor who combines protein #biophysics with computer modeling https://t.co/iJwBaEbocj
New at @epfl_en Life Sciences @epflSV: "From PhD directly to Independent Group Leader" #ELFIR_EPFL:  Early Independence Research Scholars. See https://t.co/evqyqD7FFl, also for computational biology #compbio https://t.co/e3pDCg6NVb Deadline April 1 2018 at https://t.co/mJqcrfIqkb
Video of Nicola Marzari from @EPFL_en  on Computational Discovery in the 21st Century during #PASC17 now online: https://t.co/tfCkEvYKtq https://t.co/httPdHcK9W
Exposure Science Film Hackathon 2017 applications open! Come join our Scicomm-film-hacking event! #Science #scicomm https://t.co/zwtKPlh6HT


In [9]:
# Reference code using scikit-learn
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,1), min_df = 1, stop_words = 'english')
features = tf.fit_transform(original_documents)
npm_tfidf = features.todense()
new_features = tf.transform(['computer science'])

cosine_similarities = linear_kernel(new_features, features).flatten()
related_docs_indices = cosine_similarities.argsort()[::-1]
topk = 10
for i in range(topk):
    print(original_documents[related_docs_indices[i]])

Exciting News: "World University Rankings 2016-2017 by subject: computer science" No1 @ETH &amp; @EPFL on No8. Congrats https://t.co/ARSlXZoShQ
New computer model shows how proteins are controlled "at a distance" https://t.co/zNjK3bZ6mO  via @EPFL_en #VDtech https://t.co/b9TglXO4KD
An interview with Patrick Barth, a new @EPFL professor who combines protein #biophysics with computer modeling https://t.co/iJwBaEbocj
Exposure Science Film Hackathon 2017 applications open! Come join our Scicomm-film-hacking event! #Science #scicomm https://t.co/zwtKPlh6HT
Le mystère Soulages éblouit la science @EPFL  https://t.co/u3uNICyAdi
@cwarwarrior @EPFL_en @EPFL Doing science at @EPFL_en is indeed pretty cool!!! Thank you for visiting!!!
Blue Brain Nexus: an open-source tool for data-driven science https://t.co/m5yTgXf7ym #epfl
Swiss Data Science on Twitter: "Sign up for @EPFL_en #DataJamDays: learn more a… https://t.co/kNVILHWPGb, see more https://t.co/2wg3BbHBNq
The registration for Exposure Scienc


## Exercise 2

Implement probabilistic retrieval model based on the query likelihood language model, using a mixture model between the documents and the collection, both weighted at 0.5. Maximum likelihood estimation (mle) is used to estimate both as unigram models. You can use the code framework provided below.

Now, for the query $Q$ = `"baking"`, find the top ranked documents according to the probabilistic rank.

Compare the results with TF-IDF ranking.

In [None]:
# Loading of libraries and documents
import string
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('punkt')

from collections import Counter

stemmer = PorterStemmer()
# Tokenize, stem a document
def tokenize(text):
    """Given a string, removes the punctuation and returns
    array of stemmed words.
    """
    # Remove punctuation
    text = "".join([ch for ch in text if ch
        not in string.punctuation])
    # Tokenize
    tokens = nltk.word_tokenize(text)
    # Return stemmed document
    return " ".join([stemmer.stem(word.lower())
        for word in tokens])


def read_documents(filename):
    # Read documents from file (each line is a document)
    with open(filename) as f:
        content = f.readlines()
    original_documents = [x.strip() for x in content] 
    documents = [tokenize(d).split() for d in original_documents]
    return documents, original_documents




# ref.: https://nlp.stanford.edu/IR-book/pdf/12lmodel.pdf
# The query likelihood model returns results ranked by
# P(q|d), i.e. the probability of the query q under
# the language model derived from d. 

# probabilistic relevance
def query_prob(q, document, documents, lambda_):
    """Probability of producing the query given:
        Md = language model of document d
        Ld = Number of tokens in document d
    Score is >0 iff all query terms appear in the document
    """
    # term frequency in document
    tf = Counter(document)
    # length of document d
    L_doc = len(document)
    # term frequency in collection
    collection = [w for doc in documents for w in doc]
    cf = Counter(collection)
    # length of collection
    L_col = len(collection)

    # compute prob
    prob = 1
    for term in q:
        prob = prob * (lambda_ * (tf[term] / L_doc) +
            (1 - lambda_) * cf[term] / L_col)
    return prob

# computing the search result
def search_prob(query, k, documents, original_documents):
    q = query.split()
    q = [stemmer.stem(w) for w in q]
    scores = [[query_prob(q, documents[d], documents, 0.5), d]
        for d in range(len(documents))]
    scores.sort(key=lambda x: -x[0])
    for i in range(k):
        print(original_documents[scores[i][1]])

In [None]:
documents, original_documents = read_documents("epfldocs.txt")
search_prob("computer science", 10, documents, original_documents)

## Exercise 3
Following the notation used in class, let us denote the set of terms by $T=\{k_i|i=1,...,m\}$, the set of documents by $D=\{d_j |j=1,...,n\}$, and let $d_i=(w_{1j},w_{2j},...,w_{mj})$. We are also given a query  $q=(w_{1q},w_{2q},...,w_{mq})$. In the lecture we studied that, 

$sim(q,d_j) = \sum^m_{i=1} \frac{w_{ij}}{|d_j|}\frac{w_{iq}}{|q|}$ .  (1)

Another way of looking at the information retrieval problem is using a probabilistic approach. The probabilistic view of information retrieval consists of determining the conditional probability $P(q|d_j)$ that for a given document $d_j$ the query by the user is $q$. So, practically in probabilistic retrieval when a query $q$ is given, for each document it is evaluated how probable it is that the query is indeed relevant for the document, which results in a ranking of the documents.

In order to relate vector space retrieval to a probabilistic view of information retrieval, we interpret the weights in Equation (1) as follows:

-  $w_{ij}/|d_j|$ can be interpreted as the conditional probability $P(k_i|d_j)$ that for a given document $d_j$ the term $k_i$ is important (to characterize the document $d_j$).

-  $w_{iq}/|q|$ can be interpreted as the conditional probability $P(q|k_i)$ that for a given term $k_i$ the query posed by the user is $q$. Intuitively, $P(q|k_i)$ gives the amount of importance given to a particular term while querying.

With this interpretation you can rewrite Equation (1) as follows:

$sim(q,d_j) = \sum^m_{i=1} P(k_i|d_j)P(q|k_i)$ (2)

### Question a
Show that indeed with the probabilistic interpretation of weights of vector space retrieval, as given in Equation (2), the similarity computation in vector space retrieval results exactly in the probabilistic interpretation of information retrieval, i.e., $sim(q,d_j)= P(q|d_j)$.
Given that $d_j$ and $q$ are conditionally independent, i.e., $P(d_j \cap q|ki) = P(d_j|k_i)P(q|k_i)$. You can assume existence of joint probability density functions wherever required. (Hint: You might need to use Bayes theorem)

### Question b
Using the expression derived for $P(q|d_j)$ in (a), obtain a ranking (documents sorted in descending order of their scores) for the documents $P(k_i|d_1) = (0, 1/3, 2/3)$, $P(k_i|d_2) =(1/3, 2/3, 0)$, $P(k_i|d_3) = (1/2, 0, 1/2)$, and $P (k_i|d_4) = (3/4, 1/4, 0)$ and the query $P(q|k_i) = (1/5, 0, 2/3)$.