# Ranked Information Retrieval
In this exercise, we will use different strategies to vectorize text and compare text simiarity. We perform our experiments on the 'Learned' genre from the Brown corpus, because it has a decent size (around 182k tokens) and includes texts from a good diverse of sources (text source information of different Brown genres can be found [here](http://clu.uni.no/icame/manuals/BROWN/INDEX.HTM)). 

In [None]:
from nltk.corpus import brown
all_sents = brown.sents(categories='learned')
all_words = brown.words(categories='learned')
from tqdm.notebook import tqdm

print('token num', len(all_words))
print('sentence num', len(all_sents))

## Task 1: Build vocabulary
Extract all types to build the vocabulary list. You may consider whether to use some text normalization techniques (e.g. stop-words removal, stemming, lemmatizing) or not. If you are not sure which text normalization techniques to use, you may try different combinations and compare their performance in downstream tasks. Name the resulting vocabulary list **vocab**; we will use it in later tasks.

In [None]:
def get_vocab(all_words, stemmer=False, stopremover=False, lemmatizer=False, lower=True):
    # this function should return a list with each element a unique type
    # by default no text normalization is applied
    pass

vocab = get_vocab(all_words)

## Task 2: Find similar sentences with Jaccard coefficient
Given an arbitrary sentence in the genre, we would like to find the N sentences most similar to it. In the function below, you should use Jaccard coefficient to find the top N similar sentences. **NOTE**: if you have used certain text normalization techniques when you build the vocabulary (task 1), you should use the same ones when you compute the Jaccard coefficients.

In [None]:
def get_jscore(idx1, idx2, all_sents, vocab):
    pass # measure similarity between all_sents[idx1] and all_sents[idx2] using Jaccard

def jaccard_scores(target_sent_idx, all_sents, vocab):
    target_sent = all_sents[target_sent_idx]
    similarity_scores = []
    for i in range(len(all_sents)):
        similarity_scores.append(get_jscore(target_sent_idx,i,all_sents,vocab))
    return similarity_scores

# given the similarity scores, print the most similary N sentences
target_idx = 123 # feel free to change this to your favourite index
print('===Target Sentence===')
print(' '.join(all_sents[target_idx]))

jsim_scores = jaccard_scores(target_idx, all_sents, vocab)
top_n = 10 # how many most similar sents you want 
sorted_jsim = sorted(jsim_scores,reverse=True)
for i,ss in enumerate(sorted_jsim[:top_n]):
    sent_idx = jsim_scores.index(ss)
    print('\n===Similar sentence no.{}==='.format(i+1))
    print(' '.join(all_sents[sent_idx]))
    print('similarity score:',ss)

## Task 3: Find similar sentences with cosine of TF vectors 
Unlike the last task that uses Jaccard coefficients to measure sentences similarity, here we turn to cosine of Trem-Frequency (tf) vecotrs. Try both the raw-tf version and the logarithm-tf version.  

In [None]:
def get_cosine(vec1, vec2):
    # given two vectors, this function should return their cosine similarity
    pass

def build_tf_vecs(all_sents, vocab):
    # this function should return, for each sentence, a vector representation
    pass 

def get_cos_scores(target_idx, all_vecs):
    tf_scores = []
    for i in tqdm(range(len(all_vecs)), desc='computing cosine similarities'):
        ss = get_cosine(all_vecs[i], all_vecs[target_idx])
        tf_scores.append(ss)
    return tf_scores

# given the similarity scores, print the most similary N sentences
target_idx = 123 # feel free to change this to your favourite index
print('\n===Target Sentence===')
print(' '.join(all_sents[target_idx]))

all_vecs = build_tf_vecs(all_sents, vocab)
tfcos_scores = get_cos_scores(target_idx, all_vecs)
top_n = 10 # how many most similar sents you want 
sorted_tfcos = sorted(tfcos_scores,reverse=True)
for i,ss in enumerate(sorted_tfcos[:top_n]):
    sent_idx = tfcos_scores.index(ss)
    print('\n===Similar sentence no.{}==='.format(i+1))
    print(' '.join(all_sents[sent_idx]))
    print('similarity score:',ss)

## Task 4: Find similar sentences with cosine of TF-IDF vectors 
Use tf-idf vectors cosine similarity to find similar sentences.

In [None]:
def build_tfidf_vecs(all_sents, vocab):
    # this function should return, for each sentence, a tf-idf vector representation
    pass 

# given the similarity scores, print the most similary N sentences
target_idx = 123 # feel free to change this to your favourite index
print('\n===Target Sentence===')
print(' '.join(all_sents[target_idx]))

all_vecs = build_tfidf_vecs(all_sents, vocab)
tfidf_cos_scores = get_cos_scores(target_idx, all_vecs)
top_n = 10 # how many most similar sents you want 
sorted_tfidf_cos = sorted(tfidf_cos_scores,reverse=True)
for i,ss in enumerate(sorted_tfidf_cos[:top_n]):
    sent_idx = tfidf_cos_scores.index(ss)
    print('\n===Similar sentence no.{}==='.format(i+1))
    print(' '.join(all_sents[sent_idx]))
    print('similarity score:',ss)