# Ranking evaluation metrics

In this lab you will implement ranking evaluation metrics.

For each query the search engine returns the sorted list of documents. We expect to have relevant documents at the top. 
In supervision learning to evaluate the quality we need labeled data.

Read this article first:
http://queirozf.com/entries/evaluation-metrics-for-ranking-problems-introduction-and-examples
https://medium.com/swlh/rank-aware-recsys-evaluation-metrics-5191bba16832

## Binary relevance metrics

We will start from the assumption, that each document is either relevant or not.

#### recall, precision, f1

First metrics are already faminiar for you from Introduction to Machine Learning course.
Implement relall, precision, and relevance for top $k$ documents.

relevance is a list which represents that document with this index in ranking is relevant or not.

In [15]:
import numpy as np
import math

def recall_at_k(relevance, k):
    tp_k = 0
    tp = 0
    for i in range(len(relevance)):
        if (i <= k):
            tp_k += 1
            tp += 1
        else:
           tp += 1
    
    recall = tp_k / tp
    return recall

def precision_at_k(relevance, k):
    relevance_k = relevance[:k]
    tp = 0
    for i in relevance_k:
        if i == 1:
            tp+=1
    precision = tp/k 
    return precision

def f1_at_k(relevance, k):
    recall = recall_at_k(relevance, k)
    precision = precision_at_k(relevance, k)
    f1 = 2 * ((precision * recall)/(precision + recall))
    return f1


In [16]:
r = [1, 0, 1, 1, 0, 1, 0, 0]

print(recall_at_k(r, 1))
print(recall_at_k(r, 8))

print(precision_at_k(r, 1))
print(precision_at_k(r, 8))

print(f1_at_k(r, 1))
print(f1_at_k(r, 8))

0.25
1.0
1.0
0.5
0.4
0.6666666666666666


#### Average Precision
You can calculate the AP using the following algorithm:

<img src="http://queirozf.com/images/contents/mnc7sx1.png">

In [17]:
def average_precision(relevance, K):
    cor_pred = 0
    run_sum = 0
    for k in range(K):
        if relevance[k] == 1:
            cor_pred += 1
            run_sum += cor_pred / (k+1)
    tp = 0
    for i in relevance:
        if i == 1:
            tp+=1
    return run_sum / tp

In [12]:
print(average_precision(r, 1))
print(average_precision(r, 8))

0.25
0.7708333333333333


## Relevance as a real number

#### DCG

DCG - discounted cumulative gain, does't require the relevance to be a binary feature. In many situations one document is more relevant than another and we want to represent it in supervised evaluation. Often the relevance is a number form ${0,1,2,3}$, but ${0,1}$ is also appropriate for usage.

The idea is that each rerelvant document brings a "gain" for a user. He or she looks through the documents from the first. So the gain sums cumulatively. But it is better to have a relevant document at the top of the ranking, so the weight of that gain decreases, or we have a discounded weight with increasing of the document position. And since the weight is decreasing, we can calculate this value for the top $k$ documents in ranking.

$$DCG@k = \sum_{1}^{k}\frac{2^{rel_i}-1}{log_2(i+1)}$$

In [18]:
def dcg_at_k(relevance, k=10):
    DCG = 0
    for i in range(k):
        DCG += (2**relevance[i] - 1) / (math.log(i+2,2))
    return DCG

In [14]:
r2 = [3, 2, 3, 0, 0, 1, 2, 2, 3, 0]

print(dcg_at_k(r, 1))
print(dcg_at_k(r, 8))

print(dcg_at_k(r2, 1))
print(dcg_at_k(r2, 10))

1.0
2.2868837451814152
7.0
16.802601047827448


#### nDCG
Now the idea is to normalize it to the maximum value.

In [19]:
def ndcg_at_k(relevance, k=10):
    DCGs = dcg_at_k(sorted(relevance, reverse=True), k)
    DCG = dcg_at_k(relevance, k)
    return DCG / DCGs

In [16]:
print(ndcg_at_k(r, 1))
print(ndcg_at_k(r, 8))

print(ndcg_at_k(r2, 1))
print(ndcg_at_k(r2, 10))

1.0
0.8927537907700458
1.0
0.8951337253357086


## Test in real data

You already have search engines on songs or news. Test the evaluation on real data.

1. Choose the search query query
2. Run the search
3. Manually look top 10 results and evaluate each of them if it relevant or not
4. Calculate AP and DCG (relevance is either 0 or 1)

In [1]:

!pip3 install nltk
!pip3 install gensim
!pip3 install sklearn

from gensim.models.doc2vec import Doc2Vec
import numpy as np 
import re
import unicodedata
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import 	WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')	
nltk.download('punkt')
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split



You should consider upgrading via the 'python -m pip install --upgrade pip' command.




You should consider upgrading via the 'python -m pip install --upgrade pip' command.




You should consider upgrading via the 'python -m pip install --upgrade pip' command.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kor19\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kor19\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kor19\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# normilize text
def normalize(text):
    text = unicodedata.normalize('NFD', text).encode('ascii', 'ignore').decode('ascii')
    text = re.sub(r"[”“,.:;()#%!?+-\/'@*]", "", text)
    textList = text.lower().split(" ")
    text = ' '.join(textList)

    return text

def normalize_query(text):
    text = unicodedata.normalize('NFD', text).encode('ascii', 'ignore').decode('ascii')
    text = re.sub(r"[”“,.:;()#%!?+/'@]", "", text) 
    textList = text.lower().split(" ")
    text = ' '.join(textList)

    return text

def tokenize(text):    
    return(word_tokenize(text))

def lemmatization(tokens):
  wordnet_lemmatizer = WordNetLemmatizer()
  for i in range(len(tokens)):
    tokens[i] = wordnet_lemmatizer.lemmatize(tokens[i])
  return tokens

def remove_stop_word(tokens):
  stop_words = stopwords.words('english')

  for word in tokens: 
    if word in stop_words:
        tokens.remove(word)
  return tokens

def preprocess(text):
    text = normalize(text)
    tokens = tokenize(text)
    lemmed = lemmatization(tokens)
    clean = remove_stop_word(lemmed)
    return clean    



In [9]:

news = open('testdata_news_music_2084docs.txt', 'r').readlines()

model = Doc2Vec.load('doc2vec.bin', mmap=None)
news_prep = [preprocess(words) for words in news]
sent_vecs = np.array([model.infer_vector(words) for words in news_prep])


In [10]:
def norm_vectors(A):
    An = A.copy()
    scaler = preprocessing.Normalizer(norm='l2')
    An = scaler.fit_transform(An)
    
    return An

def find_k_closest(query, dataset, k=5):    
    dotp = []
    for i, vector in enumerate(dataset):
      dotp.append((i,vector, np.dot(vector, query)))
    sort_dotp = sorted(dotp, key=lambda x:x[2])[-k:][::-1] 
    return sort_dotp

In [12]:
query = "good mood"

query_vec = model.infer_vector(preprocess(query))
r = find_k_closest(query_vec,  sent_vecs)


print("Results for query:", query)
for k, v, p in r:
    print("\t", news[k][:-1], "sim=", p)

Results for query: good mood
	 ralph lauren ventured back the edwardian era with his fall collection romantic and quite natty place but the morning coats and other gentlemanly attire rendered almost entirely black and charcoal gray don come off masculine mournful instead lauren most elegant collection fuses menswear aesthetic completely feminine figure with nipped waists and lush lace some ruffles and embroidered detailing cutaway coat paired with tiered chiffon skirt for instance jacket with tails and morning suit style pant with cashmere and lace halter top the attitude subdued and restrained and unabashedly luxurious lauren evening attire especially glamorous yet quietly his black beaded numbers are subtle yet sexy the bob mackie show meanwhile was thematically tied broadway musicals corny and vampy that may sound and was slightly though overwhelming crowd pleaser that had the audience standing and applauding the show conclusion most the clothes the collection from pin striped suits

In [20]:
vector = [1,1,0,0,1]
print(average_precision(vector,len(vector)))
print(dcg_at_k(vector,len(vector)))

0.8666666666666667
2.017782560805999
