# Machine Learning metrics

## NLP

### Text summarization: ROUGE and BLEU

refs:
* https://towardsdatascience.com/automatic-text-summarization-evaluation-2e312f66893b

#### ROUGE: Recall-Oriented Understudy for Gisting Evaluation


it is a metric to measure the quality of an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. 

* Aplications:
    * text summarization
    * language translation
    
    
    
$
Recall = \frac{\text{#share n-grams}}{\text{#n-grams in reference summary}} \\ 
Precision = \frac{\text{#share n-grams}}{\text{#n-grams in ML summary}}
$


#### BLEU: BLEU (short for “Bilingual evaluation understudy” 


“the closer a machine translation is to a professional human translation, the better it is” 

It was one of the first metrics to claim a high correlation with human judgements of quality and remains one of the most popular automated and inexpensive metrics.


$
Precision = \frac{\text{#share n-grams}}{\text{#n-grams in ML summary}}
$


Aplications:
* Language translation
* BERT ["Attention is all you need"](https://arxiv.org/pdf/1706.03762.pdf) (BERT paper)



In [1]:
from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'a', 'test'], ['this', 'is' 'test']]
candidate = ['this', 'is', 'a', 'test']

score = sentence_bleu(reference, candidate)

print(score)

1.0


In [2]:
from nltk.translate.bleu_score import corpus_bleu

references = [[['this', 'is', 'a', 'test'], ['this', 'is' 'test']]]
candidates = [['this', 'is', 'a', 'test']]

score = corpus_bleu(references, candidates)

print(score)

1.0


Problems with BLEU

It doesn’t consider meaning
It doesn’t directly consider sentence structure
It doesn’t handle morphologically rich languages well
It doesn’t map well to human judgements



#### BLEU and ROUGE are complementary

* In both metrics the oreder is not taking 

* **BLEU measures precision**: how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries.

* **ROUGE measures recall**: how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries 



### Translation: METEOR


It stands for Metric for Evaluation of Translation with Explicit ORdering, which is a metric for the evaluation of machine translation output.

The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also **"produce good correlation with human judgement at the sentence or segment level"**.


## Reccommendation system




refs:
* https://medium.com/qloo/popular-evaluation-metrics-in-recommender-systems-explained-324ff2fb427d#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6Ijc4M2VjMDMxYzU5ZTExZjI1N2QwZWMxNTcxNGVmNjA3Y2U2YTJhNmYiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJuYmYiOjE2MTEyNjUwMzEsImF1ZCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsInN1YiI6IjEwMTY2NTQwMDY2NjI1NTE2MDUzNSIsImhkIjoiY2FzdGluZ3dvcmtib29rLmNvbSIsImVtYWlsIjoibGVhbmRyb0BjYXN0aW5nd29ya2Jvb2suY29tIiwiZW1haWxfdmVyaWZpZWQiOnRydWUsImF6cCI6IjIxNjI5NjAzNTgzNC1rMWs2cWUwNjBzMnRwMmEyamFtNGxqZGNtczAwc3R0Zy5hcHBzLmdvb2dsZXVzZXJjb250ZW50LmNvbSIsIm5hbWUiOiJMZWFuZHJvIE9saXZlaXJhIEZlcm5hbmRlcyIsInBpY3R1cmUiOiJodHRwczovL2xoNC5nb29nbGV1c2VyY29udGVudC5jb20vLVlCSXF0OEt1cEJvL0FBQUFBQUFBQUFJL0FBQUFBQUFBQUFBL0FNWnV1Y202RmZ1VWZ2SFkxajVvWl9BdHUwWFYzWjQwOXcvczk2LWMvcGhvdG8uanBnIiwiZ2l2ZW5fbmFtZSI6IkxlYW5kcm8iLCJmYW1pbHlfbmFtZSI6Ik9saXZlaXJhIEZlcm5hbmRlcyIsImlhdCI6MTYxMTI2NTMzMSwiZXhwIjoxNjExMjY4OTMxLCJqdGkiOiJlYzY1YWFmMjUzZWQyMzAwOGZmOGE2MWI4ZWNlYTIxODEyNjFjNWRkIn0.8JDmGqz7Z-TA3cA3ZhzQ2HTUQrvMWOtPf_1wQ5TBBvYeRvfXHeZuW3OBxPFl53hp-D52D6WUi8-jsozrnVAFzDWP04sQ2YG8OtAUeJU6tIu--6CPZDNFc6DIOMlUlD3IEWpHlOjchtiy10y_KYVwbQ8WrfjwXlkMDIVChz_BqLZRkqqNg4nVZHCUhvOgNfI9BAggmWoOxHqIvLpbkCUybvEixwhJ77AusoB0LbxdcuZlIYpMynMuxyQCUa-tfTli6M3gucWqOtEtkvTXds9LtfKmdpKGu1wNSabcP0SHrhP15RpX7jopG4Yih7U9pC-MgIxG0v_UZiHTUauIK79cJg 


* https://gab41.lab41.org/recommender-systems-its-not-all-about-the-accuracy-562c7dceeaff    




* **Recall@k** is defined as the percentage of true labels captured when we recommend k labels per example. Its exact definition is:


$
\text{Recall@k} = \frac{\text{# true label captured}}{\text{# true label}}
$

Ex: 


* **Precision@k** is defined as the percentage of predictions we get right. Its exact definition is:

$
\text{Precision@k} = \frac{\text{# true label captured}}{\text{# prediction made}}
$
```text
true =  1 1 1 0 1
pred =  1 0 1 0 1  where 1 is relevant

Precision@3 = 2/3; Precision@4 = 2/4; Precision@5 = 3/5

Recall@3 = 2/3, Recall@4 = 2/3, and Recall@5 is 3/4.

```

* **Diversity**: Diversity measures how dissimilar recommended items are for a user. This similarity is often determined using the item’s content (e.g. movie genres) but can also be determined using how similarly items are rated.

    * Work with **Cosine Distance** or **Jacacrd Similarity**
    
* **Coverage**: Coverage represents the percentage of things (items, users, or ratings) that the recommender system was able to recommend. Not being able to predict a particular set of users or items is usually caused by an insufficient number of ratings, and is generally known as the cold start problem

$
Coverage = \frac{n}{N}
$

* **Serendipity**:  It is the measure of how surprising the successful or relevant recommendations are. 


## Machine learning

### Regression


TODO:

1. Best constat classifer comparson
1. COmparison with random guest 

### Classfier


* Precision n recall curves  APC 
* ROC and AUC 
* 

### Multi-class Classifer