# Manual evaluation questions

### 1

Filename: 

**Paragraph**: On 26 March 2010, the European Council agreed to the Commission’s proposal to launch a new strategy ‘Europe 2020’. One of the priorities of the Europe 2020 strategy is sustainable growth to be achieved by promoting a more resource-efficient, more sustainable and more competitive economy. That strategy put energy infrastructures at the forefront as part of the flagship initiative ‘Resource efficient Europe’, by underlining the need to urgently upgrade Europe’s networks, interconnecting them at the continental level, in particular to integrate renewable energy sources.

**Question**: What is Europe 2020?

# Evaluation notes

In this notebook we protocoll the results of tests that we perform with different configurations of the langchain chain.

Chunk sizes:
- 256/32
    - higher retriever F1 but lower BERTscore
- 1024/256
    - more "context context"
    - lower retriever F1 but higher BERTscore

Embedding Models:

**`BAAI/bge-large-en-v1.5` with 5 retrieved documents**
- retriever_recall: 0.69 -> 69% of the relevant documents are retrieved
- retriever_precision: 0.138 -> 13.8% of the retrieved documents are relevant
    - this could come from the fact that the oracle only links one GT-document so we are searching for "the needle in the haystack" and the other sources are not considered ground truth
    - However, there is still the possiblity that more than one chunk from the same Eur-Lex document is retrieved which should increase the precision
- retriever_f1: 0.23
- answer_bertscore_f1: 0.85
- 
**`BAAI/bge-large-en-v1.5` with 20 retrieved documents**
- retriever_recall: 0.8 -> unsurprisingly, more relevant documents are retrieved
- retriever_precision: 0.04 -> less documents are relevant
- retriever_f1: 0.08
- answer_bertscore_f1: 0.84 -> more documents do not necessarily mean better results

-> @MrWhatsItToYaa and @psaegert found that llama2 can already answer many question without the need for a retriever.
-> We therefore compare the no-rag-answer with the answer of the full pipeline from now on.
-> The score difference between the two answers will be the score that the retriever adds to the answer.

## Normal / average configurations (if not specified otherwise in Test X chapter)
Model: llama2 \
preprocessing_steps: [remove_html_tags_preprocessor] \
splitter: recursive_character_test_splitter \
embeddings: GPT4AllEmbeddings \
retriever: FAISS \
chat_history_window_size: 2

## Test 1

With "normal" settings the retriever recall seems to get better for smaller chunk sizes. We get 0.69 for chunk size 1000(overlap 256), 0.79 for chuck size 256(overlap 32) and 0.86 for chunk size 100(overlap 10).

Unfortunately the answer_bertscore_f1 drops for smaller chunk sizes: 0.847, 0,841, 0.833 for the mentioned chunk sizes.

Too counteract this problem we will now try to increase the number of retrieved documents for smaller chunk sizes.

## Test 2

For larger numbers of retrived documents the retriever recall rises as expected. For a chunksize of 100 (overlap 10) and 50 retrieved documents we het 99% retreiver recall and 1.98 % retriever percision. The drop in presicion is expected. The answer_bertscore_f1 is 83% for the given settings.

## Test 3

Using a chunk size of 500 (50 overlap) with 50 retreived documents which results in an enormous prompt does not make a real difference. retriever_recall is 95%, retriever_precision is 1.9%,m retriever f1 is 3.73% and answer_bertscore_f1 is 83.46%

---

## Metrics

In [None]:
# Silver Data

Unnamed: 0,A_BERT_RAG_F1,A_BERT_RAG_DELTA_F1,A_BERT_F1,A_AR_F1,A_BL_BL,A_RGL_SUM,RET_AP_1,RET_AP_10,RET_RC_1,RET_RC_10,RET_RR,RET_NDCG
256_10_nlc_bge_fn_mistral.yaml,0.867,0.026,0.841,0.909,0.07,0.227,0.164,0.176,0.164,0.224,0.176,0.164
256_10_nlc_bge_fn_mistral_enrich.yaml,0.866,0.023,0.843,0.891,0.063,0.206,0.119,0.146,0.119,0.209,0.146,0.119
sem_40_nlc_bge_fn_mistral.yaml,0.861,0.021,0.84,0.839,0.065,0.196,0.075,0.089,0.075,0.119,0.089,0.075
256_10_nlc_bge_fn_mixtral.yaml,0.861,0.018,0.843,0.914,0.06,0.193,0.164,0.176,0.164,0.224,0.176,0.164
512_10_nlc_bge.yaml,0.852,0.024,0.828,0.915,0.053,0.162,0.075,0.087,0.075,0.134,0.087,0.075
256_10_bge.yaml,0.851,0.026,0.825,0.912,0.047,0.143,0.179,0.216,0.179,0.284,0.216,0.179
256_10_nlc_bge_fn_enrich.yaml,0.849,0.023,0.826,0.919,0.051,0.144,0.119,0.146,0.119,0.209,0.146,0.119
1024_5_nlc_bge.yaml,0.849,0.021,0.828,0.92,0.037,0.141,0.045,,0.045,,0.06,0.045
256_10_g4a.yaml,0.848,0.023,0.825,0.918,0.042,0.129,0.09,0.132,0.09,0.254,0.132,0.09
256_10_nlc_bge_fn.yaml,0.847,0.019,0.828,0.923,0.046,0.142,0.164,0.176,0.164,0.224,0.176,0.164


In [None]:
# Bronze Data

Unnamed: 0,A_BERT_RAG_F1,A_BERT_RAG_DELTA_F1,A_BERT_F1,A_AR_F1,A_BL_BL,A_RGL_SUM,RET_AP_1,RET_AP_10,RET_RC_1,RET_RC_10,RET_RR,RET_NDCG
256_10_nlc_bge_fn_mistral.yaml,0.86,0.02,0.84,0.89,0.052,0.169,0.568,0.664,0.568,0.852,0.664,0.568
256_10_nlc_bge_fn_mistral_enrich.yaml,0.858,0.018,0.84,0.91,0.05,0.172,0.5,0.623,0.5,0.865,0.623,0.5
256_10_nlc_bge_fn_mixtral.yaml,0.854,0.014,0.84,0.906,0.043,0.149,0.568,0.664,0.568,0.852,0.664,0.568
1024_5_nlc_bge.yaml,0.848,0.019,0.829,0.907,0.038,0.13,0.25,,0.25,,0.353,0.25
sem_40_nlc_bge_fn_mistral.yaml,0.847,0.009,0.838,0.868,0.03,0.119,0.359,0.44,0.359,0.621,0.44,0.359
1024_10_nlc_bge.yaml,0.847,0.019,0.828,0.903,0.048,0.132,0.25,0.366,0.25,0.605,0.366,0.25
512_10_nlc_bge.yaml,0.845,0.021,0.824,0.912,0.035,0.119,0.398,0.522,0.398,0.777,0.522,0.398
1024_1_nlc_bge.yaml,0.843,0.015,0.828,0.905,0.032,0.117,0.25,,0.25,,0.25,0.25
512_5_nlc_bge.yaml,0.843,0.018,0.824,0.915,0.033,0.121,0.398,,0.398,,0.511,0.398
256_10_nlc_bge_fn.yaml,0.841,0.016,0.825,0.916,0.033,0.107,0.568,0.664,0.568,0.852,0.664,0.568
