# Archat Evaluation

## References Extraction

To evaluate our approach, we manually recorded the number of references in the 10 articles and compared the OpenAI o1 model and the DeepSeek-R1 model in terms of accuracy (number of generated references / number of ground truth references in the articles).

In [1]:
import pandas as pd
from prompts import *
from helpers import *
import pandas as pd
import json

articles_file = "RE_articles.csv"
articles_df = pd.read_csv(articles_file, delimiter=";", header=None)
articles_df.columns = ['patharticle']

def extract_references_count(article_text, llm_name):
    references_json = extract_references_with_prompts(article_text, llm_name)
    
    if isinstance(references_json, str):
        try:
            references_data = json.loads(references_json)
            
            if "references" in references_data and isinstance(references_data["references"], list):
                titles_count = sum(1 for ref in references_data["references"] if "title" in ref)
                print("titles_count",titles_count)
                return titles_count
            else:
                return 0
        except json.JSONDecodeError:
            return 0
    else:
        return 0

num_references_list = []

for article_path in articles_df['patharticle']:
    article_text = get_txt_content(article_path)
    llm_name = "deepseek-r1:1.5b"
    
    num_references = extract_references_count(article_text, llm_name)
    num_references_list.append(num_references)

articles_df['num_references'] = num_references_list

articles_df.to_csv("RE_ours_references.csv", index=False, sep=";")

[1] M. Auli, M. Galley, C. Quirk, and G. Zweig. Joint language and translation modeling with recurrent

neural networks. In EMNLP, 2013.

[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.

arXiv preprint arXiv:1409.0473, 2014.

[3] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. In Journal of

Machine Learning Research, pages 1137–1155, 2003.

[4] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difﬁcult.

IEEE Transactions on Neural Networks, 5(2):157–166, 1994.

[5] K. Cho, B. Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase represen-
tations using RNN encoder-decoder for statistical machine translation. In Arxiv preprint arXiv:1406.1078,
2014.

[6] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classiﬁcation.

In CVPR, 2012.

[7] G. E. Dahl, D. Yu, L. Deng, and A. 

In [2]:
ground_truth_file = "RE_ground_truth_references.csv"
ground_truth = pd.read_csv(ground_truth_file, delimiter=";", header=None)
ground_truth.columns = ['patharticle', 'num_references']  

def evaluate_reference_count(extracted_counts, ground_truth_counts):
    correct_count = 0
    total_articles = len(extracted_counts)

    for i in range(total_articles):
        if extracted_counts[i] == ground_truth_counts[i]:
            correct_count += 1

    accuracy = correct_count / total_articles
    return accuracy

ground_truth_counts = []
extracted_counts = articles_df['num_references'].tolist()

for article_path in articles_df['patharticle']:
    ground_truth_article = ground_truth[ground_truth['patharticle'] == article_path]
    ground_truth_count = ground_truth_article['num_references'].values[0]  # Nombre de références pour l'article
    ground_truth_counts.append(ground_truth_count)

accuracy = evaluate_reference_count(extracted_counts, ground_truth_counts)

print(f"Accuracy (Exactitude) du nombre de références extraites : {accuracy * 100:.2f}%")

Accuracy (Exactitude) du nombre de références extraites : 7.14%


In [8]:
import pandas as pd

csv_references = pd.read_csv("RE_ground_truth_references.csv", header=None, names=["path_expected", "num_references_expected"], delimiter=";")
csv_observations = pd.read_csv("RE_o1_references.csv", header=None, names=["path_observed", "num_references_observed"], delimiter=";")

merged = pd.merge(csv_references, csv_observations, left_on="path_expected", right_on="path_observed", suffixes=("_expected", "_observed"))

merged["match"] = merged["num_references_expected"] == merged["num_references_observed"]

match_count = merged["match"].sum()

accuracy = match_count / len(merged)

# o1 results
print(f"GPT_References_Accuracy: {accuracy:.2%}")

GPT_References_Accuracy: 50.00%
