# Result Analysis

In [1]:
models = ["gpt-3.5-turbo-0125", "gpt-4.1-nano-2025-04-14", "gpt-4.1-mini-2025-04-14", "gpt-4.1-2025-04-14", "llama3.1.70b", "llama4.scout"]
chunkings = ["256_20", "1024_20"]
only_texts = [False, True]

In [2]:
%run ../scripts/load_df_for_analysis.py

In [3]:
%run ../scripts/df_calculations.py

## Gather Results: Model Comparison

In [4]:
all_results = {}

In [5]:
def gather_results(chunking, only_text, model, ai_prompt=False):
    global all_results

    df = load_df_for_analysis(chunking, only_text, model, ai_prompt)
    
    results = eval_predictions(df, include_relabelled_partially=True)
    results_without_relabelled = eval_predictions(df, include_relabelled_partially=False)

    if model not in all_results:
        all_results[model] = {}
    all_results[model][f"{'only_text_' if only_text else ''}{chunking}{'_AI_prompt' if ai_prompt else ''}"] = {
        "all": results,
        "without_relabelled": results_without_relabelled,
    }

### Hyperparameters for all evaluations:
- PDF text from reference was extracted with large GROBID model
- Ollama indexing for choosing best chunks: used text-embedding-3-large from OpenAI, chunking with SentenceSplitter, detected top 3 matching chunks for statement
- Model temperature is set to 0

### Hyperparameters here evaluated:
- only_text: If this is "True", then the scientific text from the TEI file from GROBID was extracted via code - Else the whole text from the TEI document (including sources, authors, ...) was included for choosing the best chunks
- chunking: chunk size is set to either 256 or 1024, token overlap is always kept at 20
- model: different models are evaluated and compared for the actual classification part

### GPT 3.5 Turbo

In [6]:
model = "gpt-3.5-turbo-0125"

In [7]:
for chunking in chunkings:
    for only_text in only_texts:
        try:
            gather_results(chunking, only_text, model)
        except Exception as e:
            print(f"Error gathering results for {model}, chunking: {chunking}, only_text: {only_text}: {e}")

Row 21 Model Classification could not be decoded: Expecting value: line 1 column 1 (char 0)

Row 21 Model Classification Label is not a valid label: None
Row 21 Model Classification Label is not a valid label: None
Row 21 Model Classification could not be decoded: Expecting value: line 1 column 1 (char 0)

Row 21 Model Classification Label is not a valid label: None
Row 21 Model Classification Label is not a valid label: None
Row 24 Model Classification could not be decoded: Expecting value: line 1 column 1 (char 0)

Row 24 Model Classification Label is not a valid label: None
Row 24 Model Classification Label is not a valid label: None
Row 21 Model Classification could not be decoded: Expecting value: line 1 column 1 (char 0)

Row 24 Model Classification could not be decoded: Expecting value: line 1 column 1 (char 0)

Row 21 Model Classification Label is not a valid label: None
Row 24 Model Classification Label is not a valid label: None
Row 21 Model Classification Label is not a vali

### GPT 4.1 Nano

In [8]:
model = "gpt-4.1-nano-2025-04-14"

In [9]:
for chunking in chunkings:
    for only_text in only_texts:
        try:
            gather_results(chunking, only_text, model)
        except Exception as e:
            print(f"Error gathering results for {model}, chunking: {chunking}, only_text: {only_text}: {e}")

### GPT 4.1 Mini

In [10]:
model = "gpt-4.1-mini-2025-04-14"

In [11]:
for chunking in chunkings:
    for only_text in only_texts:
        try:
            gather_results(chunking, only_text, model)
        except Exception as e:
            print(f"Error gathering results for {model}, chunking: {chunking}, only_text: {only_text}: {e}")

### GPT 4.1

In [12]:
model = "gpt-4.1-2025-04-14"

In [13]:
for chunking in chunkings:
    for only_text in only_texts:
        try:
            gather_results(chunking, only_text, model)
        except Exception as e:
            print(f"Error gathering results for {model}, chunking: {chunking}, only_text: {only_text}: {e}")

#### AI improved prompt

In [14]:
for chunking in chunkings:
    for only_text in only_texts:
        try:
            gather_results(chunking, only_text, model, ai_prompt=True)
        except Exception as e:
            print(f"Error gathering results for {model}, chunking: {chunking}, only_text: {only_text}: {e}")

Error gathering results for gpt-4.1-2025-04-14, chunking: 256_20, only_text: False: [Errno 2] No such file or directory: '../data/dfs/256_20/gpt-4.1-2025-04-14/AI_prompt/ReferenceErrorDetection_data_with_prompt_results.pkl'
Error gathering results for gpt-4.1-2025-04-14, chunking: 1024_20, only_text: False: [Errno 2] No such file or directory: '../data/dfs/1024_20/gpt-4.1-2025-04-14/AI_prompt/ReferenceErrorDetection_data_with_prompt_results.pkl'


### Llama 3.1:70b

In [15]:
model = "llama3.1.70b"

In [16]:
for chunking in chunkings:
    for only_text in only_texts:
        try:
            gather_results(chunking, only_text, model)
        except Exception as e:
            print(f"Error gathering results for {model}, chunking: {chunking}, only_text: {only_text}: {e}")

Row 81 Model Classification could not be decoded: 'label'
{
  "citation": null,
  "conclusion": null,
  "introduction": null,
  "methodology": [
    "Atomic force microscopy force spectroscopy and trans-epithelial electrical resistance assessed changes in cell-cell tethering and paracellular permeability respectively.",
    "Carboxyfluorescein dye uptake, ATP-biosensing, and western blotting were used to assess the ability of Peptide 5 to block hemichannel activity, ATP-release, and ultimately disassembly of the adherens/tight junction complex."
  ],
  "question": null,
  "results": [
    "Co-incubation of TGF-β1 with Peptide 5 significantly reduced dye uptake and restored ATP release to near basal.",
    "Peptide 5 successfully prevented TGF-β1-evoked changes in expression of E-cadherin, N-cadherin, Claudin-2, and ZO-1 in human primary renal proximal tubule cells.",
    "Cx43 +/- mice exhibited minimal disassembly of the adherens and tight junction complex."
  ]
}
Row 81 Model Classif

#### AI improved prompt

In [17]:
for chunking in chunkings:
    for only_text in only_texts:
        try:
            gather_results(chunking, only_text, model, ai_prompt=True)
        except Exception as e:
            print(f"Error gathering results for {model}, chunking: {chunking}, only_text: {only_text}: {e}")

Error gathering results for llama3.1.70b, chunking: 256_20, only_text: False: [Errno 2] No such file or directory: '../data/dfs/256_20/llama3.1.70b/AI_prompt/ReferenceErrorDetection_data_with_prompt_results.pkl'
Error gathering results for llama3.1.70b, chunking: 1024_20, only_text: False: [Errno 2] No such file or directory: '../data/dfs/1024_20/llama3.1.70b/AI_prompt/ReferenceErrorDetection_data_with_prompt_results.pkl'


### Llama 4 Scout

In [18]:
model = "llama4.scout"

In [19]:
for chunking in chunkings:
    for only_text in only_texts:
        try:
            gather_results(chunking, only_text, model)
        except Exception as e:
            print(f"Error gathering results for {model}, chunking: {chunking}, only_text: {only_text}: {e}")

Row 2 Model Classification could not be decoded: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
{Unsubstantiated}
Row 5 Model Classification could not be decoded: Expecting value: line 1 column 1 (char 0)
## Classification Result

The provided reference article does not support the statement regarding the calculation of recommended trust values in a secure routing protocol. 

### Explanation

The reference article focuses on textile manufacturing, discussing topics such as sustainable practices, technological advancements, and environmental impacts. The article does not mention trust calculations, secure routing protocols, or any related concepts. 

### Classification

Unsubstantiated 

The classification result is: Unsubstantiated 

### Justification

There is no connection between the reference article's content and the statement about calculating recommended trust values in a secure routing protocol. The article's topics, such as textile manufacturing, s

### Save results

In [20]:
import json

with open("../data/all_results.json", "w") as json_file:
    json.dump(all_results, json_file, indent=4)