# Task
Simplify PDF documents from "https://api.github.com/repos/feliperussi/bridging-the-gap-in-health-literacy/contents/llms_testing/Cochrane/ground_truth" using the Gemini 2.5 Flash API and evaluate the results using BERTScore for relevance, AlignScore for factuality, and for readability: CLI FRE, GFI, SMOG, FKGL, DCRS.

## Setup and authentication

Install necessary libraries and set up authentication for Gemini API.


In [1]:
%pip install google-generativeai evaluate[bertscore] transformers sentence-transformers matplotlib pandas numpy textstat

Collecting textstat
  Downloading textstat-0.7.10-py3-none-any.whl.metadata (15 kB)
Collecting evaluate[bertscore]
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting pyphen (from textstat)
  Downloading pyphen-0.17.2-py3-none-any.whl.metadata (3.2 kB)
Downloading textstat-0.7.10-py3-none-any.whl (239 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m239.2/239.2 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyphen-0.17.2-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m86.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyphen, textstat, evaluate
Successfully installed evaluate-0.4.6 pyphen-0.17.2 textstat-0.7.10


In [None]:
import os
from google.colab import userdata

os.environ['GOOGLE_API_KEY'] = userdata.get('GOOGLE_API_KEY')

## Data Ingestion - Cochrane
Load text files from the GitHub repository "https://api.github.com/repos/feliperussi/bridging-the-gap-in-health-literacy/contents/llms_testing/Cochrane/ground_truth".

In [None]:
%pip install requests



In [None]:
import requests
import os

github_repo_url = "https://api.github.com/repos/feliperussi/bridging-the-gap-in-health-literacy/contents/llms_testing/Cochrane/ground_truth"
local_download_dir = 'downloaded_texts'

# Create the local directory if it doesn't exist
os.makedirs(local_download_dir, exist_ok=True)

try:
    response = requests.get(github_repo_url)
    response.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)
    files = response.json()

    for file_info in files:
        if file_info['type'] == 'file' and file_info['name'].endswith('.txt'):
            file_url = file_info['download_url']
            local_file_path = os.path.join(local_download_dir, file_info['name'])

            print(f"Downloading {file_info['name']} to {local_file_path}")
            file_content_response = requests.get(file_url)
            file_content_response.raise_for_status()

            with open(local_file_path, 'w', encoding='utf-8') as f:
                f.write(file_content_response.text)

    print("Download complete.")

except requests.exceptions.RequestException as e:
    print(f"An error occurred during the request: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Downloading 10.1002-14651858.CD000371.pub7-abstract.txt to downloaded_texts/10.1002-14651858.CD000371.pub7-abstract.txt
Downloading 10.1002-14651858.CD001218.pub3-abstract.txt to downloaded_texts/10.1002-14651858.CD001218.pub3-abstract.txt
Downloading 10.1002-14651858.CD001977.pub2-abstract.txt to downloaded_texts/10.1002-14651858.CD001977.pub2-abstract.txt
Downloading 10.1002-14651858.CD002201.pub6-abstract.txt to downloaded_texts/10.1002-14651858.CD002201.pub6-abstract.txt
Downloading 10.1002-14651858.CD002779.pub3-abstract.txt to downloaded_texts/10.1002-14651858.CD002779.pub3-abstract.txt
Downloading 10.1002-14651858.CD002948.pub2-abstract.txt to downloaded_texts/10.1002-14651858.CD002948.pub2-abstract.txt
Downloading 10.1002-14651858.CD003147.pub5-abstract.txt to downloaded_texts/10.1002-14651858.CD003147.pub5-abstract.txt
Downloading 10.1002-14651858.CD003315.pub3-abstract.txt to downloaded_texts/10.1002-14651858.CD003315.pub3-abstract.txt
Downloading 10.1002-14651858.CD003459.pu

KeyboardInterrupt: 

### Load text

Load the content of the downloaded text files into a dictionary or similar data structure.

In [None]:
import os

downloaded_texts_content = {}

for filename in os.listdir(local_download_dir):
    if filename.endswith('.txt'):
        filepath = os.path.join(local_download_dir, filename)
        try:
            with open(filepath, 'r', encoding='utf-8') as f:
                downloaded_texts_content[filename] = f.read()
            print(f"Loaded text from {filename}")
        except Exception as e:
            print(f"Error loading text from {filename}: {e}")


## Text Simplification

### Subtask:
Use the Gemini API with the provided prompt to simplify the extracted text.

In [None]:
import time
from google import genai
from google.genai import types
from google.api_core import exceptions as core_exceptions

simplified_texts = {}
prompt = """Using the following abstract of a biomedical study as input, generate a Plain Language Summary
(PLS) understandable by any patient, regardless of their health literacy. Ensure that the generated text
adheres to the following instructions which should be followed step-by-step:
a. Specific Structure: The generated PLS should be presented in a logical order, using the following
order:
1. Plain Title
2. Rationale
3. Trial Design
4. Results
b. Sections should be authored following these parameters:
1. Plain Title: Simplified title understandable to a layperson that summarizes the research that was
done.
2. Rationale: Include: background or study rationale providing a general description of the
condition, what it may cause or why it is a burden for the patients; the reason and main hypothesis
for the study; and why the study is needed, and why the study medication has the potential to
treat the condition.
3. Trial Design: Answer ‘How is this study designed?’ Include the description of the design,
description of study and patient population (age, health condition, gender), and the expected
amount of time a person will be in the study.
4. Results: Answer ‘What were the main results of the study’, include the benefits for the patients,
how the study was relevant for the area of study, and the conclusions from the investigator.
c. Consistency and Replicability: The generated PLS should be consistent regardless of the order of
sentences or the specific phrasing used in the input protocol text.
d. Compliance with Plain Language Guidelines: The generated PLS must follow all these plain
language guidelines:
• Have readability grade level of 6 or below.
• Do not have jargon. All technical or medical words or terms should be defined or broken down
into simple and logical explanations.
• Active voice, not passive.
• Mostly one or two syllable words.
• Sentences of 15 words or less.
• Short paragraphs of 3-5 sentences.
• Simple numbers (e.g., ratios, no percentages).
e. Do not invent Content: The AI model should not invent information. If the AI model includes data
other than the one given in the input abstract, the AI model should guarantee such data is verified and
real.
f. Aim for an approximate PLS length of 500-900 words.


Abstract of a biomedical study text: {text}
"""

client = genai.Client(api_key=os.environ['GOOGLE_API_KEY'])
counter = 0
sum_times = 0

for filename, text in downloaded_texts_content.items():
    counter += 1
    retries = 3  # Set a number of retries


    while retries > 0:
        init_time = time.time()

        try:
            response = client.models.generate_content(
                model="gemini-2.5-flash",
                contents=prompt.format(text=text),
                config=types.GenerateContentConfig(thinking_config=types.ThinkingConfig(thinking_budget=0))
            )
            simplified_texts[filename] = response.text
            response_time = time.time() - init_time
            sum_times += response_time
            print(f"Simplified text # {counter} for {filename}. Duration: {response_time}")
            break  # Exit the retry loop on success
        except core_exceptions.ServiceUnavailable as e:
            print(f"ServiecUnavailable error for {filename}: {e}. Retrying in 60 seconds...")
            retries -= 1
            time.sleep(60)
            if retries == 0:
                print(f"Failed to simplify {filename} after multiple retries.")
        except Exception as e:
            print(f"An unexpected error occurred while simplifying {filename}: {e}")
            break # Exit the retry loop for other errors

print(f"Average time: {sum_times/counter}")


In [None]:
import pandas as pd

# Create a DataFrame from the extracted and simplified texts
data = {'original_text': downloaded_texts_content, 'simplified_text': simplified_texts}
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
csv_filepath = 'simplified_texts.csv'
df.to_csv(csv_filepath, index_label='filename')

print(f"Results saved to {csv_filepath}")

# Task
Load filename, extracted_texts and simplified_texts from "/content/simplified_texts.csv" and provide a Relevance score using BERTScore, Factual consistence score using AlignScore and Readability score using Flesch–Kincaid Grade Level for each row.

## Load data

Load the data from "/content/simplified_texts.csv" into a pandas DataFrame.


In [2]:
%pip install google-generativeai evaluate[bertscore] transformers sentence-transformers matplotlib pandas numpy textstat



In [None]:
import pandas as pd

df = pd.read_csv("/content/simplified_texts.csv", index_col='filename')
display(df.head())
display(df.info())

Unnamed: 0_level_0,original_text,simplified_text
filename,Unnamed: 1_level_1,Unnamed: 2_level_1
10.1002-14651858.CD013409.pub2-pls.txt,Metal implants used to fix broken bones in the...,Here's a Plain Language Summary based on the p...
10.1002-14651858.CD013458.pub2-abstract.txt,An overview of systematic reviews on mental he...,## Help for Migrants' Mental Health\n\n### Rat...
10.1002-14651858.CD012488.pub2-abstract.txt,Psychological therapies for anxiety and depres...,**Plain Title:** Helping Kids with Long-Term I...
10.1002-14651858.CD015207-pls.txt,Does using low‐sodium salt substitutes (LSSS) ...,Here is a Plain Language Summary based on the ...
10.1002-14651858.CD013512.pub2-pls.txt,What surgical procedure works best to slow the...,## Plain Language Summary: Eye Disease Treatme...


<class 'pandas.core.frame.DataFrame'>
Index: 300 entries, 10.1002-14651858.CD013409.pub2-pls.txt to 10.1002-14651858.CD013298.pub2-pls.txt
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   original_text    300 non-null    object
 1   simplified_text  300 non-null    object
dtypes: object(2)
memory usage: 7.0+ KB


None

## Calculate bertscore

Compute BERTScore for each pair of original and simplified texts.


In [None]:
%pip install bert_score

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert_score
Successfully installed bert_score-0.3.13


In [None]:
from evaluate import load

bertscore = load("bertscore")

precision_scores = []
recall_scores = []
f1_scores = []

for index, row in df.iterrows():
    original_text = row['original_text']
    simplified_text = row['simplified_text']

    # BERTScore expects lists of strings
    results = bertscore.compute(predictions=[simplified_text], references=[original_text], model_type="distilbert-base-uncased")

    f1_scores.append(results['f1'][0])

df['bertscore'] = f1_scores

display(df.head())

Downloading builder script: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Unnamed: 0_level_0,original_text,simplified_text,bertscore
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10.1002-14651858.CD013409.pub2-pls.txt,Metal implants used to fix broken bones in the...,Here's a Plain Language Summary based on the p...,0.815162
10.1002-14651858.CD013458.pub2-abstract.txt,An overview of systematic reviews on mental he...,## Help for Migrants' Mental Health\n\n### Rat...,0.802063
10.1002-14651858.CD012488.pub2-abstract.txt,Psychological therapies for anxiety and depres...,**Plain Title:** Helping Kids with Long-Term I...,0.810141
10.1002-14651858.CD015207-pls.txt,Does using low‐sodium salt substitutes (LSSS) ...,Here is a Plain Language Summary based on the ...,0.8482
10.1002-14651858.CD013512.pub2-pls.txt,What surgical procedure works best to slow the...,## Plain Language Summary: Eye Disease Treatme...,0.867668


## Calculate AlignScore and Flesch–Kincaid Grade Level

Compute AlignScore and Flesch–Kincaid Grade Level for each pair of original and simplified texts.

In [None]:
# Clone the alignscore repository
!pip install git+https://github.com/yuh-zha/AlignScore.git --no-deps
!pip install pytorch-lightning==1.9.5

import torch
import transformers
if not hasattr(transformers, "AdamW"):
    from torch.optim import AdamW
    transformers.AdamW = AdamW


Collecting git+https://github.com/yuh-zha/AlignScore.git
  Cloning https://github.com/yuh-zha/AlignScore.git to /tmp/pip-req-build-1je_zn58
  Running command git clone --filter=blob:none --quiet https://github.com/yuh-zha/AlignScore.git /tmp/pip-req-build-1je_zn58
  Resolved https://github.com/yuh-zha/AlignScore.git to commit a0936d5afee642a46b22f6c02a163478447aa493
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: alignscore
  Building wheel for alignscore (pyproject.toml) ... [?25l[?25hdone
  Created wheel for alignscore: filename=alignscore-0.1.3-py3-none-any.whl size=18369 sha256=97fddaa7c81702c0c9870c7772cf3f383060db904bdafa4493ecc4d8f0bace49
  Stored in directory: /tmp/pip-ephem-wheel-cache-1xc7xu6q/wheels/61/78/cf/7adbff432a586bfb8a1673348218105047a4845e89a536007d
Successfully built alignscore
Installing collected pack

In [None]:
from alignscore import AlignScore

In [None]:
# Download the spaCy language model
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m125.4 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
# Install textstat if not already installed
%pip install textstat

import nltk

nltk.download('punkt_tab')



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
import textstat

device_id = 'cuda' if torch.cuda.is_available() else 'cpu'

alignscorer = AlignScore(model='roberta-base',
    batch_size=8,
    device=device_id,
    ckpt_path='https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-base.ckpt',
    evaluation_mode='nli_sp',
    verbose=False
)

align_scores = []

for index, row in df.iterrows():
    original_text = row['original_text']
    simplified_text = row['simplified_text']
    # Pass texts as lists to the score method
    align_scores.append(alignscorer.score([original_text], [simplified_text])[0])
    print(align_scores[-1])

df['alignscore'] = align_scores

display(df.head())

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading: "https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-base.ckpt" to /root/.cache/torch/hub/checkpoints/AlignScore-base.ckpt


100%|██████████| 1.83G/1.83G [00:05<00:00, 383MB/s]
INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.7.7 to v1.9.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file https:/huggingface.co/yzha/AlignScore/resolve/main/AlignScore-base.ckpt`
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  rank_zero_warn(


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

0.5913100838661194
0.4930691421031952
0.6340270638465881
0.7049410939216614
0.6615166068077087
0.41266438364982605
0.5513878464698792
0.6282657980918884
0.6737241744995117
0.7142598628997803
0.6393222212791443
0.7264792323112488
0.5593645572662354
0.6847326159477234
0.5374680757522583
0.5637751817703247
0.689601719379425
0.5854660272598267
0.6492792367935181
0.37166526913642883
0.4301823675632477
0.5033786296844482
0.4827814996242523
0.614648699760437
0.7011305689811707
0.562722384929657
0.5188531279563904
0.5890012979507446
0.49235987663269043
0.40849488973617554
0.5145161747932434
0.5948630571365356
0.48513925075531006
0.6318672895431519
0.5188474655151367
0.6025192737579346
0.6122657656669617
0.41322118043899536
0.4880622327327728
0.7171114087104797
0.47215571999549866
0.5670281052589417
0.5140546560287476
0.4634856581687927
0.6517964601516724
0.7051646709442139
0.7872326374053955
0.5601122379302979
0.5365469455718994
0.6731721758842468
0.5642184019088745
0.5205717086791992
0.529557

Unnamed: 0_level_0,original_text,simplified_text,bertscore,alignscore
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10.1002-14651858.CD013409.pub2-pls.txt,Metal implants used to fix broken bones in the...,Here's a Plain Language Summary based on the p...,0.815162,0.59131
10.1002-14651858.CD013458.pub2-abstract.txt,An overview of systematic reviews on mental he...,## Help for Migrants' Mental Health\n\n### Rat...,0.802063,0.493069
10.1002-14651858.CD012488.pub2-abstract.txt,Psychological therapies for anxiety and depres...,**Plain Title:** Helping Kids with Long-Term I...,0.810141,0.634027
10.1002-14651858.CD015207-pls.txt,Does using low‐sodium salt substitutes (LSSS) ...,Here is a Plain Language Summary based on the ...,0.8482,0.704941
10.1002-14651858.CD013512.pub2-pls.txt,What surgical procedure works best to slow the...,## Plain Language Summary: Eye Disease Treatme...,0.867668,0.661517


In [None]:

flesch_kincaid_scores = []
coleman_liau_scores = []
flesch_reading_ease_scores = []
gunning_fog_scores = []
smog_index_scores = []
dale_chall_scores = []


for index, row in df.iterrows():
    original_text = row['original_text']
    simplified_text = row['simplified_text']

    # Calculate Flesch-Kincaid Grade Level
    flesch_kincaid_scores.append(textstat.flesch_kincaid_grade(simplified_text))

    # Calculate Coleman-Liau Index
    coleman_liau_scores.append(textstat.coleman_liau_index(simplified_text))

    # Calculate Flesch Reading Ease
    flesch_reading_ease_scores.append(textstat.flesch_reading_ease(simplified_text))

    # Calculate Gunning Fog Index
    gunning_fog_scores.append(textstat.gunning_fog(simplified_text))

    # Calculate SMOG Index
    smog_index_scores.append(textstat.smog_index(simplified_text))

    # Calculate Dale-Chall Readability Score
    dale_chall_scores.append(textstat.dale_chall_readability_score(simplified_text))

df['flesch_kincaid_grade'] = flesch_kincaid_scores
df['coleman_liau_index'] = coleman_liau_scores
df['flesch_reading_ease'] = flesch_reading_ease_scores
df['gunning_fog_index'] = gunning_fog_scores
df['smog_index'] = smog_index_scores
df['dale_chall_score'] = dale_chall_scores


display(df.head())

Unnamed: 0_level_0,original_text,simplified_text,bertscore,alignscore,flesch_kincaid_grade,coleman_liau_index,flesch_reading_ease,gunning_fog_index,smog_index,dale_chall_score
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
10.1002-14651858.CD013409.pub2-pls.txt,Metal implants used to fix broken bones in the...,Here's a Plain Language Summary based on the p...,0.815162,0.59131,4.911179,6.772033,80.136741,6.376694,7.921354,8.694478
10.1002-14651858.CD013458.pub2-abstract.txt,An overview of systematic reviews on mental he...,## Help for Migrants' Mental Health\n\n### Rat...,0.802063,0.493069,4.660206,8.329762,76.612571,5.979048,7.554174,9.379132
10.1002-14651858.CD012488.pub2-abstract.txt,Psychological therapies for anxiety and depres...,**Plain Title:** Helping Kids with Long-Term I...,0.810141,0.634027,4.341345,8.00056,80.241325,6.238569,7.702958,9.221175
10.1002-14651858.CD015207-pls.txt,Does using low‐sodium salt substitutes (LSSS) ...,Here is a Plain Language Summary based on the ...,0.8482,0.704941,6.441857,8.583939,71.597695,8.238194,9.180166,8.97903
10.1002-14651858.CD013512.pub2-pls.txt,What surgical procedure works best to slow the...,## Plain Language Summary: Eye Disease Treatme...,0.867668,0.661517,4.736361,7.693089,77.090153,6.297913,8.066035,9.025763


In [None]:
# Save the DataFrame to a CSV file
csv_filepath = 'simplified_texts_with_scores.csv'
df.to_csv(csv_filepath, index_label='filename')

print(f"DataFrame saved to {csv_filepath}")

DataFrame saved to simplified_texts_with_scores.csv
