# Task
Simplify PDF documents from "https://api.github.com/repos/feliperussi/bridging-the-gap-in-health-literacy/contents/llms_testing/Cochrane/ground_truth" using the Gemini 2.5 Flash API and evaluate the results using BERTScore for relevance, AlignScore for factuality, and for readability: CLI FRE, GFI, SMOG, FKGL, DCRS.

## Setup and authentication

Install necessary libraries and set up authentication for AWS S3 and the OpenAI API.


In [None]:
%pip install boto3 evaluate[bertscore] transformers sentence-transformers matplotlib pandas numpy textstat



In [None]:
import os
from google.colab import userdata

os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

## Data Ingestion - Cochrane
Load text files from the GitHub repository "https://api.github.com/repos/feliperussi/bridging-the-gap-in-health-literacy/contents/llms_testing/Cochrane/ground_truth".

In [None]:
%pip install requests



In [None]:
import requests
import os

github_repo_url = "https://api.github.com/repos/feliperussi/bridging-the-gap-in-health-literacy/contents/llms_testing/Cochrane/ground_truth"
local_download_dir = 'downloaded_texts'

# Create the local directory if it doesn't exist
os.makedirs(local_download_dir, exist_ok=True)

try:
    response = requests.get(github_repo_url)
    response.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)
    files = response.json()

    for file_info in files:
        if file_info['type'] == 'file' and file_info['name'].endswith('.txt'):
            file_url = file_info['download_url']
            local_file_path = os.path.join(local_download_dir, file_info['name'])

            print(f"Downloading {file_info['name']} to {local_file_path}")
            file_content_response = requests.get(file_url)
            file_content_response.raise_for_status()

            with open(local_file_path, 'w', encoding='utf-8') as f:
                f.write(file_content_response.text)

    print("Download complete.")

except requests.exceptions.RequestException as e:
    print(f"An error occurred during the request: {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Downloading 10.1002-14651858.CD000371.pub7-abstract.txt to downloaded_texts/10.1002-14651858.CD000371.pub7-abstract.txt
Downloading 10.1002-14651858.CD001218.pub3-abstract.txt to downloaded_texts/10.1002-14651858.CD001218.pub3-abstract.txt
Downloading 10.1002-14651858.CD001977.pub2-abstract.txt to downloaded_texts/10.1002-14651858.CD001977.pub2-abstract.txt
Downloading 10.1002-14651858.CD002201.pub6-abstract.txt to downloaded_texts/10.1002-14651858.CD002201.pub6-abstract.txt
Downloading 10.1002-14651858.CD002779.pub3-abstract.txt to downloaded_texts/10.1002-14651858.CD002779.pub3-abstract.txt
Downloading 10.1002-14651858.CD002948.pub2-abstract.txt to downloaded_texts/10.1002-14651858.CD002948.pub2-abstract.txt
Downloading 10.1002-14651858.CD003147.pub5-abstract.txt to downloaded_texts/10.1002-14651858.CD003147.pub5-abstract.txt
Downloading 10.1002-14651858.CD003315.pub3-abstract.txt to downloaded_texts/10.1002-14651858.CD003315.pub3-abstract.txt
Downloading 10.1002-14651858.CD003459.pu

### Load text

Load the content of the downloaded text files into a dictionary or similar data structure.

In [None]:
import os

downloaded_texts_content = {}

for filename in os.listdir(local_download_dir):
    if filename.endswith('.txt'):
        filepath = os.path.join(local_download_dir, filename)
        try:
            with open(filepath, 'r', encoding='utf-8') as f:
                downloaded_texts_content[filename] = f.read()
            print(f"Loaded text from {filename}")
        except Exception as e:
            print(f"Error loading text from {filename}: {e}")


Loaded text from 10.1002-14651858.CD013877.pub3-pls.txt
Loaded text from 10.1002-14651858.CD008215.pub3-abstract.txt
Loaded text from 10.1002-14651858.CD013705.pub3-pls.txt
Loaded text from 10.1002-14651858.CD013157.pub2-pls.txt
Loaded text from 10.1002-14651858.CD010222.pub4-abstract.txt
Loaded text from 10.1002-14651858.CD012584.pub2-abstract.txt
Loaded text from 10.1002-14651858.CD013826.pub2-pls.txt
Loaded text from 10.1002-14651858.CD001218.pub3-abstract.txt
Loaded text from 10.1002-14651858.CD013297.pub2-pls.txt
Loaded text from 10.1002-14651858.CD003737.pub4-abstract.txt
Loaded text from 10.1002-14651858.CD013191-pls.txt
Loaded text from 10.1002-14651858.CD009961.pub2-abstract.txt
Loaded text from 10.1002-14651858.CD013756.pub2-pls.txt
Loaded text from 10.1002-14651858.CD011391.pub2-abstract.txt
Loaded text from 10.1002-14651858.CD013404.pub2-pls.txt
Loaded text from 10.1002-14651858.CD008838.pub2-abstract.txt
Loaded text from 10.1002-14651858.CD013162.pub2-pls.txt
Loaded text f

# Task: Text simplification

## Text Simplification

Use the OpenAI API with the provided prompt to simplify the extracted text.

In [None]:
import time
from openai import OpenAI
client = OpenAI()


simplified_texts = {}
prompt = """Using the following abstract of a biomedical study as input, generate a Plain Language Summary
(PLS) understandable by any patient, regardless of their health literacy. Ensure that the generated text
adheres to the following instructions which should be followed step-by-step:
a. Specific Structure: The generated PLS should be presented in a logical order, using the following
order:
1. Plain Title
2. Rationale
3. Trial Design
4. Results
b. Sections should be authored following these parameters:
1. Plain Title: Simplified title understandable to a layperson that summarizes the research that was
done.
2. Rationale: Include: background or study rationale providing a general description of the
condition, what it may cause or why it is a burden for the patients; the reason and main hypothesis
for the study; and why the study is needed, and why the study medication has the potential to
treat the condition.
3. Trial Design: Answer ‘How is this study designed?’ Include the description of the design,
description of study and patient population (age, health condition, gender), and the expected
amount of time a person will be in the study.
4. Results: Answer ‘What were the main results of the study’, include the benefits for the patients,
how the study was relevant for the area of study, and the conclusions from the investigator.
c. Consistency and Replicability: The generated PLS should be consistent regardless of the order of
sentences or the specific phrasing used in the input protocol text.
d. Compliance with Plain Language Guidelines: The generated PLS must follow all these plain
language guidelines:
• Have readability grade level of 6 or below.
• Do not have jargon. All technical or medical words or terms should be defined or broken down
into simple and logical explanations.
• Active voice, not passive.
• Mostly one or two syllable words.
• Sentences of 15 words or less.
• Short paragraphs of 3-5 sentences.
• Simple numbers (e.g., ratios, no percentages).
e. Do not invent Content: The AI model should not invent information. If the AI model includes data
other than the one given in the input abstract, the AI model should guarantee such data is verified and
real.
f. Aim for an approximate PLS length of 500-900 words.


Abstract of a biomedical study text: {text}
"""
counter = 0
sum_times = 0


for filename, text in downloaded_texts_content.items():
    counter += 1
    retries = 3  # Set a number of retries

    while retries > 0:
        init_time = time.time()

        try:
            response = client.responses.create(
                model="gpt-5-nano",
                input=prompt.format(text=text)
                )

            simplified_texts[filename] = response.output_text
            response_time = time.time() - init_time
            sum_times += response_time
            print(f"Simplified text # {counter} for {filename}. Duration: {response_time}")
            break  # Exit the retry loop on success

        except Exception as e:
            print(f"An unexpected error occurred while simplifying {filename}: {e}")
            break # Exit the retry loop for other errors

print(f"Average time: {sum_times/counter}")



Simplified text # 1 for 10.1002-14651858.CD013877.pub3-pls.txt. Duration: 31.018213748931885
Simplified text # 2 for 10.1002-14651858.CD008215.pub3-abstract.txt. Duration: 98.36972999572754
Simplified text # 3 for 10.1002-14651858.CD013705.pub3-pls.txt. Duration: 40.63139343261719
Simplified text # 4 for 10.1002-14651858.CD013157.pub2-pls.txt. Duration: 33.951061487197876
Simplified text # 5 for 10.1002-14651858.CD010222.pub4-abstract.txt. Duration: 55.00382614135742
Simplified text # 6 for 10.1002-14651858.CD012584.pub2-abstract.txt. Duration: 55.382086992263794
Simplified text # 7 for 10.1002-14651858.CD013826.pub2-pls.txt. Duration: 88.69882321357727
Simplified text # 8 for 10.1002-14651858.CD001218.pub3-abstract.txt. Duration: 34.996469259262085
Simplified text # 9 for 10.1002-14651858.CD013297.pub2-pls.txt. Duration: 67.11956596374512
Simplified text # 10 for 10.1002-14651858.CD003737.pub4-abstract.txt. Duration: 32.158323526382446
Simplified text # 11 for 10.1002-14651858.CD01319

In [None]:
import pandas as pd

# Create a DataFrame from the extracted and simplified texts
data = {'original_text': downloaded_texts_content, 'simplified_text': simplified_texts}
df_simplified = pd.DataFrame(data)

# Save the DataFrame to a CSV file
csv_filepath = 'simplified_texts.csv'
df_simplified.to_csv(csv_filepath, index_label='filename')

print(f"Results saved to {csv_filepath}")

Results saved to simplified_texts.csv


# Task: Load data
Load filename, extracted_texts and simplified_texts from "/content/simplified_texts.csv" and provide a Relevance score using BERTScore, Factual consistence score using AlignScore and Readability score using Flesch–Kincaid Grade Level for each row.

## Load data

Load the data from "/content/simplified_texts.csv" into a pandas DataFrame.


In [1]:
%pip install boto3 evaluate[bertscore] transformers sentence-transformers matplotlib pandas numpy textstat

Collecting boto3
  Downloading boto3-1.40.60-py3-none-any.whl.metadata (6.6 kB)
Collecting textstat
  Downloading textstat-0.7.10-py3-none-any.whl.metadata (15 kB)
Collecting evaluate[bertscore]
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting botocore<1.41.0,>=1.40.60 (from boto3)
  Downloading botocore-1.40.60-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.15.0,>=0.14.0 (from boto3)
  Downloading s3transfer-0.14.0-py3-none-any.whl.metadata (1.7 kB)
Collecting pyphen (from textstat)
  Downloading pyphen-0.17.2-py3-none-any.whl.metadata (3.2 kB)
Downloading boto3-1.40.60-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading textstat-0.7.10-py3-none-any.whl (239 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m239.2

In [2]:
import pandas as pd

df = pd.read_csv("/content/simplified_texts.csv", index_col='filename')
display(df.head())
display(df.info())

Unnamed: 0_level_0,original_text,simplified_text
filename,Unnamed: 1_level_1,Unnamed: 2_level_1
10.1002-14651858.CD013877.pub3-pls.txt,Interventions for the prevention of persistent...,Plain Title\nCan nasal sprays help stop long-l...
10.1002-14651858.CD008215.pub3-abstract.txt,Is there an increased risk of ovarian cancer i...,Plain Title\nDo fertility drugs raise the risk...
10.1002-14651858.CD013705.pub3-pls.txt,How accurate are rapid antigen tests for diagn...,Plain Title\nAre fast COVID tests good at find...
10.1002-14651858.CD013157.pub2-pls.txt,Nutritional supplementation for people with no...,Plain Title\nCan nutrition supplements help pe...
10.1002-14651858.CD010222.pub4-abstract.txt,Anticoagulant treatment for subsegmental pulmo...,Plain Title\nAre blood thinners good for small...


<class 'pandas.core.frame.DataFrame'>
Index: 300 entries, 10.1002-14651858.CD013877.pub3-pls.txt to 10.1002-14651858.CD013285.pub2-pls.txt
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   original_text    300 non-null    object
 1   simplified_text  300 non-null    object
dtypes: object(2)
memory usage: 7.0+ KB


None

# Task: Eval models

## Calculate bertscore

Compute BERTScore for each pair of original and simplified texts.


In [3]:
%pip install bert_score

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bert_score
Successfully installed bert_score-0.3.13


In [4]:
from evaluate import load

bertscore = load("bertscore")

precision_scores = []
recall_scores = []
f1_scores = []

for index, row in df.iterrows():
    original_text = row['original_text']
    simplified_text = row['simplified_text']

    results = bertscore.compute(predictions=[simplified_text], references=[original_text], model_type="distilbert-base-uncased")

    f1_scores.append(results['f1'][0])

df['bertscore'] = f1_scores

display(df.head())

Downloading builder script: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Unnamed: 0_level_0,original_text,simplified_text,bertscore
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10.1002-14651858.CD013877.pub3-pls.txt,Interventions for the prevention of persistent...,Plain Title\nCan nasal sprays help stop long-l...,0.83074
10.1002-14651858.CD008215.pub3-abstract.txt,Is there an increased risk of ovarian cancer i...,Plain Title\nDo fertility drugs raise the risk...,0.83438
10.1002-14651858.CD013705.pub3-pls.txt,How accurate are rapid antigen tests for diagn...,Plain Title\nAre fast COVID tests good at find...,0.795371
10.1002-14651858.CD013157.pub2-pls.txt,Nutritional supplementation for people with no...,Plain Title\nCan nutrition supplements help pe...,0.824293
10.1002-14651858.CD010222.pub4-abstract.txt,Anticoagulant treatment for subsegmental pulmo...,Plain Title\nAre blood thinners good for small...,0.80177


## Calculate AlignScore

Compute AlignScore for each pair of original and simplified texts.

In [5]:
# Clone the alignscore repository
!pip install git+https://github.com/yuh-zha/AlignScore.git --no-deps
!pip install pytorch-lightning==1.9.5

import torch
import transformers
if not hasattr(transformers, "AdamW"):
    from torch.optim import AdamW
    transformers.AdamW = AdamW


Collecting git+https://github.com/yuh-zha/AlignScore.git
  Cloning https://github.com/yuh-zha/AlignScore.git to /tmp/pip-req-build-he5_2_1_
  Running command git clone --filter=blob:none --quiet https://github.com/yuh-zha/AlignScore.git /tmp/pip-req-build-he5_2_1_
  Resolved https://github.com/yuh-zha/AlignScore.git to commit a0936d5afee642a46b22f6c02a163478447aa493
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: alignscore
  Building wheel for alignscore (pyproject.toml) ... [?25l[?25hdone
  Created wheel for alignscore: filename=alignscore-0.1.3-py3-none-any.whl size=18369 sha256=97fddaa7c81702c0c9870c7772cf3f383060db904bdafa4493ecc4d8f0bace49
  Stored in directory: /tmp/pip-ephem-wheel-cache-t4w6zdfm/wheels/61/78/cf/7adbff432a586bfb8a1673348218105047a4845e89a536007d
Successfully built alignscore
Installing collected pack

In [6]:
from alignscore import AlignScore

In [7]:
# Download the spaCy language model
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m120.9 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [8]:
# Install textstat if not already installed
%pip install textstat

import nltk

nltk.download('punkt_tab')



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [10]:
import time
import textstat

device_id = 'cuda' if torch.cuda.is_available() else 'cpu'
counter = 0
sum_times = 0

alignscorer = AlignScore(model='roberta-base',
    batch_size=8,
    device=device_id,
    ckpt_path='https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-base.ckpt',
    evaluation_mode='nli_sp',
    verbose=False
)

align_scores = []

for index, row in df.iterrows():
    counter += 1
    init_time = time.time()
    print(f"Processing row {counter}")
    original_text = row['original_text']
    simplified_text = row['simplified_text']
    # Pass texts as lists to the score method
    align_scores.append(alignscorer.score([original_text], [simplified_text])[0])
    print(align_scores[-1])
    response_time = time.time() - init_time
    sum_times += response_time
    print(f"Duration: {response_time}")

df['alignscore'] = align_scores

display(df.head())
print(f"Average time: {sum_times/counter}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.7.7 to v1.9.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file https:/huggingface.co/yzha/AlignScore/resolve/main/AlignScore-base.ckpt`
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  rank_zero_warn(


Processing row 1
0.6130501627922058
Duration: 3.2345499992370605
Processing row 2
0.856458306312561
Duration: 1.180715560913086
Processing row 3
0.6005575656890869
Duration: 10.153597593307495
Processing row 4
0.6646915674209595
Duration: 2.8707330226898193
Processing row 5
0.31245750188827515
Duration: 1.17635178565979
Processing row 6
0.46968451142311096
Duration: 2.2512054443359375
Processing row 7
0.5758981704711914
Duration: 2.602611541748047
Processing row 8
0.6458489894866943
Duration: 2.501387119293213
Processing row 9
0.5080158114433289
Duration: 1.950127124786377
Processing row 10
0.5579732656478882
Duration: 4.471763610839844
Processing row 11
0.6968176960945129
Duration: 6.016550064086914
Processing row 12
0.7380382418632507
Duration: 1.724839687347412
Processing row 13
0.6571438908576965
Duration: 4.949644565582275
Processing row 14
0.5492532849311829
Duration: 2.9597747325897217
Processing row 15
0.6292087435722351
Duration: 7.669080495834351
Processing row 16
0.450768053

Unnamed: 0_level_0,original_text,simplified_text,bertscore,alignscore
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10.1002-14651858.CD013877.pub3-pls.txt,Interventions for the prevention of persistent...,Plain Title\nCan nasal sprays help stop long-l...,0.83074,0.61305
10.1002-14651858.CD008215.pub3-abstract.txt,Is there an increased risk of ovarian cancer i...,Plain Title\nDo fertility drugs raise the risk...,0.83438,0.856458
10.1002-14651858.CD013705.pub3-pls.txt,How accurate are rapid antigen tests for diagn...,Plain Title\nAre fast COVID tests good at find...,0.795371,0.600558
10.1002-14651858.CD013157.pub2-pls.txt,Nutritional supplementation for people with no...,Plain Title\nCan nutrition supplements help pe...,0.824293,0.664692
10.1002-14651858.CD010222.pub4-abstract.txt,Anticoagulant treatment for subsegmental pulmo...,Plain Title\nAre blood thinners good for small...,0.80177,0.312458


Average time: 3.4571625781059265


## Calculate Readability scores



In [11]:

flesch_kincaid_scores = []
coleman_liau_scores = []
flesch_reading_ease_scores = []
gunning_fog_scores = []
smog_index_scores = []
dale_chall_scores = []


for index, row in df.iterrows():
    original_text = row['original_text']
    simplified_text = row['simplified_text']

    # Calculate Flesch-Kincaid Grade Level for the simplified text
    flesch_kincaid_scores.append(textstat.flesch_kincaid_grade(simplified_text))

    # Calculate Coleman-Liau Index
    coleman_liau_scores.append(textstat.coleman_liau_index(simplified_text))

    # Calculate Flesch Reading Ease
    flesch_reading_ease_scores.append(textstat.flesch_reading_ease(simplified_text))

    # Calculate Gunning Fog Index
    gunning_fog_scores.append(textstat.gunning_fog(simplified_text))

    # Calculate SMOG Index
    smog_index_scores.append(textstat.smog_index(simplified_text))

    # Calculate Dale-Chall Readability Score
    dale_chall_scores.append(textstat.dale_chall_readability_score(simplified_text))

df['flesch_kincaid_grade'] = flesch_kincaid_scores
df['coleman_liau_index'] = coleman_liau_scores
df['flesch_reading_ease'] = flesch_reading_ease_scores
df['gunning_fog_index'] = gunning_fog_scores
df['smog_index'] = smog_index_scores
df['dale_chall_score'] = dale_chall_scores


display(df.head())

Unnamed: 0_level_0,original_text,simplified_text,bertscore,alignscore,flesch_kincaid_grade,coleman_liau_index,flesch_reading_ease,gunning_fog_index,smog_index,dale_chall_score
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
10.1002-14651858.CD013877.pub3-pls.txt,Interventions for the prevention of persistent...,Plain Title\nCan nasal sprays help stop long-l...,0.83074,0.61305,7.556513,9.481493,68.819162,9.134757,9.624276,9.473067
10.1002-14651858.CD008215.pub3-abstract.txt,Is there an increased risk of ovarian cancer i...,Plain Title\nDo fertility drugs raise the risk...,0.83438,0.856458,7.479707,9.826374,59.445849,8.220269,8.998389,10.991912
10.1002-14651858.CD013705.pub3-pls.txt,How accurate are rapid antigen tests for diagn...,Plain Title\nAre fast COVID tests good at find...,0.795371,0.600558,7.215536,8.636364,70.685936,9.53406,9.833631,9.053809
10.1002-14651858.CD013157.pub2-pls.txt,Nutritional supplementation for people with no...,Plain Title\nCan nutrition supplements help pe...,0.824293,0.664692,10.577825,11.101468,53.208423,12.076444,11.830496,10.276537
10.1002-14651858.CD010222.pub4-abstract.txt,Anticoagulant treatment for subsegmental pulmo...,Plain Title\nAre blood thinners good for small...,0.80177,0.312458,6.766938,8.601928,72.035301,8.230162,9.223106,8.683787


In [12]:
# Save the DataFrame to a CSV file
csv_filepath = 'simplified_texts_with_scores.csv'
df.to_csv(csv_filepath, index_label='filename')

print(f"DataFrame saved to {csv_filepath}")

DataFrame saved to simplified_texts_with_scores.csv


# Task: Save files in drive

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')
timenow = str(int(time.time()))

# Define the path to save the CSV file with simplified texts in Google Drive
drive_csv_filepath = '/content/drive/My Drive/colabfiles/proj-gr-9/openai/simplified_texts' + timenow + '.csv'

# Save the DataFrame to the specified path in Google Drive
df_simplified.to_csv(drive_csv_filepath, index_label='filename')

print(f"DataFrame saved to {drive_csv_filepath}")

# Define the path to save the CSV file with scores in Google Drive
drive_csv_filepath = '/content/drive/My Drive/colabfiles/proj-gr-9/openai/simplified_texts_with_scores' + timenow + '.csv'

# Save the DataFrame to the specified path in Google Drive
df.to_csv(drive_csv_filepath, index_label='filename')

print(f"DataFrame saved to {drive_csv_filepath}")