# Inspect datasource

As datasource I will use [pubmed search engine](https://pubmed.ncbi.nlm.nih.gov/?term=diabetes%20mellitus&filter=simsearch1.fha&filter=pubt.clinicaltrial&sort=date) by this link
Collect representative data from the source. As many as possible.
Download all the data and save it to the file. (folder datasets/1_raw)

In [1]:
from src.tools.file import read_file_as_string

# read downloaded file
raw_file_content = read_file_as_string('datasets/1_raw/pubmed-diabetesme-set.txt')
print(raw_file_content[:1000])

PMID- 37380983
OWN - NLM
STAT- MEDLINE
DCOM- 20230630
LR  - 20230702
IS  - 1475-2840 (Electronic)
IS  - 1475-2840 (Linking)
VI  - 22
IP  - 1
DP  - 2023 Jun 28
TI  - Left ventricular mass predicts cardiac reverse remodelling in patients treated 
      with empagliflozin.
PG  - 152
LID - 10.1186/s12933-023-01849-w [doi]
LID - 152
AB  - BACKGROUND: The cardiovascular (CV) benefits of sodium-glucose transport protein 
      2 inhibitors have been attributed, in part, to cardiac reverse remodelling. The 
      EMPA-HEART CardioLink-6 study reported that sodium-glucose cotransporter-2 
      inhibition for 6 months with empagliflozin was associated with a significant 
      reduction in left ventricular mass indexed to body surface area (LVMi). In this 
      sub-analysis, we evaluated whether baseline LVMi may influence how empagliflozin 
      affects cardiac reverse remodelling. METHODS: A total of 97 patients with type 2 
      diabetes and coronary artery disease were randomized to empa

In [2]:
import pandas as pd
from src.tools.pubmed import parse_pubmed_content

# parse file content
(titles, abstracts) = parse_pubmed_content(raw_file_content)

# create df
pubmed_df = pd.DataFrame({'title': titles, 'abstract': abstracts})
pubmed_df.head()

Unnamed: 0,title,abstract
0,Left ventricular mass predicts cardiac revers...,BACKGROUND: The cardiovascular (CV) benefits ...
1,Effects of therapeutic ultrasound on the endo...,Type 2 diabetes mellitus (T2DM) is characteri...
2,A Pilot Study on the Efficacy of a Diabetic D...,High sugar consumption increases the risk of ...
3,Decreased branched-chain amino acids and elev...,INTRODUCTION: Hypoglycemia is a major limitin...
4,Effects of a Web-Based Lifestyle Intervention...,BACKGROUND: The high proportion of people wit...


In [3]:
from src.tools.file import write_file
import os

# const with directories
DIRECTORY_DATA = os.path.join('', 'datasets')
DIRECTORY_TEXT = os.path.join(DIRECTORY_DATA, '2_cleaned')

# split it to files and save
for i in range(len(titles)):
    # clean title
    title = titles[i].replace('\n', '').replace('\r', '').replace('\t', '').replace('\v', '').replace('\f', '').replace(
        '?', '').replace('.', '').replace('/', '-')

    # cut title if too long
    title = (title[:75] + '..') if len(title) > 75 else title
    file_name = title.strip() + ".txt"

    # select abstract
    abstract = abstracts[i]

    # save to separate file
    write_file(os.path.join(DIRECTORY_TEXT, file_name), abstract)

    print(f"saved {file_name}")

saved Left ventricular mass predicts cardiac reverse remodelling in patients tre...txt
saved Effects of therapeutic ultrasound on the endothelial function of patients ...txt
saved A Pilot Study on the Efficacy of a Diabetic Diet Containing the Rare Sugar...txt
saved Decreased branched-chain amino acids and elevated fatty acids during antec...txt
saved Effects of a Web-Based Lifestyle Intervention on Weight Loss and Cardiomet...txt
saved Ultrasound guided versus blinded injection in trigger finger treatment: a ...txt
saved Effect of the MySweetheart randomized controlled trial on birth, anthropom...txt
saved Tofogliflozin long-term effects on atherosclerosis progression and major c...txt
saved A randomized controlled trial for evaluating an occupational therapy self ...txt
saved The Effectiveness of a Traditional Chinese Medicine-Based Mobile Health Ap...txt
saved Implementation of Self-Care Deficits Assessment and a Nurse-Led Supportive...txt
saved Does fasting plasma glucose values 51

In [4]:
# check if all files are saved
files = os.listdir(DIRECTORY_TEXT)
print(f"saved {len(files)} files")

saved 30 files


# Mark your data
[doccano](https://github.com/doccano/doccano) is a text annotation tool for humans.

## Run the local service
Import the processed files
And mark them

- We will mark text information
- Create several labels for text:
- - `disease`
- - `drug`
- - `subject` - person or animal

In [5]:
%%bash
# startup doccano
docker compose up -d

# check if it is running
docker compose ps

 Container j1-postgres-1  Created
 Container j1-backend-1  Created
 Container j1-rabbitmq-1  Created
 Container j1-nginx-1  Created
 Container j1-celery-1  Created
 Container j1-flower-1  Created
 Container j1-postgres-1  Starting
 Container j1-rabbitmq-1  Starting
 Container j1-postgres-1  Started
 Container j1-backend-1  Starting
 Container j1-rabbitmq-1  Started
 Container j1-celery-1  Starting
 Container j1-celery-1  Started
 Container j1-flower-1  Starting
 Container j1-backend-1  Started
 Container j1-nginx-1  Starting
 Container j1-nginx-1  Started
 Container j1-flower-1  Started


NAME                IMAGE                      COMMAND                  SERVICE             CREATED             STATUS                  PORTS
j1-backend-1        doccano/doccano:backend    "/opt/bin/prod-djang…"   backend             6 days ago          Up 1 second             
j1-celery-1         doccano/doccano:backend    "/opt/bin/prod-celer…"   celery              6 days ago          Up 1 second             
j1-flower-1         doccano/doccano:backend    "/opt/bin/prod-flowe…"   flower              2 days ago          Up Less than a second   0.0.0.0:5555->5555/tcp
j1-nginx-1          doccano/doccano:frontend   "/docker-entrypoint.…"   nginx               2 days ago          Up Less than a second   80/tcp, 0.0.0.0:80->8080/tcp
j1-postgres-1       postgres:13.3-alpine       "docker-entrypoint.s…"   postgres            6 days ago          Up 2 seconds            5432/tcp
j1-rabbitmq-1       rabbitmq:3.10.7-alpine     "docker-entrypoint.s…"   rabbitmq            6 days ago          Up 

In [6]:
# after exported dataset from doccano
# import it to the project
from src.tools.file import read_file_as_string

DIRECTORY_EXPORTED = os.path.join(DIRECTORY_DATA, '3_exported')
EXPORTED_FILE_PATH = os.path.join(DIRECTORY_EXPORTED, 'admin.jsonl')

marked_dataset = read_file_as_string(EXPORTED_FILE_PATH)
print(marked_dataset[:1000])

{"id":54,"text":" OBJECTIVE: The purpose of the present double-blind, placebo-controlled, randomized clinical trial was to evaluate the efficacy and safety of Cytoflavin in patients with diabetic polyneuropathy (DPN). MATERIAL AND METHODS: Investigational therapy was administered in two steps: intravenous infusions of experimental drug\/placebo for 10 days followed by oral administration for 75 days. In 10 clinical centers, 216 patients aged 45-74 years with a diagnosis of type 2 diabetes mellitus, symptomatic distal sensorimotor DPN, confirmed no earlier than 1 year before screening, on stable therapy (no change of drugs and doses) by oral hypoglycemic drugs, intermediate-acting, long-acting or extra-long-acting insulin, and\/or GLP-1 receptor agonists. RESULTS: By the end of treatment, the change of the Total Symptom Score (TSS) in the experimental group was -2.65 points, in the placebo group -1.73 points (p<0.001). Improvement of symptoms in the experimental group was achieved regar

In [7]:
from src.tools.doccano import parse_doccano_content

# parse marked dataset
doccano_parsed = parse_doccano_content(marked_dataset)
ready_df = pd.DataFrame(doccano_parsed)
ready_df.head()

Unnamed: 0,prompt,completion
0,OBJECTIVE: The purpose of the present double-b...,diabetic polyneuropathy\nic\ntype 2 diabetes ...
1,High sugar consumption increases the risk of d...,diabetes\npatients\ntype 2 diabetes\n\n###\n\n
2,INTRODUCTION: Relationships between glycemic-l...,s of canaglifl\nDiabetes\nindividuals\n\n###\n\n
3,This study evaluated the efficacy of the Occup...,Occupational Therapy Diabetes Self-Management...
4,OBJECTIVES: The aim of the study was to invest...,FPG\nfasting plasma glucose\npregnant women\n...


In [8]:
# split dataset to train and test

validation_dataset = doccano_parsed[0:5]
training_dataset = doccano_parsed[5:]

print(f"validation dataset size: {len(validation_dataset)}")
print(f"training dataset size: {len(training_dataset)}")

validation dataset size: 5
training dataset size: 21


In [9]:
# save datasets to files
from src.tools.file import write_file

DIRECTORY_READY = os.path.join(DIRECTORY_DATA, '4_ready')
TRAINING_DATASET_NAME = 'training_dataset.jsonl'
VALIDATION_DATASET_NAME = 'validation_dataset.jsonl'

# create dataframes
df_training = pd.DataFrame([vars(f) for f in training_dataset])
df_validation = pd.DataFrame([vars(f) for f in validation_dataset])

# create filepaths
training_file_path = os.path.join(DIRECTORY_READY, TRAINING_DATASET_NAME)
validation_file_path = os.path.join(DIRECTORY_READY, VALIDATION_DATASET_NAME)

# save to files
write_file(training_file_path, df_training.to_json(lines=True, orient='records'))
write_file(validation_file_path, df_validation.to_json(lines=True, orient='records'))

# Train the model
[finetune](https://beta.openai.com/docs/guides/fine-tuning) the model with your data.
Now we need to send our data to the OpenAI API and train the model.

```bash
# set api key
$env:OPENAI_API_KEY="YOUR_API_KEY"

# send data to the api
openai api fine_tunes.create \
  --training-file "datasets/4_ready/training_dataset.jsonl" \
  --validation-file "datasets/4_ready/validation_dataset.jsonl" \
  --model ada \
  --learning-rate 5e-5 \
  --batch-size 4 \
  --validation-split 0.1 \
  --epochs 5 \
  --suffix "j1-abs-2023-07-02"
```

Do not forget to write up the ft id and model id!
ft_id = 'ft-uOzqOs400l8kg0b9yv0hZh0L'
model_id = 'ada:ft-personal:j1-abs-2023-07-02-20-56-14'

## Useful commands
```bash
# follow
openai api fine_tunes.follow -i ft-XXXXXX

# list all models
openai api models.list

# list all finetunes
openai api fine_tunes.list

# retrieve finetune
openai api fine-tunes.retrieve --id ft-XXXXXX

# results
openai api fine_tunes.results --id ft-XXXXXX

# delete finetune
openai api fine_tunes.delete --id ft-XXXXXX
```

# Lets try!
In the dataset we have only diabetes mellitus type 2. Lets try to run the model with prompt from the article with no diabetes mellitus type 2.
[I took this one](https://pubmed.ncbi.nlm.nih.gov/35971155/)

In [25]:
import env
import openai

# import os if not
import os

# ft model id
ft_ada_id = 'ft-uOzqOs400l8kg0b9yv0hZh0L'
ft_curie_id = 'ft-U61qjWKCWzCyDXjyhNphLwF3'
ft_babbage = 'ft-pwIUGj8qxPKPXhG2ZltP3OML'
ft_davinci_id = 'ft-8ppSifUvfzXdiGmqeq2kU5Z7'

# model itself id
model_ada_id = 'ada:ft-personal:j1-abs-2023-07-02-20-56-14'
model_curie_id = 'curie:ft-personal-2023-07-03-17-28-26'
model_babbage_id = 'babbage:ft-personal-2023-07-03-17-29-44'
model_davinci_id = 'davinci:ft-personal:j1-abs-2023-07-02-2023-07-02-23-21-33'

TEST_PROMPT_DIRECTORY = os.path.join(DIRECTORY_DATA, '5_test_prompts')
TEST_PROMPT_FILE_NAME = 'test_prompt.txt'

test_prompt = read_file_as_string(os.path.join(TEST_PROMPT_DIRECTORY, TEST_PROMPT_FILE_NAME))
print(test_prompt[0:1000])

BACKGROUND: Coronavirus disease-19 (COVID-19) infection causes persistent health
problems such as breathlessness, chest pain and fatigue, and therapies for the 
prevention and early treatment of post-COVID-19 syndromes are needed. 
Accordingly, we are investigating the effect of a resistance exercise 
intervention on exercise capacity and health status following COVID-19 
infection.
METHODS: A two-arm randomised, controlled clinical trial including 220 adults 
with a diagnosis of COVID-19 in the preceding 6 months. Participants will be 
classified according to clinical presentation: Group A, not hospitalised due to 
COVID but persisting symptoms for at least 4 weeks leading to medical review; 
Group B, discharged after an admission for COVID and with persistent symptoms 
for at least 4 weeks; or Group C, convalescing in hospital after an admission 
for COVID. Participants will be randomised to usual care or usual care plus a 
personalised and pragmatic resistance exercise intervention 

In [28]:
# lets check
prompt = test_prompt + "\n\n##\n\n"

openai.api_key = env.OPENAI_API_KEY
temperature = 0.0

series = []

# for each model run the prompt
for (m_id, name) in((model_ada_id, 'ada'), (model_curie_id, 'curie'), (model_babbage_id, 'babbage'), (model_davinci_id, 'davinci')):
    response = openai.Completion.create(
        model=m_id,
        temperature=temperature,
        prompt=prompt,
    )
    series.append(pd.Series(response['choices'][0]['text'].split('\n'), name=name))

df = pd.concat(series, axis=1)
df

Unnamed: 0,ada,curie,babbage,davinci,ada.1,curie.1,babbage.1,davinci.1,ada.2,curie.2,babbage.2,davinci.2
0,Coronavirus disease-19,,,,Women,patients,Women,,type,patients,,
1,participants,Participants,###,Coronavirus disease-19,women,gestational diabetes mellitus,gestational diabetes mellitus,Women,patients,Parkinson's disease,###Patients,METHODS: Thirty-five elderly patients were ran...
2,,,,COVID-19,gestational diabetes mellitus,,,,,,,
3,###,###,,persistent,,###,###,Gestational diabetes mellitus,###,##,###Patients,
4,,,###,,###,,,,,,,
5,Coron,,,,,women,,OGTT,type,patients,###Patients,
6,,,,,Women,,###,,patients,Park,,
7,,###,###,,women,###,,,,,,
8,,,,,,,patients,,###,##,,
9,,,,,,,,,,,,
