# Methodology

The goal is to fine-tune a model that can predict an appropiate response, given a conversation of a given length.

The context the model will use can vary within practical limits. The model should be able to successfully respond during different phases of a digital optometrics remote exam.

### Stages of an exam:
- #### Introduction
  - Initialization: a greeting
  - During: local technician may need to set the patient up. Assistant asks the name of the patient and introduces itself.
  - Transition: the local technician or patient will indicate their readiness to proceed. The assistant should explain the visual acuity exam
 ####
- #### Visual Acuity
  - Initialization: ready signal from local tech or patient.
  - During: give instructions to the patient to assess the visual acuity without a prescription applied.
  - Transition: binocular and monocular visual acuity for both eyes is observed.
 ####
- #### Subjective Refraction
  - Initialization: monocular visual acuity without the prescription is completed.
  - During: give instructions that facilitate monocular, then binocular lens comparisons. Binocular lens comparisons are complete when patient indicates no difference when lenses are changed. 
  - Transition: Patient is shown previous prescription and newly found prescription and asked if they see a difference.
 ####
- #### Close Vision Test
  - Initialization: Patient is shown a comparison of old and new prescriptions. Local tech is instructed to place reading card in front of patient. Ready signal should be given
  - During: Instructions are given to the patient to read a specific row of the card. If patient correctly identifies 80% of the letters correctly, the exam ends.
  - Transition: Exit instructions and valediction.

In [1]:
from glob import glob

data_dir = './recordings/'
jjara = data_dir + 'RT-JJara/'
stephens = data_dir + 'rt-lstephens/'
brokus = data_dir + 'rt-sbrokus/'
# take only first 10 of jjara, listed in some weird order in the directory
good_files = [
    'recordings/RT-JJara/17885_428_VC_1_1_28_11_2023_15_20_01',
    'recordings/RT-JJara/28081_391_VC_1_1_28_11_2023_17_27_15',
    'recordings/RT-JJara/34401_575_VC_1_1_28_11_2023_14_42_16',
    'recordings/RT-JJara/73015_939_VC_1_1_28_11_2023_15_22_48',
    'recordings/RT-JJara/102725_1374_VC_1_1_04_12_2023_16_44_06',
    'recordings/RT-JJara/298880_387_VC_1_1_04_12_2023_15_43_49',
    'recordings/RT-JJara/469703_9570_VC_1_1_29_11_2023_16_23_53',
    'recordings/RT-JJara/477169_1508_VC_1_1_30_11_2023_16_26_11',
    'recordings/RT-JJara/521679_1570_VC_1_1_30_11_2023_12_05_38',
    'recordings/RT-JJara/595841_4602_VC_1_1_01_12_2023_14_34_27'
]
jjara = []
for f in good_files:
    jjara.append('./' + f + '/clean_captions1.txt')
# other directories are cleaner.
stephens = glob(stephens + '*/clean_captions1.txt')
brokus = glob(brokus + '*/clean_captions1.txt')
for i in [jjara, stephens, brokus]:
    print(len(i))


10
10
10


# Text Selection

there are 30 diaries containing full vision exams. To form a dataset, we will create dialogues of random length between 1 and 10 lines of dialogue.

Each selection should end with the assistant's predicted response.

Step 1 - split caption files into train/val/split 

Step 2 - randomly select dialogue length, from 1 to 10 lines

Step 3 - select lines from a random file. Check to make sure the last line is the assistant.

Step 4 - separate the dialogue from the assistant's response.

## Step 1

We have 30 files to be divided in a 70/10/20 train/val/test split.

- 21 will randomly be chosen to be in the training set. 

- 3 for the validation set. 

- 6 for the test set.

In [38]:
all_files  = jjara.copy()
all_files  += stephens
all_files  += brokus

print(len(all_files))
print(all_files[8:11])
print(all_files[18:21])
print(all_files[28:30])

30
['./recordings/RT-JJara/521679_1570_VC_1_1_30_11_2023_12_05_38/clean_captions1.txt', './recordings/RT-JJara/595841_4602_VC_1_1_01_12_2023_14_34_27/clean_captions1.txt', './recordings/rt-lstephens/1442410_6967_VC_1_1_01_12_2023_18_35_50/clean_captions1.txt']
['./recordings/rt-lstephens/1052783_6838_VC_1_1_04_12_2023_15_49_25/clean_captions1.txt', './recordings/rt-lstephens/612446_5307_VC_1_1_01_12_2023_18_03_36/clean_captions1.txt', './recordings/rt-sbrokus/34898_571_VC_1_1_04_12_2023_11_05_51/clean_captions1.txt']
['./recordings/rt-sbrokus/405969_4506_VC_1_1_01_12_2023_14_38_33/clean_captions1.txt', './recordings/rt-sbrokus/566198_1378_VC_1_1_28_11_2023_16_38_04/clean_captions1.txt']


In [2]:
all_files[0]

'/home/digitalopt/proj/datasets/Exam_v1/test/000024.txt'

In [3]:
# found that some data files still contain SPEAKER_XX speaker name format
from diarization_utils import CaptionCleaner
from glob import glob


cleaner = CaptionCleaner()
data_dir = '/home/digitalopt/proj/datasets/Exam_v1/'
all_files = glob(data_dir + '*/*.txt')
all_files.sort()
for p in all_files:
    captions = cleaner.read_captions(p)
    if captions[0].startswith("[SPEAKER_"):
        print(f'{p}')
        print(f'before cleaning:\n{captions[:5]}')
        cleaner.speaker_cnt(captions)
        new = cleaner.remap_speaker_names(captions)
        print(f'After:\n{new[:5]}')
        # cleaner.write_captions(new, p)


/home/digitalopt/proj/datasets/Exam_v1/train/000003.txt
before cleaning:
['[SPEAKER_00]  Hello.', '[SPEAKER_01]  Hi, how are you two doing today?', '[SPEAKER_00]  Fine. Have a seat in that chair right there.', '[SPEAKER_01]  Great. Is [NAME] with us today?', '[SPEAKER_03]  I am.']
Dialogue contains : 4 speakers.
speaker: SPEAKER_00, words spoken: 68
speaker: SPEAKER_01, words spoken: 516
speaker: SPEAKER_03, words spoken: 111
speaker: SPEAKER_02, words spoken: 30
More than three speakers in captions! Only 3 speakers allowed.
sorted speakers by word count: 
[('SPEAKER_02', 30), ('SPEAKER_00', 68), ('SPEAKER_03', 111), ('SPEAKER_01', 516)]
/home/digitalopt/proj/datasets/Exam_v1/train/000003.txt is invalid! Only 3 speakers allowed. Removing speakers by lowest word count.
truncated speaker list: ['SPEAKER_00', 'SPEAKER_03', 'SPEAKER_01']
speaker mapping <spaker in data> : <new spaker label>
{'SPEAKER_00': 'LocalTech', 'SPEAKER_03': 'Patient__', 'SPEAKER_01': 'Assistant'}
After:
['[LocalTec

In [40]:
import random

total = 30
splits = [21, 3, 6]
# training set indices
sample = random.sample(range(0,total-1), splits[0])
total -= splits[0]
# check for duplicates
train_idx = set(sample)
while len(train_idx) < splits[0]:
    val = random.randint(0, total - 1)
    train_idx.add(val)

# validation set
val_idx = set()
while len(val_idx) < splits[1]:
    val = random.randint(0, total - 1)
    if val in train_idx:
        continue
    else:
        val_idx.add(val)
test_idx = [i for i in range(30)]
test_idx = set(test_idx)
trainVal = train_idx.union(val_idx)
test_idx = test_idx - trainVal
print(f'train: {len(train_idx)}\tval: {len(val_idx)}\t \ttrainval: {len(trainVal)}\ttest: {len(test_idx)}')

train: 21	val: 3	 	trainval: 24	test: 6


### Move files

In [41]:
from collections import defaultdict

datasets = ['train', 'val','test']
file_paths = defaultdict(list)

indices= [train_idx, val_idx, test_idx]
for idx, idx_group in enumerate(indices):
    print(f'Building {datasets[idx]} set ...')
    for n in idx_group:
        file_paths[datasets[idx]].append(all_files[n])

for (k,v) in file_paths.items():
    print(k + ' ' + str(len(v)))

Building train set ...
Building val set ...
Building test set ...
train 21
val 3
test 6


In [43]:
file_paths['train']

['./recordings/RT-JJara/28081_391_VC_1_1_28_11_2023_17_27_15/clean_captions1.txt',
 './recordings/RT-JJara/34401_575_VC_1_1_28_11_2023_14_42_16/clean_captions1.txt',
 './recordings/RT-JJara/298880_387_VC_1_1_04_12_2023_15_43_49/clean_captions1.txt',
 './recordings/RT-JJara/469703_9570_VC_1_1_29_11_2023_16_23_53/clean_captions1.txt',
 './recordings/RT-JJara/477169_1508_VC_1_1_30_11_2023_16_26_11/clean_captions1.txt',
 './recordings/RT-JJara/521679_1570_VC_1_1_30_11_2023_12_05_38/clean_captions1.txt',
 './recordings/RT-JJara/595841_4602_VC_1_1_01_12_2023_14_34_27/clean_captions1.txt',
 './recordings/rt-lstephens/1442410_6967_VC_1_1_01_12_2023_18_35_50/clean_captions1.txt',
 './recordings/rt-lstephens/696884_5310_VC_1_1_05_12_2023_11_16_56/clean_captions1.txt',
 './recordings/rt-lstephens/1711515_11357_VC_1_1_28_11_2023_18_25_35/clean_captions1.txt',
 './recordings/rt-lstephens/973397_4715_VC_1_1_30_11_2023_17_22_09/clean_captions1.txt',
 './recordings/rt-lstephens/1130238_6039_VC_1_1_01_

In [44]:
import shutil
import os


data_dir = '/home/digitalopt/proj/datasets/Exam_v1/'
for d in datasets:
    os.makedirs(data_dir + d, exist_ok=True)

cnt = -1
for k in file_paths.keys():
    for path in file_paths[k]:
        cnt += 1
        newpath =  data_dir + k + '/' + str(cnt).zfill(6) + '.txt'
        shutil.copyfile(path, newpath)

## Step 2: Randomly sample 1-10 lines from texts

In [54]:
from vision_dataset import VisionDataset
from glob import glob

data_dir = '/home/digitalopt/proj/datasets/Exam_v1/'
train = glob(data_dir + 'train/*.txt')
val = glob(data_dir + 'val/*.txt')
test = glob(data_dir + 'test/*.txt')
print(f'train: {len(train)}\tval: {len(val)}\ttest: {len(test)}')
dataset = VisionDataset()
captions = dataset.read_captions(train[0])

train: 21	val: 3	test: 6


In [55]:
captions[:10]

['[Assistant]  Hello. Can you hear me?',
 '[LocalTech]  I can, yes. We have [NAME].',
 "[Assistant]  My name's [NAME]. I'll help you through some vision testing before the doctor sees you. [NAME]'s gonna get you lined up before we start though. Thank you, [NAME].",
 "[LocalTech]  Thank you. All right, let's move this a little bit. Okay. Left side. Can you get out a little  Right there is perfect. Other side. Can you move it in, please.  Well, right there. Perfect. All set.  All right.",
 '[Assistant]  We are going to start without prescriptions first. It will look blurry, but without squinting what is the smallest row that you can read?',
 '[Patient__]  V-C-K-N-O.',
 '[Assistant]  If we go a little smaller, what is the lowest row you can see now?',
 '[Patient__]  Nothing.',
 '[Assistant]  None of those. Okay. How about any of these?',
 "[Patient__]  Um, they're all so blurry.  Uh, D V O H C."]

In [6]:
captions[0].startswith("[Assistant]")

True

In [47]:
from random import randint, randrange

def select_sample(captions):
    cap_len = len(captions)
    # randomly select starting point in data
    start = randrange(cap_len)
    # random end point that won't result in an empty sample
    end = start + randint(2,10)
    # assure sample isn't out of range
    while end > (cap_len-1):
        start -= randint(2,10)
        end = start + randint(2,10)
    # assure sample ends with Assistant response
    while captions[(end-1)][1:10] != 'Assistant':
        if start != 0:
            start -= 1
            end -= 1
        elif start != cap_len-1:
            start += 1
            end += 1
    return captions[start:end]
# run a check for errors
for i in range(1000):
    sample = select_sample(captions)
    if len(sample) < 2:
        print(f'Bad sample: \n{sample}')

Step 3: Separate dialogue from Assistant response.

Also, I will try using a text formatting function to see how that performs.

In [48]:
sample = dataset.select_sample(captions)

input = dataset.to_dialogue(sample)

def format_instruction(input):
    return f'''### Instruction:
Use the dialogue below to create a response that could help guide a patient through a vision exam.

### Input:
{input['dialogue']}

### Response:
{input['response']}
'''

print(format_instruction(input))

### Instruction:
Use the dialogue below to create a response that could help guide a patient through a vision exam.

### Input:
[{'role': 'Assistant', 'content': 'One.  Two.'}, {'role': 'Patient__', 'content': 'same'}, {'role': 'Assistant', 'content': 'Which one has clear letters?  The red side or the green side?'}, {'role': 'Patient__', 'content': 'the green side.'}, {'role': 'Assistant', 'content': 'Alright. Can you read the bottom row for me now?'}, {'role': 'Patient__', 'content': "H, Z, can't tell if it's a C or an O, K, O."}]

### Response:
[{'role': 'Assistant', 'content': "Beautiful.  Both eyes are uncovered for this comparison.  This is number one. This is number two.  Number two is the new glasses prescription.  One is the prescription.  All right.  That's our distance.  Let's show you up close."}]



# Final Step: create the dataset

## Sampling based on exam phase

    Each file is first divided into 4 sections:
    gs - greeting & setup
    va - visual acuity
    sr - subjective refraction
    cv - close vision & valediction
    The average percentage that each exam phase, based on 30 dialogue samples
    is used to drive the sampling of each text document so that dataset samples
    are more consistent with the average real exam.
    The average percentage of the total dialogue that each phase represents is as follows:
    gs : 15%
    va : 24%
    sr : 54%
    cv : 11%
The close vision (cv) section was increased beyond it's representation in the dialogue samples to prevent the lack of sampling in dialogues that are short.

In [1]:
from vision_dataset import VisionDatasetCreator
# avg percentages of exam phase lengths
gs=0.15
va=0.25
sr=0.50
cv=0.11
# minumum dialogue lengths
gs_min = 2
va_min = 3
sr_min = 4
cv_min = 3
# maximum dialogue lengths
gs_max = 5
va_max = 10
sr_max = 10
cv_max = 6

sampling_strategy = dict(
    gs=[gs, gs_min, gs_max],
    va=[va, va_min, va_max],
    sr=[sr, sr_min, sr_max],
    cv=[cv, cv_min, cv_max]
)

data_dir = '/data/datasets/Exam_v2/'
# set seed to get randomization with reproducible results
dataset = VisionDatasetCreator(sampling_strategy, seed=42)
# 25 samples from each file in the training set, which has 21 files total
size = (21*25)
dataset.load(data_dir, 'train', size)
# 25 samples from each validation file, 3 files total
size = (3*25)
dataset.load(data_dir, 'val', size)
# 25 samples from each test file, 6 files total
size = (6*25)
dataset.load(data_dir, 'test', size)

for i in ['train', 'val', 'test']:
    print('\n', i, len(dataset.dataset[i]))

21 files found. Sampling 25 times per file.
sampling file: /data/datasets/Exam_v2/train/000000.txt
sampling file: /data/datasets/Exam_v2/train/000001.txt
sampling file: /data/datasets/Exam_v2/train/000002.txt
sampling file: /data/datasets/Exam_v2/train/000003.txt
sampling file: /data/datasets/Exam_v2/train/000004.txt
sampling file: /data/datasets/Exam_v2/train/000005.txt
sampling file: /data/datasets/Exam_v2/train/000006.txt
sampling file: /data/datasets/Exam_v2/train/000007.txt
sampling file: /data/datasets/Exam_v2/train/000008.txt
sampling file: /data/datasets/Exam_v2/train/000009.txt
sampling file: /data/datasets/Exam_v2/train/000010.txt
sampling file: /data/datasets/Exam_v2/train/000011.txt
sampling file: /data/datasets/Exam_v2/train/000012.txt
sampling file: /data/datasets/Exam_v2/train/000013.txt
sampling file: /data/datasets/Exam_v2/train/000014.txt
sampling file: /data/datasets/Exam_v2/train/000015.txt
sampling file: /data/datasets/Exam_v2/train/000016.txt
sampling file: /data/

In [3]:
big = 0
for k in sampling_strategy.keys():
    if big < sampling_strategy[k][2]:
        big = sampling_strategy[k][2]
for name in ['train', 'val', 'test']:
    print(name)
    for d in dataset.dataset[name]:
        if len(d) > big:
            print(len(d))

train
val
test


In [4]:
big

10