### What can Llama tell me about my medical text descriptions?

We witness a wide range of applications for large language models (LLMs), and I'm curious about how an LLM would interpret my medical dataset. Specifically, I'm using the Pathology Visual Question Answering (Path-VQA) dataset from Hugging Face. My approach involves leveraging its text descriptions as prompts for an LLM.

To enable this, I've built a local chat interface (chat.py) that interacts with Llama 3.1. Through this interface, I'm exploring how the model interprets a subset of the dataset's text descriptions and seeking inspiration for potential AI-model development.




- Notes:
    - Besides the fact that Llama was not fine-tuned for medical QAE, I expect it to be able to comprehend the basics. 
    - I hosted the model locally, as sometimes data is private and one does not want to run it online and share it externally.
    - Only a subset of descriptions are prompted as LLM-s can process a limited number of tokens.

In [1]:
import os
import datasets
import numpy as np
from chat import handle_conversation

  from .autonotebook import tqdm as notebook_tqdm


Let's load dataset from huggingface. 

In [2]:
ds = datasets.load_dataset('flaviagiammarino/path-vqa')
print(f'dataset content:\n {ds}')

dataset content:
 DatasetDict({
    train: Dataset({
        features: ['image', 'question', 'answer'],
        num_rows: 19654
    })
    validation: Dataset({
        features: ['image', 'question', 'answer'],
        num_rows: 6259
    })
    test: Dataset({
        features: ['image', 'question', 'answer'],
        num_rows: 6719
    })
})


In [10]:
# query only questions and answers
questions = ds['train']['question']
answers = ds['train']['answer']


I will slice only 'what is present?' questions and heir corresponding answers.

In [11]:
ind = np.array(questions) == 'what is present?'
print(f'total number of distinct categories mentioned in the subset: {len(np.unique(np.array(answers)[ind]))},'
f' total length of subset: {sum(ind)}')

total number of distinct categories mentioned in the subset: 327, total length of subset: 4065


My first prompts will be something like:
- The elements of the list provided above are answer to the 'what is present' question and each question-answer pair has a pathology image associated with it. What can you tell me about that? 
- Given image-question-answer pairs, give me an idea of how to visualize my data and train useful AI model.


In [5]:
conversation = handle_conversation(np.unique(np.array(answers)[ind]))

Welcome to LLama ChatBot. Type 'exit' to quit


You:   The elements of the list provided above are answer to 'what is present' question and each question-answer pair has a pathology image associated with it. What can you tell me about that? 



AI chatbot: A very interesting list!

Based on the context, I'd say that this list appears to be a collection of medical terms or descriptions related to various pathologies or conditions. Each term is likely a question prompt for a radiology or pathology image (e.g., an X-ray, CT scan, MRI, or microscopic slide) associated with a specific disease or condition.

In other words, the elements in this list are essentially answer choices to questions like "What is present in this patient's imaging study?" or "What condition does this image suggest?"

The presence of medical terms and descriptions suggests that this list might be used for:

1. **Radiology or pathology exams**: To test knowledge of various diseases, conditions, or findings related to different body parts or systems.
2. **Medical education**: As a teaching tool to help students learn about specific pathologies or conditions.
3. **Diagnostic decision support**: As a reference guide for radiologists, pathologists, or other hea

You:  Interesting answer. Given image-question-answer pairs, give me an idea on how to visualize my data and train usefull AI model.



AI chatbot: You'd like to know more about visualizing your data and training a useful AI model using the provided image-question-answer pairs. That's a great follow-up question!

**Visualizing Data:**

To effectively visualize your data, consider the following:

1. **Tagging images**: Assign labels or tags to each image based on the corresponding question-answer pair. For example, "metastatic colon carcinoma" or "miliary tuberculosis".
2. **Image categorization**: Categorize the images into groups or classes based on their content (e.g., radiology, pathology, gross examination, microscopic).
3. **Annotation and labeling**: Use tools like labelme, Annotate, or Label Studio to annotate and label specific features within each image.
4. **Data normalization and standardization**: Normalize and standardize the images by adjusting resolution, brightness, contrast, and other parameters.

**Training a Useful AI Model:**

To train a useful AI model using your data, consider the following steps

You:  Let's focus on tagging images. How can I divide images to specific subgrups for classification if the only information about the images I have are given questions and answers?



AI chatbot: When working with a large dataset of image-question-answer pairs, and you want to group them into specific subgroups for classification, it might be challenging without any additional metadata. However, here's a possible approach:

**Extracting keywords from question-answer pairs**

1. **Text analysis**: Analyze the question-answer pairs and extract relevant keywords or phrases that could serve as labels for your image groups.
2. **Part-of-speech tagging**: Use natural language processing (NLP) techniques to identify key concepts, such as nouns, verbs, and adjectives in the questions and answers.

**Grouping images based on extracted keywords**

1. **Word frequency analysis**: Perform a word frequency analysis to identify common themes or keywords that appear across multiple question-answer pairs.
2. **Concept clustering**: Group images into subcategories (or clusters) based on shared keywords or phrases.
3. **Image categorization**: Manually review and categorize images i

You:  exit
