- Download the Meta-Llama model:
```
pip install -U "huggingface_hub[cli]"
huggingface-cli login
huggingface-cli download mlx-community/Meta-Llama-3-8B-Instruct-4bit --exclude "original/*" --local-dir mlx-community/Meta-Llama-3-8B-Instruct-4bit
or
huggingface-cli download mlx-community/Meta-Llama-3.1-70B-Instruct-4bit --exclude "original/*" --local-dir mlx-community/Meta-Llama-3.1-70B-Instruct-4bit
```

- Install curate_gpt: https://github.com/monarch-initiative/curategpt
- `export OPENAI_API_KEY=`

In [1]:
import sys
!{sys.executable} -m pip install --upgrade ipywidgets
!{sys.executable} -m pip install torch
!{sys.executable} -m pip install mlx-lm



In [2]:
from mlx_lm import load, generate

# # Define your model to import
# model_name = "mlx-community/Meta-Llama-3-8B-Instruct-4bit"
# # Loading model
# model, tokenizer = load(model_name)

# model_filename = '/Users/hk9/Downloads/llama_mlx/mlx-community/Meta-Llama-3-8B-Instruct-4bit'
model_filename = '/Users/hk9/Downloads/llama_mlx/mlx-community/Meta-Llama-3.1-70B-Instruct-4bit'
model, tokenizer = load(model_filename)

In [6]:
cell_type_name = "Striated cell of salivary gland"
cell_definition = '''
The striated cells of the salivary gland, also known as striated duct cells, play a central role in the secretion and modification of saliva. These cells exhibit distinctive longitudinal striations along their basal membrane, which give them their name (Roussa, 2011). Highly specialized and unique to the salivary glands, striated cells are considered an integral part of the salivary duct system, specifically in the striated ducts that connect the intercalated duct with the excretory duct in the gland.

Why are the striated cells vital? The saliva created within the salivary glands is initially isotonic, which implies it contains the same concentration of vital electrolytes as the plasma in the blood. Yet, when it is released into the mouth, it is often hypotonic. The secretion undergoes a significant transformation within the duct system and the striated cells play a pivotal part in this process (Tandler et al., 2001).The basal infoldings in striated cells, laden with mitochondria, supply an active transport mechanism that reabsorbs sodium ions and secretes potassium ions into the saliva (Tandler et al., 2001; Roussa, 2011; Pedersen et al., 2018). This process is crucial for generating a hypotonic saliva upon release, which aids in digestion and oral hygiene. 

Additionally, the striated duct cells of the salivary glands serve other important physiological functions as well. These cells contribute significantly to the bicarbonate content of saliva (Roussa, 2011; Ohana, 2013; Pedersen et al., 2018), which plays a critical role in neutralizing the acids in the oral cavity, and thereby protecting the dental enamel from being eroded. This function further underlines their role in maintaining oral health. Thus, the striated cells of the salivary gland, with their distinctive structure and secretory functions, play a considerable role in the physiological action of saliva within the oral cavity..
'''

Extract keywords from definition

In [4]:
chatbot_role = '''
You are a knowledgeable cell biologist that has professional experience writing and curating accurate and informative descriptions of cell types.
'''

# command = '''
# Can you extract the most important 4 keywords from this text so that I can find the most relevant articles using PubMed search: "{cell_def}". Just provide me a json list of keywords (strings) with values without outputing any extra text. No yapping.
# '''

command = '''
Using this cell type definition "{cell_def}" and your own knowledge, can you profide me the location of this cell type ({cell_name}), the most important characteristics of this cell type and the most important biological process it is related. Just provide me a json object response with keys 'cell_type_name', 'location', 'cellular_component' and 'biological_process'. Additionally can you extract the most important/distinctive 4 keywords from this json that I can use for finding references for this cell type and add it to the josn as a list of strings with keyword 'search'. No yapping.
'''

my_question = command.format(cell_name=cell_type_name, cell_def=cell_definition)

# Set up the chat scenario with roles
messages = [
    {"role": "system", "content": chatbot_role},
    {"role": "user", "content": my_question}
]

# Apply the chat template to format the input for the model
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

# Decode the tokenized input back to text format to be used as a prompt for the model
prompt = tokenizer.decode(input_ids)

# Generate a response using the model
response = generate(model, tokenizer, max_tokens=2048, prompt=prompt)
response

'Here is the JSON object with the requested information:\n\n```\n{\n  "cell_type_name": "Striated cell of salivary gland",\n  "location": "Salivary glands, specifically in the striated ducts that connect the intercalated duct with the excretory duct",\n  "cellular_component": "Basal membrane with longitudinal striations, basal infoldings with mitochondria",\n  "biological_process": "Saliva secretion and modification, sodium and potassium ion transport, bicarbonate secretion, acid neutralization and dental enamel protection",\n  "search": ["Striated duct cells", "Salivary gland", "Saliva secretion", "Ion transport"]\n}\n```\n\nLet me know if you need any further assistance!'

In [5]:
chatbot_role = '''
You are a knowledgeable cell biologist that has professional experience writing and curating accurate and informative descriptions of cell types.
'''

data = '''
{
"cell_type_name": "Striated cell of salivary gland",
"location": "Salivary glands, specifically in the striated ducts that connect the intercalated duct with the excretory duct",
"cellular_component": "Basal membrane with longitudinal striations, basal infoldings laden with mitochondria",
"biological_process": "Saliva secretion and modification, including sodium and potassium ion transport, bicarbonate secretion, and acid neutralization to maintain oral health and facilitate digestion",
}
'''

command = '''
Can you extract the following 3 statements from this definition: "{cell_def}".

1- cell type's location
2- cell type's cellular components
3- cell type related biological processes

All definitions should be at most 8 words long. 
No yapping.
'''

my_question = command.format(cell_def=cell_definition)

# Set up the chat scenario with roles
messages = [
    {"role": "system", "content": chatbot_role},
    {"role": "user", "content": my_question}
]

# Apply the chat template to format the input for the model
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

# Decode the tokenized input back to text format to be used as a prompt for the model
prompt = tokenizer.decode(input_ids)

# Generate a response using the model
response = generate(model, tokenizer, max_tokens=2048, prompt=prompt)

from IPython.display import Markdown

# Printing models response using Markdown cell formatting
Markdown(response)

Here are the 3 extracted statements:

1. Striated cells are in salivary glands.
2. Striated cells have basal infoldings and mitochondria.
3. Striated cells modify and secrete saliva.

In [31]:
import json

chatbot_role = '''
You are a highly knowledgeable cell biologist with professional expertise in writing and curating concise and informative descriptions of various cell types. Your task is to extract and generate clear, precise statements about specific cell types from detailed definitions provided to you.'''

# command = '''
# Can you extract the most important 4 keywords from this text so that I can find the most relevant articles using PubMed search: "{cell_def}". Just provide me a json list of keywords (strings) with values without outputing any extra text. No yapping.
# '''

command = '''
Extract the {result_count} most important and distinctive facts about '{cell_name}' from the given definition: '{cell_def}'. Ensure each statement is concise, not exceeding 8 words. Focus on accuracy and brevity.
Return results as a json list. Each response should be a complete sence as '{cell_name}' (as full name) being the subject.
Just return the json list as response, don't print anything else. Your response must be a valid json string, don't wrap json list items with curly braces. No yapping. 
'''

def extract_facts(cell_type_name, cell_definition, count=10):
    my_question = command.format(cell_name=cell_type_name, cell_def=cell_definition, result_count=str(count))
    
    # Set up the chat scenario with roles
    messages = [
        {"role": "system", "content": chatbot_role},
        {"role": "user", "content": my_question}
    ]
    
    # Apply the chat template to format the input for the model
    input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
    
    # Decode the tokenized input back to text format to be used as a prompt for the model
    prompt = tokenizer.decode(input_ids)
    
    # Generate a response using the model
    response = generate(model, tokenizer, max_tokens=2048, prompt=prompt)
    
    # from IPython.display import Markdown
    # Markdown(response)
    return json.loads(response)

In [32]:
response = extract_facts(cell_type_name, cell_definition)
response

['Striated cell of salivary gland has longitudinal striations.',
 'Striated cell of salivary gland is highly specialized.',
 'Striated cell of salivary gland is unique to salivary glands.',
 'Striated cell of salivary gland is part of salivary duct system.',
 'Striated cell of salivary gland modifies saliva composition.',
 'Striated cell of salivary gland reabsorbs sodium ions.',
 'Striated cell of salivary gland secretes potassium ions.',
 'Striated cell of salivary gland aids in digestion.',
 'Striated cell of salivary gland maintains oral hygiene.',
 'Striated cell of salivary gland neutralizes oral acids.']

In [8]:
import yaml
from curate_gpt.agents.evidence_agent import EvidenceAgent
from curate_gpt.extract.basic_extractor import BasicExtractor
from curate_gpt.wrappers.literature.pubmed_wrapper import PubmedWrapper
from curate_gpt.store import get_store

def curategpt_citeseek(query):
    db = get_store("chromadb", None)
    extractor = BasicExtractor()
    extractor.model_name = "gpt-4o"
    chatbot = PubmedWrapper(local_store=db, extractor=extractor)
    ea = EvidenceAgent(chat_agent=chatbot)
    response = ea.find_evidence_simple(query)
    return response

In [20]:
response = curategpt_citeseek("Striated cell of salivary gland, also known as striated duct cells, has longitudinal striations.")
print(yaml.dump(response, sort_keys=False))

- reference: PMID:21054495
  supports: SUPPORT
  snippet: Striated duct adenomas are unilayered ductal tumours that recapitulate
    normal striated ducts.
  explanation: The literature describes striations as a characteristic of normal striated
    ducts.
- reference: PMID:9825895
  supports: SUPPORT
  snippet: The duct system of the rat submandibular gland (granular convoluted tubule
    [GCT], striated duct, excretory duct, main excretory duct [MED] and salivary bladder)
    was studied...
  explanation: The description implies striations are a characteristic feature of
    the striated ducts.



In [34]:
import csv
import json

definitions_benchmark = dict()
with open("./definitions.csv", mode='r', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        cell_type = row.get('cell_type')
        if cell_type:
            references =list()
            for i in range(1, 11):
                if row.get('ref_' + str(i)) != "":
                    references.append(json.loads(row.get('ref_' + str(i))))
            definitions_benchmark[cell_type] = {"cell_type": cell_type,
                                               "definition":row.get('definition').replace("\n", " "),
                                               "references": references}
print(definitions_benchmark["Striated cell of salivary gland"])


{'cell_type': 'Striated cell of salivary gland', 'definition': 'The striated cells of the salivary gland, also known as striated duct cells, play a central role in the secretion and modification of saliva. These cells exhibit distinctive longitudinal striations along their basal membrane, which give them their name (Roussa, 2011). Highly specialized and unique to the salivary glands, striated cells are considered an integral part of the salivary duct system, specifically in the striated ducts that connect the intercalated duct with the excretory duct in the gland.  Why are the striated cells vital? The saliva created within the salivary glands is initially isotonic, which implies it contains the same concentration of vital electrolytes as the plasma in the blood. Yet, when it is released into the mouth, it is often hypotonic. The secretion undergoes a significant transformation within the duct system and the striated cells play a pivotal part in this process (Tandler et al., 2001).The 

In [41]:
results = list()
for cell_type in definitions_benchmark:
    record = definitions_benchmark[cell_type]
    facts = extract_facts(cell_type, record["definition"])
    cell_type_record = dict()
    cell_type_record["cell_type"] = cell_type
    statements = list()
    for fact in facts:
        stmt = dict()
        stmt["statement"] = fact
        try:
            references = curategpt_citeseek(fact)
            stmt["references"] = references
            statements.append(stmt)
        except Exception as e:
            print("Processing failed for : " + fact)
    cell_type_record["statements"] = statements
        # print(yaml.dump(references[0], sort_keys=False))
        # print("----------")
    results.append(cell_type_record)
    
results

Failed to search for Striated cell of salivary gland is unique to salivary glands.; params: {'db': 'pubmed', 'term': 'striated cell OR  salivary gland OR  salivary gland striated duct OR  striated duct cells OR  unique cell types in salivary glands OR  salivary gland histology OR  striated duct epithelium OR  salivary gland structure OR  acinar cells OR  intercalated ducts', 'retmax': 100, 'sort': 'relevance', 'retmode': 'json'}
Failed to fetch data for ['10682899', '10759422', '1476190', '15709965', '15990024', '17157080', '1807485', '20108531', '20831576', '21120532', '21534000', '22208652', '22298651', '23061636', '23177984', '23443543', '23878825', '24164806', '24240699', '24406086', '24559652', '24598807', '24646566', '24675464', '24862590', '24862591', '24862599', '25680367', '25843887', '26285812', '26454716', '26537593', '26592972', '26662479', '26694219', '26751783', '27592814', '27796887', '28170182', '28192873', '28623666', '28732801', '29109049', '29181608', '29187363', '29

Processing failed for : Striated cell of salivary gland contributes to bicarbonate content.


[{'cell_type': 'Striated cell of salivary gland',
  'statements': [{'statement': 'Striated cell of salivary gland has longitudinal striations.',
    'references': [{'reference': 'PMID:21054495',
      'supports': 'NO_EVIDENCE',
      'snippet': "Prominent cell membranes, reminiscent of ''striations'' of normal striated ducts, were seen.",
      'explanation': "The snippet mentions 'striations' of striated ducts but does not provide explicit evidence on the presence of longitudinal striations in striated cells. Thus, there's no direct evidence supporting the statement."},
     {'reference': 'PMID:11590591',
      'supports': 'NO_EVIDENCE',
      'snippet': 'Broad-based comparisons of ultrastructural and other data about SDs offer some insight into evolutionary history of salivary glands.',
      'explanation': 'This reference discusses striated ducts (SDs) in general, but does not directly mention or provide evidence about longitudinal striations in striated cells.'}]},
   {'statement':

{'statements': [{'statement': 'Striated cell of salivary gland has longitudinal striations.',
   'references': [{'reference': 'PMID:21054495',
     'supports': 'WRONG_STATEMENT',
     'snippet': "Prominent cell membranes, reminiscent of ''striations'' of normal striated ducts, were seen.",
     'explanation': 'The document refers to striated duct cells having a prominent cell membrane that appear similar to striations, but does not mention longitudinal striations within individual striated cells themselves. The statement implies individual cells have longitudinal striations, which is incorrect.'}]},
  {'statement': 'Striated cell of salivary gland is highly specialized.',
   'references': [{'reference': 'PMID:11590591',
     'supports': 'PARTIAL',
     'snippet': 'In addition to their role in electrolyte homeostasis, striated ducts (SDs) in the major salivary glands of many mammalian species engage in secretion of organic products.',
     'explanation': 'The striated ducts show special