In [1]:
import ollama
from ollama import chat
from pydantic import BaseModel
import re
import pandas as pd
import torch
import sys
import os
import json

from neurovlm.data import get_data_dir
from neurovlm.models import Specter

  from .autonotebook import tqdm as notebook_tqdm


## Load data and models

In [6]:
# Load data and Specter
data_dir = get_data_dir()
# Try reading with pyarrow engine and fallback to fastparquet if error occurs
try:
	df = pd.read_parquet(data_dir / "publications.parquet", engine="pyarrow")
except Exception as e:
	print(f"pyarrow failed: {e}, trying fastparquet...")
	df = pd.read_parquet(data_dir / "publications.parquet", engine="fastparquet")
specter = Specter()
aligner = torch.load(data_dir / "aligner.pt", weights_only=False).to("cpu")
latent_text = torch.load(data_dir / "latent_text.pt", weights_only=True).to("cpu")
latent_text = latent_text / latent_text.norm(dim=1, keepdim=True)  # unit norm
neurowiki_text = pd.read_parquet(data_dir / "neurowiki.parquet", engine="pyarrow")

There are adapters available but none are activated for the forward pass.


## Neurowiki data filtering 

I looked over the data and noticed that 0.4 'cos_sim_summary' is a decent cutoff for neuroscience related articles

In [61]:
filtered_neuro = neurowiki_text[neurowiki_text["cos_sim_summary"] > 0.4]

In [62]:
filtered_neuro.sort_values("cos_sim_summary", ascending=True)

Unnamed: 0,title,summary,cos_sim_summary,cos_sim_title,cos_sim_avg,cos_sim_invalid,cos_sim
8192,Pheochromocytoma,Pheochromocytoma (British English: phaeochromo...,0.400014,0.376887,0.388450,0.173785,0.214665
28212,Georg Nagel,Georg Nagel (born 24 August 1953 in Weingarten...,0.400021,0.383029,0.391525,0.314777,0.076748
21768,PLCB2,"1-Phosphatidylinositol-4,5-bisphosphate phosph...",0.400022,0.379800,0.389911,0.266159,0.123752
11318,Excited delirium,"Excited delirium (ExDS), also known as agitate...",0.400024,0.509676,0.454850,0.265422,0.189428
2049,Dendritic cell,A dendritic cell (DC) is an antigen-presentin...,0.400029,0.613064,0.506547,0.190504,0.316043
...,...,...,...,...,...,...,...
1843,Intracellular recording,*Intracellular recording** involves measuring ...,0.785440,0.450340,0.617890,0.293035,0.324854
33,Electrophysiology,Electrophysiology (from Ancient Greek: ἤλεκτρο...,0.785440,1.000000,0.892720,0.310617,0.582103
752,Network neuroscience,Network neuroscience is an approach to underst...,0.789676,1.000000,0.894838,0.503342,0.391496
332,Bayesian approaches to brain function,Bayesian approaches to brain function investig...,0.795987,0.844852,0.820420,0.379451,0.440969


## Example query

The titles and abstract most related to the query will be passed to the LM.

In [8]:
# Encode query with specter than rank publications
query = "what is the role of the hippocampus in memory formation"
encoded_text = specter(query)[0].detach()
encoded_text_norm = encoded_text / encoded_text.norm()
cos_sim = latent_text @ encoded_text_norm
inds = torch.argsort(cos_sim, descending=True)

# Aggregate publications to pass to LM
papers = "\n".join(
    [f"[{ind + 1}] " + df.iloc[int(i)]["name"] + "\n" + re.sub(r'\s+', ' ', df.iloc[int(i)]["description"].replace("\n", "")) + "\n"
     for ind, i in enumerate(inds[:5])]
)

In [9]:
# Top 10 related publications - these will be passed to LM
df.iloc[inds[:10]]["name"].values.tolist()

['Factors affecting the hippocampal BOLD response during spatial memory.',
 'Role of hippocampal CA1 atrophy in memory encoding deficits in amnestic Mild Cognitive Impairment.',
 'Memory related dysregulation of hippocampal function in major depressive disorder.',
 'Hippocampal functional connectivity and episodic memory in early childhood.',
 'Hippocampal activation for autobiographical memories over the entire lifetime in  healthy aged subjects: an fMRI study.',
 'Long-term retrograde amnesia...the crucial role of the hippocampus.',
 'Memory in frontal lobe epilepsy: an fMRI study.',
 'The hippocampus remains activated over the long term for the retrieval of truly episodic memories.',
 'The stressed hippocampus, synaptic plasticity and lost memories.',
 'Probing the relevance of the hippocampus for conflict-induced memory improvement']

In [15]:
system_prompt = """
You are a helpful neuroscience research assistant.  
You will receive a set of publications and a user query. Your task is to summarize key findings and insights from these publications, focusing on how they relate to the query.

Your response must:
- **Start with a brief overview** (2-4 sentences) summarizing the main themes or takeaways across the publications .  
- Be **entirely based on the information in the publications** and how it **directly ties to the user's query**. Do not add outside knowledge or speculation.  
- **Identify how each publication relates to the query**. If the publications **directly answer the query**, state the answer clearly. If they **do not answer it fully**, highlight **relevant points, evidence, or gaps** that inform the query.  
  - Bullet points → lists of findings, comparisons, or key points.  
  - Paragraphs → integrative or narrative summaries.  
- Maintain an **objective, precise, scholarly tone** suitable for neuroscience research contexts.  
"""

In [16]:
user_prompt = f"""
Here are some publications related to the query "{query}":
{papers}
"""

In [17]:
response = chat(
  messages=[
      {
        'role': 'system',
        'content': system_prompt
      },
      {
        'role': 'user',
        'content': user_prompt,
    }
  ],
  model='qwen2.5:3b-instruct' #'llama3.2:3b'
)

In [18]:
output_text = response['message']['content']

In [19]:
print(output_text)

The publications collectively provide insights into the role of the hippocampus in various aspects of memory formation and function. They vary from studying spatial memory tasks to examining hippocampal activity during specific brain regions involved in memory, as well as developmental changes in the hippocampus across different ages. 

Publication 1 highlights how the hippocampus contributes specifically to spatial memory by activating areas outside itself, such as the parahippocampal gyrus and precuneus, even though it remains active under conditions where other structures are engaged.

Publication 2 reveals that atrophy within the CA1 subfield of the hippocampus correlates with encoding deficits in amnestic Mild Cognitive Impairment (aMCI), supporting a specific role for the hippocampus in this type of memory processing.

Publication 3 shows dysregulated hippocampal function in major depressive disorder, where normal modulation of hippocampal activation during encoding is impaired. 

In [None]:
# should i remove talking about Synthesis and Conflicting or Divergent Findings? It might make the LLM respond faster

In [None]:
neurowiki_text_sorted = neurowiki_text
neurowiki_text_sorted.head(20)

Unnamed: 0,title,summary,cos_sim_summary,cos_sim_title,cos_sim_avg,cos_sim_invalid,cos_sim
37347,Michael M. J. Fischer,Michael M. J. Fischer is Andrew W. Mellon Prof...,0.288876,0.247793,0.268334,0.361846,-0.093511
37814,Michael S. Brown,Michael Stuart Brown ForMemRS NAS AAA&S APS (b...,0.254344,0.284701,0.269523,0.402182,-0.132659
36989,Beyond and Back,Beyond and Back is a 1978 American documentary...,0.240299,0.299665,0.269982,0.345923,-0.075941
37772,Eliezer Yudkowsky,Eliezer S. Yudkowsky ( EL-ee-AY-zər yuud-KOW-s...,0.281894,0.258097,0.269996,0.397902,-0.127907
37499,Jeffrey Long,Jeffrey Long is an American author and researc...,0.268129,0.274974,0.271552,0.375685,-0.104133
37572,Jerome H. Barkow,"Jerome H. Barkow (January 18, 1944 – April 30,...",0.280865,0.262324,0.271594,0.381359,-0.109765
37457,Eric Poeschla,Eric Murnane Poeschla is an American infectiou...,0.244713,0.298669,0.271691,0.372829,-0.101138
37884,Scott Fisher (technologist),Scott Fisher is the Professor and Founding Cha...,0.226878,0.316597,0.271738,0.412626,-0.140888
31265,Transport,Transport (in British English) or transportati...,0.202092,0.343434,0.272763,0.227129,0.045634
36927,Robert Hecht-Nielsen,"Robert Hecht-Nielsen (July 18, 1947 – May 25, ...",0.251217,0.294358,0.272788,0.345907,-0.073119
