This notebook allows to generate test-data for semantic search feature

Approach:
* take several PUBMED ids from different areas
* download abstracts from pubmed
* generate different partial summarization for abstracts/titles.
    * Try to use synonyms, not same words as in input
    * Generate in different languages
    * could be keywords or sentences
* Use these keywords/sentences as test data for semantic search feature. It is expected to find the original paper from them


------------

Install:

```
# Install API
pip install --upgrade google-genai
pip install google-cloud-aiplatform

# Install `gcloud` first, see : https://cloud.google.com/sdk/docs/install
# Authorization (before running notebook)
!gcloud auth application-default login
```

In [4]:
import requests
import os

from Bio import Entrez

In [5]:
# Always set your email for Entrez
Entrez.email = "your.email@example.com"

In [None]:
VERTEX_CLIENT_PROJECT = os.getenv('VERTEX_CLIENT_PROJECT', 'default_project')

In [7]:
pubmed_ids = ["15660127", "20534452", "14769454"]

In [8]:
def fetch_paper_abstrct(pmid):
    # Fetch metadata including abstract
    handle = Entrez.efetch(db="pubmed", id=pmid, rettype="abstract", retmode="xml")
    xml_records = Entrez.read(handle)

    found_paper = xml_records['PubmedArticle']
    assert len(found_paper) == 1
    # found_paper[0]['PubmedData']
    abstract_text = found_paper[0]['MedlineCitation']['Article']['Abstract']['AbstractText']
    assert len(abstract_text) == 1

    return abstract_text[0]

In [11]:
pmid = pubmed_ids[1]
fetch_paper_abstrct(pmid)

'Multiple sclerosis (MS) is a human demyelinating disease characterized by multifocal regions of inflammation, progressive myelin loss within the central nervous system (CNS), and eventual failure to remyelinate damaged axons. These problems suggest deficiencies in recruiting and/or maturation of oligodendrocyte progentior cells (OPCs) and highlight cell replacement therapies to promote remyelination. We have used a model of viral-induced demyelination to characterize signaling cues associated with positional migration of transplanted remyelination-competent cells. Although successful transplantation of rodent-derived glial cell types into models of MS has been performed, the mechanisms by which these cells navigate within an inflammatory environment created by a persistent virus has not been defined. Infection of the mouse CNS with the neurotropic JHM strain of mouse hepatitis virus (JHMV) results in an immune-mediated demyelinating disease with clinical and histologic similarities to

In [14]:
from google import genai
from google.genai import types
import base64

def generate(paper_text):
  client = genai.Client(
      vertexai=True,
      project=VERTEX_CLIENT_PROJECT,
      location="us-central1",
  )

  msg1_text1 = types.Part.from_text(text=paper_text)
  si_text1 = """You are a  bot, tasked with helping researchers to summarize abstracts from scientific papers into keywords set. The generated summary will be used as a test data for the new semantic search engine that finds papers by user query. Your task is to generate 5 summaries of given abstracts and repeat it in 4 languages: English, German, Polish, Russian.

Be sure to:
* make summary as keywords set , not normal sentences.
* make summary length to be a random number from 1 to 6 words
* include language and index to each summary
* focus on the main points of the text
* different summaries are allowed to cover different aspects of the user input
* keep it condense and to the point
* make all summaries different from each other
* use synonyms and avoid using same words as were in the original input
* use biological synonyms for biological terms, like gene names, diseased etc. E.g replace gene symbols with names from ENSEMBLE database or transcript ids. E.g. ACTB is known as ENSG00000075624  , you could see it from genecards web service using request like https://www.genecards.org/cgi-bin/carddisp.pl?gene=ACTB&keywords=ACTB
* do not hallucinate
* output as json text, with 'lang' key for language, "keywords" for keywords"""

  model = "gemini-2.0-flash-lite-001"
  contents = [
    types.Content(
      role="user",
      parts=[
        msg1_text1
      ]
    ),
  ]
  generate_content_config = types.GenerateContentConfig(
    temperature = 1.9,
    top_p = 0.95,
    max_output_tokens = 8192,
    response_modalities = ["TEXT"],
    safety_settings = [types.SafetySetting(
      category="HARM_CATEGORY_HATE_SPEECH",
      threshold="OFF"
    ),types.SafetySetting(
      category="HARM_CATEGORY_DANGEROUS_CONTENT",
      threshold="OFF"
    ),types.SafetySetting(
      category="HARM_CATEGORY_SEXUALLY_EXPLICIT",
      threshold="OFF"
    ),types.SafetySetting(
      category="HARM_CATEGORY_HARASSMENT",
      threshold="OFF"
    )],
    system_instruction=[types.Part.from_text(text=si_text1)],
  )

  for chunk in client.models.generate_content_stream(
    model = model,
    contents = contents,
    config = generate_content_config,
    ):
    print(chunk.text, end="")

In [15]:
pmid = pubmed_ids[1]
generate(fetch_paper_abstrct(pmid))

```json
[
  {
    "lang": "English",
    "keywords": [
      "multiple sclerosis",
      "neural stem cells",
      "CXCL12:CXCR4 signaling"
    ]
  },
  {
    "lang": "English",
    "keywords": [
      "demyelination",
      "oligodendrocyte progenitors",
      "viral infection",
      "remyelination"
    ]
  },
  {
    "lang": "English",
    "keywords": [
      "stem cell engraftment",
      "immune-mediated demyelination",
      "CXCR4 antagonism",
      "axonal repair"
    ]
  },
  {
    "lang": "English",
    "keywords": [
      "glial cell migration",
      "MS model",
      "CXCL12 signaling blockade",
      "cell replacement therapy"
    ]
  },
  {
    "lang": "English",
    "keywords": [
      "CNS inflammation",
      "neural stem cell",
      "demyelination repair",
      "CXCL12:CXCR4 pathway"
    ]
  },
    {
    "lang": "German",
    "keywords": [
      "Multiple Sklerose",
      "neurale Stammzellen",
      "CXCL12:CXCR4 Signalweg"
    ]
  },
  {
    "lang": "German",
  