# GolgiGPT - PMC Based Biomedical RAG System 

Workflow:
  - KB Selection:
      - MeSH Search String
      - PMC Full Text Only
  - Vector DB:
      - ['Tit-Abs-Kw Embedding': 'PMID'] (BioBERT or SPECTER)
  - Full Text Para Retrieval System:
      - PMC API
  - Paragraph Embedding:
      - BioBERT or SPECTER
  - Paragraph Selection by Cos Similarity:
      - Paragraph-RAG On-the-Fly
  - Context Augmentation with Sentences and Reference
  - LLM Reply
  - Implementation in Streamlit
  - Deployment
  - Query End Metric Evaluation by Domain Expert

In [22]:
import pandas as pd
import os

print(os.getcwd())
df = pd.read_csv("filtered_dataset_resdisc.csv")
# Read the dataset
df = pd.read_csv("fulltext_dataset.csv")
# Replace all space characters with NaN
# df = df.where(~df.eq(" "), None)

# df.to_csv("fulltext_dataset.csv", index=False)

/root/projects/nano-graphrag/biomedical


In [None]:
# Drop rows where both 'col1' and 'col2' are NaN
df = pd.read_csv("fulltext_dataset.csv")
df = df[~((df['INTRO'].isna() & df['METHODS'].isna() & df['RESULTS'].isna() & df['DISCUSS'].isna()))]
df = df[~((df['RESULTS'].isna() & df['DISCUSS'].isna()))]
df[["RESULTS", "DISCUSS"]]

Unnamed: 0,RESULTS,DISCUSS
0,To examine segregation of the exon 53 deletion...,The affected BC in this study and other report...
2,Carriage of the CCR5delta32 allele was not ass...,"In conclusion, our results, based on a large c..."
3,The GGT concentration of HBV-related LT recipi...,The authors have no conflicts of interest to d...
4,To investigate whether GROα levels are affecte...,"In conclusion, we have identified plasma GRO a..."
5,Two patients in the advanced cirrhosis cohort ...,"In conclusion, 12 weeks of oral treatment with..."
...,...,...
10989,Serum hsCRP also correlated negatively with fi...,"In conclusion, we propose that the putative be..."
10993,To further support the hypothesis that impaire...,"In summary, we have provided the evidence for ..."
10994,The absence of HapA was significantly associat...,"Taken together, the combination of a lack of i..."
10996,"In contrast, SNPs rs11196205 and rs7895340 (ta...","In conclusion, we show that diabetes-associate..."


# Literature Citation Exporter

In [53]:
import pyperclip as pc
import pandas as pd
import requests
import nbib

# Function to get formatted citation
def get_formatted_citation(article_id, format='medline'): # format= 'citation'
    print('PMC' in article_id)
    # Base URL for PubMed Central API
    if 'PMC' in article_id:
        article_id = article_id.replace('PMC', '')
        repo = 'pmc'
    else:
        repo = 'pubmed'
    BASE_URL = f'https://api.ncbi.nlm.nih.gov/lit/ctxp/v1/{repo}/'
    
    # Create request URL
    #request_url = f"{BASE_URL}?format=citation&contenttype=json&id={article_id}&style={format}"
    request_url = f"{BASE_URL}?format={format}&id={article_id}"
    print(request_url)
    
    pc.copy(request_url)
    
    # Send GET request to API
    response = requests.get(request_url)

    # Check if request was successful
    if response.status_code == 200:
        print(response)
        if format == 'citation':
            return response.json()  # Return the citation in JSON format
        elif format == 'medline':
            text = response.text
            data = nbib.read(text)
            return data
    else:
        return {'error': f"Request failed with status code {response.status_code}"}

# Example usage
article_id = '31435807'  # Replace with your article ID
article_id = 'PMC11169733'
citation = get_formatted_citation(article_id)
print(citation)

True
https://api.ncbi.nlm.nih.gov/lit/ctxp/v1/pmc/?format=medline&id=11169733
<Response [200]>
[{'pubmed_id': 38872946, 'citation_owner': 'NLM', 'nlm_status': 'PubMed-not-MEDLINE', 'last_revision_date': datetime.datetime(2024, 6, 15, 0, 0), 'print_issn': '1662-4548', 'electronic_issn': '1662-453X', 'linking_issn': '1662-453X', 'journal_volume': '18', 'publication_date': '2024', 'title': 'Rehmanniae Radix Preparata ameliorates behavioral deficits and hippocampal neurodevelopmental abnormalities in ADHD rat model.', 'pages': '1402056', 'abstract': 'OBJECTIVES: Abnormal hippocampal neurodevelopment, particularly in the dentate gyrus region, may be a key mechanism of attention-deficit/hyperactivity disorder (ADHD). In this study, we investigate the effect of the most commonly used Chinese herb for the treatment of ADHD, Rehmanniae Radix Preparata (RRP), on behavior and hippocampal neurodevelopment in spontaneously hypertensive rats (SHR). METHODS: Behavior tests, including Morris water maz

In [18]:
data = nbib.read(citation)
dfbib = pd.DataFrame(data)
dfbib

<Response [200]>
PMID- 31435807
OWN - NLM
STAT- MEDLINE
DCOM- 20190911
LR  - 20240328
IS  - 0080-1844 (Print)
IS  - 0080-1844 (Linking)
VI  - 67
DP  - 2019
TI  - Golgi Structure and Function in Health, Stress, and Diseases.
PG  - 441-485
LID - 10.1007/978-3-030-23173-6_19 [doi]
AB  - The Golgi apparatus is a central intracellular membrane-bound organelle with key 
      functions in trafficking, processing, and sorting of newly synthesized membrane 
      and secretory proteins and lipids. To best perform these functions, Golgi 
      membranes form a unique stacked structure. The Golgi structure is dynamic but 
      tightly regulated; it undergoes rapid disassembly and reassembly during the cell 
      cycle of mammalian cells and is disrupted under certain stress and pathological 
      conditions. In the past decade, significant amount of effort has been made to 
      reveal the molecular mechanisms that regulate the Golgi membrane architecture and 
      funct

Unnamed: 0,pubmed_id,citation_owner,nlm_status,last_revision_date,print_issn,linking_issn,journal_volume,publication_date,title,pages,...,nlm_journal_id,descriptors,pmcid,entrez_time,pubmed_time,medline_time,pmc-release_time,pii,doi,publication_status
0,31435807,NLM,MEDLINE,2024-03-28,0080-1844,0080-1844,67,2019,"Golgi Structure and Function in Health, Stress...",441-485,...,173555,"[{'descriptor': 'Animals', 'major': False}, {'...",7076563,2019-08-23 06:00:00,2019-08-23 06:00:00,2019-09-12 06:00:00,2020-04-03,19,10.1007/978-3-030-23173-6_19,ppublish


# BioC API for PMC

In [22]:
import requests
import pandas as pd

def retrieve_pubmed_article(article_id, format='BioC_json', encoding='unicode'):
    """
    Function to retrieve article from PubMed Central API in specified format and encoding.
    
    Parameters:
    - article_id (str) : PubMed or PMC ID of the article
    - format (str) : 'BioC_xml' or 'BioC_json' (default: 'BioC_json')
    - encoding (str) : 'unicode' or 'ascii' (default: 'unicode')
    
    Returns:
    - response (dict) : Article content in JSON format
    """
    base_url = "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi"
    url = f"{base_url}/{format}/{article_id}/{encoding}"
    response = requests.get(url)

    if response.ok:
        return response.json() if format == 'BioC_json' else response.text
    else:
        response.raise_for_status()

In [None]:
from pychatgpt import copilot
from pyperclip import paste, copy
copilot('@apply this function to pmc and give me another column \n\n'+paste())

In [29]:
# Import PMC Ids
pc_id = pd.read_table('pmc_result.txt', header=None)
pc_id[0].to_list()[:10]

['PMC11169026',
 'PMC11171598',
 'PMC11164214',
 'PMC11152573',
 'PMC11149193',
 'PMC11177981',
 'PMC11169733',
 'PMC11172763',
 'PMC11135019',
 'PMC11160616']

In [9]:
# Import PMC Ids
pc_id = pd.read_table('pmc_result.txt', header=None)
pmc_ids = pc_id[0].to_list()

In [21]:
%%time
# Example usage
article_id = pmc_ids[0]#'17299597'  # Replace with PubMed ID or PMC ID
 # USE PMID not PMCID !
article_content = retrieve_pubmed_article(article_id)

# main contents
content = pd.DataFrame(article_content)
# document
documents = pd.DataFrame(content.documents.iloc[0])
# passages
passages = pd.DataFrame(documents.passages.iloc[0])
# info
print('Title: '+passages.text.iloc[0])
pub_data = pd.DataFrame([passages.infons.iloc[0]])

Title: FMOD Alleviates Depression-Like Behaviors by Targeting the PI3K/AKT/mTOR Signaling After Traumatic Brain Injury
CPU times: total: 578 ms
Wall time: 1.46 s


In [22]:
# Display data
display(content)
display(documents)
display(passages)
display(pub_data)

Unnamed: 0,bioctype,source,date,key,version,infons,documents
0,BioCCollection,PMC,20240615,pmc.key,1.0,{},"[{'bioctype': 'BioCDocument', 'id': '11169026'..."


Unnamed: 0,bioctype,id,infons,passages,annotations,relations
0,BioCDocument,11169026,{'license': 'CC BY'},"[{'bioctype': 'BioCPassage', 'offset': 0, 'inf...",[],[]


Unnamed: 0,bioctype,offset,infons,text,sentences,annotations,relations
0,BioCPassage,0,{'article-id_doi': '10.1007/s12017-024-08793-2...,FMOD Alleviates Depression-Like Behaviors by T...,[],[],[]
1,BioCPassage,112,"{'section_type': 'ABSTRACT', 'type': 'abstract'}",Depression frequently occurs following traumat...,[],[],[]
2,BioCPassage,1790,"{'section_type': 'ABSTRACT', 'type': 'abstract...",Supplementary Information,[],[],[]
3,BioCPassage,1816,"{'section_type': 'ABSTRACT', 'type': 'abstract'}",The online version contains supplementary mate...,[],[],[]
4,BioCPassage,1908,"{'section_type': 'INTRO', 'type': 'title_1'}",Introduction,[],[],[]
...,...,...,...,...,...,...,...
124,BioCPassage,47822,"{'name_0': 'surname:Wu;given-names:Y', 'name_1...",Levomilnacipran improves lipopolysaccharide-in...,[],[],[]
125,BioCPassage,48007,"{'fpage': '656', 'issue': '6', 'lpage': '663',...",Anxiety and depression in frontline health car...,[],[],[]
126,BioCPassage,48095,"{'name_0': 'surname:Zhang;given-names:Y', 'nam...",Gut microbiota from NLRP3-deficient mice ameli...,[],[],[]
127,BioCPassage,48224,"{'fpage': '17050', 'name_0': 'surname:Zheng;gi...",Fibromodulin reduces scar formation in adult c...,[],[],[]


Unnamed: 0,article-id_doi,article-id_pmc,article-id_pmid,article-id_publisher-id,elocation-id,issue,kwd,license,name_0,name_1,...,name_5,name_6,name_7,name_8,name_9,section_type,title,type,volume,year
0,10.1007/s12017-024-08793-2,11169026,38864941,8793,24,1,TBI Depression FMOD Synaptic plasticity PI3K/A...,Open Access This article is licensed under a C...,surname:Huang;given-names:Xuekang,surname:Zhu;given-names:Ziyu,...,surname:Zhang;given-names:Jie,surname:Tan;given-names:Weilin,surname:Wu;given-names:Biying,surname:Liu;given-names:Lian,surname:Liao;given-names:Z. B.,TITLE,Keywords,front,26,2024


In [24]:
%%time
# Citation Explorer
print('Citation Data')
citation = get_formatted_citation(article_id)
citation = pd.DataFrame(citation)
display(citation)

Citation Data
True
https://api.ncbi.nlm.nih.gov/lit/ctxp/v1/pmc/?format=medline&id=11169026
<Response [200]>


Unnamed: 0,pubmed_id,citation_owner,nlm_status,last_revision_date,electronic_issn,print_issn,linking_issn,journal_volume,journal_issue,publication_date,...,conflict_of_interest,received_time,accepted_time,medline_time,pubmed_time,entrez_time,pmc-release_time,pii,doi,publication_status
0,38864941,NLM,MEDLINE,2024-06-15,1559-1174,1535-1084,1535-1084,26,1,2024 Jun 12,...,The authors declare no competing interests.,2024-02-11,2024-05-26,2024-06-12 12:43:00,2024-06-12 12:42:00,2024-06-12 11:07:00,2024-06-12,8793,10.1007/s12017-024-08793-2,epublish


CPU times: total: 500 ms
Wall time: 1 s


In [40]:
m=''''@
please load thia data into a dataframe 
'''+str(doc_in.infons.iloc[0])
copilot(m)

```python
import pandas as pd

# Provided data dictionary
data = {
    'article-id_doi': '10.1007/s12017-024-08793-2', 
    'article-id_pmc': '11169026', 
    'article-id_pmid': '38864941', 
    'article-id_publisher-id': '8793', 
    'elocation-id': '24', 
    'issue': '1', 
    'kwd': 'TBI Depression FMOD Synaptic plasticity PI3K/AKT/mTOR', 
    'license': "Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permi

# Embedding text

In [None]:
!pip install sentence_transformers
!pip install transformers

### BERT Pretrained Models
https://www.sbert.net/docs/pretrained_models.html

**Scientific Publications**  

| Model Name         | Description                                       |  Dimensions | 
|----------------------|--------------------------------------------------|------------|
| *allenai-specter* | SPECTER is a model trained on scientific citations and can be used to estimate the similarity of two publications. We can use it to find similar papers. | 768 |

### SPECTER:: Sentence Transformers

SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. 

In [32]:
%%time
from sentence_transformers import SentenceTransformer
models= ['dmis-lab/biobert-base-cased-v1.2', "allenai-specter"]

# Choose embedding model
embedding_model = SentenceTransformer("allenai-specter") 

def get_embedding(text):
    embedding = embedding_model.encode(text, show_progress_bar=False)
    return embedding
doc_in['paragraph_embedding'] = doc_in['text'].apply(get_embedding)
doc_in

CPU times: total: 5.34 s
Wall time: 4.59 s


Unnamed: 0,offset,infons,text,sentences,annotations,relations,paragraph_embedding
0,0,"{'alt-title': 'Population Genetic Complexity',...",Quantifying Organismal Complexity using a Popu...,[],[],[],"[-0.34946486, 0.81233764, -1.1245518, 0.239255..."
1,70,"{'section_type': 'ABSTRACT', 'type': 'abstract...",Background,[],[],[],"[-0.4900074, 0.9371992, -0.64640814, 0.0891367..."
2,81,"{'section_type': 'ABSTRACT', 'type': 'abstract'}",Various definitions of biological complexity h...,[],[],[],"[-0.05157757, 1.1551077, -0.39407745, -0.25576..."
3,337,"{'section_type': 'ABSTRACT', 'type': 'abstract...",Methodology,[],[],[],"[-0.34070057, 0.52787834, -0.7744749, 0.057282..."
4,349,"{'section_type': 'ABSTRACT', 'type': 'abstract'}",Here we propose an alternative complexity metr...,[],[],[],"[0.44728622, 0.87169987, -0.25241286, 0.232644..."
...,...,...,...,...,...,...,...
116,41294,"{'fpage': '1287', 'lpage': '1291', 'name_0': '...",Adaptation to the fitness costs of antibiotic ...,[],[],[],"[-0.24779828, 0.09557337, -0.8857135, 0.883678..."
117,41372,"{'fpage': '1471', 'lpage': '1481', 'name_0': '...",Compensatory evolution in rifampin-resistant E...,[],[],[],"[-0.5028894, 0.23429127, -0.2778873, 0.8616125..."
118,41435,"{'fpage': '5233', 'lpage': '5238', 'name_0': '...",Replication of Phi-X174 DNA with Purified Enzy...,[],[],[],"[0.3537047, 0.470016, 0.38906127, 0.8454049, 0..."
119,41559,"{'fpage': '1687', 'lpage': '1699', 'name_0': '...",Requirement for cyclophilin A for the replicat...,[],[],[],"[1.1115869, 0.8275077, -0.67756414, 0.71098024..."


In [60]:
print('DB size forecast:\n')
df = doc_in
# Calculate the DataFrame size in bytes
df_size_bytes = df.memory_usage(deep=True).sum()

# Convert size to kilobytes
df_size_kb = df_size_bytes / 1024
print(f'DataFrame size: {df_size_kb:.2f} KB')
line_size = df_size_kb/len(df)
print(f'Size for line: {line_size:.2f} KB')

print(f'\nSize for a vector db of 100k full text: {df_size_kb*100000/1024/1024:.2f} GB')
print(f'\nSize for a vector db of 100k pmids: < {line_size*100000/1024:.2f} MB')

DB size forecast:

DataFrame size: 145.31 KB
Size for line: 1.20 KB

Size for a vector db of 100k full text: 13.86 GB

Size for a vector db of 100k pmids: < 117.28 MB


### BioBERT:: transformers pipeline

In [33]:
%%time
# Import necessary libraries
from transformers import pipeline

# Load the feature-extraction pipeline
embedding_model = pipeline("feature-extraction", model='dmis-lab/biobert-base-cased-v1.2')

# Define a function to get embedding
def get_embedding_pipe(text):
    # Get the embedding using the __call__ method
    embedding = embedding_model(text)
    return embedding

# Apply the function to a DataFrame column
doc_in['paragraph_embedding'] = doc_in['text'].apply(get_embedding_pipe)
doc_in

CPU times: total: 1min 3s
Wall time: 17.4 s


Unnamed: 0,offset,infons,text,sentences,annotations,relations,paragraph_embedding
0,0,"{'alt-title': 'Population Genetic Complexity',...",Quantifying Organismal Complexity using a Popu...,[],[],[],"[[[0.14836755394935608, -0.2629369795322418, -..."
1,70,"{'section_type': 'ABSTRACT', 'type': 'abstract...",Background,[],[],[],"[[[0.22972223162651062, -0.023931751027703285,..."
2,81,"{'section_type': 'ABSTRACT', 'type': 'abstract'}",Various definitions of biological complexity h...,[],[],[],"[[[0.16660861670970917, -0.15411078929901123, ..."
3,337,"{'section_type': 'ABSTRACT', 'type': 'abstract...",Methodology,[],[],[],"[[[0.32288044691085815, -0.13535937666893005, ..."
4,349,"{'section_type': 'ABSTRACT', 'type': 'abstract'}",Here we propose an alternative complexity metr...,[],[],[],"[[[0.1790732890367508, -0.013409426435828209, ..."
...,...,...,...,...,...,...,...
116,41294,"{'fpage': '1287', 'lpage': '1291', 'name_0': '...",Adaptation to the fitness costs of antibiotic ...,[],[],[],"[[[0.3783426880836487, 0.06785178184509277, -0..."
117,41372,"{'fpage': '1471', 'lpage': '1481', 'name_0': '...",Compensatory evolution in rifampin-resistant E...,[],[],[],"[[[0.3111024498939514, 0.04340777546167374, -0..."
118,41435,"{'fpage': '5233', 'lpage': '5238', 'name_0': '...",Replication of Phi-X174 DNA with Purified Enzy...,[],[],[],"[[[0.3082647919654846, -0.18137937784194946, -..."
119,41559,"{'fpage': '1687', 'lpage': '1699', 'name_0': '...",Requirement for cyclophilin A for the replicat...,[],[],[],"[[[0.46516531705856323, -0.03741031885147095, ..."


# Extra

## PMC ID Converter

### simple converter

In [28]:
########## Single Converter ##########
import requests
import xml.etree.ElementTree as ET # Import ElementTree for XML parsing

def convert_pmcid_to_pmid(pmcid):
    # Base URL for the conversion API
    base_url = "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/"

    # API request parameters
    params = {
        'tool': 'pmcid_converter',
        'email': 'your_email@example.com',  # Please replace with your email
        'ids': pmcid
    }

    # Make a GET request to the API
    response = requests.get(base_url, params=params)

    # Check if the request was successful
    if response.status_code == 200:
        # Parse the response XML
        root = ET.fromstring(response.content)

        # Extract the PMID from the XML response
        pmid = root.find('.//record').get('pmid')

        if pmid:
            return pmid
        else:
            return "PMID not found."
    else:
        return "Error in API request."

# Example usage
pmcid = "PMC5334499"
pmid = convert_pmcid_to_pmid(pmcid)
print("PMCID:", pmcid, "=> PMID:", pmid)

PMCID: PMC5334499 => PMID: 28298962


In [63]:
########## List Converter ##########

def fetch_pubmed_data(ids, tool, email, idtype=None, versions=True, format='json'):
    base_url = "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/"

    # Construct parameters dictionary
    params = {
        'ids': ','.join(ids), # Convert list of IDs to comma-separated string
        'tool': tool,
        'email': email,
        'format': format
    }

    if idtype:
        params['idtype'] = idtype

    if not versions:
        params['versions'] = 'no'

    # Make the GET request to the API
    response = requests.get(base_url, params=params)

    # Raise error if the request was unsuccessful
    response.raise_for_status()

    # Return the response data in specified format
    if format == 'json':
        return response.json()
    elif format == 'csv':
        return response.text
    elif format == 'xml' or format == 'html':
        return response.content

# Example usage
if __name__ == "__main__":
    ids = ['PMC3531190', 'PMC3245039']
    tool = 'example_tool'
    email = 'example@example.com'
    data = fetch_pubmed_data(ids, tool, email, idtype='pmcid', versions=True, format='json')
    print(data['records'])


[{'pmcid': 'PMC3531190', 'pmid': '23193287', 'doi': '10.1093/nar/gks1195', 'versions': [{'pmcid': 'PMC3531190.1', 'current': 'true'}]}, {'pmcid': 'PMC3245039', 'pmid': '22144687', 'doi': '10.1093/nar/gkr1202', 'versions': [{'pmcid': 'PMC3245039.1', 'current': 'true'}]}]


In [None]:
###python
def get_pmid_from_pmcid(pmcids):
    ids = pmcids.tolist()[2:20] # Assuming pmcids is a pandas column or list
    tool = 'example_tool'
    email = 'example@example.com'

    # Function to fetch PubMed data
    def fetch_pubmed_data(ids, tool, email, idtype='pmcid', versions=True, format='json'):
        # This function should be implemented based on the API documentation
        pass

    data = fetch_pubmed_data(ids, tool, email, idtype='pmcid', versions=True, format='json')
    print(data['records'])

    pmids = []
    for record in data['records']:
        if 'pmid' in record:
            pmid = str(record['pmid'])
            pmids.append(pmid)
            print(pmid)

    return pmids
###

In [76]:

def get_pmid_from_pmcid(pmcids):
    ids = pmcids[0].to_list()[2:20]
    tool = 'example_tool'
    email = 'example@example.com'
    data = fetch_pubmed_data(ids, tool, email, idtype='pmcid', versions=True, format='json')
    print(data['records'])

    pmids = []
    for record in data['records']:
        if 'pmid' in record:
            pmid = str(record['pmid'])
            pmids.append(pmid)
            print(pmid)
        else:
            print('pmid not in record')
    return pmids

pmids = get_pmid_from_pmcid(pc_id)

[{'pmcid': 'PMC11164214', 'pmid': '38859952', 'doi': '10.2147/IJN.S462374', 'versions': [{'pmcid': 'PMC11164214.1', 'current': 'true'}]}, {'pmcid': 'PMC11152573', 'pmid': '38837189', 'doi': '10.7554/eLife.89306', 'versions': [{'pmcid': 'PMC11152573.1', 'current': 'true'}]}, {'pmcid': 'PMC11149193', 'pmid': '38834986', 'doi': '10.1186/s12935-024-03377-3', 'versions': [{'pmcid': 'PMC11149193.1', 'current': 'true'}]}, {'pmcid': 'PMC11177981', 'pmid': '38883790', 'doi': '10.21203/rs.3.rs-3373803/v1', 'versions': [{'pmcid': 'PMC11177981.1', 'current': 'true'}]}, {'pmcid': 'PMC11169733', 'pmid': '38872946', 'doi': '10.3389/fnins.2024.1402056', 'versions': [{'pmcid': 'PMC11169733.1', 'current': 'true'}]}, {'pmcid': 'PMC11172763', 'doi': '10.3390/ijms25116033', 'versions': [{'pmcid': 'PMC11172763.1', 'current': 'true'}]}, {'pmcid': 'PMC11135019', 'pmid': '38808948', 'doi': '10.1002/cam4.7308', 'versions': [{'pmcid': 'PMC11135019.1', 'current': 'true'}]}, {'pmcid': 'PMC11160616', 'pmid': '38853

In [72]:
def get_pmid_from_pmcid(pmcids):
    pmids = []
    for i in range(len(data['records'])):
        pmid = str(data['records'][i]['pmid'])
        pmids.append(pmid)
    return pmids

In [7]:
import pandas as pd
pd.DataFrame(data['records'])

Unnamed: 0,pmcid,pmid,doi,versions
0,PMC3531190,23193287,10.1093/nar/gks1195,"[{'pmcid': 'PMC3531190.1', 'current': 'true'}]"
1,PMC3245039,22144687,10.1093/nar/gkr1202,"[{'pmcid': 'PMC3245039.1', 'current': 'true'}]"


In [None]:
pmids = []
for i in range(len(data['records'])):
    pmid = str(data['records'][i]['pmid'])
    pmids.append(pmid)
pmids

### txt converter

In [30]:
import pandas as pd
from tqdm import tqdm

# Convert PMCID to PMIDs
pmc_id = pd.read_table('pmc_result.txt', header=None)
pmids = []

# Wrap the for loop with tqdm for progress bar
for i in tqdm(range(len(pmc_id))):
    pmcid = pmc_id[0][i]
    pmid = convert_pmcid_to_pmid(pmcid)
    pmids.append(pmid)

pmids

  0%|          | 8/16515 [00:08<4:35:28,  1.00s/it]


KeyboardInterrupt: 

## OA

In [41]:
import requests
import xml.etree.ElementTree as ET

def fetch_pubmed_data(base_url, params):
    response = requests.get(base_url, params=params)
    response.raise_for_status()  # Raise HTTPError for bad responses
    return response.content

def parse_xml_data(xml_data):
    root = ET.fromstring(xml_data)
    results = []
    for record in root.findall('.//record'):
        citation = record.findtext('.//title')
        license_info = record.findtext('.//license')
        update_date = record.findtext('.//update_date')
        ftp_location = record.findtext('.//file')
        results.append({
            'citation': citation,
            'license': license_info,
            'update_date': update_date,
            'ftp_location': ftp_location
        })
    return results

def get_articles_by_date(base_url, from_date, until_date=None, format=None):
    params = {
        'from': from_date,
    }
    if until_date:
        params['until'] = until_date
    if format:
        params['format'] = format

    xml_data = fetch_pubmed_data(base_url, params)
    return parse_xml_data(xml_data)

if __name__ == "__main__":
    base_url = "https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi"
    from_date = "2022-01-01"
    until_date = "2022-12-31"
    article_format = "pdf"

    articles = get_articles_by_date(base_url, from_date, until_date, article_format)
    for article in articles:
        print(f"Citation: {article['citation']}")
        print(f"License: {article['license']}")
        print(f"Update Date: {article['update_date']}")
        print(f"FTP Location: {article['ftp_location']}")
        print("--------------------")


Citation: None
License: None
Update Date: None
FTP Location: None
--------------------
Citation: None
License: None
Update Date: None
FTP Location: None
--------------------
Citation: None
License: None
Update Date: None
FTP Location: None
--------------------
Citation: None
License: None
Update Date: None
FTP Location: None
--------------------
Citation: None
License: None
Update Date: None
FTP Location: None
--------------------
Citation: None
License: None
Update Date: None
FTP Location: None
--------------------
Citation: None
License: None
Update Date: None
FTP Location: None
--------------------
Citation: None
License: None
Update Date: None
FTP Location: None
--------------------
Citation: None
License: None
Update Date: None
FTP Location: None
--------------------
Citation: None
License: None
Update Date: None
FTP Location: None
--------------------
Citation: None
License: None
Update Date: None
FTP Location: None
--------------------
Citation: None
License: None
Update Date: N

## API Retriever

In [10]:
# Building API Retriever
from pychatgpt import copilot
import scrapers as sc
m='''@
write an API retriever from PubMed Cental API in Python.

Get formatted citations and tag formats, such as MEDLINE or RIS, for journal articles from PubMed and PMC.
Base URL: https://api.ncbi.nlm.nih.gov/lit/ctxp/v1/pmc/

get formatted ciation
/lit/ctxp/v1/pmc/?format=citation&contenttype=json&id=

This below are the offical developer guidelines
'''#+sc.get_text_from_url(url='https://api.ncbi.nlm.nih.gov/lit/ctxp/')
#m='''NameError: name 'ET' is not defined'''
copilot(m)

```python
import requests

# Base URL for PubMed Central API
BASE_URL = 'https://api.ncbi.nlm.nih.gov/lit/ctxp/v1/pmc/'

# Function to get formatted citation
def get_formatted_citation(article_id, format='medline'):
    # Create request URL
    request_url = f"{BASE_URL}?format=citation&contenttype=json&id={article_id}&style={format}"
    
    # Send GET request to API
    response = requests.get(request_url)
    
    # Check if request was successful
    if response.status_code == 200:
        return response.json()  # Return the citation in JSON format
    else:
        return {'error': f"Request failed with status code {response.status_code}"}

# Example usage
article_id = 'PMC1234567'  # Replace with your article ID
citation = get_formatted_citation(article_id)
print(citation)
```
 <prompt tokens: 1515>
