In [1]:
import pandas as pd
from semanticscholar import SemanticScholar
import openai
from xml.etree import ElementTree
from Bio import Entrez

In [2]:
from openai_query import *

In [3]:
email = 'i5lee@health.ucsd.edu'

In [4]:
paragraphs = """The interacting protein system being analyzed consists of six proteins: CHST9, DKK1, LRP5, LRP6, RASD2, and TK2. These proteins are involved in various mechanisms and biological processes, with a significant emphasis on the Wnt signaling pathway, which plays a critical role in embryonic development, tissue homeostasis, and cell differentiation. The canonical Wnt signaling pathway includes three of the proteins: DKK1, LRP5, and LRP6. This pathway is crucial for the regulation of gene expression and cellular functions.

Regarding cellular components and complexes, the system's proteins are located in various cellular regions. Four of the proteins (CHST9, LRP5, LRP6, and RASD2) are found in the membrane, with DKK1, LRP5, LRP6, and RASD2 being present in the plasma membrane. CHST9, DKK1, and LRP6 are located in the extracellular region, while LRP5 and LRP6 are found in the endoplasmic reticulum. Additionally, DKK1 and LRP6 are present in the early endosome membrane. The Wnt-Frizzled-LRP5/6 complex involves LRP5 and LRP6, which participate in the coreceptor activity for both canonical Wnt signaling and the Wnt signaling pathway.

With regards to ASD (Autism Spectrum Disorder), LRP5 and LRP6 are genes with high-confidence mutations in ASD-diagnosed individuals. All six proteins (CHST9, DKK1, LRP5, LRP6, RASD2, and TK2) are part of systems with two or more known ASD-risk genes. However, specific ASD-risk genes identified in the Satterstrom et al., 2020, and Fu et al., 2022 studies were not provided, and information on proteins connected to ASD-risk proteins (AP-MS experiment) and ASD-risk in SFARI categories 2 and 3 is also missing.

From the Uniprot analysis, three proteins (LRP5, LRP6, and TK2) have disease variant associations. LRP6 and RASD2 are associated with synapses, which are crucial for neural communication and may play a role in ASD pathophysiology. Moreover, LRP5 and LRP6 are involved in the regulation of various transcription factors and processes, which may influence the expression of genes linked to ASD.

In summary, this interacting protein system appears to be predominantly involved in the Wnt signaling pathway, with an emphasis on the canonical Wnt signaling. The presence of high-confidence ASD-associated mutations in LRP5 and LRP6, along with the system's overall connection to ASD-risk genes, suggests that further investigation into the role of these proteins and their relationship with ASD is warranted.

As mentioned earlier, LRP5 and LRP6 are the genes with high-confidence mutations in ASD-diagnosed individuals in this system. To determine whether they can be considered as potential novel ASD-risk genes, we need to verify whether they are included in known ASD-risk gene sets, such as SFARI. Unfortunately, the provided information does not mention their inclusion or exclusion from such gene sets. Assuming that they are not included in any known ASD-risk gene sets, we can analyze their potential as novel ASD-risk genes based on their associations and functions.

LRP5: This gene is involved in the canonical Wnt signaling pathway, which is crucial for the regulation of gene expression and cellular functions. It is a membrane protein found in the plasma membrane and endoplasmic reticulum. LRP5 also participates in the Wnt-Frizzled-LRP5/6 complex and is involved in the regulation of various transcription factors and processes. Although there is no direct evidence of LRP5's association with diseases comorbid with ASD, its involvement in the Wnt signaling pathway, which is known to play a role in brain development and function, could imply a potential connection to ASD.

LRP6: Like LRP5, LRP6 is also involved in the canonical Wnt signaling pathway and has a similar cellular distribution. It is associated with synapses, which are crucial for neural communication. Dysfunctions at the synapse level have been implicated in ASD. LRP6's association with synapses and the Wnt signaling pathway could potentially link it to ASD. However, there is no specific information about LRP6's association with diseases comorbid with ASD.

In summary, assuming LRP5 and LRP6 are not included in known ASD-risk gene sets, they can be considered as potential novel ASD-risk genes based on their involvement in the Wnt signaling pathway, cellular distribution, and, in the case of LRP6, its association with synapses."""

In [5]:
paragraphs = list(filter(None, paragraphs.split("\n")))

In [6]:
import requests

api_key = 'use your api key'
headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json",
}

In [7]:
paragraph = paragraphs[0]

In [37]:
def get_genes_from_paragraph(paragraph, gpt_model='gpt-4', verbose=False):
    context = "I am a highly intelligent question answering bot. If you ask me a question that is rooted in truth, I will give you the answer. If you ask me a question that is nonsense, trickery, or has no clear answer, I will respond with \"Unknown\"."
    query = """I have paragraph\nParagraph:\n%s\nI would like to search PubMed to validate this abstract. give me a list of gene symbols from paragraph. please only include genes. Just tell me keywords only with comma seperated without spacing, if there is no gene, please tell me "Unknown" """%paragraph
    #'''
    keyword_extraction_data = {
    "model": gpt_model,
        "temperature": 0,
        "messages": [
        {"role": "system", "content": context},
    ] + [{"role": "user", "content": query}]}
    
    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=keyword_extraction_data)
    response_json = response.json()
    if 'choices' in response_json.keys():
        result = response_json["choices"][0]["message"]["content"]
    else:
        result = None
    #'''
    #context = 
    #result = openai_chat(context, query, gpt_model, 0, max_tokens=2000, rate_per_token=0.001)
    if verbose: 
        print("Query:")
        print(query)
        print("Result:")
        print(result)
    if result is not None:
        return [keyword.strip() for keyword in result.split(",")]
    else:
        print(response_json)

In [38]:
def get_molecular_functions_from_paragraph(paragraph, gpt_model='gpt-4', verbose=False):
    context = "I am a highly intelligent question answering bot. If you ask me a question that is rooted in truth, I will give you the answer. If you ask me a question that is nonsense, trickery, or has no clear answer, I will respond with \"Unknown\"."
    query = """I have paragraph\nParagraph:\n%s\nI would like to search PubMed to validate this abstract. give me a list of keywords of molecular functions. please don't include gene symbols. please order keywords by their importance in paragraph, from high important to low important. Just tell me keywords only with comma seperated without spacing. if there is no molecular function, please tell me "Unknown" """%paragraph
    #'''
    keyword_extraction_data = {
    "model": gpt_model,
        "temperature": 0,
        "messages": [
        {"role": "system", "content": context},
    ] + [{"role": "user", "content": query}]}
    
    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=keyword_extraction_data)
    response_json = response.json()
    if 'choices' in response_json.keys():
        result = response_json["choices"][0]["message"]["content"]
    else:
        result = None
    #'''
    #context = 
    #result = openai_chat(context, query, gpt_model, 0, max_tokens=2000, rate_per_token=0.001)
    if verbose: 
        print("Query:")
        print(query)
        print("Result:")
        print(result)
    if result is not None:
        return [keyword.strip() for keyword in result.split(",")]
    else:
        print(response_json)

In [10]:
global LOG_FILE
LOG_FILE = "./temp.log"

In [11]:
genes = get_genes_from_paragraph(paragraph)

In [12]:
genes

['CHST9', 'DKK1', 'LRP5', 'LRP6', 'RASD2', 'TK2']

In [39]:
no_gene = get_genes_from_paragraph("embryonic development")

In [40]:
no_gene

['Unknown']

In [14]:
molecular_functions = get_molecular_functions_from_paragraph(paragraph)

In [15]:
molecular_functions

['Wnt signaling pathway',
 'embryonic development',
 'tissue homeostasis',
 'cell differentiation',
 'canonical Wnt signaling',
 'gene expression',
 'cellular functions']

In [41]:
no_nomolecular_functions = get_molecular_functions_from_paragraph("CHST9")

In [42]:
no_nomolecular_functions

['Unknown']

In [16]:
def get_mla_citation(doi):
    url = f'https://api.crossref.org/works/{doi}'
    headers = {'accept': 'application/json'}
    
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        data = response.json()
        #print(data)
        item = data['message']
        
        authors = item['author']
        formatted_authors = []
        for author in authors:
            formatted_authors.append(f"{author['family']}, {author.get('given', '')}")
        authors_str = ', '.join(formatted_authors)
        
        title = item['title'][0]
        container_title = item['container-title'][0]
        year = item['issued']['date-parts'][0][0]
        volume = item.get('volume', '')
        issue = item.get('issue', '')
        page = item.get('page', '')
        
        mla_citation = f"{authors_str}. \"{title}.\" {container_title}"
        if volume or issue:
            mla_citation += f", vol. {volume}" if volume else ''
            mla_citation += f", no. {issue}" if issue else ''
        mla_citation += f", {year}, pp. {page}."
        
        return mla_citation

In [17]:
str({"hello", "there"})

"{'there', 'hello'}"

In [18]:
def get_mla_citation_from_pubmed_id(paper_dict):
    article = paper_dict['MedlineCitation']['Article']
    #print(article.keys())
    authors = article['AuthorList']
    formatted_authors = []
    for author in authors:
        last_name = author['LastName'] if author['LastName'] is not None else ''
        first_name = author['ForeName'] if author['ForeName'] is not None else ''
        formatted_authors.append(f"{last_name}, {first_name}")
    authors_str = ', '.join(formatted_authors)

    title = article['ArticleTitle']
    journal = article['Journal']['Title']
    year = article['Journal']['JournalIssue']['PubDate']['Year']
    page = article['Pagination']['MedlinePgn']
    mla_citation = f"{authors_str}. \"{title}\" {journal}"
    if "Volume" in article['Journal']['JournalIssue']['PubDate']:
        volume = article['Journal']['JournalIssue']['PubDate']['Volume']
        mla_citation += f", vol. {volume}" if volume else ''
    elif "Issue" in article['Journal']['JournalIssue']['PubDate']:
        issue = article['Journal']['JournalIssue']['PubDate']['Issue']
        mla_citation += f", no. {issue}" if issue else ''
    mla_citation += f", {year}, pp. {page}."
    return mla_citation

articles = search_pubmed("LRP5 AND LRP6 AND DKK1 AND Wnt signaling pathway", 'i5lee@health.ucsd')

get_mla_citation_from_pubmed_id(articles[0])

In [19]:
def get_citation(paper):
    names = ",".join([author['name'] for author in paper['authors']])
    corrected_title = paper['title']
    journal = paper['journal']['name']
    pub_date = paper['publicationDate']
    if 'volume' in paper['journal'].keys(): 
        volume = paper['journal']['volume'].strip()
    else:
        volume = ''
    if 'pages' in paper['journal'].keys():
        pages = paper['journal']['pages'].strip()
    else:
        doi = paper['externalIds']['DOI']
        pages = doi.strip().split(".")[-1]
    citation = f"{names}. {corrected_title} {journal} {volume} ({pub_date[0:4]}):{pages}"
    return citation

In [20]:
def get_references(queried_papers, paragraph, gpt_model='gpt-4', n=10, verbose=False):
    citations = []
    for paper in queried_papers:
        abstract = paper['MedlineCitation']['Article']['Abstract']['AbstractText'][0]
        message = """I have pharagraph\n Pharagraph:\n%s\nand abstract.\n Abstract:\n%s\nDoes this abstract support this paragraph? Please tell me yes or no"""%(paragraph, abstract)
        
        reference_check_data = {
            "model": gpt_model,
            "temperature": 0,
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
            ] + [{"role": "user", "content": message }],
        }
        reference_check_data['messages'].append({"role":"user", "content":message})
        response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=reference_check_data)

        response_json = response.json()

        if 'choices' in response_json.keys():
            result = response_json['choices'][0]['message']['content']
            if result[:3].lower()=='yes':
                try:
                    citation = get_mla_citation_from_pubmed_id(paper)
                    if citation not in citations:
                        citations.append(citation)
                except Exception as e:
                    print("Cannot parse citation even though this paper support pargraph")
                    print("Error detail: ", e)
                    pass
                if len(citations)>=n:
                    return citations
        else:
            result = "No"    
        if verbose:
            print("Title: ", paper['MedlineCitation']['Article']['ArticleTitle'])
            print("Query: ")
            print(message)
            print("Result:")
            print(result)
            print("="*200)
    return citations

In [21]:
def search_pubmed(keywords, email, sort_by='citation_count', retmax=10):
    Entrez.email = email

    search_query = f"{keywords} AND (hasabstract[text])"
    search_handle = Entrez.esearch(db='pubmed', term=search_query, sort=sort_by, retmax=retmax)
    search_results = Entrez.read(search_handle)
    search_handle.close()

    id_list = search_results['IdList']

    if not id_list:
        print("No results found.")
        return []

    fetch_handle = Entrez.efetch(db='pubmed', id=id_list, retmode='xml')
    articles = Entrez.read(fetch_handle)['PubmedArticle']
    fetch_handle.close()

    return articles

In [44]:
def get_papers(keywords, n):
    total_papers = []
    for keyword in keywords:
        print("Searching Keyword :", keyword)
        try:
            pubmed_queried_keywords= search_pubmed(keyword, email=email)
            print("%d papers are found"%len(pubmed_queried_keywords))
            total_papers += list(pubmed_queried_keywords[:n])
        except:
            print("No paper found")
            pass
    return total_papers

In [43]:
def get_keywords_combinations(paragraph, gpt_model='gpt-4', verbose=False):
    genes = get_genes_from_paragraph(paragraph, gpt_model, verbose)
    functions = get_molecular_functions_from_paragraph(paragraph, gpt_model, verbose)
    if genes[0]=='Unknown' or functions[0]=='Unknown':
        return []
    gene_query = " OR ".join(["(%s[Title/Abstract])"%gene for gene in genes])
    keywords = [gene_query + " AND (%s[Title/Abstract])"%function for function in functions]
    return keywords

In [46]:
get_keywords_combinations('KRAS', verbose=True)

Query:
I have paragraph
Paragraph:
KRAS
I would like to search PubMed to validate this abstract. give me a list of gene symbols from paragraph. please only include genes. Just tell me keywords only with comma seperated without spacing, if there is no gene, please tell me "Unknown" 
Result:
KRAS
Query:
I have paragraph
Paragraph:
KRAS
I would like to search PubMed to validate this abstract. give me a list of keywords of molecular functions. please don't include gene symbols. please order keywords by their importance in paragraph, from high important to low important. Just tell me keywords only with comma seperated without spacing. if there is no molecular function, please tell me "Unknown" 
Result:
Unknown


[]

In [24]:
def get_references_for_paragraphs(paragraphs, n=5, gpt_model='gpt-4', verbose=False):
    references_paragraphs = []
    for i, paragraph in enumerate(paragraphs):
        if verbose:
            print("""Extracting keywords from paragraph\nParagraph:\n%s"""%paragraph)
            print("="*75)
        keywords = get_keywords_combinations(paragraph, gpt_model=gpt_model, verbose=verbose)
        #keywords = list(sorted(keywords, key=len))
        keyword_joined = ",".join(keywords)
        #print("Keywords: ", keyword_joined)
        print("Serching paper with keywords...")
        pubmed_queried_keywords = get_papers(keywords, n)
        if len(pubmed_queried_keywords)==0:
            print("No paper searched!!")
            references_paragraphs.append([])
        print("In paragraph %d, %d references are queried"%(i+1, len(pubmed_queried_keywords)))
        references = get_references(pubmed_queried_keywords, paragraph, gpt_model=gpt_model, n=n, verbose=verbose)
        references_paragraphs.append(references)
        print("In paragraph %d, %d references are matched"%(i+1, len(references)))
        print("")
        print("")
    n_refs = sum([len(refs) for refs in references_paragraphs])
    print("Total %d references are queried"%n_refs)
    print(references_paragraphs)
    i = 1
    referenced_paragraphs = ""
    footer = "="*200+"\n"
    for paragraph, references in zip(paragraphs, references_paragraphs):
        referenced_paragraphs += paragraph
        for reference in references:
            referenced_paragraphs += "[%d]"%i
            footer += "[%d] %s"%(i, reference) + '\n'
            i+=1
        referenced_paragraphs += "\n"
    return referenced_paragraphs + footer
        

In [25]:
paragraphs_with_references = get_references_for_paragraphs(paragraphs[:2], n=3, gpt_model="gpt-4", verbose=True)

Extracting keywords from paragraph
Paragraph:
The interacting protein system being analyzed consists of six proteins: CHST9, DKK1, LRP5, LRP6, RASD2, and TK2. These proteins are involved in various mechanisms and biological processes, with a significant emphasis on the Wnt signaling pathway, which plays a critical role in embryonic development, tissue homeostasis, and cell differentiation. The canonical Wnt signaling pathway includes three of the proteins: DKK1, LRP5, and LRP6. This pathway is crucial for the regulation of gene expression and cellular functions.
Query:
I have paragraph
Paragraph:
The interacting protein system being analyzed consists of six proteins: CHST9, DKK1, LRP5, LRP6, RASD2, and TK2. These proteins are involved in various mechanisms and biological processes, with a significant emphasis on the Wnt signaling pathway, which plays a critical role in embryonic development, tissue homeostasis, and cell differentiation. The canonical Wnt signaling pathway includes thre

In [27]:
print(paragraphs_with_references)

The interacting protein system being analyzed consists of six proteins: CHST9, DKK1, LRP5, LRP6, RASD2, and TK2. These proteins are involved in various mechanisms and biological processes, with a significant emphasis on the Wnt signaling pathway, which plays a critical role in embryonic development, tissue homeostasis, and cell differentiation. The canonical Wnt signaling pathway includes three of the proteins: DKK1, LRP5, and LRP6. This pathway is crucial for the regulation of gene expression and cellular functions.[1][2][3]
Regarding cellular components and complexes, the system's proteins are located in various cellular regions. Four of the proteins (CHST9, LRP5, LRP6, and RASD2) are found in the membrane, with DKK1, LRP5, LRP6, and RASD2 being present in the plasma membrane. CHST9, DKK1, and LRP6 are located in the extracellular region, while LRP5 and LRP6 are found in the endoplasmic reticulum. Additionally, DKK1 and LRP6 are present in the early endosome membrane. The Wnt-Frizzle