## Query OpenTargets with LLMs

##### This notebook was prepared for the pistoia alliance LLM evaluation effort
Author: Helena F. Deus, PhD
License: CC-BY-NC-SA (https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en)
Date: 2024-02-23

In [1]:
graphQL_prompt = '''Your task is to create a GraphQL query for the user request given a schema. The instructions below will guide you systematically in creating a functional query. 

1. Determine the Operation Type: Start by defining the operation type of your GraphQL. For instance, in the provided examples, the operation type is `query`. 

2. Identify the Main Object: Scrutinize the request and identify the main subjects. These subjects correspond to the main objects in your GraphQL query. For example, in the first query, 'target' with a parameter 'ensemblId' is the key object.

3. Specify the Parameters: The requests usually contain specific values that need to be passed as parameters in the query. Identify these in the parentheses following the object. For instance, in the second example 'efoId' is a parameter in 'disease' object. 

4. Set Return Fields: In GraphQL, you need to specify exactly what details you want returned by your queries. Identify the appropriate fields you need returned for your query from the request and list them within a pair of curly brackets after your object and parameters. 

5. Use Nested Fields: Sometimes, the information you need is nested within other fields. Make sure to check the schema and include these nested fields where necessary. For instance, the field 'associatedTargets' in the second example contains subfields 'count' and 'rows'.

6. Reference Schema Documentation: Always remember to refer to the schema documentation when creating your query. The schema contains the structure and type of data that can be queried and will indicate what fields are available and how they are connected.

7. Test Your Query: Once your query is prepared, test it to ensure it works effectively and returns the required data. If you encounter any issues or the data returned is not as expected, refer back to the schema and adjust your query accordingly.

8. Return the query verbatim, do not add any explanation so that it can be parsed correctly

9. Remember that the identifiers from EFO are as in example one meaning they require an underscore over a colon so if the user provides a colon, please use underscore instead, like so EFO:0000349 becomes EFO_0000349

<Examples>
	

	Question: Find targets associated with a specific disease or phenotype (EFO_0000349)
	Answer: query associatedTargets {{
	  disease(efoId: "EFO_0000349") {{
	    id
	    name
	    associatedTargets {{
	      count
	      rows {{
	        target {{
	          id
	          approvedSymbol
	        }}
	        score
	      }}
	    }}
	  }}
	}}
    
    
    Question: Find targets associated with a specific disease or phenotype (EFO_0000349)
	Answer: query associatedTargets {{
	  disease(efoId: "EFO_0000349") {{
	    id
	    name
	    associatedTargets {{
	      count
	      rows {{
	        target {{
	          id
	          approvedSymbol
	        }}
	        score
	      }}
	    }}
	  }}
	}}

<Schema> 
{schema}

<Question>
{question}

Remember that GraphQL is strongly typed and case sensitive. It is also worth noting that fields in GraphQL are only returned if specifically requested. Even errors in queries will not be returned unless the 'errors' keyword is used in your query. Lastly, when you feel stuck, always refer back to the schema or to the GraphQL documentation.'''


In [2]:
## You need to replace this with your own key
OPENAI_API_KEY=open("../openaikey-personal", "r").read()

## Define prompt, llm and output format

In [3]:
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_template(
    graphQL_prompt
)

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY)

from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()

## Run the chain with input from user 

In [4]:
schema = open("./data/open-targets-schema.txt", "r").read()
question = "Find the targets associated with lung adenocarcinoma (EFO:0000571)"

In [5]:
chain = prompt | llm | output_parser
output = chain.invoke({"question": question, "schema":schema})
print(output)

query associatedTargets {
  disease(efoId: "EFO_0000571") {
    id
    name
    associatedTargets {
      count
      rows {
        target {
          id
          approvedSymbol
        }
        score
      }
    }
  }
}


In [6]:
#is the query valid? 
from graphql import graphql, build_schema, parse
from graphql import parse

try:
    parsed = parse(output)
    query_valid = True
    print("Query is valid!")
except:
    print("Query not valid")


Query is valid!


## Now run this against OpenTargets API

In [7]:
import requests

# Define the URL of the OpenTarget API endpoint
url = 'https://api.platform.opentargets.org/api/v4/graphql'

# Set up the request headers
headers = {
    'Content-Type': 'application/json',
}

# Set up the request payload
data = {
    'query': output
}

# Send the GraphQL query request
response = requests.post(url, headers=headers, json=data)

# Check if the request was successful
if response.status_code == 200:
    # Print the JSON response
    json_data = response.json()
    print(json_data)
else:
    # Print an error message if the request failed
    print(f"Error: {response.status_code}")
    print(response.text)
    


{'data': {'disease': {'id': 'EFO_0000571', 'name': 'lung adenocarcinoma', 'associatedTargets': {'count': 7962, 'rows': [{'target': {'id': 'ENSG00000146648', 'approvedSymbol': 'EGFR'}, 'score': 0.8267159415027404}, {'target': {'id': 'ENSG00000141510', 'approvedSymbol': 'TP53'}, 'score': 0.7679343542735289}, {'target': {'id': 'ENSG00000133703', 'approvedSymbol': 'KRAS'}, 'score': 0.7564033322165448}, {'target': {'id': 'ENSG00000157764', 'approvedSymbol': 'BRAF'}, 'score': 0.7291460177657758}, {'target': {'id': 'ENSG00000118046', 'approvedSymbol': 'STK11'}, 'score': 0.7224909751877195}, {'target': {'id': 'ENSG00000171094', 'approvedSymbol': 'ALK'}, 'score': 0.6864287642178499}, {'target': {'id': 'ENSG00000141736', 'approvedSymbol': 'ERBB2'}, 'score': 0.6630123723575108}, {'target': {'id': 'ENSG00000182872', 'approvedSymbol': 'RBM10'}, 'score': 0.6510321647486}, {'target': {'id': 'ENSG00000121879', 'approvedSymbol': 'PIK3CA'}, 'score': 0.639439527044695}, {'target': {'id': 'ENSG00000047936

## Now load this into pandas

In [8]:
from IPython.display import display
import pandas as pd
# Extract the rows from the JSON data
rows = json_data['data']['disease']['associatedTargets']['rows']

# Extracting additional information from the JSON data and creating columns
target_ids = [row['target'].get('id', '') for row in rows]
approved_symbols = [row['target'].get('approvedSymbol', '') for row in rows]
names = [row['target'].get('name', '') for row in rows]
scores = [row.get('score', '') for row in rows]
disease_name = json_data['data']['disease']['name']

# Create a DataFrame with the extracted information
df = pd.DataFrame({
    'Disease Name': [disease_name] * len(rows),  # Repeat disease name for all rows
    'Target ID': target_ids,
    'Approved Symbol': approved_symbols,
    'Name': names,
    'Score': scores
})

df_sorted = df.sort_values(by='Score', ascending=False)

# Display the DataFrame
display(df_sorted)

Unnamed: 0,Disease Name,Target ID,Approved Symbol,Name,Score
0,lung adenocarcinoma,ENSG00000146648,EGFR,,0.826716
1,lung adenocarcinoma,ENSG00000141510,TP53,,0.767934
2,lung adenocarcinoma,ENSG00000133703,KRAS,,0.756403
3,lung adenocarcinoma,ENSG00000157764,BRAF,,0.729146
4,lung adenocarcinoma,ENSG00000118046,STK11,,0.722491
5,lung adenocarcinoma,ENSG00000171094,ALK,,0.686429
6,lung adenocarcinoma,ENSG00000141736,ERBB2,,0.663012
7,lung adenocarcinoma,ENSG00000182872,RBM10,,0.651032
8,lung adenocarcinoma,ENSG00000121879,PIK3CA,,0.63944
9,lung adenocarcinoma,ENSG00000047936,ROS1,,0.637853
