# QSAR-LLM

Basic attempt at training and calling QSAR models using large language models

The repository implements:
1. A wrapper around `lightgbm`'s `LGBMRegressor` and `LGBMClassifier` that allows to easily train simple regression and classification models
2. A set of `langchain` tools to instruct LLM to download ChEMBL data packages for biological targets, train models and predict the actiivty of molecules


Model parameters are saved as `lightgbm.Booster` params TXT files, and information on all trained models is stored using [`shelve`](https://docs.python.org/3/library/shelve.html)

The tools and the agent have been set up to be able to handle most parsing and internal errors, helping the LLM correct its queries

In [1]:
from langchain.llms import openai
from langchain.agents import initialize_agent
from langchain.agents import AgentType

from tools import (
    check_local_files,
    get_models,
    protein_name_to_accession,
    accession_to_chembl_target_id,
    query_target_data,
    train_model,
    predict_molecule,
    shelf,
)

# Initialize the LLM and the LangChain agent

In [2]:
llm = openai.OpenAI(temperature=0.0)
agent = initialize_agent(
    [
        check_local_files,
        get_models,
        protein_name_to_accession,
        accession_to_chembl_target_id,
        query_target_data,
        train_model,
        predict_molecule,
    ],
    llm,
    agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
    handle_parsing_errors=True,
    handle_tool_error=True,
)

## Download ChEMBL data for JAK2 and train a regression model

In [3]:
text = """
Train a regression model to predict activity for JAK2 using a fresh ChEMBL data package for the target.
Make sure to resolve the protein name to the correct ChEMBL target ID"""
agent.run(text)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Action:
```
{
  "action": "protein_name_to_accession",
  "action_input": {
    "protein_name": "JAK2"
  }
}
```

[0m
Observation: [38;5;200m[1;3m('JAK2_HUMAN', 'Tyrosine-protein kinase JAK2', 'O60674')[0m
Thought:[32;1m[1;3m I need to use the accession to get the ChEMBL target ID
Action:
```
{
  "action": "accession_to_chembl_target_id",
  "action_input": {
    "accession": "O60674"
  }
}
```

[0m
Observation: [36;1m[1;3m('Tyrosine-protein kinase JAK2', 'CHEMBL2971', 'O60674')[0m
Thought:[32;1m[1;3m I need to query the ChEMBL data package
Action:
```
{
  "action": "query_target_data",
  "action_input": {
    "target_chembl_id": "CHEMBL2971",
    "output_csv": "JAK2.csv"
  }
}
```


[0m

Querying ChEMBL for CHEMBL2971: 100%|███████████████████████████████████████████| 7454/7454 [00:00<00:00, 15249.65it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = pd.to_numeric(df[col])



Observation: [33;1m[1;3m{'pchembl_value': {'count': 6484.0, 'mean': 7.341097866394738, 'std': 1.219005643404945, 'min': 3.84, '25%': 6.48, '50%': 7.31, '75%': 8.14, 'max': 10.97}}[0m
Thought:[32;1m[1;3m I need to train a model
Action:
```
{
  "action": "train_model",
  "action_input": {
    "input_csv": "JAK2.csv",
    "mode": "regression",
    "model_name": "JAK2_regression",
    "output_path": "JAK2_regression.txt"
  }
}
```

[0m
Observation: [38;5;200m[1;3m('r2', 0.5840131903651216)[0m
Thought:[32;1m[1;3m I have the final answer
Action:
```
{
  "action": "Final Answer",
  "action_input": "I have trained a regression model for JAK2 with an R2 score of 0.58."
}
```[0m

[1m> Finished chain.[0m


'I have trained a regression model for JAK2 with an R2 score of 0.58.'

# Train a classification model

We can instruct the LLM to use different activity thresholds (here, the 50 percentile of activity values)

In [4]:
text = """
Train a classification model to predict activity for JAK2 using a fresh ChEMBL data package for the target.
Use the 50% percentile value of the pChEMBL value as threshold to define activity"""
agent.run(text)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Action:
```
{
  "action": "query_target_data",
  "action_input": {
    "target_chembl_id": "",
    "output_csv": "JAK2.csv",
    "standard_types": ["Kd", "Ki", "IC50"]
  }
}
```

[0m

Querying ChEMBL for : 0it [00:00, ?it/s]


Observation: [31;1m[1;3mNo data found for [0m
Thought:




[32;1m[1;3m Need to get the ChEMBL target ID
Action:
```
{
  "action": "protein_name_to_accession",
  "action_input": {
    "protein_name": "JAK2"
  }
}
```

[0m
Observation: [38;5;200m[1;3m('JAK2_HUMAN', 'Tyrosine-protein kinase JAK2', 'O60674')[0m
Thought:[32;1m[1;3m Need to get the ChEMBL target ID from the Uniprot accession
Action:
```
{
  "action": "accession_to_chembl_target_id",
  "action_input": {
    "accession": "O60674"
  }
}
```

[0m
Observation: [36;1m[1;3m('Tyrosine-protein kinase JAK2', 'CHEMBL2971', 'O60674')[0m
Thought:[32;1m[1;3m Need to query the ChEMBL data package
Action:
```
{
  "action": "query_target_data",
  "action_input": {
    "target_chembl_id": "CHEMBL2971",
    "output_csv": "JAK2.csv",
    "standard_types": ["Kd", "Ki", "IC50"]
  }
}
```


[0m

Querying ChEMBL for CHEMBL2971: 100%|███████████████████████████████████████████| 7454/7454 [00:00<00:00, 15985.06it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = pd.to_numeric(df[col])



Observation: [33;1m[1;3m{'pchembl_value': {'count': 6484.0, 'mean': 7.341097866394738, 'std': 1.219005643404945, 'min': 3.84, '25%': 6.48, '50%': 7.31, '75%': 8.14, 'max': 10.97}}[0m
Thought:[32;1m[1;3m Need to train a classification model
Action:
```
{
  "action": "train_model",
  "action_input": {
    "input_csv": "JAK2.csv",
    "mode": "classification",
    "model_name": "JAK2_classification",
    "output_path": "JAK2_classification.txt",
    "activity_threshold": 7.31
  }
}
```


[0m
Observation: [38;5;200m[1;3m('mcc', 0.5989637798269798)[0m
Thought:[32;1m[1;3m I know what to respond
Action:
```
{
  "action": "Final Answer",
  "action_input": "I have trained a classification model for JAK2 with a Matthews Correlation Coefficient of 0.5989637798269798."
}
```[0m

[1m> Finished chain.[0m


'I have trained a classification model for JAK2 with a Matthews Correlation Coefficient of 0.5989637798269798.'

## Predict the activity for a single molecule

In [5]:
agent.run("Predict the activity of the molecule 'c1ccccc1' for JAK2")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mAction:
```
{
  "action": "predict_molecule",
  "action_input": {
    "smiles": "c1ccccc1",
    "model_name": "JAK2_regression"
  }
}
```

[0m
Observation: [36;1m[1;3m5.009191304941544[0m
Thought:[32;1m[1;3m I have the prediction result, now I need to return the final answer.
Action:
```
{
  "action": "Final Answer",
  "action_input": "The predicted activity of the molecule 'c1ccccc1' for JAK2 is 5.009191304941544."
}
```[0m

[1m> Finished chain.[0m


"The predicted activity of the molecule 'c1ccccc1' for JAK2 is 5.009191304941544."

## Predict a single molecule using the classification model

In [6]:
agent.run(
    "Predict whether the molecule 'CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12' is active for JAK2 using the classification model"
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mAction:
```
{
  "action": "predict_molecule",
  "action_input": {
    "smiles": "CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12",
    "model_name": "JAK2_classification"
  }
}
```

[0m
Observation: [36;1m[1;3mFalse[0m
Thought:[32;1m[1;3m I need to check if the model exists
Action:
```
{
  "action": "get_models",
  "action_input": {}
}
```

[0m
Observation: [33;1m[1;3m{'JAK2_regression': {'params_path': 'JAK2_regression.txt', 'metric_name': 'r2', 'mode': 'regression', 'performance': 0.5840131903651216, 'timestamp': '2023-07-17 14:15:37'}, 'JAK2_classification': {'params_path': 'JAK2_classification.txt', 'metric_name': 'mcc', 'mode': 'classification', 'performance': 0.5989637798269798, 'timestamp': '2023-07-17 14:16:12'}}[0m
Thought:[32;1m[1;3m The model exists, so I can respond
Action:
```
{
  "action": "Final Answer",
  "action_input": "The molecule 'CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12' is predicted to be inactive 

"The molecule 'CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12' is predicted to be inactive for JAK2 using the classification model."

## Compare the predictions for two molecules

In [7]:
text = """
Which molecule is more active according to the JAK2 regression model?
Tofacitinib (SMILES CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12) or Baricitinib (SMILES CCS(=O)(=O)N1CC(CC#N)(n2cc(-c3ncnc4[nH]ccc34)cn2)C1)
"""
agent.run(text)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mAction:
```
{
  "action": "predict_molecule",
  "action_input": {
    "smiles": "CC1CCN(C(=O)CC#N)CC1N(C)c1ncnc2[nH]ccc12",
    "model_name": "JAK2_regression"
  }
}
```

[0m
Observation: [36;1m[1;3m7.246299979282368[0m
Thought:[32;1m[1;3m
Action:
```
{
  "action": "predict_molecule",
  "action_input": {
    "smiles": "CCS(=O)(=O)N1CC(CC#N)(n2cc(-c3ncnc4[nH]ccc34)cn2)C1",
    "model_name": "JAK2_regression"
  }
}
```

[0m
Observation: [36;1m[1;3m7.839277643356649[0m
Thought:[32;1m[1;3m I know what to respond
Action:
```
{
  "action": "Final Answer",
  "action_input": "The molecule with SMILES CCS(=O)(=O)N1CC(CC#N)(n2cc(-c3ncnc4[nH]ccc34)cn2)C1 is more active according to the JAK2 regression model."
}
```[0m

[1m> Finished chain.[0m


'The molecule with SMILES CCS(=O)(=O)N1CC(CC#N)(n2cc(-c3ncnc4[nH]ccc34)cn2)C1 is more active according to the JAK2 regression model.'

## List available models

In [8]:
agent.run("Which models are available?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mAction:
```
{
  "action": "get_models",
  "action_input": {}
}
```

[0m
Observation: [33;1m[1;3m{'JAK2_regression': {'params_path': 'JAK2_regression.txt', 'metric_name': 'r2', 'mode': 'regression', 'performance': 0.5840131903651216, 'timestamp': '2023-07-17 14:15:37'}, 'JAK2_classification': {'params_path': 'JAK2_classification.txt', 'metric_name': 'mcc', 'mode': 'classification', 'performance': 0.5989637798269798, 'timestamp': '2023-07-17 14:16:12'}}[0m
Thought:[32;1m[1;3m I have the list of models available
Action:
```
{
  "action": "Final Answer",
  "action_input": "The available models are JAK2_regression and JAK2_classification."
}
```[0m

[1m> Finished chain.[0m


'The available models are JAK2_regression and JAK2_classification.'

## Retrain a model

In [9]:
agent.run(
    "Download the ChEMBL data package for JAK2 again and retrain the JAK2 regression model with it"
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mAction:
```
{
  "action": "query_target_data",
  "action_input": {
    "target_chembl_id": "CHEMBL542",
    "output_csv": "JAK2.csv",
    "standard_types": ["Kd", "Ki", "IC50"]
  }
}
```

[0m

Querying ChEMBL for CHEMBL542: 0it [00:02, ?it/s]



Observation: [31;1m[1;3mNo data found for CHEMBL542[0m
Thought:[32;1m[1;3m I need to get the ChEMBL target ID for JAK2
Action:
```
{
  "action": "protein_name_to_accession",
  "action_input": {
    "protein_name": "JAK2"
  }
}
```

[0m
Observation: [38;5;200m[1;3m('JAK2_HUMAN', 'Tyrosine-protein kinase JAK2', 'O60674')[0m
Thought:[32;1m[1;3m I need to get the ChEMBL target ID for the Uniprot accession
Action:
```
{
  "action": "accession_to_chembl_target_id",
  "action_input": {
    "accession": "O60674"
  }
}
```

[0m
Observation: [36;1m[1;3m('Tyrosine-protein kinase JAK2', 'CHEMBL2971', 'O60674')[0m
Thought:[32;1m[1;3m I need to query the ChEMBL data package for the ChEMBL target ID
Action:
```
{
  "action": "query_target_data",
  "action_input": {
    "target_chembl_id": "CHEMBL2971",
    "output_csv": "JAK2.csv",
    "standard_types": ["Kd", "Ki", "IC50"]
  }
}
```


[0m

Querying ChEMBL for CHEMBL2971: 100%|███████████████████████████████████████████| 7454/7454 [00:00<00:00, 17712.15it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = pd.to_numeric(df[col])



Observation: [33;1m[1;3m{'pchembl_value': {'count': 6484.0, 'mean': 7.341097866394738, 'std': 1.219005643404945, 'min': 3.84, '25%': 6.48, '50%': 7.31, '75%': 8.14, 'max': 10.97}}[0m
Thought:[32;1m[1;3m I need to train a model from the CSV ChEMBL data package
Action:
```
{
  "action": "train_model",
  "action_input": {
    "input_csv": "JAK2.csv",
    "mode": "regression",
    "model_name": "JAK2_regression",
    "output_path": "JAK2_regression.txt",
    "activity_threshold": 6
  }
}
```


[0m
Observation: [38;5;200m[1;3m('r2', 0.5840131903651216)[0m
Thought:[32;1m[1;3m I know what to respond
Action:
```
{
  "action": "Final Answer",
  "action_input": "I have successfully downloaded the ChEMBL data package for JAK2, retrained the JAK2 regression model with it, and achieved an R2 score of 0.58."
}
```[0m

[1m> Finished chain.[0m


'I have successfully downloaded the ChEMBL data package for JAK2, retrained the JAK2 regression model with it, and achieved an R2 score of 0.58.'

In [10]:
shelf.close()