In [1]:
# Example: Fetching connectivity SMILES from PubChemPy (no deprecation warning)
import pubchempy as pcp

compounds = pcp.get_compounds('aspirin', 'name')
for c in compounds:
    print(f"CID: {c.cid}, Connectivity SMILES: {c.connectivity_smiles}")

CID: 2244, Connectivity SMILES: CC(=O)OC1=CC=CC=C1C(=O)O


## Check for Existing Notebooks

Let's search the current directory for Jupyter notebook files.

In [2]:
import os

# List all .ipynb files in the current directory
notebooks = [f for f in os.listdir('.') if f.endswith('.ipynb')]
print("Notebooks in this directory:", notebooks)

Notebooks in this directory: ['chapter3-cheminformatics-essentials.ipynb', 'chapter6-graph-neural-networks.ipynb', 'sample_generated_notebook.ipynb', 'chapter11-transformers-chemistry.ipynb', 'chapter12-protein-structure.ipynb', 'chapter9-molecular-generation.ipynb', 'chapter15-retrosynthesis.ipynb', 'chapter8-admet-prediction.ipynb', 'chapter7-bioactivity-prediction.ipynb', 'create_missing_chapter_notebooks.ipynb', 'chapter19-external-tools.ipynb', 'chapter18-case-studies.ipynb', 'chapter13-molecular-docking.ipynb', 'chapter21-future-next-steps.ipynb', 'chapter1-genomics-to-molecules.ipynb', 'chapter14-ml-enhanced-docking.ipynb', 'chapter5-molecular-properties.ipynb', 'chapter4-compound-screening.ipynb', 'chapter10-conditional-generation.ipynb', 'chapter20-ethical-considerations.ipynb', 'chapter2-target-identification.ipynb', 'chapter16-multi-modal-learning.ipynb', 'chapter17-end-to-end-pipeline.ipynb']


## Generate a New Notebook File

You can programmatically create a new Jupyter notebook using the `nbformat` library.

In [3]:
import nbformat as nbf

nb = nbf.v4.new_notebook()
nb['cells'] = [nbf.v4.new_code_cell("print('Hello, new notebook!')")]
with open('sample_generated_notebook.ipynb', 'w') as f:
    nbf.write(nb, f)
print("Created sample_generated_notebook.ipynb")

Created sample_generated_notebook.ipynb


## List All Notebooks in the Directory

Let's display all `.ipynb` files in the current directory again.

In [4]:
# List all .ipynb files again
notebooks = [f for f in os.listdir('.') if f.endswith('.ipynb')]
print("Notebooks in this directory:", notebooks)

Notebooks in this directory: ['chapter3-cheminformatics-essentials.ipynb', 'chapter6-graph-neural-networks.ipynb', 'sample_generated_notebook.ipynb', 'chapter11-transformers-chemistry.ipynb', 'chapter12-protein-structure.ipynb', 'chapter9-molecular-generation.ipynb', 'chapter15-retrosynthesis.ipynb', 'chapter8-admet-prediction.ipynb', 'chapter7-bioactivity-prediction.ipynb', 'create_missing_chapter_notebooks.ipynb', 'chapter19-external-tools.ipynb', 'chapter18-case-studies.ipynb', 'chapter13-molecular-docking.ipynb', 'chapter21-future-next-steps.ipynb', 'chapter1-genomics-to-molecules.ipynb', 'chapter14-ml-enhanced-docking.ipynb', 'chapter5-molecular-properties.ipynb', 'chapter4-compound-screening.ipynb', 'chapter10-conditional-generation.ipynb', 'chapter20-ethical-considerations.ipynb', 'chapter2-target-identification.ipynb', 'chapter16-multi-modal-learning.ipynb', 'chapter17-end-to-end-pipeline.ipynb']


## Open a Notebook in Visual Studio Code

You can open a notebook from the terminal using:

```bash
code chapter2-target-identification.ipynb
```

## Run Code Cells Programmatically

You can execute notebook cells programmatically using `nbconvert` or `papermill`. Example with papermill:

```python
import papermill as pm
pm.execute_notebook('chapter2-target-identification.ipynb', 'output.ipynb')
```


## Practical Target Identification Examples

Below are three hands-on code examples for target identification using NLP and public databases. Each example includes an explanation to help you understand the workflow and its relevance to drug discovery.

1. **NER on Custom Biomedical Text:** Use a pretrained NLP model to extract gene/protein names (potential drug targets) from biomedical sentences.
2. **Automated PubMed Search and NER:** Search PubMed for disease-related abstracts, then use NER to extract targets from real literature.
3. **UniProt Query for Target Information:** Query UniProt for detailed information about a gene/protein target identified in previous steps.

In [5]:
# 1. NER on Custom Biomedical Text
from transformers import pipeline

# Use a public general NER model
nlp = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
text = "APP and BACE1 are key targets in Alzheimer's disease."
results = nlp(text)
print("Named Entities:", results)

Skipping import of cpp extensions due to incompatible torch version 2.9.1 for torchao version 0.13.0
W1201 13:37:06.753000 63367 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
W1201 13:37:06.753000 63367 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenc

Named Entities: [{'entity_group': 'ORG', 'score': np.float32(0.56968904), 'word': 'AP', 'start': 0, 'end': 2}, {'entity_group': 'MISC', 'score': np.float32(0.67893267), 'word': 'BACE', 'start': 8, 'end': 12}, {'entity_group': 'MISC', 'score': np.float32(0.90901405), 'word': 'Alzheimer', 'start': 33, 'end': 42}]


In [6]:
# 2. Automated PubMed Search and NER
from Bio import Entrez
from transformers import pipeline

Entrez.email = "justinzhang.xl@gmail.com"  # Replace with your email

# Search PubMed for abstracts related to Alzheimer's targets
handle = Entrez.esearch(db="pubmed", term="Alzheimer target", retmax=1)
record = Entrez.read(handle)
pubmed_ids = record["IdList"]

# Fetch the abstract text
if pubmed_ids:
    fetch_handle = Entrez.efetch(db="pubmed", id=pubmed_ids[0], rettype="abstract", retmode="text")
    abstract = fetch_handle.read()
    print("Abstract:\n", abstract)
    # Run NER on the abstract using a public general NER model
    nlp = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
    entities = nlp(abstract)
    print("Named Entities:", entities)
else:
    print("No PubMed results found.")

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Abstract:
 1. Mol Neurobiol. 2025 Dec 1;63(1):226. doi: 10.1007/s12035-025-05538-5.

Production, Mechanisms, and Therapeutic Strategies of Tryptophan Metabolites in 
CNS Diseases.

Shi C(#)(1)(2), Dong J(#)(1)(2), Hui X(#)(3), Xu Z(1), Zhao Z(4), Dong L(5).

Author information:
(1)Department of Spine Surgery, Honghui Hospital, Xi'an Jiaotong University, 
Xi'an, Shaanxi, China.
(2)Medical College of Yan'an University, Yan'an University, Yan'an, Shaanxi, 
China.
(3)Shenzhen Nanshan People's Hospital, Shenzhen, Guangdong, China.
(4)Department of Sports Medicine, Honghui Hospital, Xi'an Jiaotong University, 
Xi'an, Shaanxi, China. zhaozandong@163.com.
(5)Department of Spine Surgery, Honghui Hospital, Xi'an Jiaotong University, 
Xi'an, Shaanxi, China. dongliang-526@163.com.
(#)Contributed equally

Tryptophan (TRP) metabolites, which are produced from TRP via three 
pathways-kynurenine, 5-hydroxytryptamine, and indole-are key signaling molecules 
of the gut-brain axis and are involved in a v

Device set to use cpu


Named Entities: [{'entity_group': 'ORG', 'score': np.float32(0.951797), 'word': 'Mol Neurobiol', 'start': 3, 'end': 16}, {'entity_group': 'MISC', 'score': np.float32(0.76769227), 'word': '##eu', 'start': 108, 'end': 110}, {'entity_group': 'ORG', 'score': np.float32(0.7694095), 'word': 'Tryptop', 'start': 128, 'end': 135}, {'entity_group': 'MISC', 'score': np.float32(0.6065106), 'word': '##han', 'start': 135, 'end': 138}, {'entity_group': 'MISC', 'score': np.float32(0.96675605), 'word': 'C', 'start': 155, 'end': 156}, {'entity_group': 'MISC', 'score': np.float32(0.5766957), 'word': 'Disease', 'start': 159, 'end': 166}, {'entity_group': 'ORG', 'score': np.float32(0.5497527), 'word': 'Shi', 'start': 170, 'end': 173}, {'entity_group': 'PER', 'score': np.float32(0.9377253), 'word': 'Dong J', 'start': 186, 'end': 192}, {'entity_group': 'PER', 'score': np.float32(0.9911506), 'word': 'Hui X', 'start': 203, 'end': 208}, {'entity_group': 'PER', 'score': np.float32(0.99876195), 'word': 'Xu Z', 's

In [7]:
# 3. UniProt Query for Target Information
import requests

def get_uniprot_by_accession(accession):
    url = f"https://rest.uniprot.org/uniprotkb/{accession}.tsv?fields=accession,protein_name,gene_names,organism_name,length"
    response = requests.get(url)
    print(f"Results for {accession} (UniProt):\n{response.text}")

# Example: Query UniProt for several Alzheimer's-related targets by accession
get_uniprot_by_accession("P05067")  # APP
get_uniprot_by_accession("P56817")  # BACE1
get_uniprot_by_accession("P10636")  # MAPT
get_uniprot_by_accession("P49768")  # PSEN1
get_uniprot_by_accession("P49810")  # PSEN2

Results for P05067 (UniProt):
Entry	Protein names	Gene Names	Organism	Length
P05067	Amyloid-beta precursor protein (APP) (ABPP) (APPI) (Alzheimer disease amyloid A4 protein homolog) (Alzheimer disease amyloid protein) (Amyloid precursor protein) (Amyloid-beta (A4) precursor protein) (Amyloid-beta A4 protein) (Cerebral vascular amyloid peptide) (CVAP) (PreA4) (Protease nexin-II) (PN-II) [Cleaved into: N-APP; Soluble APP-alpha (S-APP-alpha); Soluble APP-beta (S-APP-beta); C99 (Beta-secretase C-terminal fragment) (Beta-CTF); Amyloid-beta protein 42 (Abeta42) (Beta-APP42); Amyloid-beta protein 40 (Abeta40) (Beta-APP40); C83 (Alpha-secretase C-terminal fragment) (Alpha-CTF); P3(42); P3(40); C80; Gamma-secretase C-terminal fragment 59 (Amyloid intracellular domain 59) (AICD-59) (AID(59)) (Gamma-CTF(59)); Gamma-secretase C-terminal fragment 57 (Amyloid intracellular domain 57) (AICD-57) (AID(57)) (Gamma-CTF(57)); Gamma-secretase C-terminal fragment 50 (Amyloid intracellular domain 50) (AICD-5

### 1. NER on Custom Biomedical Text
This cell uses a pretrained BioBERT model to perform Named Entity Recognition (NER) on a biomedical sentence. NER helps identify gene and protein names mentioned in scientific text, which are potential drug targets. This is a common first step in literature mining for target identification.

### 2. Automated PubMed Search and NER
This cell demonstrates how to search PubMed for abstracts related to a disease (e.g., Alzheimer's), retrieve the abstract text, and then use NER to extract gene/protein names. This workflow automates the process of finding and prioritizing new drug targets from the latest scientific literature.

### 3. UniProt Query for Target Information
This cell shows how to query the UniProt database for detailed information about a gene or protein (such as APP). After identifying a potential target using NER and literature mining, you can use UniProt to learn about its biological function, sequence, and relevance to disease, helping you decide if it is a good candidate for drug development.