# Introduction to the GDSC Dataset

The **Genomics of Drug Sensitivity in Cancer (GDSC)** project is a comprehensive resource aimed at identifying the molecular characteristics of cancer cells that predict sensitivity to anti-cancer drugs. It integrates drug response data from thousands of cell lines with genomic, transcriptomic, and proteomic features to study the impact of genetic variants on drug sensitivity.
![GDSC Project Overview](../Figures/GDSC_project.png)
*Figure 1: Overview of the GDSC project from the paper "A Landscape of Pharmacogenomic Interactions in Cancer" by Iorio et al., published in Cell, Volume 166, Issue 3, pages 740 - 754 (2016).* 

---

## GDSC1

**GDSC1** represents the initial phase of the GDSC project, with screenings conducted **before 2015**. This dataset includes:

- **970 cancer cell lines**.
- **403 drugs** focused primarily on well-established therapeutic agents.

## GDSC2

**GDSC2** is the expanded phase of the project, with data generated **after 2015**. It includes:

- **969 cancer cell lines**, with some overlap but also new cell lines not present in GDSC1.
- **297 drugs**, including newer experimental compounds targeting a broader range of cancer pathways.

---

In this notebook, we will focus on **GDSC2**, as it provides more recent and comprehensive data on drug responses across a wider array of compounds, making it suitable for analyzing modern cancer therapies.




In [1]:
import pandas as pd
import os
import pubchempy as pcp
import sys
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
from utils import *
# Initialize the class with appropriate URLs
processor = GDSCProcessor(
    gdsc_link='https://cog.sanger.ac.uk/cancerrxgene/GDSC_release8.5/GDSC2_fitted_dose_response_27Oct23.xlsx',
    drug_meta_link='https://www.cancerrxgene.org/api/compounds?list=all&export=csv',
    exp_data_link='https://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources//Data/preprocessed/Cell_line_RMA_proc_basalExp.txt.zip',
    verbose = False,
    data_path = '../Data/'
)

# Run the full process
processor.run()

## 1. Drug Response Data

The GDSC dataset provides **IC50** (half maximal inhibitory concentration) values for hundreds of drugs across various cancer cell lines. The **IC50** value represents the concentration of a drug required to inhibit a biological process (such as cell growth) by 50%. Lower IC50 values indicate higher sensitivity of the cell line to the drug.

### Key Aspects of Drug Response Data:
- **IC50 values**: A measure of how effective a drug is at inhibiting cancer cell proliferation.
- **Drug ID**: Each drug is assigned a unique ID in the dataset.
- **COSMIC_ID**: Each cell line is identified by a unique COSMIC ID, which allows for linking the drug response data to other genomic features of the cell line.

The GDSC dataset contains IC50 values for hundreds of drugs tested across over 1,000 cancer cell lines. This information is critical for understanding which cell lines are more sensitive or resistant to particular drugs.

In [53]:
## Check how many unique drugs
num_drugs = len(processor.df['DRUG_ID'].unique())
print(f'There are {num_drugs} drug in GDSC2.') # 295 unique drugs
## Check how many unique cell lines
num_CCLs = len(processor.df['COSMIC_ID'].unique())
print(f'There are {num_CCLs} cancer cell lines in GDSC2.') # 969 unique cell lines

There are 295 drug in GDSC2.
There are 969 cancer cell lines in GDSC2.


## 2. Cancer Cell Lines 

Cancer cell lines (CCLs) are used as model systems for studying drug response and genetic features of cancer. The GDSC dataset includes CCLs derived from a variety of cancer types, including breast, lung, colon, and many more. Each CCL has its own set of molecular features, such as gene expression data, mutation status, and copy number variations. In this notebook, we use the RNAseq data, as the omic feature.

### Key Aspects of CCL Data:
- **COSMIC_ID**: A unique identifier for each CCL, provided by the Catalogue of Somatic Mutations in Cancer (COSMIC) database.
- **Gene Expression Profiles**: RNA sequencing data that provides the expression levels of thousands of genes across different CCLs.
<!-- - **Mutation Data**: Information on specific mutations found in each CCL, which may contribute to drug sensitivity or resistance. -->

By linking drug response data to CCL molecular features, we can investigate the relationships between genetic alterations and drug efficacy.


In [14]:
processor.exp_df.head()

Unnamed: 0,TSPAN6,TNMD,DPM1,SCYL3,C1orf112,FGR,CFH,FUCA2,GCLC,NFYA,...,LINC00526,PPY2,Unnamed: 17730,Unnamed: 17731,KRT18P55,Unnamed: 17733,POLRMTP1,UBL5P2,TBC1D3P5,Unnamed: 17737
683667,7.780713,2.753253,9.960137,4.351073,3.71674,3.222277,8.221606,3.823474,4.756228,5.805642,...,3.34752,3.230713,3.032447,9.040972,3.102091,2.870875,3.169188,9.81043,3.266915,8.45208
684052,7.301344,2.890533,9.922489,4.125088,3.678987,3.096576,3.588391,4.809305,4.951782,5.089165,...,5.05426,3.003521,2.874737,8.532759,3.068187,2.874065,3.135479,9.073222,3.098364,6.824238
684057,8.233101,2.824687,10.015884,4.749715,3.839433,3.142754,5.32983,3.272124,5.538055,6.428482,...,6.261573,3.031862,3.370459,8.930821,3.322455,3.083922,2.81344,8.893197,3.266184,8.758289
684059,8.333466,3.966757,9.793991,3.976923,3.505669,3.079943,3.37364,4.199048,5.794734,5.902391,...,3.885425,2.993918,2.843472,8.246666,3.219777,3.683564,3.033869,8.691401,3.27923,8.236239
684062,8.391341,2.96836,10.26068,4.295875,4.129471,3.31876,7.103957,3.447994,5.988208,6.257495,...,5.584552,2.959515,2.952987,8.625519,3.056066,3.059551,3.127004,9.396462,3.217885,7.248236


## Summary of Cancer Types in GDSC2

After processing the dataset, we identified **31 distinct cancer types** (excluding the `UNCLASSIFIED` category). Each cancer type is associated with a unique number of cell lines, which can be used to analyze drug sensitivity patterns.

In the GDSC2 dataset, there is one category labeled `UNCLASSIFIED`, representing cell lines that do not fall into the predefined cancer types. For this analysis, we have excluded the `UNCLASSIFIED` category and focused on the main cancer types.

In [38]:
cancer_df = processor.df.groupby('TCGA_DESC')['COSMIC_ID'].nunique().reset_index()
cancer_df.set_index('TCGA_DESC', inplace=True)
cancer_df.columns = ['Number of Unique Cell Lines']
cancer_df.sort_values('Number of Unique Cell Lines', ascending=False, inplace=True)
## Drop UNCLASSIFIED and nan in 'TCGA_DESC'
cancer_df = cancer_df[cancer_df.index != 'UNCLASSIFIED'].dropna()
print(f'There are {len(cancer_df)} cancer types and one \'UNCLASSIFIED\' cancer in GDSC2.')

There are 31 cancer types and one 'UNCLASSIFIED' cancer in GDSC2.


### Top 10 Cancer Types by Number of Unique Cell Lines

The table below shows the **top 10 cancer types** with the highest number of unique cell lines in the GDSC2 dataset:

| Cancer Type (TCGA_DESC) | Number of Unique Cell Lines |
|-------------------------|-----------------------------|
| LUAD (Lung Adenocarcinoma)     | 62                          |
| SCLC (Small Cell Lung Cancer)  | 59                          |
| SKCM (Skin Cutaneous Melanoma) | 54                          |
| BRCA (Breast Cancer)           | 51                          |
| COREAD (Colorectal Cancer)     | 46                          |
| HNSC (Head and Neck Cancer)    | 39                          |
| ESCA (Esophageal Cancer)       | 35                          |
| DLBC (Diffuse Large B-Cell Lymphoma) | 34                   |
| GBM (Glioblastoma Multiforme)  | 34                          |
| OV (Ovarian Cancer)            | 34                          |

These cancer types represent the most frequently occurring cancer cell lines in the dataset and provide a broad range of data for drug sensitivity analysis.


In [39]:
print(cancer_df.head(5))

           Number of Unique Cell Lines
TCGA_DESC                             
LUAD                                62
SCLC                                59
SKCM                                54
BRCA                                51
COREAD                              46


## 3. Drugs

In addition to drug response data, the GDSC dataset includes information about the drugs themselves. Each drug is characterized by various chemical properties that can be used to predict its effectiveness in inhibiting cancer cell growth.

### Key Aspects of Drug Features:
- **PubChem ID**: A unique identifier for each drug in the PubChem database.
- **SMILES**: A textual representation of the molecular structure of the drug, which can be used to calculate molecular descriptors for quantitative structure-activity relationship (QSAR) modeling.
- **Pathway Information**: Many drugs target specific biological pathways (e.g., PI3K/AKT/mTOR signaling) that are dysregulated in cancer cells. Understanding these pathways helps identify which drugs might be effective against particular cell lines.
<!-- - **Molecular Descriptors**: Features such as molecular weight, LogP (partition coefficient), and hydrogen bond donors/acceptors provide additional information about the chemical properties of each drug. These descriptors can be used to predict drug efficacy. -->

By integrating drug features with cell line molecular data, we can build predictive models for drug response, identify potential biomarkers of sensitivity, and uncover novel therapeutic strategies for cancer treatment.

# Drug Target Pathways in the GDSC2 Dataset

The **Genomics of Drug Sensitivity in Cancer (GDSC)** dataset not only provides drug response data but also links each drug to the biological pathways they target. This helps in understanding the mechanisms through which drugs act on different cancer cell lines and allows for the identification of common pathways targeted by multiple drugs.

## Summary of Drug Target Pathways in GDSC2

In the GDSC2 dataset, drugs are categorized based on the pathways they target. Some drugs may target multiple pathways, but each drug is uniquely associated with its primary pathway in the dataset. These pathways represent key biological processes often dysregulated in cancer, such as **PI3K/MTOR signaling**, **DNA replication**, and **ERK MAPK signaling**.

### Key Drug Target Pathways

After processing the dataset, we identified **22 distinct pathways** associated with drug targets, excluding the `Unclassified` and `Other` categories. These pathways play critical roles in cancer development and progression, and targeting these pathways can lead to potential therapeutic interventions.

In [50]:
processor.df['PATHWAY_NAME']

drug_df = processor.df.groupby('PATHWAY_NAME')['DRUG_ID'].nunique().reset_index()
drug_df.set_index('PATHWAY_NAME', inplace=True)
drug_df.columns = ['Number of Unique Drugs']
drug_df.sort_values('Number of Unique Drugs', ascending=False, inplace=True)
drug_df = drug_df[drug_df.index != 'Unclassified'].dropna()
drug_df = drug_df[drug_df.index != 'Other'].dropna()
print(f'There are {len(drug_df)} pathways in GDSC2, one \'Unclassified\', and  one \'Other\' drug target pathway.')

There are 22 pathways in GDSC2, one 'Unclassified', and  one 'Other' drug target pathway.



### Top 10 Drug Target Pathways by Number of Unique Drugs

The table below shows the **top 10 drug target pathways** with the highest number of unique drugs in the GDSC2 dataset:

| Pathway Name                  | Number of Unique Drugs |
|-------------------------------|------------------------|
| PI3K/MTOR signaling            | 27                     |
| Other, kinases                 | 22                     |
| DNA replication                | 21                     |
| ERK MAPK signaling             | 15                     |
| Chromatin histone methylation  | 13                     |
| Apoptosis regulation           | 13                     |
| Cell cycle                     | 13                     |
| Genome integrity               | 13                     |
| RTK signaling                  | 12                     |
| Chromatin other                | 10                     |

These pathways represent the most frequently targeted mechanisms by drugs in the dataset. **PI3K/MTOR signaling** is the most targeted pathway, with 27 unique drugs, followed by **DNA replication** and **ERK MAPK signaling**.

In [52]:
drug_df.head(5)

Unnamed: 0_level_0,Number of Unique Drugs
PATHWAY_NAME,Unnamed: 1_level_1
PI3K/MTOR signaling,27
"Other, kinases",22
DNA replication,21
ERK MAPK signaling,15
Chromatin histone methylation,13
