# IntOGen Data Set Exploration  


In this notebook we study the data sets provided by the [intOGen](https://www.intogen.org) 
portal which catalogs cancer driver gene mutations. 

The *intOGen* framework  works with a combination of cancer driver gen identification methods which are used to analyze samples from different sources. 

This notebook is part of a project where we use Machine Learning Techniques to identify carcinogenic
gene mutations. 

In [1]:
#Libraries and modules 

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import os
import zipfile
import urllib

from zipfile import ZipFile

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.impute import SimpleImputer

from pandas.plotting import scatter_matrix

The information we will download from the intOGen portal is divided into two groups. *Cohorts* contains the  specifics of the cohorts analyzed in the current releaseof intOGen. *Drivers* includes data from the unfiltered list of driver genes and a data from the catalog of of driver mutations. 

In [2]:
# Data's URL address 
DOWNLOAD_ROOT = "https://www.intogen.org/download"
COHORTS_URL = DOWNLOAD_ROOT + "?file=IntOGen-Cohorts-20200201.zip" # Current Cohorts dowload URL
DRIVERS_URL = DOWNLOAD_ROOT + "?file=IntOGen-Drivers-20200201.zip" # Current Drivers dowload URL

In [3]:
# Relative path ofthe directory where the data sets will be stored
INTOGEN_DATA_PATH = "intogen_datasets" 

The following function is used to download and unzip the files provided by the intOGen portal. 
The function takes as arguments the URLs of the files to be downloaded and the path to the folder where the files will be stored. The function returns a two element list containing the names of the files in the Cohorts and Drivers directories. 

In [4]:
# The following function is used to fetch 
def fetch_intOGen_data(drivers_url=DRIVERS_URL, 
                       cohorts_url=COHORTS_URL, 
                       data_path = INTOGEN_DATA_PATH):
    os.makedirs(INTOGEN_DATA_PATH, exist_ok=True) # Creates the directory where the data sets will be stored.
    drivers_zip_path = os.path.join(data_path,"Drivers.zip")
    cohorts_zip_path = os.path.join(data_path,"Cohorts.zip")
    urllib.request.urlretrieve(drivers_url, drivers_zip_path)
    urllib.request.urlretrieve(cohorts_url, cohorts_zip_path)
    drivers_zipfile = zipfile.ZipFile(drivers_zip_path)
    cohorts_zipfile = zipfile.ZipFile(cohorts_zip_path)
    drivers_zipfile.extractall(path=data_path)
    cohorts_zipfile.extractall(path=data_path)
    drivers_file_list = drivers_zipfile.namelist() 
    cohorts_file_list = cohorts_zipfile.namelist()
    drivers_zipfile.close
    cohorts_zipfile.close
    return [cohorts_file_list, drivers_file_list]

In [5]:
intOGen_data_names = fetch_intOGen_data()

In [6]:
print("Cohorts folder and files:")
for name in intOGen_data_names[0]:
    print(name)

Cohorts folder and files:
2020-02-02_IntOGen-Cohorts-20200213/
2020-02-02_IntOGen-Cohorts-20200213/cohorts.tsv
2020-02-02_IntOGen-Cohorts-20200213/LICENSE.txt
2020-02-02_IntOGen-Cohorts-20200213/README.txt


In [7]:
print("Drivers folder and files:")
for name in intOGen_data_names[1]:
    print(name)

Drivers folder and files:
2020-02-02_IntOGen-Drivers-20200213/
2020-02-02_IntOGen-Drivers-20200213/Compendium_Cancer_Genes.tsv
2020-02-02_IntOGen-Drivers-20200213/Unfiltered_driver_results_05.tsv
2020-02-02_IntOGen-Drivers-20200213/LICENSE.txt
2020-02-02_IntOGen-Drivers-20200213/README.txt


In [8]:
cohorts_subdirectory_path = os.path.join(INTOGEN_DATA_PATH,intOGen_data_names[0][0]) 
drivers_subdirectory_path = os.path.join(INTOGEN_DATA_PATH,intOGen_data_names[1][0]) 

In [9]:
Cohorts = os.path.join(cohorts_subdirectory_path,"cohorts.tsv")
Compendium = os.path.join(drivers_subdirectory_path,"Compendium_Cancer_Genes.tsv")
Unfiltered = os.path.join(drivers_subdirectory_path,"Unfiltered_driver_results_05.tsv")

## Cohorts  Data

In [10]:
# Cohorts Dataframe
cohorts_df = pd.read_csv(Cohorts, sep = '\t')

In [11]:
cohorts_df.head()

Unnamed: 0,COHORT,CANCER_TYPE,CANCER_TYPE_NAME,SOURCE,PLATFORM,PROJECT,REFERENCE,TYPE,TREATED,AGE,SAMPLES,MUTATIONS,WEB_SHORT_COHORT_NAME,WEB_LONG_COHORT_NAME
0,PEDCBIOP_WXS_ACC-PRY,ACC,Adrenocortical carcinoma,PEDCBIOP,WXS,acc_pry_pediatric_dkfz_2017,PMID: 29489754,Primary,Untreated,Pediatric,8,239,ACC_PRY_DKFZ_2017,"Adrenocortical carcinoma - DKFZ, Nature 2017"
1,D_ACC,ACC,Adrenocortical carcinoma,STJUDE,WGS,,DOI: 10.1038/ncomms7302,Primary,Untreated,Pediatric,20,17560,ACC_D_STJUDE,Adrenocortical carcinoma data from St. Jude Ch...
2,TCGA_WXS_ACC,ACC,Adrenocortical carcinoma,TCGA,WXS,TCGA_WXS_ACC,PMID:24071849,Primary,Untreated,Adult,91,11572,ACC_TCGA,Adrenocortical carcinoma data from TCGA/PanCan...
3,CBIOP_WXS_ACY_2019,ACY,Adenoid cystic carcinoma,CBIOP,WXS,acy_2019,DOI:https://doi.org/10.1172/JCI128227,Primary,Untreated,Adult,35,2457,ACY_2019_PROJECT,Adenoid cystic carcinoma project - Multi-Insti...
4,PEDCBIOP_WXS_BALL-PRY,ALL,Acute lymphoblastic leukemia,PEDCBIOP,WXS,ball_pry_pediatric_dkfz_2017,PMID: 29489754,Primary,Untreated,Pediatric,44,276,BALL_PRY_DKFZ_2017,"Acute lymphoblastic leukemia primary - DKFZ, N..."


In [12]:
cohorts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221 entries, 0 to 220
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   COHORT                 221 non-null    object
 1   CANCER_TYPE            221 non-null    object
 2   CANCER_TYPE_NAME       221 non-null    object
 3   SOURCE                 221 non-null    object
 4   PLATFORM               221 non-null    object
 5   PROJECT                175 non-null    object
 6   REFERENCE              221 non-null    object
 7   TYPE                   221 non-null    object
 8   TREATED                221 non-null    object
 9   AGE                    221 non-null    object
 10  SAMPLES                221 non-null    int64 
 11  MUTATIONS              221 non-null    int64 
 12  WEB_SHORT_COHORT_NAME  221 non-null    object
 13  WEB_LONG_COHORT_NAME   221 non-null    object
dtypes: int64(2), object(12)
memory usage: 24.3+ KB


## Drivers Data, Compendium of Cancer Genes

In [13]:
# Compendium_Cancer_Genes Dataframe
compendium_df = pd.read_csv(Compendium, sep = '\t')

In [14]:
compendium_df.head()

Unnamed: 0,SYMBOL,TRANSCRIPT,COHORT,CANCER_TYPE,METHODS,MUTATIONS,SAMPLES,%_SAMPLES_COHORT,QVALUE_COMBINATION,ROLE,CGC_GENE,CGC_CANCER_GENE,DOMAIN,2D_CLUSTERS,3D_CLUSTERS,EXCESS_MIS,EXCESS_NON,EXCESS_SPL
0,ABCB1,ENST00000622132,ICGC_WGS_ESAD_UK,ESCA,"dndscv,cbase",16.0,14,0.093333,0.000201,Act,False,False,,915:915,,0.973575,0.0,0.0
1,ABI1,ENST00000376142,ICGC_WGS_ESAD_UK,ESCA,smregions,3.0,2,0.013333,0.000107,ambiguous,True,False,PF00018:452:496,,,0.90674,0.977878,0.0
2,ABL1,ENST00000372348,HARTWIG_LIVER,HC,combination,3.0,3,0.058824,0.001472,Act,True,False,,,,0.920462,0.0,0.0
3,ABL1,ENST00000372348,TCGA_WXS_UCEC,UCEC,combination,13.0,10,0.019881,0.006205,Act,True,False,,,,0.565525,0.772649,0.0
4,ABL2,ENST00000502732,ICGC_WGS_BRCA_FR,BRCA,combination,2.0,2,0.027778,0.008338,Act,True,False,,,,0.863833,0.986215,0.0


In [15]:
compendium_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   SYMBOL              3333 non-null   object 
 1   TRANSCRIPT          3333 non-null   object 
 2   COHORT              3333 non-null   object 
 3   CANCER_TYPE         3333 non-null   object 
 4   METHODS             3333 non-null   object 
 5   MUTATIONS           3333 non-null   float64
 6   SAMPLES             3333 non-null   int64  
 7   %_SAMPLES_COHORT    3333 non-null   float64
 8   QVALUE_COMBINATION  3333 non-null   float64
 9   ROLE                3329 non-null   object 
 10  CGC_GENE            3333 non-null   bool   
 11  CGC_CANCER_GENE     3333 non-null   bool   
 12  DOMAIN              644 non-null    object 
 13  2D_CLUSTERS         961 non-null    object 
 14  3D_CLUSTERS         518 non-null    object 
 15  EXCESS_MIS          3333 non-null   float64
 16  EXCESS

## Drivers Data, Unfiltered Driver Results

In [16]:
unfiltered_df = pd.read_csv(Unfiltered, sep = '\t')

In [17]:
unfiltered_df.head()

Unnamed: 0,SYMBOL,COHORT,CANCER_TYPE,METHODS,QVALUE_COMBINATION,TIER,MUTATIONS_COHORT,SAMPLES_COHORT,ROLE,CGC_GENE,...,CGC_CANCER_GENE,NUM_COHORTS,SIGNATURE9,WARNING_EXPRESSION,WARNING_GERMLINE,SAMPLES_3MUTS,OR_WARNING,KNOWN_ARTIFACT,NUM_PAPERS,FILTER
0,A1CF,HARTWIG_ANUS,AN,cbase,0.02111618,3,435957,17,Act,True,...,False,3,0.0,True,False,0.0,False,False,0.0,Warning expression
1,A1CF,CBIOP_WXS_SKCM_BROAD,CM,,0.04965412,2,157355,88,Act,True,...,True,3,0.0,True,False,0.0,False,False,0.0,Warning expression
2,A1CF,ICGC_WGS_PRAD_UK,PRAD,,0.028867,2,584433,182,LoF,True,...,False,3,0.0,True,False,0.0,False,False,0.0,Warning expression
3,ABCA13,HARTWIG_SKIN_SKIN_SQUAMOUS_CELL_CARCINOMA,SSCC,"oncodriveclustl,cbase",3.979008e-05,1,2535437,11,Act,False,...,False,1,0.0,True,False,3.0,False,False,0.0,Warning expression
4,ABCA6,TCGA_WXS_LAML,AML,"oncodriveclustl,dndscv,cbase,mutpanning",7.035128e-09,1,8182,140,LoF,False,...,False,1,0.0,False,False,0.0,False,False,0.0,Lack of literature evidence


In [18]:
unfiltered_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4164 entries, 0 to 4163
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   SYMBOL              4164 non-null   object 
 1   COHORT              4164 non-null   object 
 2   CANCER_TYPE         4164 non-null   object 
 3   METHODS             2665 non-null   object 
 4   QVALUE_COMBINATION  4164 non-null   float64
 5   TIER                4164 non-null   int64  
 6   MUTATIONS_COHORT    4164 non-null   int64  
 7   SAMPLES_COHORT      4164 non-null   int64  
 8   ROLE                4164 non-null   object 
 9   CGC_GENE            4164 non-null   bool   
 10  TIER_CGC            3454 non-null   float64
 11  CGC_CANCER_GENE     4164 non-null   bool   
 12  NUM_COHORTS         4164 non-null   int64  
 13  SIGNATURE9          4164 non-null   float64
 16  SAMPLES_3MUTS       4164 non-null   float64
 18  KNOWN_ARTIFACT      4164 non-null   bool   
 19  NUM_PA