**Computational Health Laboratory Project, A.Y. 2021/2022**

**Authors:** Niko Dalla Noce, Alessandro Ristori, Andrea Zuppolini

# **CHL Project, Pathway Analysis**
Starting fron one or more genes, extract from interaction databases the genes they interact with. Using the expanded gene set, perform pathway analysis and obtain all disease pathways in which the genes appear. Merge the pathways to obtain a larger graph. Perform further network analysis to extract central biomarkers and communities beyond pathways. Compute a distance between the initial gene set and the various pathways (diseases).

## **Colab setup**
Takes care of the project setup on Colab.

In [1]:
if 'google.colab' in str(get_ipython()):
    import subprocess
    from google.colab import drive
    out_clone = subprocess.run(["git", "clone", "https://github.com/nikodallanoce/ComputationalHealthLaboratory"], text=True, capture_output=True)
    print("{0}{1}".format(out_clone.stdout, out_clone.stderr))
    %pip install -U PyYAML
    %pip install gseapy
    drive.mount("/content/drive/")
    %cp "/content/drive/Shareddrives/CHL/config.yml" "/content/ComputationalHealthLaboratory"
    %cd ComputationalHealthLaboratory

## **Obtain all the genes that interacts with the starting one**
Starting from a gene obtain its neighbours and the interactions between them.


In [2]:
import requests
import json
import pandas as pd
import numpy as np
import re
from config import ACCESS_KEY, BASE_URL

In [3]:
gene_interactions = pd.read_csv("datasets/geneset.csv", sep=";")
gene_interactions["InteractorA"] = gene_interactions["InteractorA"].str.upper()
gene_interactions.drop_duplicates(inplace=True)
proteins_list = list(gene_interactions["InteractorA"])  # all the proteins that interact with our starting gene

In [4]:
gene_interactions

Unnamed: 0,InteractorA,InteractorB
0,YWHAG,SON
1,YWHAB,SON
3,SIRT7,SON
4,TCF3,SON
5,SF3B1,SON
...,...,...
149,NSP8,SON
150,NSP9,SON
151,ORF6,SON
152,ORF8,SON


## **Expand the interactions dataset**
Expand the dataset using the proteins obtained from the previous step.

In [5]:
request_url = BASE_URL + "/interactions"
data = {}

step = 5
for i in range(0, len(proteins_list), step):
    end = i+step
    if end >= len(proteins_list):
        end = len(proteins_list)
    
    # List of genes to search for
    gene_list = proteins_list[i:end] # ["SRPK2"]

    params = {
        "accesskey": ACCESS_KEY,
        "format": "json",  # Return results in TAB2 format
        "geneList": "|".join(gene_list),  # Must be | separated
        "searchNames": "true",  # Search against official names
        "includeInteractors": "true",  # Set to true to get any interaction involving EITHER gene, set to false to get interactions between genes
        "includeInteractorInteractions": "false",  # Set to true to get interactions between the geneList’s first order interactors
        "includeEvidence": "false",  # If false "evidenceList" is evidence to exclude, if true "evidenceList" is evidence to show
        "selfInteractionsExcluded": "true", # If true no self-interactions will be included
    }

    r = requests.get(request_url, params=params)
    interactions = r.json()
    
    # Check if the interactions are more than the allowed number
    if len(interactions)==10000:
      assert False

    # Create a hash of results by interaction identifier
    for interaction_id, interaction in interactions.items():
        data[interaction_id] = interaction

In [6]:
# Load the data into a pandas dataframe
dataset = pd.DataFrame.from_dict(data, orient="index")

# Re-order the columns and select only the columns we want to see
columns = ["OFFICIAL_SYMBOL_A", "OFFICIAL_SYMBOL_B"]
dataset = dataset[columns]

# Rename the columns and make all the values uppercase
dataset = dataset.rename(columns={"OFFICIAL_SYMBOL_A": "InteractorA", "OFFICIAL_SYMBOL_B": "InteractorB"})
dataset["InteractorA"] = dataset["InteractorA"].str.upper()
dataset["InteractorB"] = dataset["InteractorB"].str.upper()

# Print the dataframe
dataset

Unnamed: 0,InteractorA,InteractorB
8289,TCF3,HAND2
8324,TCF3,ID3
31348,VAP-33B,SIRT7
31539,SIRT7,CKIIBETA
37873,SIRT7,POLO
...,...,...
3305885,CCNF,ZBTB1
3305886,CCNF,ZGPAT
3305887,CCNF,ZNF638
3305888,CCNF,ZNF687


Drop duplicated interactions, they're not interesting from out point of view.

In [7]:
# Look for duplicated interactions
duplicated_interactions = pd.DataFrame(np.sort(dataset[["InteractorA", "InteractorB"]].values, 1)).duplicated()
print("Duplicated interactions:\n{0}".format(duplicated_interactions.value_counts()))

# Delete such interactions from the dataset
dataset = dataset[~duplicated_interactions.values]

Duplicated interactions:
False    79296
True     25281
dtype: int64


Drop self-loops since they're useless for our analysis.

In [8]:
# Look for interactions where both proteins are the same
same_proteins_interactions = pd.DataFrame(dataset[["InteractorA", "InteractorB"]].nunique(axis=1) == 1)
print("Useless interactions:\n{0}".format(same_proteins_interactions.value_counts()))

# Delete such interactions from the dataset
dataset = dataset[~same_proteins_interactions.values]

Useless interactions:
False    79283
True        13
dtype: int64


Unify the interactions from the starting gene with the ones obtained by the requests to the BioGrid dataset.

In [9]:
dataset = dataset.append(gene_interactions)

  dataset = dataset.append(gene_interactions)


In [10]:
nodes = dataset["InteractorA"].append(dataset["InteractorB"]).unique()
# Basterebbe fare l'append su genes nel caso considerassimo solamente i nodi iniziali
print("Number of nodes: {0}".format(len(nodes)))

Number of nodes: 13010


  nodes = dataset["InteractorA"].append(dataset["InteractorB"]).unique()


At last, save the interactions and nodes into csv file for pathway enrichment.

In [11]:
# Save interactions and nodes dataset to csv
dataset.to_csv("datasets/interactions.csv")
pd.DataFrame(nodes).to_csv("datasets/genes.csv")

## **Pathway enrichment**
Find all the diseases linked to the nodes retrieved by the previous step.

In [12]:
import gseapy as gp

List all the datasets from wich we can retrieve pathways by using the gseapy package.

In [13]:
gp.get_library_name()

['ARCHS4_Cell-lines',
 'ARCHS4_IDG_Coexp',
 'ARCHS4_Kinases_Coexp',
 'ARCHS4_TFs_Coexp',
 'ARCHS4_Tissues',
 'Achilles_fitness_decrease',
 'Achilles_fitness_increase',
 'Aging_Perturbations_from_GEO_down',
 'Aging_Perturbations_from_GEO_up',
 'Allen_Brain_Atlas_10x_scRNA_2021',
 'Allen_Brain_Atlas_down',
 'Allen_Brain_Atlas_up',
 'Azimuth_Cell_Types_2021',
 'BioCarta_2013',
 'BioCarta_2015',
 'BioCarta_2016',
 'BioPlanet_2019',
 'BioPlex_2017',
 'CCLE_Proteomics_2020',
 'CORUM',
 'COVID-19_Related_Gene_Sets',
 'COVID-19_Related_Gene_Sets_2021',
 'Cancer_Cell_Line_Encyclopedia',
 'CellMarker_Augmented_2021',
 'ChEA_2013',
 'ChEA_2015',
 'ChEA_2016',
 'Chromosome_Location',
 'Chromosome_Location_hg19',
 'ClinVar_2019',
 'DSigDB',
 'Data_Acquisition_Method_Most_Popular_Genes',
 'DepMap_WG_CRISPR_Screens_Broad_CellLines_2019',
 'DepMap_WG_CRISPR_Screens_Sanger_CellLines_2019',
 'Descartes_Cell_Types_and_Tissue_2021',
 'DisGeNET',
 'Disease_Perturbations_from_GEO_down',
 'Disease_Perturbati

Obtain all the pathways connected to our nodes, for our case we are going to use the DisGeNET dataset.

In [14]:
import os
if os.path.exists("datasets/diseases_pathways.csv"):
    df_diseases = df_mean_distances = pd.read_csv("datasets/diseases_pathways.csv", sep=",", index_col=0)
elif not "df_diseases" in locals():
    enr = gp.enrichr(gene_list=pd.DataFrame(nodes),
                      gene_sets=['DisGeNET'],  # Datasets from the gp.get_library_name() method
                      organism='Human',
                      description='DEGs_up_1d',
                      outdir='test'
                  )

Keep those pathways with an adjusted p-value < 0.1.

In [15]:
if 'enr' in locals():
    df_diseases = enr.results[enr.results["Adjusted P-value"] < 0.1][["Term", "Overlap", "P-value", "Adjusted P-value", "Genes"]]

df_diseases.tail()

Unnamed: 0,Term,Overlap,P-value,Adjusted P-value,Genes
584,Chronic otitis media,55/69,0.005896,0.09895,IGHM;CD81;WIPF1;FMR1;DOCK8;CHD7;JMJD1C;COMT;GT...
585,Inadequate arch length for tooth size,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...
586,Tooth Crowding,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...
587,Tooth mass arch size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...
588,Tooth size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...


Save the pathways in a csv file just like for the interactions and nodes.

In [16]:
df_diseases.to_csv("datasets/diseases_pathways.csv")

Build a dict with all the diseases, this will be helpful when we will need to work on the graph.

In [17]:
diseases = dict()

for i, disease in df_diseases.iterrows():
    disease_genes = disease['Genes'].split(";")
    term = disease['Term']
    diseases[i] = {"name": term, "genes": disease_genes}

## **Protein-Protein network**
Build the protein-to-protein network and link each node to its diseases.


In [18]:
import networkx as nx
def intersection(lst1, lst2):
    inters = list()
    if not (len(lst1)==0 or len(lst2)==0):
        set1 = set(lst1)
        inters = [elem for elem in lst2 if elem in set1]
    return inters

Build the graph and fill it with its nodes (the proteins coming from the dataset).

In [46]:
# Build the graph
protein_graph = nx.Graph(name='Protein Interactions Graph')

# Build the edges
for node in nodes:
    protein_graph.add_node(node, diseases=[])  # Each node will have a list with the disease pathways it belongs to

Insert into the node their respective diseases.

In [47]:
for i, disease in diseases.items():
    disease_genes = disease['genes']
    for gene in disease_genes:
        protein_graph.nodes[gene]["diseases"].append(i)

There could be nodes without any diseases, they still need to be kept into the network.

In [48]:
nodes_no_disease = list()
for node in protein_graph.nodes:
    if len(protein_graph.nodes[node]["diseases"])==0:
        nodes_no_disease.append(str(node))

In [49]:
print("Nodes without diseases: {0}".format(len(nodes_no_disease)))

Nodes without diseases: 5101


Then, build the edges, it's straightforward as the nodes are known.

In [50]:
for _, interaction in dataset.iterrows():
    first_protein, second_protein = interaction[0], interaction[1]  # Proteins involved in the interaction

    prot1_dis = protein_graph.nodes()[interaction[0]]['diseases']
    prot2_dis = protein_graph.nodes()[interaction[1]]['diseases']
    # Build the edge
    protein_graph.add_edge(first_protein, second_protein, weight=len(intersection(prot1_dis, prot2_dis)))

In [51]:
protein_graph.edges.data()

EdgeDataView([('TCF3', 'HAND2', {'weight': 0}), ('TCF3', 'ID3', {'weight': 8}), ('TCF3', 'SKP2', {'weight': 12}), ('TCF3', 'NHLH1', {'weight': 3}), ('TCF3', 'EP300', {'weight': 19}), ('TCF3', 'CREBBP', {'weight': 19}), ('TCF3', 'KAT2B', {'weight': 8}), ('TCF3', 'CALM1', {'weight': 15}), ('TCF3', 'LMX1A', {'weight': 3}), ('TCF3', 'TAL1', {'weight': 11}), ('TCF3', 'LDB1', {'weight': 3}), ('TCF3', 'CBFA2T3', {'weight': 6}), ('TCF3', 'ELK3', {'weight': 2}), ('TCF3', 'MYOD1', {'weight': 6}), ('TCF3', 'TWIST1', {'weight': 11}), ('TCF3', 'PSMD9', {'weight': 9}), ('TCF3', 'ID1', {'weight': 11}), ('TCF3', 'ID2', {'weight': 6}), ('TCF3', 'MYF5', {'weight': 0}), ('TCF3', 'MYF6', {'weight': 0}), ('TCF3', 'MYOG', {'weight': 2}), ('TCF3', 'UBE2I', {'weight': 4}), ('TCF3', 'LYL1', {'weight': 7}), ('TCF3', 'MAPKAPK3', {'weight': 0}), ('TCF3', 'DAXX', {'weight': 3}), ('TCF3', 'MEN1', {'weight': 12}), ('TCF3', 'TRIM27', {'weight': 2}), ('TCF3', 'TCF4', {'weight': 11}), ('TCF3', 'BRCA1', {'weight': 17}),

At last, save the graph.

In [52]:
nx.write_gpickle(protein_graph,'datasets/protein_graph.gpickle')

## **Metrics**
Metrics needed to compare the various diseases and proteins.

Load the graph if already built previously.

In [25]:
import os

if os.path.exists("datasets/protein_graph.gpickle"):
    protein_graph = nx.read_gpickle("datasets/protein_graph.gpickle")
elif not "protein_graph" in locals():
    raise ValueError("It was not possible to find the graph, build it from the previous steps")

**Size of largest pathway component:** Fraction of disease proteins that lie in the disease's largest pathway component (i.e., the relative size of the largest connected component (LCC) of the disease).

In [26]:
def largest_conn_comp(diseases: dict) -> list:
    lcc_score = list()
    for _, disease in diseases.items():
        sub_graph = protein_graph.subgraph(disease['genes'])  # Subgraph of the current disease
        largest_cc = max(nx.connected_components(sub_graph), key=len)
        lcc_score.append(len(largest_cc) / len(sub_graph.nodes()))
    
    return lcc_score

In [27]:
df_diseases.insert(len(df_diseases.columns), "LCC Score", largest_conn_comp(diseases), True)

In [28]:
df_diseases.tail()

Unnamed: 0,Term,Overlap,P-value,Adjusted P-value,Genes,LCC Score
584,Chronic otitis media,55/69,0.005896,0.09895,IGHM;CD81;WIPF1;FMR1;DOCK8;CHD7;JMJD1C;COMT;GT...,0.018182
585,Inadequate arch length for tooth size,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043
586,Tooth Crowding,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043
587,Tooth mass arch size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043
588,Tooth size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043


**Distance of pathway components:** For each pair of pathway components, we calculate the average shortest path length between each set of proteins, and then, the average of this is taken over all pairs of the components.

In [29]:
from numpy.ma.core import mean
from tqdm.notebook import tqdm

def distance_pathway_comps(diseases: dict) -> list:
    dpc_score = list()
    for _, disease in tqdm(diseases.items()):
        sub_graph = protein_graph.subgraph(disease['genes'])
        conn_comps = list(nx.connected_components(sub_graph))
        distances = list()
        for i, comp in enumerate(conn_comps):
            for j in range(i+1, len(conn_comps)):
                dist = 0
                for first_comp_protein in comp:
                    for second_comp_protein in conn_comps[j]:
                        dist += nx.shortest_path_length(protein_graph, source=first_comp_protein, target=second_comp_protein)
                
                distances.append(dist / (len(comp) * len(conn_comps[j])))
        dpc_score.append(mean(distances))
    
    return dpc_score

In [30]:
import os

if os.path.exists("datasets/mean_distances.csv"):
    df_mean_distances = pd.read_csv("datasets/mean_distances.csv", sep=",", index_col=0)
elif not "df_mean_distances" in locals():
    df_mean_distances = pd.DataFrame(distance_pathway_comps(diseases))
    df_mean_distances.to_csv('datasets/mean_distances.csv')

In [31]:
df_diseases["DPC Score"] = df_mean_distances
df_diseases.tail()

Unnamed: 0,Term,Overlap,P-value,Adjusted P-value,Genes,LCC Score,DPC Score
584,Chronic otitis media,55/69,0.005896,0.09895,IGHM;CD81;WIPF1;FMR1;DOCK8;CHD7;JMJD1C;COMT;GT...,0.018182,2.678114
585,Inadequate arch length for tooth size,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043,2.693694
586,Tooth Crowding,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043,2.693694
587,Tooth mass arch size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043,2.693694
588,Tooth size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043,2.693694


In [32]:
df_diseases.to_csv("datasets/diseases_scores.csv")

**Network modularity:** Fraction of edges that fall within/outside the pathway minus the expected fraction if edges were randomly distributed:
\begin{equation}
Q_d = 1/(2m) \sum_{ij} (I((i, j) ∈ E) − \frac{k_ik_j}{
2m})δ(p_i, p_j)
\end{equation}
where $k_i$ is the degree of $i$, and $δ(p_i, p_j)$ is 1 if $p_i$ and $p_j$ are equal and 0 otherwise.


In [33]:
def intersection(lst1, lst2):
    inters = list()
    if not (len(lst1)==0 or len(lst2)==0):
        set1 = set(lst1)
        inters = [elem for elem in lst2 if elem in set1]
    return inters

In [34]:
def network_modularity(protein_graph: nx.Graph, diseases: dict) -> list:
    m = protein_graph.number_of_edges()
    one_m = 1/(2*m)
    Q = list()
    for _, disease in tqdm(diseases.items()):
        sub_graph = protein_graph.subgraph(disease['genes'])
        disease_nodes = list(sub_graph.nodes())
        Q_dis = 0
        for i, node_i in enumerate(disease_nodes):
            for j in range(i+1, len(disease_nodes)):
                node_j = disease_nodes[j]
                a = protein_graph.number_of_edges(node_i, node_j)
                k_i=protein_graph.degree[node_i]
                k_j=protein_graph.degree[node_j]
                Q_dis += a - (k_i*k_j)/(2*m)
        
        Q.append(one_m * Q_dis)
    
    return Q 

In [35]:
import os

if os.path.exists("datasets/modularities.csv"):
    df_modularities = pd.read_csv("datasets/modularities.csv", sep=",", index_col=0)
elif not "df_modularities" in locals():
    df_modularities = pd.DataFrame(network_modularity(protein_graph, diseases))
    df_modularities.to_csv('datasets/modularities.csv')

In [36]:
df_diseases["Modularity"] = df_modularities
df_diseases.tail()

Unnamed: 0,Term,Overlap,P-value,Adjusted P-value,Genes,LCC Score,DPC Score,Modularity
584,Chronic otitis media,55/69,0.005896,0.09895,IGHM;CD81;WIPF1;FMR1;DOCK8;CHD7;JMJD1C;COMT;GT...,0.018182,2.678114,-8e-06
585,Inadequate arch length for tooth size,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043,2.693694,4e-06
586,Tooth Crowding,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043,2.693694,4e-06
587,Tooth mass arch size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043,2.693694,4e-06
588,Tooth size discrepancy,47/58,0.005953,0.099228,AMER1;SETD5;NOTCH3;TRIO;RPL10;SATB2;GNAI3;PLOD...,0.234043,2.693694,4e-06


In [37]:
df_diseases.to_csv("datasets/diseases_scores.csv")

In [38]:
#list(protein_graph.nodes())
nodes_degree = pd.DataFrame(protein_graph.degree(list(protein_graph.nodes())), columns=['protein', 'degree'])
nodes_degree = nodes_degree.sort_values(by='degree', ascending=False)

In [39]:
nodes_degree[nodes_degree['degree']==1].count()

protein    4673
degree     4673
dtype: int64

In [40]:
nodes_degree.iloc[0:100,:]

Unnamed: 0,protein,degree
524,KIAA1429,2913
475,ESR2,2287
183,ESR1,2275
409,FANCD2,2117
79,MYC,2040
...,...,...
897,NHP2L1,267
1993,SAP18,262
1999,PHGDH,261
861,PRPF3,260


In [60]:
parts = nx.algorithms.community.louvain_partitions(protein_graph)
communities = []
for elem in parts:
    communities.append(elem)

for community in communities:
    for i,elem in enumerate(community):
        tmp = list(elem)
        community.pop(i)
        community.insert(i,tmp)
for community in communities:
    for i,elem in enumerate(community):
        if elem == []:
            community.remove(elem)

In [67]:
print(communities[1]==communities[0])

False


In [71]:
parts = nx.algorithms.community.louvain_communities(protein_graph)
parts_lst = list(parts)
parts_lst = [com for com in parts_lst if len(com)>1]

In [86]:
summation=0
for i, val in diseases.items():
    summation += len(val['genes'])
summation=summation / len(diseases.keys())
print(summation)

246.55857385398983


In [107]:
# Vediamo se le communities hanno tutti i nodi differenti. Spoiler sì.
def communities_are_distincts(parts_lst):
    for i, com in enumerate(parts_lst):
        for j in range(i+1, len(parts_lst)):
            com_j =  parts_lst[j]
            if len(intersection(com, com_j))>0:
                return False
    return True

In [103]:
comm_rank=dict()
for i, com in enumerate(parts_lst):
    ranks = dict()
    for k, val in diseases.items():
        genes = val['genes']
        common_genes = intersection(genes, com)
        ranks[k]= len(common_genes) #{'inters' :len(common_genes), 'disease' : val['name']}

    ranks = sorted(ranks.items(), key=lambda x: x[1], reverse=True)
    comm_rank[i]=[(diseases[r[0]]['name'], r[1]) for r in ranks if r[1]>246]

In [104]:
print(comm_rank)

{0: [], 1: [('Malignant neoplasm of breast', 290), ('Breast Carcinoma', 287), ('Carcinogenesis', 275)], 2: [('Intellectual Disability', 465), ('Malignant neoplasm of breast', 414), ('Breast Carcinoma', 388), ('Carcinogenesis', 280), ('Global developmental delay', 274), ('Mental and motor retardation', 269), ('Cognitive delay', 256), ('Mental Retardation', 250)], 3: [('Malignant neoplasm of breast', 474), ('Breast Carcinoma', 474), ('Carcinogenesis', 382), ('Malignant neoplasm of prostate', 343), ('Mammary Neoplasms', 270)], 4: [('Malignant neoplasm of breast', 526), ('Breast Carcinoma', 507), ('Carcinogenesis', 465), ('Malignant neoplasm of prostate', 277), ('Mammary Neoplasms', 253)], 5: [], 6: [], 7: [('Malignant neoplasm of breast', 556), ('Breast Carcinoma', 526), ('Carcinogenesis', 403), ('Malignant neoplasm of prostate', 320), ('Mammary Neoplasms', 272), ('Malignant neoplasm of lung', 265), ('melanoma', 256), ('Non-Small Cell Lung Carcinoma', 253), ('Primary malignant neoplasm of