# Example GO Pruning Using TreeParser

In [1]:
import pandas as pd
from src.utils.tree import TreeParser

Let's load full GO and initialize tree_parser

In [2]:
ont = 'GO_files/GO_BP_full.txt'

In [3]:
tree_parser = TreeParser(ont, sys_annot_file='GO_files/goID_2_name.tab')

27596 Systems are queried
17775 Genes are queried
Total 134564 Gene-System interactions are queried
Building descendant dict
Subtree types:  ['default']


## Download GWAS statistics

This is example, so let's download a random GWAS statistics from GWAS catalog

In [4]:
!wget https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90257001-GCST90258000/GCST90257283/GCST90257283.tsv.gz

--2025-02-10 11:41:09--  https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90257001-GCST90258000/GCST90257283/GCST90257283.tsv.gz
Resolving ftp.ebi.ac.uk (ftp.ebi.ac.uk)... 193.62.193.165
Connecting to ftp.ebi.ac.uk (ftp.ebi.ac.uk)|193.62.193.165|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 216858141 (207M) [application/x-gzip]
Saving to: ‘GCST90257283.tsv.gz’


2025-02-10 11:41:22 (19.2 MB/s) - ‘GCST90257283.tsv.gz’ saved [216858141/216858141]



## Load GWAS statistics

In [5]:
gwas_results = pd.read_csv('GCST90257283.tsv.gz', sep='\t', compression='gzip')

In [6]:
gwas_results.shape

(9097072, 10)

In [7]:
gwas_results.head()

Unnamed: 0,snpid,chromosome,base_pair_location,other_allele,effect_allele,beta,standard_error,effect_allele_frequency,Qual,p_value
0,chr1:710225:T:A,1,710225,T,A,-0.087379,0.118473,0.05,0.2315,0.460793
1,chr1:722408:G:C,1,722408,G,C,-0.02968,0.042131,0.75,0.518,0.481133
2,chr1:722700:G:A,1,722700,G,A,0.105705,0.13623,0.0,0.43333,0.437792
3,chr1:727233:G:A,1,727233,G,A,-0.135374,0.148444,0.0,0.31792,0.361793
4,chr1:727242:G:A,1,727242,G,A,0.036811,0.050237,0.1,0.69987,0.463714


## Nearest Gene Case

If you already have a nice SNP2Gene mapping from any other sources, **please skip this section**

,and use your own SNP2Gene mapping

### Load gtf file

you can download gtf file for GRCh37 here (https://ftp.ensembl.org/pub/grch37/release-87/gtf/homo_sapiens/)

In [8]:
gtf = pd.read_csv("GO_files/Homo_sapiens.GRCh37.87.gtf", skiprows=5, sep='\t', header=None)

  gtf = pd.read_csv("GO_files/Homo_sapiens.GRCh37.87.gtf", skiprows=5, sep='\t', header=None)


In [9]:
gtf.columns = ["CHR", "POS", "type", "start", "end", "..", "strand", "...", "properties"]

In [10]:
def get_nearst_gene(gtf, chromosome, pos):
    try:
        gtf_chromosome = gtf[gtf['CHR'] == chromosome]

        # Calculate the distance from the SNP to the start and end of each gene
        gtf_chromosome['distance_to_start'] = abs(gtf_chromosome['start'] - pos)
        gtf_chromosome['distance_to_end'] = abs(gtf_chromosome['end'] - pos)
        # Find the minimum distance
        gtf_chromosome['min_distance'] = gtf_chromosome[['distance_to_start', 'distance_to_end']].min(axis=1)
        nearest_gene = gtf_chromosome.loc[gtf_chromosome['min_distance'].idxmin()]
        return nearest_gene["gene_name"]
    except:
        return None

def get_property_dict(values):
    result_dict = {}
    for prop in values.strip().split(";")[:-1]:
        #print(prop.strip())
        key = prop.strip().split(" ")[0]
        value = prop.strip().split(" ")[1][1:-1]
        #print(prop.strip().split(" ")[1])
        result_dict[key] = value
    return result_dict


def normalize_chrome(value):
    if type(value)==int:
        return value
    elif value.isdigit():
        return int(value)
    else:
        return value

Process gtf file..

In [11]:
gtf['CHR'] = gtf.CHR.map(normalize_chrome)
gtf["properties"] = gtf["properties"].map(get_property_dict)
gtf["gene_name"] = gtf["properties"].map(lambda a: a["gene_name"] if "gene_name" in a.keys() else None)
gtf["gene_biotype"] = gtf["properties"].map(lambda a: a["gene_biotype"] if "gene_biotype" in a.keys() else None)
gtf = gtf.loc[(gtf['gene_biotype']=='protein_coding') & (gtf['type']=='gene')]

## Collapse Ontology based on GWAS

In [12]:
pval_col = 'p_value'
pval_threshold = 1e-4

In [13]:
gwas_results_sig = gwas_results.loc[gwas_results[pval_col] <= pval_threshold]

In [14]:
nearest_genes = gwas_results_sig.apply(lambda snp_info: get_nearst_gene(gtf, snp_info.chromosome, snp_info.base_pair_location), axis=1) # change snp_info.chromosome and snp_info.base_pair_location as your GWAS statistics
gwas_results_sig['gene_name'] = nearest_genes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gwas_results_sig['gene_name'] = nearest_genes


In [15]:
gwas_results_sig.head()

Unnamed: 0,snpid,chromosome,base_pair_location,other_allele,effect_allele,beta,standard_error,effect_allele_frequency,Qual,p_value,gene_name
121914,chr1:39594106:A:G,1,39594106,A,G,0.110808,0.027581,0.35,0.98547,5.9e-05,MACF1
122064,chr1:39642015:G:A,1,39642015,G,A,-0.191866,0.049165,0.1,0.98063,9.5e-05,MACF1
122143,chr1:39673690:C:T,1,39673690,C,T,-0.280666,0.070832,0.05,0.99305,7.4e-05,MACF1
180947,chr1:60663562:A:T,1,60663562,A,T,-0.503807,0.127223,0.0,0.83033,7.5e-05,C1orf87
237131,chr1:78797567:A:C,1,78797567,A,C,0.112562,0.026526,0.5,0.9691,2.2e-05,PTGFR


In [16]:
sig_genes = gwas_results_sig['gene_name'].unique()

In [17]:
len(sig_genes)

237

`retain_genes` function will filter out non significant genes from hierarchy

In [18]:
tree_parser.retain_genes(sig_genes)

27596 Systems are queried
191 Genes are queried
Total 1610 Gene-System interactions are queried
Building descendant dict
Subtree types:  ['default']


`collapse` function will collapse ontology based on retained genes. use `min_term_size` to make sure smallest system to have at least N genes

In [19]:
tree_parser.collapse(min_term_size=5)

The number of nodes to collapse: 27116
480 Systems are queried
191 Genes are queried
Total 3039 Gene-System interactions are queried
Building descendant dict
Subtree types:  ['default']


You can collapse until the number of systems in hierarchy becomes same after collapse (this is optional!)

In [20]:
tree_parser.collapse(min_term_size=5)

The number of nodes to collapse: 179
301 Systems are queried
191 Genes are queried
Total 2024 Gene-System interactions are queried
Building descendant dict
Subtree types:  ['default']


In [21]:
tree_parser.ontology

Unnamed: 0,parent,child,interaction
0,GO:0022414,GO:0003006,default
1,GO:0003006,GO:0007548,default
2,GO:0001775,GO:0045321,default
3,GO:0045321,GO:0046649,default
4,GO:0002376,GO:0002252,default
...,...,...,...
2506,GO:0060284,MACF1,gene
2507,GO:0060284,PGLYRP1,gene
2508,GO:0060284,CR1,gene
2509,GO:0060284,VEGFC,gene


Now save your pruned ontology!

In [None]:
tree_parser.save_ontology('output_dir')