---
title: Investigating the various CAGE brain tissue mouse tracks
author: Sabrina Mi
date: 9/1/2023
description: I decided on using the CAGE:hippocampus mouse hippocampus track as a representative of brain tissue, as there is no equivalent CAGE:Brain track in mouse targets.
---


In [26]:
import pandas as pd
import numpy as np
import h5py

In [23]:
## subset to all brain-specific CAGE targets
targets = pd.read_csv("https://raw.githubusercontent.com/calico/basenji/master/manuscripts/cross2020/targets_mouse.txt", sep = "\t")
CAGE_targets = targets[targets['index'].isin([6612, 6613, 6622, 6627])]
track_indices = list(CAGE_targets.index)
CAGE_targets.head()

Unnamed: 0,index,genome,identifier,file,clip,scale,sum_stat,description
1299,6612,1,CNhs10477,/home/drk/tillage/datasets/mouse/cage/fantom/C...,384,1,sum,"CAGE:medulla oblongata, adult"
1300,6613,1,CNhs10478,/home/drk/tillage/datasets/mouse/cage/fantom/C...,384,1,sum,"CAGE:hippocampus, adult"
1309,6622,1,CNhs10489,/home/drk/tillage/datasets/mouse/cage/fantom/C...,384,1,sum,"CAGE:olfactory brain, adult"
1314,6627,1,CNhs10494,/home/drk/tillage/datasets/mouse/cage/fantom/C...,384,1,sum,"CAGE:cerebellum, adult"


In [22]:
predictions_dir = "/home/s1mi/Br_predictions/predictions_folder/personalized_Br_selected_genes/predictions_2023-09-01/enformer_predictions"
gene_expr_bed = "/home/s1mi/enformer_rat_data/expression_data/Brain.rn7.expr.tpm.bed"
obs_gene_expr = pd.read_csv(gene_expr_bed, sep="\t", header=0, index_col='gene_id')
annot_df = pd.read_csv("/home/s1mi/enformer_rat_data/annotation/rn7.gene.txt", sep="\t", header= 0, index_col='geneId')
gene_list = ["ENSRNOG00000060185", "ENSRNOG00000022448", "ENSRNOG00000006331", "ENSRNOG00000000435", "ENSRNOG00000001336", "ENSRNOG00000016623", "ENSRNOG00000025324", "ENSRNOG00000012087", "ENSRNOG00000021663", "ENSRNOG00000012333"]


  obs_gene_expr = pd.read_csv(gene_expr_bed, sep="\t", header=0, index_col='gene_id')
  annot_df = pd.read_csv("/home/s1mi/enformer_rat_data/annotation/rn7.gene.txt", sep="\t", header= 0, index_col='geneId')


In [4]:
expr_dict = {}
for gene in gene_list:
    obs = obs_gene_expr.loc[gene][3:]
    expr_dict[gene] = pd.DataFrame({"observed": obs})
            

In [None]:
import os
for gene in gene_list:
    gene_annot = annot_df.loc[gene]
    interval = f"chr{gene_annot['chromosome']}_{gene_annot['tss']}_{gene_annot['tss']}"
    medulla_oblongata = []
    hippocampus = []
    olfactory_brain = []
    cerebellum = []
    for individual in expr_dict[gene].index:
        haplo0 = h5py.File(f"{predictions_dir}/{individual}/haplotype0/{interval}_predictions.h5", "r")
        predictions = haplo0["mouse"][446:450, track_indices]
        medulla_oblongata.append(np.average(predictions[:,0]))
        hippocampus.append(np.average(predictions[:,1]))
        olfactory_brain.append(np.average(predictions[:,2]))
        cerebellum.append(np.average(predictions[:,3]))
    expr_dict[gene]["medulla oblongata"] = medulla_oblongata
    expr_dict[gene]["hippocampus"] = hippocampus
    expr_dict[gene]["olfactory brain"] = olfactory_brain
    expr_dict[gene]["cerebellum"] = cerebellum

In [131]:
corr_by_gene_and_track = pd.DataFrame(columns = ["medulla oblongata", "hippocampus", "olfactory brain", "cerebellum"], index = gene_list)

for gene in gene_list:
    corr_df = expr_dict[gene].corr()
    corr_by_gene_and_track.loc[gene] = pd.to_numeric(corr_df.iloc[0,1:])
print(corr_by_gene_and_track)

                   medulla oblongata hippocampus olfactory brain cerebellum
ENSRNOG00000060185          0.340841    0.339026        0.340439   0.338799
ENSRNOG00000022448          0.176105    0.183011        0.179076   0.176374
ENSRNOG00000006331          0.284351    0.326806        0.327064    0.32246
ENSRNOG00000000435          0.212012    0.233529        0.171748   0.097784
ENSRNOG00000001336          0.550703    0.540987         0.54114   0.551966
ENSRNOG00000016623          0.060619    0.170785        0.153715  -0.109455
ENSRNOG00000025324         -0.315003   -0.223645       -0.240727  -0.174633
ENSRNOG00000012087          0.285358    0.309409        0.312098   0.312001
ENSRNOG00000021663         -0.325343   -0.430207       -0.426303  -0.014811
ENSRNOG00000012333         -0.081875    0.098419        0.077553   0.038098


Based off a row-wise look at the table, I've decided to use the CAGE:hippocampus mouse track as a representative of brain tissue.