# Parse New Gene Table

**from:** Maria D. Vibranovski

Here attached is a list from Yong Zhang group based on our paper from 2010. But this is a still not published updated version that he shared with me but you can use.

If you need details about the columns, please look at https://genome.cshlp.org/content/suppl/2010/08/27/gr.107334.110.DC1/SupplementalMaterial.pdf  table 2a.

But mainly, what you need to select is the child genes with:

gene_type = D or R or DI or RI
m_type= M
note that contains "chrX-"

D and R stands for DNA-based Duplication and RNA-based duplication
I means that the assignment of the parental genes is less reliable.
M indicates that is between chromosome movement.

Hope it helps. If you need I can parse for you. please, do not hesitate to ask. But I thought you would prefer a complete list where you can look at subsets.

cheers

Maria


In [7]:
import os
import sys
from pathlib import Path
import re

from IPython.display import display, HTML, Markdown
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Project level imports
sys.path.insert(0, '../lib')
from larval_gonad.notebook import Nb
from larval_gonad.plotting import make_figs
from larval_gonad.config import memory

# Setup notebook
nbconfig = Nb.setup_notebook()

last updated: 2018-05-10 
Git hash: 24131ab928c3728f9d1a9a7071609d227c96d379


## Import data from Maria

In [105]:
dat = pd.read_excel('../data/external/maria/dm6_ver78_genetype.xlsx')

# Focus on genes that are a DNA duplicateion or retrotranspostion
gene_type_mask = dat.gene_type.isin(['D', 'R', 'Dl', 'Rl'])

# Focus on genes that moved chromosomes
m_type_mask = dat.m_type == 'M'

# Grab genes that moved
moved_genes = dat.loc[gene_type_mask & m_type_mask, ['child_id', 'parent_id', 'note']]

In [106]:
# Parse our child and parent chroms from the note
chroms = np.array([re.findall('chr[X234Y][RL]?', s) for s in moved_genes.note])
moved_genes['child_chrom'] = chroms[:, 0]
moved_genes['parent_chrom'] = chroms[:, 1]

In [107]:
otable = moved_genes.rename({'child_id': 'FBgn', 'child_chrom': 'chrom'}, axis=1)[['FBgn', 'chrom', 'parent_id', 'parent_chrom']]

## FBgn sanitizer

I don't know where these FBgns are from, so I need to sanitize them to my current annotation.

In [98]:
assembly = nbconfig.assembly
tag = nbconfig.tag
pth = Path(os.environ['REFERENCES_DIR'], f'{assembly}/{tag}/fb_annotation/{assembly}_{tag}.fb_annotation')

# Create an FBgn 
mapper = {}

for record in pd.read_csv(pth, sep='\t').to_records():
    mapper[record.primary_FBgn] = record.primary_FBgn
    
    try:
        for g in record.secondary_FBgn.split(','):
            mapper[g] = record.primary_FBgn
    except AttributeError:
        pass

In [108]:
otable.FBgn.replace(mapper, inplace=True)
otable.parent_id.replace(mapper, inplace=True)

## Check Chromosomes

I just want to double check that the chromosomes are correct according to my annotation.

In [110]:
# check chroms are right
for record in otable.to_records():
    assert record.chrom == nbconfig.fbgn2chrom.loc[record.FBgn, 'chrom']

In [114]:
otable.set_index('FBgn').to_csv('../output/new_genes.tsv', sep='\t')