Script to cross-reference the gff file with a list of DEGs for each condition from the full list of DEGs we get from AskoR

Structure
Project directory (cwd): all scripts in this directory. Inside project directory, 3 directories : RNA, FAIRE, and integrated. Also in this root folder: the general data like the gff and gtf

In [4]:
import pandas as pd
import os

cwd = os.getcwd() # current working directory. All scripts in this directory. Inside : RNA, FAIRE, and integrated
RNA_directory = cwd + '/RNA/' # there's also the R scripts and result of the AskoR analysis inside
gff_header = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']

# conversion to dataframes
gff_df = pd.read_csv('Acyrthosiphon_pisum_JIC1_v1.0.scaffolds.braker2.gff', sep='\t', header=None, comment='#', names=gff_header)
DEG_df = pd.read_csv(RNA_directory + 'AskoR/MaleVsPartheno/DEanalysis/DEtables/MalevsPartheno.txt', sep='\t')

gff_df.shape

(892845, 9)

In [3]:
# Separating partheno from males, with a FDR < 0.05
DEG_male_df = DEG_df[(DEG_df['Significance'] == 1) & (DEG_df['FDR'] < 0.05)]
DEG_partheno_df = DEG_df[(DEG_df['Significance'] == -1) & (DEG_df['FDR'] < 0.05)]
DEG_male_df.shape  # same as the results from AskoR

(2867, 13)

In [3]:
# Selecting only the lines that correspond to a gene
gff_df = gff_df[gff_df['type'] == 'gene']
# Adding a column gene to the gff to make the join with the DEG lists easier
gff_df['gene'] = gff_df['attributes'].str.extract(r'ID=(.*);')

In [4]:
# Keeping only the lines in the gff where there are the DEG of each condition
DEG_male_gff = gff_df[(gff_df['gene'].isin(DEG_male_df.gene))]
DEG_partheno_gff = gff_df[(gff_df['gene'].isin(DEG_partheno_df.gene))]

In [5]:
DEG_male_gff

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,attributes,gene
304,scaffold_1,AUGUSTUS,gene,184336,191720,0.04,+,.,ID=g168;,g168
338,scaffold_1,AUGUSTUS,gene,218045,223926,0.32,-,.,ID=g169;,g169
359,scaffold_1,AUGUSTUS,gene,224516,257836,0.03,+,.,ID=g170;,g170
1016,scaffold_1,AUGUSTUS,gene,612679,616236,0.1,-,.,ID=g190;,g190
1037,scaffold_1,AUGUSTUS,gene,616456,619176,0.03,+,.,ID=g191;,g191
...,...,...,...,...,...,...,...,...,...,...
891657,scaffold_418,AUGUSTUS,gene,1,6678,0.63,+,.,ID=g33;,g33
892111,scaffold_462,AUGUSTUS,gene,1,4546,0.56,-,.,ID=g31176;,g31176
892307,scaffold_488,AUGUSTUS,gene,1,3322,0.54,+,.,ID=g20198;,g20198
892434,scaffold_512,AUGUSTUS,gene,510,3076,0.12,-,.,ID=g22836;,g22836


To generate the bed of the regions that **surrounds the TSS** +/- 1500 bp : **start** of the gene for the positive strand, but **end** of the gene for the negative strand (and need to be careful to keep start < end in the right columns... and to not go below 0 with the clip method)

In [10]:
DEG_male_tss_bed = pd.DataFrame({
    'seqid': DEG_male_gff['seqid'],
    'start': DEG_male_gff.apply(lambda x: x['start'] - 1500 if x['strand'] == '+' else x['end'] - 1500, axis=1).clip(lower=0),
    'end': DEG_male_gff.apply(lambda x: x['start'] + 1500 if x['strand'] == '+' else x['end'] + 1500, axis=1).clip(lower=0),
    'strand': DEG_male_gff['strand'],
    'gene': DEG_male_gff['gene']
})

DEG_male_tss_bed

Unnamed: 0,seqid,start,end,strand,gene
304,scaffold_1,182836,185836,+,g168
338,scaffold_1,222426,225426,-,g169
359,scaffold_1,223016,226016,+,g170
1016,scaffold_1,614736,617736,-,g190
1037,scaffold_1,614956,617956,+,g191
...,...,...,...,...,...
891657,scaffold_418,0,1501,+,g33
892111,scaffold_462,3046,6046,-,g31176
892307,scaffold_488,0,1501,+,g20198
892434,scaffold_512,1576,4576,-,g22836


In [11]:
DEG_partheno_tss_bed = pd.DataFrame({
    'seqid': DEG_partheno_gff['seqid'],
    'start': DEG_partheno_gff.apply(lambda x: x['start'] - 1500 if x['strand'] == '+' else x['end'] - 1500, axis=1).clip(lower=0),
    'end': DEG_partheno_gff.apply(lambda x: x['start'] + 1500 if x['strand'] == '+' else x['end'] + 1500, axis=1).clip(lower=0),
    'strand': DEG_partheno_gff['strand'],
    'gene': DEG_partheno_gff['gene']
})

DEG_partheno_tss_bed

Unnamed: 0,seqid,start,end,strand,gene
57,scaffold_1,70696,73696,+,g157
570,scaffold_1,345946,348946,+,g175
644,scaffold_1,400346,403346,+,g178
713,scaffold_1,440526,443526,+,g180
1570,scaffold_1,1107061,1110061,+,g210
...,...,...,...,...,...
889035,scaffold_215,10291,13291,-,g20213
889941,scaffold_263,6686,9686,-,g22722
891424,scaffold_399,0,2976,-,g40
891534,scaffold_404,5426,8426,-,g22716


In [12]:
# Removing the previously added column to get back to a regular gff file
DEG_male_gff = DEG_male_gff.drop('gene', axis=1)
DEG_partheno_gff = DEG_partheno_gff.drop('gene', axis=1)

In [13]:
DEG_male_gff

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,attributes
304,scaffold_1,AUGUSTUS,gene,184336,191720,0.04,+,.,ID=g168;
338,scaffold_1,AUGUSTUS,gene,218045,223926,0.32,-,.,ID=g169;
359,scaffold_1,AUGUSTUS,gene,224516,257836,0.03,+,.,ID=g170;
1016,scaffold_1,AUGUSTUS,gene,612679,616236,0.1,-,.,ID=g190;
1037,scaffold_1,AUGUSTUS,gene,616456,619176,0.03,+,.,ID=g191;
...,...,...,...,...,...,...,...,...,...
891657,scaffold_418,AUGUSTUS,gene,1,6678,0.63,+,.,ID=g33;
892111,scaffold_462,AUGUSTUS,gene,1,4546,0.56,-,.,ID=g31176;
892307,scaffold_488,AUGUSTUS,gene,1,3322,0.54,+,.,ID=g20198;
892434,scaffold_512,AUGUSTUS,gene,510,3076,0.12,-,.,ID=g22836;


In [14]:
DEG_male_gff.to_csv(RNA_directory + 'DEG_male.gff', sep='\t', index=False, header=False)
DEG_partheno_gff.to_csv(RNA_directory + 'DEG_partheno.gff', sep='\t', index=False, header=False)
DEG_male_tss_bed.to_csv(RNA_directory + 'DEG_male_tss.bed', sep='\t', index=False, header=False)
DEG_partheno_tss_bed.to_csv(RNA_directory + 'DEG_partheno_tss.bed', sep='\t', index=False, header=False)