# 2-5-Transcription-factors-motif-inference
Jakke Neiro$^1$
1. Aboobaker laboratory, Department of Zoology, University of Oxford

## Contents of notebook
* 1. Introduction
* 2. Transcription factor sequences
    * 2.1 Selection of sequences
    * 2.2 Protein sequences
* 3. Motif inference
    * 3.1 Jaspar profile inference
    * 3.2 Jaspar result processing
    * 3.3 Downloading motifs
* 4. Final transcription factor table

## Files
* Input: tf_genes_April2021.csv
* Output: all.meme

# 1. Introduction

Motifs were predicted for all transcription factors using JASPAR.

# 2. Transcription factor sequences

## 2.1 Selection of sequences

Firstly, the computationally predicted and manually curated sequences were used:

In [4]:
%%bash
cd /hydra/TF_data
ls tf_genes_April2021.csv 

tf_genes_April2021.csv


In [14]:
import pandas as pd
tf_genes = pd.read_csv("/hydra/TF_data/tf_genes_April2021.csv", sep=";")

In [5]:
gffcmp = pd.read_csv("/hydra/sexual_genome_annotation_files/ncrna_Neiro/gffcmp.stringtie_merged.gtf.tmap", sep="\t")
gene2transcript = gffcmp.iloc[:,[0,3,4]]

The IDs were saved as a list: 

In [33]:
gene2transcript[gene2transcript["qry_gene_id"].isin(tf_genes.iloc[:,1])].iloc[:,2].to_csv("/hydra/TF_data/tf_genes_April2021.list", index=False, header=False)

In [34]:
%%bash
cd /hydra/TF_data
head tf_genes_April2021.list

MSTRG.22.1
MSTRG.22.2
SMEST026639002.1
SMEST026639001.1
MSTRG.27.1
MSTRG.27.2
MSTRG.27.3
SMEST026861001.1
SMEST027161004.1
SMEST026461001.1


Sequences were extracted as a fasta file:

In [35]:
%%bash
cd /hydra/TF_data
seqtk subseq /hydra/FACS/stringtie_transcripts.fa tf_genes_April2021.list > tf_genes_April2021.fasta

In [1]:
%%bash
cd /hydra/TF_data
grep ">" tf_genes_April2021.fasta | head -2
head -2 tf_genes_April2021.fasta

>MSTRG.22.1
>MSTRG.22.2
>MSTRG.22.1
GTTTTTGTTTAAATAAACTCTTATATTTTGTTCATTTAAAAGATAATTTAATTTTATAAAAATATTTTAGCATGATTGATCCTGAAGATGATGCAGTTTATTCAGTAGATGCATCACAATGTGACGATAGGGTTTGGTTGGTAAAAATTCCCAATTATTTATCAAATGAATGGATGAACTCTCCGGACAATTCTATTGTTGCAAAAATTGTAGTGGATCAAGATAAAGATAAAGAAGCTGTCTACAAATTAATTTGTAATCCCGACTTTATAAAAAATAAAGACATACCAACAGTAAACAAATTTATTATTCAAGGTATTACTGAAAAAGTTAATGTTAAATCCGAAGAACGAGCTAAAAACTTATCTATTGGTTCCCGTGATGGTAAAACGTTCATATTGCAATCATCTGATATCGAAGGAGGCTATAACCCAGGAAAAAAGCCAAGATATAAGAAAAAAGCCATTATAGGCCGTGTAAGTGTAAGATGTAATGTCATGCCACCTGATGACAATGCATATTTCGCATTGAAATCTAAACAAATTAGAACTTATAACACCCCGTTAAGAAAAACTCAAATTTGTGAGGAACAAGGTGTGGAGTTTAAACCGAAATCCACCGGCACAATCAATAAAAAGAAAGATCCCAGTGGTCGAGATGGAACTAGATCAGCCCGAATGGAGCACAGCAAACTCATGGACTTGATCTGTTCACATTTTGAGAAACATCAATTTTATAATATCAAAGACTTGATCGATCTAACTGGTCAACCGCCGGTAAGATTTCAGTGCTTATTTTGGGTATTTAACCGATTATTGTAGGGATATGTGAAGGAAATATTGAAAGAAGTTGCCACATTAAGCAAAGCTCCTTCCCGTCGCCACATGTGGGAACTGAAACCCGAATATCGCCATTATTCCTAATGTTTTATAAAAATTTTAAATATTGCCTTGTTTTTTGTTGG

Secondly, the sequences from literature were used:

In [1]:
%%bash
cd /hydra/TF_data
ls tf_genes_May2021.xlsx

tf_genes_May2021.xlsx


In [2]:
import pandas as pd
tf_genes_lit = pd.read_excel("/hydra/TF_data/tf_genes_May2021.xlsx")

In [7]:
tf_genes_lit.iloc[565:,:]

Unnamed: 0,Column1,Column2,Symbol,Old symbol,Description,TF group,TF class,Identification,RNAi,In situ,Reference
565,SMESG000072873.1,SMESG000072873.1,foxJ1-2,foxJ1-2,forkhead box J1,FOX,,1.0,0.0,1.0,"(Vij et al., 2012; Pascual-Carreras et al., 2021)"
566,SMESG000077917.1,SMESG000077917.1,slou,,Homeobox protein slou,,,,,,
567,SMESG000078554.1,SMESG000078554.1,TLF-1,,,,,,,,
568,SMESG000065670.1,MSTRG.19539,foxA1,foxA,Smed-foxA1,,,1.0,1.0,1.0,"(Adler et al., 2014; Roberts-Galbraith et al.,..."
569,SMESG000037781.1,MSTRG.11814,foxO,foxO,,FOX,,1.0,1.0,1.0,"(van Wolfswinkel et al., 2014; Pascual-Carrera..."
570,SMESG000066725.1,MSTRG.19831,soxP-5,soxD-1,,SOX,,,,,"(Önal et al., 2012; van Wolfswinkel et al., 20..."
571,SMESG000032173.1,MSTRG.10122,ZMYM-1,ZMYM-1,,,,1.0,1.0,1.0,"(Wagner et al., 2012)"
572,SMESG000077995.1,MSTRG.21518,ZNF207-1,ZNF207-1,ZNF207-1,,,1.0,1.0,1.0,"(Wagner et al., 2012)"
573,SMESG000010063.1,MSTRG.3856,soxP-1,soxP-1,,SOX,,1.0,1.0,1.0,"(Wagner et al., 2012; Önal et al., 2012; van W..."
574,SMESG000074761.1,MSTRG.21964,soxP-2,soxP-2,,SOX,,1.0,1.0,1.0,"(Wagner et al., 2012; Önal et al., 2012; van W..."


In [2]:
import pandas as pd
gffcmp = pd.read_csv("/hydra/sexual_genome_annotation_files/ncrna_Neiro/gffcmp.stringtie_merged.gtf.tmap", sep="\t")
gene2transcript = gffcmp.iloc[:,[0,3,4]]

The IDs were saved as a list: 

In [9]:
gene2transcript[gene2transcript["qry_gene_id"].isin(tf_genes_lit.iloc[565:,1])].iloc[:,2].to_csv("/hydra/TF_data/tf_genes_May2021lit.list", index=False, header=False)

In [10]:
%%bash
cd /hydra/TF_data
head -2 tf_genes_May2021lit.list

MSTRG.624.1
MSTRG.624.2


Sequences were extracted as a fasta file:

In [11]:
%%bash
cd /hydra/TF_data
seqtk subseq /hydra/FACS/stringtie_transcripts.fa tf_genes_May2021lit.list > tf_genes_May2021lit.fasta

In [13]:
%%bash
cd /hydra/TF_data
grep -c ">" tf_genes_May2021lit.fasta 

236


## 2.2 Protein sequences

Firstly, the ORFs and protein sequences of the computationally predicted TFs were extracted:

In [None]:
#%%bash
#cd /hydra/TF_data
#nohup TransDecoder.LongOrfs -t tf_genes_April2021.fasta -O Transdecoder_tfApril2021 &

In [3]:
#%%bash
#cd /hydra/TF_data/Transdecoder_tfApril2021
#cp longest_orfs.pep longest_orfs_ip.pep
#sed -i 's_*__g' longest_orfs_ip.pep
#head longest_orfs_ip.pep

In [11]:
%%bash
cd /hydra/TF_data/Transdecoder_tfApril2021
awk 'NR % 2 == 0' longest_orfs_ip.pep > longest_orfs_ip_seq.pep
awk 'NR % 2 == 1' longest_orfs_ip.pep > longest_orfs_ip_id.pep

In [14]:
%%bash
cd /hydra/TF_data/Transdecoder_tfApril2021
head -1 longest_orfs_ip_seq.pep
head -1 longest_orfs_ip_id.pep

MIDPEDDAVYSVDASQCDDRVWLVKIPNYLSNEWMNSPDNSIVAKIVVDQDKDKEAVYKLICNPDFIKNKDIPTVNKFIIQGITEKVNVKSEERAKNLSIGSRDGKTFILQSSDIEGGYNPGKKPRYKKKAIIGRVSVRCNVMPPDDNAYFALKSKQIRTYNTPLRKTQICEEQGVEFKPKSTGTINKKKDPSGRDGTRSARMEHSKLMDLICSHFEKHQFYNIKDLIDLTGQPPVRFQCLFWVFNRLL
>MSTRG.22.1.p1 type:complete len:250 gc:universal MSTRG.22.1:72-821(+)


In [18]:
%%bash
cd /hydra/TF_data/Transdecoder_tfApril2021
wc -l longest_orfs_ip_seq.pep

2847 longest_orfs_ip_seq.pep


Secondly, the ORFs and protein sequences of TFs derived from literature were extracted: 

In [16]:
#%%bash
#cd /hydra/TF_data
#TransDecoder.LongOrfs -t tf_genes_May2021lit.fasta -O Transdecoder_tfMay2021lit

In [18]:
#%%bash
#cd /hydra/TF_data/Transdecoder_tfMay2021lit
#cp longest_orfs.pep longest_orfs_ip.pep
#sed -i 's_*__g' longest_orfs_ip.pep
#head longest_orfs_ip.pep

In [19]:
%%bash
cd /hydra/TF_data/Transdecoder_tfMay2021lit
awk 'NR % 2 == 0' longest_orfs_ip.pep > longest_orfs_ip_seq.pep
awk 'NR % 2 == 1' longest_orfs_ip.pep > longest_orfs_ip_id.pep

In [20]:
%%bash
cd /hydra/TF_data/Transdecoder_tfMay2021lit
head -1 longest_orfs_ip_seq.pep
head -1 longest_orfs_ip_id.pep

MDVSCDPEIICVFCGLEFTSVENVELHINHNHQSSSTDHIENKKTCGIDDYADQTQFDGDFFMAKENKETNLDEVQIKSKSDNSSNDPIKVIQLENREKSDLANPRKSNKPNHSKHRDPNQVKESIYSDNTRNHRESTILSISETIANLVDNKKVKLRKRLDQTFHNRKCPKCFKRFFFKVTQIIHSQNHQRKQRHKWNCEKCGFRYSRKRVLLQHYERMCGTRHSSNSKQKSLLESLRCNFCQMIFTDQIYLTMHQLRICHHFITENPKTYKATDNDKTNTESVDEPQKKLQQDDNIKDLSSKSCSTEHSISENFESESNPRSPENSSSKNDKSENLTINANQKFHWHNFDENQLKIFQCEVCKKSFSSRSSLSNHVKSHYSSRGQPFTCRDCNKRFINLLSLQDHRREMCSQKSQDSEQKIVYSLSNFNGVSQVPIWPNITNINSVNQPVTCLHCSGVFYDRLELEEHALSKHSNEQGNILCLLCDRSFTTNMALRVHLTKSHGFANGGCPGVSQNLPPMPKLETNYHIGDDLPTKKKRKDKRIYEFDKDYACDNCHRNFSTGQALGNHKRACLNLPNECISSQTPKSNSKLHSIESLIFDNSNIRSTFTNDKNVFLAFN
>MSTRG.624.1.p1 type:complete len:623 gc:universal MSTRG.624.1:989-2857(+)


In [21]:
%%bash
cd /hydra/TF_data/Transdecoder_tfMay2021lit
wc -l longest_orfs_ip_seq.pep

289 longest_orfs_ip_seq.pep


# 3. Motif inference

Motifs were predcted based on the protein sequence: 

## 3.1 Jaspar profile inference

Firstly, motifs were predicted for the computationally derived TFs: 

In [22]:
%%bash
cd /hydra/TF_data
echo "#!/bin/bash" > jaspar_motif.sh
echo "while read p; do coreapi action infer read -p sequence=\$p >> jaspar_output.txt; done < Transdecoder_tfApril2021/longest_orfs_ip_seq.pep" >> jaspar_motif.sh
chmod +x jaspar_motif.sh
less jaspar_motif.sh

#!/bin/bash
while read p; do coreapi action infer read -p sequence=$p >> jaspar_output.txt; done < Transdecoder_tfApril2021/longest_orfs_ip_seq.pep


In [1]:
#%%bash
#cd /hydra/TF_data
#nohup ./jaspar_motif.sh &

Secondly, motifs were predicted for the TFs derived from literature:

In [22]:
%%bash
cd /hydra/TF_data
echo "#!/bin/bash" > jaspar_motif_lit.sh
echo "while read p; do coreapi action infer read -p sequence=\$p >> jaspar_output_lit.txt; done < Transdecoder_tfMay2021lit/longest_orfs_ip_seq.pep" >> jaspar_motif_lit.sh
chmod +x jaspar_motif_lit.sh
less jaspar_motif_lit.sh

#!/bin/bash
while read p; do coreapi action infer read -p sequence=$p >> jaspar_output_lit.txt; done < Transdecoder_tfMay2021lit/longest_orfs_ip_seq.pep


In [None]:
#%%bash
#cd /hydra/TF_data
#nohup ./jaspar_motif_lit.sh &

## 3.2 Jaspar result processing

The profile inference result for the computationally derived TFs was processed into table format:

In [1]:
file1 = open('/hydra/TF_data/jaspar_output.txt', 'r')
Lines = file1.readlines()

In [2]:
import pandas as pd
profiles = pd.read_csv("/hydra/TF_data/Transdecoder_tfApril2021/longest_orfs_ip_id.pep", header=None)

In [3]:
profiles["TranscriptID"] = " "
profiles["GeneID"] = " "
profiles["MatrixID"] = " "
profiles["Dbd"] = 0
profiles["Evalue"] = 0

In [6]:
for i in range(len(profiles)):
    trans_id = profiles.iloc[i,0].split(">")[1].split(".p")[0]
    profiles.iloc[i, 1] = trans_id
    profiles.iloc[i, 2] = gene2transcript[gene2transcript["qry_id"] ==  trans_id].iloc[0,1]

In [7]:
import numpy as np
other_index = 0
write = 0
for i in range(len(Lines)):
    if len(Lines[i].split(" ")) > 1:
        if Lines[i].split(" ")[4] == '"count":':
            results_n = int(Lines[i].split(":")[1].split(" ")[1].split(",")[0])
            if results_n > 0:
                w = i + 6
                ma_id = Lines[i+6].split(":")[1].split(" ")[1].split(",")[0].split("\"")[1]
                dbd = Lines[i+7].split(":")[1].split(" ")[1].split(",")[0]
                evalue = Lines[i+5].split(":")[1].split(" ")[1].split(",")[0]
                profiles.iloc[other_index,3] = ma_id
                profiles.iloc[other_index,4] = float(dbd)
                profiles.iloc[other_index,5] = float(evalue)
                other_index += 1
            else:
                profiles.iloc[other_index,3] = "-"
                profiles.iloc[other_index,4] = 0.0
                profiles.iloc[other_index,5] = 1.0 
                other_index += 1

In [8]:
profiles_filtered = profiles.loc[profiles.groupby("GeneID")["Evalue"].idxmin()]

In [9]:
profiles_filtered = profiles_filtered[profiles_filtered["Evalue"] < 1]

In [10]:
tf_genes = pd.read_csv("/hydra/TF_data/tf_genes_April2021.csv", sep=";")

In [11]:
profiles_filtered["Symbol"] = " "
for i in range(len(profiles_filtered)):
    name = tf_genes[tf_genes["Neiro"] == profiles_filtered.iloc[i, 2]].iloc[0, 3]
    profiles_filtered.iloc[i, 6] = name

In [13]:
profiles_filtered["URL"] = " "
for i in range(len(profiles_filtered)):
    profiles_filtered.iloc[i,7] = "http://jaspar.genereg.net/api/v1/matrix/" + profiles_filtered.iloc[i,3] + "/?format=meme"

In [24]:
profiles_filtered.iloc[:,7].to_csv("/hydra/TF_data/April2021motifs/matrixURL.list", index=False, header=False)
profiles_filtered.iloc[:,2].to_csv("/hydra/TF_data/April2021motifs/matrixURLname.list", index=False, header=False)

In [17]:
profiles_filtered.to_csv("/hydra/TF_data/jaspar_filtered.csv", index=False)

The profile inference result for the TFs derived from the literature was processed into table format:

In [3]:
file1 = open('/hydra/TF_data/jaspar_output_lit.txt', 'r')
Lines = file1.readlines()
import pandas as pd
profiles = pd.read_csv("/hydra/TF_data/Transdecoder_tfMay2021lit/longest_orfs_ip_id.pep", header=None)
profiles["TranscriptID"] = " "
profiles["GeneID"] = " "
profiles["MatrixID"] = " "
profiles["Dbd"] = 0
profiles["Evalue"] = 0
for i in range(len(profiles)):
    trans_id = profiles.iloc[i,0].split(">")[1].split(".p")[0]
    profiles.iloc[i, 1] = trans_id
    profiles.iloc[i, 2] = gene2transcript[gene2transcript["qry_id"] ==  trans_id].iloc[0,1]
import numpy as np
other_index = 0
write = 0
for i in range(len(Lines)):
    if len(Lines[i].split(" ")) > 1:
        if Lines[i].split(" ")[4] == '"count":':
            results_n = int(Lines[i].split(":")[1].split(" ")[1].split(",")[0])
            if results_n > 0:
                w = i + 6
                ma_id = Lines[i+6].split(":")[1].split(" ")[1].split(",")[0].split("\"")[1]
                dbd = Lines[i+7].split(":")[1].split(" ")[1].split(",")[0]
                evalue = Lines[i+5].split(":")[1].split(" ")[1].split(",")[0]
                profiles.iloc[other_index,3] = ma_id
                profiles.iloc[other_index,4] = float(dbd)
                profiles.iloc[other_index,5] = float(evalue)
                other_index += 1
            else:
                profiles.iloc[other_index,3] = "-"
                profiles.iloc[other_index,4] = 0.0
                profiles.iloc[other_index,5] = 1.0 
                other_index += 1

In [4]:
profiles_filtered = profiles.loc[profiles.groupby("GeneID")["Evalue"].idxmin()]
profiles_filtered = profiles_filtered[profiles_filtered["Evalue"] < 1]

In [6]:
import pandas as pd
tf_genes_lit = pd.read_excel("/hydra/TF_data/tf_genes_May2021.xlsx")
tf_genes = tf_genes_lit.iloc[565:,:]

In [9]:
profiles_filtered["Symbol"] = " "
for i in range(len(profiles_filtered)):
    name = tf_genes[tf_genes["Column2"] == profiles_filtered.iloc[i, 2]].iloc[0, 3]
    profiles_filtered.iloc[i, 6] = name

In [10]:
profiles_filtered["URL"] = " "
for i in range(len(profiles_filtered)):
    profiles_filtered.iloc[i,7] = "http://jaspar.genereg.net/api/v1/matrix/" + profiles_filtered.iloc[i,3] + "/?format=meme"

In [11]:
profiles_filtered.iloc[:,7].to_csv("/hydra/TF_data/April2021motifs/matrixURLlit.list", index=False, header=False)
profiles_filtered.iloc[:,2].to_csv("/hydra/TF_data/April2021motifs/matrixURLnamelit.list", index=False, header=False)

In [12]:
profiles_filtered.to_csv("/hydra/TF_data/jaspar_filtered_lit.csv", index=False)

## 3.3 Downloading motifs

Based on the matrix IDs, the memes were dowbnloaded in MEME format. 

In [30]:
%%bash
cd /hydra/TF_data/April2021motifs
echo "#!/bin/bash" > jasparurl.sh
echo "while read p; do wget \$p -O \${p:40:6}.meme; done < matrixURL.list" >> jasparurl.sh
chmod +x jasparurl.sh
less jasparurl.sh

#!/bin/bash
while read p; do wget $p -O ${p:40:6}.meme; done < matrixURL.list


In [33]:
#%%bash
#cd /hydra/TF_data/April2021motifs
#./jasparurl.sh

In [13]:
%%bash
cd /hydra/TF_data/April2021motifs
echo "#!/bin/bash" > jasparurl_lit.sh
echo "while read p; do wget \$p -O \${p:40:6}.meme; done < matrixURLlit.list" >> jasparurl_lit.sh
chmod +x jasparurl_lit.sh
less jasparurl_lit.sh

#!/bin/bash
while read p; do wget $p -O ${p:40:6}.meme; done < matrixURLlit.list


In [None]:
#%%bash
#cd /hydra/TF_data/April2021motifs
#./jasparurl_lit.sh

In [14]:
%%bash
cd /hydra/TF_data/April2021motifs
#rm all.meme
meme2meme *meme > all.meme

In [15]:
%%bash
cd /hydra/TF_data/April2021motifs
ls -l all.meme

-rw-rw-r-- 1 ubuntu ubuntu 98110 Jun  7 21:50 all.meme


# 4. Final transcription factor table

The TPM values, proportional expression values, motif IDs amd motif evalues were added to the transcription factor table.

In [1]:
import pandas as pd
facs = pd.read_csv("/hydra/FACS/FACS_prop.csv")

In [2]:
TF_table = pd.read_excel("/hydra/TF_data/Transcription_factors_01072021.xlsx")

In [17]:
profiles_filtered1 = pd.read_csv("/hydra/TF_data/jaspar_filtered.csv")
profiles_filtered2 = pd.read_csv("/hydra/TF_data/jaspar_filtered_lit.csv")
profiles_filtered = pd.concat([profiles_filtered1, profiles_filtered2])

In [38]:
TF_table["X1"] = 0
TF_table["X2"] = 0
TF_table["Xins"] = 0
TF_table["X1 percent"] = 0
TF_table["X2 percent"] = 0
TF_table["Xins percent"] = 0
TF_table["Jaspar MatrixID"] = "0"
TF_table["Jaspar Evalue"] = 0
for i in range(len(TF_table)):
    TF_table["X1"].iloc[i] = facs[facs["gene"] == TF_table["Neiro"].iloc[i]]["X1"].iloc[0]
    TF_table["X2"].iloc[i] = facs[facs["gene"] == TF_table["Neiro"].iloc[i]]["X2"].iloc[0]
    TF_table["Xins"].iloc[i] = facs[facs["gene"] == TF_table["Neiro"].iloc[i]]["Xins"].iloc[0]
    TF_table["X1 percent"].iloc[i] = facs[facs["gene"] == TF_table["Neiro"].iloc[i]]["X1.prop"].iloc[0]
    TF_table["X2 percent"].iloc[i] = facs[facs["gene"] == TF_table["Neiro"].iloc[i]]["X2.prop"].iloc[0]
    TF_table["Xins percent"].iloc[i] = facs[facs["gene"] == TF_table["Neiro"].iloc[i]]["Xins.prop"].iloc[0]
    matrixidtable = profiles_filtered[profiles_filtered["GeneID"] == TF_table["Neiro"].iloc[i]]
    if len(matrixidtable) > 0:
        TF_table["Jaspar MatrixID"].iloc[i] = matrixidtable["MatrixID"].iloc[0]
        TF_table["Jaspar Evalue"].iloc[i] = matrixidtable["Evalue"].iloc[0]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [42]:
TF_table.to_excel("/hydra/TF_data/Transcription_factors_01072021_values.xlsx")
TF_table.to_csv("/hydra/TF_data/Transcription_factors_01072021_values.csv")

## 4.2 Isoforms of transcription factors

In [2]:
import pandas as pd
TF_table = pd.read_csv("/hydra/TF_data/Transcription_factors_01072021_values.csv").iloc[:,1:]

In [3]:
len(TF_table)

551

In [8]:
len(TF_table[~TF_table["Reference"].isna()])

248

In [11]:
TF_table[~TF_table["Reference"].isna()]

Unnamed: 0,Rink,Neiro,Symbol,Old symbol,Description,TF group,TF class,Identification,RNAi,In situ,Reference,X1,X2,Xins,X1 percent,X2 percent,Xins percent,Jaspar MatrixID,Jaspar Evalue
0,SMESG000003328.1,MSTRG.707,fos-1,,Fos proto-oncogene,FOS,Basic domain,1.0,Cyclopic blastemas and asymmetric tails,Ubiquitous,"(Wenemoser et al., 2012; Zhu et al., 2015)",73.531050,56.195082,82.571524,0.346358,0.264699,0.388942,0,0.000000e+00
3,SMESG000044121.1,MSTRG.13956,atf1,ATFl1,Cyclic AMP-dependent transcription factor ATF-1,ATF,Basic domain,1.0,0,0,"(Wenemoser et al., 2012)",19.975386,73.754351,29.475104,0.162132,0.598632,0.239237,0,0.000000e+00
11,SMESG000032044.1,MSTRG.10086,da,,Daughterless,E2A,Basic domain,1.0,Eye defects,Ubiquitous,"(Cowles et al., 2013; Scimone et al., 2018)",27.345086,28.801769,6.739212,0.434835,0.457999,0.107165,MA0830.1,1.074150e-40
12,SMESG000034317.1,MSTRG.11214,myoD,,Myogenic differentiation 1,MYOD,Basic domain,1.0,CNS and blastema defect,Mesenchyme,"(Cowles et al., 2013; Scimone et al., 2018; Ra...",6.057592,4.919220,6.696601,0.342752,0.278340,0.378908,MA0499.1,2.134130e-21
13,SMESG000032619.1,MSTRG.10663,ascl-2,,Achaete scute like-2,MYOD,Basic domain,1.0,No defect,Mesenchyme,"(Cowles et al., 2013; Fincher et al., 2018)",6.350572,9.947515,12.070575,0.223859,0.350652,0.425490,0,0.000000e+00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
493,SMESG000038786.1,MSTRG.12140,tcf4,,TCF/LEF transcription factor 4,TCF,Other,1.0,0,CNS,"(Brown et al., 2018)",13.809530,25.002619,12.127656,0.271095,0.490827,0.238078,0,0.000000e+00
494,SMESG000017328.1,MSTRG.5099,tcf5,,TCF/LEF transcription factor 5,TCF,Other,1.0,0,CNS,"(Brown et al., 2018)",5.024201,13.912388,4.023106,0.218827,0.605948,0.175225,0,0.000000e+00
498,SMESG000013159.1,MSTRG.4715,Dach,,Dachshund,DACH,Other,1.0,No eye defect,Scattered in the head,"(Lapan and Reddien, 2011)",0.902512,5.428261,1.249944,0.119054,0.716062,0.164885,0,0.000000e+00
501,SMESG000067290.1,MSTRG.19902,ski-1,ski,Ski oncogene 1,SKI,Other,1.0,0,Neoblasts,"(Molinaro et al., 2016; Stückemann et al., 2017)",2.109682,30.611419,23.561391,0.037484,0.543889,0.418627,MA0508.1,7.792400e-01


In [2]:
counts = pd.read_csv("/hydra/sexual_genome_annotation_files/ncrna_Neiro/counts_annotated.csv")

In [18]:
counts[counts.iloc[:,0].isin(TF_table["Neiro"])].sort_values(by="Counts", ascending=False).iloc[120:150,:]

Unnamed: 0,Genes,Counts,Description,Symbol,Synonym
23584,MSTRG.9898,5.0,SMAD family member 4 [Source:HGNC Symbol;Acc:H...,SMAD4,DPC4
16813,MSTRG.3802,5.0,four and a half LIM domains 2 [Source:HGNC Sym...,FHL2,DRAL
18738,MSTRG.5535,5.0,zinc finger protein 541 [Source:HGNC Symbol;Ac...,ZNF541,DKFZp434I1930
10925,MSTRG.19831,5.0,SRY-box transcription factor 5 [Source:HGNC Sy...,SOX5,L-SOX5
12807,MSTRG.21524,5.0,tumor protein p63 [Source:HGNC Symbol;Acc:HGNC...,TP63,EEC3
10851,MSTRG.19765,5.0,forkhead box F1 [Source:HGNC Symbol;Acc:HGNC:3...,FOXF1,FKHL5
17772,MSTRG.4666,5.0,zinc finger E-box binding homeobox 1 [Source:H...,ZEB1,AREB6
13295,MSTRG.21964,5.0,kelch domain containing 9 [Source:HGNC Symbol;...,KLHDC9,KARCA1
987,MSTRG.10887,5.0,aryl hydrocarbon receptor nuclear translocator...,ARNT,bHLHe2
10141,MSTRG.19125,5.0,PR/SET domain 1 [Source:HGNC Symbol;Acc:HGNC:9...,PRDM1,BLIMP1


In [28]:
TF_table[TF_table["Neiro"] == "MSTRG.442"]

Unnamed: 0,Rink,Neiro,Symbol,Old symbol,Description,TF group,TF class,Identification,RNAi,In situ,Reference,X1,X2,Xins,X1 percent,X2 percent,Xins percent,Jaspar MatrixID,Jaspar Evalue
245,SMESG000025656.1,MSTRG.442,Hox3b,Xlox,Homeobox protein 3b,HOX,Helix-turn-helix,1.0,Fission defects,Anterior axially restricted,"(Currie et al., 2016; Stückemann et al., 2017)",1.645654,0.765388,0.190008,0.632688,0.294261,0.073051,0,0.0


In [29]:
%%bash
cd /hydra/sexual_genome_annotation_files/ncrna_Neiro/
grep "MSTRG.442" stringtie_merged.gtf

dd_Smes_g4_1	StringTie	transcript	11234218	11237544	1000	+	.	gene_id "MSTRG.442"; transcript_id "MSTRG.442.1"; 
dd_Smes_g4_1	StringTie	exon	11234218	11235030	1000	+	.	gene_id "MSTRG.442"; transcript_id "MSTRG.442.1"; exon_number "1"; 
dd_Smes_g4_1	StringTie	exon	11235271	11235482	1000	+	.	gene_id "MSTRG.442"; transcript_id "MSTRG.442.1"; exon_number "2"; 
dd_Smes_g4_1	StringTie	exon	11236833	11237544	1000	+	.	gene_id "MSTRG.442"; transcript_id "MSTRG.442.1"; exon_number "3"; 
dd_Smes_g4_1	StringTie	transcript	11234419	11237544	1000	+	.	gene_id "MSTRG.442"; transcript_id "MSTRG.442.2"; 
dd_Smes_g4_1	StringTie	exon	11234419	11235030	1000	+	.	gene_id "MSTRG.442"; transcript_id "MSTRG.442.2"; exon_number "1"; 
dd_Smes_g4_1	StringTie	exon	11235244	11235482	1000	+	.	gene_id "MSTRG.442"; transcript_id "MSTRG.442.2"; exon_number "2"; 
dd_Smes_g4_1	StringTie	exon	11236833	11237544	1000	+	.	gene_id "MSTRG.442"; transcript_id "MSTRG.442.2"; exon_number "3"; 
dd_Smes_g4_1	StringTie	transcript	1123

# FINNISHED