# 2-1-Transcription-factors-identification
Jakke Neiro$^1$
1. Aboobaker laboratory, Department of Zoology, University of Oxford

## Contents of notebook

* 1. Introduction
* 2.  Classification of protein families: interproscan
    * 2.1 Transdecoder
    * 2.2 Interproscan
* 3. Identification of transcription factors
    * 3.1 Transcription factor IDs
    * 3.2 Transcription factor IDs in interproscan result
    * 3.3 From Interproscan result to gene IDs
    * 3.4 Comparison with Swapna et al. 2018
* 4. Manual curation

## Files
* Input: stringtie_transcripts.fa
* Output:

# 1. Introduction

Transcription factors were identified based on the new annotation. ORFs and peptide sequences were generated with Transdecoder and TF-like protein domains were scanned with Interproscan. Transcription factors were identified based on three criteria: pfam IDs, Superfamily IDs, and the description "Transcription factor" in the description. Lastly, the computationally derived TFs were manually curated. 

# 2. Classification of protein families: interproscan

## 2.1 Transdecoder

In [2]:
%%bash
cd /hydra/FACS/
grep -c ">" stringtie_transcripts.fa

91068


Transdecoder was used to obtain reading frames

In [None]:
#%%bash
#cd /hydra/FACS/
#nohup TransDecoder.LongOrfs -t stringtie_transcripts.fa -O Transdecoder_all &

The last * from each protein sequence was removed:

In [None]:
#%%bash
#cd /hydra/FACS/Transdecoder_all
#cp longest_orfs.pep longest_orfs_ip.pep
#sed -i 's_*__g' longest_orfs_ip.pep
#head longest_orfs_ip.pep

In [4]:
%%bash
cd /hydra/FACS/Transdecoder_all
grep -c ">" longest_orfs_ip.pep

112819


## 2.2 Interproscan

Interproscan was performed on the sequences, including -goterms:

In [None]:
%%bash
cd /hydra/FACS
cd TF_interpro
echo "#!/bin/bash" > interproscan_TF.sh
echo "/hydra/software/interproscan-5.46-81.0/interproscan.sh -i /hydra/FACS/Transdecoder_all/longest_orfs_ip.pep -d /hydra/FACS/TF_interpro/interproscan_24082020 -goterms " >> interproscan_TF.sh
chmod +x interproscan_TF.sh
less interproscan_TF.sh

In [None]:
#%%bash
#cd /hydra/FACS/TF_interpro
#nohup ./interproscan_TF.sh &

# 3. Identification of transcription factors

Transcription factors were identified based on three criteria: pfam IDs, Superfamily IDs, and the description "Transcription factor" in the description. 

## 3.1 Transcription factor IDs

The Pfam IDs were reorganised into a list:

In [None]:
import pandas as pd
pfam_tf = pd.read_csv("/hydra/sexual_genome_annotation_files/Swapna/pfam_tf.txt", sep="\t", header=None)

In [None]:
pfam_tf.iloc[:,1].to_csv("/hydra/sexual_genome_annotation_files/Swapna/pfam_tf_list.txt", index=False, header=None)

The Superfamily IDs were reorganised into a list: 

In [None]:
import pandas as pd
superfamily_tf = pd.read_csv("/hydra/sexual_genome_annotation_files/Swapna/SUPERFAMILY_1_69.dbds.v2.03.txt", sep="\t", header=None)
superfamily_tf_id = []
for i in range(0, len(superfamily_tf)):
    superfamily_tf_id.append("SSF"+str(superfamily_tf.iloc[i,5]))
superfamily_tf_id[0:5]

In [None]:
pd.DataFrame(superfamily_tf_id).to_csv("/hydra/sexual_genome_annotation_files/Swapna/superfamily_list.txt", index=False, header=None)

In [None]:
%%bash
cd /hydra/sexual_genome_annotation_files/Swapna
wc -l superfamily_list.txt

In [6]:
%%bash
cd /hydra/sexual_genome_annotation_files/Swapna
head -2 superfamily_list.txt

SSF56548
SSF49417


## 3.2 Transcription factor IDs in interproscan result

In [7]:
%%bash
cd /hydra/FACS/TF_interpro/interproscan_24082020
wc -l longest_orfs_ip.pep.tsv

700276 longest_orfs_ip.pep.tsv


Transcription factors were chosen based on pfam_ids:

In [None]:
#%%bash
#cd /hydra/FACS/TF_interpro/interproscan_24082020
#while read p; do grep -w $p longest_orfs_ip.pep.tsv >> pfam.tf.pep.tsv ; done < /hydra/sexual_genome_annotation_files/Swapna/pfam_tf_list.txt

In [8]:
%%bash
cd /hydra/FACS/TF_interpro/interproscan_24082020
wc -l pfam.tf.pep.tsv

4202 pfam.tf.pep.tsv


Transcription factors were chosen based on superfamily_ids:

In [None]:
#%%bash
#cd /hydra/FACS/TF_interpro/interproscan_24082020
#while read p; do grep -w $p longest_orfs_ip.pep.tsv >> superfamily.tf.pep.tsv ; done < /hydra/sexual_genome_annotation_files/Swapna/superfamily_list.txt

In [9]:
%%bash
cd /hydra/FACS/TF_interpro/interproscan_24082020
wc -l superfamily.tf.pep.tsv

132410 superfamily.tf.pep.tsv


Transcription factors were chosen based on the name "Transcription factor" in Interproscan results:

In [None]:
%%bash
cd /hydra/FACS/TF_interpro/interproscan_24082020
grep "Transcription factor" longest_orfs_ip.pep.tsv > transcription_factor.tf.pep.tsv

In [None]:
%%bash
cd /hydra/FACS/TF_interpro/interproscan_24082020
wc -l transcription_factor.tf.pep.tsv

The results were combined:

In [None]:
%%bash
cd /hydra/FACS/TF_interpro/interproscan_24082020
cat pfam.tf.pep.tsv superfamily.tf.pep.tsv transcription_factor.tf.pep.tsv > final.tf.pep.tsv

In [10]:
%%bash
cd /hydra/FACS/TF_interpro/interproscan_24082020
wc -l final.tf.pep.tsv

137758 final.tf.pep.tsv


## 3.3 From Interproscan result to gene IDs 

In [None]:
import pandas as pd
interpro_all_df = pd.read_csv("/hydra/FACS/TF_interpro/interproscan_24082020/final.tf.pep.tsv", sep="\t", header=None, low_memory=False)

In [None]:
len(interpro_all_df)

Protein IDs are converted into transcript ids:

In [None]:
interpro_all_df_transcript_ids = []
for i in range(len(interpro_all_df)):
    id_i = interpro_all_df.iloc[i,0].split(".p")[0]
    if id_i not in interpro_all_df_transcript_ids:
        interpro_all_df_transcript_ids.append(id_i)

In [None]:
len(interpro_all_df_transcript_ids)

The table containing the correspondence between transcript ID and gene ID is uploaded:

In [None]:
gffcmp = pd.read_csv("/hydra/sexual_genome_annotation_files/ncrna_Neiro/gffcmp.stringtie_merged.gtf.tmap", sep="\t")
gene2transcript = gffcmp.iloc[:,[0,3,4]]
gene2transcript

Transcripts with a TF ID are selected:

In [None]:
tf_transcripts = gene2transcript[gene2transcript["qry_id"].isin(interpro_all_df_transcript_ids)]

In [None]:
qry_gene_id = pd.Categorical(tf_transcripts["qry_gene_id"]).categories.to_list()
ref_gene_id = []
for i in range(len(qry_gene_id)):
    ref_gene_id.append(tf_transcripts[tf_transcripts["qry_gene_id"] == qry_gene_id[i]].iloc[0,0])
tf_genes_Aug2020 = pd.DataFrame({"Rink": ref_gene_id, "Neiro": qry_gene_id})

In [None]:
len(tf_genes_Aug2020)

In [2]:
import pandas as pd
tf_genes_Aug2020 = pd.read_csv("/hydra/FACS/TF_interpro/tf_genes_Aug2020.csv")

In [3]:
len(tf_genes_Aug2020)

982

## 3.4 Comparison with Swapna et al. 2018

In [None]:
tf_genes_Aug2020["Swapna"] = "-"
for i in range(len(tf_genes_Aug2020)):
    swapna_id = tf_genes[tf_genes.iloc[:, 0] == tf_genes_Aug2020.iloc[i,1]]
    if (len(swapna_id) == 1):
        tf_genes_Aug2020["Swapna"] = swapna_id.iloc[0,1]

The table is saved as a csv file.

In [None]:
tf_genes_Aug2020.to_csv("/hydra/FACS/TF_interpro/tf_genes_Aug2020.csv", index=0)

The IDs are saved as a txt file.

In [None]:
tf_genes_Aug2020["Neiro"].to_csv("/hydra/FACS/TF_interpro/tf_genes_Aug2020_NeiroIDs.txt", index=0, header=None)

# 4. Manual curation

The results were manually curated and transcripts blasting to transposons or non-TF genes were removed. The resulting file was tf_genes_April2021.csv.

The planarian literature was reviewed and TFs that had not been computationally derived were added to the list. The resulting file tf_genes_May2021.xlsx contains the literature-derived TFs.

In [6]:
%%bash
ls -l /hydra/TF_data/tf_genes_April2021.csv
ls -l /hydra/TF_data/tf_genes_May2021.xlsx

-rwxrwxr-x 1 ubuntu ubuntu 53682 Apr 26 20:19 /hydra/TF_data/tf_genes_April2021.csv
-rwxrwxr-x 1 ubuntu ubuntu 60449 Jun  7 20:34 /hydra/TF_data/tf_genes_May2021.xlsx


In [9]:
import pandas as pd
pd.read_excel("/hydra/TF_data/tf_genes_May2021.xlsx").to_csv("/hydra/TF_data/tf_genes_May2021.xlsx", index=False)

The review of planarian literature is described in 2-2-Transcription-factors-literature-2000-2009, 

# FINNISHED