# Psuedo gene detection test 

Software: https://github.com/filip-husnik/pseudofinder 

Install via https://github.com/filip-husnik/pseudofinder/wiki/2.-Installing-Pseudofinder#easy-installation 

I had to modify the setup.sh to the following:

```bash
#!/usr/bin/env bash

# setting colors to use
GREEN='\033[0;32m'
RED='\033[0;31m'
NC='\033[0m'
PATH_TO_PSEUDOFINDER=`dirname $0`

printf "\n    ${GREEN}Setting up conda environment...${NC}\n\n"

chmod +x $PATH_TO_PSEUDOFINDER/pseudofinder.py

## creating environment and installing dependencies
mamba env create --file $PATH_TO_PSEUDOFINDER/modules/environment.yml 

## activating environment
source activate pseudofinder

## creating directory for conda-env-specific source files
mkdir -p ${CONDA_PREFIX}/etc/conda/activate.d

## adding codeml-2.ctl file path:
echo '#!/bin/sh'" \

export PATH=\"$(pwd):"'$PATH'\"" \

export ctl=\"$(pwd)/codeml-2.ctl\"" >> ${CONDA_PREFIX}/etc/conda/activate.d/env_vars.sh

# re-activating environment so variable and PATH changes take effect
source activate pseudofinder

printf "\n        ${GREEN}DONE!${NC}\n\n"

# to reset:
# conda env remove --name pseudofinder
```

*This really should just be in Bioconda* 


## Getting started

This program needs you to give it a database of true genes (as amino acid) to compare with. I am using the wgMLST gene panel from Enterobase.

```
wget https://enterobase.warwick.ac.uk/schemes/Salmonella.wgMLST/exemplar.alleles.fasta.gz 
```

These are in nucleotide, so we must convert nuc to aa. 

In [31]:
from Bio import SeqIO, Seq
import gzip 
from Bio.Data.CodonTable import TranslationError 

gfile = gzip.open("exemplar.alleles.fasta.gz", "rt")

out_file = open('sal_alleles.faa', 'w') 

for record in SeqIO.parse(gfile, 'fasta'):
    try:
        record.seq = record.seq.translate(cds=True)
        out_file.write(record.format("fasta"))        
    except TranslationError:
        print(f'Could not translate {record.id}')
        


Could not translate STMMW_00401_1
Could not translate STMMW_01291_1
Could not translate STMMW_01331_1
Could not translate STMMW_01221_1
Could not translate STMMW_01241_1
Could not translate STMMW_01021_1
Could not translate STMMW_01141_1
Could not translate STMMW_01641_1
Could not translate STMMW_02251_1
Could not translate STMMW_02341_1
Could not translate STMMW_02532_1
Could not translate STMMW_02311_1
Could not translate STMMW_03191_1
Could not translate STMMW_04461_1
Could not translate STMMW_04631_1
Could not translate STMMW_03271_1
Could not translate STMMW_04681_1
Could not translate STMMW_05221_1
Could not translate STMMW_05311_1
Could not translate STMMW_04881_1
Could not translate STMMW_04781_1
Could not translate STMMW_04771_1
Could not translate STMMW_05381_1
Could not translate STMMW_06031_1
Could not translate STMMW_05551_1
Could not translate STMMW_06331_1
Could not translate STMMW_06591_1
Could not translate STMMW_06451_1
Could not translate STMMW_06991_1
Could not tran

Could not translate NCTC12418_01137_1
Could not translate NCTC12418_03313_1
Could not translate NCTC12418_01000_1
Could not translate NCTC12418_04335_1
Could not translate NCTC12418_03785_1
Could not translate NCTC12418_03856_1
Could not translate NCTC12420_01392_1
Could not translate NCTC12420_03993_1
Could not translate NCTC12420_03241_1
Could not translate NCTC12420_04768_1
Could not translate NCTC12420_04703_1
Could not translate NCTC4444_04607_1
Could not translate NCTC13348_04993_1
Could not translate NCTC5742_01629_1
Could not translate NCTC5773_03448_1
Could not translate NCTC5750_04784_1
Could not translate NCTC5706_01118_1
Could not translate NCTC5793_01145_1
Could not translate NCTC5773_03451_1
Could not translate NCTC5773_03450_1
Could not translate NCTC5798_05067_1
Could not translate NCTC6385_00873_1
Could not translate NCTC6385_00841_1
Could not translate NCTC6385_03543_1
Could not translate NCTC6385_04405_1
Could not translate NCTC6385_04657_1
Could not translate NCTC69

Could not translate STMMW_33011_1
Could not translate STMMW_38861_1
Could not translate STMMW_35771_1
Could not translate STMMW_33321_1
Could not translate STMMW_43341_1
Could not translate STMMW_45361_1
Could not translate STMMW_42631_1
Could not translate STMMW_43271_1
Could not translate STMMW_44331_1
Could not translate T_RS08145_1
Could not translate T_RS19665_1
Could not translate STY3288_1
Could not translate T_RS05005_1
Could not translate T_RS03710_1
Could not translate T_RS18610_1
Could not translate SU5_RS23005_1
Could not translate ZV79_RS09900_1
Could not translate ZV79_RS12805_1
Could not translate ZV79_RS13710_1
Could not translate ZV79_RS21585_1
Could not translate X506_RS23160_1
Could not translate STMMW_34801_1
Could not translate 32473_B02_04504_1
Could not translate STMMW_40581_1
Could not translate A464_RS02400_1
Could not translate A464_RS02295_1
Could not translate A464_RS02325_1
Could not translate A464_RS09650_1
Could not translate A464_RS09660_1
Could not tran

Could not translate NCTC8267_00931_1
Could not translate NCTC8270_05036_1
Could not translate NCTC8270_03008_1
Could not translate NCTC8271_03759_1
Could not translate NCTC8270_03017_1
Could not translate NCTC8273_01575_1
Could not translate NCTC8272_02445_1
Could not translate NCTC8273_01577_1
Could not translate NCTC8273_01595_1
Could not translate NCTC8273_01586_1
Could not translate NCTC8273_03643_1
Could not translate NCTC8273_02059_1
Could not translate NCTC8273_01615_1
Could not translate NCTC8273_01610_1
Could not translate NCTC8273_04179_1
Could not translate NCTC8273_04602_1
Could not translate NCTC8297_01585_1
Could not translate NCTC8297_01581_1
Could not translate NCTC9606_00427_1
Could not translate NCTC9606_00433_1
Could not translate NCTC9606_01036_1
Could not translate NCTC9606_01066_1
Could not translate NCTC9606_01069_1
Could not translate NCTC9606_02014_1
Could not translate NCTC9606_01075_1
Could not translate NCTC9606_02305_1
Could not translate NCTC9606_03771_1
C

Could not translate SAL_CA9998AA_02874_1
Could not translate SAL_CA9998AA_02875_1
Could not translate SAL_CA9998AA_02877_1
Could not translate SAL_DA0313AA_02728_1
Could not translate SAL_DA0313AA_01196_1
Could not translate SAL_DA0431AA_00392_1
Could not translate SAL_DA0809AA_00023_1
Could not translate SAL_DA0678AA_00449_1
Could not translate SAL_DA0809AA_03186_1
Could not translate SAL_DA0947AA_02349_1
Could not translate SAL_DA0947AA_01856_1
Could not translate SAL_DA1178AA_00516_1
Could not translate SAL_DA0809AA_04437_1
Could not translate SAL_DA1427AA_04071_1
Could not translate SAL_DA1521AA_00670_1
Could not translate SAL_DA1362AA_01019_1
Could not translate SAL_DA1521AA_02287_1
Could not translate SAL_DA1521AA_02214_1
Could not translate SAL_DA1521AA_02217_1
Could not translate SAL_DA1532AA_00607_1
Could not translate SAL_DA1601AA_02833_1
Could not translate SAL_DA2140AA_02610_1
Could not translate SAL_DA2140AA_02599_1
Could not translate SAL_DA2140AA_02584_1
Could not transl

Could not translate SEHA_RS08155_1
Could not translate SEHA_RS00505_1
Could not translate SEHA_RS02130_1
Could not translate SEHA_RS08060_1
Could not translate SEHA_RS16985_1
Could not translate SEHA_RS22680_1
Could not translate SEHA_RS15540_1
Could not translate SEHA_RS13150_1
Could not translate SEN_RS06485_1
Could not translate SEHA_RS22765_1
Could not translate SEHA_RS22860_1
Could not translate SEN_RS05200_1
Could not translate SEN_RS07445_1
Could not translate SESA_RS01550_1
Could not translate SESA_RS00315_1
Could not translate SESA_RS02090_1
Could not translate SESA_RS00325_1
Could not translate SESA_RS00550_1
Could not translate SESA_RS04035_1
Could not translate SESA_RS06675_1
Could not translate SESA_RS22560_1
Could not translate SESA_RS14285_1
Could not translate SG_RS05275_1
Could not translate SG_RS01460_1
Could not translate SLT-BT0091_1
Could not translate SLT-BT0371_1
Could not translate SLT-BT0481_1
Could not translate SLT-BT0431_1
Could not translate SLT-BT0621_1
Co

In [38]:
!head sal_alleles.faa
!tail sal_alleles.faa
!cat  sal_alleles.faa | grep '>' | wc -l 

>STMMW_00651_1
MHEAQIRVAIAGAGGRMGRQLIQAAMAMEGVQLGAALEREGSSLLGSDAGELAGAGKSGV
IVQSSLEAVKDDFDVFIDFTRPEGTLTHLAFCRQHGKGMVIGTTGFDDAGKQAIREASQE
IAIVFAANFSVGVNVMLKLLEKAAKVMGDYSDIEIIEAHHRHKVDAPSGTALAMGEAIAG
ALDKNLKDCAVYSREGYTGERVAGTIGFATVRAGDIVGEHTAMFADIGERVEITHKASSR
MTFANGALRSALWLKTKKNGLFDMRDVLGLDVL
>STMMW_00121_1
MAKRDYYEILGVSKTAEEREIKKAYKRLAMKYHPDRNQGDKEAEAKFKEIKEAYEVLTDA
QKRAAYDQYGHAAFEQGGMGGGFGGGFNGGADFSDIFGDVFGDIFGGGRGRQRAARGADL
RYNMDLTLEEAVRGVTKEIRIPTLEECDVCHGSGAKAGTQPQTCPTCHGSGQVQMRQGFF
VGLDFIN*
>ZV79_RS13700_1
MEIDHIKILSKSALVILTEYIDLISSDLYHLIDYTYTDKYKTYCDLSQVISDFTKNNIDK
IKEISLPINNDFSVHYYDLCMISSKLSDFKMNCETLIKDNDIFYSEILRIFGFNSNVPME
IVICSLYKNYSFMHFVLKDDDMRNELTKFYSSIDANYNAFMVEYFSYKKIQSCDDISNYA
SLAVDQLIEYEQVDAENLLHNKKVVYIDQNIISAYCSEKNKKLRSLLNSLKESGEYVFVF
SPYLVEDGIKMDYVYFNLYLAQVLKLTNGVFISKVNNEIRYVKEEFYTLVNRVIEWLPAT
SVAENIKYYKAKLNYFAYPFVRKDSRIVSKINDDISDFFMAIDSTKNIMINDINASFFDF
LQSVLLNITNQFDLEDMKAGRISVDKDFDYVEIIERVSEFLDIINYKTERVRDKKKILSS
YQDVQHLAHAWKADYFLTNDDRLIERGGYIYSLLGVKTKFIKEKELADLK*
19399

This needs to be made into a BLAST database.

In [39]:
!makeblastdb -in sal_alleles.faa -dbtype prot



Building a new DB, current time: 07/12/2022 22:41:23
New DB name:   /home/ubuntu/code/nabil-labbook/pseudo/sal_alleles.faa
New DB title:  sal_alleles.faa
Sequence type: Protein
Deleted existing Protein BLAST database named /home/ubuntu/code/nabil-labbook/pseudo/sal_alleles.faa
Keep MBits: T
Maximum file size: 1000000000B
FASTA-Reader: Ignoring invalid residues at position(s): On line 113188: 1-464831, 464834-464836, 464839-464845
Adding sequences from FASTA; added 19475 sequences in 0.483955 seconds.




We also need a genome to query. I selected a random genome from EnteroBase (SAL_VC3907AA_AS) using the pregenerated annotation (gbk). 

# Running pseudofinder



In [40]:
!pseudofinder.py annotate -g SAL_VC3907AA_AS.enterobase.gbk -db sal_alleles.faa -op test 

[1m2022-07-12 22:41:45[0m	CDS extracted from:			SAL_VC3907AA_AS.enterobase.gbk
			Written to file:			test_cds.fasta.
[1m2022-07-12 22:41:45[0m	Intergenic regions extracted from:	SAL_VC3907AA_AS.enterobase.gbk
			Written to file:			test_intergenic.fasta.
[1m2022-07-12 22:41:45[0m	blastp executed with 4 threads on test_proteome.faa.
[1m2022-07-12 22:43:07[0m	blastx executed with 4 threads on test_intergenic.fasta.
[1m2022-07-12 22:43:30[0m	Checking contig 1 / 62 for pseudogenes.
[1m2022-07-12 22:43:30[0m	Number of ORFs on this contig: 624
			Number of pseudogenes flagged: 41
[1m2022-07-12 22:43:30[0m	Checking contig 2 / 62 for pseudogenes.
[1m2022-07-12 22:43:30[0m	Number of ORFs on this contig: 582
			Number of pseudogenes flagged: 43
[1m2022-07-12 22:43:30[0m	Checking contig 3 / 62 for pseudogenes.
[1m2022-07-12 22:43:30[0m	Number of ORFs on this contig: 459
			Number of pseudogenes flagged: 42
[1m2022-07-12 22:43:30[0m	Checking contig 4 / 62 for pseudogenes.
[1m

In [41]:
!ls

SAL_VC3907AA_AS.enterobase.gbk	test_contigs.fasta
exemplar.alleles.fasta.gz	test_intact.faa
pseudofinder			test_intact.ffn
psuedo-work.ipynb		test_intact.gff
sal_alleles.faa			test_interactive_results.html
sal_alleles.faa.dmnd		test_intergenic.fasta
sal_alleles.faa.pdb		test_intergenic.fasta.blastX_output.tsv
sal_alleles.faa.phr		test_log.txt
sal_alleles.faa.pin		test_map.pdf
sal_alleles.faa.pot		test_proteome.faa
sal_alleles.faa.psq		test_proteome.faa.blastP_output.tsv
sal_alleles.faa.ptf		test_pseudos.fasta
sal_alleles.faa.pto		test_pseudos.gff
test_cds.fasta
