# Pseudo gene detection test 

Software: https://github.com/filip-husnik/pseudofinder 

Install via https://github.com/filip-husnik/pseudofinder/wiki/2.-Installing-Pseudofinder#easy-installation 

I had to modify the setup.sh to the following:

```bash
#!/usr/bin/env bash

# setting colors to use
GREEN='\033[0;32m'
RED='\033[0;31m'
NC='\033[0m'
PATH_TO_PSEUDOFINDER=`dirname $0`

printf "\n    ${GREEN}Setting up conda environment...${NC}\n\n"

chmod +x $PATH_TO_PSEUDOFINDER/pseudofinder.py

## creating environment and installing dependencies
mamba env create --file $PATH_TO_PSEUDOFINDER/modules/environment.yml 

## activating environment
source activate pseudofinder

## creating directory for conda-env-specific source files
mkdir -p ${CONDA_PREFIX}/etc/conda/activate.d

## adding codeml-2.ctl file path:
echo '#!/bin/sh'" \

export PATH=\"$(pwd):"'$PATH'\"" \

export ctl=\"$(pwd)/codeml-2.ctl\"" >> ${CONDA_PREFIX}/etc/conda/activate.d/env_vars.sh

# re-activating environment so variable and PATH changes take effect
source activate pseudofinder

printf "\n        ${GREEN}DONE!${NC}\n\n"

# to reset:
# conda env remove --name pseudofinder
```

*This really should just be in Bioconda* 


## Getting started

This program needs you to give it a database of true genes (as amino acid) to compare with. I am using the wgMLST gene panel from Enterobase.

```
wget https://enterobase.warwick.ac.uk/schemes/Salmonella.wgMLST/exemplar.alleles.fasta.gz 
```

These are in nucleotide, so we must convert nuc to aa. 

In [1]:
from Bio import SeqIO, Seq
import gzip 
from Bio.Data.CodonTable import TranslationError 

gfile = gzip.open("exemplar.alleles.fasta.gz", "rt")

out_file = open('sal_alleles.faa', 'w') 


number_skip = 0 
for record in SeqIO.parse(gfile, 'fasta'):
    try:
        record.seq = record.seq.translate(cds=True)
        out_file.write(record.format("fasta"))        
    except TranslationError:
        number_skip += 1
print(f'Could not translate {number_skip} seqs ')
        


Could not translate 1596 seqs 


In [2]:
!head sal_alleles.faa
!tail sal_alleles.faa
!cat  sal_alleles.faa | grep '>' | wc -l 

>STMMW_00651_1
MHEAQIRVAIAGAGGRMGRQLIQAAMAMEGVQLGAALEREGSSLLGSDAGELAGAGKSGV
IVQSSLEAVKDDFDVFIDFTRPEGTLTHLAFCRQHGKGMVIGTTGFDDAGKQAIREASQE
IAIVFAANFSVGVNVMLKLLEKAAKVMGDYSDIEIIEAHHRHKVDAPSGTALAMGEAIAG
ALDKNLKDCAVYSREGYTGERVAGTIGFATVRAGDIVGEHTAMFADIGERVEITHKASSR
MTFANGALRSALWLKTKKNGLFDMRDVLGLDVL
>STMMW_00121_1
MAKRDYYEILGVSKTAEEREIKKAYKRLAMKYHPDRNQGDKEAEAKFKEIKEAYEVLTDA
QKRAAYDQYGHAAFEQGGMGGGFGGGFNGGADFSDIFGDVFGDIFGGGRGRQRAARGADL
RYNMDLTLEEAVRGVTKEIRIPTLEECDVCHGSGAKAGTQPQTCPTCHGSGQVQMRQGFF
QDALAILRNKLVVREHYLPCVLFGDDAPTEFTVGPVTFTQNAMFFRDKKSVFRHSVDINT
NAHIKSVTSAITQGFFRENVPTPDESRKFVGEFQKRAIKIYKDYPWVASIKVTDCDEVTS
QERAIQATELAIHIIRILLGAEPTRKIRLAWSRSNALNTAHLYSDADGVIHASVGANSLG
PVGIINWYKALMKCDLELEILGSALLPIVNPIETNHLHQRLIDAINWFGDAATDSNPSSS
IVKYVSAIERLFFGKFESGRTKVFAGRIKYILDAFGCDGDHQVYDQALKVYRARSILVHG
EIYQTEDEANESICLASSLSRMCLLCSAQLYSMMQNAFDNPDALALEEIMKRIGAEGLDW
LVDAAGFHK
>ZV79_RS12785_1
MKVETISYVKKNAATLDLSEPILVTQNGVPAYVIESYDQQQERENAIALLKLLTLSEKDK
AEGRVFSKDQLLDSLED
19466


This needs to be made into a BLAST database.

In [3]:
!makeblastdb -in sal_alleles.faa -dbtype prot



Building a new DB, current time: 07/12/2022 23:26:26
New DB name:   /home/ubuntu/code/journal/pseudo/sal_alleles.faa
New DB title:  sal_alleles.faa
Sequence type: Protein
Deleted existing Protein BLAST database named /home/ubuntu/code/journal/pseudo/sal_alleles.faa
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 19466 sequences in 0.479658 seconds.




We also need a genome to query. I selected a random Typhi genome from EnteroBase (SAL_XA0264AA_AS) using the pregenerated annotation (gbk).  As a reference I used a Paratyphi (SAL_XA0359AA_AS). 

# Running pseudofinder



In [None]:
!pseudofinder/pseudofinder.py annotate -g SAL_XA0264AA_AS.Typhi.gbk --reference SAL_XA0359AA_AS.Paratyphi.gbk  -db sal_alleles.faa -op fast_test 

[1m2022-07-12 23:26:30[0m	CDS extracted from:			SAL_XA0264AA_AS.Typhi.gbk
			Written to file:			fast_test_cds.fasta.
[1m2022-07-12 23:26:30[0m	Intergenic regions extracted from:	SAL_XA0264AA_AS.Typhi.gbk
			Written to file:			fast_test_intergenic.fasta.
[1m2022-07-12 23:26:30[0m	blastp executed with 4 threads on fast_test_proteome.faa.
[1m2022-07-12 23:27:48[0m	blastx executed with 4 threads on fast_test_intergenic.fasta.
[1m2022-07-12 23:28:12[0m	Starting Sleuth...
[1m2022-07-12 23:28:14[0m	Running BLAST...
[1m2022-07-12 23:28:15[0m	Done with BLAST.
[1m2022-07-12 23:28:16[0m	Starting Muscle.
2022-07-12 23:28:41	Running Muscle: 15%

In [None]:
!ls

In [None]:
pseudofinder.py sleuth -a GENOME_PROTS -n GENOME_GENES -ra REFERENCE-PROTS -rn REFERENCE_GENES
