# MUSCLE

MUSCLE takes a FASTA amino acid file and makes a multiple sequence alignment of the proteins in it. Then can use that alignment to make a PSSM or HMM profile and do profile-based searches. Unlike with most OrthoLang functions, the sequences here should be a bunch of variants of the same protein, NOT complete proteomes. One way to get a list of sequences like that is to do a BLAST search first:

In [12]:
seed = load_faa "example-data/Mycoplasma_genitalium_single.faa"



In [13]:
seed

>gi|84626162|gb|AAC71217.2|
MKILINKSELNKILKKMNNVIISNNKIKPHHSYFLIEAKEKEINFYANNEYFSVKCNLNK
NIDILEQGSLIVKGKIFNDLINGIKEEIITIQEKDQTLLVKTKKTSINLNTINVNEFPRI
RFNEKNDLSEFNQFKINYSLLVKGIKKIFHSVSNNREISSKFNGVNFNGSNGKEIFLEAS
DTYKLSVFEIKQETEPFDFILESNLLSFINSFNPEEDKSIVFYYRKDNKDSFSTEMLISM
...

In [20]:
proteomes = load_faa_glob "example-data/Mycoplasma_*_refseq.faa"



In [25]:
db = makeblastdb_prot_all proteomes
matches = blastp_db 1e-20 seed db



In [26]:
# this will have one duplicate because the seed sequence is also from one of the protomes
# for our purposes that doesn't really matter
matches

gi|84626162|gb|AAC71217.2|	WP_009885562.1	100.000	380	0	0	1	380	1	380	0.0	741
gi|84626162|gb|AAC71217.2|	WP_009885562.1	100.000	380	0	0	1	380	1	380	0.0	741
gi|84626162|gb|AAC71217.2|	WP_011076828.1	27.083	384	263	7	1	380	7	377	1.17e-30	119

In [1]:
:h muscle

muscle : faa -> aln

where
  faa = FASTA amino acid
  aln = multiple sequence alignment

In [2]:
:h aln

The aln extension is for multiple sequence alignment.

You can create them with these 1 functions:
  muscle : faa -> aln

And use them with these 1 functions:
  hmmbuild : aln -> hmm

In [3]:
:h pssm

The pssm extension is for PSI-BLAST position-specific substitution matrix as ASCII.

You can create them with these 3 functions:
  psiblast_train_faa_faa : num faa faa -> pssm
  psiblast_train_faa_faas_all : num faa faa.list -> pssm
  psiblast_train_faa_pdb : num faa faa.blastdb -> pssm

And use them with these 5 functions:
  psiblast_search_pssm_faa : num pssm faa -> bht
  psiblast_search_pssm_faas : num pssm faa.list -> bht.list
  psiblast_search_pssm_pdb : num pssm faa.blastdb -> bht
  psiblast_search_pssm_pdbs : num pssm faa.blastdb.list -> bht.list
  psiblast_search_pssm_pdbs_all : num pssm faa.blastdb.list -> bht

In [4]:
magala = load_faa "example-data/Mycoplasma_agalactiae_small.faa"



In [5]:
magala

>gi|290752267|emb|CBH40238.1|
MNINSPNDKEIALKSYTETFLDILRQELGDQMLYKNFFANFEIKDVSKIGHITIGTTNVT
PNSQYVIRAYESSIQKSLDETFERKCTFSFVLLDSAVKKKVKRERKEAAIENIELSNREV
DKTKTFENYVEGNFNKEAIRIAKLIVEGEEDYNPIFIYGKSGIGKTHLLNAICNELLKKE
VSVKYINANSFTRDISYFLQENDQRKLKQIRNHFDNADIVMFDDFQSYGIGNKKATIELI
...

In [36]:
# bug here
split_faa (load_faa "example-data/Mycoplasma_genitalium_small.faa")

Traceback (most recent call last):
  File "/nix/store/ks4dfnkrwbi7jzqsnirh7alazxhrjv1y-OrthoLang-SeqIO/bin/.split_fasta.py-wrapped", line 83, in <module>
    main(*argv[1:])
  File "/nix/store/ks4dfnkrwbi7jzqsnirh7alazxhrjv1y-OrthoLang-SeqIO/bin/.split_fasta.py-wrapped", line 80, in main
    split_fasta(outlist, outdir, infasta, prefix, suffix)
  File "/nix/store/ks4dfnkrwbi7jzqsnirh7alazxhrjv1y-OrthoLang-SeqIO/bin/.split_fasta.py-wrapped", line 61, in split_fasta
    place_by_hash(oh, tmpfile, outdir, md5sum, suffix)
  File "/nix/store/ks4dfnkrwbi7jzqsnirh7alazxhrjv1y-OrthoLang-SeqIO/bin/.split_fasta.py-wrapped", line 35, in place_by_hash
    outhandle.write(outfile + '\n')
TypeError: a bytes-like object is required, not 'str'
error: split_fasta.py failed with ExitFailure 1.
The files it was working on have been deleted:
/home/jefdaj/myrepos/ortholang-notebooks/.ortholang-kernels/76e0b8d4-187b-4ae0-8ee1-821a14f4184f/cache/split_faa/eafc65d3ba
/home/jefdaj/myrepos/ortholang-notebooks/.

In [10]:
:h extract_seqs

extract_seqs : fa str.list -> fa

where
  fa = any fasta file (fna or faa)
  str = string

In [41]:
# bug here too
extract_seqs magala ["gi|290752271|emb|CBH40242.1|"]



In [45]:
random_seq = extract_seqs magala (sample 1 (extract_ids magala))



In [46]:
random_seq

>gi|290752317|emb|CBH40288.1|
MSIKYIDCHTHPIKEYYKDNFQVIEKAYFKGVAAMLITGCDPKENLEVLNICSHFDYTFP
VIGVHPNNSTGAIDGEIVESQLTKDVVAIGEIGLDYHYPDTKKDIQKESFIAQIKVAQRH
NLPVVVHMRDSYEDLFEILSQFKDVKFMIHTFSGNLYWAKKFNDLGCYFSFSAIATYKNN
SSLLEVLQYLPVDKILTETDAPYLPPASKRGMLNYPNYVKHTANYIAGVKGLSIEKFTDK
...

In [50]:
matches = blastp_db 1e-20 random_seq db
match_ids = extract_targets matches
# does this need to be written?
match_seqs = concat_faa (extract_seqs_each proteomes match_ids)

extract_seqs_each has the type signature [Some (TypeGroup {tgExt = "fa", tgDesc = "FASTA nucleic OR amino acid", tgMembers = [Exactly fna,Exactly faa]}) "any fasta file",Exactly str.list.list],
which doesn't match its inputs [faa.list,str.list]
Specifically, the faa.list in position 1 doesn't match Some (TypeGroup {tgExt = "fa", tgDesc = "FASTA nucleic OR amino acid", tgMembers = [Exactly fna,Exactly faa]}) "any fasta file".
the str.list in position 2 doesn't match Exactly str.list.list.

In [2]:
# another approach:
proteomes = load_faa_glob "example-data/Mycoplasma_*_refseq.faa"
proteomes_single_file = concat_faa proteomes
random_seq = extract_seqs proteomes_single_file (sample 1 (extract_ids proteomes_single_file))
matches = blastp 1e-20 random_seq proteomes_single_file
match_ids = extract_targets matches
match_seqs = extract_seqs proteomes_single_file match_ids



In [8]:
match_seqs

Traceback (most recent call last):
  File "/nix/store/ks4dfnkrwbi7jzqsnirh7alazxhrjv1y-OrthoLang-SeqIO/bin/.cat.py-wrapped", line 48, in <module>
    main(*argv[1:])
  File "/nix/store/ks4dfnkrwbi7jzqsnirh7alazxhrjv1y-OrthoLang-SeqIO/bin/.cat.py-wrapped", line 41, in main
    if not is_empty(infile):
  File "/nix/store/ks4dfnkrwbi7jzqsnirh7alazxhrjv1y-OrthoLang-SeqIO/bin/.cat.py-wrapped", line 29, in is_empty
    with open(filetotest, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '$TMPDIR/cache/load/0d8dfa261e.faa'
error: cat.py failed with ExitFailure 1.
The files it was working on have been deleted:
/home/jefdaj/myrepos/ortholang-notebooks/.ortholang-kernels/76e0b8d4-187b-4ae0-8ee1-821a14f4184f/exprs/concat_faa/1b8a0ea696/result

In [14]:
# bug: cat missing in jupyter env?
proteomes_single_file

Traceback (most recent call last):
  File "/nix/store/ks4dfnkrwbi7jzqsnirh7alazxhrjv1y-OrthoLang-SeqIO/bin/.cat.py-wrapped", line 48, in <module>
    main(*argv[1:])
  File "/nix/store/ks4dfnkrwbi7jzqsnirh7alazxhrjv1y-OrthoLang-SeqIO/bin/.cat.py-wrapped", line 41, in main
    if not is_empty(infile):
  File "/nix/store/ks4dfnkrwbi7jzqsnirh7alazxhrjv1y-OrthoLang-SeqIO/bin/.cat.py-wrapped", line 29, in is_empty
    with open(filetotest, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '$TMPDIR/cache/load/0d8dfa261e.faa'
error: cat.py failed with ExitFailure 1.
The files it was working on have been deleted:
/home/jefdaj/myrepos/ortholang-notebooks/.ortholang-kernels/76e0b8d4-187b-4ae0-8ee1-821a14f4184f/exprs/concat_faa/1b8a0ea696/result