# **Data preparation**

In [None]:
! wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

In [4]:
%%bash

total_seqs=$(zcat uniprot_sprot.fasta.gz | grep ">" | wc -l)

# Extract Mouse Only
zcat uniprot_sprot.fasta.gz | seqkit grep -n -r -p "OX=10090\s" > swissprot_mouse.fa
mouse_seqs=$(cat swissprot_mouse.fa | grep ">" | wc -l)

# Extract Human Only
zcat uniprot_sprot.fasta.gz | seqkit grep -n -r -p "OX=9606\s" > swissprot_human.fa
human_seqs=$(cat swissprot_human.fa | grep ">" | wc -l)

# Exclude Mouse (-v)
zcat uniprot_sprot.fasta.gz | seqkit grep -n -r -p "OX=10090\s" -v > swissprot_without_mouse.fa
without_mouse_seqs=$(cat swissprot_without_mouse.fa | grep ">" | wc -l)

echo -e "Total: ${total_seqs}\nMouse: ${mouse_seqs}\nHuman: ${human_seqs}\nWithout_Mouse: ${without_mouse_seqs}"

Total: 561568
Mouse: 17027
Human: 20367
Without_Mouse: 544541


# MMseqs
1. Download and createDB for the whole Swiss-Prot >> **swissprot**
2. Filter out the HUMAN sequences >> **swissprot_only_human**
3. Filter out all the species **without** HUMAN sequences >> **swissprot_wo_human**
4. Query the **swissprot_wo_human** on **swissprot_only_human** with sensitivity of 7 (very sensitive) >> **aln_res**
5. Generate the alignment results in a tabular format.

---
### **Options**
- `-s [float] Target sensitivity in the range [1:7.5] (default=5.7).`
    - Adjusts the sensitivity of the prefiltering and influences the prefiltering run time. 1.0 fastest - 8.5 sensitive. The sensitivity between 8 to 8.5 should be as sensitive as BLAST
-  `-a` for alignment information

---

### **Output Format**
(1,2) identifiers for query and target sequences/profiles, (3) sequence identity, (4) alignment length, (5) number of mismatches, (6) number of gap openings, (7-8, 9-10) domain start and end-position in query and in target, (11) E-value, and (12) bit score.


### **Commands**

```bash
mmseqs databases UniProtKB/Swiss-Prot swissprot tmp
mmseqs filtertaxseqdb swissprot swissprot_only_human --taxon-list "9606"
mmseqs filtertaxseqdb swissprot swissprot_wo_human --taxon-list "!9606"
mmseqs search swissprot_only_human swissprot_wo_human aln_res tmp --add-self-matches 1 -s 7 -a
mmseqs convertalis swissprot swissprot aln_res result.m8
```

### **Output head**
```tsv
query	target	pident	alnlen	mismatch	gapopen	qstart	qend	tstart	tend	evalue	bits
P62807	Q6ZWY9	1.000	126	0	0	1	126	1	126	3.978E-71	236
P62807	Q5R893	1.000	126	0	0	1	126	1	126	3.978E-71	236
P62807	P62808	1.000	126	0	0	1	126	1	126	3.978E-71	236
P62807	Q64478	0.992	126	1	0	1	126	1	126	1.026E-70	235
P62807	P10854	0.992	126	1	0	1	126	1	126	1.407E-70	234
P62807	P10853	0.992	126	1	0	1	126	1	126	2.647E-70	233
P62807	Q2PFX4	0.984	126	2	0	1	126	1	126	3.630E-70	233
P62807	Q64525	0.984	126	2	0	1	126	1	126	3.630E-70	233
P62807	Q8CGP1	0.984	126	2	0	1	126	1	126	6.829E-70	232
P62807	Q5RCP8	0.984	126	2	0	1	126	1	126	6.829E-70	232
```

kmer size = read_length / 2

Clustering the unitigs (CD-HIT & MMseq), 95%, take the longest unitig as representative (GFA simplification).