## Data

### Downloading & Dumping

```bash
wget -c https://sra-download.ncbi.nlm.nih.gov/traces/sra51/SRR/010757/SRR11015356

fastq-dump --fasta 0 --split-files SRR11015356.sra
```

### Creating cDBG of the "SRR11015356_1.fasta" and "SRR11015356_2.fasta"

- #### create cDBG and GFAs for kmer sizes 25 and 75

    ```bash
    ls -1 *fasta > list_reads
    bcalm -kmer-size 25 -max-memory 12000 -out SRR11015356_k25 -in list_reads
    bcalm -kmer-size 75 -max-memory 12000 -out SRR11015356_k75 -in list_reads
    python convertToGFA.py SRR11015356_k25.unitigs.fa SRR11015356_k25.GFA 25
    python convertToGFA.py SRR11015356_k75.unitigs.fa SRR11015356_k75.GFA 75
    ```
- #### Indexing the *GFA files using [odgi](https://github.com/vgteam/odgi) tool to reduce the GFA size. (removing the redundant edges)
    - **Indexing**
        ```bash
        odgi build -g SRR11015356_k75.GFA -G > reduced_SRR11015356_k75.GFA 
        odgi build -g SRR11015356_k25.GFA -G > reduced_SRR11015356_k25.GFA
      ```
    - **Stats**: Odgi graph reduction effect (Nodes number still the same)        
        - `grep "^L" SRR11015356_k75.GFA | wc -l` : 24099578
        - `grep "^L" reduced_SRR11015356_k75.GFA | wc -l` : 12050926 Links
    - **Generate connected components**:
        - ```bash
            python gfa_to_connected_components.py reduced_SRR11015356_k75.GFA
            python gfa_to_connected_components.py reduced_SRR11015356_k25.GFA
          ```
        - 

### CDHIT Clustering and cDBG of representitive sequences
    
1. Clustering 
    ```bash
    cd-hit-est -i SRR11015356_k75.unitigs.fa -n 11 -c 0.95 -o clusters_SRR11015356 -d 0 -T 0 -M 12000
      ```
2. Exporting representative sequences only from the unitigs.fa files for **Only k=75**
    ```bash
    cat clusters_SRR11015356.clstr | grep "\*" | awk -F"[>.]" '{print ">"$2}' | grep -Fwf - -A1 <(seqkit seq -w 0 SRR11015356_k75.unitigs.fa) | grep -v "^\-\-" > reps_unitigs_SRR11015356_k75.fa
    ```
3. Constructing **cDBG k=25|k=75** for the representative sequences for the representative sequences **k=75**
    ```bash
    bcalm -kmer-size 25 -max-memory 12000 -out reps_unitigs_SRR11015356_beforek75_afterk25.fa -in reps_unitigs_SRR11015356_k75.fa  -abundance-min 1
    bcalm -kmer-size 75 -max-memory 12000 -out reps_unitigs_SRR11015356_beforek75_afterk75.fa -in reps_unitigs_SRR11015356_k75.fa  -abundance-min 1
    ```
4. Converting unitigs in step `3` to GFAs
    ```bash
    python convertToGFA.py reps_unitigs_SRR11015356_beforek75_afterk25.fa.unitigs.fa reps_unitigs_SRR11015356_beforek75_afterk25.GFA 25
    python convertToGFA.py reps_unitigs_SRR11015356_beforek75_afterk75.fa.unitigs.fa reps_unitigs_SRR11015356_beforek75_afterk75.GFA 75
    ```

### Subset 100k reads from the raw `SRR11015356_1` fasta file
```bash
head -n 200000 SRR11015356_1.fasta > 100ksubset_SRR11015356_1.fasta
```

### MMseqs2 searching

1. Create MMseqs DB for `reps_unitigs_SRR11015356_beforek75_afterk75` unitigs file
   ```bash
    mmseqs createdb reps_unitigs_SRR11015356_beforek75_afterk75.fa.unitigs.fa DBreps_unitigs_SRR11015356_beforek75_afterk75
   ```
2. Create MMseqs DBs for `100ksubset_SRR11015356_1` query reads file 
    ```bash
    mmseqs createdb 100ksubset_SRR11015356_1.fasta 100kqueryDB
    ```
3. Search by *nt sequences mode* 
    ```bash
    mmseqs search DBreps_unitigs_SRR11015356_beforek75_afterk75 100kqueryDB aln_res tmp --add-self-matches 1 -s 7 -a --search-type 3
    ```
4. Convert the search result *aln_res* to TSV format
    ```bash
    mmseqs convertalis 100kqueryDB DBreps_unitigs_SRR11015356_beforek75_afterk75 aln_res2 result2.m8
    ```


### Transform MMseqs TSV to get the best hits
```bash
python transform_m8.py result2.m8 > transformed_result2.tsv
```