# Very brief introduction to IGUA-df v.1.0-alpha.3

> NOTE: This version of IGUA-df can be used for preliminary analysis. However, the code still needs to go through testing to ensure original IGUA behavior has been preserved when refactoring. 

## Installation 


1. Install rust. 

1. Create a dedicated conda environment: 

    ```bash
    conda create igua-df python
    conda activate igua-df
    ```

1. Install MMSeqs2: `conda install bioconda::mmseqs2`

1. For now, install IGUA in editable mode by following the instructions in [this commit](https://github.com/zellerlab/IGUA/commit/6ce03937bb7293ce3490771c523726db049fd0b3). To be more precise, clone the repo to IGUA-df and run 

    ```bash
    cd IGUA-df
    python -m pip install --no-build-isolation -e .
    ```

1. Finally, there may be additional missing dependency messages. Please follow them. 

## Running IGUA-df on a single genome

IGUA-df can be run on a single genome/strain by providing all the required files one by one. 

```bash 
igua \
  --defense-systems-tsv /path/to/my_strain_defense_finder_systems.tsv \
  --defense-genes-tsv /path/to/my_strain_defense_finder_genes.tsv \
  --gff /path/to/my_strain.gff \
  --genome /path/to/my_strain.fa \
  --protein-file /path/to/my_strain.faa
```

Running this command should result in a successful run of IGUA-df. The logger will indicate that final GCFs table was saved to `gcfs.tsv` in the same directory. 

## Running IGUA-df on multiple genomes

The other way to run IGUA-df is by providing a single tsv file which contains the paths to the files specified above. IGUA-df is run row by row _i.e._ genome by genome. 

```bash
igua -i ./strains_metadata.tsv 
```

To obtain the tsv file, one must run a pre-processing step that identifies all the file paths which will be fed to IGUA-df (raw and annotated data along with the DefenseFinder outputs). Below is a quick function to help with the file detection; feel free to adapt it to your own needs. Please, note that the directory structure this function can work with is limited, as indicated in the docstring.

In [None]:
import os, subprocess
import rich 
from rich.console import Console
from rich.progress import Progress
import pandas as pd
from pathlib import Path
from typing import Optional

In [None]:
def create_defense_finder_metadata(
    data_dir: str, 
    output_file: str,
    pattern: str = "*-defensefinder",
    progress: Optional[rich.progress.Progress] = None
) -> str:
    """
    Scan directory for defense-finder files and create a metadata TSV file.
    
    This function scans for strains with the following structure:
    data_dir/
    ├── strain1/
    │   ├── strain1.fa
    │   ├── strain1.faa  # Protein sequences
    │   ├── strain1.fna  # Gene sequences
    │   ├── strain1.gff  # Genome annotation
    │   └── strain1-defensefinder/
    │       ├── strain1_defense_finder_genes.tsv
    │       └── strain1_defense_finder_systems.tsv
    
    Args:
        data_dir: Base directory containing strain directories
        output_file: Path to output metadata TSV file
        pattern: Pattern to match defensefinder directories inside a given genome/strain
        progress: Optional progress bar
        
    Returns:
        str: Path to created metadata file
    """
    console = Console() if progress is None else progress.console
    
    # List all subdirectories (strains)
    strain_dirs = [d for d in Path(data_dir).iterdir() if d.is_dir()]
    
    metadata_rows = []
    
    # progress 
    using_external_progress = progress is not None
    if not progress:
        progress = rich.progress.Progress(
            rich.progress.SpinnerColumn(),
            rich.progress.TextColumn("[bold blue]{task.description}"),
            rich.progress.BarColumn(),
            rich.progress.TextColumn("[bold]{task.completed}/{task.total}")
        )
    
    try:
        if not using_external_progress:
            progress.start()
            
        task = progress.add_task(f"[bold cyan]Scanning[/] {data_dir} for defense-finder data", total=len(strain_dirs))
        
        for strain_dir in strain_dirs:
            strain_id = strain_dir.name
            progress.update(task, description=f"[bold cyan]Scanning[/] strain: {strain_id}")
            
            # find genome and annotation files
            fa_files = list(strain_dir.glob(f"{strain_id}.fa"))
            gff_files = list(strain_dir.glob(f"{strain_id}.gff"))
            
            # find for protein and gene sequence files
            faa_files = list(strain_dir.glob(f"{strain_id}.faa"))
            fna_files = list(strain_dir.glob(f"{strain_id}.fna"))
            
            # find defensefinder directory
            defensefinder_dirs = list(strain_dir.glob(pattern))
            
            if not defensefinder_dirs:
                progress.console.print(f"[bold yellow]Warning:[/] No defensefinder directory found for strain {strain_id}")
                progress.update(task, advance=1)
                continue
                
            defensefinder_dir = defensefinder_dirs[0]
            
            # inside defensefinder, find TSV files
            systems_tsv = list(defensefinder_dir.glob(f"{strain_id}_defense_finder_systems.tsv"))
            genes_tsv = list(defensefinder_dir.glob(f"{strain_id}_defense_finder_genes.tsv"))
            
            # if all required files
            if (fa_files and gff_files and systems_tsv and genes_tsv):
                # metadata entry with optional protein/gene files
                # providing full paths
                metadata_entry = {
                    "strain_id": strain_id,
                    "systems_tsv": str(systems_tsv[0].absolute()),
                    "genes_tsv": str(genes_tsv[0].absolute()),
                    "gff_file": str(gff_files[0].absolute()),
                    "fasta_file": str(fa_files[0].absolute())
                }

                # Add protein and gene files if they exist                
                if faa_files:
                    metadata_entry["faa_file"] = str(faa_files[0].absolute())
                if fna_files:
                    metadata_entry["fna_file"] = str(fna_files[0].absolute())
                
                metadata_rows.append(metadata_entry)
            else:
                # Report missing files
                missing = []
                if not fa_files:
                    missing.append("FASTA")
                if not gff_files:
                    missing.append("GFF")
                if not systems_tsv:
                    missing.append("systems TSV")
                if not genes_tsv:
                    missing.append("genes TSV")
                progress.console.print(f"[bold yellow]Warning:[/] Missing files for strain {strain_id}: {', '.join(missing)}")
                
            progress.update(task, advance=1)
        
        # write metadata file
        if metadata_rows:
            df = pd.DataFrame(metadata_rows)
            df.to_csv(output_file, sep='\t', index=False)
            progress.console.print(f"[bold green]Created[/] metadata file with {len(metadata_rows)} strains: {output_file}")
        else:
            progress.console.print(f"[bold red]Error:[/] No valid strains found")
            
        return output_file
        
    finally:
        if not using_external_progress:
            progress.stop()

Upon running the function, a `strains_metadata.tsv` will be written. It will contain the paths to the `.fa`, `.gff`, `.faa`, `.fna`, `defense_finder_systems.tsv`, and `defense_finder_genes.tsv` along with some metadata about the strain ID (taken from the folder structure). Some of these files will be fed to IGUA internally. 

Note that it is essential for the `metadata.tsv` file to have the following column names: `[strain_id, systems_tsv, genes_tsv, gff_file, fasta_file, faa_file, fna_file]` where `fna_file` is optional. 

In [None]:
data_dir = "/path/to/your/data/directory"           # Change to your data directory
test_run_directory = "/path/to/result/directory"    # Change to your desired working directory

create_defense_finder_metadata(
    data_dir=data_dir,  
    output_file=os.path.join(test_run_directory, "strains_metadata.tsv")
)

Next, from your console, navigate to the test run directory which should contain `strains_metadata.tsv`

```bash
cd "/path/to/working/directory"
```

We are ready to run IGUA-df: 

```bash
igua -i ./strains_metadata.tsv 
```

Quickly examine the results, for example as in below.  

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# quick look at the cluster sizes 
tsv_path = 'path/to/gcfs.tsv' 
gcfs_df = pd.read_csv(tsv_path, sep="\t")
gcfs_df['cluster_length'].hist(bins=100)
plt.xlabel('Cluster Length [bp]')
plt.ylabel('Frequency')
plt.show();
# plt.savefig('hist_cluster_length.pdf') 

In [None]:
# quick look at column summaries 

for col in gcfs_df.columns:
    print(f"Column: {col}")
    print(gcfs_df[col].describe())
    print("\n")

## Fileting by activity 

To filter the defense systems by activity, use the `--activity` argument. It accepts one of `defense` (default), `antidefense`, or `all` (no filtering). 

```bash
igua -i ../strains_metadata.tsv --activity antidefense 
```

## Extracting and writing the defense systems 

When run, IGUA-df extracts the genomic and protein sequences of the defense systems. These can be written to a separate directory using the `--write-defense-systems defsysts` argument, which writes all the defense systems to `./defsysts`. It may be helpful to go back through the files for debugging/sanity-checking or some downstream analysis. 

```bash
igua -i ./strains_metadata.tsv --write-defense-systems defsysts 
```

Note that writing the defense systems slows down IGUA-df and is not required for the clustering functionality. 

Also note that `--write-defense-systems` is distinct from `--features`, which writes the final clusters in FASTA format. Meanwhile, `--write-defense-systems` will write several files: 
- `strain_id@system_id.fasta`: genomic sequence subset of the entire defense system from the start of the `sys_begin` gene to the end of the `sys_end` gene, including non-coding sequences
- `strain_id@system_id@@protein_id.faa`: protein sequences for all proteins that make up a given system
- `strain_id@system_id@@protein_id.fna` (optional): nucleotide sequences for all proteins that make up a given system, if path to a corresponding `.fna` file is provided

## Verbose defense system extraction feedback 

When extracting/writing defense systems, get extra verbose feedback on the number of systems written from every strain and the number of proteins/nucleotide sequences extracted from every system using the `--defense-finder-verbose` option.

```bash
igua -i ../strains_metadata.tsv --output gcfs.tsv --write-defense-systems defsysts --defense-finder-verbose
```

## Checking memory usage with the `profile` module 

You can profile parts of the code by decorating the function definitions with `@profiler.profile_function` (need to import profiler as `from .profiler import profiler`). This allows to run IGUA with the `--profile-memory` option. By default (`--profile-memory` or `--profile-memory quiet`) will only display a summary table of the profiled function. By setting the option to `verbose`, one can also get a memory usage output for every function call. 

```bash
igua -i strains_metadata.tsv --write-defense-systems defsysts --output gcfs_prof.tsv --profile-memory
```