# 3.8 Gene GC and protein AA proportions and pI

## Software and versions used in this study

- pepstats: EMBOSS v6.6.0

## Additional custom scripts

Note: custom scripts have been tested in python v3.11.6 and R v4.2.1 and may not be stable in other versions.

- scripts/general/extract_aa_and_gc_proportions.py
- scripts/general/summarise_pepstats_pI.py

*Required python packages: argparse, pandas, numpy, re, os, Bio, itertools*

***

## Gene GC and protein AA proportions and pI

Note: A small dataset of sequences is provided for workflow testing: *data/refseq.Caudoviricetes.n50.prodigal_gv.faa*. Stated runtimes are based on this test set.

In the full study, analysed data included Waiwera vOTUs and all high-quality sequences from the IMG/VR database.

#### Extract AA and GC proportions from prodigal

In [None]:
mkdir -p DNA/3.viruses/11.gene_and_protein_stats

scripts/viruses.general/extract_aa_and_gc_proportions.py \
--input_format prodigal \
--protein_sequences data/refseq.Caudoviricetes.n50.prodigal_gv.faa \
--sample_id test_data \
--output_filename DNA/3.viruses/11.gene_and_protein_stats/AA_and_GC.summary_table.tsv


*n50 test runtime < 10s*

#### Protein isoelectric point (pI) via pepstats

note: pepstats trims IDs at '|' character, so need to edit these if present

In [None]:
# Replace '|' characters with '__'
sed -e 's/|/__/g' data/refseq.Caudoviricetes.n50.prodigal_gv.faa > data/refseq.Caudoviricetes.n50.prodigal_gv.edit.faa

#### Run pepstats

In [None]:
pepstats \
"data/refseq.Caudoviricetes.n50.prodigal_gv.edit.faa" \
-outfile "DNA/3.viruses/11.gene_and_protein_stats/pepstats_results.txt"

*n50 test runtime < 10s*

#### Summarise pepstats pI results

In [None]:
scripts/viruses.general/summarise_pepstats_pI.py \
-i DNA/3.viruses/11.gene_and_protein_stats/pepstats_results.txt \
-s test_data \
-o DNA/3.viruses/11.gene_and_protein_stats/pepstats.summaryTable.tsv


*n50 test runtime < 10s*

***