Collection of general scripts and files for bioinformatics

Ideas to implement

A simple script to remove sequences from a multifasta using its ID, without external libraries like BioPython. This means write a sequence parser, test the exact presence of identifier, and explicily tell the which sequence were removed. Default will be an output in the STDOUT and information in the STDERR. A file and logfile and be used too.


Examples of dotfiles in the directory dotfiles/. Do not forget to copy them and add the caracteristic . prefix!

Conda environments

First a tips: how activating environment in a BASH script? Here are two lines to add at the top of your script:

CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/

# Then load one of your environment, eg:
conda activate MySuperEnvironment

Some scripts presented below require non-native Python libraries or external tools. I provide recipes for the Conda environments I used to use under the resource/ directory.
To install such environment from a recipe, follow these instructions:

  1. first I suggest you to use mamba as package manager which is orders of magnitude faster than Conda
conda install -n base -c conda-forge mamba 
  1. Install the environment from the YML recipe
conda env create --solver libmamba -f /path/to/environment.yml -n MyEnvName
  1. Activate your freshly installed environment
conda activate MyEnvName

For documentation about Conda usage, I suggest you to have a look at the official documentation.

Python3 scripts


Here is a list of homemade libraries:

  • provides generic functions that can be used in my other scripts

Scripts for parsing

Here is a list of scripts used for different task. Note that the library cited above is mandatory for some of them!

  • take a NCBI taxid as input and outputs the full taxonomy. Several options are available. Uses the ETE3 toolkit.
  • return some basic statistics for an assembly
  • extract a list of PFam profiles from the IDs.
  • reformat the amino-acid identity (AAI) results obtained by comparem aai_wf, as the table is not very easy to understand...
  • get the proportion of gaps for each sequence in an alignment file, Fasta format


Run a workflow

In general, to run a SnakeMake workflow, use:

conda activate snakemake

snakemake -s ~/bioinfoscripts/snakefiles/phylosift_run.smk  --config samples=samples.tsv outdir=result/00_test_pipeline

A list of interesting parameters:

  • --dry-run, to test the behavior
  • -c, --cores, the number of threads that SnakeMake can use. The workflow is distributed on this number of cores
  • --config param=value, to pass expected parameters

The list of wokflows

A list of SnakeMake scripts

  • snakefiles/phylosift_run.smk: run PhyloSift on a genome. This workflow requires an input: a tab-separated file with 2 columns, with the genome identifier {tab} path/to/sequence/file.fa, without column names. And pass this information to the script with --config samples=/path/to/my_samples.tsv. It is also possible to give an output directory with --config outdir=path/to/dir. The number of thread to run PhyloSift can be customised too, through --config thread={int}.


Gererate URL to share files stored in S3

Here is the procedure to generate a temporary URL to download a single file stored in a S3 bucket:

# Generate the link, replace 'http' by 'https'
s3cmd signurl S3://BUCKET/path/MyFile `date -d 'now + 7 days' +%s` | \
    sed 's/http:/https:/g'

# Download with wget OR curl
wget -O MyFile URL
curl -o MyFile URL

A full example:

# Get the list of files; 'microstore' is an alias to my S3 storage name
rclone ls microstore:for_data_sharing/ | cut -f 2 -d " " | \
    awk 'BEGIN{FS="/"}{print "for_data_sharing/" $0 "\t" $3}' \

# Generate the links
while read file name
    s3cmd signurl S3://$file `date -d 'now + 7 days' +%s` | \
        sed 's/http:/https:/g' | awk -v file=$file -v name=$name \
        'BEGIN{}{print file "\t" name "\t" $0}' \
    sleep 1
done < metaplasmidomes_files.tsv

# Download
while read file name url
    ## Uncomment the line with your preferred tool
    # wget -O $name $url
    # curl -o $name $url
    sleep 2 # because it always better to let server rest for some seconds
done < metaplasmidomes_files_urls.tsv

Usage, Share and Contibutions

All resources available in this repository are released under the GNU General Public License v2.0, see LICENCE for more details.

Any contribution is welcome!


