A simple script to remove sequences from a multifasta using its ID, without external libraries like BioPython. This means write a sequence parser, test the exact presence of identifier, and explicily tell the which sequence were removed. Default will be an output in the STDOUT and information in the STDERR. A file and logfile and be used too.
Examples of dotfiles in the directory dotfiles/
. Do not forget to copy them
and add the caracteristic .
prefix!
First a tips: how activating environment in a BASH script? Here are two lines to add at the top of your script:
CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh
# Then load one of your environment, eg:
conda activate MySuperEnvironment
Some scripts presented below require non-native Python libraries or external
tools. I provide recipes for the Conda environments I used to use under the
resource/
directory.
To install such environment from a recipe, follow these instructions:
- first I suggest you to use mamba as package manager which is orders of magnitude faster than Conda
conda install -n base -c conda-forge mamba
- Install the environment from the YML recipe
conda env create --solver libmamba -f /path/to/environment.yml -n MyEnvName
- Activate your freshly installed environment
conda activate MyEnvName
For documentation about Conda usage, I suggest you to have a look at the official documentation.
Here is a list of homemade libraries:
generic_utils.py
: provides generic functions that can be used in my other scripts
Here is a list of scripts used for different task. Note that the library
generic_utils.py
cited above is mandatory for some of them!
ncbi_taxid_to_taxonomy.py
: take a NCBI taxid as input and outputs the full taxonomy. Several options are available. Uses the ETE3 toolkit.assembly_statistics.py
: return some basic statistics for an assemblyget_pfam_specific_hmm.py
: extract a list of PFam profiles from the IDs.comparem_aai_result_to_matrix.py
: reformat the amino-acid identity (AAI) results obtained bycomparem aai_wf
, as the table is not very easy to understand...number_informative_site_alignment.py
: get the proportion of gaps for each sequence in an alignment file, Fasta format
In general, to run a SnakeMake workflow, use:
conda activate snakemake
snakemake -s ~/bioinfoscripts/snakefiles/phylosift_run.smk --config samples=samples.tsv outdir=result/00_test_pipeline
A list of interesting parameters:
--dry-run
, to test the behavior-c
,--cores
, the number of threads that SnakeMake can use. The workflow is distributed on this number of cores--config param=value
, to pass expected parameters
A list of SnakeMake scripts
snakefiles/phylosift_run.smk
: run PhyloSift on a genome. This workflow requires an input: a tab-separated file with 2 columns, with the genome identifier {tab}path/to/sequence/file.fa
, without column names. And pass this information to the script with--config samples=/path/to/my_samples.tsv
. It is also possible to give an output directory with--config outdir=path/to/dir
. The number of thread to run PhyloSift can be customised too, through--config thread={int}
.
Here is the procedure to generate a temporary URL to download a single file stored in a S3 bucket:
# Generate the link, replace 'http' by 'https'
s3cmd signurl S3://BUCKET/path/MyFile `date -d 'now + 7 days' +%s` | \
sed 's/http:/https:/g'
# Download with wget OR curl
wget -O MyFile URL
curl -o MyFile URL
A full example:
# Get the list of files; 'microstore' is an alias to my S3 storage name
rclone ls microstore:for_data_sharing/ | cut -f 2 -d " " | \
awk 'BEGIN{FS="/"}{print "for_data_sharing/" $0 "\t" $3}' \
>metaplasmidomes_files.tsv
# Generate the links
while read file name
do
s3cmd signurl S3://$file `date -d 'now + 7 days' +%s` | \
sed 's/http:/https:/g' | awk -v file=$file -v name=$name \
'BEGIN{}{print file "\t" name "\t" $0}' \
>>metaplasmidomes_files_urls.tsv
sleep 1
done < metaplasmidomes_files.tsv
# Download
while read file name url
do
## Uncomment the line with your preferred tool
# wget -O $name $url
# curl -o $name $url
sleep 2 # because it always better to let server rest for some seconds
done < metaplasmidomes_files_urls.tsv
All resources available in this repository are released under the GNU General
Public License v2.0, see LICENCE
for more details.
Any contribution is welcome!