Skip to content

Functions and scripts used in a pipeline to find novel drug resistance associated mutations in HIV pol-RT

Notifications You must be signed in to change notification settings

lucblassel/utils_hiv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HIV Project util functions

This python package regroups useful functions and classes, that were used for the study of resistance mutations, and the search for potentially resistance-associated mutations in HIV-1 Reverse transcriptase sequences, using machine learning methods.
You can read about this project here and here

Module description

This module is separated into 5 different submodules

DRM utils

This submodule contains all functions to get different subsets of DRMs (ie. NRTIs, NNRTIs, accessory DRMs, SDRMs, etc...). Each of the functions returns a list of selected DRMs.

data utils

This submodule contains useful functions and classes to pre-process the encoded dataset before model training. You can remove features corresponding to known DRMs, remove sequences that have DRMs, balance target classes by sub-sampling or over-sampling, or creating cross-validation folds.

learning utils

This submodule contains useful functions and classes to use classifiers needed during the study. It also contains custom classifiers based on exact fisher tests. It contains functions to train classifiers, get predictions from these classifiers and extract coefficients / weights from these classifiers.

param utils

This submodule contains functions useful for the generation and selection of the best hyper-parameter set via random search.

metrics

This submodule contains a set of custom performance metrics that we devised in an attempt to take into account class imbalance and the differing importance given to False positives (more important) and False negatives (less important).

independent scripts

Additionally, two useful scripts are present.

compute_fisher_values.py

This script allows us to compute p-values for Fisher exact tests comparing the prevalence of mutations w.r.t a binary character like RTI treatment status or presence/absence of any DRM. This outputs a table with each considered mutation in a row and the raw p-value, as well as p-values corrected for multiple testing with the Bonferroni, Benjamini-Hochberg or Benjamini-Yekutieli methods. This script was used to generate the table: utils_hiv/data/fisher_p_values.tsv

data_encoder.py

This script is used to create the OneHot encoded dataset from HIVDB files and an additional metadata file.
To run this script you need the PrettyRTAA_naive.tsv and PrettyRTAA_treated.tsv generated by submitting the naive.fa and treated.fa fasta alignments to the HIVDB sequence program. This also outputs ResistanceSummary_naive.tsv and ResistanceSummary_treated.tsv which are needed for the script to run.
This script can be used to specify starting and ending positions.

data files

These files are in utils_hiv/data and are used by submodules.

DRM files

NRTI.tab and NNRTI.tab are local copies of HIVDB files (1, 2).
mutation_characteristic.tab is used by the DRM_utils submodule and contains known DRMs with their type (NRTI,NNRTI,Other), their SDRM status. This was obtained through the HIVDB program and hand-curated. The accessory/primary role of each mutation was determined by the HIVDB program comment.

consensus.fa

This file contains the reference sequences for the main HIV-1 subtypes present in our datasets. These sequences were obtained from the Los Alamos HIV sequence database, they are used to determine what features to remove when encoding sequences.

fisher_p_values.tsv

This file contains the results of fisher exact tests for all mutations in the datasets w.r.t to treatment or DRM presence/absence, with raw and corrected (for multiple testing) p-values. These p-values are used to build our "Fisher classifiers".

dependencies

This module depends on the following python packages:

  • python 3.7.6
  • pandas 0.25.3
  • scikit-learn 0.20.3
  • biopython 1.74
  • statsmodels 0.9.0
  • category_encoders 1.3.0
  • scipy 1.4.1
  • numpy 1.18.1

About

Functions and scripts used in a pipeline to find novel drug resistance associated mutations in HIV pol-RT

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages