clinAc_PGx_WES

This repository includes all self-made scripts written in Python to analyze the pharmacogenetic (PGx) germline genetic variation of 5,001 individuals referred for diagnostic Whole Exome Sequencing (WES).

Mainly, it explains how to translate genotyping data (VCF-like format) into PGx alleles and phenotypes depicted by all the CPIC guidelines (https://cpicpgx.org/guidelines/)

These scripts are ordered according to different steps in the analysis.

IMPORTANT COMMENT

Individual-associated data (i.e., variation, gender and location) is not avalable due to data protection matters.
Genotyping data (VCFs) will be available at a public repository upon publication.
The folder called "testdata" aims to provide with examples of how the input data would look like, with the aim of helping anyone to examine and understand the code

HOW TO

Prepare your input (per sample) VCF files:

Use BCFTOOLS to split alleles in each individual VCF, then remove duplicated lines (if needed), then split again (just in case some alleles have been merged again) and merge all individual VCFs together.
Finally, a last splitting alleles of the merged VCF

We go to Python3 scripts to generate MERGED_GVCFS.csv --> Alleles.csv --> (merge Alleles.csv from both panels). Alleles.csv contains all resulting PGx alleles for each individual obtained from the VCF files.

conda activate py36

The following .CSV file contains all reference alleles defined in CPIC tables which have been used in this study (actionable alleles) HAPREF=path/to/test_data/Ref_table.csv

CREATE Alleles.csv file from MERGED VCF file (both CCP17 and SSV6 panels, separately)

Taking Ref_table.csv file and merged VCF file with all variant calling results of the whole cohort, the aim is to create Alleles.csv. This file contains genotyping data relevant to resolve each pharmacogenetic allele. First, work with either CCP17 or SSV6 data (CES or WES data), and then, merge both panel into a final Alleles.csv file.

Several "normalization steps" are applied using bcftools in order to split multiallelic sites. These steps make easier to parse merged vcf into csv file

NOTE: individual VCF files can be found at https://www.ebi.ac.uk/ena/browser/view/PRJEB48632

CCP17 panel

FILES_DIR=/path/to/test_data/CCP17_panel/

cd $FILES_DIR

for i in *vcf; do mv $i tmp.vcf; bgzip $i; tabix -p vcf $i.gz; rm tmp.vcf.gz;done

for i in *vcf.gz; do mv $i tmp.vcf.gz; bcftools norm -m -both tmp.vcf.gz -Oz -o $i; rm tmp.vcf.gz;done

for i in *vcf.gz; do mv $i tmp.vcf.gz; bcftools norm -D tmp.vcf.gz -Oz -o $i; rm tmp.vcf.gz; done

for i in *vcf.gz; do mv $i tmp.vcf.gz; bcftools norm -m -both tmp.vcf.gz -Oz -o $i; rm tmp.vcf.gz;done

realpath *vcf.gz > list.txt

bcftools merge -o tmp.vcf.gz -Oz --file-list list.txt

bcftools norm -m -both -o MERGED.vcf -Ov tmp.vcf.gz;rm tmp.vcf.gz

#Create MERGED with gt info:

python /path/to/scripts/PGx_0_FromMergedVCF2MergedCSV.py --vcf MERGED.vcf

#Next, create Alleles.csv for this panel (CCP17)

python /path/to/scripts/PGx_1_merged_to_alleles.py --fileformat '.vcf' --searchpath $FILES_DIR --hapref $HAPREF --mergedCSV MERGED_GVCFs.csv

SSV6 panel

FILES_DIR=/path/to/test_data/SSV6_panel/

cd $FILES_DIR

for i in *vcf; do mv $i tmp.vcf; bgzip $i; tabix -p vcf $i.gz; rm tmp.vcf.gz;done

for i in *vcf.gz; do mv $i tmp.vcf.gz; bcftools norm -m -both tmp.vcf.gz -Oz -o $i; rm tmp.vcf.gz;done

for i in *vcf.gz; do mv $i tmp.vcf.gz; bcftools norm -D tmp.vcf.gz -Oz -o $i; rm tmp.vcf.gz; done

for i in *vcf.gz; do mv $i tmp.vcf.gz; bcftools norm -m -both tmp.vcf.gz -Oz -o $i; rm tmp.vcf.gz;done

realpath *vcf.gz > list.txt

bcftools merge -o tmp.vcf.gz -Oz --file-list list.txt

bcftools norm -m -both -o MERGED.vcf -Ov tmp.vcf.gz;rm tmp.vcf.gz

#Create MERGED with gt info

python /path/to/scripts/PGx_0_FromMergedVCF2MergedCSV.py --vcf MERGED.vcf

#Next, create Alleles.csv for this panel (SSV6)

python /path/to/scripts/PGx_1_merged_to_alleles.py --fileformat '.vcf' --searchpath $FILES_DIR --hapref $HAPREF --mergedCSV MERGED_GVCFs.csv

Merge data from both panels (SSv6 and CCP17, a.k.a WES and CES, respectively) [input files are specified inside the script]

#Merge Alleles.csv from both panels

python /path/to/scripts/PGx_2_mergeAlleles_CCP17_SSV6_together.py

#Get per sample diplotype information (haplotypes.csv) using Alleles.csv. [input files are specified inside the script] haplotypes.csv contains per-sample pharmacogenetic allele information

python /path/to/scripts/PGx_3_mergeSex_Proc_AllelesperInd.py (This script is important to correct gene-specific cases which require to remove redundant alleles. For example, if an alleleX is defined by variant A and alleleY is defined by phased variants A and B, the allele with the lowest number of variants is filtered out. All positive cases in our data have been revised and included in this script. This script also adds indels which have been manually revised to avoid missing some variants due to variant caller annotation diversity)

#Convert haplotypes to phenotypes [input files are specified inside the script]

python /path/to/scripts/PGx4_diplotypes2phenotypes.py

##Please, find the scripts used to plot figures derived from these output files in /path/to/scripts/Figures_tables

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
DISCOVERY_scripts		DISCOVERY_scripts
Figures_tables		Figures_tables
testdata		testdata
LICENSE		LICENSE
PGx_0_FromMergedVCF2MergedCSV.py		PGx_0_FromMergedVCF2MergedCSV.py
PGx_1_merged_to_alleles.py		PGx_1_merged_to_alleles.py
PGx_2_mergedAlleles_CCP17_SSV6_together.py		PGx_2_mergedAlleles_CCP17_SSV6_together.py
PGx_3_mergeSex_Proc_AllelesperInd.py		PGx_3_mergeSex_Proc_AllelesperInd.py
PGx_4_diplotypes2phenotypes.py		PGx_4_diplotypes2phenotypes.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISCOVERY_scripts

DISCOVERY_scripts

Figures_tables

Figures_tables

testdata

testdata

LICENSE

LICENSE

PGx_0_FromMergedVCF2MergedCSV.py

PGx_0_FromMergedVCF2MergedCSV.py

PGx_1_merged_to_alleles.py

PGx_1_merged_to_alleles.py

PGx_2_mergedAlleles_CCP17_SSV6_together.py

PGx_2_mergedAlleles_CCP17_SSV6_together.py

PGx_3_mergeSex_Proc_AllelesperInd.py

PGx_3_mergeSex_Proc_AllelesperInd.py

PGx_4_diplotypes2phenotypes.py

PGx_4_diplotypes2phenotypes.py

README.md

README.md

Repository files navigation

clinAc_PGx_WES

IMPORTANT COMMENT

HOW TO

CREATE Alleles.csv file from MERGED VCF file (both CCP17 and SSV6 panels, separately)

CCP17 panel

SSV6 panel

Merge data from both panels (SSv6 and CCP17, a.k.a WES and CES, respectively) [input files are specified inside the script]

About

Releases

Packages

Languages

License

jlanillos/clinAcc_PGx_WES

Folders and files

Latest commit

History

Repository files navigation

clinAc_PGx_WES

IMPORTANT COMMENT

HOW TO

CREATE Alleles.csv file from MERGED VCF file (both CCP17 and SSV6 panels, separately)

CCP17 panel

SSV6 panel

Merge data from both panels (SSv6 and CCP17, a.k.a WES and CES, respectively) [input files are specified inside the script]

About

Resources

License

Stars

Watchers

Forks

Languages