GitHub - leveveaudrey/analysis-of-polymorphism-S-locus: Config files for my GitHub profile.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
pipeline_analyse		pipeline_analyse
README.txt		README.txt
files_necessary.7z		files_necessary.7z
pipeline_analyse_polymorphism.py		pipeline_analyse_polymorphism.py
post_analysis.zip		post_analysis.zip
previous analysis.zip		previous analysis.zip

Repository files navigation

I) generate the different files


	the first step of the analysis consist to generate all files necessary for the pipeline. the method used to generate these files are decribed below, one by one:

	a) vcf

		the VCF file were generated with pipeline "sequencing_genome_vcf.py" than take a reference genome (-r) , a file than contained genomic data files name (-i) and a bed file (-b)
		ex: sequencing_genome_vcf.py -d -o nivelle_halleri -r Arabidopsis_lyrata.v.1.0.23.dna.genome.fa -i nivelle -b lyrata_control.bed -p 2

	b) file with depth of each position
		
		with samtool with the option depth and bam files (ex:samtools depth *.bam > profondeur.csv), we generate a file with depth of each position in each individual, named "profondeur.csv". 
		With this file, first, we determined the mean depth of each position, then we range this mean depth and finaly, we determined the value at 97.5 of the distribution necesary for the "max_depth" in pipeline of analysis. 
		To do that, we used the python script "depth_mean.py". if you want to use it, you must modifie the first name of file with depth by position and individual (here, "profondeur.csv",line1) 
		and the name of bed file with interested regions (here,"lyrata_control.bed", line7).

	c) files generated by vcftool (vcf file "a" necessary)
		
		with vcftool and vcf file generated in "a", we generated two different files necessary for main analysis pipeline, a file with pi for each position of each population 
		(ex:vcftools --vcf GATK_HC_Arabidopsis_lyrata.v.1.0.23.dna.genome_filtered.vcf --out pi --site-pi) and an other with tajima D estimated in no-overlapping windows of 5kb 
		(ex:vcftools --vcf GATK_HC_Arabidopsis_lyrata.v.1.0.23.dna.genome_filtered.vcf --TajimaD 5000 --out tajima)

	d) file generated by SIFT4G (vcf file "a" necessary)
		
		with SIFT4G and vcf file generated in "a", we generated a file necessary for main analysis pipeline, a file with prediction of effect of some positions of each population 
		(ex:java -jar SIFT4G_Annotator.jar -c -i GATK_HC_Arabidopsis_lyrata.v.1.0.23.dna.genome_filtered.vcf -d lyrata_ref -r ./result). 
		To use SIFT4G, please, see documentation (https://sift.bii.a-star.edu.sg/sift4g/). 
		We modificated the file generated: first, we remoded the first comment lines, leaving only the one with the column names and secondly, we changed the separator by ";"

	e) file generated by SNPeff (vcf file "a" necessary)
		
		with SNPeff and vcf file generated in "a", we generated a file necessary for main analysis pipeline, a file with annotation of each mutations of each population 
		(ex: ./snpeff -V lyrata107_control GATK_HC_Arabidopsis_lyrata.v.1.0.23.dna.genome_filtered.vcf > GATK_HC_Arabidopsis_lyrata.v.1.0.23.dna.genome_filtered.ann.vcf). 
		To use SNPeff and build your own library, please, see documentation (https://pcingola.github.io/SnpEff/se_introduction/).

	f) file with annotation of each position in a reference genome (independent step)

		to determined the 0 fold, 2 fold and 4 fold degenerates sites in the reference genome, we used a published pipeline corrected (bug in original version) than take a gff and a fasta file of the reference genome 
		and we generate a big csv file with the annotation predicted for each position (ex:python NewAnnotateRef2.py Arabidopsis_lyrata.v.1.0.23.dna.genome.fa Arabidopsis_lyrata.v.1.0.23.gff3 -o > lyrata.csv )

	g) bed files

		We created two different bed files: first, the bed files with all interest regions (controls regions and 6 regions of 25kb around the S locus) found in lyrata genome 
		and an other particular "bed" files, for the analysis of non-overlapping windows of 2.5kb, with two added columns, first with relative position with S locus (bases) and second with number of cds positions in window (bases).
		We generated this second particular bed with script "special_bed.py" than take a gff file (here, "Arabidopsis_lyrata.v.1.0.23.gff3") and the first and the last position of each region (5 and 3') 
		around S locus (lines15-18 & lines59-61 respectively) . We removed finaly the empty first line. to change the windows size, change value in line11. 

	h) parameters files
		
		To use the main analysis pipeline, you must to change parameters in parameters.txt (see exampl). 

II) run "pipeline_analyse_polymorphism.py" in same folder than all files previously generated



For test the pipeline, please, download and unzip all zip files with code. the "post_analysis" file contain the R script used on the paper