forked from spflanagan/popgen
-
Notifications
You must be signed in to change notification settings - Fork 0
/
300_nop_analysis.txt
252 lines (193 loc) · 11.3 KB
/
300_nop_analysis.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
Nerophis ophidion Pop Gen notes
1. gunzip file.gz
2. .~/Programs/FastQC/fastqc and check each one
3. ran process_radtags on each plate using the 96barcodes.txt file as the barcode list
4. In jEdit, created 'rename.sh' files with standardized formats and checked the uniqueness of each name
5. Renamed output of process_radtags using the rename files and checked read depth with wc -l *.fq
6. Moved all sample files to ./samples/ folder
7. Ran run_denovo_map.sh
8. Now first pruning with populations
populations -b 3 -P ./stacks/ -M ./nop_map.txt -r 0.75 -a 0.05 -p 6 -t 8 --vcf --plink --structure
9. prune_snps.cpp
5. plink --file pruned --hwe 0.001 --noweb --allow-no-sex --write-snplist
plink --file pruned --extract plink.snplist --recode --recodeA --recode-structure --out subset --noweb --allow-no-sex
plink --file pruned --extract plink.snplist --recodeA --out subset --noweb --allow-no-sex
6. PCAdapt
7. Adegenet
8. FastStructure
9. Structure
Structure Harvester
10. Bayenv2
11. calculate_global_fst
#######################################LAB BOOK#############################################
#####Tuesday, 7 February 2017
Installed ddocent, and am zipping and moving the nerophis fq files to a ddocent folder. This seems to have possibly frozen ubuntu but also worked.
#####Friday, 3 February 2017
I'm running stacks with different parameters to see if that changes the results at all.
run_denovo_map_nop_m10_n2.sh
populations -b 3 -P ./stacks/stacks_m10_n2/ -M ./nop_map.txt -r 0.5 -a 0.05 -p 6 -t 8 --vcf --plink --structure
Only contained 36 snps...
Re-running populations without SEW
../programs/prune_snps/prune_snps -p ./stacks/stacks_m10_n2/batch_3.plink.ped -m ./stacks/stacks_m10_n2/batch_3.plink.map -o ./stacks/stacks_m10_n2/pruned_m10
Now only 43 SNPs!
OK, so this doesn't seem like it's going to work. 2380 if i don't use -r with populations
plink --file pruned_m10 --hwe 0.001 --noweb --allow-no-sex --write-snplist
plink --file pruned_m10 --extract plink.snplist --recode --recodeA --recode-structure --out subset --noweb --allow-no-sex
plink --file pruned_m10 --extract plink.snplist --recodeA --out subset --noweb --allow-no-sex
2091 remaining
#####Tuesday, 22 November 2016
How to analyze the Bayenv results? Because i used a quantile, they all have the same number of outliers
Is there a chi-squared test I could use?
How about blast results?
->Only five of the BF outliers had blastx matches
->736 of the nerophis tags match the scovelli tags.
Each one matches multiple locations, but only 495 total scovelli locations are matched.
How can I learn more about the salinity-associated things?
->The five tags that blasted to genes were outliers in both salinity and temperature associations.
->19 salinity associated tags were also in ssc.
->15 temp associated tags were also in ssc.
used PGDSpider 2.1.0.3 to convert subset.ped to subset.bayescan
but it doesn't seem to have worked right, so I"ll do it myself.
#####Monday, 7 November 2016
Bayenv finished running and I'm working on the analyses...trying
blastx -db nr -query bf_tags.fasta -remote -out bf_tags.blastx -outfmt 7
to annotate outliers
Also blasted the tags to the scovelli genome.
#####Thursday, 3 November 2016
Bayenv seems to not have worked because matrix.5.out is the actual output not a representative matrix.
I copied the last matrix from matrix.5.out into its own file and am re-running run_bayenv_general.sh
This seems to be working--but I need to delete old .freqs files. NOW we're on the right track.
Meanwhile, 601 of 14526 SNPs show significant IBD.
41 of those are also outliers in the PCAdapt analysis.
#####Wednesday, 2 November 2016
Made a map in R.
Also ran isolation by distance.
Need to analyze bayenv output, see if any of my outliers are similar, and see if they match anything in scovelli.
What is the overall pattern I'm seeing here??
So the bayenv output doesn't look right.
Running
~/Programs/bayenv_2/calc_bf.sh nop.all.snpsfile env_data_bayenv_std.txt matrix.5.out 5 100000 4
to do the environmental correlations.
#####Tuesday, 1 November 2016
OK, I fixed it--for some reason it hadn't been re-written? IDK but now it works and I have bayenv.frq.strat
Ran:
plink --ped bayenv.plink.ped --map subset.map --out bayenv --noweb --allow-no-sex --freq --within plink.clust.txt
to get the correct SNPSFILE format.
First need to run bayenv2 for matrix estimation (run_bayenv2_matrix_general.sh)
Then I can use R to find consensus matrix for environmental associations and XtX.
Meanwhile also making the snpfiles for all 49740 markers using R
And made the standardized environmental file in R
(and environmental variables not correlated with geographic distance)
>mantel.rtest(as.dist(t(geo.dist)),as.dist(env.dist),999)
Monte-Carlo test
Observation: 0.01898832
Call: mantelnoneuclid(m1 = m1, m2 = m2, nrepet = nrepet)
Based on 999 replicates
Simulated p-value: 0.438
Input for bayenv:
nop.snpsfile SNPSFILE
env_data_bayenv_std.txt ENVIRONFILE
Meanwhile, at work:
Ran PCAdapt and got very little information. Excellent. There is basically no population structure (just like adegenet). But I was able to identify 1032 significant outliers (alpha = 0.05).
These outliers have varied global Fst values
> summary(global.fst[global.fst$RADloc %in% outliers,"Fst"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.734e-05 3.989e-02 5.816e-02 5.524e-02 7.230e-02 1.169e-01
If there's not much population structure, then is Fst informative?
#####Monday, 31 October 2016
I need the coordinates for the nerophis sites, which I'm not sure where the file is.
plink --file pruned --extract plink.snplist --freq --within --out subset --noweb --allow-no-sex
Trying to do the clustering to generate bayenv files but it keeps saying "0 of 240 individuals assigned to clusters"
#####Wednesday, 14 September 2016
Structure finished running, so I ran Structure Harvester..got K=3 as best (admixture model).
#####Sunday, 11 September 2016
Running structure on the no-SEW group in STRUCTURE with 9 reps, K=1 through K=5.
#####Thursday, 14 July 2016
The new populations run seems to have yielded 49740 SNPs.
I ran prune_snps.cpp and retained 15398 SNPs. After plink hwe pruning, 14526 SNPs retained.
#####Friday, 8 July 2016
I'm running populations without the SEW population. I think this might help things, since SEW had so much missing data.
populations -b 3 -P ./stacks -M ./nop_map_nosew.txt -s -t 3 --fstats --plink --vcf
#####Thursday, 9 June 2016
I've received the coordinates from Josefin Sundin and Kai Linstrom, waiting on Olivia Roth.
Meanwhile I'm downloading World Oceans Database data for the Baltic region.
#####Tuesday, 7 June 2016
What do I need to continue?
Get the coordinates of the collecting locations to do IBD analysis
Use coordinates to download salinity data
Look at Fsts among populations (pairwise and global).
Emailed Josefin and Olivia for coordinates of the collecting sites. Also emailed Nilla for info about the Finland samples.
Meanwhile, running calculate_global_fst on pruned.ped and pruned.map (these are unlinked but not pruned for HWE).
Fsts are surprisingly clumped around 0.05...it's a normal distribution rather than skewed. Weird. Probalby an artifact due to sequencing error or something.
Plotted the Stacks distributions and they don't seem to be quite as weird.
#####Wednesday, 11 May 2016
SEW LEM GEL STR GTL FIN
SEW 0.0241251 0.0233355 0.0231902 0.036876 0.037098
LEM 0.0279487 0.0270594 0.0374785 0.0363935
GEL 0.0133522 0.0194884 0.0209477
STR 0.0234515 0.0249285
GTL 0.0142245
FIN
Running coverage_From_stacks
Num individuals 288
Num loci 618457
Average per-locus coverage 3.81336
average per-individual coverage 4.71528
Num Loci with >=5x coverage 105597
Num loci with >=10x coverage 5436
Overall minimum per-locus coverage 0
Overall maximum per-locus coverage 46198
#####Monday, 9 May 2016
The structure run showed that deltaK was maximized at K=4. Looking at the graphs, though, I'm not sure that's the best interpretation (they all look kind of the same).
Adegenet does not provide clear clusters
I want to re-run populations with the whitelist to get a Fst matrix...
Moved old populations output to 'populations' folder.
Opened up plink.snplist and replaced _ with \t to have locus\tbp setup. Saved as nop.subset.whitelist.txt
populations -b 3 -P ./stacks -M ./nop_map.txt -s -W ./stacks/nop.subset.whitelist.txt -t 3 --fstats --plink --vcf
#####Sunday, 8 May 2016
I re-ran stacks with different filters and got only 41 polymorphic sites...
I'll just go with the populations run with a 0.05, p 2 and r 0.5.
Ran the plink files through prune_snps (saved as pruned.ped and pruned.map)
plink --file pruned --hwe 0.001 --noweb --allow-no-sex --write-snplist
1295 markers excluded (genotyping rate is 0.284024)
plink --file pruned --extract plink.snplist --recode --recodeA --recode-structure --out subset --noweb --allow-no-sex
Using 3317 SNPs.
Started structure, admix 10000 and 10000 k=1 through k=6.
The adegenet plot showing missing data shows that the SEW population is missing the most and LEM is missing a lot too.
#####Saturday, 19 March 2016
Starting to parse the structure output
Using Structure harvester, I found that K=4 is best in admixture model but K=2 is best in no admixture model. So clearly I need to do more analyses.
Used this and Notepad++ to get a list of the files and then run parse_structure_output
sarah@sarah-VirtualBox: ls nerophis/structure/nop_popgen/admixture/Results/*_f > ./scripts/parse_nerophis_structure.sh
sarah@sarah-VirtualBox:~/sf_ubuntushare/popgen$ ls nerophis/structure/nop_popgen/no_admix/Results/*_f >> ./scripts/parse_nerophis_structure.sh
Now I need to run faststructure, adegenet, and PCAdapt.
#####Thursday, 10 March 2016
Wrote parse_structure_output to convert the Results/*_f files into something more readable by R.
#####Tuesday, 8 March 2016
ran
plink --file pruned --recode-structure --recode --recodeA --noweb --allow-no-sex --out nop.subset
**Note: Total genotyping rate in remaining individuals is 0.284024
and then started structure
#####Monday, 7 March 2016
So the numbers below are wrong. The populations run with -r 75 had 273 loci.
Let's try
populations -b 3 -P ./stacks -s -t 3 -M ./nop_map.txt -a 0.05 -p 2 -r 0.5 --plink --structure --vcf
That's better. Now there are 6527.
Used prune_snps and ended up with 4612 SNPs.
#####Friday, 4 March 2016
That one was killed too. I guess asking it to keep all 6 populations is too much? Maybe I'll just go with the one with -r 0.75...one last try!
populations -b 3 -P ./stacks -s -t 3 -M ./nop_map.txt -a 0.05 --fstats --plink --structure -p 2 --vcf --vcf_haplotypes
This just isn't working, even without fstats.
#####Thursday, 3 March 2016
Trying this:
populations -b 3 -P ./stacks -s -t 3 -M ./nop_map.txt -a 0.05 --fstats --plink --structure -p 6 --vcf --vcf_haplotypes
The one below was killed.
#####Wednesday, 2 March 2016
/usr/local/bin/populations -b 3 -P ./stacks -s -t 6 -M ./nop_map.txt -a 0.05 --genomic --plink --structure -p 6 --vcf
#####Tuesday, 1 March 2016
/usr/local/bin/populations -b 3 -P ./stacks -s -t 6 -M ./nop_map.txt -r 0.75 -a 0.05 --genomic --plink --structure
resulted in 144780 loci.
Now running populations -b 3 -P ./stacks -s -t 6 -M ./nop_map.txt -r 0.75 -a 0.05 --genomic --plink --structure -p 6
1235
#####Saturday, 27 February 2016
I've installed Stacks v.1.37 on my home computer and I'm going to re-run stacks from the beginning. Added the renaming and moving of files to run_process_radtags.sh