### Introduction

We first must obtain the 1000 genome data. The download we have available is for Hg19, so we will have to lift it over to obtain the data in Hg38. Additionally, we will attempt to determine after some QC steps the union of the genotype data. This will allow for plotting of the ancestries.

In [None]:
#download the data!
#uncomment the line below if need to download
#wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz

In [2]:
#need to convert the 1kg 1000 genome data to plink format
plink --vcf ALL.2of4intersection.20100804.genotypes.vcf.gz --make-bed --out ALL.2of4intersection.20100804.genotypes

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.2of4intersection.20100804.genotypes.log.
Options in effect:
  --make-bed
  --out ALL.2of4intersection.20100804.genotypes
  --vcf ALL.2of4intersection.20100804.genotypes.vcf.gz

193099 MB RAM detected; reserving 96549 MB for main workspace.
--vcf: ALL.2of4intersection.20100804.genotypes-temporary.bed +
ALL.2of4intersection.20100804.genotypes-temporary.bim +
ALL.2of4intersection.20100804.genotypes-temporary.fam written.
25488488 variants loaded from .bim file.
629 people (0 males, 0 females, 629 ambiguous) loaded from .fam.
Ambiguous sex IDs written to ALL.2of4intersection.20100804.genotypes.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 629 founders and 0 nonfounders present.
Calculating allele frequencies... 101112131415161718192021222324252627282930313233343536373839404

In [3]:
#we need to rename the identifiers 
plink --bfile ALL.2of4intersection.20100804.genotypes --set-missing-var-ids @:#[b37]\$1,\$2 --make-bed --out ALL.2of4intersection.20100804.genotypes_no_missing_IDs

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.2of4intersection.20100804.genotypes_no_missing_IDs.log.
Options in effect:
  --bfile ALL.2of4intersection.20100804.genotypes
  --make-bed
  --out ALL.2of4intersection.20100804.genotypes_no_missing_IDs
  --set-missing-var-ids @:#[b37]$1,$2

193099 MB RAM detected; reserving 96549 MB for main workspace.
25488488 variants loaded from .bim file.
10375501 missing IDs set.
629 people (0 males, 0 females, 629 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
ALL.2of4intersection.20100804.genotypes_no_missing_IDs.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 629 founders and 0 nonfounders present.
Calculating allele frequencies... 101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798

In [4]:
ls

ALL.2of4intersection.20100804.genotypes.bed
ALL.2of4intersection.20100804.genotypes.bim
ALL.2of4intersection.20100804.genotypes.fam
ALL.2of4intersection.20100804.genotypes.log
ALL.2of4intersection.20100804.genotypes_no_missing_IDs.bed
ALL.2of4intersection.20100804.genotypes_no_missing_IDs.bim
ALL.2of4intersection.20100804.genotypes_no_missing_IDs.fam
ALL.2of4intersection.20100804.genotypes_no_missing_IDs.log
ALL.2of4intersection.20100804.genotypes_no_missing_IDs.nosex
ALL.2of4intersection.20100804.genotypes.nosex
ALL.2of4intersection.20100804.genotypes.vcf.gz
AncestryAnd1KG.ipynb
final_IBP.bed
final_IBP.bim
final_IBP.fam
final_IBP.log
final_IBP.nosex
final_phenotypes.txt


Begin on the QC of the 1kg data

In [5]:
# Remove variants based on missing genotype data.
plink --bfile ALL.2of4intersection.20100804.genotypes_no_missing_IDs --geno 0.2 --allow-no-sex --make-bed --out 1kG_1

# Remove individuals based on missing genotype data.
plink --bfile 1kG_1 --mind 0.2 --allow-no-sex --make-bed --out 1kG_2

# Remove variants based on missing genotype data.
plink --bfile 1kG_2 --geno 0.02 --allow-no-sex --make-bed --out 1kG_3

# Remove individuals based on missing genotype data.
plink --bfile 1kG_3 --mind 0.02 --allow-no-sex --make-bed --out 1kG_4

# Remove variants based on MAF.
plink --bfile 1kG_4 --maf 0.05 --allow-no-sex --make-bed --out 1kG_5


PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to 1kG_1.log.
Options in effect:
  --allow-no-sex
  --bfile ALL.2of4intersection.20100804.genotypes_no_missing_IDs
  --geno 0.2
  --make-bed
  --out 1kG_1

193099 MB RAM detected; reserving 96549 MB for main workspace.
25488488 variants loaded from .bim file.
629 people (0 males, 0 females, 629 ambiguous) loaded from .fam.
Ambiguous sex IDs written to 1kG_1.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 629 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.615305.
16481066 variants removed due to missing genotype data (--geno).
9007422 variants and 

We also can remove the ambiguous SNPs.

In [14]:
cat 1kG_5.bim | \
awk '!(($5=="A" && $6=="T") || \
     ($5=="T" && $6=="A") || \
     ($5=="G" && $6=="C") || \
     ($5=="C" && $6=="G")) {print}' > non_ambiguous_snps.txt 

In [15]:
wc -l non_ambiguous_snps.txt 1kG_5.bim

  4938714 non_ambiguous_snps.txt
  5808310 1kG_5.bim
 10747024 total


In [16]:
plink --bfile 1kG_5 --extract non_ambiguous_snps.txt --allow-no-sex --make-bed --out 1kG_6

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to 1kG_6.log.
Options in effect:
  --allow-no-sex
  --bfile 1kG_5
  --extract non_ambiguous_snps.txt
  --make-bed
  --out 1kG_6

193099 MB RAM detected; reserving 96549 MB for main workspace.
5808310 variants loaded from .bim file.
629 people (0 males, 0 females, 629 ambiguous) loaded from .fam.
Ambiguous sex IDs written to 1kG_6.nosex .
--extract: 4938714 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 629 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.999561.
4938714 variants and 629 people pass filters and QC.
Note: No phenotypes pr

In [17]:
wc -l 1kG_6.bim

4938714 1kG_6.bim


Remove duplicated SNPs

In [114]:
plink --bfile 1kG_6 --write-snplist --out all_snps

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to all_snps.log.
Options in effect:
  --bfile 1kG_6
  --out all_snps
  --write-snplist

193099 MB RAM detected; reserving 96549 MB for main workspace.
4938714 variants loaded from .bim file.
629 people (0 males, 0 females, 629 ambiguous) loaded from .fam.
Ambiguous sex IDs written to all_snps.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 629 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.999561.
4938714 variants and 629 people pass filters and QC.
Note: No phenotypes present.
List of variant IDs written to all_snps.snplist .


In [115]:
cat all_snps.snplist | sort | uniq -d > duplicated_snps.snplist

In [116]:
plink --bfile 1kG_6 --exclude duplicated_snps.snplist --make-bed --out 1kg_6_unique

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to 1kg_6_unique.log.
Options in effect:
  --bfile 1kG_6
  --exclude duplicated_snps.snplist
  --make-bed
  --out 1kg_6_unique

193099 MB RAM detected; reserving 96549 MB for main workspace.
4938714 variants loaded from .bim file.
629 people (0 males, 0 females, 629 ambiguous) loaded from .fam.
Ambiguous sex IDs written to 1kg_6_unique.nosex .
--exclude: 4938714 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 629 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.999561.
4938714 variants and 629 people pass filters and QC.
Note: No phenotyp

Must perform lifting over of the 1kg now.

In [6]:
#download lifover
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver

--2022-11-15 19:06:05--  http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/liftOver
Resolving hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.cse.ucsc.edu (hgdownload.cse.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 34882144 (33M)
Saving to: ‘liftOver’


2022-11-15 19:06:10 (10.1 MB/s) - ‘liftOver’ saved [34882144/34882144]



In [7]:
#download the chainfile to life from hg19 to hg38
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz

--2022-11-15 19:06:13--  http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 227698 (222K) [application/x-gzip]
Saving to: ‘hg19ToHg38.over.chain.gz’


2022-11-15 19:06:14 (351 KB/s) - ‘hg19ToHg38.over.chain.gz’ saved [227698/227698]



In [5]:
#must convert to binary files
plink --bfile 1kg_6_unique --recode --out 1kg_6_unique

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to 1kg_6_unique.log.
Options in effect:
  --bfile 1kg_6_unique
  --out 1kg_6_unique
  --recode

193099 MB RAM detected; reserving 96549 MB for main workspace.
4938714 variants loaded from .bim file.
629 people (0 males, 0 females, 629 ambiguous) loaded from .fam.
Ambiguous sex IDs written to 1kg_6_unique.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 629 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.999561.
4938714 variants and 629 people pass filters and QC.
Note: No phenotypes present.
--recode ped to 1kg_6_unique.ped + 1kg_6_unique.map ... 10

In [6]:
ls 1kg_6_unique.*

1kg_6_unique.bed  1kg_6_unique.fam  1kg_6_unique.map	1kg_6_unique.ped
1kg_6_unique.bim  1kg_6_unique.log  1kg_6_unique.nosex


In [8]:
head 1kg_6_unique.map

1	rs58108140	0	10583
1	1:11508[b37]A,G	0	11508
1	1:15820[b37]G,T	0	15820
1	1:16378[b37]C,T	0	16378
1	1:30923[b37]G,T	0	30923
1	1:40261[b37]A,C	0	40261
1	1:49298[b37]C,T	0	49298
1	rs62637812	0	51803
1	1:52238[b37]G,T	0	52238
1	1:54421[b37]A,G	0	54421


Must encode into the .bed file format for liftOver:


`chr1    743267  743268  rs3115860`

In [9]:
awk '{print "chr"$1,"\t",$4,"\t",$4+1,"\t"$2,"\t",$1}' 1kg_6_unique.map > 1kG_liftOver.bed
head 1kG_liftOver.bed

chr1 	 10583 	 10584 	rs58108140 	 1
chr1 	 11508 	 11509 	1:11508[b37]A,G 	 1
chr1 	 15820 	 15821 	1:15820[b37]G,T 	 1
chr1 	 16378 	 16379 	1:16378[b37]C,T 	 1
chr1 	 30923 	 30924 	1:30923[b37]G,T 	 1
chr1 	 40261 	 40262 	1:40261[b37]A,C 	 1
chr1 	 49298 	 49299 	1:49298[b37]C,T 	 1
chr1 	 51803 	 51804 	rs62637812 	 1
chr1 	 52238 	 52239 	1:52238[b37]G,T 	 1
chr1 	 54421 	 54422 	1:54421[b37]A,G 	 1


In [11]:
head 1kG_liftOver.bed

chr1 	 10583 	 10584 	rs58108140 	 1
chr1 	 11508 	 11509 	1:11508[b37]A,G 	 1
chr1 	 15820 	 15821 	1:15820[b37]G,T 	 1
chr1 	 16378 	 16379 	1:16378[b37]C,T 	 1
chr1 	 30923 	 30924 	1:30923[b37]G,T 	 1
chr1 	 40261 	 40262 	1:40261[b37]A,C 	 1
chr1 	 49298 	 49299 	1:49298[b37]C,T 	 1
chr1 	 51803 	 51804 	rs62637812 	 1
chr1 	 52238 	 52239 	1:52238[b37]G,T 	 1
chr1 	 54421 	 54422 	1:54421[b37]A,G 	 1


In [12]:
liftOver 1kG_liftOver.bed hg19ToHg38.over.chain.gz 1kG_lifted.bed 1kG_unlifted.bed

Reading liftover chains
Mapping coordinates


In [8]:
#need to exclude the SNPs not lifted over
tail 1kG_lifted.bed
wc -l 1kG_lifted.bed

chr22	50793229	50793230	22:51231657[b37]C,T	22
chr22	50794884	50794885	rs62240043	22
chr22	50797019	50797020	22:51235447[b37]A,G	22
chr22	50797068	50797069	22:51235496[b37]A,G	22
chr22	50797070	50797071	22:51235498[b37]A,G	22
chr22	50797119	50797120	22:51235547[b37]A,C	22
chr22	50797531	50797532	22:51235959[b37]C,T	22
chr22	50798635	50798636	rs3896457	22
chr22	50799821	50799822	22:51238249[b37]A,C	22
chr22	50802920	50802921	22:51241348[b37]C,T	22
4937181 1kG_lifted.bed


In [14]:
#convert lifted .bed back to .map
awk '{print $5,$4,0,$2}' 1kG_lifted.bed > 1kG_2_lifted.map
head 1kG_2_lifted.map

1 rs58108140 0 10583
1 1:11508[b37]A,G 0 11508
1 1:15820[b37]G,T 0 15820
1 1:16378[b37]C,T 0 16378
1 1:30923[b37]G,T 0 30923
1 1:40261[b37]A,C 0 40261
1 1:49298[b37]C,T 0 49298
1 rs62637812 0 51803
1 1:52238[b37]G,T 0 52238
1 1:54421[b37]A,G 0 54421


In [19]:
plink --file 1kg_6_unique --exclude 1kG_unlifted.bed --allow-no-sex --recode --out 1kG_2_lifted.ped

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to 1kG_2_lifted.ped.log.
Options in effect:
  --allow-no-sex
  --exclude 1kG_unlifted.bed
  --file 1kg_6_unique
  --out 1kG_2_lifted.ped
  --recode

193099 MB RAM detected; reserving 96549 MB for main workspace.
.ped scan complete (for binary autoconversion).11212121212121313131313131314141414141415151515151516161616161617171717171717181818181818191919191919202020202020202121212121212222222222222323232323232424242424242425252525252526262626262627272727272727282828282828292929292929303030303030313131313131313232323232323333333333333434343434343435353535353536363636363637373737373737383838383838393939393939404040404040414141414141414242424242424343434343434444444444444445454545454546464646464647474747474748484848484848494949494949505050505050515151515151515252525252525353535353535454545454545555555555555556565656565657575757

In [21]:
ls 1kG_2_lifted*

1kG_2_lifted.map      1kG_2_lifted.ped.map    1kG_2_lifted.ped.ped
1kG_2_lifted.ped.log  1kG_2_lifted.ped.nosex


In [23]:
plink --file 1kG_2_lifted.ped --allow-no-sex --make-bed --out 1kG_7

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to 1kG_7.log.
Options in effect:
  --allow-no-sex
  --file 1kG_2_lifted.ped
  --make-bed
  --out 1kG_7

193099 MB RAM detected; reserving 96549 MB for main workspace.
.ped scan complete (for binary autoconversion).11212121212121313131313131314141414141415151515151516161616161617171717171717181818181818191919191919202020202020202121212121212222222222222323232323232424242424242425252525252526262626262627272727272727282828282828292929292929303030303030313131313131313232323232323333333333333434343434343435353535353536363636363637373737373737383838383838393939393939404040404040414141414141414242424242424343434343434444444444444445454545454546464646464647474747474748484848484848494949494949505050505050515151515151515252525252525353535353535454545454545555555555555556565656565657575757575758585858585858595959595959606060606060616

In [None]:
#??????????
#delete 3 cells below?

In [17]:
awk '{print $4}' 1kG_lifted.bed > 1kG_6_lifted.txt
head 1kG_6_lifted.txt
wc -l 1kG_6_lifted.txt 1kg_6_unique.bim 1kG_unlifted.bed

rs58108140
1:11508[b37]A,G
1:15820[b37]G,T
1:16378[b37]C,T
1:30923[b37]G,T
1:40261[b37]A,C
1:49298[b37]C,T
rs62637812
1:52238[b37]G,T
1:54421[b37]A,G
  4937181 1kG_6_lifted.txt
  4938714 1kg_6_unique.bim
     3066 1kG_unlifted.bed
  9878961 total


In [11]:
plink --file 1kg_6_unique --extract 1kG_6_lifted.txt --allow-no-sex --recode --make-bed --out 1kG_test

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to 1kG_test.log.
Options in effect:
  --allow-no-sex
  --extract 1kG_6_lifted.txt
  --file 1kg_6_unique
  --make-bed
  --out 1kG_test
  --recode

193099 MB RAM detected; reserving 96549 MB for main workspace.
.ped scan complete (for binary autoconversion).11212121212121313131313131314141414141415151515151516161616161617171717171717181818181818191919191919202020202020202121212121212222222222222323232323232424242424242425252525252526262626262627272727272727282828282828292929292929303030303030313131313131313232323232323333333333333434343434343435353535353536363636363637373737373737383838383838393939393939404040404040414141414141414242424242424343434343434444444444444445454545454546464646464647474747474748484848484848494949494949505050505050515151515151515252525252525353535353535454545454545555555555555556565656565657575757575

In [14]:
#4933775
wc -l 1kG_6.bim 1kG_test.bim 1kG_7.bim 1kG_6_lifted.txt

  4938714 1kG_6.bim
  4937181 1kG_test.bim
  4937181 1kG_7.bim
  4937181 1kG_6_lifted.txt
 19750257 total


In [24]:
wc -l 1kG_6.bim 1kG_7.bim

  4938714 1kG_6.bim
  4937181 1kG_7.bim
  9875895 total


Verify the variants with the 1000 genome are the same format as with ours.

In [25]:
head final_IBP.bim

1	1:10930	0	10930	A	G
1	1:10989	0	10989	A	G
1	1:11171	0	11171	C	CCTTG
1	1:23197	0	23197	T	TTAAAA
1	1:23308	0	23308	C	G
1	1:24963	0	24963	G	GT
1	1:28918	0	28918	C	G
1	1:39743	0	39743	T	TA
1	1:47647	0	47647	T	G
1	1:48824	0	48824	C	T


In [26]:
head 1kG_7.bim

1	rs58108140	0	10583	A	G
1	1:11508[b37]A,G	0	11508	A	G
1	1:15820[b37]G,T	0	15820	T	G
1	1:16378[b37]C,T	0	16378	T	C
1	1:30923[b37]G,T	0	30923	G	T
1	1:40261[b37]A,C	0	40261	A	C
1	1:49298[b37]C,T	0	49298	T	C
1	rs62637812	0	51803	C	T
1	1:52238[b37]G,T	0	52238	T	G
1	1:54421[b37]A,G	0	54421	G	A


In [27]:
awk '{print $2,$1":"$4}' 1kG_7.bim > 1kG_updated_names.bim
head 1kG_updated_names.bim

rs58108140 1:10583
1:11508[b37]A,G 1:11508
1:15820[b37]G,T 1:15820
1:16378[b37]C,T 1:16378
1:30923[b37]G,T 1:30923
1:40261[b37]A,C 1:40261
1:49298[b37]C,T 1:49298
rs62637812 1:51803
1:52238[b37]G,T 1:52238
1:54421[b37]A,G 1:54421


In [28]:
plink --bfile 1kG_7 --update-name 1kG_updated_names.bim --make-bed --out 1kG_8
head 1kG_8.bim

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to 1kG_8.log.
Options in effect:
  --bfile 1kG_7
  --make-bed
  --out 1kG_8
  --update-name 1kG_updated_names.bim

193099 MB RAM detected; reserving 96549 MB for main workspace.
4937181 variants loaded from .bim file.
629 people (0 males, 0 females, 629 ambiguous) loaded from .fam.
Ambiguous sex IDs written to 1kG_8.nosex .
--update-name: 4937181 values updated.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 629 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.999561.
4937181 variants and 629 people pass filters and QC.
Note: No phenotypes present.
--make-

Extract variants present in both our cohort and 1kg

In [29]:
awk '{print$2}' final_IBP.bim > IBP_SNPs.txt
plink --bfile 1kG_8 --extract IBP_SNPs.txt --make-bed --out 1kG_9

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to 1kG_9.log.
Options in effect:
  --bfile 1kG_8
  --extract IBP_SNPs.txt
  --make-bed
  --out 1kG_9

193099 MB RAM detected; reserving 96549 MB for main workspace.
4937181 variants loaded from .bim file.
629 people (0 males, 0 females, 629 ambiguous) loaded from .fam.
Ambiguous sex IDs written to 1kG_9.nosex .
--extract: 64529 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 629 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.999621.
64529 variants and 629 people pass filters and QC.
Note: No phenotypes present.
--make-bed to 1kG_9.bed 

In [30]:
awk '{print$2}' 1kG_9.bim > 1kG_9_SNPs.txt
plink --bfile final_IBP --extract 1kG_9_SNPs.txt --allow-no-sex --recode --make-bed --out final_IBP_2
#now datasets contain same variants!

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to final_IBP_2.log.
Options in effect:
  --allow-no-sex
  --bfile final_IBP
  --extract 1kG_9_SNPs.txt
  --make-bed
  --out final_IBP_2
  --recode

193099 MB RAM detected; reserving 96549 MB for main workspace.
29070215 variants loaded from .bim file.
3864 people (0 males, 0 females, 3864 ambiguous) loaded from .fam.
Ambiguous sex IDs written to final_IBP_2.nosex .
3864 phenotype values loaded from .fam.
--extract: 64529 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3864 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
64529 variants and 3864 people pass filters

? have the same build?

In [31]:
#plink --bfile final_IBP_2 --recode --out final_IBP_2

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to final_IBP_2.log.
Options in effect:
  --bfile final_IBP_2
  --out final_IBP_2
  --recode

193099 MB RAM detected; reserving 96549 MB for main workspace.
64529 variants loaded from .bim file.
3864 people (0 males, 0 females, 3864 ambiguous) loaded from .fam.
Ambiguous sex IDs written to final_IBP_2.nosex .
3864 phenotype values loaded from .fam.
phenotypes to be ignored, use the --allow-no-sex flag.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3864 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
64529 variants and 3864 people pass filters and QC.
Among remaining phenotypes, 306

In [32]:
awk '{print$2,$4}' final_IBP_2.map > buildhapmap.txt

In [33]:
plink --bfile 1kG_9 --update-map buildhapmap.txt --make-bed --out 1kG_9_intermediate


PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to 1kG_9_intermediate.log.
Options in effect:
  --bfile 1kG_9
  --make-bed
  --out 1kG_9_intermediate
  --update-map buildhapmap.txt

193099 MB RAM detected; reserving 96549 MB for main workspace.
64529 variants loaded from .bim file.
629 people (0 males, 0 females, 629 ambiguous) loaded from .fam.
Ambiguous sex IDs written to 1kG_9_intermediate.nosex .
--update-map: 64529 values updated.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 629 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.999621.
64529 variants and 629 people pass filters and QC.
Note: No ph

In [31]:
head 1kG_9.bim
head 1kG_9_intermediate.bim

1	1:54490	0	54490	A	G
1	1:813109	0	813109	C	A
1	1:846808	0	846808	T	C
1	1:885657	0	885657	A	G
1	1:974494	0	974494	G	T
1	1:978193	0	978193	A	G
1	1:1025301	0	1025301	T	C
1	1:1056006	0	1056006	A	G
1	1:1088777	0	1088777	C	T
1	1:1157564	0	1157564	A	G
1	1:54490	0	54490	A	G
1	1:813109	0	813109	C	A
1	1:846808	0	846808	T	C
1	1:885657	0	885657	A	G
1	1:974494	0	974494	G	T
1	1:978193	0	978193	A	G
1	1:1025301	0	1025301	T	C
1	1:1056006	0	1056006	A	G
1	1:1088777	0	1088777	C	T
1	1:1157564	0	1157564	A	G


To merge must ensure:

1) Make sure the reference genome is similar in our cohort and the 1000 Genomes Project datasets.

2) Resolve strand issues.

3) Remove the SNPs which after the previous two steps still differ between datasets.


In [34]:
#set reference genome
awk '{print$2,$5}' 1kG_9_intermediate.bim > 1kg_ref-list.txt
plink --bfile final_IBP_2 --reference-allele 1kg_ref-list.txt --make-bed --out final_IBP_3


PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to final_IBP_3.log.
Options in effect:
  --a1-allele 1kg_ref-list.txt
  --bfile final_IBP_2
  --make-bed
  --out final_IBP_3

193099 MB RAM detected; reserving 96549 MB for main workspace.
64529 variants loaded from .bim file.
3864 people (0 males, 0 females, 3864 ambiguous) loaded from .fam.
Ambiguous sex IDs written to final_IBP_3.nosex .
3864 phenotype values loaded from .fam.
phenotypes to be ignored, use the --allow-no-sex flag.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3864 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.






















































































IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



















































IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



















































IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





















































IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





















































IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)































































IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [25]:
wc -l final_IBP_3.bim

64529 final_IBP_3.bim


In [35]:
# Check for potential strand issues.
awk '{print$2,$5,$6}' 1kG_9_intermediate.bim > 1kG_9_tmp
awk '{print$2,$5,$6}' final_IBP_3.bim > final_IBP_3_tmp
sort 1kG_9_tmp final_IBP_3_tmp |uniq -u > all_differences.txt

In [33]:
head all_differences.txt
wc -l all_differences.txt

10:100023957 G T
10:100023957 T C
10:100168631 C G
10:100168631 T G
10:10018514 A G
10:10018514 T TG
10:10021863 G A
10:10021863 T C
10:10057699 C CA
10:10057699 T C
78452 all_differences.txt


maybe need to only include biallelic

In [60]:
awk 'length($5)>1 || length($6) > 1 {print $5}' final_IBP_2.bim | wc -l

3556


In [61]:
awk 'length($5)>1 || length($6) > 1 {print $5}' 1kG_8.bim | wc -l

0


In [44]:
awk 'length($5)>1 || length($6) > 1 {print $2}' final_IBP_3.bim > non_biallelic.txt

In [45]:
plink --bfile final_IBP_3 --exclude non_biallelic.txt --allow-no-sex --make-bed --out final_IBP_3_intermediate

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to final_IBP_3_intermediate.log.
Options in effect:
  --allow-no-sex
  --bfile final_IBP_3
  --exclude non_biallelic.txt
  --make-bed
  --out final_IBP_3_intermediate

193099 MB RAM detected; reserving 96549 MB for main workspace.
64529 variants loaded from .bim file.
3864 people (0 males, 0 females, 3864 ambiguous) loaded from .fam.
Ambiguous sex IDs written to final_IBP_3_intermediate.nosex .
3864 phenotype values loaded from .fam.
--exclude: 60973 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3864 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
60973 variant

In [46]:
wc -l final_IBP_3_intermediate.bim

60973 final_IBP_3_intermediate.bim


In [47]:
#redo strand issue check
# Check for potential strand issues.
awk '{print$2,$5,$6}' 1kG_9_intermediate.bim > 1kG_9_tmp
awk '{print$2,$5,$6}' final_IBP_3_intermediate.bim > final_IBP_3_int_tmp
sort 1kG_9_tmp final_IBP_3_int_tmp |uniq -u > all_differences_int.txt

In [48]:
awk '{print$1}' all_differences_int.txt | sort -u > flip_list_int.txt
#flip non-corresponding SNPs
plink --bfile final_IBP_3_intermediate --flip flip_list_int.txt --reference-allele 1kg_ref-list.txt --make-bed --out final_IBP_4_2


PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to final_IBP_4_2.log.
Options in effect:
  --a1-allele 1kg_ref-list.txt
  --bfile final_IBP_3_intermediate
  --flip flip_list_int.txt
  --make-bed
  --out final_IBP_4_2

193099 MB RAM detected; reserving 96549 MB for main workspace.
60973 variants loaded from .bim file.
3864 people (0 males, 0 females, 3864 ambiguous) loaded from .fam.
Ambiguous sex IDs written to final_IBP_4_2.nosex .
3864 phenotype values loaded from .fam.
--flip: 43987 SNPs flipped, 3556 SNP IDs not present.
phenotypes to be ignored, use the --allow-no-sex flag.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3864 founders and 0 nonfounders present.
Calculating allele frequencies... 1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677

































































IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



















































IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





















































IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [49]:
# Check for SNPs which are still problematic after they have been flipped.
awk '{print$2,$5,$6}' final_IBP_4_2.bim > final_IBP_4_2_tmp
sort 1kG_9_tmp final_IBP_4_2_tmp |uniq -u  > uncorresponding_SNPs_int.txt


In [50]:
wc -l uncorresponding_SNPs_int.txt

61138 uncorresponding_SNPs_int.txt


In [51]:
# 3) Remove problematic SNPs from HapMap and 1000 Genomes.
awk '{print$1}' uncorresponding_SNPs_int.txt | sort -u > SNPs_for_exlusion_int.txt
# The command above generates a list of the SNPs which caused the differences 
# Remove the problematic SNPs from both datasets.
plink --bfile final_IBP_4_2 --exclude SNPs_for_exlusion_int.txt --make-bed --out final_IBP_5_2
plink --bfile 1kG_9_intermediate --exclude SNPs_for_exlusion_int.txt --make-bed --out 1kG_10_2


PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to final_IBP_5_2.log.
Options in effect:
  --bfile final_IBP_4_2
  --exclude SNPs_for_exlusion_int.txt
  --make-bed
  --out final_IBP_5_2

193099 MB RAM detected; reserving 96549 MB for main workspace.
60973 variants loaded from .bim file.
3864 people (0 males, 0 females, 3864 ambiguous) loaded from .fam.
Ambiguous sex IDs written to final_IBP_5_2.nosex .
3864 phenotype values loaded from .fam.
--exclude: 32182 variants remaining.
phenotypes to be ignored, use the --allow-no-sex flag.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3864 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done

In [52]:
plink --bfile final_IBP_5_2 --bmerge 1kG_10_2.bed 1kG_10_2.bim 1kG_10_2.fam --allow-no-sex --make-bed --out merged_int

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to merged_int.log.
Options in effect:
  --allow-no-sex
  --bfile final_IBP_5_2
  --bmerge 1kG_10_2.bed 1kG_10_2.bim 1kG_10_2.fam
  --make-bed
  --out merged_int

193099 MB RAM detected; reserving 96549 MB for main workspace.
3864 people loaded from final_IBP_5_2.fam.
629 people to be merged from 1kG_10_2.fam.
Of these, 629 are new, while 0 are present in the base dataset.
32182 markers loaded from final_IBP_5_2.bim.
32182 markers to be merged from 1kG_10_2.bim.
Of these, 0 are new, while 32182 are present in the base dataset.
Performing single-pass merge (4493 people, 32182 variants).
Merged fileset written to merged_int-merge.bed + merged_int-merge.bim +
merged_int-merge.fam .
32182 variants loaded from .bim file.
4493 people (0 males, 0 females, 4493 ambiguous) loaded from .fam.
Ambiguous sex IDs written to merged_int.no

In [53]:
plink --bfile merged_int --extract indepSNP.prune.in --genome --out merged_2_int
plink --bfile merged_int --read-genome merged_2_int.genome --cluster --pca 10 --out pca_merged


PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to merged_2_int.log.
Options in effect:
  --bfile merged_int
  --extract indepSNP.prune.in
  --genome
  --out merged_2_int

193099 MB RAM detected; reserving 96549 MB for main workspace.
32182 variants loaded from .bim file.
4493 people (0 males, 0 females, 4493 ambiguous) loaded from .fam.
Ambiguous sex IDs written to merged_2_int.nosex .
3864 phenotype values loaded from .fam.
--extract: 27448 variants remaining.
phenotypes to be ignored, use the --allow-no-sex flag.
Using up to 35 threads (change this with --threads).
Before main variant filters, 4493 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping 

In [54]:

awk '{print$1,$2,"OWN"}' final_IBP.fam>racefile_own.txt


In [55]:
cat race_1kG14.txt racefile_own.txt | sed -e '1i\FID IID race' > racefile.txt


In [56]:
Rscript pca_script.R

 [1] "IID"   "FID"   "V3"    "V4"    "V5"    "V6"    "V7"    "V8"    "V9"   
[10] "V10"   "V11"   "V12"   "group"
    IID   FID         V3           V4           V5          V6          V7
1     1     1 0.00605037 -3.91818e-04 -0.000579187 -0.01041400 0.013223500
2 10003 10003 0.00564534  9.73189e-05  0.005577410  0.01466550 0.015328500
3 10004 10004 0.00606815 -4.65484e-04 -0.000221216 -0.00751029 0.003144000
4 10006 10006 0.00605308 -6.78670e-05 -0.000110103 -0.00642852 0.018216500
5 10008 10008 0.00604955 -4.06068e-04 -0.000855008 -0.00659588 0.008964420
6 10013 10013 0.00607332  7.45801e-05 -0.000300801 -0.00393606 0.000204327
            V8          V9          V10         V11         V12 group
1 -0.011138900 -0.00722605 -0.000937656 -0.00715494 -0.00332161   OWN
2 -0.017788700  0.00104997  0.005708050 -0.00628088 -0.01195530   OWN
3 -0.009155880 -0.01100770 -0.015187600  0.00192387 -0.00399194   OWN
4 -0.002063110  0.00433389  0.007567520  0.00665285  0.00377068   OWN
5  0.003730

In [36]:
awk '{print$1}' all_differences.txt | sort -u > flip_list.txt
#flip non-corresponding SNPs
plink --bfile final_IBP_3 --flip flip_list.txt --reference-allele 1kg_ref-list.txt --make-bed --out final_IBP_4


PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to final_IBP_4.log.
Options in effect:
  --a1-allele 1kg_ref-list.txt
  --bfile final_IBP_3
  --flip flip_list.txt
  --make-bed
  --out final_IBP_4

193099 MB RAM detected; reserving 96549 MB for main workspace.
64529 variants loaded from .bim file.
3864 people (0 males, 0 females, 3864 ambiguous) loaded from .fam.
Ambiguous sex IDs written to final_IBP_4.nosex .
3864 phenotype values loaded from .fam.
--flip: 39226 SNPs flipped.
phenotypes to be ignored, use the --allow-no-sex flag.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3864 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.















































IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



















































IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





















































IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [36]:
wc -l flip_list.txt
head flip_list.txt

39226 flip_list.txt
10:100023957
10:100168631
10:10018514
10:10021863
10:10057699
10:100767209
10:101062818
10:101068037
10:101093036
10:101096372


In [37]:
# Check for SNPs which are still problematic after they have been flipped.
awk '{print$2,$5,$6}' final_IBP_4.bim > final_IBP_4_tmp
sort 1kG_9_tmp final_IBP_4_tmp |uniq -u  > uncorresponding_SNPs.txt


In [34]:
wc -l uncorresponding_SNPs.txt
head uncorresponding_SNPs.txt

48060 uncorresponding_SNPs.txt
10:100023957 G A
10:100023957 G T
10:100168631 G C
10:100168631 T G
10:10018514 A G
10:10018514 A TG
10:10057699 G CA
10:10057699 T C
10:101062818 A C
10:101062818 A G


In [38]:
# 3) Remove problematic SNPs from HapMap and 1000 Genomes.
awk '{print$1}' uncorresponding_SNPs.txt | sort -u > SNPs_for_exlusion.txt
# The command above generates a list of the SNPs which caused the differences 
# Remove the problematic SNPs from both datasets.
plink --bfile final_IBP_4 --exclude SNPs_for_exlusion.txt --make-bed --out final_IBP_5
plink --bfile 1kG_9_intermediate --exclude SNPs_for_exlusion.txt --make-bed --out 1kG_10


PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to final_IBP_5.log.
Options in effect:
  --bfile final_IBP_4
  --exclude SNPs_for_exlusion.txt
  --make-bed
  --out final_IBP_5

193099 MB RAM detected; reserving 96549 MB for main workspace.
64529 variants loaded from .bim file.
3864 people (0 males, 0 females, 3864 ambiguous) loaded from .fam.
Ambiguous sex IDs written to final_IBP_5.nosex .
3864 phenotype values loaded from .fam.
--exclude: 40499 variants remaining.
phenotypes to be ignored, use the --allow-no-sex flag.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3864 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
40499 vari

In [28]:
wc -l SNPs_for_exlusion.txt

24030 SNPs_for_exlusion.txt


In [39]:
plink --bfile final_IBP_5 --bmerge 1kG_10.bed 1kG_10.bim 1kG_10.fam --allow-no-sex --make-bed --out merged

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to merged.log.
Options in effect:
  --allow-no-sex
  --bfile final_IBP_5
  --bmerge 1kG_10.bed 1kG_10.bim 1kG_10.fam
  --make-bed
  --out merged

193099 MB RAM detected; reserving 96549 MB for main workspace.
3864 people loaded from final_IBP_5.fam.
629 people to be merged from 1kG_10.fam.
Of these, 629 are new, while 0 are present in the base dataset.
40499 markers loaded from final_IBP_5.bim.
40499 markers to be merged from 1kG_10.bim.
Of these, 0 are new, while 40499 are present in the base dataset.
Performing single-pass merge (4493 people, 40499 variants).
Merged fileset written to merged-merge.bed + merged-merge.bim +
merged-merge.fam .
40499 variants loaded from .bim file.
4493 people (0 males, 0 females, 4493 ambiguous) loaded from .fam.
Ambiguous sex IDs written to merged.nosex .
3864 phenotype values loaded from 

Forgot to prune...

In [83]:
plink --bfile final_IBP --exclude inversion.txt --range --indep-pairwise 50 5 0.2 --out indepSNP

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to indepSNP.log.
Options in effect:
  --bfile final_IBP
  --exclude inversion.txt
  --indep-pairwise 50 5 0.2
  --out indepSNP
  --range

Note: --range flag deprecated.  Use e.g. "--extract range <filename>".
193099 MB RAM detected; reserving 96549 MB for main workspace.
29070215 variants loaded from .bim file.
3864 people (0 males, 0 females, 3864 ambiguous) loaded from .fam.
Ambiguous sex IDs written to indepSNP.nosex .
3864 phenotype values loaded from .fam.
--exclude range: 29070215 variants remaining.
phenotypes to be ignored, use the --allow-no-sex flag.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 3864 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626

RUN PCA

In [21]:
plink --bfile merged --extract indepSNP.prune.in --genome --out merged_2
plink --bfile merged --read-genome merged_2.genome --cluster --pca 10 --out pca_merged


PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to merged_2.log.
Options in effect:
  --bfile merged
  --extract indepSNP.prune.in
  --genome
  --out merged_2

193099 MB RAM detected; reserving 96549 MB for main workspace.
40499 variants loaded from .bim file.
4493 people (0 males, 0 females, 4493 ambiguous) loaded from .fam.
Ambiguous sex IDs written to merged_2.nosex .
3864 phenotype values loaded from .fam.
--extract: 34934 variants remaining.
phenotypes to be ignored, use the --allow-no-sex flag.
Using up to 35 threads (change this with --threads).
Before main variant filters, 4493 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
Total genotyping rate is 0.999948

In [41]:
head pca_merged.eigenvec

1 1 0.00603594 -0.000350045 -0.00048743 0.017065 -0.0070905 0.00745327 -0.00756285 -0.00423995 -0.00871638 0.00177332
2 2 0.00605742 8.93595e-05 -0.000802358 -0.00598124 0.00364067 -0.0334369 0.0125996 -0.00719591 -0.0171038 -0.00925383
3 3 0.00602447 0.000426876 0.000195555 0.0151474 -0.00976435 0.00420849 0.00445705 -0.00166869 0.0103126 0.00735778
5 5 0.0060142 1.78281e-05 -0.000210749 -0.0106741 0.0142587 -0.0101749 0.00178847 -0.000256724 -0.00828762 -0.00493752
7 7 0.00603393 0.000362052 0.000405538 0.0107006 0.00239714 0.0233185 0.0109672 -0.00151261 -0.0187987 -0.00167633
8 8 0.0060431 -7.67423e-06 -0.000674436 0.0157471 -0.0120818 -0.0158165 -0.0033956 -0.00355967 -0.00420267 -0.000821562
15 15 0.00598809 -0.000174661 0.000274143 -0.00868255 0.0187184 0.0157521 -0.0171103 0.00274475 -0.0134457 -0.00171759
33 33 0.00596288 -0.000258922 0.00171454 -0.00808063 -0.0271303 -0.00494605 -0.0101549 -0.000486719 -0.00156137 -0.000617413
41 41 0.0060186 -3.33726e-05 -0.000199956 -0.0183

Make sure this ran correctly! Plot the ancestry!

In [86]:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/20100804.ALL.panel


--2022-11-15 21:22:44--  ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/20100804.ALL.panel
           => ‘20100804.ALL.panel’
Resolving ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)... 193.62.193.140
Connecting to ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)|193.62.193.140|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /vol1/ftp/release/20100804 ... done.
==> SIZE 20100804.ALL.panel ... 13499
==> PASV ... done.    ==> RETR 20100804.ALL.panel ... done.
Length: 13499 (13K) (unauthoritative)


2022-11-15 21:22:45 (177 KB/s) - ‘20100804.ALL.panel’ saved [13499]



In [87]:
awk '{print$1,$1,$2}' 20100804.ALL.panel > race_1kG.txt
sed 's/JPT/ASN/g' race_1kG.txt>race_1kG2.txt
sed 's/ASW/AFR/g' race_1kG2.txt>race_1kG3.txt
sed 's/CEU/EUR/g' race_1kG3.txt>race_1kG4.txt
sed 's/CHB/ASN/g' race_1kG4.txt>race_1kG5.txt
sed 's/CHD/ASN/g' race_1kG5.txt>race_1kG6.txt
sed 's/YRI/AFR/g' race_1kG6.txt>race_1kG7.txt
sed 's/LWK/AFR/g' race_1kG7.txt>race_1kG8.txt
sed 's/TSI/EUR/g' race_1kG8.txt>race_1kG9.txt
sed 's/MXL/AMR/g' race_1kG9.txt>race_1kG10.txt
sed 's/GBR/EUR/g' race_1kG10.txt>race_1kG11.txt
sed 's/FIN/EUR/g' race_1kG11.txt>race_1kG12.txt
sed 's/CHS/ASN/g' race_1kG12.txt>race_1kG13.txt
sed 's/PUR/AMR/g' race_1kG13.txt>race_1kG14.txt

In [1]:
awk '{print$1,$2,"OWN"}' final_IBP_2.fam>racefile_own.txt

In [2]:
cat race_1kG14.txt racefile_own.txt | sed -e '1i\FID IID race' > racefile.txt
 

In [3]:
Rscript pca_script.R 

 [1] "IID"   "FID"   "V3"    "V4"    "V5"    "V6"    "V7"    "V8"    "V9"   
[10] "V10"   "V11"   "V12"   "group"
    IID   FID         V3           V4           V5          V6          V7
1     1     1 0.00605037 -3.91818e-04 -0.000579187 -0.01041400 0.013223500
2 10003 10003 0.00564534  9.73189e-05  0.005577410  0.01466550 0.015328500
3 10004 10004 0.00606815 -4.65484e-04 -0.000221216 -0.00751029 0.003144000
4 10006 10006 0.00605308 -6.78670e-05 -0.000110103 -0.00642852 0.018216500
5 10008 10008 0.00604955 -4.06068e-04 -0.000855008 -0.00659588 0.008964420
6 10013 10013 0.00607332  7.45801e-05 -0.000300801 -0.00393606 0.000204327
            V8          V9          V10         V11         V12 group
1 -0.011138900 -0.00722605 -0.000937656 -0.00715494 -0.00332161   OWN
2 -0.017788700  0.00104997  0.005708050 -0.00628088 -0.01195530   OWN
3 -0.009155880 -0.01100770 -0.015187600  0.00192387 -0.00399194   OWN
4 -0.002063110  0.00433389  0.007567520  0.00665285  0.00377068   OWN
5  0.003730