# WIHS2 and WIHS3 Troubleshooting
**author**: Jesse Marks

We suspect that WIHS2 and WIHS3 have sample overlap. This notebook will provide documentation as we investigate that presumption. 

## Description
We recently performed a TOPMed imputed GWAS meta-analysis for HIV acquisition ([ref](https://github.com/RTIInternational/bioinformatics/issues/97#issuecomment-698927826)). The lambda value was largely inflated. After some troubleshooting, we discovered that when WIHS2 was removed from the meta-analyis the resulting lambda value was no longer inflated. Likewise, when we excluded WIHS3 from the analysis—and included WIHS2—the lambda value was not inflated. This would suggest potential sample overlap. 

Next, we looked at the phenotype files supplied during the GWAS. What we found was a very large proportion of the WIHS3 sample IDs were the same as the WIHS2 sample IDs. 



## Compare phenotype files
We will look at both the processed (what went into GWAS) and unprocessed (straight from dbGaP)...

### unprocessed files
These phenotype files were directory from dbGaP.

* `s3://rti-hiv/shared_data/raw/wihs2/genotype/array/0001/2a_dbGaP_SubjectPhenotypesDS_NIDA_Study53_Aouizerat.txt.gz`
* `s3://rti-common/dbGaP/phs001503_wihs3/RootStudyConsentSet_phs001503.WIHS.v1.p1.c1.HMB-IRB/PhenotypeFiles/phs001503.v1.pht007316.v1.p1.c1.WIHS_Subject_Phenotypes.HMB-IRB.txt.gz`

#### WIHS2

In [34]:
%%bash 

cd /home/jesse/Projects/hiv/scratch/gwas/wihs2/data/0001/phenotype
#aws s3 cp s3://rti-hiv/shared_data/raw/wihs2/genotype/array/0001/Smokescreen_NIDA_Study53_Aouizerat_clean_full_sample.fam.gz .
#aws s3 cp s3://rti-hiv/shared_data/raw/wihs2/genotype/array/0001/2a_dbGaP_SubjectPhenotypesDS_NIDA_Study53_Aouizerat.txt.gz .
#gunzip Smokescreen_NIDA_Study53_Aouizerat_clean_full_sample.fam.gz
#gunzip 2a_dbGaP_SubjectPhenotypesDS_NIDA_Study53_Aouizerat.txt.gz

head Smokescreen_NIDA_Study53_Aouizerat_clean_full_sample.fam
echo ""
head 2a_dbGaP_SubjectPhenotypesDS_NIDA_Study53_Aouizerat.txt

echo ""
echo "Number of subjects (plus the header)."
wc -l 2a_dbGaP_SubjectPhenotypesDS_NIDA_Study53_Aouizerat.txt

tail -n +2 2a_dbGaP_SubjectPhenotypesDS_NIDA_Study53_Aouizerat.txt |\
  cut -f2 > wihs2_ids.txt

3432 204477 218943 206387 2 -9
3432 209890 218943 206387 2 -9
3433 210459 216387 214964 2 -9
3433 214964 0 0 2 -9
3434 218207 0 0 2 -9
3434 219343 211118 218207 2 -9
3435 204110 0 0 2 -9
3435 204205 209556 204110 2 -9
3436 208915 219556 210231 2 -9
3436 210231 0 0 2 -9

study	subject_id	age	sex	family_race	race	hisp	HIV_status
53	203299	29	2	WHITE	WHITE	NON-HISPANIC	1
53	203314	46	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1
53	203323	46	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1
53	203333	47	2	HISPANIC	WHITE	HISPANIC	0
53	203336	47	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1
53	203361	NULL	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	NULL
53	203378	52	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1
53	203391	53	2	HISPANIC	WHITE	HISPANIC	0
53	203399	37	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1

Number of subjects (plus the header).
1036 2a_dbGaP_SubjectPhenotypesDS_NIDA_Study53_Aouizerat.txt


#### WIHS3

In [47]:
%%bash 

# WIHS3 unprocessed data
cd /home/jesse/Projects/hiv/scratch/gwas/wihs3/data/0001/phenotype
#aws s3 cp s3://rti-common/dbGaP/phs001503_wihs3/RootStudyConsentSet_phs001503.WIHS.v1.p1.c1.HMB-IRB/GenotypeFiles/phg001045.v1.WIHS.genotype-calls-matrixfmt.Axiom_Smokesc1.c1.HMB-IRB.tar.gz .
#aws s3 cp s3://rti-common/dbGaP/phs001503_wihs3/RootStudyConsentSet_phs001503.WIHS.v1.p1.c1.HMB-IRB/PhenotypeFiles/phs001503.v1.pht007316.v1.p1.c1.WIHS_Subject_Phenotypes.HMB-IRB.txt.gz .
    
#tar -xvzf phg001045.v1.WIHS.genotype-calls-matrixfmt.Axiom_Smokesc1.c1.HMB-IRB.tar.gz
#gunzip phs001503.v1.pht007316.v1.p1.c1.WIHS_Subject_Phenotypes.HMB-IRB.txt.gz
echo ""

head matrix/Smokescreen_NIDA_Study53_Aouizerat_clean.fam #Smokescreen_NIDA_Study53_Aouizerat_clean_full_sample.fam
echo ""
tail -n +11  phs001503.v1.pht007316.v1.p1.c1.WIHS_Subject_Phenotypes.HMB-IRB.txt | head
echo ""
echo "Number of subjects (plus the header)."
tail -n +11  phs001503.v1.pht007316.v1.p1.c1.WIHS_Subject_Phenotypes.HMB-IRB.txt | wc -l


tail -n +12  phs001503.v1.pht007316.v1.p1.c1.WIHS_Subject_Phenotypes.HMB-IRB.txt |\
  cut -f3 > wihs3_ids.txt


3432 204477 218943 206387 2 -9
3432 209890 218943 206387 2 -9
3433 210459 216387 214964 2 -9
3434 218207 0 0 2 -9
3434 219343 211118 218207 2 -9
3435 204110 0 0 2 -9
3435 204205 209556 204110 2 -9
3436 208915 219556 210231 2 -9
3436 210231 0 0 2 -9
3437 205792 208573 218573 2 -9

dbGaP_Subject_ID	study	SUBJECT_ID	age	sex	family_race	race	hisp	HIV_status
2407677	53	203299	29	2	WHITE	WHITE	NON-HISPANIC	1
2407286	53	203314	46	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1
2407096	53	203333	47	2	HISPANIC	WHITE	HISPANIC	0
2406915	53	203336	47	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1
2407333	53	203378	52	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1
2407262	53	203391	53	2	HISPANIC	WHITE	HISPANIC	0
2407040	53	203399	37	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1
2407107	53	203404	45	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	0
2407554	53	203419	49	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	HISPANIC	1

Number of subjects (plus the header).
865


#### intersection

In [66]:
%%bash 

cd /home/jesse/Projects/hiv/scratch/gwas/wihs3/data/0001/phenotype/
comm -12 /home/jesse/Projects/hiv/scratch/gwas/wihs3/data/0001/phenotype/wihs3_ids.txt \
  /home/jesse/Projects/hiv/scratch/gwas/wihs2/data/0001/phenotype/wihs2_ids.txt | wc -l
echo 

# how many lines do they have in common
head <(tail -n +12 phs001503.v1.pht007316.v1.p1.c1.WIHS_Subject_Phenotypes.HMB-IRB.txt | cut -f3-9) 
echo 
head <(tail -n +2 /home/jesse/Projects/hiv/scratch/gwas/wihs2/data/0001/phenotype/2a_dbGaP_SubjectPhenotypesDS_NIDA_Study53_Aouizerat.txt | cut -f2-8 ) 
echo 

comm -12 <(tail -n +12 phs001503.v1.pht007316.v1.p1.c1.WIHS_Subject_Phenotypes.HMB-IRB.txt | cut -f3-9) \
  <(tail -n +2 /home/jesse/Projects/hiv/scratch/gwas/wihs2/data/0001/phenotype/2a_dbGaP_SubjectPhenotypesDS_NIDA_Study53_Aouizerat.txt | cut -f2-8 ) |\
  wc -l


864

203299	29	2	WHITE	WHITE	NON-HISPANIC	1
203314	46	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1
203333	47	2	HISPANIC	WHITE	HISPANIC	0
203336	47	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1
203378	52	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1
203391	53	2	HISPANIC	WHITE	HISPANIC	0
203399	37	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1
203404	45	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	0
203419	49	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	HISPANIC	1
203424	36	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1

203299	29	2	WHITE	WHITE	NON-HISPANIC	1
203314	46	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1
203323	46	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1
203333	47	2	HISPANIC	WHITE	HISPANIC	0
203336	47	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1
203361	NULL	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	NULL
203378	52	2	AFRICAN_AMERICAN	AFRICAN_AMERICAN	NON-HISPANIC	1
203391	53	2	HISPANIC	WHITE	HISPANIC	0
203399	37	2	AFRICAN_AMERICAN	AF

This suggests that every line in WIHS3 is in WIHS2. 

### processed files

In [2]:
%%bash 

# download phenotpe files used in the GWAS
cd /home/jesse/Projects/hiv/scratch/gwas/wihs2/data/0001/phenotype
aws s3 cp s3://rti-hiv/gwas/wihs2/data/acquisition/0001/phenotype/afr/wihs2_afr_hiv_status_ageatbl_pcs.tsv .

cd /home/jesse/Projects/hiv/scratch/gwas/wihs3/data/0001/phenotype
aws s3 cp s3://rti-hiv/gwas/wihs3/data/acquisition/0001/phenotype/wihs3_afr_hiv_status_age_sex_pcs.tsv .
    
grep -v -f ids.txt  ../../../../wihs2/data/0001/phenotype/wihs2_afr_hiv_status_ageatbl_pcs.tsv  | ww 

In [51]:
%%bash

cd /home/jesse/Projects/hiv/scratch/gwas/wihs2/data/0001/phenotype
tail -n +2 wihs2_afr_hiv_status_ageatbl_pcs.tsv | cut -f1 > ids.txt

# shared IDs
grep -f ids.txt  ../../../../wihs3/data/0001/phenotype/wihs3_afr_hiv_status_age_sex_pcs.tsv  | sort | head
echo ""
grep -f ids.txt  ../../../../wihs3/data/0001/phenotype/wihs3_afr_hiv_status_age_sex_pcs.tsv  | wc -l
echo ""

# unique IDs
grep -v -f ids.txt  ../../../../wihs3/data/0001/phenotype/wihs3_afr_hiv_status_age_sex_pcs.tsv  | wc -l
grep -v -f ids.txt  ../../../../wihs3/data/0001/phenotype/wihs3_afr_hiv_status_age_sex_pcs.tsv  

3432_204477	3432_204477	NA	NA	2	60	1	0.0046	0.0047	-0.0303
3434_219343	3434_219343	NA	NA	2	30	1	-0.0306	-0.0379	0.0093
3435_204205	3435_204205	NA	NA	2	34	1	0.1108	-0.0128	0.6652
3436_210231	3436_210231	NA	NA	2	51	2	0.0374	-0.0981	0.0153
3437_205792	3437_205792	NA	NA	2	56	1	-0.0152	-0.0717	-0.0201
3438_219088	3438_219088	NA	NA	2	34	1	-0.0126	-0.0381	0.0206
3439_205762	3439_205762	NA	NA	2	51	2	-0.0183	-0.382	-0.0284
3440_215575	3440_215575	NA	NA	2	59	2	-0.065	0.2744	-0.0587
3441_213146	3441_213146	NA	NA	2	30	1	-0.015	0.0246	-0.0109
3441_217939	3441_217939	NA	NA	2	34	1	-0.0127	0.0311	-0.002

719

11
fid	iid	fatid	matid	sex	age	hiv_status	PC3	PC8	PC2
3432_209890	3432_209890	NA	NA	2	55	2	6e-04	0.0013	-0.0327
3433_210459	3433_210459	NA	NA	2	33	2	-0.001	-0.029	-0.0044
3434_218207	3434_218207	NA	NA	2	55	1	-0.0317	-0.0492	0.0058
3435_204110	3435_204110	NA	NA	2	53	1	0.1118	-0.0126	0.6323
3436_208915	3436_208915	NA	NA	2	30	1	0.0393	-0.0844	0.0056
3437_209158	3437_209158	NA	NA	2	61	1	-0.0111	-0.06


|       | shared IDs | unique IDs | total IDs |
|-------|------------|------------|-----------|
| WIHS2 | 719        | 124        | 843       |
| WIHS3 | 719        | 10         | 729       |

In [8]:
%%bash

cd /home/jesse/Projects/hiv/scratch/gwas/wihs3/data/0001/phenotype
tail -n +2 wihs3_afr_hiv_status_age_sex_pcs.tsv | cut -f1 > ids.txt


# shared IDs
grep -f ids.txt  ~/Projects/hiv/scratch/gwas/wihs2/data/0001/phenotype/wihs2_afr_hiv_status_ageatbl_pcs.tsv  | sort | head 
echo ""
grep -f ids.txt  ~/Projects/hiv/scratch/gwas/wihs2/data/0001/phenotype/wihs2_afr_hiv_status_ageatbl_pcs.tsv  | wc -l
echo ""

# unique IDs
grep -v -f ids.txt  ~/Projects/hiv/scratch/gwas/wihs2/data/0001/phenotype/wihs2_afr_hiv_status_ageatbl_pcs.tsv  | wc -l

3432_204477	3432_204477	NA	NA	2	58	1	-0.0149	-0.0039	0.0053	0.0083	-0.026	-0.015	-0.0429	0.0014	0.0023	-0.0519
3434_219343	3434_219343	NA	NA	2	28	1	0.0205	0.0015	-0.0106	0.0086	-0.0077	0.0067	-0.0089	0.0142	-0.003	0.0168
3435_204205	3435_204205	NA	NA	2	33	1	0.0022	-0.0027	0.0044	-0.004	-0.0184	0.0105	0.0248	-0.0375	0.0147	-0.0234
3436_210231	3436_210231	NA	NA	2	49	2	-0.0063	0.0067	0.015	-0.0054	-0.0115	-0.0117	0.0244	0.0076	-0.006	-0.0014
3437_205792	3437_205792	NA	NA	2	55	1	0.0296	0.0014	0.0026	0.0017	0.0103	0.005	-0.0108	-0.0027	0.0013	0.0113
3438_219088	3438_219088	NA	NA	2	32	1	-0.0283	0.0355	-2e-04	-0.0213	0.0276	0.021	0.0023	0.0139	-0.0174	0.0506
3439_205762	3439_205762	NA	NA	2	48	2	0.0191	-0.009	-0.0062	-0.0098	-0.0037	0.0017	-9e-04	0.0054	-0.001	0.0065
3440_215575	3440_215575	NA	NA	2	56	2	-0.0338	0.021	-0.0097	-0.0385	0.0028	0.0058	-0.0059	-0.0372	-0.006	0.001
3441_213146	3441_213146	NA	NA	2	28	1	-0.1254	-0.0197	-0.0126	0.0302	0.0628	-0.0553	-0.1147	0.0657	-0.0304	-0.011
3441_21