# CFAR Public Control Matching

**author**: Jesse Marks <br>
**GitHub**: [Issue 133](https://github.com/RTIInternational/bioinformatics/issues/133)

CFAR is reported to have 4,761 subjects with both genotype & phenotype data.
Self-report race is mostly Black and White. All are HIV cases. We were able to match the CFAR EAs with the COGA EAs for a GWAS with sample size ~4,000. We still leaves ~2,000 HIV+ AAs on the table. So we are looking at public controls which have TOPMed data since other genotyping arrays wouldn’t have great overlap with the Smokescreen genotyped CFAR samples. WGS of TOPMed has complete overlap with the Smokescreen genotyping array.

The short list of public control cohorts to consider matching with CFAR AAs is:
* ARIC N=3,895
* COPDGene N=3,269
* JHS N=2,928

# ARIC
https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000280.v7.p1
 
Since ARIC has the largest African American WGS sample size, we will start with them and see if we can get a good match.

These data are comprised of multiple substudies. Two of which have WGS data:
* [GENEVA: The Atherosclerosis Risk in Communities (ARIC) Study](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000090.v7.p1)
* [NHLBI ARIC Candidate Gene Association Resource (CARe)](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000557.v6.p1)

The WGS data have already been downloaded to AWS S3 at:
`s3://rti-shared/shared_data/raw/topmed/genotype/sequencing/TOPMed_ARIC_phs001211/Freeze8/`
These data are split by chromosome (1–22) and are in VCF format.

The WGS data have an associated harmonized phenotype file on S3 at `s3://rti-shared/shared_data/raw/topmed/harmonized_phenotypes/JHS_FHS_COPDGene_ARIC_harmonized_phenotypes_10_21.txt`. We will filter this phenotype file down to just the ARIC samples. This phenotype file does not have as many variables as the parent accession phenotype files, which are also on S3: `s3://rti-common/dbGaP/phs000280_aric/PhenoGenotypeFiles`. In particular we know that the harmonized phenotype file is missing age and alcohol dependence—covariates included in the CFAR+COGA EUR GWAS model, therfore we would like to include them in the CFAR+ARIC AFR GWAS. But we also need to use the harmonized phenotype file because it has the WGS sample IDs `VCF.Sample.Id` that we will need to filter the WGS data with. We will also use the variable in the harmonized phenotype file `Study.Accession.with.Patient.ID` to match to the parent accession phenotype files so we can extract age and alc_dep variables. 

Let's first see what variables are available. 
```

```


1. Download phenotype data and provide a summary.
* s3://rti-shared/shared_data/raw/topmed/genotype/sequencing/TOPMed_ARIC_phs001211/PhenotypeFiles/phs001211.v3.pht005755.v3.p2.TOPMed_WGS_ARIC_Sample.MULTI.txt.gz
2. Download genotype data (on EC2 with lots of storage! 497.8 GB for WGS) 
3. [GAWMerge](https://www.biorxiv.org/content/10.1101/2021.10.19.464854v1)
* extracting only the genotyped variants from the WGS data
* independent array and WGS data QC
* independent array and WGS data phasing
* merge phased data
4. Imputation
5. GWAS.

## Phenotype summary 
* The harmonized TOPMed phenotype file is located: `s3://rti-shared/shared_data/raw/topmed/harmonized_phenotypes/JHS_FHS_COPDGene_ARIC_harmonized_phenotypes_10_21.txt`.

The parent accession phenotype files are located:
* s3://rti-common/dbGaP/phs000280_aric/PhenoGenotypeFiles/ChildStudyConsentSet_phs000090.RootStudy.v7.p1.c1.HMB-IRB/PhenotypeFiles/
* s3://rti-common/dbGaP/phs000280_aric/PhenoGenotypeFiles/ChildStudyConsentSet_phs000557.RootStudy.v6.p1.c1.HMB-IRB/PhenotypeFiles/ 
* s3://rti-common/dbGaP/phs000280_aric/PhenoGenotypeFiles/RootStudyConsentSet_phs000280.RootStudy.v7.p1.c1.HMB-IRB/PhenotypeFiles/

The ARIC TOPMed WGS is comprised of the both CARe and GENEVA. We will select only the ARIC African Americans for this study.

We need to clean up the phenotype file. Apply Plink format standard:
* Sex code ('1' = male, '2' = female, '0' = unknown)

In [None]:
cd /Users/jmarks/projects/hiv/shared_data/processed/topmed/harmonized_phenotypes/topmed_aric_phs001211

head aric_aa_topmed_phenotypes.csv
#Patient.ID,X.DCC.Harmonized.data.set.01...Demographics.A.distinct.subgroup.within.a.study..generally.indicating.subjects.who.share.similar.characteristics.due.to.study.design..Subjects.may.belong.to.only.one.subcohort..,X.DCC.Harmonized.data.set.01...Demographics.Harmonized.race.category.of.participant..,X.DCC.Harmonized.data.set.01...Demographics.Subject.sex..as.recorded.by.the.study..,X._VCF.Sample.Id.
#700072,ARIC: No Subcohort Structure,Black or African American,Female,NWD694030
#700090,ARIC: No Subcohort Structure,Black or African American,Male,NWD536080
#700096,ARIC: No Subcohort Structure,Black or African American,Female,NWD297271
#700097,ARIC: No Subcohort Structure,Black or African American,Female,NWD313611



In [7]:
### R
setwd("/Users/jmarks/projects/hiv/shared_data/processed/topmed/harmonized_phenotypes/topmed_aric_phs001211/")
aric_pheno <- read.csv("aric_aa_topmed_phenotypes.csv")

names(aric_pheno) <- c("patient.id", "cohort.name", "race", "sex", "sample.id")
# keep only necessary columns
aric_pheno <- aric_pheno[, c("sample.id", "sex")]
head(aric_pheno)

Unnamed: 0_level_0,sample.id,sex
Unnamed: 0_level_1,<chr>,<chr>
1,NWD694030,Female
2,NWD536080,Male
3,NWD297271,Female
4,NWD313611,Female
5,NWD929639,Female
6,NWD440740,Female
