# UHS1–4 dbGaP submission

__Author__: Jesse Marks <br>

**GitHub Issue:** [Issue #125](https://github.com/RTIInternational/bioinformatics/issues/125)

This notebook documents the preparation of UHS1–4 submission to [dbGaP](https://submit.ncbi.nlm.nih.gov/dbgap/). We will use Eric O. Johnson's login information. <br>
**Note:** click on the "eRA Commons" button, then again on the Study Name "CIDR-NIDA Study of HIV Host Genetics" once logged in to access the Submission Portal.

We submitted the UHS genotype and phenotype data to dbGaP in November 2020. In the first part of 2021, they responded to us describing issues they found in our data that needed to be corrected before they could publish them. The details of this email are in [this GitHub comment.](https://github.com/RTIInternational/bioinformatics/issues/125#issuecomment-781473484)

The first steps were to combine the UHS genotype data and see if we can replicate the issues that dbGaP identified. Perform pre-imputation procedures on the UHS datasets: UHS1, UHS2, UHS3 (chip version 1–2), UHS3 (chip version 1–3), and UHS4. The starting point for the pre-imputation procedures is after quality control of observed genotypes for each batch/wave. We will merge these data and run them through our automated [genotype array QC](https://github.com/RTIInternational/biocloud_gwas_workflows/tree/master/genotype_array_qc) workflow removing any overlapping samples. The quality-controlled genotypes are on the GRCh37 plus strand. 



## Data Overview
### genotypes
PLINK binary filesets were obtained from AWS S3 storage. Nathan Gaddis detail the whereabouts of the UHS1–3 genotype data in [this post from GitHub Issue #117](https://github.com/RTIInternational/bioinformatics/issues/117#issuecomment-469845859) and for UHS4 in [this](https://github.com/RTIInternational/bioinformatics/issues/97#issuecomment-488712833) GitHub post. Note the following S3 paths have been updated—there was an S3 reorganization that resulted in these data being relocated.

* UHS1: `s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs1.{aa,ea}*`
* UHS2: `s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs2{ea,aa}*`
* UHS3_V–2: `s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.{ea,aa}.V1-2*`
* UHS3_V–3: `s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.{ea,aa}.V1-3*`
* UHS4: `s3://rti-shared/rti-midas-data/studies/uhs4/observed/genotypes/final/uhs4.merged2.{aa,ea}.`

**Note:** the above locations have been update. The details in [this GitHub comment.](https://github.com/RTIInternational/bioinformatics/issues/125#issuecomment-781473484)

### phenotypes
The master phenotype data is contained in a spreadsheet on AWS S3 at: `s3://rti-heroin/hiv_all_merged_with_uhs_all_phenotype_data_08282017.csv.gz`

A former RTI Internation employee (Christie G.) created the phenotype files for dbGaP and posted them in [this GitHub comment](https://github.com/RTIInternational/bioinformatics/issues/125#issuecomment-553973222) The first submission of these data to dbGaP were reported by Jesse Marks in [this GitHub comment.](https://github.com/RTIInternational/bioinformatics/issues/125#issuecomment-562162459) The dbGaP submitted data are on S3 at `s3://rti-hiv/gwas/uhs1234/dbgap`.

## Prepare UHS for QC
Combine the different batches of UHS based on the SNP intersection. Then run these data through the [automated genotype QC workflow](https://github.com/RTIInternational/biocloud_gwas_workflows/tree/master/genotype_array_qc) to check for sex discrepancies and relatedness issues.

### Summary statistics for each batch

Using [UHS4](https://github.com/RTIInternational/bioinformatics/issues/97#issuecomment-488712833) version that was initially submitted to dbGaP.

**AFR**

| Data Set  | Initial N | Initial M | Pre-imputation Filtering M | Intersection M | Final M (Remove Polygenic SNPs) |
|-----------|-----------|-----------|----------------------------|----------------|---------------------------------|
| UHS1*     | 2,016     | 805,863   |  803,736                   |  352,937       | 352,926                         |
| UHS2      | 767       | 1,841,954 |  1,412,279                 |  352,937       | 352,926                         |
| UHS3_v1-2 | 84        | 1,845,927 |  1,841,676                 |  352,937       | 352,926                         |
| UHS3_v1-3 | 94        | 1,806,722 |  1,802,415                 |  352,937       | 352,926                         |
| UHS4      | 1,072     | 1,878,335 |  1,873,484                 |  352,937       | 352,926                         |

\* 2,015 samples with chrX and 2,016 with autosomes

<br><br>

**EUR**

| Data Set  | Initial N | Initial M | Pre-imputation Filtering M | Intersection M | Final M (Remove Polygenic SNPs) |
|-----------|-----------|-----------|----------------------------|----------------|---------------------------------|
| UHS1*     | 1,142     | 808,822   |   806,592                  | 413,790        |  413,782                        |
| UHS2      | 828       | 1,841,954 |   1,836,785                | 413,790        |  413,782                        |
| UHS3_v1-2 | 33        | 1,924,759 |   1,919,896                | 413,790        |  413,782                        |
| UHS3_v1-3 | 44        | 1,868,612 |   1,863,909                | 413,790        |  413,782                        |
| UHS4      | 989       | 2,073,618 |   2,067,863                | 413,790        |  413,782                        |

\* 1,140 samples with chrX and 1,142 with autosomes.

### Download genotypes

In [None]:
study="uhs1234"
study_list="uhs1 uhs2 uhs3_v1-2 uhs3_v1-3 uhs4"
ancestry_list="afr eur"

## create directory structure
baseDir=/shared/rti-shared/shared_data/pre_qc/$study/genotype/array/observed/0002
mkdir -p $baseDir/{eur,afr}/{uhs1,uhs2,uhs3_v1-2,uhs3_v1-3,uhs4}
mkdir -p /shared/rti-common/ref_panels/1000G/2014.10/legend_with_chr/dbsnp_b153_ids/

## download reference files
cd /shared/rti-common/ref_panels/1000G/2014.10/legend_with_chr/dbsnp_b153_ids/
aws s3 sync s3://rti-common/ref_panels/1000G/2014.10/legend_with_chr/dbsnp_b153_ids/ .

## download study genotype data
cd $baseDir
for ext in {bed,bim,fam}; do
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs1.aa.$ext.gz afr/uhs1/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs1.aa.chr23.$ext.gz afr/uhs1/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs1.ea.$ext.gz eur/uhs1/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs1.ea.chr23.$ext.gz eur/uhs1/

    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs2.aa.$ext.gz afr/uhs2/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs2.aa.chr23.$ext.gz afr/uhs2/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs2.ea.$ext.gz eur/uhs2/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs2.ea.chr23.$ext.gz eur/uhs2/

    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.aa.V1-2.$ext.gz afr/uhs3_v1-2/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.aa.V1-2.chr23.$ext.gz afr/uhs3_v1-2/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.ea.V1-2.$ext.gz eur/uhs3_v1-2/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.ea.V1-2.chr23.$ext.gz eur/uhs3_v1-2/

    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.aa.V1-3.$ext.gz afr/uhs3_v1-3/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.aa.V1-3.chr23.$ext.gz afr/uhs3_v1-3/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.ea.V1-3.$ext.gz eur/uhs3_v1-3/
    aws s3 cp s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.ea.V1-3.chr23.$ext.gz eur/uhs3_v1-3/

    aws s3 cp s3://rti-shared/rti-midas-data/studies/uhs4/observed/genotypes/final/uhs4.merged2.aa.$ext.gz afr/uhs4/
    aws s3 cp s3://rti-shared/rti-midas-data/studies/uhs4/observed/genotypes/final/uhs4.merged2.ea.$ext.gz eur/uhs4/
done

gunzip -r * 

wc -l */*/*bim

```
   805863 afr/uhs1/uhs1.aa.bim
    17396 afr/uhs1/uhs1.aa.chr23.bim
  1395852 afr/uhs2/uhs2.aa.bim
    23484 afr/uhs2/uhs2.aa.chr23.bim
  1845925 afr/uhs3_v1-2/uhs3.aa.V1-2.bim
    41884 afr/uhs3_v1-2/uhs3.aa.V1-2.chr23.bim
  1806719 afr/uhs3_v1-3/uhs3.aa.V1-3.bim
    44706 afr/uhs3_v1-3/uhs3.aa.V1-3.chr23.bim
  1878335 afr/uhs4/uhs4.merged2.aa.bim
  
   808822 eur/uhs1/uhs1.ea.bim
    17408 eur/uhs1/uhs1.ea.chr23.bim
  1820881 eur/uhs2/uhs2.ea.bim
    32857 eur/uhs2/uhs2.ea.chr23.bim
  1924758 eur/uhs3_v1-2/uhs3.ea.V1-2.bim
    24202 eur/uhs3_v1-2/uhs3.ea.V1-2.chr23.bim
  1868610 eur/uhs3_v1-3/uhs3.ea.V1-3.bim
    35440 eur/uhs3_v1-3/uhs3.ea.V1-3.chr23.bim
  2073618 eur/uhs4/uhs4.merged2.ea.bim
  ```

### capture sex information
We want to merge the sex information from the chrX genotype data into the whole-genome data. The sex info is missing in the whole-genome data for UHS2. We will perform this command on all of the waves of UHS to make sure the sex information is consistent.  

In [None]:
# UHS1
cd $baseDir/afr/uhs1/
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile uhs1.aa.chr23 \
    --bmerge uhs1.aa \
    --make-bed \
    --out uhs1_afr
wc -l *fam
#  2015 uhs1.aa.chr23.fam
#  2016 uhs1.aa.fam
#  2016 uhs1_afr.fam

wc -l *bim
#  805863 uhs1.aa.bim
#   17396 uhs1.aa.chr23.bim
#  805863 uhs1_afr.bim

cd $baseDir/eur/uhs1/
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile uhs1.ea.chr23 \
    --bmerge uhs1.ea \
    --make-bed \
    --out uhs1_eur
wc -l *fam
#  1140 uhs1.ea.chr23.fam
#  1142 uhs1.ea.fam
#  1142 uhs1_eur.fam

wc -l *bim
#  808822 uhs1.ea.bim
#   17408 uhs1.ea.chr23.bim
#  808822 uhs1_eur.bim


# UHS2
cd $baseDir/afr/uhs2/
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile uhs2.aa.chr23 \
    --bmerge uhs2.aa \
    --make-bed \
    --out uhs2_afr
wc -l *fam
#   767 uhs2.aa.chr23.fam
#   767 uhs2.aa.fam
#   767 uhs2_afr.fam

wc -l *bim
#  1395852 uhs2.aa.bim
#    23484 uhs2.aa.chr23.bim
#  1416262 uhs2_afr.bim

cd $baseDir/eur/uhs2/
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile uhs2.ea.chr23 \
    --bmerge uhs2.ea \
    --make-bed \
    --out uhs2_eur
wc -l *fam
#   828 uhs2.ea.chr23.fam
#   828 uhs2.ea.fam
#   828 uhs2_eur.fam

wc -l *bim
#  1820881 uhs2.ea.bim
#    32857 uhs2.ea.chr23.bim
#  1841954 uhs2_eur.bim


# UHS3_v1-2
cd $baseDir/afr/uhs3_v1-2
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile uhs3.aa.V1-2.chr23 \
    --bmerge uhs3.aa.V1-2 \
    --make-bed \
    --out uhs3_v1-2_afr
wc -l *bim *fam
#  1845925 uhs3.aa.V1-2.bim
#    41884 uhs3.aa.V1-2.chr23.bim
#  1845927 uhs3_v1-2_afr.bim
#       84 uhs3.aa.V1-2.chr23.fam
#       84 uhs3.aa.V1-2.fam
#       84 uhs3_v1-2_afr.fam

cd $baseDir/eur/uhs3_v1-2
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile uhs3.ea.V1-2.chr23 \
    --bmerge uhs3.ea.V1-2 \
    --make-bed \
    --out uhs3_v1-2_eur
wc -l *bim *fam
#  1924758 uhs3.ea.V1-2.bim
#    24202 uhs3.ea.V1-2.chr23.bim
#  1924759 uhs3_v1-2_eur.bim
#       33 uhs3.ea.V1-2.chr23.fam
#       33 uhs3.ea.V1-2.fam
#       33 uhs3_v1-2_eur.fam


# UHS3_v1-3
cd $baseDir/afr/uhs3_v1-3
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile uhs3.aa.V1-3.chr23 \
    --bmerge uhs3.aa.V1-3 \
    --make-bed \
    --out uhs3_v1-3_afr
wc -l *bim *fam
#  1806719 uhs3.aa.V1-3.bim
#    44706 uhs3.aa.V1-3.chr23.bim
#  1806722 uhs3_v1-3_afr.bim
#       94 uhs3.aa.V1-3.chr23.fam
#       94 uhs3.aa.V1-3.fam
#       94 uhs3_v1-3_afr.fam

cd $baseDir/eur/uhs3_v1-3
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --bfile uhs3.ea.V1-3.chr23 \
    --bmerge uhs3.ea.V1-3 \
    --make-bed \
    --out uhs3_v1-3_eur
wc -l *bim *fam
#  1868610 uhs3.ea.V1-3.bim
#    35440 uhs3.ea.V1-3.chr23.bim
#  1868612 uhs3_v1-3_eur.bim
#       44 uhs3.ea.V1-3.chr23.fam
#       44 uhs3.ea.V1-3.fam
#       44 uhs3_v1-3_eur.fam


# UHS4
cd $baseDir/afr/uhs4/
# rename files for consistency 
for ext in {bim,bed,fam}; do
    cp uhs4.merged2.aa.$ext uhs4_afr.$ext
done
wc -l *afr.bim *afr.fam
# 1878335 uhs4_afr.bim
#    1072 uhs4_afr.fam

cd $baseDir/eur/uhs4/
# rename files for consistency 
for ext in {bim,bed,fam}; do
    cp uhs4.merged2.ea.$ext uhs4_eur.$ext
done
wc -l *eur.bim *eur.fam
# 2073618 uhs4_eur.bim
#     989 uhs4_eur.fam

### Upload UHS2 to S3
Since it was really the only file that changed. It had updated sex added to the FAM file.

In [None]:
cd /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/

for an in {afr,eur}; do
    for ext in {bed,bim,fam}; do
        gzip $an/uhs2/uhs2_${an}.$ext
        aws s3 cp $an/uhs2/uhs2_${an}.$ext.gz s3://rti-shared/shared_data/post_qc/uhs2/genotype/array/observed/0001/$an/
    done
done

### Pre-imputation processing

In [None]:
## Perform pre-imputation processing
for ancestry in {afr,eur}; do
    for cohort in ${study_list}; do
          group=$(echo $ancestry | perl -ne 'print uc($_);')
          python /shared/bioinformatics/software/python/prepare_imputation_input.py \
            --bfile ${baseDir}/$ancestry/$cohort/${cohort}_${ancestry} \
            --ref /shared/rti-common/ref_panels/1000G/2014.10/legend_with_chr/dbsnp_b153_ids/1000GP_Phase3.legend.gz \
            --ref_group $group \
            --freq_diff_threshold 0.2 \
            --out_prefix ${baseDir}/$ancestry/$cohort/${cohort}_${ancestry}_preimputation_filters \
            --working_dir ${baseDir}/$ancestry/$cohort/ \
            --plink /shared/bioinformatics/software/third_party/plink-1.90-beta-6.16-x86_64/plink \
            --bgzip /shared/bioinformatics/software/third_party/htslib-1.6/bin/bgzip \
            --bgzip_threads 4 \
            --keep_plink True \
            --cohort $cohort
    done
done

study_list="uhs1 uhs2 uhs3_v1-2 uhs3_v1-3 uhs4"

```
Calculating study frequencies
Reading study frequencies
Reading ref
Comparing study and ref
Generating files for imputation.
805863 variants in study
787412 variants overlap with ref
2127 A/T and C/G variants with MAF > 0.4 or freq difference compared to ref > 0.2 (excluded in output)
8075 non-plus strand variants (flipped in output)

Calculating study frequencies
Reading study frequencies
Reading ref
Comparing study and ref
Generating files for imputation.
1416262 variants in study
1342060 variants overlap with ref
3990 A/T and C/G variants with MAF > 0.4 or freq difference compared to ref > 0.2 (excluded in output)
31 non-plus strand variants (flipped in output)

Calculating study frequencies
Reading study frequencies
Reading ref
Comparing study and ref
Generating files for imputation.
1845927 variants in study
1713386 variants overlap with ref
4251 A/T and C/G variants with MAF > 0.4 or freq difference compared to ref > 0.2 (excluded in output)
42 non-plus strand variants (flipped in output)

Calculating study frequencies
Reading study frequencies
Reading ref
Comparing study and ref
Generating files for imputation.
1806722 variants in study
1671876 variants overlap with ref
4307 A/T and C/G variants with MAF > 0.4 or freq difference compared to ref > 0.2 (excluded in output)
137 non-plus strand variants (flipped in output)

Calculating study frequencies
Reading study frequencies
Reading ref
787412 variants overlap with ref
2127 A/T and C/G variants with MAF > 0.4 or freq difference compared to ref > 0.2 (excluded in output)
8075 non-plus strand variants (flipped in output)

Calculating study frequencies
Reading study frequencies
Reading ref
Comparing study and ref
Generating files for imputation.
1416262 variants in study
1342060 variants overlap with ref
3990 A/T and C/G variants with MAF > 0.4 or freq difference compared to ref > 0.2 (excluded in output)
31 non-plus strand variants (flipped in output)

Calculating study frequencies
Reading study frequencies
Reading ref
Comparing study and ref
Generating files for imputation.
1845927 variants in study
1713386 variants overlap with ref
4251 A/T and C/G variants with MAF > 0.4 or freq difference compared to ref > 0.2 (excluded in output)
42 non-plus strand variants (flipped in output)

Calculating study frequencies
Reading study frequencies
Reading ref
Comparing study and ref
Generating files for imputation.
1806722 variants in study
1671876 variants overlap with ref
4307 A/T and C/G variants with MAF > 0.4 or freq difference compared to ref > 0.2 (excluded in output)
137 non-plus strand variants (flipped in output)

Calculating study frequencies
Reading study frequencies
Reading ref
Comparing study and ref
Generating files for imputation.
1878335 variants in study
1775405 variants overlap with ref
4851 A/T and C/G variants with MAF > 0.4 or freq difference compared to ref > 0.2 (excluded in output)
49 non-plus strand variants (flipped in output)

Calculating study frequencies
Reading study frequencies
Reading ref
Comparing study and ref
Generating files for imputation.
808822 variants in study
790195 variants overlap with ref
2230 A/T and C/G variants with MAF > 0.4 or freq difference compared to ref > 0.2 (excluded in output)
8072 non-plus strand variants (flipped in output)

Calculating study frequencies
Reading study frequencies
Reading ref
Comparing study and ref
Generating files for imputation.
1841954 variants in study
1716296 variants overlap with ref
5165 A/T and C/G variants with MAF > 0.4 or freq difference compared to ref > 0.2 (excluded in output)
38 non-plus strand variants (flipped in output)

Calculating study frequencies
Reading study frequencies
Reading ref
Comparing study and ref
Generating files for imputation.
1924759 variants in study
1788224 variants overlap with ref
4863 A/T and C/G variants with MAF > 0.4 or freq difference compared to ref > 0.2 (excluded in output)
39 non-plus strand variants (flipped in output)

Calculating study frequencies
Reading study frequencies
Reading ref
Comparing study and ref
Generating files for imputation.
1868612 variants in study
1731844 variants overlap with ref
4703 A/T and C/G variants with MAF > 0.4 or freq difference compared to ref > 0.2 (excluded in output)
134 non-plus strand variants (flipped in output)

Calculating study frequencies
Reading study frequencies
Reading ref
Comparing study and ref
Generating files for imputation.
2073618 variants in study
1934064 variants overlap with ref
5755 A/T and C/G variants with MAF > 0.4 or freq difference compared to ref > 0.2 (excluded in output)
63 non-plus strand variants (flipped in output)
```

In [None]:
wc -l $baseDir/*/uhs*/uhs*_preimputation_filters.bim

```
   803736 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/uhs1/uhs1_afr_preimputation_filters.bim
  1412272 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/uhs2/uhs2_afr_preimputation_filters.bim
  1841676 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/uhs3_v1-2/uhs3_v1-2_afr_preimputation_filters.bim
  1802415 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/uhs3_v1-3/uhs3_v1-3_afr_preimputation_filters.bim
  1873484 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/uhs4/uhs4_afr_preimputation_filters.bim
  
   806592 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/uhs1/uhs1_eur_preimputation_filters.bim
  1836789 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/uhs2/uhs2_eur_preimputation_filters.bim
  1919896 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/uhs3_v1-2/uhs3_v1-2_eur_preimputation_filters.bim
  1863909 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/uhs3_v1-3/uhs3_v1-3_eur_preimputation_filters.bim
  2067863 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/uhs4/uhs4_eur_preimputation_filters.bim
  ```

### Intersection set of SNPs

In [None]:
## Get intersection set of SNPs
studies=($study_list)  # array of study names
num=${#studies[@]}
    
for ancestry in ${ancestry_list};do
    bim_files=()
    for (( i=0; i<${num}; i++ ));do
        bim_files+=(${baseDir}/$ancestry/${studies[$i]}/${studies[$i]}_${ancestry}_preimputation_filters.bim)
    done
    
    echo -e "\nCalculating intersection between $ancestry: ${study_list}...\n"
    cat ${bim_files[@]}| cut -f2 | sort |  uniq -c | awk -v num=$num '$1 == num {print $2}' \
        > ${baseDir}/$ancestry/${ancestry}_variant_intersection.txt
    wc -l ${baseDir}/$ancestry/${ancestry}_variant_intersection.txt
done 

```
Calculating intersection between afr: uhs1 uhs2 uhs3_v1-2 uhs3_v1-3 uhs4...
352937 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/afr_variant_intersection.txt

Calculating intersection between eur: uhs1 uhs2 uhs3_v1-2 uhs3_v1-3 uhs4...
413790 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/eur_variant_intersection.txt
```

In [None]:
## Perform SNP extraction
for ancestry in ${ancestry_list};do
    for study in ${studies[@]}; do
        /shared/bioinformatics/software/third_party/plink-1.90-beta-6.16-x86_64/plink \
            --bfile ${baseDir}/$ancestry/$study/${study}_${ancestry}_preimputation_filters \
            --extract ${baseDir}/$ancestry/${ancestry}_variant_intersection.txt \
            --make-bed \
            --out ${baseDir}/$ancestry/$study/${study}_${ancestry}_preimputation_filters_snp_intersection
    done
done

wc -l ${baseDir}/{afr,eur}/uhs*/*filters_snp_intersection.{bim,fam}

```
   352937 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/uhs1/uhs1_afr_preimputation_filters_snp_intersection.bim
   352937 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/uhs2/uhs2_afr_preimputation_filters_snp_intersection.bim
   352937 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/uhs3_v1-2/uhs3_v1-2_afr_preimputation_filters_snp_intersection.bim
   352937 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/uhs3_v1-3/uhs3_v1-3_afr_preimputation_filters_snp_intersection.bim
   352937 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/uhs4/uhs4_afr_preimputation_filters_snp_intersection.bim
     2016 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/uhs1/uhs1_afr_preimputation_filters_snp_intersection.fam
      767 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/uhs2/uhs2_afr_preimputation_filters_snp_intersection.fam
       84 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/uhs3_v1-2/uhs3_v1-2_afr_preimputation_filters_snp_intersection.fam
       94 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/uhs3_v1-3/uhs3_v1-3_afr_preimputation_filters_snp_intersection.fam
     1072 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/uhs4/uhs4_afr_preimputation_filters_snp_intersection.fam
     
   413790 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/uhs1/uhs1_eur_preimputation_filters_snp_intersection.bim
   413790 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/uhs2/uhs2_eur_preimputation_filters_snp_intersection.bim
   413790 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/uhs3_v1-2/uhs3_v1-2_eur_preimputation_filters_snp_intersection.bim
   413790 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/uhs3_v1-3/uhs3_v1-3_eur_preimputation_filters_snp_intersection.bim
   413790 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/uhs4/uhs4_eur_preimputation_filters_snp_intersection.bim
     1142 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/uhs1/uhs1_eur_preimputation_filters_snp_intersection.fam
      828 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/uhs2/uhs2_eur_preimputation_filters_snp_intersection.fam
       33 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/uhs3_v1-2/uhs3_v1-2_eur_preimputation_filters_snp_intersection.fam
       44 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/uhs3_v1-3/uhs3_v1-3_eur_preimputation_filters_snp_intersection.fam
      989 /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/uhs4/uhs4_eur_preimputation_filters_snp_intersection.fam
```

### Merge ancestry specific batches/waves

#### AFR

In [None]:
cd $baseDir

# create merge list
wc -l afr/uhs*/uhs*_afr_preimputation_filters_snp_intersection.fam | awk '{print $2}' | head -n -1 | awk -F"." '{print $1}' > afr/uhs_afr_merge_list.txt

/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --merge-list afr/uhs_afr_merge_list.txt \
    --make-bed \
    --out afr/uhs1234_afr
#Error: 11 variants with 3+ alleles present.

# remove those variants and try the merge again.
while read line; do
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --bfile $line \
        --exclude afr/uhs1234_afr-merge.missnp \
        --make-bed \
        --out ${line}_remove_missnps
done < afr/uhs_afr_merge_list.txt

# create new merge list
wc -l afr/uhs*/*remove_missnps.fam | awk '{print $2}' | head -n -1 | awk -F"." '{print $1}' > afr/uhs_afr_new_merge_list.txt

# try merge again
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --merge-list afr/uhs_afr_new_merge_list.txt \
    --make-bed \
    --out afr/uhs1234_afr

gzip afr/uhs1234_afr* &

#### EUR

In [None]:
cd $baseDir

# create merge list
wc -l eur/uhs*/uhs*_eur_preimputation_filters_snp_intersection.fam | awk '{print $2}' | head -n -1 | awk -F"." '{print $1}' > eur/uhs_eur_merge_list.txt

/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --merge-list eur/uhs_eur_merge_list.txt \
    --make-bed \
    --out eur/uhs1234_eur
# Error: 8 variants with 3+ alleles present.

# remove those variants and try the merge again.
while read line; do
    /shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
        --bfile $line \
        --exclude eur/uhs1234_eur-merge.missnp \
        --make-bed \
        --out ${line}_remove_missnps
done < eur/uhs_eur_merge_list.txt

# create new merge list
wc -l eur/uhs*/*remove_missnps.fam | awk '{print $2}' | head -n -1 | awk -F"." '{print $1}' > eur/uhs_eur_new_merge_list.txt

# try merge again
/shared/bioinformatics/software/third_party/plink-1.90-beta-4.10-x86_64/plink \
    --merge-list eur/uhs_eur_new_merge_list.txt \
    --make-bed \
    --out eur/uhs1234_eur

gzip eur/uhs1234_eur* 

### Upload to S3
Upload the combined UHS data to S3.

In [None]:
cd $baseDir/afr/

# include a README within the version-level directory
for file in uhs1234_afr*; do
    aws s3 mv $file  s3://rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/snp_intersection/
done

cd $baseDir/eur/
# include a README within the version-level directory
for file in uhs1234_eur*; do
    aws s3 mv $file  s3://rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/snp_intersection/
done

## Genotype QC: relatedness & sex-check
Perform the sex-check and relatedness sub-workflows of the genotype array qc workflow on the combined UHS data.

using UHS4 version:
`s3://rti-shared/rti-midas-data/studies/uhs4/observed/genotypes/final/uhs4.merged2.{ea,aa}.$ext.gz`

### Create Directories

In [None]:
mkdir -p /shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0004/{wf_input,wf_output}
mkdir -p /shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0004/{sex_check,relatedness}/

###  Set up config file for QC pipeline

Copy JSON file from previous run and modify settings. Then edit this config file to include the appropriate cohort information.

In [None]:
cd /shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0004/
cp /shared/biocloud_gwas_workflows/genotype_array_qc/sex_check/test/test_sex_check.json sex_check/uhs1234_afr_sex_check_wf.json
cp /shared/biocloud_gwas_workflows/genotype_array_qc/relatedness/test_relatedness_wf.json relatedness/uhs1234_afr_relatedness_wf.json
cp /shared/biocloud_gwas_workflows/genotype_array_qc/sex_check/test/test_sex_check.json sex_check/uhs1234_eur_sex_check_wf.json
cp /shared/biocloud_gwas_workflows/genotype_array_qc/relatedness/test_relatedness_wf.json relatedness/uhs1234_eur_relatedness_wf.json

# Edit wf config files

### Zip biocloud_gwas_workflows repo

In [None]:
cd /shared/biocloud_gwas_workflows
git pull
git submodule update --init --recursive
git rev-parse HEAD > /shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0004/wf_input/git_hash.txt
cd /shared
zip \
    --exclude=*/var/* \
    --exclude=*.git/* \
    --exclude=*/test/* \
    --exclude=*/.idea/* \
    -r /shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0004/wf_input/biocloud_gwas_workflows.zip \
    biocloud_gwas_workflows/

cd /shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0004/

### Submit jobs

In [None]:
# Open session in terminal 1
ssh -i ~/.ssh/gwas_rsa -L localhost:8000:localhost:8000 ec2-user@54.174.185.7

# Submit jobs in terminal 2

# AFR relatedness
curl -X POST "http://localhost:8000/api/workflows/v1" -H "accept: application/json" \
    -F "workflowSource=@/shared/biocloud_gwas_workflows/genotype_array_qc/relatedness/relatedness_wf.wdl" \
    -F "workflowInputs=@/shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0004/relatedness/uhs1234_afr_relatedness_wf.json" \
    -F "workflowDependencies=@/shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0004/wf_input/biocloud_gwas_workflows.zip"
echo ""

# AFR sex_check
curl -X POST "http://localhost:8000/api/workflows/v1" -H "accept: application/json" \
    -F "workflowSource=@/shared/biocloud_gwas_workflows/genotype_array_qc/sex_check/sex_check_wf.wdl" \
    -F "workflowInputs=@/shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0004/sex_check/uhs1234_afr_sex_check_wf.json" \
    -F "workflowDependencies"=@/shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0004/wf_input/biocloud_gwas_workflows.zip
echo ""

# EUR relatedness
curl -X POST "http://localhost:8000/api/workflows/v1" -H "accept: application/json" \
    -F "workflowSource=@/shared/biocloud_gwas_workflows/genotype_array_qc/relatedness/relatedness_wf.wdl" \
    -F "workflowInputs=@/shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0004/relatedness/uhs1234_eur_relatedness_wf.json" \
    -F "workflowDependencies"=@/shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0004/wf_input/biocloud_gwas_workflows.zip
echo ""

# EUR sex_check
curl -X POST "http://localhost:8000/api/workflows/v1" -H "accept: application/json" \
    -F "workflowSource=@/shared/biocloud_gwas_workflows/genotype_array_qc/sex_check/sex_check_wf.wdl" \
    -F "workflowInputs=@/shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0004/sex_check/uhs1234_eur_sex_check_wf.json" \
    -F "workflowDependencies"=@/shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0004/wf_input/biocloud_gwas_workflows.zip
echo ""

In [None]:
# Monitor job in terminal 1
#tail -f /tmp/cromwell-server.log

afr_relate=22794ea1-ebd0-4f17-af93-9044d925d49d
afr_sex=ab5e5132-94f5-48ab-affd-6e90d7c3160d

eur_relate=2b5ffe83-12d5-46ae-830a-544c8558d911
eur_sex=96c0221f-98f2-4681-b500-528dfadeb050

# check job status in terminal 2
for job in {$afr_relate,$afr_sex,$eur_relate,$eur_sex}; do
    curl -X GET "http://localhost:8000/api/workflows/v1/${job}/status"   
    echo ""
done

```
{"status":"Succeeded","id":"22794ea1-ebd0-4f17-af93-9044d925d49d"}
{"status":"Succeeded","id":"ab5e5132-94f5-48ab-affd-6e90d7c3160d"}
{"status":"Succeeded","id":"2b5ffe83-12d5-46ae-830a-544c8558d911"}
{"status":"Succeeded","id":"96c0221f-98f2-4681-b500-528dfadeb050"}
```

## Sex-check results
31 AFR samples to remove.<br>
36 EUR samples to remove.

Code in Jesse Marks' local environment.

In [None]:
## local
cd ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs1234/uhs4_vmerged2/sex_check

#96c0221f-98f2-4681-b500-528dfadeb050_final_outputs.json
#ab5e5132-94f5-48ab-affd-6e90d7c3160d_final_outputs.json
curl -X GET "http://localhost:8000/api/workflows/v1/${afr_sex}/outputs" -H "accept: application/json" \
  > ${afr_sex}_final_outputs.json
curl -X GET "http://localhost:8000/api/workflows/v1/${eur_sex}/outputs" -H "accept: application/json" \
  > ${eur_sex}_final_outputs.json

## download remove files
wc -l *remove
#      31 uhs1234_afr_sex_check.sexcheck.sexcheck.remove
#      36 uhs1234_eur_sex_check.sexcheck.sexcheck.remove

## find out which waves the sex discrepancies came from and create remove lists

### UHS1

In [None]:
## UHS1 AFR
mkdir -p ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/remove_samples_methodically/uhs4_vmerged2/uhs1/
grep -f uhs1234_afr_sex_check.sexcheck.sexcheck.remove \
     ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs1/uhs1.aa.fam | \
    cut -d" " -f1 > ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/remove_samples_methodically/uhs4_vmerged2/uhs1/uhs1_aa_sexcheck_remove.txt
grep -f uhs1234_afr_sex_check.sexcheck.sexcheck.remove \
     ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs1/uhs1.aa.fam | wc -l
#1 problematic sample

        # 357532@1054698370 357532@1054698370 0 0 2 -9
    
    

## UHS1 EUR
grep -f uhs1234_eur_sex_check.sexcheck.sexcheck.remove \
     ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs1/uhs1.ea.fam | \
    cut -d" " -f1 > ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/remove_samples_methodically/uhs4_vmerged2/uhs1/uhs1_ea_sexcheck_remove.txt
grep -f uhs1234_eur_sex_check.sexcheck.sexcheck.remove \
     ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs1/uhs1.ea.fam | wc -l
#2 problematic samples

        ## 162691@1054698377 162691@1054698377 0 0 2 -9
        ## 975568@1054752720 975568@1054752720 0 0 1 -9

### UHS2

In [None]:
### UHS2
## UHS2 AFR
mkdir -p ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/remove_samples_methodically/uhs4_vmerged2/uhs2/
grep -f uhs1234_afr_sex_check.sexcheck.sexcheck.remove \
     ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs2/uhs2.aa.chr23.fam | \
    cut -d" " -f1 > ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/remove_samples_methodically/uhs4_vmerged2/uhs2/uhs2_aa_sexcheck_remove.txt
grep -f uhs1234_afr_sex_check.sexcheck.sexcheck.remove \
     ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs2/uhs2.aa.chr23.fam | wc -l
#13 problematic samples

        #8002020629_HHG7930_AS97-10339 8002020629_HHG7930_AS97-10339 0 0 2 -9
        #8002022836_HHG9308_AS88-5991 8002022836_HHG9308_AS88-5991 0 0 2 -9
        #8002219639_HHG6875_AS95-00499 8002219639_HHG6875_AS95-00499 0 0 2 -9
        #8002220224_HHG5779_AS87-2842 8002220224_HHG5779_AS87-2842 0 0 2 -9
        #8002221789_HHG7523_AS96-03326 8002221789_HHG7523_AS96-03326 0 0 2 -9
        #8002688031_HHG5202_AS00-09358 8002688031_HHG5202_AS00-09358 0 0 2 -9
        #8002688586_HHG3456_AS01-15849 8002688586_HHG3456_AS01-15849 0 0 2 -9
        #8002689002_HHG3587_AS01-19301 8002689002_HHG3587_AS01-19301 0 0 2 -9
        #8002689503_HHG3705_AS02-03196 8002689503_HHG3705_AS02-03196 0 0 2 -9
        #8002690295_HHG1302_AS01-08806 8002690295_HHG1302_AS01-08806 0 0 2 -9
        #8002694995_HHG0641_AS00-01871 8002694995_HHG0641_AS00-01871 0 0 2 -9
        #8002694999_HHG0539_AS99-11829 8002694999_HHG0539_AS99-11829 0 0 2 -9
        #8002697175_HHG0897_AS00-10209 8002697175_HHG0897_AS00-10209 0 0 2 -9

        

## UHS2 EUR
grep -f uhs1234_eur_sex_check.sexcheck.sexcheck.remove \
     ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs2/uhs2.ea.chr23.fam | \
    cut -d" " -f1 > ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/remove_samples_methodically/uhs4_vmerged2/uhs2/uhs2_ea_sexcheck_remove.txt
grep -f uhs1234_eur_sex_check.sexcheck.sexcheck.remove \
     ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs2/uhs2.ea.chr23.fam | wc -l
#14 problematic samples

        #8002020535_HHG8094_AS01-19139 8002020535_HHG8094_AS01-19139 0 0 2 -9
        #8002161405_HHG0077_AS91-0851 8002161405_HHG0077_AS91-0851 0 0 2 -9
        #8002220141_HHG5508_AS01-07007 8002220141_HHG5508_AS01-07007 0 0 2 -9
        #8002221104_HHG7353_AS02-08126 8002221104_HHG7353_AS02-08126 0 0 2 -9
        #8002221598_HHG6991_AS95-02824 8002221598_HHG6991_AS95-02824 0 0 2 -9
        #8002221680_HHG7504_AS96-02261 8002221680_HHG7504_AS96-02261 0 0 2 -9
        #8002221698_HHG7434_AS95-10698 8002221698_HHG7434_AS95-10698 0 0 2 -9
        #8002688451_HHG2500_AS92-3130 8002688451_HHG2500_AS92-3130 0 0 2 -9
        #8002689261_HHG4111_AS91-1758 8002689261_HHG4111_AS91-1758 0 0 2 -9
        #8002689363_HHG3965_AS90-6385 8002689363_HHG3965_AS90-6385 0 0 2 -9
        #8002690341_HHG0968_AS00-12469 8002690341_HHG0968_AS00-12469 0 0 2 -9
        #8002690600_HHG1902_AS93-2853 8002690600_HHG1902_AS93-2853 0 0 2 -9
        #8002690716_HHG1441_AS01-13920 8002690716_HHG1441_AS01-13920 0 0 2 -9
        #8002697188_HHG0910_AS00-10319 8002697188_HHG0910_AS00-10319 0 0 2 -9

### UHS3_1-2

In [None]:
### UHS3_v1-2
## UHS3_v1-2 AFR
mkdir -p ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/remove_samples_methodically/uhs4_vmerged2/uhs3/
grep -f uhs1234_afr_sex_check.sexcheck.sexcheck.remove \
    ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs3_1-2/uhs3.aa.V1-2.fam |\
    cut -d" " -f1 > ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/remove_samples_methodically/uhs4_vmerged2/uhs3/uhs3_1-2_aa_sexcheck_remove.txt
grep -f uhs1234_afr_sex_check.sexcheck.sexcheck.remove \
    ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs3_1-2/uhs3.aa.V1-2.fam | wc -l
#2 problematics samples

        #8002022049_HHG8460_AS90-6445 8002022049_HHG8460_AS90-6445 0 0 2 -9
        #8002690630_HHG1800_AS88-5431 8002690630_HHG1800_AS88-5431 0 0 2 -9
        

## UHS3_v1-2 EUR
grep -f uhs1234_eur_sex_check.sexcheck.sexcheck.remove \
    ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs3_1-2/uhs3.ea.V1-2.fam | \
    cut -d" " -f1 > ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/remove_samples_methodically/uhs4_vmerged2/uhs3/uhs3_1-2_ea_sexcheck_remove.txt
grep -f uhs1234_eur_sex_check.sexcheck.sexcheck.remove \
    ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs3_1-2/uhs3.ea.V1-2.fam | wc -l
#3 problematics samples

        #8002690370_HHG1042_AS00-13921 8002690370_HHG1042_AS00-13921 0 0 2 -9
        #8002690839_HHG1196_AS01-07048 8002690839_HHG1196_AS01-07048 0 0 2 -9
        #8002695466_HHG0676_AS00-03992 8002695466_HHG0676_AS00-03992 0 0 2 -9

### UHS3_1-3

In [None]:
### UHS3_v1-3
## UHS3_v1-3 AFR
grep -f uhs1234_afr_sex_check.sexcheck.sexcheck.remove \
    ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs3_1-3/uhs3.aa.V1-3.fam | \
    cut -d" " -f1  > ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/remove_samples_methodically/uhs4_vmerged2/uhs3/uhs3_1-3_aa_sexcheck_remove.txt
grep -f uhs1234_afr_sex_check.sexcheck.sexcheck.remove \
    ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs3_1-3/uhs3.aa.V1-3.fam | wc -l
# 6 problematics samples
        #8002221030_HHG6599_AS93-4798 8002221030_HHG6599_AS93-4798 0 0 2 -9
        #8002221634_HHG7376_AS95-08042 8002221634_HHG7376_AS95-08042 0 0 2 -9
        #8002688054_HHG5193_AS00-09274 8002688054_HHG5193_AS00-09274 0 0 2 -9
        #8002688215_HHG2990_AS96-04581 8002688215_HHG2990_AS96-04581 0 0 2 -9
        #8002688706_HHG3293_AS97-05079 8002688706_HHG3293_AS97-05079 0 0 2 -9
        #8002688835_HHG3107_AS96-10532 8002688835_HHG3107_AS96-10532 0 0 2 -9
        

## UHS3_v1-3 EUR
grep -f uhs1234_eur_sex_check.sexcheck.sexcheck.remove \
    ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs3_1-3/uhs3.ea.V1-3.fam | \
    cut -d" " -f1  > ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/remove_samples_methodically/uhs4_vmerged2/uhs3/uhs3_1-3_ea_sexcheck_remove.txt
grep -f uhs1234_eur_sex_check.sexcheck.sexcheck.remove \
    ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs3_1-3/uhs3.ea.V1-3.fam | wc -l
#5 problematics samples
        #8002022140_HHG8667_AS91-4372 8002022140_HHG8667_AS91-4372 0 0 2 -9
        #8002161390_HHG0168_AS95-08217 8002161390_HHG0168_AS95-08217 0 0 2 -9
        #8002220223_HHG5763_AS01-14187 8002220223_HHG5763_AS01-14187 0 0 2 -9
        #8002221088_HHG7296_AS89-2397 8002221088_HHG7296_AS89-2397 0 0 2 -9
        #8002688694_HHG3292_AS97-05074 8002688694_HHG3292_AS97-05074 0 0 2 -9

### UHS4

In [None]:
### UHS4
## AFR
mkdir -p ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/remove_samples_methodically/uhs4_vmerged2/uhs4/
grep -f uhs1234_afr_sex_check.sexcheck.sexcheck.remove \
    ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs4/uhs4.merged2.aa.fam| \
    cut -d" " -f1  > ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/remove_samples_methodically/uhs4_vmerged2/uhs4/uhs4_aa_sexcheck_remove.txt
grep -f uhs1234_afr_sex_check.sexcheck.sexcheck.remove \
    ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs4/uhs4.merged2.aa.fam| wc -l
#9 problematics samples
        #AS00-06545_8002220476_HHG6317_33_E01 AS00-06545_8002220476_HHG6317_33_E01 0 0 2 -9
        #AS01-02391_8002220098_HHG5424_8_A01 AS01-02391_8002220098_HHG5424_8_A01 0 0 2 -9
        #AS02-05580_8002020034_HHG8251_36_D05 AS02-05580_8002020034_HHG8251_36_D05 0 0 2 -9
        #AS95-00970_8002220541_HHG6905_27_C08 AS95-00970_8002220541_HHG6905_27_C08 0 0 2 -9
        #AS95-00977_8002220565_HHG6907_27_E08 AS95-00977_8002220565_HHG6907_27_E08 0 0 2 -9
        #AS95-03805_8002221602_HHG7039_28_B01 AS95-03805_8002221602_HHG7039_28_B01 0 0 2 -9
        #AS95-8948_8002688254_HHG2868_19_B03 AS95-8948_8002688254_HHG2868_19_B03 0 0 2 -9
        #AS96-02177_8002221679_HHG7491_27_D11 AS96-02177_8002221679_HHG7491_27_D11 0 0 2 -9
        #AS98-13797_8002687443_HHG4785_17_A08 AS98-13797_8002687443_HHG4785_17_A08 0 0 2 -9
        

## EUR
grep -f uhs1234_eur_sex_check.sexcheck.sexcheck.remove \
    ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs4/uhs4.merged2.ea.fam | \
    cut -d" " -f1  > ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/remove_samples_methodically/uhs4_vmerged2/uhs4/uhs4_ea_sexcheck_remove.txt
grep -f uhs1234_eur_sex_check.sexcheck.sexcheck.remove \
    ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs4/uhs4.merged2.ea.fam | wc -l
#12 problematics samples
				#AS00-12753_8002220592_HHG5305_35_C03 AS00-12753_8002220592_HHG5305_35_C03 0 0 2 -9
				#AS01-07476_8002690855_HHG1242_2_C06 AS01-07476_8002690855_HHG1242_2_C06 0 0 2 -9
				#AS88-0067_8002690151_HHG1624_6_A10 AS88-0067_8002690151_HHG1624_6_A10 0 0 2 -9
				#AS90-3001_8002161168_HHG10477_10_F05 AS90-3001_8002161168_HHG10477_10_F05 0 0 2 -9
				#AS90-7230_8002689401_HHG4010_34_G08 AS90-7230_8002689401_HHG4010_34_G08 0 0 2 -9
				#AS91-0868_8002689367_HHG4064_34_C10 AS91-0868_8002689367_HHG4064_34_C10 0 0 2 -9
				#AS91-4444_8002022069_HHG8673_34_A06 AS91-4444_8002022069_HHG8673_34_A06 0 0 2 -9
				#AS94-4123_8002220689_HHG5608_33_F12 AS94-4123_8002220689_HHG5608_33_F12 0 0 2 -9
				#AS95-05752_8002221152_HHG7363_27_H10 AS95-05752_8002221152_HHG7363_27_H10 0 0 2 -9
				#AS98-02119_8002160687_HHG0215_15_G05 AS98-02119_8002160687_HHG0215_15_G05 0 0 2 -9
				#AS98-12584_8002687430_HHG4768_17_H06 AS98-12584_8002687430_HHG4768_17_H06 0 0 2 -9
				#AS99-13954_8002160702_HHG0248_15_C07 AS99-13954_8002160702_HHG0248_15_C07 0 0 2 -9

## Relatedness results
[AFR duplicate pairs](https://www.notion.so/b2fc96a26e0e4b7fbfbb771964d8c94a)
The number of AFR sample pairs identified as being monozygous twins or duplicates.

| Wave Name | UHS1 # | UHS2 # | UHS3 #  | UHS4 # | Total |
|-----------|--------|--------|---------|--------|-------|
| UHS1      | 0      | 118    | 42      | 161    | 321   |
| UHS2      | -      | 0      | 24      | 105    | 129   |
| UHS3      | -      | -      | 1       | 24     | 25    |
| UHS4      | -      | -      | -       | 0      | 0     |

**SUM: 475**

<br>

[AFR 1st° pairs](https://www.notion.so/1fc703974de04df7a1c2cc22805b26b2)
The number of AFR sample pairs identified as being 1st degree relatives.

| Wave Name | UHS1 # | UHS2 # | UHS3 #  | UHS4 # | Total |
|-----------|--------|--------|---------|--------|-------|
| UHS1      | 0      | 44     | 11      | 58     | 113   |
| UHS2      | -      | 0      | 4       | 25     | 29    |
| UHS3      | -      | -      | 0       | 4      | 4     |
| UHS4      | -      | -      | -       | 0      | 0     |

**SUM: 146**




<br><br>
___


[EUR duplicate pairs](https://www.notion.so/5b9557807f214127bb16fa3ec07748b5)
The number of EUR sample pairs identified as being monozygous twins or duplicates.

| Wave Name | UHS1 # | UHS2 # | UHS3 #  | UHS4 # | Total |
|-----------|--------|--------|---------|--------|-------|
| UHS1      | 0      | 52     | 5       | 77     | 134   |
| UHS2      | -      | 0      | 9       | 81     | 90    |
| UHS3      | -      | -      | 0       | 9      | 9     |
| UHS4      | -      | -      | -       | 0      | 0     |


**SUM: 223**

<br>

[EUR 1st° pairs](https://www.notion.so/70529799e18a4f7e9972c0beedc44cf5)
The number of EUR sample pairs identified as being 1st degree relatives.


| Wave Name | UHS1 # | UHS2 # | UHS3 #  | UHS4 # | Total |
|-----------|--------|--------|---------|--------|-------|
| UHS1      | 0      | 8      | 1       | 8      | 17    |
| UHS2      | -      | 0      | 0       | 8      | 8     |
| UHS3      | -      | -      | 0       | 1      | 1     |
| UHS4      | -      | -      | -       | 0      | 0     |

**SUM: 26**

Below is the code for counting the number of samples with either relatedness issues or sex discrepancy issues.
In Jesse Marks' local environment.

### Create ID lists
Get the sample IDs for each wave of UHS so we can determine which wave each problematic sample pair is from. 

In [None]:
## Open tunnel session in terminal 1
ssh -i ~/.ssh/gwas_rsa -L localhost:8000:localhost:8000 ec2-user@54.174.185.7

## Open session in terminal 2
cd ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506/uhs1234/uhs4_vmerged2/relatedness/

## check job status in terminal 2
#afr_sex=
#eur_sex=
afr_relate=22794ea1-ebd0-4f17-af93-9044d925d49d
eur_relate=2b5ffe83-12d5-46ae-830a-544c8558d911

for job in {$afr_relate,$afr_sex,$eur_relate,$eur_sex}; do
    curl -X GET "http://localhost:8000/api/workflows/v1/${job}/status"
    echo ""
done

#{"status":"Succeeded","id":"22794ea1-ebd0-4f17-af93-9044d925d49d"}
#{"status":"Succeeded","id":"ab5e5132-94f5-48ab-affd-6e90d7c3160d"}
#{"status":"Succeeded","id":"2b5ffe83-12d5-46ae-830a-544c8558d911"}
#{"status":"Succeeded","id":"96c0221f-98f2-4681-b500-528dfadeb050"}

## Download results of workflow from Swagger UI.
curl -X GET "http://localhost:8000/api/workflows/v1/${afr_relate}/outputs" -H "accept: application/json" \
  > ${afr_relate}_final_outputs.json
curl -X GET "http://localhost:8000/api/workflows/v1/${eur_relate}/outputs" -H "accept: application/json" \
  > ${eur_relate}_final_outputs.json

## Download output files from S3
aws s3 cp s3://rti-cromwell-output/cromwell-execution/relatedness_wf/22794ea1-ebd0-4f17-af93-9044d925d49d/call-final_get_relateds/KING.king_kinship_wf/44282123-f968-4a60-8436-356c0e2a49fb/call-prune_related_samples/uhs1234_afr_relatedness.final.king.pruned.annotated.k0 .
aws s3 cp s3://rti-cromwell-output/cromwell-execution/relatedness_wf/2b5ffe83-12d5-46ae-830a-544c8558d911/call-final_get_relateds/KING.king_kinship_wf/daf7be74-3f71-4f64-8984-d643b7845089/call-prune_related_samples/uhs1234_eur_relatedness.final.king.pruned.annotated.k0 .

head -2 uhs1234_afr_relatedness.final.king.pruned.annotated.k0
#FID1	ID1	FID2	ID2	N_SNP	HetHet	IBS0	Kinship	Classification
#23	706@1064714543	732	346384@1054752919	99853	0.1543	0.0504	0.0729	3+_degree_relative

cd ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506/uhs1/

cut -d " " -f1 uhs1.aa.fam > uhs1_aa_ids.txt
cut -d " " -f1 uhs1.ea.fam > uhs1_ea_ids.txt

cd uhs2/
cut -d " " -f1 uhs2.aa.fam > uhs2_aa_ids.txt
cut -d " " -f1 uhs2.ea.fam > uhs2_ea_ids.txt

# UHS3 came in two batches which we will just combine
cd uhs3_1-2/
cut -d " " -f1 uhs3.aa.V1-2.fam >> ../uhs3_combined/uhs3_aa_ids.txt
cut -d " " -f1 uhs3.ea.V1-2.fam >> ../uhs3_combined/uhs3_ea_ids.txt

cd uhs3_1-3/
cut -d " " -f1 uhs3.aa.V1-3.fam >> ../uhs3_combined/uhs3_aa_ids.txt
cut -d " " -f1 uhs3.ea.V1-3.fam >> ../uhs3_combined/uhs3_ea_ids.txt

cd uhs4/
cut -d " " -f1 uhs4.aa.fam > uhs4_aa_ids.txt
cut -d " " -f1 uhs4.ea.fam > uhs4_ea_ids.txt

### Create problematic samples lists
Organized into lists the results from the genotype QC workflows that performed relatedness checks. We will use these lists to count the number of duplicate pairs and 1st degree relative pairs.

In [None]:
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs1234/uhs4_vmerged2/relatedness/

## download files from the output JSON files:
# 22794ea1-ebd0-4f17-af93-9044d925d49d_final_outputs.json
# 2b5ffe83-12d5-46ae-830a-544c8558d911_final_outputs.json

head -2 uhs1234_afr_relatedness.final.king.pruned.annotated.k0
#FID1	ID1	FID2	ID2	N_SNP	HetHet	IBS0	Kinship	Classification
#23	706@1064714543	732	346384@1054752919	99838	0.1552	0.05	0.0738	3+_degree_relative

## get all 1st degree pairs
head -1 uhs1234_afr_relatedness.final.king.pruned.annotated.k0 | \
  cut -f2,4 > uhs1234_afr_1st_degree_relative_pairs.txt
awk '$9=="1st_degree_relative"' uhs1234_afr_relatedness.final.king.pruned.annotated.k0 | \
  cut -f2,4 >> uhs1234_afr_1st_degree_relative_pairs.txt

head -1 uhs1234_eur_relatedness.final.king.pruned.annotated.k0 | \
  cut -f2,4 > uhs1234_eur_1st_degree_relative_pairs.txt
awk '$9=="1st_degree_relative"' uhs1234_eur_relatedness.final.king.pruned.annotated.k0 | \
  cut -f2,4 >> uhs1234_eur_1st_degree_relative_pairs.txt


## get all MZ-twins/duplicates
head -1 uhs1234_afr_relatedness.final.king.pruned.annotated.k0 | \
  cut -f2,4 > uhs1234_afr_duplicate_pairs.txt
awk '$9=="MZ_twin_or_duplicate"' uhs1234_afr_relatedness.final.king.pruned.annotated.k0 | \
  cut -f2,4 >> uhs1234_afr_duplicate_pairs.txt

head -1 uhs1234_eur_relatedness.final.king.pruned.annotated.k0 | \
  cut -f2,4 > uhs1234_eur_duplicate_pairs.txt
awk '$9=="MZ_twin_or_duplicate"' uhs1234_eur_relatedness.final.king.pruned.annotated.k0 | \
  cut -f2,4 >> uhs1234_eur_duplicate_pairs.txt

## summary
wc -l uhs1234_afr_1st_degree_relative_pairs.txt
     #147 uhs1234_afr_1st_degree_relative_pairs.txt
wc -l uhs1234_afr_duplicate_pairs.txt
     #476 uhs1234_afr_duplicate_pairs.txt
wc -l uhs1234_eur_1st_degree_relative_pairs.txt
     #27 uhs1234_eur_1st_degree_relative_pairs.txt
wc -l uhs1234_eur_duplicate_pairs.txt
     #234 uhs1234_eur_duplicate_pairs.txt

### Count the number of duplicate pairs

In [None]:
cd ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/
mkdir -p remove_samples_methodically/uhs4_vmerged2/
cd remove_samples_methodically/uhs4_vmerged2/

ln -s ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs1234/uhs4_vmerged2/relatedness/uhs1234_afr_duplicate_pairs.txt
ln -s ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs1234/uhs4_vmerged2/relatedness/uhs1234_eur_duplicate_pairs.txt
ln -s ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs1234/uhs4_vmerged2/relatedness/uhs1234_afr_1st_degree_relative_pairs.txt
ln -s ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs1234/uhs4_vmerged2/relatedness/uhs1234_eur_1st_degree_relative_pairs.txt

ln -s ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs1/uhs1_aa_ids.txt
ln -s ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs1/uhs1_ea_ids.txt

ln -s ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs2/uhs2_aa_ids.txt
ln -s ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs2/uhs2_ea_ids.txt

ln -s ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs3_combined/uhs3_aa_ids.txt
ln -s ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs3_combined/uhs3_ea_ids.txt

ln -s ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs4/uhs4_aa_ids.txt
ln -s ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/uhs4/uhs4_ea_ids.txt

In [None]:
### Python

"""
count_pairs.py

Given two lists of samples (one column per list) and a 
third list containing sample pairs (two columns) this 
function counts the number of occurences a sample pair 
in the third list contains samples from both list 1 and list 2.
"""

def count_pairs(keep_file, remove_file, pairs_file):
    with open(keep_file) as keepF,\
         open(remove_file) as removeF,\
         open(pairs_file) as pairF:

        # add samples to a set to mark for keeping
        keep_set = set()
        line = keepF.readline()
        while line:
            keep_set.add(line.strip())
            line = keepF.readline()

        # add samples to a set to mark for removing
        remove_set = set()
        line = removeF.readline()
        while line:
            remove_set.add(line.strip())
            line = removeF.readline()
        
        # go through pairs list and identify all samples from remove_set to remove
        next(pairF)
        line = pairF.readline()

        count = 0
        while line:
            sl = line.split()
            id1 = sl[0]
            id2 = sl[1]
            if (id1 in keep_set) or (id2 in keep_set): # is a sample from the keep_set in this pair?
                if (id1 in keep_set) and (id2 in keep_set): # are both samples of the pair from the keep_set?
                    print('Uh Oh! A duplicate within same wave!') 
                    print(line)
                if (id1 in remove_set) or (id2 in remove_set): # if so, is the other sample from the remove_set?
                    count += 1
            line = pairF.readline()
        print(count)

In [None]:
### AFR
pairs = "uhs1234_afr_duplicate_pairs.txt"
uhs1 = "uhs1_aa_ids.txt"
uhs2 = "uhs2_aa_ids.txt"
uhs3 = "uhs3_aa_ids.txt"
uhs4 = "uhs4_aa_ids.txt"


count_pairs(keep_file=uhs1, remove_file=uhs2, pairs_file=pairs) # 118
count_pairs(keep_file=uhs1, remove_file=uhs3, pairs_file=pairs) # 42
count_pairs(keep_file=uhs1, remove_file=uhs4, pairs_file=pairs) # 161
count_pairs(keep_file=uhs2, remove_file=uhs3, pairs_file=pairs) # 24
count_pairs(keep_file=uhs2, remove_file=uhs4, pairs_file=pairs) # 105
count_pairs(keep_file=uhs3, remove_file=uhs4, pairs_file=pairs) # 24

# Note about UHS3. There was a duplicate identified within this wave.
# This is because we combined wave3_v1-2 with wave3_v1-3
# The duplicate was across these two waves.

#8002220849_HHG6007_AS88-0946	8002695215_HHG0308_AS89-8248
#grep 8002220849_HHG6007_AS88-0946 ../uhs3_1-3/*fam
    #../uhs3_1-3/uhs3.aa.V1-3.fam:8002220849_HHG6007_AS88-0946 8002220849_HHG6007_AS88-0946 0 0 1 -9
#grep 8002695215_HHG0308_AS89-8248 ../uhs3_1-2/*fam
    #../uhs3_1-2/uhs3.aa.V1-2.fam:8002695215_HHG0308_AS89-8248 8002695215_HHG0308_AS89-8248 0 0 1 -9



####################################################################################################        
### EUR
pairs = "uhs1234_eur_duplicate_pairs.txt"
uhs1 = "uhs1_ea_ids.txt"
uhs2 = "uhs2_ea_ids.txt"
uhs3 = "uhs3_ea_ids.txt"
uhs4 = "uhs4_ea_ids.txt"

count_pairs(keep_file=uhs1, remove_file=uhs2, pairs_file=pairs) # 52
count_pairs(keep_file=uhs1, remove_file=uhs3, pairs_file=pairs) # 5
count_pairs(keep_file=uhs1, remove_file=uhs4, pairs_file=pairs) # 77
count_pairs(keep_file=uhs2, remove_file=uhs3, pairs_file=pairs) # 9
count_pairs(keep_file=uhs2, remove_file=uhs4, pairs_file=pairs) # 81
count_pairs(keep_file=uhs3, remove_file=uhs4, pairs_file=pairs) # 9

### Count the number of 1st degree relative pairs

In [None]:
cd remove_samples_methodically

ln -s ../uhs1234/uhs4_vmerged2/relatedness/uhs1234_afr_1st_degree_relative_pairs.txt
ln -s ../uhs1234/uhs4_vmerged2/relatedness/uhs1234_eur_1st_degree_relative_pairs.txt

vim count_samples.py

In [None]:
"""
Given two lists of samples (one column per list) and a 
third list containing sample pairs (two columns) this 
function counts the number of occurences a sample pair 
in the third list contains samples from both list 1 and list 2.
"""

def count_pairs(keep_file, remove_file, pairs_file):
    with open(keep_file) as keepF,\
         open(remove_file) as removeF,\
         open(pairs_file) as pairF:

        # add samples to a set to mark for keeping
        keep_set = set()
        line = keepF.readline()
        while line:
            keep_set.add(line.strip())
            line = keepF.readline()

        # add samples to a set to mark for removing
        remove_set = set()
        line = removeF.readline()
        while line:
            remove_set.add(line.strip())
            line = removeF.readline()
        
        # go through pairs list and identify all samples from remove_set to remove
        next(pairF)
        line = pairF.readline()

        count = 0
        while line:
            sl = line.split()
            id1 = sl[0]
            id2 = sl[1]
            if (id1 in keep_set) or (id2 in keep_set): # is a sample from the keep_set in this pair?
                if (id1 in keep_set) and (id2 in keep_set): # are both samples of the pair from the keep_set?
                    print('Uh Oh! A duplicate within same wave!') 
                    print(line)
                if (id1 in remove_set) or (id2 in remove_set): # if so, is the other sample from the remove_set?
                    count += 1
            line = pairF.readline()
        print(count)

In [None]:
### AFR
pairs = "uhs1234_afr_1st_degree_relative_pairs.txt"
uhs1 = "uhs1_aa_ids.txt"
uhs2 = "uhs2_aa_ids.txt"
uhs3 = "uhs3_aa_ids.txt"
uhs4 = "uhs4_aa_ids.txt"


count_pairs(keep_file=uhs1, remove_file=uhs2, pairs_file=pairs) # 44
count_pairs(keep_file=uhs1, remove_file=uhs3, pairs_file=pairs) # 11
count_pairs(keep_file=uhs1, remove_file=uhs4, pairs_file=pairs) # 58
count_pairs(keep_file=uhs2, remove_file=uhs3, pairs_file=pairs) # 4
count_pairs(keep_file=uhs2, remove_file=uhs4, pairs_file=pairs) # 25
count_pairs(keep_file=uhs3, remove_file=uhs4, pairs_file=pairs) # 4

####################################################################################################        
### EUR
pairs = "uhs1234_eur_1st_degree_relative_pairs.txt"
uhs1 = "uhs1_ea_ids.txt"
uhs2 = "uhs2_ea_ids.txt"
uhs3 = "uhs3_ea_ids.txt"
uhs4 = "uhs4_ea_ids.txt"

count_pairs(keep_file=uhs1, remove_file=uhs2, pairs_file=pairs) # 8
count_pairs(keep_file=uhs1, remove_file=uhs3, pairs_file=pairs) # 1
count_pairs(keep_file=uhs1, remove_file=uhs4, pairs_file=pairs) # 8
count_pairs(keep_file=uhs2, remove_file=uhs3, pairs_file=pairs) # 0
count_pairs(keep_file=uhs2, remove_file=uhs4, pairs_file=pairs) # 8
count_pairs(keep_file=uhs3, remove_file=uhs4, pairs_file=pairs) # 1

## Create remove-samples lists
Combine the problematic samples that were identified as being problematic for one of the following reasons: sex descrepancy, 1st degree relative, or a duplicate sample.

### Duplicate samples list
Give UHS1 presedence over the other waves. In particular, if a one of samples from a pair is from UHS1 and the other sample is from UHS2 remove the UHS2 sample and keep the UHS1 sample. 

#### AFR
Remember to remove one of the duplicates in UHS3. The duplicate was across v1-2 and v1-3.

In [None]:
def main():
    pairs = "uhs1234_afr_duplicate_pairs.txt"
    uhs1 = "uhs1_aa_ids.txt"
    uhs2 = "uhs2_aa_ids.txt"
    uhs3 = "uhs3_aa_ids.txt"
    uhs4 = "uhs4_aa_ids.txt"

    out_remove = "uhs2_aa_duplicates_of_uhs1_remove.txt"
    remove_samples(keep_file=uhs1, remove_file=uhs2, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs3_aa_duplicates_of_uhs1_remove.txt"
    remove_samples(keep_file=uhs1, remove_file=uhs3, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs4_aa_duplicates_of_uhs1_remove.txt"
    remove_samples(keep_file=uhs1, remove_file=uhs4, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs3_aa_duplicates_of_uhs2_remove.txt"
    remove_samples(keep_file=uhs2, remove_file=uhs3, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs4_aa_duplicates_of_uhs2_remove.txt"
    remove_samples(keep_file=uhs2, remove_file=uhs4, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs4_aa_duplicates_of_uhs3_remove.txt"
    remove_samples(keep_file=uhs3, remove_file=uhs4, pairs_file=pairs, out_remove=out_remove)


def remove_samples(keep_file, remove_file, pairs_file, out_remove):
    with open(keep_file) as keepF,\
         open(remove_file) as removeF,\
         open(pairs_file) as pairF, \
         open(out_remove, "w") as finalF:

        # add samples to a set to mark for keeping
        keep_set = set()
        line = keepF.readline()
        while line:
            keep_set.add(line.strip())
            line = keepF.readline()

        # add samples to a set to mark for removing
        remove_set = set()
        line = removeF.readline()
        while line:
            remove_set.add(line.strip())
            line = removeF.readline()

        # go through pairs list and identify all samples from remove_set to remove
        head = pairF.readline()
        #outF.write(head)
        line = pairF.readline()
        count = 0
        while line:
            sl = line.split()
            id1 = sl[0]
            id2 = sl[1]
            if (id1 in keep_set) or (id2 in keep_set): # is a sample from the keep_set in this pair?
                if (id1 in keep_set) and (id2 in keep_set): # are both samples of the pair from the keep_set?
                    print("Uh Oh! A duplicate within same wave!")
                    print(line)
                elif (id1 in remove_set): # is one from keep_set and other from the remove_set?
                    finalF.write(id1 + "\n")
                    count += 1
                elif (id2 in remove_set): # is one from keep_set and other from the remove_set?
                    finalF.write(id2 + "\n")
                    count += 1
                else: # if no to both, then don't remove the pair
                    pass
                    #outF.write(line)
            else:
                pass
                #outF.write(line)
            line = pairF.readline()
        print(count)

####################################################################################################        
if __name__ == "__main__":
    main()

In [None]:
## Combine the different remove-duplicate-samples files
mkdir uhs{2..4}

mv uhs2_aa_duplicates* uhs2
mv uhs3_aa_duplicates* uhs3
mv uhs4_aa_duplicates* uhs4

cd uhs2
paste <(cat uhs2_aa_duplicates_of*) <(cat uhs2_aa_duplicates_of*) > uhs2_aa_duplicates_to_remove.txt
wc -l uhs2_aa_duplicates_to_remove.txt # 118 uhs2_aa_duplicates_to_remove.txt

cd ../uhs3
paste <(cat uhs3_aa_duplicates_of*) <(cat uhs3_aa_duplicates_of*) > uhs3_aa_duplicates_to_remove.txt
echo -e "8002220849_HHG6007_AS88-0946\t8002220849_HHG6007_AS88-0946" >> uhs3_aa_duplicates_to_remove.txt
wc -l uhs3_aa_duplicates_to_remove.txt # 67 uhs3_aa_duplicates_to_remove.txt

cd ../uhs4
paste <(cat uhs4_aa_duplicates_of*) <(cat uhs4_aa_duplicates_of*) > uhs4_aa_duplicates_to_remove.txt
wc -l uhs4_aa_duplicates_to_remove.txt # 290 uhs4_aa_duplicates_to_remove.txt

#### EUR

In [None]:
def main():
    pairs = "uhs1234_eur_duplicate_pairs.txt"
    uhs1 = "uhs1_ea_ids.txt"
    uhs2 = "uhs2_ea_ids.txt"
    uhs3 = "uhs3_ea_ids.txt"
    uhs4 = "uhs4_ea_ids.txt"

    out_remove = "uhs2_ea_duplicates_of_uhs1_remove.txt"
    remove_samples(keep_file=uhs1, remove_file=uhs2, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs3_ea_duplicates_of_uhs1_remove.txt"
    remove_samples(keep_file=uhs1, remove_file=uhs3, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs4_ea_duplicates_of_uhs1_remove.txt"
    remove_samples(keep_file=uhs1, remove_file=uhs4, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs3_ea_duplicates_of_uhs2_remove.txt"
    remove_samples(keep_file=uhs2, remove_file=uhs3, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs4_ea_duplicates_of_uhs2_remove.txt"
    remove_samples(keep_file=uhs2, remove_file=uhs4, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs4_ea_duplicates_of_uhs3_remove.txt"
    remove_samples(keep_file=uhs3, remove_file=uhs4, pairs_file=pairs, out_remove=out_remove)


def remove_samples(keep_file, remove_file, pairs_file, out_remove):
    with open(keep_file) as keepF,\
         open(remove_file) as removeF,\
         open(pairs_file) as pairF, \
         open(out_remove, "w") as finalF:

        # add samples to a set to mark for keeping
        keep_set = set()
        line = keepF.readline()
        while line:
            keep_set.add(line.strip())
            line = keepF.readline()

        # add samples to a set to mark for removing
        remove_set = set()
        line = removeF.readline()
        while line:
            remove_set.add(line.strip())
            line = removeF.readline()

        # go through pairs list and identify all samples from remove_set to remove
        head = pairF.readline()
        #outF.write(head)
        line = pairF.readline()
        count = 0
        while line:
            sl = line.split()
            id1 = sl[0]
            id2 = sl[1]
            if (id1 in keep_set) or (id2 in keep_set): # is a sample from the keep_set in this pair?
                if (id1 in keep_set) and (id2 in keep_set): # are both samples of the pair from the keep_set?
                    print("Uh Oh! A duplicate within same wave!")
                    print(line)
                elif (id1 in remove_set): # is one from keep_set and other from the remove_set?
                    finalF.write(id1 + "\n")
                    count += 1
                elif (id2 in remove_set): # is one from keep_set and other from the remove_set?
                    finalF.write(id2 + "\n")
                    count += 1
                else: # if no to both, then don't remove the pair
                    pass
                    #outF.write(line)
            else:
                pass
                #outF.write(line)
            line = pairF.readline()
        print(count)

####################################################################################################        
if __name__ == "__main__":
    main()

In [None]:
## Combine the different remove-duplicate-samples files

mv uhs2_ea_duplicates* uhs2/
mv uhs3_ea_duplicates* uhs3/
mv uhs4_ea_duplicates* uhs4/

cd uhs2
paste <(cat uhs2_ea_duplicates_of*) <(cat uhs2_ea_duplicates_of*) > uhs2_ea_duplicates_to_remove.txt
wc -l uhs2_ea_duplicates_to_remove.txt # 52 uhs2_ea_duplicates_to_remove.txt

cd ../uhs3
paste <(cat uhs3_ea_duplicates_of*) <(cat uhs3_ea_duplicates_of*) > uhs3_ea_duplicates_to_remove.txt
wc -l uhs3_ea_duplicates_to_remove.txt # 14 uhs3_ea_duplicates_to_remove.txt

cd ../uhs4
paste <(cat uhs4_ea_duplicates_of*) <(cat uhs4_ea_duplicates_of*) > uhs4_ea_duplicates_to_remove.txt
wc -l uhs4_ea_duplicates_to_remove.txt # 167 uhs4_ea_duplicates_to_remove.txt

### 1st degree relatives list
Give UHS1 presedence over the other waves. In particular, if a one of samples from a pair is from UHS1 and the other sample is from UHS2 remove the UHS2 sample and keep the UHS1 sample. 

#### AFR

In [None]:
def main():
    pairs = "uhs1234_afr_1st_degree_relative_pairs.txt"
    uhs1 = "uhs1_aa_ids.txt"
    uhs2 = "uhs2_aa_ids.txt"
    uhs3 = "uhs3_aa_ids.txt"
    uhs4 = "uhs4_aa_ids.txt"

    out_remove = "uhs2_aa_1st_degree_relatives_of_uhs1_remove.txt"
    remove_samples(keep_file=uhs1, remove_file=uhs2, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs3_aa_1st_degree_relatives_of_uhs1_remove.txt"
    remove_samples(keep_file=uhs1, remove_file=uhs3, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs4_aa_1st_degree_relatives_of_uhs1_remove.txt"
    remove_samples(keep_file=uhs1, remove_file=uhs4, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs3_aa_1st_degree_relatives_of_uhs2_remove.txt"
    remove_samples(keep_file=uhs2, remove_file=uhs3, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs4_aa_1st_degree_relatives_of_uhs2_remove.txt"
    remove_samples(keep_file=uhs2, remove_file=uhs4, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs4_aa_1st_degree_relatives_of_uhs3_remove.txt"
    remove_samples(keep_file=uhs3, remove_file=uhs4, pairs_file=pairs, out_remove=out_remove)


def remove_samples(keep_file, remove_file, pairs_file, out_remove):
    with open(keep_file) as keepF,\
         open(remove_file) as removeF,\
         open(pairs_file) as pairF, \
         open(out_remove, "w") as finalF:

        # add samples to a set to mark for keeping
        keep_set = set()
        line = keepF.readline()
        while line:
            keep_set.add(line.strip())
            line = keepF.readline()

        # add samples to a set to mark for removing
        remove_set = set()
        line = removeF.readline()
        while line:
            remove_set.add(line.strip())
            line = removeF.readline()

        # go through pairs list and identify all samples from remove_set to remove
        head = pairF.readline()
        #outF.write(head)
        line = pairF.readline()
        count = 0
        while line:
            sl = line.split()
            id1 = sl[0]
            id2 = sl[1]
            if (id1 in keep_set) or (id2 in keep_set): # is a sample from the keep_set in this pair?
                if (id1 in keep_set) and (id2 in keep_set): # are both samples of the pair from the keep_set?
                    print("Uh Oh! A duplicate within same wave!")
                    print(line)
                elif (id1 in remove_set): # is one from keep_set and other from the remove_set?
                    finalF.write(id1 + "\n")
                    count += 1
                elif (id2 in remove_set): # is one from keep_set and other from the remove_set?
                    finalF.write(id2 + "\n")
                    count += 1
                else: # if no to both, then don't remove the pair
                    pass
                    #outF.write(line)
            else:
                pass
                #outF.write(line)
            line = pairF.readline()
        print(count)

####################################################################################################        
if __name__ == "__main__"
    main()

In [None]:
## Combine the different remove-1st-degree-relative-samples files
mv uhs2_aa_1st_degree_relatives* uhs2
mv uhs3_aa_1st_degree_relatives* uhs3
mv uhs4_aa_1st_degree_relatives* uhs4

cd uhs2
paste <(cat uhs2_aa_1st_degree_relatives*) \
  <(cat uhs2_aa_1st_degree_relatives*) > uhs2_aa_1st_degrees_to_remove.txt
wc -l uhs2_aa_1st_degrees_to_remove.txt # 44 uhs2_aa_1st_degrees_to_remove.txt

cd ../uhs3
paste <(cat uhs3_aa_1st_degree_relatives*) \
  <(cat uhs3_aa_1st_degree_relatives*) > uhs3_aa_1st_degrees_to_remove.txt
wc -l uhs3_aa_1st_degrees_to_remove.txt # 15 uhs3_aa_1st_degrees_to_remove.txt

cd ../uhs4
paste <(cat uhs4_aa_1st_degree_relatives*) \
  <(cat uhs4_aa_1st_degree_relatives*) > uhs4_aa_1st_degrees_to_remove.txt
wc -l uhs4_aa_1st_degrees_to_remove.txt # 87 uhs4_aa_1st_degrees_to_remove.txt


#### EUR

In [None]:
def main():
		pairs = "uhs1234_eur_1st_degree_relative_pairs.txt"
    uhs1 = "uhs1_ea_ids.txt"
    uhs2 = "uhs2_ea_ids.txt"
    uhs3 = "uhs3_ea_ids.txt"
    uhs4 = "uhs4_ea_ids.txt"

    out_remove = "uhs2_ea_1st_degree_relatives_of_uhs1_remove.txt"
    remove_samples(keep_file=uhs1, remove_file=uhs2, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs3_ea_1st_degree_relatives_of_uhs1_remove.txt"
    remove_samples(keep_file=uhs1, remove_file=uhs3, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs4_ea_1st_degree_relatives_of_uhs1_remove.txt"
    remove_samples(keep_file=uhs1, remove_file=uhs4, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs2_ea_1st_degree_relatives_of_uhs3_remove.txt"
    remove_samples(keep_file=uhs2, remove_file=uhs3, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs2_ea_1st_degree_relatives_of_uhs4_remove.txt"
    remove_samples(keep_file=uhs2, remove_file=uhs4, pairs_file=pairs, out_remove=out_remove)

    out_remove = "uhs3_ea_1st_degree_relatives_of_uhs4_remove.txt"
    remove_samples(keep_file=uhs3, remove_file=uhs4, pairs_file=pairs, out_remove=out_remove)

def remove_samples(keep_file, remove_file, pairs_file, out_remove):
    with open(keep_file) as keepF,\
         open(remove_file) as removeF,\
         open(pairs_file) as pairF, \
         open(out_remove, "w") as finalF:

        # add samples to a set to mark for keeping
        keep_set = set()
        line = keepF.readline()
        while line:
            keep_set.add(line.strip())
            line = keepF.readline()

        # add samples to a set to mark for removing
        remove_set = set()
        line = removeF.readline()
        while line:
            remove_set.add(line.strip())
            line = removeF.readline()

        # go through pairs list and identify all samples from remove_set to remove
        head = pairF.readline()
        #outF.write(head)
        line = pairF.readline()
        count = 0
        while line:
            sl = line.split()
            id1 = sl[0]
            id2 = sl[1]
            if (id1 in keep_set) or (id2 in keep_set): # is a sample from the keep_set in this pair?
                if (id1 in keep_set) and (id2 in keep_set): # are both samples of the pair from the keep_set?
                    print("Uh Oh! A duplicate within same wave!")
                    print(line)
                elif (id1 in remove_set): # is one from keep_set and other from the remove_set?
                    finalF.write(id1 + "\n")
                    count += 1
                elif (id2 in remove_set): # is one from keep_set and other from the remove_set?
                    finalF.write(id2 + "\n")
                    count += 1
                else: # if no to both, then don't remove the pair
                    pass
                    #outF.write(line)
            else:
                pass
                #outF.write(line)
            line = pairF.readline()
        print(count)

####################################################################################################        
if __name__ == "__main__":
    main()

In [None]:
## Combine the different remove-1st-degree-relative-samples files
mv uhs2_ea_1st_degree_relatives* uhs2
mv uhs3_ea_1st_degree_relatives* uhs3
mv uhs4_ea_1st_degree_relatives* uhs4

cd uhs2
paste <(cat uhs2_ea_1st_degree_relatives*) \
  <(cat uhs2_ea_1st_degree_relatives*) > uhs2_ea_1st_degrees_to_remove.txt
wc -l uhs2_ea_1st_degrees_to_remove.txt # 8 uhs2_ea_1st_degrees_to_remove.txt

cd uhs3
paste <(cat uhs3_ea_1st_degree_relatives*) \
  <(cat uhs3_ea_1st_degree_relatives*) > uhs3_ea_1st_degrees_to_remove.txt
wc -l uhs3_ea_1st_degrees_to_remove.txt # 1 uhs3_ea_1st_degrees_to_remove.txt

cd uhs4
paste <(cat uhs4_ea_1st_degree_relatives*) \
  <(cat uhs4_ea_1st_degree_relatives*) > uhs4_ea_1st_degrees_to_remove.txt
wc -l uhs4_ea_1st_degrees_to_remove.txt # 17 uhs4_ea_1st_degrees_to_remove.txt

## Upload remove-files to S3

In [None]:
## UHS1
cd ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/remove_samples_methodically/uhs4_vmerged2/uhs1/
aws s3 cp uhs1_aa_sexcheck_remove.txt s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/
aws s3 cp uhs1_ea_sexcheck_remove.txt s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/


## UHS2
cd ~/Projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/remove_samples_methodically/uhs4_vmerged2/uhs2/

for file in *aa*; do
    aws s3 cp $file s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/
done

for file in *ea*; do
    aws s3 cp $file s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/
done


## UHS3
cd ../uhs3/

for file in *aa*; do
    aws s3 cp $file s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/
done

for file in *ea*; do
    aws s3 cp $file s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/
done


## UHS4
cd ../uhs4/

for file in *aa*; do
    aws s3 cp $file s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/
done

for file in *ea*; do
    aws s3 cp $file s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/
done

## Genotype QC UHS1234: removed relateness- & sex-discrepant samples
Run the relatedness workflow on the combined UHS1234 data to verify that all of the duplicates have been remove.


*Note* that another way to handle the duplicates is as follows: find the duplicates of UHS1 across the different waves and keep the UHS1 samples while removing the duplicates of UHS1 from the other waves. Next, create a combined UHS data set and all the waves except for UHS1. Then run the combined set through the relatedness workflow to determine which samples should then be removed.

## Prepare data
Combine the duplicate-remove, 1st-degree-remove, & sex-issues-remove files. Run the relatedness and sex-check workflow on the combined UHS1234 data to verify that all of the duplicates and sex discrepant samples have been remove. Note that there are instances of a sample being a duplicate of multiple other samples. For example, the UHS4 sample 
`AS99-14089_8002220307_HHG6143_28_B09`
is a duplicate of the UHS2 sample
`8002689005_HHG3625_AS01-19675`
and also of the UHS1 sample
`104984@1054754941`.
In this case, we give preference to the UHS1 sample and remove both of the other duplicates. So, when we retain only the unique samples within the remove list, the actual number may be smaller than the sum of the three lists (dups, 1st-degrees, and sex-issues) because, for example, AS99-14089_8002220307_HHG6143_28_B09 would have been listed twice.

In [None]:
### AFR
## Download combined UHS1234 data
cd /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0003/afr/

for ext in {bed,bim,fam}; do
    aws s3 cp s3://rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/afr/snp_intersection/uhs1234_afr.$ext.gz .
done

## gunzip
## download remove lists
cd remove_lists/
aws s3 sync s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/ .

## combine remove lists
cat <(paste uhs1_aa_sexcheck_remove.txt uhs1_aa_sexcheck_remove.txt) \
    <(paste uhs2_aa_sexcheck_remove.txt uhs2_aa_sexcheck_remove.txt) \
    <(paste uhs3_1-2_aa_sexcheck_remove.txt uhs3_1-2_aa_sexcheck_remove.txt) \
    <(paste uhs3_1-3_aa_sexcheck_remove.txt uhs3_1-3_aa_sexcheck_remove.txt) \
    <(paste uhs4_aa_sexcheck_remove.txt uhs4_aa_sexcheck_remove.txt) \
    uhs2_aa_1st_degrees_to_remove.txt \
    uhs3_aa_1st_degrees_to_remove.txt \
    uhs4_aa_1st_degrees_to_remove.txt \
    uhs2_aa_duplicates_to_remove.txt \
    uhs3_aa_duplicates_to_remove.txt \
    uhs4_aa_duplicates_to_remove.txt  | sort -u > \
    ../uhs1234_aa_combined_remove_list.txt  # 652 total without unique-only filter

wc -l ../uhs1234_aa_combined_remove_list.txt
#580 uhs1234_aa_combined_remove_list.txt
    
####################################################################################################
### EUR
cd /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0003/eur/
## Download combined UHS1234 data
for ext in {bed,bim,fam}; do
    aws s3 cp s3://rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0002/eur/snp_intersection/uhs1234_eur.$ext.gz .
done

## gunzip
## download remove lists
cd remove_lists/
aws s3 sync s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/ .

cat <(paste uhs1_ea_sexcheck_remove.txt uhs1_ea_sexcheck_remove.txt) \
    <(paste uhs2_ea_sexcheck_remove.txt uhs2_ea_sexcheck_remove.txt) \
    <(paste uhs3_1-2_ea_sexcheck_remove.txt uhs3_1-2_ea_sexcheck_remove.txt) \
    <(paste uhs3_1-3_ea_sexcheck_remove.txt uhs3_1-3_ea_sexcheck_remove.txt) \
    <(paste uhs4_ea_sexcheck_remove.txt uhs4_ea_sexcheck_remove.txt) \
    uhs2_ea_1st_degrees_to_remove.txt \
    uhs3_ea_1st_degrees_to_remove.txt \
    uhs4_ea_1st_degrees_to_remove.txt \
    uhs2_ea_duplicates_to_remove.txt \
    uhs3_ea_duplicates_to_remove.txt \
    uhs4_ea_duplicates_to_remove.txt  | sort -u > \
    ../uhs1234_ea_combined_remove_list.txt  # 295 total without unique-only filter

wc -l ../uhs1234_ea_combined_remove_list.txt
#278 ../uhs1234_ea_combined_remove_list.txt

In [None]:
## use plink to remove the samples
cd /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0003/afr
/shared/bioinformatics/software/third_party/plink-1.90-beta-6.16-x86_64/plink \
    --bfile uhs1234_afr \
    --remove uhs1234_aa_combined_remove_list.txt \
    --make-bed \
    --out uhs1234_afr_snp_intersection_removed_relative_and_sex_issues

wc -l  *fam
#  4033 uhs1234_afr.fam
#  3453 uhs1234_afr_snp_intersection_removed_relative_and_sex_issues.fam

gzip uhs1234_afr_snp_intersection_removed_relative_and_sex_issues*

## upload to S3
for ext in {bed,bim,fam,hh,log}; do
    file=uhs1234_afr_snp_intersection_removed_relative_and_sex_issues.$ext.gz
    aws s3 cp $file s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/
done



## use plink to remove the samples
cd /shared/rti-shared/shared_data/pre_qc/uhs1234/genotype/array/observed/0003/eur
/shared/bioinformatics/software/third_party/plink-1.90-beta-6.16-x86_64/plink \
    --bfile uhs1234_eur \
    --remove uhs1234_ea_combined_remove_list.txt \
    --make-bed \
    --out uhs1234_eur_snp_intersection_removed_relative_and_sex_issues

wc -l  *fam
#  3036 uhs1234_eur.fam
#  2758 uhs1234_eur_snp_intersection_removed_relative_and_sex_issues.fam

gzip uhs1234_eur_snp_intersection_removed_relative_and_sex_issues*

## upload to S3
for ext in {bed,bim,fam,hh,log}; do
    file=uhs1234_eur_snp_intersection_removed_relative_and_sex_issues.$ext.gz
    aws s3 cp $file s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/
done

## Create Directories

In [None]:
mkdir -p /shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0005/{wf_input,wf_output}
mkdir -p /shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0005/{sex_check,relatedness}/

##  Set up config file for QC pipeline

Copy JSON file from previous run and modify settings. Then edit this config file to include the appropriate cohort information.

In [None]:
cd /shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0005/
cp /shared/biocloud_gwas_workflows/genotype_array_qc/sex_check/test/test_sex_check.json sex_check/uhs1234_afr_sex_check_wf.json
cp /shared/biocloud_gwas_workflows/genotype_array_qc/relatedness/test_relatedness_wf.json relatedness/uhs1234_afr_relatedness_wf.json
cp /shared/biocloud_gwas_workflows/genotype_array_qc/sex_check/test/test_sex_check.json sex_check/uhs1234_eur_sex_check_wf.json
cp /shared/biocloud_gwas_workflows/genotype_array_qc/relatedness/test_relatedness_wf.json relatedness/uhs1234_eur_relatedness_wf.json

# Edit wf config files

## Zip biocloud_gwas_workflows repo

In [None]:
cd /shared/biocloud_gwas_workflows
git pull
git submodule update --init --recursive
git rev-parse HEAD > /shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0005/wf_input/git_hash.txt
cd /shared
zip \
    --exclude=*/var/* \
    --exclude=*.git/* \
    --exclude=*/test/* \
    --exclude=*/.idea/* \
    -r /shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0005/wf_input/biocloud_gwas_workflows.zip \
    biocloud_gwas_workflows/

cd /shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0005/

## Submit jobs

In [None]:
# Open session in terminal 1
ssh -i ~/.ssh/gwas_rsa -L localhost:8000:localhost:8000 ec2-user@54.174.185.7

# Submit jobs in terminal 2

# AFR relatedness
curl -X POST "http://localhost:8000/api/workflows/v1" -H "accept: application/json" \
    -F "workflowSource=@/shared/biocloud_gwas_workflows/genotype_array_qc/relatedness/relatedness_wf.wdl" \
    -F "workflowInputs=@/shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0005/relatedness/uhs1234_afr_relatedness_wf.json" \
    -F "workflowDependencies=@/shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0005/wf_input/biocloud_gwas_workflows.zip"
echo ""


# AFR sex_check
curl -X POST "http://localhost:8000/api/workflows/v1" -H "accept: application/json" \
    -F "workflowSource=@/shared/biocloud_gwas_workflows/genotype_array_qc/sex_check/sex_check_wf.wdl" \
    -F "workflowInputs=@/shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0005/sex_check/uhs1234_afr_sex_check_wf.json" \
    -F "workflowDependencies=@/shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0005/wf_input/biocloud_gwas_workflows.zip"


# EUR relatedness
curl -X POST "http://localhost:8000/api/workflows/v1" -H "accept: application/json" \
    -F "workflowSource=@/shared/biocloud_gwas_workflows/genotype_array_qc/relatedness/relatedness_wf.wdl" \
    -F "workflowInputs=@/shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0005/relatedness/uhs1234_eur_relatedness_wf.json" \
    -F "workflowDependencies=@/shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0005/wf_input/biocloud_gwas_workflows.zip"
echo ""

# EUR sex_check
curl -X POST "http://localhost:8000/api/workflows/v1" -H "accept: application/json" \
    -F "workflowSource=@/shared/biocloud_gwas_workflows/genotype_array_qc/sex_check/sex_check_wf.wdl" \
    -F "workflowInputs=@/shared/bioinformatics/methods/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0005/sex_check/uhs1234_eur_sex_check_wf.json" \
    -F "workflowDependencies=@/shared/rti-shared/shared_data/post_qc/uhs1234/genotype/array/observed/0005/wf_input/biocloud_gwas_workflows.zip"


afr_relate=33d6b7eb-0999-44a5-a314-b5663209253a
afr_sex=e4832d20-cb76-4c5a-a87f-1398553db011

eur_relate=d967e853-635b-4805-942e-f6db877def34
eur_sex=2c607c54-ac9a-4897-be14-2ad189f01804

# Monitor job in terminal 1
#tail -f /tmp/cromwell-server.log

# check job status in terminal 2
for job in {$afr_relate,$afr_sex,$eur_relate,$eur_sex}; do
    curl -X GET "http://localhost:8000/api/workflows/v1/${job}/status"   
    echo ""
done


```
{"status":"Succeeded","id":"33d6b7eb-0999-44a5-a314-b5663209253a"}
{"status":"Succeeded","id":"e4832d20-cb76-4c5a-a87f-1398553db011"}
{"status":"Succeeded","id":"d967e853-635b-4805-942e-f6db877def34"}
{"status":"Succeeded","id":"2c607c54-ac9a-4897-be14-2ad189f01804"}
```

In [None]:
# Download output from Swagger UI.
curl -X GET "http://localhost:8000/api/workflows/v1/${afr_relate}/outputs" -H "accept: application/json" \
  > ${afr_relate}_final_outputs.json
curl -X GET "http://localhost:8000/api/workflows/v1/${eur_relate}/outputs" -H "accept: application/json" \
  > ${eur_relate}_final_outputs.json


curl -X GET "http://localhost:8000/api/workflows/v1/${afr_sex}/outputs" -H "accept: application/json" \
  > ${afr_sex}_final_outputs.json
curl -X GET "http://localhost:8000/api/workflows/v1/${eur_sex}/outputs" -H "accept: application/json" \
  > ${eur_sex}_final_outputs.json

## QC Results
Examine the results of the UHS1–4 automated relatedness & sex workflow. The input UHS1–4 was the version after removing the duplicates, 1st°  relatives, and sex discrepant samples.

### Relatedness

#### AFR

Remove the residual UHS2_AFR 1st° relative
- 8002221665_HHG7462_AS96-00773

In [None]:
### Results from current QC wf

## download results 
curl -X GET "http://localhost:8000/api/workflows/v1/${afr_relate}/outputs" -H "accept: application/json"   >\
  ${afr_relate}_final_outputs.json
awscp s3://rti-cromwell-output/cromwell-execution/relatedness_wf/33d6b7eb-0999-44a5-a314-b5663209253a/call-final_get_relateds/KING.king_kinship_wf/40dad790-2b20-4a8b-8396-09db2c3e3867/call-prune_related_samples/cacheCopy/uhs1234_afr_relatedness.final.king.pruned.annotated.k0 .

## determine if there are Any residual twin/duplicates for first-degree relatives 
grep "MZ_twin" uhs1234_afr_relatedness.final.king.pruned.annotated.k0               
grep "1st_degree" uhs1234_afr_relatedness.final.king.pruned.annotated.k0
#FID1    ID1     FID2    ID2     N_SNP   HetHet  IBS0    Kinship Classification
#1621    805550@1054754025       2314    8002221665_HHG7462_AS96-00773   93884   0.1823  0.0249  0.1795  1st_degree_relative



In [None]:
### from previous QC wf

## 22794ea1-ebd0-4f17-af93-9044d925d49d
head -1 uhs1234_afr_relatedness.final.king.pruned.annotated.k0; grep -E "805550@1054754025.*8002221665_HHG7462_AS96-00773" uhs1234_afr_relatedness.final.king.pruned.annotated.k0
#FID1	ID1	FID2	ID2	N_SNP	HetHet	IBS0	Kinship	Classification
#1622	805550@1054754025	2409	8002221665_HHG7462_AS96-00773	94048	0.1795	0.0257	0.173	2nd_degree_relative


#### EUR
No residual realtives.

In [None]:
## download results
curl -X GET "http://localhost:8000/api/workflows/v1/${eur_relate}/outputs" -H "accept: application/json"   >\
  ${eur_relate}_final_outputs.json

s3://rti-cromwell-output/cromwell-execution/relatedness_wf/d967e853-635b-4805-942e-f6db877def34/call-final_get_relateds/KING.king_kinship_wf/3ab8c6f5-bfe0-46bb-8694-2bcfa7378699/call-prune_related_samples/cacheCopy/uhs1234_eur_relatedness.final.king.pruned.annotated.k0 .

## determine if there are Any residual twin/duplicates for first-degree relatives 
grep "MZ_twin" uhs1234_eur_relatedness.final.king.pruned.annotated.k0
grep "1st_degree" uhs1234_eur_relatedness.final.king.pruned.annotated.k0

### Sex-check

#### AFR
Remove the two residual UHS2_AFR sex discrepant samples:
- 8002697189_HHG0924_AS00-11262
- 8002690857_HHG1120_AS01-03526

In [None]:
### Results from current QC wf

## e4832d20-cb76-4c5a-a87f-1398553db011
cat uhs1234_afr_sex_check.sexcheck.sexcheck.problems.tsv
#FID     IID     PEDSEX  SNPSEX  STATUS  F
#8002690857_HHG1120_AS01-03526   8002690857_HHG1120_AS01-03526   2       0       PROBLEM 0.2012
#8002697189_HHG0924_AS00-11262   8002697189_HHG0924_AS00-11262   2       0       PROBLEM 0.2031

In [None]:
### from previous QC wf

## ab5e5132-94f5-48ab-affd-6e90d7c3160d
grep  "8002690857_HHG1120_AS01-03526\|8002697189_HHG0924_AS00-11262" uhs1234_afr_sex_check.sexcheck.sexcheck.all.tsv
#8002690857_HHG1120_AS01-03526	8002690857_HHG1120_AS01-03526	2	2	OK	0.1997
#8002697189_HHG0924_AS00-11262	8002697189_HHG0924_AS00-11262	2	2	OK	0.1987 

## which wave were they from?
grep "8002690857_HHG1120_AS01-03526" uhs2/uhs2.aa.fam
#8002690857_HHG1120_AS01-03526 8002690857_HHG1120_AS01-03526 0 0 0 -9
grep "8002697189_HHG0924_AS00-11262" uhs2/uhs2.aa.fam
#8002697189_HHG0924_AS00-11262 8002697189_HHG0924_AS00-11262 0 0 0 -9

#### EUR
Remove the residual UHS4_EUR sex discrepant sample:
- AS95-02724_8002161389_HHG0159_18_G12

In [None]:
### Results from current QC wf

## 2c607c54-ac9a-4897-be14-2ad189f01804
cat uhs1234_eur_sex_check.sexcheck.sexcheck.problems.tsv
#FID     IID     PEDSEX  SNPSEX  STATUS  F
#AS95-02724_8002161389_HHG0159_18_G12    AS95-02724_8002161389_HHG0159_18_G12    2       0       PROBLEM 0.2022

In [None]:
### from previous QC wf

## 96c0221f-98f2-4681-b500-528dfadeb050
grep AS95-02724_8002161389_HHG0159_18_G12 uhs1234_eur_sex_check.sexcheck.sexcheck.all.tsv
#AS95-02724_8002161389_HHG0159_18_G12	AS95-02724_8002161389_HHG0159_18_G12	2	2	OK	0.1987


## which wave was this sample from?
grep "AS95-02724_8002161389_HHG0159_18_G12" uhs4/uhs4.merged2.ea.fam
#AS95-02724_8002161389_HHG0159_18_G12 AS95-02724_8002161389_HHG0159_18_G12 0 0 2 -9

## Make final files

### Genotypes
Note that there were some samples marked to be removed for multiple reasons. For example, the UHS2_AFR sample `8002022836_HHG9308_AS88-5991` was a duplicate of a UHS1 sample, and this particular UHS2_AFR sample was also a sex-discrepant sample.

#### UHS1

In [None]:
cd create_final_dbgap_files/uhs1

# Download the Plink fileset
mkdir eur afr
cd eur
for ext in {bed,bim,fam}; do
  file=s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs1.ea.$ext.gz
  aws s3 cp $file .
done
gunzip *

cd ../afr/
for ext in {bed,bim,fam}; do
  file=s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs1.aa.$ext.gz
  aws s3 cp $file .
done
gunzip *

In [None]:
## Download remove files
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs1/afr/
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/uhs1_aa_sexcheck_remove.txt .

## Note that this is the same sample we were to remove based on this GitHub comment:
## https://github.com/RTIInternational/bioinformatics/issues/125#issuecomment-525743686
awk '{print $1,$1}' uhs1_aa_sexcheck_remove.txt > uhs1_afr_final_dbgap_remove_file.txt

wc -l uhs1_afr_final_dbgap_remove_file.txt
# 1

## Use Plink to create final file set
docker run -v $PWD:/data/ rtibiocloud/plink:v2.0-4d3bad3 plink2 \
	--bfile /data/uhs1.aa \
	--remove /data/uhs1_afr_final_dbgap_remove_file.txt \
	--make-bed \
	--out /data/uhs1_afr_dbgap_ready

wc -l *fam
#    2016 uhs1.aa.fam
#    2015 uhs1_afr_dbgap_ready.fam        
        
        

cd ../eur/
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/uhs1_ea_sexcheck_remove.txt .

## Note that these are the same samples we were to remove based on this GitHub comment:
## https://github.com/RTIInternational/bioinformatics/issues/125#issuecomment-525743686
awk '{print $1,$1}' uhs1_ea_sexcheck_remove.txt > uhs1_eur_final_dbgap_remove_file.txt

sort -u uhs1_eur_final_dbgap_remove_file.txt  | wc -l 
# 2

## No other remove samples to consider.
## Use Plink to create final file set
docker run -v $PWD:/data/ rtibiocloud/plink:v2.0-4d3bad3 plink2 \
	--bfile /data/uhs1.ea \
	--remove /data/uhs1_eur_final_dbgap_remove_file.txt \
	--make-bed \
	--out /data/uhs1_eur_dbgap_ready
        
wc -l *fam
#    1142 uhs1.ea.fam
#    1140 uhs1_eur_dbgap_ready.fam

In [None]:
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs1/
# sync version with others
s3://rti-shared/shared_data/post_qc/uhs1/genotype/array/observed/

#### UHS2

In [None]:
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs2/

# Download the Plink fileset
mkdir eur afr
cd eur/
for ext in {bed,bim,fam}; do
  file=s3://rti-shared/shared_data/post_qc/uhs2/genotype/array/observed/0001/eur/uhs2_eur.$ext.gz
  aws s3 cp $file .
done
gunzip *

cd ../afr/
for ext in {bed,bim,fam}; do
  file=s3://rti-shared/shared_data/post_qc/uhs2/genotype/array/observed/0001/afr/uhs2_afr.$ext.gz
  aws s3 cp $file .
done
gunzip *

In [None]:
## Download remove files
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs2/afr/
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/uhs2_aa_duplicates_to_remove.txt .
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/uhs2_aa_1st_degrees_to_remove.txt .
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/uhs2_aa_sexcheck_remove.txt .

awk '{print $1,$1}' uhs2_aa_duplicates_to_remove.txt > uhs2_afr_final_dbgap_remove_file.txt
awk '{print $1,$1}' uhs2_aa_1st_degrees_to_remove.txt >> uhs2_afr_final_dbgap_remove_file.txt
awk '{print $1,$1}' uhs2_aa_sexcheck_remove.txt >> uhs2_afr_final_dbgap_remove_file.txt

## add the one residual 1st-degree relative and 2 sex discrepant samples
echo -e "8002221665_HHG7462_AS96-00773 8002221665_HHG7462_AS96-00773" >> uhs2_afr_final_dbgap_remove_file.txt
echo -e "8002697189_HHG0924_AS00-11262 8002697189_HHG0924_AS00-11262" >>  uhs2_afr_final_dbgap_remove_file.txt
echo -e "8002690857_HHG1120_AS01-03526 8002690857_HHG1120_AS01-03526" >> uhs2_afr_final_dbgap_remove_file.txt

wc -l *txt
#      44 uhs2_aa_1st_degrees_to_remove.txt
#     118 uhs2_aa_duplicates_to_remove.txt
#      13 uhs2_aa_sexcheck_remove.txt
#     178 uhs2_afr_final_dbgap_remove_file.txt

# Final number of samples to remove (remove duplicates)
sort -u uhs2_afr_final_dbgap_remove_file.txt | wc -l
#     172

# determine which samples had multiple discrepancies
grep -f uhs2_aa_sexcheck_remove.txt uhs2_aa_1st_degrees_to_remove.txt
#8002220224_HHG5779_AS87-2842	8002220224_HHG5779_AS87-2842
grep -f uhs2_aa_1st_degrees_to_remove.txt uhs2_aa_duplicates_to_remove.txt
grep -f uhs2_aa_sexcheck_remove.txt uhs2_aa_duplicates_to_remove.txt
#8002022836_HHG9308_AS88-5991	8002022836_HHG9308_AS88-5991
#8002697175_HHG0897_AS00-10209	8002697175_HHG0897_AS00-10209
#8002689503_HHG3705_AS02-03196	8002689503_HHG3705_AS02-03196
#8002694995_HHG0641_AS00-01871	8002694995_HHG0641_AS00-01871
#8002694999_HHG0539_AS99-11829	8002694999_HHG0539_AS99-11829

### Final count then is
### Sex issues 9 (=13-1-5+2), 1st degree 45 (=44+1), and duplicates 118.
### This count includes the 3 residuals added above.

## Use Plink to create final file set
docker run -v $PWD:/data/ rtibiocloud/plink:v2.0-4d3bad3 plink2 \
	--bfile /data/uhs2_afr \
	--remove /data/uhs2_afr_final_dbgap_remove_file.txt \
	--make-bed \
	--out /data/uhs2_afr_dbgap_ready

wc -l *fam
# 767 uhs2_afr.fam
# 595 uhs2_afr_dbgap_ready.fam






cd ../eur/
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/uhs2_ea_duplicates_to_remove.txt .
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/uhs2_ea_1st_degrees_to_remove.txt .
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/uhs2_ea_sexcheck_remove.txt .

awk '{print $1,$1}' uhs2_ea_duplicates_to_remove.txt > uhs2_eur_final_dbgap_remove_file.txt
awk '{print $1,$1}' uhs2_ea_1st_degrees_to_remove.txt >> uhs2_eur_final_dbgap_remove_file.txt
awk '{print $1,$1}' uhs2_ea_sexcheck_remove.txt >> uhs2_eur_final_dbgap_remove_file.txt

wc -l uhs2_eur_final_dbgap_remove_file.txt 
# 74 
sort -u uhs2_eur_final_dbgap_remove_file.txt | wc -l 
# 74

sort -u uhs2_ea_duplicates_to_remove.txt | wc -l
# 52 

sort -u uhs2_ea_1st_degrees_to_remove.txt | wc -l
# 8 

sort -u uhs2_ea_sexcheck_remove.txt | wc -l
# 14

### determine which samples had multiple discrepancies
grep -f uhs2_ea_sexcheck_remove.txt uhs2_ea_1st_degrees_to_remove.txt
grep -f uhs2_ea_sexcheck_remove.txt  uhs2_ea_duplicates_to_remove.txt
cut -f1 uhs2_ea_1st_degrees_to_remove.txt | xargs -I{} grep {} uhs2_ea_duplicates_to_remove.txt  | sort -u


### Final count then is
### Sex issues 14, 1st degree 8, and duplicates 52.

wc -l uhs2_eur_final_dbgap_remove_file.txt
# 74

sort uhs2_eur_final_dbgap_remove_file.txt | uniq | wc -l
# 74

### Final count then is
### Sex issues 14, 1st degree 8, and duplicates 52.

## Use Plink to create final file set
docker run -v $PWD:/data/ rtibiocloud/plink:v2.0-4d3bad3 plink2 \
	--bfile /data/uhs2_eur \
	--remove /data/uhs2_eur_final_dbgap_remove_file.txt \
	--make-bed \
	--out /data/uhs2_eur_dbgap_ready

wc -l *fam
# 828 uhs2_eur.fam
# 754 uhs2_eur_dbgap_ready.fam

#### UHS3_1-2
Note that the UHS3 (v1-2 and v1-3) 1st degree relatives and duplicates were combined. So the remove list will be larger than what is actually removed. 

In [None]:
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs3_1-2/

# Download the Plink fileset
mkdir eur afr
cd eur/
for ext in {bed,bim,fam}; do
  file=s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.ea.V1-2.$ext.gz
  aws s3 cp $file .
done
gunzip *

cd ../afr/
for ext in {bed,bim,fam}; do
  file=s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.aa.V1-2.$ext.gz
  aws s3 cp $file .
done
gunzip *


In [None]:
## Download remove files
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs3_1-2/afr/
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/uhs3_aa_duplicates_to_remove.txt .
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/uhs3_aa_1st_degrees_to_remove.txt .
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/uhs3_1-2_aa_sexcheck_remove.txt .

awk '{print $1,$1}' uhs3_aa_duplicates_to_remove.txt > uhs3_1-2_afr_final_dbgap_remove_file.txt
awk '{print $1,$1}' uhs3_aa_1st_degrees_to_remove.txt >> uhs3_1-2_afr_final_dbgap_remove_file.txt
awk '{print $1,$1}' uhs3_1-2_aa_sexcheck_remove.txt >> uhs3_1-2_afr_final_dbgap_remove_file.txt

wc -l uhs3_1-2_afr_final_dbgap_remove_file.txt 
# 84 
sort -u uhs3_1-2_afr_final_dbgap_remove_file.txt | wc -l 
# 69 

# actual # of samples to remove
sort -u uhs3_1-2_afr_final_dbgap_remove_file.txt | xargs -I{} grep {} uhs3.*fam | ww
# 33 

cut -f1 uhs3_aa_duplicates_to_remove.txt | xargs -I{} grep {} uhs3.*fam | awk '{print $1,$2}'  > uhs3_1-2_aa_duplicates_to_remove.txt
sort -u uhs3_1-2_aa_duplicates_to_remove.txt | wc -l
#      28

cut -f1 uhs3_aa_1st_degrees_to_remove.txt | xargs -I{} grep {} uhs3.*fam | awk '{print $1,$2}'  > uhs3_1-2_aa_1st_degrees_to_remove.txt
sort -u uhs3_1-2_aa_1st_degrees_to_remove.txt | wc -l
#      5 

sort -u uhs3_1-2_aa_sexcheck_remove.txt | wc -l
#     2

### determine which samples had multiple discrepancies
grep -f uhs3_1-2_aa_sexcheck_remove.txt uhs3_1-2_aa_1st_degrees_to_remove.txt
grep -f uhs3_1-2_aa_sexcheck_remove.txt  uhs3_1-2_aa_duplicates_to_remove.txt
grep -f uhs3_1-2_aa_1st_degrees_to_remove.txt uhs3_1-2_aa_duplicates_to_remove.txt 
# 8002022049_HHG8460_AS90-6445 8002022049_HHG8460_AS90-6445
# 8002690630_HHG1800_AS88-5431 8002690630_HHG1800_AS88-5431

### Final count then is
### Sex issues 2, 1st degree  3 (=5-2), and duplicates 28.

## Use Plink to create final file set
docker run -v $PWD:/data/ rtibiocloud/plink:v2.0-4d3bad3 plink2 \
	--bfile /data/uhs3.aa.V1-2 \
	--remove /data/uhs3_1-2_afr_final_dbgap_remove_file.txt \
	--make-bed \
	--out /data/uhs3_v1-2_afr_dbgap_ready
        
wc -l *fam
# 84 uhs3.aa.V1-2.fam
# 51 uhs3_v1-2_afr_dbgap_ready.fam






## Download remove files
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs3_1-2/eur/
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/uhs3_ea_duplicates_to_remove.txt .
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/uhs3_ea_1st_degrees_to_remove.txt .
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/uhs3_1-2_ea_sexcheck_remove.txt .

awk '{print $1,$1}' uhs3_ea_duplicates_to_remove.txt > uhs3_1-2_eur_final_dbgap_remove_file.txt
awk '{print $1,$1}' uhs3_ea_1st_degrees_to_remove.txt >> uhs3_1-2_eur_final_dbgap_remove_file.txt
awk '{print $1,$1}' uhs3_1-2_ea_sexcheck_remove.txt >> uhs3_1-2_eur_final_dbgap_remove_file.txt


wc -l uhs3_1-2_eur_final_dbgap_remove_file.txt 
# 18 
sort -u uhs3_1-2_eur_final_dbgap_remove_file.txt | wc -l 
# 18

# actual # of samples to remove
sort -u uhs3_1-2_eur_final_dbgap_remove_file.txt | xargs -I{} grep {} uhs3.*fam | ww
# 9 

cut -f1 uhs3_ea_duplicates_to_remove.txt | xargs -I{} grep {} uhs3.*fam | awk '{print $1,$2}'  > uhs3_1-2_ea_duplicates_to_remove.txt
sort -u uhs3_1-2_ea_duplicates_to_remove.txt | wc -l
#      5

cut -f1 uhs3_ea_1st_degrees_to_remove.txt | xargs -I{} grep {} uhs3.*fam | awk '{print $1,$2}'  > uhs3_1-2_ea_1st_degrees_to_remove.txt
sort -u uhs3_1-2_ea_1st_degrees_to_remove.txt | wc -l
#      1 

sort -u uhs3_1-2_ea_sexcheck_remove.txt | wc -l
#     3

### determine which samples had multiple discrepancies
grep -f uhs3_1-2_ea_sexcheck_remove.txt uhs3_1-2_ea_1st_degrees_to_remove.txt
grep -f uhs3_1-2_ea_sexcheck_remove.txt  uhs3_1-2_ea_duplicates_to_remove.txt
grep -f uhs3_1-2_ea_1st_degrees_to_remove.txt uhs3_1-2_ea_duplicates_to_remove.txt 

### Final count then is
### Sex issues 3, 1st degree  1, and duplicates 5.


## Use Plink to create final file set
docker run -v $PWD:/data/ rtibiocloud/plink:v2.0-4d3bad3 plink2 \
	--bfile /data/uhs3.ea.V1-2 \
	--remove /data/uhs3_1-2_eur_final_dbgap_remove_file.txt \
	--make-bed \
	--out /data/uhs3_v1-2_eur_dbgap_ready

wc -l *fam
# 33 uhs3.ea.V1-2.fam
# 24 uhs3_v1-2_eur_dbgap_ready.fam

#### UHS3_1-3

In [None]:
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs3_1-3/

# Download the Plink fileset
mkdir eur afr
cd eur/
for ext in {bed,bim,fam}; do
  file=s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.ea.V1-3.$ext.gz
  aws s3 cp $file .
done
gunzip *

cd ../afr/
for ext in {bed,bim,fam}; do
  file=s3://rti-hiv/rti-midas-data/studies/hiv/observed/final/uhs3.aa.V1-3.$ext.gz
  aws s3 cp $file .
done
gunzip *

In [None]:
## Download remove files
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs3_1-3/afr/
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/uhs3_aa_duplicates_to_remove.txt .
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/uhs3_aa_1st_degrees_to_remove.txt .
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/uhs3_1-3_aa_sexcheck_remove.txt .

awk '{print $1,$1}' uhs3_aa_duplicates_to_remove.txt > uhs3_1-3_afr_final_dbgap_remove_file.txt
awk '{print $1,$1}' uhs3_aa_1st_degrees_to_remove.txt >> uhs3_1-3_afr_final_dbgap_remove_file.txt
awk '{print $1,$1}' uhs3_1-3_aa_sexcheck_remove.txt >> uhs3_1-3_afr_final_dbgap_remove_file.txt


wc -l uhs3_1-3_afr_final_dbgap_remove_file.txt 
# 88
sort -u uhs3_1-3_afr_final_dbgap_remove_file.txt | wc -l 
# 73 

# There are 40 samples to remove
sort -u uhs3_1-3_afr_final_dbgap_remove_file.txt | xargs -I{} grep {} uhs3.*fam | ww
# 40 

cut -f1 uhs3_aa_duplicates_to_remove.txt | xargs -I{} grep {} uhs3.*fam | awk '{print $1,$2}'  > uhs3_1-3_aa_duplicates_to_remove.txt
sort -u uhs3_1-3_aa_duplicates_to_remove.txt | wc -l
#      29

cut -f1 uhs3_aa_1st_degrees_to_remove.txt | xargs -I{} grep {} uhs3.*fam | awk '{print $1,$2}'  > uhs3_1-3_aa_1st_degrees_to_remove.txt
sort -u uhs3_1-3_aa_1st_degrees_to_remove.txt | wc -l
#      9 

sort -u uhs3_1-3_aa_sexcheck_remove.txt | wc -l
#     6

### determine which samples had multiple discrepancies
grep -f uhs3_1-3_aa_sexcheck_remove.txt uhs3_1-3_aa_1st_degrees_to_remove.txt
grep -f uhs3_1-3_aa_sexcheck_remove.txt  uhs3_1-3_aa_duplicates_to_remove.txt
#8002221030_HHG6599_AS93-4798	8002221030_HHG6599_AS93-4798
#8002688215_HHG2990_AS96-04581	8002688215_HHG2990_AS96-04581
grep -f uhs3_1-3_aa_1st_degrees_to_remove.txt uhs3_1-3_aa_duplicates_to_remove.txt 
#8002022646_HHG8892_AS99-06074	8002022646_HHG8892_AS99-06074
#8002689386_HHG3930_AS89-1797	8002689386_HHG3930_AS89-1797

### Final count then is
### Sex issues 4 (=6-2), 1st degree 7 (=9-2), and duplicates 29.

## Use Plink to create final file set
docker run -v $PWD:/data/ rtibiocloud/plink:v2.0-4d3bad3 plink2 \
	--bfile /data/uhs3.aa.V1-3 \
	--remove /data/uhs3_1-3_afr_final_dbgap_remove_file.txt \
	--make-bed \
	--out /data/uhs3_v1-3_afr_dbgap_ready

wc -l *fam
#      94 uhs3.aa.V1-3.fam
#      54 uhs3_v1-3_afr_dbgap_ready.fam





## Download remove files
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs3_1-3/eur/
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/uhs3_ea_duplicates_to_remove.txt .
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/uhs3_ea_1st_degrees_to_remove.txt .
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/uhs3_1-3_ea_sexcheck_remove.txt .

awk '{print $1,$1}' uhs3_ea_duplicates_to_remove.txt > uhs3_1-3_eur_final_dbgap_remove_file.txt
awk '{print $1,$1}' uhs3_ea_1st_degrees_to_remove.txt >> uhs3_1-3_eur_final_dbgap_remove_file.txt
awk '{print $1,$1}' uhs3_1-3_ea_sexcheck_remove.txt >> uhs3_1-3_eur_final_dbgap_remove_file.txt

wc -l uhs3_1-3_eur_final_dbgap_remove_file.txt 
# 20
sort -u uhs3_1-3_eur_final_dbgap_remove_file.txt | wc -l 
# 19 

# There are 13 samples to remove
sort -u uhs3_1-3_eur_final_dbgap_remove_file.txt | xargs -I{} grep {} uhs3.*fam | ww
# 13 

cut -f1 uhs3_ea_duplicates_to_remove.txt | xargs -I{} grep {} uhs3.*fam | awk '{print $1,$2}'  > uhs3_1-3_ea_duplicates_to_remove.txt
sort -u uhs3_1-3_ea_duplicates_to_remove.txt | wc -l
#      9

cut -f1 uhs3_ea_1st_degrees_to_remove.txt | xargs -I{} grep {} uhs3.*fam | awk '{print $1,$2}'  > uhs3_1-3_ea_1st_degrees_to_remove.txt
sort -u uhs3_1-3_ea_1st_degrees_to_remove.txt | wc -l
#      0 

sort -u uhs3_1-3_ea_sexcheck_remove.txt | wc -l
#     5

### determine which samples had multiple discrepancies
grep -f uhs3_1-3_ea_sexcheck_remove.txt uhs3_1-3_ea_1st_degrees_to_remove.txt
grep -f uhs3_1-3_ea_sexcheck_remove.txt  uhs3_1-3_ea_duplicates_to_remove.txt
# 8002221088_HHG7296_AS89-2397 8002221088_HHG7296_AS89-2397
grep -f uhs3_1-3_ea_1st_degrees_to_remove.txt uhs3_1-3_ea_duplicates_to_remove.txt 

### Final count then is
### Sex issues 4 (=5-1), 1st degree 0, and duplicates 9.

## Use Plink to create final file set
docker run -v $PWD:/data/ rtibiocloud/plink:v2.0-4d3bad3 plink2 \
	--bfile /data/uhs3.ea.V1-3 \
	--remove /data/uhs3_1-3_eur_final_dbgap_remove_file.txt \
	--make-bed \
	--out /data/uhs3_v1-3_eur_dbgap_ready

wc -l *fam
# 44 uhs3.ea.V1-3.fam
# 31 uhs3_v1-3_eur_dbgap_ready.fam

#### UHS4

In [None]:
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs4/

# Download the Plink fileset
mkdir eur afr
cd eur/
for ext in {bed,bim,fam}; do
  file=s3://rti-shared/rti-midas-data/studies/uhs4/observed/genotypes/final/uhs4.merged2.ea.$ext.gz
  aws s3 cp $file .
done
gunzip *

cd ../afr/
for ext in {bed,bim,fam}; do
  file=s3://rti-shared/rti-midas-data/studies/uhs4/observed/genotypes/final/uhs4.merged2.aa.$ext.gz
  aws s3 cp $file .
done
gunzip *

In [None]:
## Download remove files
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs4/afr/
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/uhs4_aa_duplicates_to_remove.txt .
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/uhs4_aa_1st_degrees_to_remove.txt .
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/afr/remove_sample_files/uhs4_aa_sexcheck_remove.txt .

awk '{print $1,$1}' uhs4_aa_duplicates_to_remove.txt > uhs4_afr_final_dbgap_remove_file.txt
awk '{print $1,$1}' uhs4_aa_1st_degrees_to_remove.txt >> uhs4_afr_final_dbgap_remove_file.txt
awk '{print $1,$1}' uhs4_aa_sexcheck_remove.txt >> uhs4_afr_final_dbgap_remove_file.txt

wc -l uhs4_afr_final_dbgap_remove_file.txt 
# 386 
sort -u uhs4_afr_final_dbgap_remove_file.txt | wc -l 
# 337

sort -u uhs4_aa_duplicates_to_remove.txt | wc -l
#      259

sort -u uhs4_aa_1st_degrees_to_remove.txt | wc -l
#      79 

sort -u uhs4_aa_sexcheck_remove.txt | wc -l
#     9

### determine which samples had multiple discrepancies
grep -f uhs4_aa_sexcheck_remove.txt uhs4_aa_1st_degrees_to_remove.txt
grep -f uhs4_aa_sexcheck_remove.txt  uhs4_aa_duplicates_to_remove.txt
cut -f1 uhs4_aa_1st_degrees_to_remove.txt | xargs -I{} grep {} uhs4_aa_duplicates_to_remove.txt  | sort -u  | wc -l
# 10

### Final count then is
### Sex issues 9, 1st degree 69, and duplicates 259.

## Use Plink to create final file set
docker run -v $PWD:/data/ rtibiocloud/plink:v2.0-4d3bad3 plink2 \
	--bfile /data/uhs4.merged2.aa \
	--remove /data/uhs4_afr_final_dbgap_remove_file.txt \
	--make-bed \
	--out /data/uhs4_afr_dbgap_ready

wc -l *fam
# 1072 uhs4.merged2.aa.fam
# 735 uhs4_afr_dbgap_ready.fam




cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs4/eur/
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/uhs4_ea_duplicates_to_remove.txt .
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/uhs4_ea_1st_degrees_to_remove.txt .
aws s3 cp s3://rti-shared/scratch/shared_data/post_qc/uhs1234/genotype/array/observed/0001/eur/remove_sample_files/uhs4_ea_sexcheck_remove.txt .

awk '{print $1,$1}' uhs4_ea_duplicates_to_remove.txt > uhs4_eur_final_dbgap_remove_file.txt
awk '{print $1,$1}' uhs4_ea_1st_degrees_to_remove.txt >> uhs4_eur_final_dbgap_remove_file.txt
awk '{print $1,$1}' uhs4_ea_sexcheck_remove.txt >> uhs4_eur_final_dbgap_remove_file.txt

## add the one residual sex discrepant sample
echo -e "AS95-02724_8002161389_HHG0159_18_G12 AS95-02724_8002161389_HHG0159_18_G12" >> uhs4_eur_final_dbgap_remove_file.txt

sort -u uhs4_eur_final_dbgap_remove_file.txt | wc -l
# 197

sort -u uhs4_ea_duplicates_to_remove.txt | wc -l
# 155

sort -u uhs4_ea_1st_degrees_to_remove.txt | wc -l
# 17 

sort -u uhs4_ea_sexcheck_remove.txt | wc -l
# 12

### determine which samples had multiple discrepancies
grep -f uhs4_ea_sexcheck_remove.txt uhs4_ea_1st_degrees_to_remove.txt | wc -l
# 1
grep -f uhs4_ea_sexcheck_remove.txt  uhs4_ea_duplicates_to_remove.txt | wc -l
# 1
cut -f1 uhs4_ea_1st_degrees_to_remove.txt | xargs -I{} grep {} uhs4_ea_duplicates_to_remove.txt  | sort -u  | wc -l
# 2

### Final count then is
### Sex issues 11 (=12-1-1+1), 1st degree 15 (=17-2), and duplicates 155.

## Use Plink to create final file set
docker run -v $PWD:/data/ rtibiocloud/plink:v2.0-4d3bad3 plink2 \
	--bfile /data/uhs4.merged2.ea \
	--remove /data/uhs4_eur_final_dbgap_remove_file.txt \
	--make-bed \
	--out /data/uhs4_eur_dbgap_ready

wc -l *fam
# 989 uhs4.merged2.ea.fam
# 808 uhs4_eur_dbgap_ready.fam

### Phenotype
Our starting point for this file will be the most recently dbGaP submitted phenotype file. `dbGaP_phenotypeDS_20201130.txt`

#### Create combined subject sample list

In [None]:
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/

## create combined subject sampid list
touch uhs_afr_combined_sample_ids.txt
touch uhs_eur_combined_sample_ids.txt

for an in {afr,eur}; do
    cut -f1 uhs1/$an/uhs1_${an}_dbgap_ready.fam >> uhs_${an}_combined_sample_ids.txt
    cut -f1 uhs2/$an/uhs2_${an}_dbgap_ready.fam >> uhs_${an}_combined_sample_ids.txt
    cut -f1 uhs3_1-2/$an/uhs3_v1-2_${an}_dbgap_ready.fam >> uhs_${an}_combined_sample_ids.txt
    cut -f1 uhs3_1-3/$an/uhs3_v1-3_${an}_dbgap_ready.fam >> uhs_${an}_combined_sample_ids.txt
    cut -f1 uhs4/$an/uhs4_${an}_dbgap_ready.fam >> uhs_${an}_combined_sample_ids.txt
done

cat uhs_afr_combined_sample_ids.txt uhs_eur_combined_sample_ids.txt \
    > uhs_cross_ancestry_combined_sample_ids.txt
wc -l *txt
#    3450 uhs_afr_combined_sample_ids.txt
#    6207 uhs_cross_ancestry_combined_sample_ids.txt
#    2757 uhs_eur_combined_sample_ids.txt

#### Create dbGaP_SubjectSampleMappingDS
The previous file submitted to dbGaP was `dbGaP_SubjectSampleMappingDS_20191202.txt`. We will create a new filtered version of this file.



In [None]:
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files

# sym link file to current directory from previous dbGaP submission
ln -s ~/projects/hiv/gwas/uhs1234/dbgap_upload/final/004/dbGaP_SubjectSampleMappingDS_20191202.txt

head -4 dbGaP_SubjectSampleMappingDS_20191202.txt
#SUBJECT_ID	SAMPLE_ID
#HHG1579	109@1064714572
#HHG1561	202@1064714531
#HHG5943	245@1064714500

# get rid of the tabs
awk '$1=$1' dbGaP_SubjectSampleMappingDS_20191202.txt > dbGaP_SubjectSampleMappingDS_20191202.tsv

echo "SUBJECT_ID SAMPLE_ID" > dbGaP_SubjectSampleMappingDS_v01.txt

awk '
FNR==NR{ map[$2] = $0; next }
{ if (( $1 in map ))  print map[$1] } ' OFS=" " dbGaP_SubjectSampleMappingDS_20191202.tsv \
    uhs_cross_ancestry_combined_sample_ids.txt \
    >> dbGaP_SubjectSampleMappingDS_v01.txt

In [None]:
awk '
FNR==NR{ map[$2] = $0; next }
{ if (!( $1 in map ))  print $1 } ' dbGaP_SubjectSampleMappingDS_20191202.tsv \
    uhs_cross_ancestry_combined_sample_ids.txt \
    > samples_to_add_manually.txt

# Here are those 20 samples. 
# Note that the cell_line & line number from master phenotype file (after the octothorpe) are appended here just for reference
"""
49282@1054752684 # HHG7167; 7473
119705@1054753355 #  HHG6464; 289
232828@1054755560 #  HHG3647; 4724
541308@1054753694 #  HHG8876; 6818
885463@1054755444 #  HHG7234; 7014
994455@1054755536 #  HHG4152; 1471
AS88-2170_8002023143_HHG10437_15_A01 #  HHG10437; 8986
631@1064714686 #  HHG0556;  2523
129734@1054697780 #  HHG5406; 5778
167887@1054752878 #  HHG8549; 10327
177455@1054753301 #  HHG1268; 3867
246540@1054754394 #  HHG8508; 10252
353343@1054752875 #  HHG5906;  8599
388179@1054755347 #  HHG6290; 2871
499992@1054753526 #  HHG9066; 1880
507306@1054735568 #  HHG1244; 3820
507342@1054752627 #  HHG0666; 5450
748007@1054754953 #  HHG2299;  8942
766342@1054753836 #  HHG7211; 203
819309@1054753189 #  HHG7612; 3101
"""

# use vim to create a file cointaining those 20 missing samples to append these subject IDs & sample IDs to the mapping file
head -2 sampid_and_cell_line.txt
# HHG7167 49282@1054752684
# HHG6464 119705@1054753355

cat dbGaP_SubjectSampleMappingDS_v01.txt sampid_and_cell_line.txt \
	> dbGaP_SubjectSampleMappingDS_v02.txt

wc -l dbGaP_SubjectSampleMappingDS_v02.txt
#    6208 dbGaP_SubjectSampleMappingDS_v02.txt

I manually search and found these within the "Master Phenotype file" hiv_all_merged_with_uhs_all_phenotype_data_08282017.csv.
I used the search function within Excel in the "sampid" column which matches the first portion of the id. Example: for sample 49282@1054752684 I search for 49282 within sampid column and then recorded the line number and the id found within the "cell_line" column. 
For sample AS88-2170_8002023143_HHG10437_15_A01 I searched for AS88-2170 in the master phenotype file and found this match on line 8986. The sampid didn't match but the ciderid did.

#### Create dbGaP_SampleAttributesDS

In [None]:
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files

# sym link file to current directory from previous dbGaP submission
ln -s ~/projects/hiv/gwas/uhs1234/dbgap_upload/final/004/dbGaP_SampleAttributesDS_20191202.txt

head dbGaP_SampleAttributesDS_20191202.txt
# SAMPLE_ID	BODY_SITE	ANALYTE_TYPE	HISTOLOGICAL_TYPE	IS_TUMOR
# 109@1064714572	Serum	DNA	Serum	N
# 202@1064714531	Serum	DNA	Serum	N

head dbGaP_SubjectSampleMappingDS_v02.txt
#SUBJECT_ID SAMPLE_ID
#HHG1579	109@1064714572
#HHG1561	202@1064714531

echo "SAMPLE_ID BODY_SITE ANALYTE_TYPE HISTOLOGICAL_TYPE IS_TUMOR" > dbGaP_SampleAttributesDS_v02.txt
awk '
{ print $2,"Serum","DNA","Serum", "N" }
' <(tail -n +2 dbGaP_SubjectSampleMappingDS_v02.txt) >> dbGaP_SampleAttributesDS_v02.txt

wc -l dbGaP_SampleAttributesDS_v02.txt
# 6208

#### Create dbGaP_SubjectConsentDS
The previous file submitted to dbGaP was `dbGaP_SubjectConsentDS_20191202.txt`. We will create a new filtered version of this file.

In [None]:
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files

# sym link file to current directory from previous dbGaP submission
ln -s ~/projects/hiv/gwas/uhs1234/dbgap_upload/final/004/dbGaP_SubjectConsentDS_20191202.txt

head dbGaP_SubjectConsentDS_20191202.txt
#SUBJECT_ID	CONSENT	SUBJECT_SOURCE	SOURCE_SUBJECT_ID
#HHG1579	1	NA	NA
#HHG1561	1	NA	NA

head dbGaP_SubjectSampleMappingDS_v02.txt
#SUBJECT_ID SAMPLE_ID
#HHG1579	109@1064714572
#HHG1561	202@1064714531

echo "SUBJECT_ID CONSENT SUBJECT_SOURCE SOURCE_SUBJECT_ID" > dbGaP_SubjectConsentDS_v02.txt
awk '
{ print $1,1,"NA","NA" }
' <(tail -n +2 dbGaP_SubjectSampleMappingDS_v02.txt) >> dbGaP_SubjectConsentDS_v02.txt

wc -l dbGaP_SubjectConsentDS_v02.txt
# 6208

#### Create dbGaP_phenotypeDS

In [None]:
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files

# sym link file to current directory from previous dbGaP submission
ln -s ~/projects/hiv/gwas/uhs1234/dbgap_upload/final/005/dbGaP_phenotypeDS_20201130.txt

head dbGaP_phenotypeDS_20201130.txt
# subjid	hivstat	age_hiv	viralload_cperml	viralload_log10	sex_selfreport	gwassex	ancestry_selfreport	age	surveyyear	lca_group	mpartners	stis	anal	needleshare	sexwork	site	heroin_ever	heroin_ever_inj	opioid_ever	opioid_ever_inj	totopioid_ever	heroin_age_ons	opioid_age_ons	totopioid_age_ons	heroin_inj_30d	heroin_non_30d	opioid_inj_30d	opioid_non_30d	totopioid_inj_30d	totopioid_non_30d	totopioid_tot_30d	heroin_case	opioid_case	totopioid_case	inj_ever	inj_freq	inj_case	inj_age_ons	coc_ever	coc_ever_inj	amphet_ever	amphet_ever_inj	sed_ever	sed_ever_inj	mj_ever	coc_age_ons	amphet_age_ons	sed_age_ons	mj_age_ons	cocaine_inj_30d	cocaine_non_30dtotcoc_30d	amphet_inj_30d	amphet_non_30d	totamphet_30d	mj_30d	totcoc_case	totamphet_case	mj_case
# HHG1579	0	-99999	-99999	-99999	1	1	2	26	1987	2	1	1	0	-99999	-99999	2	-99999	-99999	1	11	-99999	-99999	-99999	0	-99999	-99999	-99999	0	-99999	0	0	-99999	-99999	1	17	1	-99999	1	1	11	-99999	-99999	-99999	-99999	-99999	-99999	-99999	0	-99999	0	17	-99999	17	-99999	0	1	-99999
# HHG1561	0	-99999	-99999	-99999	2	2	2	27	1987	2	0	1	0	-99999	-99999	2	1	1	1	11	-99999	-99999	-99999	17	-99999	-99999	-99999	17	-99999	17	1	-99999	1	1	17	1	-99999	-99999	-99999-99999	-99999	-99999	-99999	-99999	-99999	-99999	-99999	-99999	0	-99999	0	0	-99999	0	-99999	0	0	-99999

head dbGaP_SubjectSampleMappingDS_v02.txt
# SUBJECT_ID SAMPLE_ID
# HHG1579 109@1064714572
# HHG1561 202@1064714531

# convert tabs to spaces
awk '$1=$1' dbGaP_phenotypeDS_20201130.txt >  dbGaP_phenotypeDS_20201130.tsv

head -1 dbGaP_phenotypeDS_20201130.tsv > dbGaP_phenotypeDS_v01.txt

awk '
FNR==NR { map[$1] = $0; next }
{ if ($1 in map) print map[$1] }
' dbGaP_phenotypeDS_20201130.tsv dbGaP_SubjectSampleMappingDS_v02.txt \
	>> dbGaP_phenotypeDS_v01.txt

wc -l dbGaP_phenotypeDS_v01.txt
#    6188 dbGaP_phenotypeDS_v01.txt

# print the missing samples
awk '
FNR==NR { map[$1] $0; next }
{ if (!($1 in map)) print $0 }
' dbGaP_phenotypeDS_20201130.tsv dbGaP_SubjectSampleMappingDS_v02.txt
#SUBJECT_ID SAMPLE_ID
#HHG7167 49282@1054752684
#HHG6464 119705@1054753355
#HHG3647 232828@1054755560
#HHG8876 541308@1054753694
#HHG7234 885463@1054755444
#HHG4152 994455@1054755536
#HHG10437 AS88-2170_8002023143_HHG10437_15_A01
#HHG0556 631@1064714686
#HHG5406 129734@1054697780
#HHG8549 167887@1054752878
#HHG1268 177455@1054753301
#HHG8508 246540@1054754394
#HHG5906 353343@1054752875
#HHG6290 388179@1054755347
#HHG9066 499992@1054753526
#HHG1244 507306@1054735568
#HHG0666 507342@1054752627
#HHG2299 748007@1054754953
#HHG7211 766342@1054753836
#HHG7612 819309@1054753189

In [None]:
# add 20 missing samples manually from the master phenotype file
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files

# sym link file to current directory from previous dbGaP submission
ln -s ~/projects/hiv/gwas/uhs1234/dbgap_upload/hiv_all_merged_with_uhs_all_phenotype_data_08282017.csv
head -3 hiv_all_merged_with_uhs_all_phenotype_data_08282017.csv

# personid,serum,cell_line2,cell_line3,condition,gwasserum,gwaspresid,gwasdate_acq,gwasbalance,gwashiv,gwasrace,balance,hivwave,hivstat,race,date_acq,presid,wave.x,pmmn,aage,asex,astate,actliv,azipco,azpij1,azpij2,aracgp,aschgr,amarst,hhepa,hhepb,hbca,apar,fparf1y,fparf5y,fparm1y,fparm5y,fsex15y,fsex25y,fsexm5y,fpmc5y,fsxconpt,gegsgm5y,gegmhs5y,bdgb5y,bidgb5y,bidgb5yp,bdgb1y,bdgb1yt,bidgb1yt,bidgb1yp,bdgc5y,bidgc5y,bidgc5yp,bdgc1y,bdgc1yt,bidgc1yt,bidgc1yp,bdge5y,bidgd5y,bidgd5yp,bdge1y,bdge1yt,bidgd1yt,bidgd1yp,bdgf5y,bidgbar5y,bidgx5yp,bidgbar1y,bidgx1yp,bdgg5y,bidgy5yp,bdgg1y,bidgtrn1y,bidgy1yp,bdgi5y,bidge5yp,bdgi1y,bidge1yt,bidge1yp,bdgpsy5y,bidgpsy5y,bidgz5yp,bdgj1y,bdgi1yt,bidglsd1y,bidgw1yp,bdgk5y,bdgk1y,bdgl5y,bdgl1y,csndpe1y,csndpe1m,ijail5y,ijailij,intdate,bdgf1y,ahisplat,ahmls,alivea,aliveb,aparcond,htb,fparf6m,fparm6m,gegsgd5y,gegdhs5y,fsexa5y,fsexb5y,fsexc5y,fsexd5y,fsexem5y,fsexfm5y,fsexg5y,fsexh5y,fpfa,fpfa6mp,fpma,fpma6mp,fpmb,fpmb6mp,fpfb6mp,fpmc6mp,ijailsex,gspref,bdga1y,bdga1yt,bidga1yt,bidga1yp,bdga6m,bidga6mt,bidga6mp,bdgb6mt,bidgb6mt,bidgb6mp,bdgc6m,bidgc6m,bidgc6mt,bidgc6mp,bdge6m,bidgd6mt,bidgd6mp,bdgf6m,bidgbar6m,bidgx6mp,bdgg6m,bidgtrn6m,bidgy6mp,bidge6mt,bidge6mp,bdgpsy6m,bidgpsy6m,bidgz6mp,bdgk6m,bdgl6m,cscokpt,azipco1_n,aisfix,fpfa6mpt,fpma6mpt,fpmb6mpt,fpmc6mpt,ijailsnd,bdgb6m,habs,gegsgm1y,gegmhs,gegmhs1y,gegsgd1y,gegdhs,gegdhs1y,bdga1m,bdgagesb,bdgb1m,bdgb1mt,bdgageenon,bdgc1m,bdgc1mt,bdgagehnon,bdgd1mt,bdgd1y,bdgd1yt,bdgagesm,bdge1m,bdge1mt,bdgagednon,bdgf1m,bdgg1m,bdgagetr,bdgi1m,bdgi1mt,bdgageinon,bdgj1m,bdgagels,bdgk1m,bdgageb,bdgl1m,bdgagel,bidga1mt,bidga1mp,bidgaag,bidgb1mt,bidgb1mp,bidgbag,bidgc1mt,bidgc1mp,bidgd1mt,bidgd1mp,bidgdag,bidgbar1m,bidgx1mp,bidgbara,bidgtrn1m,bidgy1mp,bidgtrna,bidge1mt,bidge1mp,bidgeag,bidglsd1m,bidgw1mp,tatcnd2,gegsgm6m,gegmhs6m,gegsgd6m,gegdhs6m,ijailw6m,ijailw5y,bdgh1m,bdgh1mt,bdgh1y,bdgh1yt,bdgageic,abthyr,csrinpt,cscotpt,atrans,hvhepb,fparf1m,fparm1m,gegsgmt,gegmhst,gegsgdt,gegdhst,ctused,ctgun,htba,ahisp,habsdr,ijailcok,ijailcot,ijailrin,fpfa1mpt,fpfb,fpfb6mpt,fpfb1mpt,fpma1mpt,fpmb1mpt,fpmc1mpt,fsxcon1m,csndpe6m,cndupt,cnduapt,cscok1m,csrin1m,cscot1m,ijailltw,ctused6m,hepbpos,hepcpos,bdgcig,bdgcigd,bdgcigy,bidgfag,bidgf6m,bidgf6mt,bidgf1m,bidgf1mt,fpfsa,fpfsa6pt,fpfsa1pt,fpfsb,fpfsb6pt,fpmsa,fpmsa6pt,fpmsa1pt,fpmsb,fpmsb6pt,fpmsb1pt,fpmsc,fpmsc6pt,fpmsc1pt,fpfoa,fpfoa6mp,fpfoa6pt,fpfoa1pt,fpfob,fpfob6mp,fpfob6pt,fpfob1pt,fpmoa,fpmoa6mp,fpmoa6pt,fpmoa1pt,fpmob,fpmob6mp,fpmob6pt,fpmob1pt,fpmoc,fpmoc6mp,fpmoc6pt,fpmoc1pt,aparnum,aparc,aparcnum,ijailwyr,ijailmo,bdgm1m,bidga,bidgb,bidgc,bidgd,bidgf,bidge,iprisltw,fpfpa,fpfpa6pt,fpfpa1pt,fpfpb,fpfpb6pt,fpfpb1pt,fpmpa,fpmpa6pt,fpmpa1pt,fpmpb,fpmpb6pt,fpmpb1pt,fpmpc,fpmpc6pt,fpmpc1pt,bdgn1m,bdgo1m,bdgp1m,bdgq1m,bidgg,bidgg6m,bidgg6mt,bidgg1m,bidgg1mt,ch2cotu,fpfa6pt,fpfb6pt,fpma6pt,fpmb6pt,fpmc6pt,bdgq6m,bdgm6m,bdgn6m,bdgn6md,bdgd6m,bdgd6md,bdgo6m,bdgi6md,bdgp6m,bdgi6m,hlivr,fpfas6mp,fpfap6mp,fpfac6mp,fpfac6pt,fpfbs6mp,fpfbp6mp,fpfbc6mp,fpfbc6pt,fpmas6mp,fpmap6mp,fpmac6mp,fpmac6pt,fpmcs6mp,fpmcp6mp,fpmcc6mp,fpmcc6pt,fpmbs6mp,fpmbp6mp,fpmbc6mp,fpmbc6pt,fptas6mp,fptap6mp,fptac6mp,fptcs6mp,fptcp6mp,fptcc6mp,fptbs6mp,fptbp6mp,fptbc6mp,geggc6,bdgr6m,bdgr1m,cscokpt1,csrinpt1,azipco1,hgawrt,holgono,hulgono,hargono,htrich,azipco2,hiv,bdga1md,bdgb1md,bdgc1md,bdgd1md,bdge1md,bdgi1md,bdgm1md,bdgn1md,bdgo1md,bdgq1md,bdgr1md,bidgb6m,bidga6m,bidgd6m,bidge6m,bidga1m,bidgd1m,bidge1m,bdgd1m,hhivpat,hhivdx,hhepc,hhepu,hghrp,hgwrt,hgono,hsyph,hchla,hbt,hgsore,gegsgm,gegsgd,bidgd1y,bidgb1y,bidgc1m,bidgc1y,bidga1y,bidge1y,fpmc,htstyrl,hhivts,hiv_gwas,hisp_hiv,heroin_abuse,city,bdylstb,bedrgc,bdylstc,btnijc,bdgaged,bdylstd,bdyijd,btijd,bdynijd,btnijd,bdgagee,bdylste,btije,bdynije,btnije,bdylstf,btijf,bdynijf,btnijf,bdylstg,bdyijg,btijg,bdynijg,btnijg,bdgageh,bdylsth,bdyijh,btijh,bdynijh,btnijh,bdylsti,bdyiji,btiji,bdyniji,btniji,fsexp,hhlthb,hhlthc,hhlthe,hhiv,fsexg6m,fsexh6m,fpfsb1pt,citywave,race_rec,sex_rec,surveyyear,syear_rec,age_quad,age_tert,bdyijd_rec,bidgd1m_rec,bidgd1y_rec,bidgb1mt_rec,bidgb1m_rec,bidgb1y_rec,bdyijh_rec,bidgc1m_rec,bidgc1y_rec,bidga1mt_rec,bidga1m_rec,bidga1y_rec,bdyiji_rec,bidge1m_rec,bidge1y_rec,injdrug,ctused_rec,recneedle,fsexp_rec,fparf1m_rec,fparf1y_rec,fparm1m_rec,fparm1y_rec,npartners,multipartners,gegsgd6m_rec,gegsgd_rec,gegsgd5y_rec,gegsgm_rec,gegsgm6m_rec,gegsgm5y_rec,sexwork,hhepb_rec,hepbpos_rec,hhepc_rec,hepcpos_rec,hhlthe_rec,hchla_rec,hhlthc_rec,hsyph_rec,hhlthb_rec,hgono_rec,stis,fsexg6m_rec,fsexh6m_rec,fpmc_rec,fpmc5y_rec,fpmsc_rec,fpmoc_rec,fpmpc6pt_rec,fpmpc_rec,fpmcs6mp_rec,fpmcc6mp_rec,fpmcp6mp_rec,anal,cprob1,cprob2,cprob3,class,fscore,match,ciderid,cell_line,useable,control,plate,repon,city2,sampid,sampid2,tothopuse,totnihopuse,totihopuse,comboheroin,comboinheroin,combonoiheroin,cocni30d,smokecoc30d,speedni30d,injcoc30d,injspeed30d,smokeamph30d,injcrack30d,noincocdays30,smkcocdays30,noninspeeddays30,cocni30d_cat,smokecoc30d_cat,speedni30d_cat,smokeamph30d_cat,injcoc30d_cat,injcrack30d_cat,injspeed30d_cat,altsamplex,herointype,cocni30d_yn,smokecoc30d_yn,speedni30d_yn,smokeamph30d_yn,injcrack30d_yn,injcoc30d_yn,injspeed30d_yn,viralload_cperml.x,viralload_log10.x,viralload_date,hisplcaprob1,hisplcaprob2,hisplcaprob3,hisplca,hispmatch,hisp_select,typing,cluster,hhivmed,hhivmeds,hivmede1,hivmedc1,hivmede0,hivmedc0,hazt,hivmed3,tx1,tx2,treatment,gwassex,treatment2,aids,aidscorrect,ruid_performance,perf,genotyperace,bdgk6mns,bdgk1yns,bdgk5yns,bdge5yns,bdge1yns,bdge6mns,bdgb5yns,bdgb1yns,bdgb6mns,bdga1yns,bdga6mns,bdgc5yns,bdgc1yns,bdgc6mns,bdgi5yns,bdgi1yns,bdgpsy5yns,bdgj1yns,bdgpsy6mns,bdgl5yns,bdgl1yns,bdgl6mns,bdgf5yns,bdgf1yns,bdgf6mns,bdgg5yns,bdgg1yns,bdgg6mns,bdga1ytns,bdga6mtns,bdgb1ytns,bdgb6mtns,bdgc1ytns,bdgc6mtns,bdge1ytns,bdge6mtns,bdgi1ytns,bdgi6mtns,dbgapexclude,dbexc_reason,dupselect,unexdupselect,uwselect,hvl,bidgcag,bidgpsya,bidglsda,goofball_10,speedball_10,mj_10,amphetamines_10,opiates_10,coke_10,heroin_10,bdga1mt,bidgb1m,speedball_total,opiates_total,bdgc6mt,bdga6mt,tothopuse2,married,syr2,classsex,genoqcpass,cocainenumber,cocaine10,speednumber,speed10,gwasraceselect,gwassex_t,jbeern,jhlqn,jwinen,halcoft,jbeernd,jwinend,jhlqnd,total_alc,totalalc_wk,redistribute,select4wt,oversampex,reweightselect,hivposgrtest,newsamp,serum2,notselect,status_052015,ptnum,balance_052015,spec_type,state052015,nidalist,abletotypefornewsamp,wave.y,ind_id,hiv_status,ancestry_selfreport,viralload_cperml.y,viralload_log10.y,inj_ever,age,sex_selfreport,age_hiv,mj_age_ons,heroin_inj_age_ons,heroin_noninj_age_ons,amphet_inj_age_ons,amphet_noninj_age_ons,opioid_inj_age_ons,opioid_noninj_age_ons,sed_inj_age_ons,sed_noninj_age_ons,sed_age_ons,coc_noninj_age_ons,coc_inj_age_ons,amphet_age_ons,opioid_age_ons,heroin_age_ons,coc_age_ons,mj_ever,heroin_ever_inj,heroin_ever,opioid_ever_inj,opioid_ever,coc_ever_inj,coc_ever,amphet_ever_inj,amphet_ever,sed_ever_inj,sed_ever,totopioid_ever,totopioid_age_ons,inj_age_ons,heroin_inj_30d,heroin_non_30d,opioid_non_30d,opioid_inj_30d,amphet_non_30d,amphet_inj_30d,cocaine_non_30d,cocaine_inj_30d,mj_30d,heroin_30d,opioid_30d,totopioid_inj_30d,totopioid_non_30d,totopioid_tot_30d,heroin_case,opioid_case,totopioid_case,mj_case,totcoc_30d,totamphet_30d,totcoc_case,totamphet_case,inj_freq,inj_case
# 1,AS95-07896,,,0,AS98-04724,63705,13977,3,0,2,4.75,,0,2,4/8/1998,63705,23,MY,45,1,46,,,,,2,14,4,0,0,,0,,,,,,,,,,,,,,,,0,0,,,,,,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,,4/8/1998,,,1,45,0,99,,5,0,,,,,,,,,,,0,999,9,999,9,999,999,999,,3,,,,,,540,,0,2,,,9,0,,,0,,,,,,,,0,,,,,,,0,164,164,99,99,99,99,,,1,,,,,,,0,,0,0,,0,0,,30,,,,0,0,,,0,,0,0,,,,0,,,,90,,44,0,,35,0,,0,,99,,,,,,,0,,,,,,9999,,9999,,,,,,,,,1952,0,0,0,0,2,0,999,,999,,0,0,,99,1,,,,99,0,99,99,99,99,99,,,,,9,9,9,260,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,,,,0,,,0,0,1,1,9,9,,9,9,1,0,9,0,,0,,1,0,0,,,0,0,,,9,,,,9,1997,1,1,0,1,SF,,,,,,,,,,,,,0,,0,,0,,0,,,,,,,,,0,,0,,,,,,,,,,,,,,1023,2,1,1998,2,2,1,,0,,0,0,,,0,,1,,,,0,,,0,0,,2,,0,,2,1,0,0,,0,0,,0,0,,0,,,0,,0,,1,1,,,0,,,,,,,,,0,0.018,0.154,0.828,3,0.634,32221,AS98-04724,HHG4618,1,0,46,,1,859701,,90,0,90,1,1,0,0,30,0,0,0,,,,,,0,3,0,,0,,0,,1,0,1,0,,,0,0,,,,,,,,,,1,138,0,,99,99,99,99,,,,0,0,1,0,0,,,,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,,,859701,1,,99,,,,1,,0,0,1,1,0,0,90,0,0,0,90,4,2,M3,1,30,1,0,0,AA,Male,56,0,0,,,,,56,56,yes,0,1,,,2,,,Genotyped Fine,UHS63705,2,1,,,,18,24907,0,2,-9,-9,1,43,1,,-8,25,-8,-8,-8,-9,-8,-8,-8,-8,-8,32,-8,,25,32,-8,1,1,1,1,1,1,1,1,-8,-8,1,25,25,135,0,0,0,0,90,0,0,,135,0,135,0,135,1,0,,,0,90,0,1,225,1
# 10,AS87-4078,,,0,AS88-1109,4447,10294,5.75,0,1,6.75,,0,1,3/8/1988,4447,3,DR,36,1,1,1,3,10,,1,11,1,2,1,,0,2,4,0,0,,,,,0,0,0,,,,,0,1300,5,,,,,0,0,0,,,,,,340,2,,,,2,0,,,,2,0,,,2,0,0,,,,,,2,0,,,,,5,2,1,1,3/8/1988,,2,2,0,0,0,2,0,0,0,0,2,1,2,2,2,2,2,2,2,0,2,0,2,,0,0,0,3,,,2,3,,0,0,0,340,2,,2,0,0,,1,1,,2,0,,2,0,0,0,,2,0,,,70,,,,,,,,,,,,,,,,,,,0,,,0,,,,,,,,,,,,,,,,,,,,,0,,,0,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0,0,0,,,,0,,,0,0,1,0,1,0,,,,,0,0,,0,0,,0,0,0,0,1,,,1,1,,0,1,0,0,1987,1,1,0,0,SF,,,,,,,,,,,,,0,,0,,0,,0,,,,,,,,,0,,0,,,,,,,,,,,,,,1003,1,1,1988,1,1,0,,,1,,,1,,,0,,,1,,,0,,,,,,2,,0,0.166666667,0,,,0,,,0,,1,,,,,0,,0,,0,1,,,0,,,,,,,,,0,0.065,0.756,0.179,2,-0.101,21210,AS88-1109,HHG6025,1,0,27,,1,900312,,56.66666667,,56.66666667,1,1,,,,0,28.33333333,0,,,,,,,,0,,2,,0,,0,,,0,,,1,0,,,,,,,,,,1,,,,,,,,,,,,0,1,0,0,,,,1,0,0,,,1,1,,1,,1,0,,0,0,,0,,0,0,,0,0,,0,0,,1,0,2,0,1300,340,0,0,340,1,0,0,0,,,,1,,,,,,,,,,,1,0,,,,0,0,56.66666667,1,0,M2,1,0.166666667,0,0,0,EA,Male,0,0,0,,,,,0,0,yes,0,1,,,2,,,Genotyped Fine,UHS4447,5,1,,,,2,12192,0,1,-9,-9,1,36,1,,-8,-8,-8,-9,-8,-8,-8,-8,-8,-8,-8,-8,-8,-8,-8,-8,-8,1,1,1,1,1,1,1,1,-8,-8,1,-8,,83.33333333,,,,,0.083333333,,2.083333333,,83.33333333,,83.33333333,,83.33333333,1,,,,2.083333333,0.083333333,0,0,85.5,1# add 20 missing samples manually from the master phenotype file

In [None]:
### Python

"""
Capture the phenotype data of the 20 missing samples
from the master phenotype file, and write to file.
"""

out_head = "subjid hivstat age_hiv viralload_cperml viralload_log10 sex_selfreport gwassex ancestry_selfreport age surveyyear lca_group mpartners stis anal needleshare sexwork site heroin_ever heroin_ever_inj opioid_ever opioid_ever_inj totopioid_ever heroin_age_ons opioid_age_ons totopioid_age_ons heroin_inj_30d heroin_non_30d opioid_inj_30d opioid_non_30d totopioid_inj_30d totopioid_non_30d totopioid_tot_30d heroin_case opioid_case totopioid_case inj_ever inj_freq inj_case inj_age_ons coc_ever coc_ever_inj amphet_ever amphet_ever_inj sed_ever sed_ever_inj mj_ever coc_age_ons amphet_age_ons sed_age_ons mj_age_ons cocaine_inj_30d cocaine_non_30d totcoc_30d amphet_inj_30d amphet_non_30d totamphet_30d mj_30d totcoc_case totamphet_case mj_case"

# modified out_head because some variables are missing or have a different name
keep_cols = "cell_line hivstat age_hiv viralload_cperml.x viralload_log10.x sex_selfreport gwassex ancestry_selfreport age surveyyear lca_group multipartners stis anal needleshare sexwork site heroin_ever heroin_ever_inj opioid_ever opioid_ever_inj totopioid_ever heroin_age_ons opioid_age_ons totopioid_age_ons heroin_inj_30d heroin_non_30d opioid_inj_30d opioid_non_30d totopioid_inj_30d totopioid_non_30d totopioid_tot_30d heroin_case opioid_case totopioid_case inj_ever inj_freq inj_case inj_age_ons coc_ever coc_ever_inj amphet_ever amphet_ever_inj sed_ever sed_ever_inj mj_ever coc_age_ons amphet_age_ons sed_age_ons mj_age_ons cocaine_inj_30d cocaine_non_30d totcoc_30d amphet_inj_30d amphet_non_30d totamphet_30d mj_30d totcoc_case totamphet_case mj_case"

keep_cols = keep_cols.split()

### missing fields from master
# subjid -> Original dbGaP Subject ID. See cell_line from master
# lca_group -> Latent Class Analysis (LCA) assigned HIV risk group used in frequency matching  -99999=Missing or N/A
# mpartners ->  multipartners in master
# needleshare -> Sharing needles in the past 30 days, indicator used in LCA. -99999=Missing or N/A
# site ->  blinded site indicator site 1 vs. 2. -99999=Missing or N/A
# viralload_cperml -> viralload_cperml.x in master
# viralload_log10 -> viralload_log10.x in master

master ="hiv_all_merged_with_uhs_all_phenotype_data_08282017.csv"
missing_file = "sampid_and_cell_line.txt"
outfile = "data_for_20_missing_samples.txt"
with open(missing_file) as missF, open(master) as masterF, open(outfile, "w") as outF:
    head = masterF.readline().strip()
    head = head.split(",")

    # create a list based off of keep_cols. The list will contain the column
    # number that the associated variable is located in the header of the master
    keep_indices = []
    for pheno in keep_cols:
        if pheno == "lca_group" or pheno == "needleshare" or pheno == "site":
            # some keep_cols are not present in the master: 
            # lca_group,  needleshare, site. Set these data as missing
            # when we go to use the index at these positions, an exception will be thrown
            # and we will just mark the data as missing with -99999
            keep_indices.append("NA")
        else:
            keep_indices.append(head.index(pheno))

    # create set with subject_ids
    subject_set = set()
    line = missF.readline()
    while line:
        sl = line.split()
        subid = sl[0]
        subject_set.add(subid)
        line = missF.readline()

    # create a dictionary from the master. The cell_line will be the key
    # and the value will be the entries in keep_cols (except subjid which is the cell_line in master).
    line = masterF.readline().strip()
    master_dict = {}
    while line:
        sl = line.split(",")
        subjid = sl[keep_indices[0]]
        if subjid in subject_set:
            master_dict[subjid] = []
            for i in range(1, len(keep_cols)):
                try:
                    entry = sl[keep_indices[i]]
                    if entry == "": # if the master has a blank entry, fill in the blank
                        entry = "-99999"
                    master_dict[subjid].append(entry)
                except: # if the variable was one of those not found in the header, just fill in the blank
                    master_dict[subjid].append("-99999")
        line = masterF.readline().strip()

    # write the data we found for the missing samples to file
    outF.write(out_head + "\n")
    for key in master_dict:
        values = master_dict[key]
        values = " ".join(values)
        outline = "{} {}\n".format(key, values)
        outF.write(outline)

In [None]:
cat dbGaP_phenotypeDS_v01.txt <(tail -n +2 data_for_20_missing_samples.txt) \
	> dbGaP_phenotypeDS_v02.txt

wc -l dbGaP_phenotypeDS_v02.txt
# 6208

head dbGaP_phenotypeDS_v02.txt
# subjid hivstat age_hiv viralload_cperml viralload_log10 sex_selfreport gwassex ancestry_selfreport age surveyyear lca_group mpartners stis anal needleshare sexwork site heroin_ever heroin_ever_inj opioid_ever opioid_ever_inj totopioid_ever heroin_age_ons opioid_age_ons totopioid_age_ons heroin_inj_30d heroin_non_30d opioid_inj_30d opioid_non_30d totopioid_inj_30d totopioid_non_30d totopioid_tot_30d heroin_case opioid_case totopioid_case inj_ever inj_freq inj_case inj_age_ons coc_ever coc_ever_inj amphet_ever amphet_ever_inj sed_ever sed_ever_inj mj_ever coc_age_ons amphet_age_ons sed_age_ons mj_age_ons cocaine_inj_30d cocaine_non_30d totcoc_30d amphet_inj_30d amphet_non_30d totamphet_30d mj_30d totcoc_case totamphet_case mj_case
# HHG1579 0 -99999 -99999 -99999 1 1 2 26 1987 2 1 1 0 -99999 -99999 2 -99999 -99999 1 1 1 -99999 -99999 -99999 0 -99999 -99999 -99999 0 -99999 0 0 -99999 -99999 1 17 1 -99999 1 1 1 1 -99999 -99999 -99999 -99999 -99999 -99999 -99999 0 -99999 0 17 -99999 17 -99999 0 1 -99999
# HHG1561 0 -99999 -99999 -99999 2 2 2 27 1987 2 0 1 0 -99999 -99999 2 1 1 1 1 1 -99999 -99999 -99999 17 -99999 -99999 -99999 17 -99999 17 1 -99999 1 1 17 1 -99999 -99999 -99999 -99999 -99999 -99999 -99999 -99999 -99999 -99999 -99999 -99999 0 -99999 0 0 -99999 0 -99999 0 0 -99999

## Move dbGaP files to final location

In [None]:
mkdir -p /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/final/20210630
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/final/20210630

# sample attributes
cp /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/dbGaP_SampleAttributesDS_v02.txt \
    dbGaP_SampleAttributesDS_20210630.txt
cp /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/final/004/dbGaP_SampleAttributesDD_20191202.xlsx \
   dbGaP_SampleAttributesDD_20210630.xlsx

# subject consent
cp /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/dbGaP_SubjectConsentDS_v02.txt \
    dbGaP_SubjectConsentDS_20210630.txt
cp /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/final/004/dbGaP_SubjectConsentDD_20191202.xlsx \
    dbGaP_SubjectConsentDD_20210630.xlsx

# subject to sample mapping
cp /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/dbGaP_SubjectSampleMappingDS_v02.txt \
    dbGaP_SubjectSampleMappingDS_20210630.txt
cp /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/final/004/dbGaP_SubjectSampleMappingDD_20191202.xlsx \
    dbGaP_SubjectSampleMappingDD_20210630.xlsx

# phenotypes
cp /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/dbGaP_phenotypeDS_v02.txt \
    dbGaP_phenotypeDS_20210630.txt
cp /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/final/005/dbGaP_phenotypeDD_20201130.xlsx \
    dbGaP_phenotypeDD_20210630.xlsx

### Upload to S3

In [None]:
cd ~/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/

for wave in {uhs1,uhs2,uhs4}; do
    for ancestry in {afr,eur}; do
        for ext in {bed,bim,fam}; do
            file=$wave/$ancestry/${wave}_${ancestry}_dbgap_ready.$ext
            gzip $file
            aws s3 cp $file.gz s3://rti-shared/shared_data/post_qc/$wave/genotype/array/observed/0010/$ancestry/
            cp $file.gz /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/final/20210630/
        done
    done
done


for ancestry in {afr,eur}; do
    for ext in {bed,bim,fam}; do
        file=uhs3_1-2/$ancestry/uhs3_v1-2_${ancestry}_dbgap_ready.$ext
        gzip $file
        aws s3 cp $file.gz s3://rti-shared/shared_data/post_qc/uhs3_1-2/genotype/array/observed/0010/$ancestry/
        cp $file.gz /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/final/20210630/
    done
done


for ancestry in {afr,eur}; do
    for ext in {bed,bim,fam}; do
        file=uhs3_1-3/$ancestry/uhs3_v1-3_${ancestry}_dbgap_ready.$ext
        gzip $file
        aws s3 cp $file.gz s3://rti-shared/shared_data/post_qc/uhs3_1-3/genotype/array/observed/0010/$ancestry/
        cp $file.gz /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/final/20210630/
    done
done

In [None]:
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/final/20210630/

## zip up genotype data
zip uhs_genotypes_dbgap_ready_20210630.zip uhs*dbgap_ready*

aws s3 sync . s3://rti-shared/shared_data/post_qc/uhs1234/phenotype/0010/

# Final sample count

| Wave Name | EUR N | AFR N | Sum   |
|-----------|-------|-------|-------|
| UHS1      | 1,140 | 2,015 | 3,155 |
| UHS2      | 754   | 595   | 1,349 |
| UHS3_1-2  | 24    | 51    | 75    |
| UHS3_1-3  | 31    | 54    | 93    |
| UHS4      | 808   | 735   | 1,543 |

* AFR: 3,450
* EUR: 2,757
* AFR+EUR: 6,207

In [None]:
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/final/20210630
## number of samples with phenotype data
wc -l dbGaP_phenotypeDS_20210630.txt
#    6208 dbGaP_phenotypeDS_20210630.txt

## UHS1: number of samples with genotype data
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs1/
wc -l */*dbgap_ready*fam
#    2015 afr/uhs1_afr_dbgap_ready.fam
#    1140 eur/uhs1_eur_dbgap_ready.fam
#    3155 total

## UHS2: number of samples with genotype data
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs2/
wc -l */*dbgap_ready*fam
#     595 afr/uhs2_afr_dbgap_ready.fam
#     754 eur/uhs2_eur_dbgap_ready.fam
#    1349 total

## UHS3_v1-2: number of samples with genotype data
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs3_1-2/
wc -l */*dbgap_ready*fam
#      51 afr/uhs3_v1-2_afr_dbgap_ready.fam
#      24 eur/uhs3_v1-2_eur_dbgap_ready.fam
#      75 total

## UHS3_v1-3: number of samples with genotype data
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs3_1-3/
wc -l */*dbgap_ready*fam
#      54 afr/uhs3_v1-3_afr_dbgap_ready.fam
#      31 eur/uhs3_v1-3_eur_dbgap_ready.fam
#      85 total
    
## UHS4: number of samples with genotype data
cd /Users/jmarks/projects/hiv/gwas/uhs1234/dbgap_upload/20210506_troubleshooting/create_final_dbgap_files/uhs4/
wc -l */*dbgap_ready*fam
#     735 afr/uhs4_afr_dbgap_ready.fam
#     808 eur/uhs4_eur_dbgap_ready.fam
#     1543 total
