# New version validations
The purpose of this notebook is to check the subject IDs of new versions of studies. This is to ensure that the subject IDs are not changing, which could cause problems with genomic data or incorrect patient mapping in PIC-SURE.

### Prerequisites
- Access to the S3 bucket
- Files from new study version downloaded via "Pull raw data from gen3" Jenkins job

In [None]:
import pandas as pd
from check_version_utils import check_new_version, check_new_df

In [None]:
# Change directory to the directories with files of interest
old_dir = '/home/ec2-user/SageMaker/studies/ALL-avillach-73-bdcatalyst-etl/aric/rawDataOld/' # old version files
new_dir = '/home/ec2-user/SageMaker/studies/ALL-avillach-73-bdcatalyst-etl/aric/rawData/' # newly downloaded file versions

### Comparing Subject_MULTI files

In [None]:
# Check all columns of the subject_multi file 

In [None]:
subject_cols = ['INDIVIDUAL_ID', 'SUBJID', 'SUBJECT_ID']
exclude_cols = ['DBGAP_SUBJECT_ID']

In [None]:
old_sub_multi = old_dir+'phs000280.v5.pht001440.v5.p1.ARIC_Subject.MULTI.txt'
new_sub_multi = new_dir+'phs000280.v7.pht001440.v5.p1.ARIC_Subject.MULTI.txt'

In [None]:
old_diffs, new_diffs = check_new_version(old_sub_multi, new_sub_multi, subject_cols)

In [None]:
old_data, new_data = check_new_df(old_sub_multi, new_sub_multi, include_cols=None, 
             exclude_cols = exclude_cols, old_diffs=old_diffs, new_diffs=new_diffs)

In [None]:
# Manual inspection of dataframes
#old = pd.read_csv(old_sub_multi, sep = '\t', skiprows=10)
#new = pd.read_csv(new_sub_multi, sep = '\t', skiprows=10)
#old
#new

### Comparing Sample_MULTI files

In [None]:
# Compare the subject_ID and the sample_ID should match for each row

In [None]:
old_sam_multi = old_dir+'phs000280.v5.pht001441.v5.p1.ARIC_Sample.MULTI.txt'
new_sam_multi = new_dir+'phs000280.v7.pht001441.v7.p1.ARIC_Sample.MULTI.txt'

In [None]:
sample_cols = ['SAMPID', 'SAMPLE_ID', 'SAMPLEID']
include_cols = sample_cols+subject_cols

In [None]:
old_diffs, new_diffs = check_new_version(old_sam_multi, new_sam_multi, sample_cols)

In [None]:
new_diffs

In [None]:
old_data, new_data = check_new_df(old_sam_multi, new_sam_multi, include_cols=include_cols, 
             exclude_cols = None, old_diffs=old_diffs, new_diffs=new_diffs)

In [None]:
# Manual inspection of dataframes
#old = pd.read_csv(old_sam_multi, sep = '\t', skiprows=10)
#new = pd.read_csv(new_sam_multi, sep = '\t', skiprows=10)
#old
#new

### Unique cases
If older version of study has more sample IDs than new version:
- If subject ID did have a sample ID before but no longer is associated with any sample IDs, then we need to orphan that subject’s genomic data. RED FLAG
- If some sample IDs that were associated with a subject ID were removed, but the subject ID still is associated with at least one sample ID, this is okay. GREEN LIGHT

### Below is helpful code for manually troubleshooting

In [None]:
# Merging on specific columns and getting the difference between datasets
#diffs = newsub.merge(oldsub, how='outer', on=['SUBJECT_ID', 'SAMPLE_ID', 'BioSample Accession'], indicator=True)
#res = diffs[diffs._merge == 'right_only']