# Data type identification
The purpose of this notebook is to classify the type of data as either continuous or categorical to be used in PIC-SURE. 

- Input: `decoded_data` S3 directory, output file location in picsure-metadata-curation directory
- Output: csv file of interpreted data types, encoded value, and file location

In [None]:
# Do imports
import pandas as pd
from data_type_utils import identify_var_types, check_against_sas, parse_data, check_data, output_results

### Step 1: Identify variable types based on pandas dataframe types. 

In [None]:
file_dir = '/home/ec2-user/SageMaker/studies/ALL-avillach-73-bdcatalyst-etl/walk-phasst/decoded_data/'
decoded_df = identify_var_types(file_dir+'*')
decoded_df['file'].unique()

### Step 2: Compare with SAS data types.

Before proceeding, we will need to manually curate the `file` column to match the SAS file names. Specifically the column `df` should match the `MEMNAME` column in the SAS file.

In [None]:
# Peek at SAS file to get an understanding of how to format the DF for merging
test = pd.read_csv('/home/ec2-user/SageMaker/studies/ALL-avillach-73-bdcatalyst-etl/walk-phasst/sas_files/WALKPHASST_METADATA.csv')
test['MEMNAME'].unique()

In [None]:
# Manual curation of the "file" column
#new = decoded_df['file'].str.replace("pfu_", '')
new = decoded_df['file'].str.replace('.csv', '')
new = new.str.replace("lab_t_r", "labt_r")
#new.unique()
decoded_df['df'] = new.str.upper()
decoded_df
#new = csscd['file'].str.split("_", expand=True)
#new = new[1].str.split('.', expand=True)
#new
#dfs = []
#for i, df in enumerate(new):
#    if df.upper() in ["LAB_T_R"]:
#        #new['df'] = df.upper()
#        dfs.append("LABT_R")
#    else:
#        #new['df'] = df.upper()+"_PUBN"
#        dfs.append(df.upper())
#decoded_df['df'] = dfs

#csscd['df'] = new[0].str.upper()
#csscd

In [None]:
# Be sure to change the directory to the correct metadata file
comparison = check_against_sas('/home/ec2-user/SageMaker/studies/ALL-avillach-73-bdcatalyst-etl/walk-phasst/sas_files/WALKPHASST_METADATA.csv', 
                            decoded_df)
mismatch_comparison = comparison[comparison['type_match'] == False].reset_index()
mismatch_comparison

The dataframe above displays the variables where the pandas variable type does not match the SAS variable type.

In [None]:
data_only = comparison[comparison['_merge'] != 'both'].reset_index()
data_only#['df'].unique()

The dataframe above displays the variables where the variable only exists in the data file and *not* the SAS file. As a sanity check, this dataframe should be empty. In other words, all variables in the data file should exist in the SAS information.

### Step 3: Peek at data to check disagreements between pandas and SAS.

The following code allows us to select a variable from the dataframe displayed above and peek at the data. 

In [None]:
# Can do manual check of specific variable by specifying varname parameter
#check_data(mismatch_comparison, file_dir, varname='F04REV')

In [None]:
# This cell randomly chooses 5 of the variables to display sanity check
check_data(mismatch_comparison, file_dir)

### Step 4: Output the variable type information
The final step is to output the identified variable type information to be used in the curation process of the metadata JSON file. This file should include the output variable name, vairable label, file name, and type (could perhaps include the SAS file and MEMNAME). 

Note that this file should be saved in the `intermediates` folder of the study folder in the pic-sure-metadata-curation repo. For example: `pic-sure-metadata-curation/csscd/intermediates/`.

In [None]:
output_results(comparison, 
               "/home/ec2-user/SageMaker/pic-sure-metadata-curation/walk-phasst/intermediates/walk-phasst_data_info.csv")

In [None]:
# Data type validation and identification
# Reports folder in each study
# Each variable has a report file
# Each study has a variable report overview

# Next steps
# Expand current pipeline to include dbGaP studies
# Spit out mapping2_postanalyzer.csv
# Use TEXT/NUMERIC instead of continuous/categorical