# Data type identification
The purpose of this notebook is to classify the type of data as either continuous or categorical to be used in PIC-SURE. 

- Input: `decoded_data` S3 directory, output file location in picsure-metadata-curation directory
- Output: csv file of interpreted data types, encoded value, and file location

In [1]:
# Do imports
from data_type_utils import identify_var_types, check_against_sas, parse_data, check_data

### Step 1: Identify variable types based on pandas dataframe types. 

In [2]:
file_dir = '/home/ec2-user/SageMaker/studies/ALL-avillach-73-bdcatalyst-etl/csscd/decoded_data/'
csscd = identify_var_types(file_dir+'*')
csscd

Unnamed: 0,index,varname,raw_type,picsure_type,file
0,0,ANONID,int64,continuous,phase1_r02.csv
1,1,F02CYCLE,int64,continuous,phase1_r02.csv
2,2,F02RSP,object,categorical,phase1_r02.csv
3,3,F02MS,object,categorical,phase1_r02.csv
4,4,F02ED,object,categorical,phase1_r02.csv
...,...,...,...,...,...
95,14,DISCREP2,object,categorical,phase1_r04.csv
96,15,DISCREP3,object,categorical,phase1_r04.csv
97,16,DISCREP4,object,categorical,phase1_r04.csv
98,17,F04ALPHA,float64,continuous,phase1_r04.csv


### Step 2: Compare with SAS data types.

Before proceeding, we will need to manually curate the `file` column to match the SAS file names. Specifically the column `df` should match the `MEMNAME` column in the SAS file.

In [3]:
# Manual curation of the "file" column
new = csscd['file'].str.split("_", expand=True)
new = new[1].str.split('.', expand=True)
csscd['df'] = new[0].str.upper()
csscd

Unnamed: 0,index,varname,raw_type,picsure_type,file,df
0,0,ANONID,int64,continuous,phase1_r02.csv,R02
1,1,F02CYCLE,int64,continuous,phase1_r02.csv,R02
2,2,F02RSP,object,categorical,phase1_r02.csv,R02
3,3,F02MS,object,categorical,phase1_r02.csv,R02
4,4,F02ED,object,categorical,phase1_r02.csv,R02
...,...,...,...,...,...,...
95,14,DISCREP2,object,categorical,phase1_r04.csv,R04
96,15,DISCREP3,object,categorical,phase1_r04.csv,R04
97,16,DISCREP4,object,categorical,phase1_r04.csv,R04
98,17,F04ALPHA,float64,continuous,phase1_r04.csv,R04


In [4]:
comparison = check_against_sas('/home/ec2-user/SageMaker/studies/ALL-avillach-73-bdcatalyst-etl/csscd/sas_files/PHASE1_METADATA.csv', 
                            csscd)
mismatch_comparison = comparison[comparison['type_match'] == False].reset_index()
mismatch_comparison

Unnamed: 0,index,varname,raw_type,picsure_type,file,df,MEMNAME,NAME,TYPE,LABEL,type_match,FORMAT
0,1,F02CYCLE,int64,continuous,phase1_r02.csv,R02,R02,F02CYCLE,categorical,UPDATE NUMBER,False,
1,16,F02MIL_R,object,categorical,phase1_r02.csv,R02,R02,F02MIL_R,continuous,ACTIVE DUTY FOR MILITARY SERVICE,False,
2,17,F02OC_R,object,categorical,phase1_r02.csv,R02,R02,F02OC_R,continuous,JOB CODE,False,$F02OC_R
3,18,JF02DATE,int64,continuous,phase1_r02.csv,R02,R02,JF02DATE,categorical,DATE OF VISIT - DAYS SINCE DOE,False,
4,20,F03CYCLE,int64,continuous,phase1_r03.csv,R03,R03,F03CYCLE,categorical,UPDATE NUMBER,False,
5,77,F04COST,float64,continuous,phase1_r03.csv,R03,R03,F04COST,categorical,COST OF TRANSPORTATION TO CLINIC,False,
6,78,F03OCH_R,object,categorical,phase1_r03.csv,R03,R03,F03OCH_R,continuous,JOB CODE (HOUSEHOLD HEAD),False,$F03OCHR
7,79,JF03DATE,float64,continuous,phase1_r03.csv,R03,R03,JF03DATE,categorical,DATE OF VISIT - DAYS SINCE DOE,False,
8,80,JF04DATE,float64,continuous,phase1_r03.csv,R03,R03,JF04DATE,categorical,DATE OF INTERVIEW - DAYS SINCE DOE,False,
9,82,C_EPS,object,categorical,phase1_r04.csv,R04,R04,C_EPS,continuous,HINC II - EPSILON RESULT,False,


The dataframe above displays the variables where the pandas variable type does not match the SAS variable type.

### Step 3: Peek at data to check disagreements between pandas and SAS.

The following code allows us to select a variable from the dataframe displayed above and peek at the data. 

In [5]:
# Can do manual check of specific variable by specifying varname parameter
check_data(mismatch_comparison, file_dir, varname='F02OC_R')

Information for variable F02OC_R
	pandas type: categorical
	SAS type: continuous
Values in decoded data:
[nan 'CLERICAL' 'OPERATIVES' 'SERVICE' 'PROFESSIONAL' 'CRAFTSMAN'
 'MANAGERS' 'TRANSPORT' 'LABORERS' 'HOUSEHOLD'
 'UNCLASSIFIED (NOT OTHERWISE DEFINED)' 'FARM']


In [6]:
# This cell randomly chooses 5 of the variables to display sanity check
check_data(mismatch_comparison, file_dir)


 F03OCH_R
Information for variable F03OCH_R
	pandas type: categorical
	SAS type: continuous
Values in decoded data:
['UNCLASSIFIED (NOT OTHERWISE DEFINED)' 'OPERATIVES' 'CLERICAL'
 'TRANSPORT' 'CRAFTSMAN' nan 'PROFESSIONAL' 'SERVICE' 'MANAGERS'
 'LABORERS' 'FARM']

 JF02DATE
Information for variable JF02DATE
	pandas type: continuous
	SAS type: categorical
Values in decoded data:
[    0   126   124     1   -47     5   -21    76   -20   -70   -14    51
     6   -69   -13   -31   -99    -3    91   -32   -42    61  -365  -152
    19    -7   -15  -366  -360   -18    -1   176     3    77   -29    59
   112     7    21   131   -30  -103   -35   -54   178    -4    89    13
   198   -68    29 -1096    41    17   -28   -56   122    36    63 -1462
    28    33   -98     2   -50    70   173   -11   -62     9   -37  -688
  -187 -1461    14   105    20    49  -731  -105    -6   -49   -19   -51
    58   730    56    -2  -155    16   -17    53   159   -38   -63    27
   154     4   238    62    -9   

In [None]:
# Output varname, file and type (maybe file and MEMNAME)
# Add check to make sure all variables in data are in SAS files as well
# Only output the identified variable type - NOT the sas variable type