# Amazon SageMaker Data Wrangler demo
## Data source
This demo of Amazon SageMaker Data Wrangler is using [UCI diabetic patient readmission dataset](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008). The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. 

Detail description of the dataset is available in 

    Detailed description of all the atrributes is provided in Table 1 Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records,” BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.


In [11]:
import boto3
import sagemaker

sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'sagemaker/demo-diabetic-datawrangler'

s3_client = boto3.client("s3")

In [2]:
%%sh
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00296/dataset_diabetes.zip
unzip dataset_diabetes.zip

Archive:  dataset_diabetes.zip
  inflating: dataset_diabetes/diabetic_data.csv  
  inflating: dataset_diabetes/IDs_mapping.csv  


--2021-03-11 00:43:51--  https://archive.ics.uci.edu/ml/machine-learning-databases/00296/dataset_diabetes.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3347213 (3.2M) [application/x-httpd-php]
Saving to: ‘dataset_diabetes.zip’

     0K .......... .......... .......... .......... ..........  1%  378K 9s
    50K .......... .......... .......... .......... ..........  3%  744K 6s
   100K .......... .......... .......... .......... ..........  4% 56.9M 4s
   150K .......... .......... .......... .......... ..........  6% 63.7M 3s
   200K .......... .......... .......... .......... ..........  7%  753K 3s
   250K .......... .......... .......... .......... ..........  9% 90.6M 3s
   300K .......... .......... .......... .......... .......... 10% 98.2M 2s
   350K .......... .......... .......... .......... .......... 12% 

## Split the data for demo purposes


In [3]:
import pandas as pd
df = pd.read_csv('dataset_diabetes/diabetic_data.csv', index_col = 'encounter_id')

In [6]:
demographic_feature_columns = 'patient_nbr,race,gender,age,weight'.split(',')
hospital_visits_feature_columns = 'patient_nbr,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,payer_code,medical_specialty,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,diag_1,diag_2,diag_3,readmitted'.split(',')
labs_feature_columns = 'patient_nbr,A1Cresult,max_glu_serum'.split(',')
medication_feature_columns = 'patient_nbr,metformin,repaglinide,nateglinide,chlorpropamide,glimepiride,acetohexamide,glipizide,glyburide,tolbutamide,pioglitazone,rosiglitazone,acarbose,miglitol,troglitazone,tolazamide,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed'.split(',')

Split the CSV into multiple CSVs and upload them to a S3 bucket

In [12]:
dfs = []
suffix = ['demographic', 'hospital_visits', 'labs', 'medication']
for i, columns in enumerate([demographic_feature_columns, hospital_visits_feature_columns, 
                             labs_feature_columns, medication_feature_columns]):
    df_tmp = df[columns]
    dfs.append(df_tmp)
    fname = 'dataset_diabetes/diabetic_data_%s.csv' % suffix[i]
    df_tmp.to_csv(fname)
    s3_client.upload_file(fname, bucket,  '%s/%s' % (prefix, fname))

In [15]:
print('Your data is uploaded to s3://%s/%s/' % (bucket, prefix))

Your data is uploaded to s3://sagemaker-us-east-1-029454422462/sagemaker/demo-diabetic-datawrangler/
