# Prepare Data


The goal of this stage of the project is to get data ready for all modelling and analysis.

At the end of this section the data will be **Loaded, Processed, Partitioned and Stored into S3.**

These stages are executed using the configuration API to reduce hard coding of paths.

In [1]:
import pandas as pd
import sys
sys.path.append("../")
import utils.config as cfg
import utils.display as disp

## Download data

In this demonstration we will use a [Hospital Readmission Dataset](https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008) from [UCI ML Repository](https://archive.ics.uci.edu/ml/index.php)

We need to download this data and unzip it into the right location for downstream processing.

All of this is done with the script: **src/load_data.sh**


In [2]:
disp.display_file("../src/load_data.sh")

In [4]:
# EXECUTE IT USING THE MASTER RUN SCRIPT

!../RUN.sh load

Loading your data...
mkdir: raw: File exists
--2020-12-23 09:58:35--  https://archive.ics.uci.edu/ml/machine-learning-databases/00296/dataset_diabetes.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3347213 (3.2M) [application/x-httpd-php]
Saving to: ‘raw/dataset_diabetes.zip’


2020-12-23 09:58:38 (1.62 MB/s) - ‘raw/dataset_diabetes.zip’ saved [3347213/3347213]

Archive:  dataset_diabetes.zip
  inflating: dataset_diabetes/diabetic_data.csv  
  inflating: dataset_diabetes/IDs_mapping.csv  
rm: dataset_diabetes/*: No such file or directory
Done.



In [5]:
!ls raw

IDs_mapping.csv README.md       raw.csv


### Inspect the Raw Data

In [6]:
df_data = pd.read_csv("raw/raw.csv", sep=",")
df_data.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),?,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),?,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),?,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),?,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


## Global Data Engineering

At this point in the project we are looking to perform any data engineering that must be applied globally before any partitioning of the data.

* **We need to be aware that any processing done here needs to be replicated in the deployed model pipeline.**

In this demo, we add in values from a lookup table and re-shape the target variable.

All of this is done with the script: **src/process_data.py**

In [7]:
disp.display_file("../src/process_data.py")

We perform that processing using the master RUN script

In [8]:
# EXECUTE IT USING THE MASTER RUN SCRIPT

!../RUN.sh process

Processing your data...
Done.



### Inspect Processed Data

In [9]:
df_data = pd.read_csv("processed/processed.csv", sep=",")
df_data.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,Transfer from another health care facility,25,1,1,...,No,No,No,No,No,No,No,No,No,0
1,149190,55629189,Caucasian,Female,[10-20),?,Physician Referral,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,1
2,64410,86047875,AfricanAmerican,Female,[20-30),?,Physician Referral,1,7,2,...,No,No,No,No,No,No,No,No,Yes,0
3,500364,82442376,Caucasian,Male,[30-40),?,Physician Referral,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,0
4,16680,42519267,Caucasian,Male,[40-50),?,Physician Referral,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,0


## Partition the data

Next we partition the data in preparation for modelling.

The configuration file will determine how this partitioning is done.

The partitioning is performed by the script: **src/partition_data.py**

In [10]:
disp.display_file("../src/partition_data.py")

Execute it using the master RUN script

In [27]:
# NOW EXECUTE IT USING THE MASTER RUN SCRIPT

!../RUN.sh partition

Partitioning your data...
[1;32mData written to: /Users/jlhawk/Projects/aws-sagemaker-workbench-demo/data/partitioned[0m
Done.



### Inspect the partitioned data files

In [28]:
!ls -la partitioned

total 40992
drwxr-xr-x  6 jlhawk  staff       192 23 Dec 10:37 [34m.[m[m
drwxr-xr-x  8 jlhawk  staff       256 23 Dec 10:42 [34m..[m[m
-rw-r--r--  1 jlhawk  staff        73 22 Dec 14:45 README.md
-rw-r--r--  1 jlhawk  staff   4160353 23 Dec 10:43 test.csv
-rw-r--r--  1 jlhawk  staff  12477088 23 Dec 10:43 train.csv
-rw-r--r--  1 jlhawk  staff   4160516 23 Dec 10:43 validation.csv


## Store the data

Finally we load the data to S3 for the sagemaker modelling instances to use.

This is performed by the script: **src/store_data.py**


In [30]:

disp.display_file("../src/store_data.py")


In [None]:
!../RUN.sh store