# Redshift ML Workshop 
---

### Overview of the Workshop
All labs in this workshop use Jupyter notebooks running on Amazon SageMaker Notebook Instances. Three Labs will be covered in this workshop based upon three user personas.   
1. **Data Analyst - Naive Machine Learning User**  
2. **Advanced Data Analyst - Intermediate Machine Learning User**  
3. **Data Sciencist -  Advanced Machine Expert**  


### Lab Components
    
* __Jupyter Notebook__:  
You are currently in a Jupyter notebook. This is an exploratory environment where you can un many different types of code, see the results, and interact them. Each of the 4 labs in this workshop is a single notebook. These notebooks are accessible through the table of contents at the top of any lab.

* __Amazon SageMaker Notebook Instance__:  
This notebook is running in an Amazon SageMaker notebook instance. This is a fully managed Amazon EC2 instance that has a preconfigured Jupyter notebook server and a set of `conda` libraries. All necessary dependencies for the labs in this workshop are already present. 

* __`conda` Python Kernel__:  
Kernels are processes that receive and execute interactive code and return output to the user. The notebook frontend communicates with the kernel backend. In these labs we use the `conda_python3` kernel.

[Project Jupyter]: https://jupyter.org/
[SageMaker example notebooks]: https://github.com/awslabs/amazon-sagemaker-examples


### Tips:
 
* Labs progress by running the grey `code` cells _in order_ top to bottom.
* Each cell has a title text to explain what happens when you run it.
* Chrome is recommended but any modern browser should work
* Poor network connectivity may cause minor delays when navigating the notebook.
* When a cell is running you will see the text to the left change to `In [*]:`.
* When a cell's code has finished you will see the text to the left change to `In [19]:`. 
    * The number indicates the order in which the cell was run.
* We're here to help if you get stuck or something doesn't work please let us know.
* **Finally** - You're free to experiment and rerun cells. 
    * Nothing should break if cells are run more than once or out of order or rerun.

## 1. Customize Labs notebooks for your test account
Setup credentials to access the Redshift cluster.
In this step, please replace the `host_name` with your Redshift cluster's `hostname`. 

We will also install some python libaries needed for this notebook.

-----
**Expected Outputs**: None

In [1]:
%%bash


echo "{
  \"user_name\": \"awsuser\",
  \"password\": \"Awsuser123\",
  \"host_name\": \"redshift-ml-demo.cehdaz7u3h70.us-west-2.redshift.amazonaws.com\",
  \"port_num\": \"5439\",
  \"db_name\": \"dev\"
}" > redshift-ml-workshop.creds

cat redshift-ml-workshop.creds

pip install psycopg2-binary
pip install sqlalchemy 
pip install simplejson
pip install ipython-sql




{
  "user_name": "awsuser",
  "password": "Awsuser123",
  "host_name": "redshift-ml-demo.cehdaz7u3h70.us-west-2.redshift.amazonaws.com",
  "port_num": "5439",
  "db_name": "dev"
}


You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.


## 2. Connect to your Redshift cluster and run a query
You will use the sqlalchemy and ipython-sql Python libraries to manage the Redshift connection.  
This test confirms that you can proceed with the rest of the Labs.


-----

Replace the `hostname` to get connected.

**Sample Outputs**:
`current_user`	`version`
awsuser	PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.23274

In [2]:
%reload_ext sql

%sql postgresql+psycopg2://awsuser:Awsuser123@redshift-ml-demo.cehdaz7u3h70.us-west-2.redshift.amazonaws.com:5439/dev
#SELECT current_user, version();

In [3]:
%sql SELECT current_user, version();

 * postgresql+psycopg2://awsuser:***@redshift-ml-demo.cehdaz7u3h70.us-west-2.redshift.amazonaws.com:5439/dev
1 rows affected.


current_user,version
awsuser,"PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.23274"


# Redshift-ML-Workshop - Usecase 1 - Data Analyst User 
---

### Data Set Information: ###
### Bank Marketing data set ###

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.


The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

### Attribute Information /Input variables / bank client data: ###
1 - age (numeric)   
2 - job  
3 - marital   
4 - education   
5 - default  
6 - housing  
7 - loan  
8 - contact 
9 - month  
10 - day_of_week  
11 - duration  
12 - campaign  
13 - pdays  
14 - previous  
15 - poutcome  
16 - emp.var.rate  
17 - cons.price.idx  
18 - cons.conf.idx  
19 - euribor3m  
20 - nr.employed  

Output variable (desired target):  
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')  


**Reference:** https://archive.ics.uci.edu/ml/datasets/bank+marketing

Complete SQL is here <add the file>
    
Sample dataset is already loaded into `bank_details_training` and `bank_details_inference` tables. 

The create model with default run time (90 mins ) is already pre built in your Redshift cluster. In this workshop we will run the modified version of the create model with `MAX_RUNTIME` option set to a `900 secs`  keeping in mind the live session. 

For the inference queries we can use SQL function created by the prebuilt model .



#### Modified CREATE MODEL for Use Case 1 - Data Analyst User

In [None]:
%%sql
/* -- Create model for bank marketing use case with max runtime 900 secs -- */
 CREATE MODEL model_bank_marketing_v2
FROM (
    SELECT age
        ,job
        ,marital
        ,education
        ,"default"
        ,housing
        ,loan
        ,contact
        ,month
        ,day_of_week
        ,duration
        ,campaign
        ,pdays
        ,previous
        ,poutcome
        ,emp_var_rate
        ,cons_price_idx
        ,cons_conf_idx
        ,euribor3m
        ,nr_employed
        ,y
    FROM bank_details_training
    ) TARGET y 
FUNCTION func_model_bank_marketing_v2 
IAM_ROLE '<< replace IAM role arn >>' 
SETTINGS(S3_BUCKET '<< replace S3 output bucket >>', MAX_RUNTIME 900);


#### Show MODEL for Use Case 1 

In [None]:
%%sql
/* -- Show all models -- */
 SHOW model ALL;

In [None]:
%%sql
/* -- Show prebuilt model for bank marketing  -- */
 SHOW model model_bank_marketing;

In [None]:
%%sql
/* -- Show model for bank marketing created during the workshop -- */
 SHOW model model_bank_marketing_v2;

#### Check Inference/Accuracy of the model `model_bank_marketing` .
This is where you run the query to check the accuracy of the models. We will use the function created by the pre built model for the inference and against the data set in inference table `bank_details_inference`. Please feel free to run the same against the function that was created by the model we created in the workshop. 

In [None]:
%%sql

/* -- Check accuracy for bank marketing using prebuilt model function  -- */
 WITH infer_data
AS (
	SELECT y AS actual
		,func_model_bank_marketing(age, job, marital, education, "default", housing, loan, contact, month, day_of_week, duration, campaign, pdays, previous, poutcome, emp_var_rate, cons_price_idx, cons_conf_idx, euribor3m, nr_employed) AS predicted
		,CASE 
			WHEN actual = predicted
				THEN 1::INT
			ELSE 0::INT
			END AS correct
	FROM bank_details_inference
	)
	,aggr_data
AS (
	SELECT SUM(correct) AS num_correct
		,COUNT(*) AS total
	FROM infer_data
	)
SELECT (num_correct::FLOAT / total::FLOAT) AS accuracy
FROM aggr_data;

#### Predict how many customers will subscribe for term deposit vs not subscribe
We are running this query against the dataset in inference table `bank_details_inference`.

#### Sample output for prediction query

```sql
     deposit_prediction     | count
----------------------------+-------
 Yes-will-do-a-term-deposit |  5362
 No-term-deposit            | 35826
(2 rows)

```

In [None]:
%%sql 
/* -- Predict whether the customer will do a term deposit or not  -- */
WITH term_data AS ( SELECT func_model_bank_marketing( age,job,marital,education,"default",housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed) AS predicted 
FROM bank_details_inference )
SELECT 
CASE WHEN predicted = 'Y'  THEN 'Yes-will-do-a-term-deposit'
     WHEN predicted = 'N'  THEN 'No-term-deposit'
     ELSE 'Neither' END as deposit_prediction,
COUNT(1) AS count
from term_data GROUP BY 1;

# Redshift-ML-Workshop - Usecase 2 - Advanced Data Analyst User 
----  

### Data Set Information: ####
### Iris data set ###

This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) 
The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

*Predicted attribute:* class of iris plant.


Attribute Information:

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica */

User creates model and supplies some information like the `PROBLEM_TYPE` and `OBJECTIVE` as part of the create model process. 

SageMaker Autopilot chooses the `PROBLEM_TYPE` and `OBJECTIVE` specified by the user instead of trying everything. 

For this example, we are going to provide the `PROBLEM_TYPE` which is  `multiclass classification`. 

For all problem_types supported by SageMaker Autopilot - https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development-problem-types.html 

The `OBJECTIVE` we are going to provide is  `accuracy`. The objective metric is used to measure the predictive quality of a machine learning system. 

*Default:* MSE: for regression, F1: for binary classification, Accuracy: for multiclass classification

For all objectives supported refer: https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html#r_user_guidance_create_model

Complete SQL is here <insert sql here > 

Sample dataset is already loaded into `iris_data_train` and `iris_data_test` tables.

The create model with default run time (90 mins ) is already pre built in your Redshift cluster. For the workshop we will run the modified version of the create model with `MAX_RUNTIME` option set to a 900 secs. 

For the inference queries we can use SQL function created by the prebuilt model .

Please make sure to change the `IAM role` and `S3 bucket`. 


#### Create Model for Advanced Data Analyst User ####

In [None]:
%%sql
/* -- Create model for iris use case - with max runtime 900 secs -- */
 CREATE MODEL model_iris_v2
FROM (
SELECT 
   Id,
   SepalLengthCm,
   SepalWidthCm,
   PetalLengthCm,
   PetalWidthCm,
   Species
FROM iris_data_train
)
TARGET Species 
FUNCTION func_model_iris_v2 IAM_ROLE '<< replace IAM role arn >>' 
PROBLEM_TYPE multiclass_classification 
OBJECTIVE 'accuracy' 
SETTINGS (S3_BUCKET '<< replace S3 output bucket >>', MAX_RUNTIME 900);



### Show model for Iris data set for Use case 2

In [None]:
%%sql
/* -- show pre built model for iris -- */
 SHOW model model_iris;

In [None]:
%%sql 
/* -- show model for model created during the workshop -- */
 SHOW model model_iris_v2;

#### Check Inference/Accuracy of the model `model_iris` .
This is where you run the query to check the accuracy of the models. We will use the function created by the pre built model for the inference and against the data set in inference table `iris_data_test`. 

Please feel free to run the same against the function that was created by the model we created in the workshop.

In [None]:
%%sql
/* -- Inference query for iris data set -- */
WITH infer_data AS (
    SELECT Species AS label,
        func_model_iris(Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm) AS predicted,
        CASE WHEN label is NULL THEN NULL ELSE label END AS actual,
        CASE WHEN actual = predicted THEN 1::INT
        ELSE 0::INT END AS correct
    FROM iris_data_test
),
aggr_data AS (
    SELECT SUM(correct) as num_correct, COUNT(*) as total FROM infer_data
)
SELECT (num_correct::float/total::float) AS accuracy FROM aggr_data;

#### Predict the class of the Iris flower using the testing data set 
We are running this query against the dataset in inference table `iris_data_set`.

#### Sample output for prediction 

```sql
dev-# from class_data GROUP BY 1;
  class_distribution   | count
-----------------------+-------
 Class-Iris-versicolor |    82
 Class-Iris-setosa     |    81
 Class-Iris-virginica  |    88
(3 rows)
```


In [None]:
%%sql
/* -- Predict the Iris flower class -- */ 
WITH class_data AS ( SELECT func_model_iris( 
   Id,
   SepalLengthCm,
   SepalWidthCm,
   PetalLengthCm,
   PetalWidthCm) AS class 
FROM iris_data_test )
SELECT 
CASE WHEN class = 'Iris-versicolor'  THEN 'Class-Iris-versicolor'
     WHEN class = 'Iris-setosa'  THEN 'Class-Iris-setosa'
     WHEN class = 'Iris-virginica'  THEN 'Class-Iris-virginica'
     ELSE 'Class-Other' END as class_distribution,
COUNT(1) AS count
from class_data GROUP BY 1;

# Redshift-ML-Workshop - Usecase 3 - Data Scientist / Machine Learning Expert 
---

### Data Set Information: ###

Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

From the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200).


Attribute Information:

Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict: either as a continuous value or as a classification problem.

### Name / Data Type / Measurement Unit / Description ###
---
Sex / nominal / -- / M, F, and I (infant)  
Length / continuous / mm / Longest shell measurement  
Diameter / continuous / mm / perpendicular to length  
Height / continuous / mm / with meat in shell  
Whole weight / continuous / grams / whole abalone 
Shucked weight / continuous / grams / weight of meat  
Viscera weight / continuous / grams / gut weight (after bleeding)  
Shell weight / continuous / grams / after being dried  
Rings / integer / -- / +1.5 gives the age in years  
  
*Reference* : https://archive.ics.uci.edu/ml/datasets/Abalone  

For this example, the user is considered advanced machine learning expert where the autopilot is not used and the user will directly provide advanced properties including `preprocessors` and `hyper parameters` . 


For this example, we are going to provide the `MODEL_TYPE` , `OBJECTIVE`, `PREPROCESSORS` and `HYPER PARAMETERS`. 

For all options supported - https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html#r_auto_off_create_model


Complete SQL is here <insert sql here > 

Sample dataset is already loaded into `abalone_xgb_train` and `abalone_xgb_test` tables.

We will run the create model live in the session using `xgboost` which should take ~15 mins. 

For the inference queries we can use SQL function created by the model .

Before running the create model we will also create the tables and load sample data into it. Make sure to change the `IAM role` in the COPY statement. 


In [None]:
%%sql
/* -- Create table for xgboost model training -- */
CREATE TABLE abalone_xgb_train (
length_val float, 
diameter float, 
height float,
whole_weight float, 
shucked_weight float, 
viscera_weight float,
shell_weight float, 
rings int
);


In [None]:
%%sql
/* -- Create table for xgboost model testing -- */
CREATE TABLE abalone_xgb_test (
length_val float, 
diameter float, 
height float,
whole_weight float, 
shucked_weight float, 
viscera_weight float,
shell_weight float, 
rings int
);


In [None]:
%%sql
/* -- COPY for abalone_xgb_train -- */
COPY abalone_xgb_train FROM 's3://redshift-downloads/redshift-ml/workshop/xgboost_abalone_data/train/' REGION 'us-east-1' IAM_ROLE '<< replace IAM role arn >>' IGNOREHEADER 1 CSV;


In [None]:
%%sql
/* -- COPY for abalone_xgb_test -- */
COPY abalone_xgb_test FROM 's3://redshift-downloads/redshift-ml/workshop/xgboost_abalone_data/test/' REGION 'us-east-1' IAM_ROLE '<< replace IAM role arn >>' IGNOREHEADER 1 CSV;


In [None]:
%%sql
/* -- Create model -- */
CREATE MODEL model_abalone_xgboost_regression 
FROM (SELECT
      length_val,
      diameter,
      height,
      whole_weight,
      shucked_weight,
      viscera_weight,
      shell_weight,
      rings
     FROM abalone_xgb_train)
TARGET Rings 
FUNCTION func_model_abalone_xgboost_regression 
IAM_ROLE '<< replace IAM role arn >>' 
AUTO OFF 
MODEL_TYPE xgboost 
OBJECTIVE 'reg:squarederror' 
PREPROCESSORS 'none' 
HYPERPARAMETERS DEFAULT EXCEPT (NUM_ROUND '100') 
SETTINGS (S3_BUCKET '<< S3 bucket >>');


### Show model for xgboost ###

In [None]:
SHOW model model_abalone_xgboost_regression;

#### Check Inference/accuracy of the model ####
MSE/RMSE [The lower the better]: For regression problems, we compute Mean Squared Error / Root Mean Squared Error for accuracy. 

In [None]:
%%sql
/* -- Accuracy query -- */
WITH infer_data AS (
    SELECT Rings AS label, func_model_abalone_xgboost_regression(
Length_val, Diameter, Height, Whole_weight, Shucked_weight, Viscera_weight,
Shell_weight
) AS predicted,
    CASE WHEN label is NULL THEN 0 ELSE label END AS actual
    FROM abalone_xgb_test
)
SELECT SQRT(AVG(POWER(actual - predicted, 2))) AS rmse FROM infer_data;


#### Predict the age group of Abalone Species for harvesting, run on the test table #### 

Sample output

```sql
     age_group     | count
-------------------+-------
 age_between_10_20 |   589
 age_between_5_10  |   247
 age_5_and_under   |     1
 age_over_20       |     1
(4 rows)
```

In [None]:
%%sql
/* -- Prediction query -- */
WITH age_data AS ( SELECT func_model_abalone_xgboost_regression( length_val, 
                                               diameter, 
                                               height, 
                                               whole_weight, 
                                               shucked_weight, 
                                               viscera_weight, 
                                               shell_weight ) + 1.5 AS age
FROM abalone_xgb_test )
SELECT 
CASE WHEN age  > 20 THEN 'age_over_20'
     WHEN age  > 10 THEN 'age_between_10_20'
     WHEN age  > 5  THEN 'age_between_5_10'
     ELSE 'age_5_and_under' END as age_group,
COUNT(1) AS count
from age_data GROUP BY 1;

# System Tables for debugging #

In [None]:
%%sql
/* -- stv_ml_model_info -- */
SELECT * FROM stv_ml_model_info WHERE model_name='model_abalone_xgboost_regression';
