# Amazon SageMaker Data Wrangler

## **Contents**
1. [About the dataset](#About-the-dataset)
1. [Data preprocessing](#Data-preprocessing)

## About the dataset

Original data is from Jester Datasets for Recommender Systems and Collaborative Filtering Research ([click](http://eigentaste.berkeley.edu/dataset/)), converted to CSV formats, and uploaded with notebook [jester_dataset.ipynb](jester_dataset.ipynb)

This datasets is comprised of 3 parts (dataset 1/3/4), we will pick dataset 4 in this demo, since this one contains relatively newer and less records than prio ones.

Dataset 4 contains over 100,000 new ratings from 7,699 total users: data collected from April 2015 - Nov 2019.

### Download data from S3

In [1]:
csv_orig = 'jester_ds4_orig.csv'
csv_long = 'jester_ds4_long.csv'


In [None]:
%%bash -s "$csv_orig" "$csv_long"

mkdir data
aws s3 cp "s3://your-bucket/your-folder/$1" "data/$1"
aws s3 cp "s3://your-bucket/your-folder/$2" "data/$2"


### Original data as a csv file

In [3]:
%%bash -s "data/$csv_orig"

head -2 "$1"


user_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158
1,99,99,99,99,99,99,99.0,99.0,99,99,99,99,99.0,99,99.0,99.0,99.0,99.0,99.0,99,99.0,99.0,99.0,99.0,99.0,99.0,99,99.0,99.0,99.0,99,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99,99,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,3.7,99,99.0,99.0,99.0,99.0,99.0,99.0,99,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99.0,99,99.0,99.0

### Converted from wide form to long form

In [4]:
%%bash -s "data/$csv_long"

head -5 "$1"


user_id,item_id,rating
1,1,99.0
1,2,99.0
1,3,99.0
1,4,99.0


## Data preprocessing
Each rating is from (-10.00 to +10.00) and 99 corresponds to a null rating (user did not rate that joke).
22 of the jokes don't have ratings, their ids are: {1, 2, 3, 4, 5, 6, 9, 10, 11, 12, 14, 20, 27, 31, 43, 51, 52, 61, 73, 80, 100, 116}.



### Create new data flow
* [File] => [New] => [Flow], or [File] => [New Launcher] => [New data flow], or [click me](empty.flow)
* Click on [Amazon S3], select `jester_ds4_long.csv` from S3
* Click on [Import dataset]

### Transform data

* Add transformer, Handle outliers, Min-max numeric outliers
* Add transformer, Process numeric, Min-max scalar
* (Optional) Add transformer, Custome transform, Python (Pandas), type in code like below

```python
# Table is available as variable `df`
import time

ts_now = int(time.time())
df.insert(len(df.columns), 'ts', ts_now)

df = df.reset_index(drop=True)
df.insert(0, 'id', df.index)

```

### Analysis

* Create new Analysis, Configure Tab, Chart Histogram, X axis RATING
* (Optional) Create new Analysis, Code Tab, type in code like below, [Altair: Declarative Visualization in Python](https://altair-viz.github.io/)

```python
# Table is available as variable `df` of pandas dataframe
# Output Altair chart is available as variable `chart`
import altair as alt

df_tmp = df['ITEM_ID'].value_counts()

quantiles = [0, 0.01, 0.02, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.98, 0.99, 1]
qntl = df_tmp.quantile(quantiles)
df_qntl = qntl.to_frame('Observation')
df_qntl.index.rename('%', inplace=True)
df_qntl['Quantiles'] = df_qntl.index


chart = alt.Chart(df_qntl).mark_line().encode(
  x=alt.X(
    "Quantiles:Q",
    axis=alt.Axis(
      tickCount=df_qntl.shape[0],
      grid=False,
      labelExpr="datum.value % 1 ? null : datum.label",
    )
  ),
  y='Observation'
)

```


### Export data flow
* Export as Data Wrangler Job
* (Optional) Export as Pipeline
* (Optional) Export as Feature Sture