## An Introduction to SageMaker LDA in R
#### R implementation counterpart of "An Introduction to SageMaker LDA"

**Amazon SageMaker LDA** is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. **Latent Dirichlet Allocation (LDA)** is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. 

In this notebook, we will demonstrate the `R` implementation counterpart of **An Introduction to SageMaker LDA** written in `Python`. Given that majority of the SageMaker sample notebooks are written in `Python`, data scientists already using the `R` language can use this notebook as a reference so that they are not forced to convert existing scripts and notebooks already written with the `R` language.

Similar to the original notebook written in `Python`, the following are not the goals of this notebook:

- discuss and interpret the generated synthetic document data
- understand the LDA model
- interpret the meaning of the inference output

### I. Setting Up

The `reticulate` package allows `Python` code to be executed in `R`. The `import` statement below is the `R` counterpart of the `Python import` statement. The `reticulate` package has allowed the `Python` code below to be written and executed in `R` seamlessly.

```
import sagemaker

session = sagemaker.Session()

...
```

In [1]:
library(reticulate)
sagemaker <- import('sagemaker')

In [2]:
session <- sagemaker$Session()

bucket <- '581320662326-sagemaker-ap-southeast-2'
prefix <- 'sagemaker/DEMO-lda-introduction'

The `role_arn` is the IAM Role ARN used to give training and hosting access to your data.

In [3]:
role_arn <- sagemaker$get_execution_role()
role_arn

The `Python` code below translates to the next set of `R` statements to get the target `container` for the current `region`.

```
from sagemaker.amazon.amazon_estimator import get_image_uri

region_name = boto3.Session().region_name
container = get_image_uri(region_name, 'lda')
```

In [5]:
registry <- sagemaker$amazon$amazon_estimator$registry(session$boto_region_name, algorithm='lda')
container <- paste(registry, '/lda:latest', sep='')

container

### II. Training

The `Python` code below translates to the next set of `R` statements. The set of statements below involve setting and specifying the general training job information along with the hyperparameters for LDA.

```
lda = sagemaker.estimator.Estimator(
    container,
    role,
    output_path='s3://{}/{}/output'.format(bucket, prefix),
    train_instance_count=1,
    train_instance_type='ml.c4.2xlarge',
    sagemaker_session=session,
)

lda.set_hyperparameters(
    num_topics=num_topics,
    feature_dim=vocabulary_size,
    mini_batch_size=num_documents_training,
    alpha0=1.0,
)

```

In [6]:
s3_output <- paste0('s3://', bucket, '/output')
estimator <- sagemaker$estimator$Estimator(image_name = container,
                                           role = role_arn,
                                           train_instance_count = 1L,
                                           train_instance_type = 'ml.c4.2xlarge',
                                           output_path = s3_output,
                                           output_kms_key = NULL,
                                           base_job_name = NULL,
                                           sagemaker_session = session)

In [9]:
estimator$set_hyperparameters(num_topics = 5L, feature_dim=25L, mini_batch_size=45L)

Finally, we run the training job on the input data in S3. The training data (topic-mixture) used is generated using the `generate_griffiths_data` script provided in the original notebook.

In [10]:
job_name <- paste('sagemaker-train-lda-r', format(Sys.time(), '%H-%M-%S'), sep = '-')
input_data <- list('train' = 's3://581320662326-sagemaker-ap-southeast-2/sagemaker/DEMO-lda-introduction/train/lda.data')

estimator$fit(inputs = input_data,
              job_name = job_name)

After the training job has completed, the output model `model.tar.gz` is saved in `S3` as well. 

In [12]:
estimator$model_data

### III. Inference

Once training is complete and we have the model, we generate an **endpoint** by using the **deploy** method with the *instance count* and *instance type* as the parameters.

In [13]:
model_endpoint <- estimator$deploy(initial_instance_count = 1L,
                                   instance_type = 'ml.t2.medium')

The deployed endpoint name can be accessed using `model_endpoint$endpoint`. The `Endpoints tab` in the Amazon SageMaker console should reflect the deployed endpoint.

In [25]:
model_endpoint$endpoint

The `Python` code below translates to the next set of `R` statements. The code below makes use of `csv_serializer` and `json_deserializer` to configure the inference endpoint.

```
from sagemaker.predictor import csv_serializer, json_deserializer

lda_inference.content_type = 'text/csv'
lda_inference.serializer = csv_serializer
lda_inference.deserializer = json_deserializer
```

In [14]:
model_endpoint$content_type <- 'text/csv'
model_endpoint$serializer <- sagemaker$predictor$csv_serializer
model_endpoint$deserializer <- sagemaker$predictor$json_deserializer

We use the sample test data stored in `lda-test.csv`. 

In [15]:
test_data <- read.csv(file = 'lda-test.csv', header=FALSE)
head(test_data)

V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25
36,1,0,2,0,24,0,0,3,0,...,27,2,0,2,0,28,1,0,3,0
32,0,0,0,1,27,0,0,0,0,...,32,0,0,0,2,23,0,0,0,0
3,0,23,1,0,1,0,15,5,0,...,2,1,30,1,0,1,0,14,3,0
18,8,1,0,11,10,9,1,0,11,...,12,9,2,0,7,16,4,2,0,7
25,0,0,2,0,30,0,0,0,4,...,26,0,0,4,0,33,0,0,3,2


In [16]:
test_data <- as.matrix(test_data)

Finally, we use the inference endpoint to perform predictions using the test data provided. The `serializer` and `deserializer` used to configure the endpoint will automatically perform the datatype conversions required.

In [17]:
predictions <- model_endpoint$predict(test_data)

In [18]:
predictions

### IV. Stop / Close the Endpoint

In [19]:
session$delete_endpoint(model_endpoint$endpoint)

### V. Epilogue

In this notebook, we've used `reticulate` to convert an existing `SageMaker example notebook` written in `Python` to `R`.