# Word-level language modeling using PyTorch and Meeshkan


## Setup

### Install requirements
Using `meeshkan` with sagemaker requires installing both Python SDKs. Note that if you have installed `meeshkan` previously, you may have to **restart the Jupyter kernel (`Kernel` -> `Restart`) to have the dependency updated.**

In [None]:
!pip install --upgrade meeshkan sagemaker

In [None]:
import meeshkan
import sagemaker

### Start the Meeshkan agent
The Meeshkan agent is a daemonized process monitoring your jobs in the background and notifying you when events happen. The command `meeshkan init(token=YOUR_TOKEN)` starts the agent using the token you got when signing up at [meeshkan.com](https://meeshkan.com).
The token is stored to a local file at `~/.meeshkan/credentials`, so you only need to include the token here _once_.

In [None]:
meeshkan.init(token="YOUR_TOKEN_HERE")

You can control the agent also with the following commands:
```python
meeshkan.start()   # Starts the agent (assumes credentials have been setup)
meeshkan.stop()   # Stops the agent
meeshkan.restart()  # Restarts the agent, useful if you installed a new version of `meeshkan` and want the changes to take effect
```

## Working with SageMaker (original notebook [here](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/pytorch_lstm_word_language_model))

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Monitor with Meeshkan](#Monitor)

---

## Background

This example trains a multi-layer LSTM RNN model on a language modeling task based on [PyTorch example](https://github.com/pytorch/examples/tree/master/word_language_model). By default, the training script uses the Wikitext-2 dataset. We will train a model on SageMaker, deploy it, and then use deployed model to generate new text.

For more information about the PyTorch in SageMaker, please visit [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers) and [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) github repositories.

---

## Setup

_This notebook was created and tested on an ml.p2.xlarge notebook instance._

Let's create a SageMaker session and specify:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See [the documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the sagemaker.get_execution_role() with appropriate full IAM role arn string(s).

In [None]:
sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/DEMO-pytorch-rnn-lstm'

role = sagemaker.get_execution_role()  # If you are running the notebook in notebook instance
# If you are running the notebook locally, fill in the appropriate execution role ARN here.
# role = "EXECUTOR_ROLE_ARN_HERE"

## Data
### Getting the data
As mentioned above we are going to use [the wikitext-2 raw data](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/). This data is from Wikipedia and is licensed CC-BY-SA-3.0. Before you use this data for any other purpose than this example, you should understand the data license, described at https://creativecommons.org/licenses/by-sa/3.0/

In [None]:
%%bash
wget http://research.metamind.io.s3.amazonaws.com/wikitext/wikitext-2-raw-v1.zip
unzip -n wikitext-2-raw-v1.zip
cd wikitext-2-raw
mv wiki.test.raw test && mv wiki.train.raw train && mv wiki.valid.raw valid


Let's preview what data looks like.

In [None]:
!head -5 wikitext-2-raw/train

### Uploading the data to S3
We are going to use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use later when we start the training job.



In [None]:
inputs = sagemaker_session.upload_data(path='wikitext-2-raw', bucket=bucket, key_prefix=prefix)
print('input spec (in this case, just an S3 path): {}'.format(inputs))

## Train
### Training script
We need to provide a training script that can run on the SageMaker platform. The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:

* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to.
  These artifacts are uploaded to S3 for model hosting.
* `SM_OUTPUT_DATA_DIR`: A string representing the filesystem path to write output artifacts to. Output artifacts may
  include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed
  and uploaded to S3 to the same S3 prefix as the model artifacts.

Supposing one input channel, 'training', was used in the call to the PyTorch estimator's `fit()` method,
the following will be set, following the format `SM_CHANNEL_[channel_name]`:

* `SM_CHANNEL_TRAINING`: A string representing the path to the directory containing data in the 'training' channel.

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to `model_dir` so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance. For example, the script run by this notebook:

In [None]:
!pygmentize 'source/train.py'

For more information about training environment variables, please visit [SageMaker Containers](https://github.com/aws/sagemaker-containers).

In the current example we also need to provide source directory since training script imports data and model classes from other modules.

In [None]:
!ls source

### Run training in SageMaker
The PyTorch class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script and source directory, an IAM role, the number of training instances, and the training instance type. In this case we will run our training job on ```ml.p2.xlarge``` instance. As you can see in this example you can also specify hyperparameters. 


### Define the metrics to capture
Our training script prints, for example, rows such as 'TrainingLoss=2.2342', so tell SageMaker to capture those. **Meeshkan can only notify you of the metrics that you define here**. Note that if you use built-in Amazon algorithms, those define their own metric definitions and will be automatically watched.

In [None]:
float_pattern = "([0-9\\.]+)"
epoch_pattern = "Epoch={}".format(float_pattern)
val_loss_pattern = "ValidationLoss={}".format(float_pattern)
training_loss_pattern = "TrainingLoss={}".format(float_pattern)

### Create a PyTorch estimator

In [None]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point='train.py',
                    role=role,
                    framework_version='1.0.0',
                    train_instance_count=1,
                    train_instance_type='ml.m4.xlarge',  # Use e.g. `ml.p2.xlarge` to train faster
                    source_dir='source',
                    hyperparameters={
                        'epochs': 2,
                        'tied': True
                    },
                    metric_definitions=[
                        {'Name': 'epoch', 'Regex': epoch_pattern},
                        {'Name': 'val:loss', 'Regex': val_loss_pattern},
                        {'Name': 'train:loss', 'Regex': training_loss_pattern}
                    ]
                   )

### Launch training job
After we've constructed our PyTorch object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.

In [None]:
from time import gmtime, strftime

job_name = 'pytorch-meeshkan-rnn-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

estimator.fit({'training': inputs}, job_name=job_name, wait=False)
print("Submitted job {:s}".format(job_name))

You can check the status of the job with
```python
sagemaker_client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
```

## Monitor
Start monitoring the job with Meeshkan, checking for updates every minute. You will be notified via Slack when new metrics are posted by the SageMaker job.

In [None]:
meeshkan.sagemaker.monitor(job_name=job_name, poll_interval=60)

### Stop the job
For demonstration purposes, you can stop the training job using the low-level `boto3` client as shown below. Stopping the job may take a few minutes, but once it's done, you'll get a Slack notification from Meeshkan of your job being finished.

In [None]:
import boto3
sagemaker_client = boto3.client('sagemaker')
sagemaker_client.stop_training_job(TrainingJobName=job_name)

### Stop Meeshkan agent
Stop the agent once you do not need monitoring anymore.

In [None]:
meeshkan.stop()