# Espnet on Amazon SageMaker 
## Overview

This notebook helps you to run Espnet2 on Amazon SageMaker by running Jupyer notebook cells from the top to the bottom. This runs with Espnet code compatible with Amazon SageMaker, found in https://github.com/harusametime/espnet

### How this works
Espnet has multiple steps of data preparation, model training, post-processing, etc. Most of the steps run in this notebook with Amazon SageMaker, and computationally-demanding steps, such as training ASR, TTS, language model, are processed by Amazon SageMaker Training job using powerful and scalable cluster of AWS instances in on-demand manner. Original Espnet code has triggered these computationally-demanding steps in `launch.py` for distributed training. In the SageMaker compatible code, `launch.py` is modified to run SageMaker training job once you specify this should run with SageMaker. You can specify it to pass a new argument `--sagemaker_train_config` with sagemaker config, e.g., `conf/train_sagemaker.yaml`. The sagemaker config includes the information of what instance you want to use or how many instances you need.

In the same manner as we run espnet without SageMaker, you run shell script, namely `run.sh`, on this notebook with Amazon SageMaker; you run CLI like `!bash run.sh` in each cell of this notebook. Also you can reusume the training process at the step that you completed. Please refer Espnet's log like ` Generate 'exp/hogehoge/run.sh'. You can resume the process from stage 5 using this script` and you can resume the step by `!bash exp/hogehoge/run.sh`.

### Requirement
- This notebook is available on SageMaker Studio notebook with PyTorch 1.10, and probably NOT on SageMaker Classic notebook as Espnet requires Ubuntu for OS. 
    - You can find how to create SageMaker Studio Notebook here: https://catalog.us-east-1.prod.workshops.aws/workshops/63069e26-921c-4ce1-9cc7-dd882ff62575/ja-JP/prerequisites/option2
- This invokes SageMaker Training Jobs for training large models with PyTorch 1.11. The version is supported by both of Espnet and SageMaker Training pre-built image.
- Please have SageMaker compatible code by cloning github repo `git clone -b sagemaker https://github.com/harusametime/espnet.git` onto the SageMaker Studio Notebook.
- SageMaker Studio Notebook has a trust policy and persmission for CodeBuild, https://github.com/aws-samples/sagemaker-studio-image-build-cli

## 1. Preparation

Note: Please make sure this is running on SageMaker Studio notebook with PyTorch 1.10 Pyhon 3.8 CPU or GPU.

As SageMaker Training job is required to run with Docker image, this notebook begins with building Docker image and uploading it to ECR. Because Studio notebook runs inside Docker, it is not straightforward to do docker-in-docker. This notebook uses `sm-docker` that builds and publish docker image on ECR by AWS CodeBuild outside of this notebook.

If the built docker image is available for other tasks, you need not to re-build the docker image.

## 1-1. Installing smdocker

sm-docker can be installed via pip. https://github.com/aws-samples/sagemaker-studio-image-build-cli

If you see error like `not authorized to perform`, please review the trust policy and permission again https://github.com/aws-samples/sagemaker-studio-image-build-cli

In [6]:
!pip install --upgrade pip
!pip install sagemaker-studio-image-build

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting pip
  Using cached pip-22.2.2-py3-none-any.whl (2.0 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.3
    Uninstalling pip-22.0.3:
      Successfully uninstalled pip-22.0.3
Successfully installed pip-22.2.2
[0mLooking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
[0m

## 1-2. Creating docker directory

Create `docker` directory storing `Dockerfile`. All the config files, data files, docker related files need to be stored in `(dataset_name)/(task_name)`, such as `jsut/asr1`, which is the same location as this notebook. 

In [3]:
!mkdir -p docker

## 1-3. Creating Dockerfile

The easiest way of building docker image that can fully use SageMaker features is to extend pre-built SageMaker image with adding necessary libraries. Here the docker image is based on Pytorch 1.11 GPU image and has espnet compiled for PyTorch 1.11.

In [4]:
%%writefile "./docker/Dockerfile"
FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker

ENV AWS_DEFAULT_REGION us-west-2

RUN apt-get update \
 && apt-get install -y --no-install-recommends \
    pkg-config \
    ffmpeg \
    flac \
    libsndfile1-dev \
    libpng-dev \
    libfreetype6-dev \
    sox \
    bc \
    nkf

RUN pip install nltk

RUN python -c "import nltk; nltk.download('averaged_perceptron_tagger'); nltk.download('cmudict')"


ADD https://api.github.com/repos/harusametime/espnet/git/refs/heads/sagemaker version.json
RUN git clone -b sagemaker https://github.com/harusametime/espnet.git
RUN cd espnet/tools && ./setup_python.sh $(command -v python3)
RUN cd espnet/tools &&  make TH_VERSION=1.11.0

RUN conda install -c conda-forge curl -y

RUN python -V

Writing ./docker/Dockerfile


## 1-4. Building docker image and pushing it into ECR

Run `sm-docker` with specifying `./docker` directory storing `Dockerfile`. Finally the log outputs image URI, like `(your_account_id).dkr.ecr.(region_name).amazonaws.com/sagemaker-espnet-pytorch111:latest`, which is needed in sagemaker config. It would be convenient to avoid using '.' in the name because SageMaker Training Job name uses image name as default and does not allow to include '.'. `pytorch111` does not require SageMaker Job name while `pytorch1.11` reqiures.

This may take around 10 minutes. 

In [20]:
!cd docker && sm-docker build . --repository sagemaker-espnet-pytorch111:latest

....[Container] 2022/09/21 00:27:14 Waiting for agent ping

[Container] 2022/09/21 00:27:15 Waiting for DOWNLOAD_SOURCE
[Container] 2022/09/21 00:27:18 Phase is DOWNLOAD_SOURCE
[Container] 2022/09/21 00:27:18 CODEBUILD_SRC_DIR=/codebuild/output/src015096783/src
[Container] 2022/09/21 00:27:18 YAML location is /codebuild/output/src015096783/src/buildspec.yml
[Container] 2022/09/21 00:27:18 Setting HTTP client timeout to higher timeout for S3 source
[Container] 2022/09/21 00:27:18 Processing environment variables
[Container] 2022/09/21 00:27:18 No runtime version selected in buildspec.
[Container] 2022/09/21 00:27:18 Moving to directory /codebuild/output/src015096783/src
[Container] 2022/09/21 00:27:18 Configuring ssm agent with target id: codebuild:90d32f7b-b095-45e8-b007-a699e5538524
[Container] 2022/09/21 00:27:18 Successfully updated ssm agent configuration
[Container] 2022/09/21 00:27:18 Registering with agent
[Container] 2022/09/21 00:27:18 Phases found in YAML: 3
[Container] 2022/

## 2. Installing Espnet with Python libraries to SageMaker Notebook image

Here installs espnet to this notebook image as this notebook also will run most of the Espnet steps with Kaldi.  The following cell just follows the installation instruction.

https://espnet.github.io/espnet/installation.html

In [None]:
!apt-get update
!apt-get install -y pkg-config libpng-dev libfreetype6-dev sox bc nkf
!cd ../../../tools && ./setup_python.sh $(command -v python3)
!cd ../../../tools && make > compile.log

## 3. Create SageMaker Config

You can specify those parameters to run training with Amazon SageMaker.

- sagemaker_config_path: path to config file that you are creating
- image_uri: consitent with the URI where you pushed docker image
- key_prefix: location under s3_bucket (default: dataset_name/task_name like JSUT/asr1)
- s3_bucket: s3 bucket that you want to use (default: sagemaker-(region_name)-(Account ID0
- train_instance_type: Choose from SageMaker GPU instance family (p4d, P3 or G5). G4dn may not have sufficient GPU memory.
- train_instance_count: number of instances that you want to use. If the count is over 1, distributed training is triggered.
- data_upload: if true, data is uploaded by S3 sync. if false, data is not uploaded.



In [22]:
import os
import sagemaker
import yaml

sagemaker_config_path = 'conf/train_sagemaker.yaml'
image_uri = '373011628954.dkr.ecr.us-west-2.amazonaws.com/sagemaker-espnet-pytorch111:latest'

dataset_name =  os.getcwd().split('/')[-2]  # jsut
task_name = os.getcwd().split('/')[-1] # asr1
key_prefix = os.path.join('sagemaker_espnet',dataset_name,task_name)

config_dict = dict(
    s3_bucket = sagemaker.Session().default_bucket(),
    image_uri = image_uri,
    role = sagemaker.get_execution_role(),
    key_prefix = key_prefix,
    train_instance_type = 'ml.p4d.24xlarge',
    train_instance_count = 1,
    data_upload = 'true'
)

with open(sagemaker_config_path, 'w') as outfile:
    yaml.dump(config_dict, outfile, default_flow_style=False)

## 4. Run Espnet on SageMaker

### 4-1. First run 

When you run Espnet on SageMaker, you have to execute `run.sh` with argument `--sagemaker_train_config`. As well as the original code of Espnet, you can pass arugment defined in `asr.sh`; for example, `--ngpu 0` for no-GPU instance,`--asr_args "--max_epoch 10"` for limiting the number of epochs in training ASR model. **Note: do not forget to specify `--ngpu 0` for no-GPU instance because the script assumes gpu instance and will raise error related to GPU.**

In the first execution, the shell script begins with downloading JSUT dataset as specified in `db.sh`; if `JSUT=downloads` in `db.sh`, the script downloads dataset automatically. If you download the dataset by yourself, you specify the name of the directory storing the dataset instead of `downloads`. 

**Note: After the download, this script will stop due to permission error. Please execute the same shell script again. This successfully cotinutes to the next step to the download. The error happens because we cannot run this as root in SageMaker Studio Notebook. This issue would be addressed for the future.**

If you do not like to see expaneded log message, try `Enabling Scrolling for Outputs` in right-click menu on the log message.


In [None]:
!bash run.sh --sagemaker_train_config conf/train_sagemaker.yaml --ngpu 0 --asr_args "--max_epoch 10"

2022-09-21T05:55:36 (asr.sh:256:main) ./asr.sh --ngpu 4 --lang jp --token_type char --feats_type raw --fs 16000 --speed_perturb_factors 0.9 1.0 1.1 --local_data_opts --fs 16000 --asr_config conf/tuning/train_asr_conformer8.yaml --inference_config conf/decode_transformer.yaml --lm_config conf/train_lm.yaml --train_set tr_no_dev --valid_set dev --test_sets dev eval1 --lm_train_text data/tr_no_dev/text --audio_format flac --sagemaker_train_config conf/train_sagemaker.yaml --ngpu 0 --asr_args --max_epoch 10
2022-09-21T05:55:36 (asr.sh:447:main) Stage 1: Data preparation for data/tr_no_dev, data/dev, etc.
2022-09-21T05:55:36 (data.sh:17:main) local/data.sh --fs 16000
stage -1: Data Download
Already exists. Skipped.
Already exists. Skipped.
finished making wav.scp, utt2spk, spk2utt.
finished making text.
   Search for the word 'bold' in http://kaldi-asr.org/doc/data_prep.html
   for more information.
utils/validate_data_dir.sh: Successfully validated data-directory data/train
utils/subset_da

### 3-2. Resume the process

After completing each step, you will see log message like `2022-09-20T13:26:27 (asr.sh:758:main) Generate 'exp/lm_stats_jp_char/run.sh'. You can resume the process from stage 6 using this script`. Following this message, you can resume the process. 

In [25]:
!bash exp/lm_stats_jp_char/run.sh

2022-09-21T04:11:46 (asr.sh:256:main) ./asr.sh --ngpu 4 --lang jp --token_type char --feats_type raw --fs 16000 --speed_perturb_factors 0.9 1.0 1.1 --local_data_opts --fs 16000 --asr_config conf/tuning/train_asr_conformer8.yaml --inference_config conf/decode_transformer.yaml --lm_config conf/train_lm.yaml --train_set tr_no_dev --valid_set dev --test_sets dev eval1 --lm_train_text data/tr_no_dev/text --audio_format flac --sagemaker_train_config conf/train_sagemaker.yaml --stage 6 --ngpu 0
2022-09-21T04:11:46 (asr.sh:728:main) Stage 6: LM collect stats: train_set=dump/raw/lm_train.txt, dev_set=dump/raw/dev/text
2022-09-21T04:11:47 (asr.sh:760:main) Generate 'exp/lm_stats_jp_char/run.sh'. You can resume the process from stage 6 using this script
2022-09-21T04:11:47 (asr.sh:764:main) LM collect-stats started... log: 'exp/lm_stats_jp_char/logdir/stats.*.log'
^C
