# Espnet on Amazon SageMaker 
## Overview

This notebook helps you to run Espnet2 on Amazon SageMaker by running Jupyer notebook cells from the top to the bottom. This runs with Espnet code compatible with Amazon SageMaker, found in https://github.com/harusametime/espnet

### How this works
Espnet has multiple steps of data preparation, model training, post-processing, etc. Most of the steps run in this notebook with Amazon SageMaker, and computationally-demanding steps, such as training ASR, TTS, language model, are processed by Amazon SageMaker Training job using powerful and scalable cluster of AWS instances in on-demand manner. Original Espnet code has triggered these computationally-demanding steps in `launch.py` for distributed training. In the SageMaker compatible code, `launch.py` is modified to run SageMaker training job once you specify this should run with SageMaker. You can specify it to pass a new argument `--sagemaker_train_config` with sagemaker config, e.g., `conf/train_sagemaker.yaml`. The sagemaker config includes the information of what instance you want to use or how many instances you need.

In the same manner as we run espnet without SageMaker, you run shell script, namely `run.sh`, on this notebook with Amazon SageMaker; you run CLI like `!bash run.sh` in each cell of this notebook. Also you can reusume the training process at the step that you completed. Please refer Espnet's log like ` Generate 'exp/hogehoge/run.sh'. You can resume the process from stage 5 using this script` and you can resume the step by `!bash exp/hogehoge/run.sh`.

### Requirement
- This notebook is available on SageMaker Studio notebook with PyTorch 1.10, and probably NOT on SageMaker Classic notebook as Espnet requires Ubuntu for OS. 
    - You can find how to create SageMaker Studio Notebook here: https://catalog.us-east-1.prod.workshops.aws/workshops/63069e26-921c-4ce1-9cc7-dd882ff62575/ja-JP/prerequisites/option2
- This invokes SageMaker Training Jobs for training large models with PyTorch 1.11. The version is supported by both of Espnet and SageMaker Training pre-built image.
- Please have SageMaker compatible code by cloning github repo `git clone -b sagemaker https://github.com/harusametime/espnet.git` onto the SageMaker Studio Notebook.
- SageMaker Studio Notebook has a trust policy and persmission for CodeBuild, https://github.com/aws-samples/sagemaker-studio-image-build-cli

## 1. Preparation

Note: Please make sure this is running on SageMaker Studio notebook with PyTorch 1.10 Pyhon 3.8 CPU or GPU.

As SageMaker Training job is required to run with Docker image, this notebook begins with building Docker image and uploading it to ECR. Because Studio notebook runs inside Docker, it is not straightforward to do docker-in-docker. This notebook uses `sm-docker` that builds and publish docker image on ECR by AWS CodeBuild outside of this notebook.

If the built docker image is available for other tasks, you need not to re-build the docker image.

## 1-1. Installing smdocker

sm-docker can be installed via pip. https://github.com/aws-samples/sagemaker-studio-image-build-cli

If you see error like `not authorized to perform`, please review the trust policy and permission again https://github.com/aws-samples/sagemaker-studio-image-build-cli

In [None]:
!pip install --upgrade pip
!pip install sagemaker-studio-image-build

## 1-2. Creating docker directory

Create `docker` directory storing `Dockerfile`. All the config files, data files, docker related files need to be stored in `(dataset_name)/(task_name)`, such as `ljspeech/tts1`, which is the same location as this notebook. 

In [None]:
!mkdir -p docker

## 1-3. Creating Dockerfile

The easiest way of building docker image that can fully use SageMaker features is to extend pre-built SageMaker image with adding necessary libraries. Here the docker image is based on Pytorch 1.11 GPU image and has espnet compiled for PyTorch 1.11.

In [None]:
%%writefile "./docker/Dockerfile"
FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker

ENV AWS_DEFAULT_REGION us-west-2

RUN apt-get update \
 && apt-get install -y --no-install-recommends \
    pkg-config \
    ffmpeg \
    flac \
    libsndfile1-dev \
    libpng-dev \
    libfreetype6-dev \
    sox \
    bc \
    nkf

RUN pip install nltk

RUN python -c "import nltk; nltk.download('averaged_perceptron_tagger'); nltk.download('cmudict')"


ADD https://api.github.com/repos/harusametime/espnet/git/refs/heads/sagemaker version.json
RUN git clone -b sagemaker https://github.com/harusametime/espnet.git
RUN cd espnet/tools && ./setup_python.sh $(command -v python3)
RUN cd espnet/tools &&  make TH_VERSION=1.11.0

RUN conda install -c conda-forge curl -y

RUN python -V

## 1-4. Building docker image and pushing it into ECR

Run `sm-docker` with specifying `./docker` directory storing `Dockerfile`. Finally the log outputs image URI, like `(your_account_id).dkr.ecr.(region_name).amazonaws.com/sagemaker-espnet-pytorch111:latest`, which is needed in sagemaker config. It would be convenient to avoid using '.' in the name because SageMaker Training Job name uses image name as default and does not allow to include '.'. `pytorch111` does not require SageMaker Job name while `pytorch1.11` reqiures.

This may take around 10 minutes. 

In [None]:
!cd docker && sm-docker build . --repository sagemaker-espnet-pytorch111:latest

## 2. Installing Espnet with Python libraries to SageMaker Notebook image

Here installs espnet to this notebook image as this notebook also will run most of the Espnet steps with Kaldi.  The following cell just follows the installation instruction.

https://espnet.github.io/espnet/installation.html

In [None]:
!apt-get update
!apt-get install -y pkg-config libpng-dev libfreetype6-dev sox bc nkf
!cd ../../../tools && ./setup_python.sh $(command -v python3)
!cd ../../../tools && make > compile.log

## 3. Create SageMaker Config

You can specify those parameters to run training with Amazon SageMaker.

- sagemaker_config_path: path to config file that you are creating
- image_uri: consitent with the URI where you pushed docker image
- key_prefix: location under s3_bucket (default: dataset_name/task_name like JSUT/asr1)
- s3_bucket: s3 bucket that you want to use (default: sagemaker-(region_name)-(Account ID0
- train_instance_type: Choose from SageMaker GPU instance family (p4d, P3 or G5). G4dn may not have sufficient GPU memory.
- train_instance_count: number of instances that you want to use. If the count is over 1, distributed training is triggered.
- data_upload: if true, data is uploaded by S3 sync. if false, data is not uploaded.



In [None]:
import os
import sagemaker
import yaml

sagemaker_config_path = 'conf/train_sagemaker.yaml'
image_uri = '(Account_id).dkr.ecr.us-west-2.amazonaws.com/sagemaker-espnet-pytorch111:latest'

dataset_name =  os.getcwd().split('/')[-2]  # ljspeech
task_name = os.getcwd().split('/')[-1] # tts1
key_prefix = os.path.join('sagemaker_espnet',dataset_name,task_name)

config_dict = dict(
    s3_bucket = sagemaker.Session().default_bucket(),
    image_uri = image_uri,
    role = sagemaker.get_execution_role(),
    key_prefix = key_prefix,
    train_instance_type = 'ml.p4d.24xlarge',
    train_instance_count = 1,
    data_upload = 'true'
)

with open(sagemaker_config_path, 'w') as outfile:
    yaml.dump(config_dict, outfile, default_flow_style=False)

## 4. Run Espnet on SageMaker

### 4-1. End-to-End Training



When you run Espnet on SageMaker, you have to execute `run.sh` with argument `--sagemaker_train_config`. As well as the original code of Espnet, you can pass arugment defined in `tts.sh`; for example, `--ngpu 0` for no-GPU instance,`--train_args "--max_epoch 10"` for limiting the number of epochs in training ASR model. **Note: do not forget to specify `--ngpu 0` for no-GPU instance because the script assumes gpu instance and will raise error related to GPU.**

In the first execution, the shell script begins with downloading JSUT dataset as specified in `db.sh`; if `LJSPEECH=downloads` in `db.sh`, the script downloads dataset automatically. If you download the dataset by yourself, you specify the name of the directory storing the dataset instead of `downloads`. 

**Note: After the download, this script will stop due to permission error. Please execute the same shell script again. This successfully cotinutes to the next step to the download. The error happens because we cannot run this as root in SageMaker Studio Notebook. This issue would be addressed for the future.**

If you do not like to see expaneded log message, try `Enabling Scrolling for Outputs` in right-click menu on the log message.


In [None]:
!bash run.sh --sagemaker_train_config conf/train_sagemaker.yaml --ngpu 0 --train_args "--max_epoch 10"

### 4-2. Resume the process

After completing each step, you will see log message like `2022-09-20T13:26:27 (asr.sh:758:main) Generate 'exp/lm_stats_jp_char/run.sh'. You can resume the process from stage 6 using this script`. Following this message, you can resume the process. 

In [None]:
!bash exp/lm_stats_jp_char/run.sh

### 4-3. Training with different configuration
#### 4-3-1. Train conformer fastspeech2 + hifigan G + hifigan D from scratch

If you want to use GAN for tts task (`tts_task = gan_tts`), you can use a preset defined in `conf/tuning/train_joint_conformer_fastspeech2_hifigan.yaml`. This will train fastspeech2 model, requiring teacher model to generate duration before training.

Let's begin with downloading a teacher model `tacotron2` from https://huggingface.co/espnet/kan-bayashi_ljspeech_tacotron2.
For downloading, first we install git-lfs that enables us to download large file with git.


In [None]:
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
!apt-get install git-lfs

With git-lfs, we can clone the model file from `hugginface.co`. The downloaded files are a model file and stats files. As default, espnet assumes those files are located under `exp`, thus here moves those file to `exp`.

In [None]:
!git lfs install
!git clone https://huggingface.co/espnet/kan-bayashi_ljspeech_tacotron2 pretrained
!mkdir exp -p
!mv pretrained/exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space ./exp
!mv pretrained/exp/tts_stats_raw_phn_tacotron_g2p_en_no_space ./exp

As explained in instruction https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/tts1/README.md#fastspeech2-training, the fowllowinig code generates duration in `decode_use_teacher_***` under the directory specified in `--tts_exp`. As the duration is generated from data specified in `test_sets`, for example `data/tr_no_dev` and `data/dev`, you need to have the data before running the following cell. If you do not have, you can prepare the data by running Stage -1 to 1.  Arguments `--inference_model` and `--tts_exp` must be consistent with model file name and path to the directory storing the model file. 

**Note: this would take long time, over 10 hours with t3.medium instance. Running this notebook with accelerated computing instance, such P, G and C family, is highly recommended.**

In [None]:
!bash ./run.sh --stage 7 \
    --inference_model 199epoch.pth \
    --tts_exp exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space \
    --inference_args "--use_teacher_forcing true" \
    --test_sets "tr_no_dev dev eval1"

Before running training job, you may need to shorten the path (reduce the number of characters in the path) because SageMaker allows us to pass up to 2500 characters for hyperparameters (eqivalent to argument). Long paths may lead to error. The following cell just shortens the path.

In [None]:
!mv exp/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space/decode_use_teacher_forcingtrue_199epoch/ exp/dump/
!mv exp/tts_stats_raw_phn_tacotron_g2p_en_no_space/ exp/stats/

It is now ready to train fastspeech2 model. We need to pass the duration by specifying its directory with `--teacher_dumpdir`, and start the process from Stage 5.

In [None]:
!bash run.sh  --stage 5 \
            --train_config conf/tuning/train_joint_conformer_fastspeech2_hifigan.yaml \
            --sagemaker_train_config conf/train_sagemaker.yaml \
            --tts_task gan_tts \
            --tts_stats_dir exp/stats \
            --teacher_dumpdir exp/dump \
            --ngpu 0 \
            --train_args "--max_epoch 10"