# Train a Coqui 🐸 STT model with a Common Voice Dataset 🤖

👋 Hello and welcome

This is a copy of the official colab file that regularly gets updates:
https://github.com/coqui-ai/STT/tree/main/notebooks

You have to do three things to get this code running for your project:


*   Download the Common Voice Dataset and upload it to your Google Drive
*   create a alphabet.txt file for your language
* change a few of the paths to match your folder structure on Drive and your language code

I used this notebook with a payed Google Drive account and Colab+, but it should work with the free version if your dataset is small enougth. Due to space limits in colab it likely won't work with CV datasets that are bigger than 150 GB when extracted and converted to wav and converting and packing the files inside of colab gets very hard with datasets that are biger than 15 GB.

## transfer learning

Transfer learning doku:
https://stt.readthedocs.io/en/latest/TRANSFER_LEARNING.html?highlight=transfer#transfer-learning-new-alphabet




# Basic setup

In [None]:
## Install Coqui STT 
!git clone --depth 1 https://github.com/coqui-ai/STT.git
!cd STT; pip install -U pip wheel setuptools; pip install .
#right now coqui needs another version of tensorflow for GPU use, this may change in the future 
!pip uninstall --yes tensorflow && pip install tensorflow-gpu==1.15.4

In [None]:
# install libraries to convert mp3 to wav
!apt-get install sox libsox-fmt-mp3

## ✅ Mount Google Drive and Download your alphabet.txt

**First things first**: we need some data from Google Drive, GitHub or another source of your choice. 


In [None]:
# mount your private google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# create folder and download alphabet.txt
! mkdir -p /content/eo/
%cd /content/eo/
! wget https://raw.githubusercontent.com/parolteknologio/stt-esperanto/master/deepspeech-coqui/alphabet.txt

# Convert mp3s to wav and create a tar.gz file of it
**You only have to do this once, after that skip this step and use the tar file.** If you can do this on a local machine, I recommend not doing this on Colab and simply upload the result to google drive and skip this step.

Based on https://stt.readthedocs.io/en/latest/COMMON_VOICE_DATA.html

In [None]:
# untar the Dataset from Common Voice
!mkdir -p /content/data
!tar -xzvf "/content/drive/MyDrive/Deepspeech/cv-corpus-7.0-2021-07-21-eo.tar.gz" -C "/content/data"   

In [None]:
#rename folder for easier paths below
!mv /content/data/cv-corpus-7.0-2021-07-21 /content/data/cv-corpus-7

In [None]:
# This step converts the mp3s to wav-files. The result will be around three times as big as your mp3 folder.
!/content/STT/bin/import_cv2.py --filter_alphabet /content/eo/alphabet.txt /content/data/cv-corpus-7/eo


In [None]:
#delete all mp3 files AFTER they got converted to wav (rm doesnt work with so many files in Colab)
!find /content/data/cv-corpus-7/eo/clips/ -name "*.mp3" -delete

In [None]:
#pack WAVs and CSVs into tar.gz inside of the workspace
!tar czf /content/data/converted-eo-corpus-7.tar.gz /content/data/cv-corpus-7/
# if the files are too big you can divide it into chunks like this:
#!split -b 6000M /content/data/converted-eo-corpus-7.tar.gz "/content/data/corpus.tar.bz2.part"

tar: Removing leading `/' from member names
tar: /content/data/cv-corpus-7: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors


In [None]:
#Delete Wav Files because the copy process needs disk space. This can also be done in the terminal during the copy process.
!find /content/data/cv-corpus-7/eo/clips/ -name "*.wav" -delete

In [None]:
#copy big file to Google Drive (for small files in low numbers mv also works well). 
import shutil
shutil.move("/content/data/converted-eo-corpus-7.tar.gz", "/content/drive/MyDrive/Deepspeech/")

#in case there are splitted files:
#!cp /content/data/* /content/drive/MyDrive/Deepspeech/stt-downloads/converted-corpus-parts

'/content/drive/MyDrive/Deepspeech/converted-eo-corpus-7.tar.gz'

In [None]:
# if files don't appear in your Google Drive, this often helps (sync and disconnect drive)
from google.colab import drive
drive.flush_and_unmount()

# Untar converted wav-files and download checkpoints
This part uses the prepered file from above. This speeds up the process a lot, untaring is a lot quicker then converting everything every time you want to train a model

In [None]:
! mkdir -p /content/data
#!tar -xzvf "/content/drive/MyDrive/Deepspeech/stt-downloads/converted-eo-corpus-7.tar.gz" -C "/content/data" 
#192 GB 30 min  
# if you have a splitted archive use 
%cd /content/drive/MyDrive/Deepspeech/stt-downloads/converted-corpus-parts
!cat corpus.tar.bz2.* | tar xvfz - -C /content/data

In [None]:
#create folders for checkpoints and exports
! mkdir -p /content/eo/checkpoints
! mkdir -p /content/eo/exports

In [None]:
#copy checkpoints
#!cp /content/drive/MyDrive/Deepspeech/old_checkpoints/2048_transfer_learning_1-5/* /content/eo/checkpoints
!cp /content/drive/MyDrive/Deepspeech/checkpoints/* /content/eo/checkpoints

In [None]:
#copy scorer
!cp /content/drive/MyDrive/Deepspeech/stt-downloads/kenlm.scorer /content/eo/

In [None]:
#english checkpoints for transfer learning
! mkdir -p /content/en/
%cd /content/en/
!wget https://github.com/coqui-ai/STT/releases/download/v1.0.0/coqui-stt-1.0.0-checkpoint.tar.gz
!tar -xzvf "coqui-stt-1.0.0-checkpoint.tar.gz" -C "/content/en" 

# ✅ Configure & set hyperparameters

Coqui STT comes with a long list of hyperparameters you can tweak. We've set default values, but you will often want to set your own. You can use `initialize_globals_from_args()` to do this. 

You must **always** configure the paths to your data, and you must **always** configure your alphabet. Additionally, here we show how you can specify the size of hidden layers (`n_hidden`), the number of epochs to train for (`epochs`), and to initialize a new model from scratch (`load_train="init"`).

https://stt.readthedocs.io/en/latest/playbook/TRAINING.html

In [None]:
from coqui_stt_training.util.config import initialize_globals_from_args
#@title String fields
initialize_globals_from_args(
    alphabet_config_path="/content/eo/alphabet.txt", #@param {type:"string"}
    train_files=["/content/data/content/data/cv-corpus-7/eo/clips/train-all.csv"], #@param {type:"string"}
    dev_files=["/content/data/content/data/cv-corpus-7/eo/clips/dev.csv"],#@param {type:"string"}
    test_files=["/content/data/content/data/cv-corpus-7/eo/clips/test.csv"], #@param {type:"string"}
    load_train="best",  #@param ["best", "init"] {allow-input: true} 
    #@markdown load_train="init" for first epoch and "best" for any future training with snapsshotss
    n_hidden=2048, 
    #@markdown size of the model. The default of 2048 is only usefull for thousands of hours of data or transfer learning
    epochs=1, #@param {type:"raw"} # keep the epoch number small if you want to be able to save checkpoints regularily and stay inside of colab time restrictions
    train_batch_size=4,#@param {type:"raw"} #a smaller batch size means more acuracy but also slower training
    dev_batch_size=4,#@param {type:"raw"}
    test_batch_size=4,#@param {type:"raw"}
    export_batch_size=4,#@param {type:"raw"}
    #automatic_mixed_precision=True,
    dropout_rate=0.3, #@param {type:"raw"} #the default of 0.5 is not ideal for datasets with less thank 1000 hours
    learning_rate=0.0001, #@param {type:"raw"} #decreased after problems with growing loss
    checkpoint_dir="/content/eo/checkpoints",#@param {type:"string"}
    #load_checkpoint_dir="/content/en/coqui-stt-1.0.0-checkpoint",
    #save_checkpoint_dir="/content/eo/checkpoints",
    #drop_source_layers=1, #remove this after the first transfer learning
    export_dir="/content/eo/exports", #@param {type:"string"}
    scorer_path="/content/eo/kenlm.scorer",#@param {type:"string"}
    load_cudnn=True
)

In [None]:
from coqui_stt_training.util.config import Config

# Take a peek at the entire Config
print(Config.to_json())

## ✅ Train a new model

Let's kick off a training run 🚀🚀🚀 (using the configure you set above).

This notebook should work on either a GPU or a CPU. However, GPU training is a lot quicker.

https://stt.readthedocs.io/en/latest/TRAINING_ADVANCED.html

In [None]:
from coqui_stt_training.train import train, early_training_checks
from coqui_stt_training.evaluate  import test

early_training_checks()

train()
!cp /content/eo/checkpoints/* /content/drive/MyDrive/Deepspeech/checkpoints/
test()

## ✅ Test the model


In [None]:
from coqui_stt_training.evaluate  import test
from coqui_stt_training.util.config import Config

Config.test_files=["/content/data/content/data/cv-corpus-7/eo/clips/test.csv"]
Config.load_checkpoint_dir="/content/eo/checkpoints"

test()

# Create Production Model
https://stt.readthedocs.io/en/latest/EXPORTING_MODELS.html

n_hidden has to be identical to your definition above

In [None]:
!python3 -m coqui_stt_training.export --n_hidden 2048 --checkpoint_dir /content/eo/checkpoints/ --export_dir /content/drive/MyDrive/Deepspeech/exports

# export big filels to Drive

In [None]:
s#copy big file to Google Drive (for small files in low numbers mv also works well). 
#If space gets low during the transfere open the terminal and use "find /content/data/ -name "*.wav" -delete"
import shutil
shutil.move("/content/eo", "/content/drive/MyDrive/Deepspeech/cp/")

In [None]:
# if files don't appear in your Google Drive, this often helps (sync and disconnect drive)
from google.colab import drive
drive.flush_and_unmount()