# Train a Coqui 🐸 STT model with a Common Voice Dataset 🤖

👋 Hello and welcome

This is a copy of the official colab file that regularly gets updates:
https://github.com/coqui-ai/STT/tree/main/notebooks

You have to do three things to get this code running for your project:


*   Download the Common Voice Dataset and upload it to your Google Drive
*   create a alphabet.txt file for your language
* change a few of the paths to match your folder structure on Drive and your language code

I used this notebook with a payed Google Drive account and Colab+, but it should work with the free version if your dataset is small enougth. Due to space limits in colab it likely won't work with CV datasets that are bigger than 150 GB when extracted and converted to wav and converting and packing the files inside of colab gets very hard with datasets that are biger than 15 GB.




# Basic setup

In [None]:
## Install Coqui STT 
!git clone --depth 1 https://github.com/coqui-ai/STT.git
!cd STT; pip install -U pip wheel setuptools; pip install .

In [None]:
# install libraries to convert mp3 to wav
!apt-get install sox libsox-fmt-mp3

## ✅ Mount Google Drive and Download your alphabet.txt

**First things first**: we need some data from Google Drive, GitHub or another source of your choice. 


In [None]:
# mount your private google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# create folder
! mkdir -p /content/eo/
%cd /content/eo/
! wget https://raw.githubusercontent.com/parolrekonado/deepspeech-esperanto/master/alphabet.txt

## Convert mp3s to wav and create a tar.gz file of it
**You only have to do this once, after that skip this step and use the tar file.** If you can do this on a local machine, I recommend not doing this on Colab and simply upload the result to google drive and skip this step.

Based on https://stt.readthedocs.io/en/latest/COMMON_VOICE_DATA.html?highlight=common%20voice

In [None]:
# untar the Dataset from Common Voice
!mkdir -p /content/data
!tar -xzvf "/content/drive/MyDrive/Deepspeech/cv-corpus-7.0-2021-07-21-eo.tar.gz" -C "/content/data"   

In [None]:
#rename folder for easier paths below
!mv /content/data/cv-corpus-7.0-2021-07-21 /content/data/cv-corpus-7

In [None]:
# This step converts the mp3s to wav-files. The result will be around three times as big as your mp3 folder.
!/content/STT/bin/import_cv2.py --filter_alphabet /content/eo/alphabet.txt /content/data/cv-corpus-7/eo --normalize


In [None]:
#delete all mp3 files AFTER they got converted to wav (rm doesnt work with so many files in Colab)
!find /content/data/cv-corpus-7/eo/clips/ -name "*.mp3" -delete

In [None]:
#pack WAVs and CSVs into tar.gz inside of the workspace
!tar czf /content/data/converted-eo-corpus-7.tar.gz /content/data/cv-corpus-7/

In [None]:
#copy big file to Google Drive (for small files in low numbers mv also works well). 
#If space gets low during the transfere open the terminal and use "find /content/data/cv-corpus-7/eo/clips/ -name "*.wav" -delete"
import shutil
shutil.move("/content/data/converted-eo-corpus-7.tar.gz", "/content/drive/MyDrive/Deepspeech/")

In [None]:
# if files don't appear in your Google Drive, this often helps (sync and disconnect drive)
from google.colab import drive
drive.flush_and_unmount()

# Untar converted wav-files and download checkpoints
This part uses the prepered file from above. This speeds up the process a lot, untaring is a lot quicker then converting everything every time you want to train a model

In [None]:
! mkdir -p /content/data
!tar -xzvf "/content/drive/MyDrive/Deepspeech/converted-eo-corpus-7.tar.gz" -C "/content/data"   

In [None]:
#create some missing folders
! mkdir -p /content/eo/checkpoints
! mkdir -p /content/eo/exports

In [None]:
#If you have some checkpoints from earlier trainings, you can import them here
import shutil
shutil.move("/content/drive/MyDrive/Deepspeech/cp/checkpoints", "/content/eo/checkpoints")

## ✅ Configure & set hyperparameters

Coqui STT comes with a long list of hyperparameters you can tweak. We've set default values, but you will often want to set your own. You can use `initialize_globals_from_args()` to do this. 

You must **always** configure the paths to your data, and you must **always** configure your alphabet. Additionally, here we show how you can specify the size of hidden layers (`n_hidden`), the number of epochs to train for (`epochs`), and to initialize a new model from scratch (`load_train="init"`).

In [None]:
from coqui_stt_training.util.config import initialize_globals_from_args

initialize_globals_from_args(
    alphabet_config_path="/content/eo/alphabet.txt",
    train_files=["/content/data/content/data/cv-corpus-7/eo/clips/train-all.csv"],
    dev_files=["/content/data/content/data/cv-corpus-7/eo/clips/dev.csv"],
    test_files=["/content/data/content/data/cv-corpus-7/eo/clips/test.csv"],
    load_train="init",
    n_hidden=100,
    epochs=5,
    train_batch_size=64,
    dev_batch_size=64,
    test_batch_size=64,
    export_batch_size=64,
    automatic_mixed_precision=True,
    checkpoint_dir="/content/eo/checkpoints",
    export_dir="/content/eo/exports"
)

In [None]:
from coqui_stt_training.util.config import Config

# Take a peek at the entire Config
print(Config.to_json())

## ✅ Train a new model

Let's kick off a training run 🚀🚀🚀 (using the configure you set above).

This notebook should work on either a GPU or a CPU. However, GPU training is a lot quicker.

In [None]:
from coqui_stt_training.train import train, early_training_checks

early_training_checks()

train()

## ✅ Test the model


In [None]:
from coqui_stt_training.evaluate  import test

test()

# export model or checkpoints to Drive

In [None]:
#copy big file to Google Drive (for small files in low numbers mv also works well). 
#If space gets low during the transfere open the terminal and use "find /content/data/cv-corpus-7/eo/clips/ -name "*.wav" -delete"
import shutil
shutil.move("/content/eo", "/content/drive/MyDrive/Deepspeech/cp/")

In [None]:
# if files don't appear in your Google Drive, this often helps (sync and disconnect drive)
from google.colab import drive
drive.flush_and_unmount()