# wav2vec-u CV-sv - GAN
> "GAN training for wav2vec-u on Common Voice Swedish"

- toc: false
- branch: master
- badges: true
- comments: true
- categories: [kaggle, colab, wav2vec-u]

The original attempt on [Kaggle](https://www.kaggle.com/jimregan/wav2vec-u-cv-swedish-gan) won't run because of an issue with CuDNN, but this notebook runs fine on Colab.

## Preparation

In [1]:
!pip install condacolab

Collecting condacolab
  Downloading https://files.pythonhosted.org/packages/ee/47/6f9fe13087c31aba889c4b09f9beaa558bf216bf9108c9ccef44e6c9dcfe/condacolab-0.1.2-py3-none-any.whl
Installing collected packages: condacolab
Successfully installed condacolab-0.1.2


In [2]:
import condacolab
condacolab.install()

⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:36
🔁 Restarting kernel...


In [1]:
%%capture
!conda install -c pykaldi pykaldi -y

In [2]:
!git clone https://github.com/jimregan/fairseq/ --branch issue3581

Cloning into 'fairseq'...
remote: Enumerating objects: 28296, done.[K
remote: Total 28296 (delta 0), reused 0 (delta 0), pack-reused 28296[K
Receiving objects: 100% (28296/28296), 11.77 MiB | 28.02 MiB/s, done.
Resolving deltas: 100% (21286/21286), done.


In [3]:
!git clone https://github.com/kpu/kenlm

Cloning into 'kenlm'...
remote: Enumerating objects: 14046, done.[K
remote: Counting objects: 100% (359/359), done.[K
remote: Compressing objects: 100% (291/291), done.[K
remote: Total 14046 (delta 107), reused 121 (delta 55), pack-reused 13687[K
Receiving objects: 100% (14046/14046), 5.76 MiB | 17.08 MiB/s, done.
Resolving deltas: 100% (7987/7987), done.


In [4]:
%%capture
!apt-get -y install libeigen3-dev liblzma-dev zlib1g-dev libbz2-dev

In [5]:
%%capture
%cd /content/kenlm
!python setup.py install
%cd /tmp

In [6]:
import os
os.environ['PATH'] = f"{os.environ['PATH']}:/content/kenlm/build/bin/"
os.environ['FAIRSEQ_ROOT'] = '/content/fairseq'

In [7]:
%cd /content/fairseq/

/content/fairseq


In [8]:
%%capture
!python setup.py install

In [9]:
os.environ['HYDRA_FULL_ERROR'] = '1'

In [10]:
%%capture
!pip install editdistance

https://colab.research.google.com/github/corrieann/kaggle/blob/master/kaggle_api_in_colab.ipynb

In [11]:
%%capture
!pip install kaggle

In [12]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  
# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
User uploaded file "kaggle.json" with length 64 bytes


In [13]:
%cd /content

/content


In [14]:
!kaggle datasets download "jimregan/w2vu-cvsv-prepared-text"

Downloading w2vu-cvsv-prepared-text.zip to /content
 52% 9.00M/17.4M [00:00<00:00, 30.4MB/s]
100% 17.4M/17.4M [00:00<00:00, 54.4MB/s]


In [15]:
%%capture
!unzip /content/w2vu-cvsv-prepared-text.zip

In [16]:
!kaggle datasets download -d jimregan/w2vu-cvsv-precompute-pca512-cls128-mean-pooled

Downloading w2vu-cvsv-precompute-pca512-cls128-mean-pooled.zip to /content
100% 393M/394M [00:06<00:00, 50.5MB/s]
100% 394M/394M [00:06<00:00, 65.7MB/s]


In [17]:
%%capture
!unzip w2vu-cvsv-precompute-pca512-cls128-mean-pooled.zip

In [18]:
!rm *.zip

## GAN

In [19]:
import torch
torch.version.cuda

'10.1'

In [20]:
torch.backends.cudnn.version()

7603

In [21]:
%cd /content/fairseq

/content/fairseq


In [22]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [23]:
%%writefile rungan.sh
PREFIX=w2v_unsup_gan_xp
TASK_DATA=/content/precompute_pca512_cls128_mean_pooled
TEXT_DATA=/content/preppedtext/phones/
KENLM_PATH=/content/preppedtext/phones/lm.phones.filtered.04.bin

PREFIX=$PREFIX CUDA_LAUNCH_BLOCKING=1 fairseq-hydra-train \
	-m --config-dir fairseq/config/model/wav2vecu/gan \
	--config-name w2vu \
	task.data=${TASK_DATA} \
	task.text_data=${TEXT_DATA} \
	task.kenlm_path=${KENLM_PATH} \
	checkpoint.no_epoch_checkpoints=true \
	checkpoint.save_dir=/content/drive/MyDrive/w2vu \
	'common.seed=range(0,5)'

Writing rungan.sh


In [24]:
!bash rungan.sh

[2021-06-08 09:24:25,403][valid][INFO] - {"epoch": 9314, "valid_loss": "1.007", "valid_ntokens": "3039.79", "valid_nsentences": "144.214", "valid_lm_score_sum": "-89285.5", "valid_num_pred_chars": "48498", "valid_vocab_seen_pct": "0.855401", "valid_uer": "100.707", "valid_weighted_lm_ppl": "80.0009", "valid_lm_ppl": "58.5375", "valid_wps": "16800.8", "valid_wpb": "3039.8", "valid_bsz": "144.2", "valid_num_updates": "149024", "valid_best_weighted_lm_ppl": "72.0447"}
[2021-06-08 09:24:25,405][fairseq.checkpoint_utils][INFO] - Preparing to save checkpoint for epoch 9314 @ 149024 updates
[2021-06-08 09:24:25,405][fairseq.trainer][INFO] - Saving checkpoint to /content/drive/MyDrive/w2vu/checkpoint_last.pt
[2021-06-08 09:24:25,464][fairseq.trainer][INFO] - Finished saving checkpoint to /content/drive/MyDrive/w2vu/checkpoint_last.pt
[2021-06-08 09:24:25,464][fairseq.checkpoint_utils][INFO] - Saved checkpoint /content/drive/MyDrive/w2vu/checkpoint_last.pt (epoch 9314 @ 149024 updates, score 80