<!-- # <span style="color:red">UNDER CONSTRUCTION!!!!</span> --> 

# Spoken Language Processing - Instituto Superior Técnico
## Laboratory Assignment 2 - Native Language Identification challenge

# PART 2 - Fine-tuning a self-supervised pre-trained model


This notebook contains the guide and code cells that permit implementing an advanced system for native language identification based on fine-tuning of a self-supervised pre-trained model. Besides, the notebook will show how to obtain predictions and score the systems on the development set.

**In contrast to previous notebook, we will make use of some scripts that are part of the S3PRL framework. These are typically run in a terminal, so, some of the following steps may be simpler to run in a termnial, rather than as cell in the Notebook itself.**

## Before starting

Let's import some modules and make some definitions:

In [None]:
import os 
import csv 

from pf_tools import CheckThisCell

LANGUAGES = ('CHI',  'GER',  'HIN',  'ITA')
LANG2ID = {'CHI':1, 'GER':2, 'HIN':3, 'ITA':4}
ID2LANG = dict((LANG2ID[k],k)for k in LANG2ID)


Like in the previous Notebooks, you need to mount Google drive if you are working on Google Colab. Otherwise, you should skip or delete the following code cell:

In [None]:

raise CheckThisCell ## <---- Remove this torun this cell if you are on Google Colab
from google.colab import drive
drive.mount('/content/drive')


Like in Part1, the audio data is expected to be located in a folder with the following format:

```
ets_data/
├── train/
│   └── audio/
│       └──wav files
│   └── key.lst 
│
└── train100/
    └── audio/
        └──wav files
    └── key.lst
... 
```

You must already have this from part 1, so you can set-up your data directory:

In [None]:

# raise CheckThisCell ## <---- Remove this after completeing/checking this cell

CWD = os.getcwd()
DATADIR = f'{CWD}/ets_data/' # <--- Change this variable to your working directory containig the ETS data
if not os.path.isdir(DATADIR):
    os.mkdir(DATADIR)
    print(f'WARNING: Your data is not in the folder {DATADIR}')

os.chdir(CWD)
print(f'Your ETS data should be in this folder {DATADIR}')


If you need to download again the data, you can run the following cell:

In [None]:
raise CheckThisCell

os.chdir(DATADIR)

# download train
!wget http://groups.tecnico.ulisboa.pt/speechproc/pf24/lab2/train.tgz
!tar -xzvf train.tgz

#download train100
!wget http://groups.tecnico.ulisboa.pt/speechproc/pf24/lab2/train100.tgz
!tar -xzvf train100.tgz

#download dev
!wget http://groups.tecnico.ulisboa.pt/speechproc/pf24/lab2/dev.tgz
!tar -xzvf dev.tgz

#download evl
!wget http://groups.tecnico.ulisboa.pt/speechproc/pf24/lab2/evl.tgz
!tar -xzvf evl.tgz

os.chdir(CWD)

## 2.1 The SSL based model

The goal of this part of the laboratory is to expose students to modern tools and methods for speech classification.
In particular, we will use the [s3prl](https://github.com/s3prl/s3prl) toolkit to build a native language identification system based on self-supervised learning (SSL) models as feature extraction.

[s3prl](https://github.com/s3prl/s3prl) is an open source toolkit, which stands for Self-Supervised Speech Pre-training and Representation Learning. Self-supervised speech pre-trained models are called upstream in this toolkit, and are utilized in various downstream tasks.

The toolkit permits pre-training upstream models, load already pre-trained upstream models and/or utilize these upstream models in lots of downstream tasks already defined.

For this lab, the faculty team configured a downstream task and a simple model specifically for our native language identification task and data. The model consists of a projection layer, followed by an average pooling, and a linear output layer.

In this part of the lab, students are  expected to *play* with the different upstream models to build the best possible native language identification system. 
In particular, students are encouraged to explore and discover which of the available SSL models can be a better candidate for their classification system. Note that using a large SSL model will turn out the training process very slow. So, you should choose wisely depending for instance on the reported performance in similar tasks.

Besides playing with the different upstream models, interested students can try to modify some of the details of the "expert" downstream model. This can be done relatively easy using one of the many examples already included in the toolkit as a starting point. 

## 2.2 Installing the toolkit
Let's start by cloning the repository of the s3prl toolkit:

In [None]:
!git clone https://github.com/s3prl/s3prl.git

As a result, a new folder named  `s3prl/` with the contents of the toolkit has been created. We'll now install the toolkit itself.

In [None]:
S3PRLDIR = CWD + '/s3prl/'
os.chdir(S3PRLDIR)
!pip install -e .

## 2.3 Configuring the downstream task
Let's create the downstream native language identification task. In the `s3prl/downstream/` there are plenty of examples. The faculty team took one of those as an example to create the configuration needed for this lab assignment. Let's download and copy it to the downstream folder: 

In [None]:
os.chdir(f'{S3PRLDIR}/s3prl/downstream/') # <----- change to the downstream folder
!wget http://groups.tecnico.ulisboa.pt/speechproc/pf24/lab2/nli_s3prl_downstream.tgz # <--- download the lab specific downstream task
!tar -xzvf nli_s3prl_downstream.tgz  # <---- unzip
!rm nli_s3prl_downstream.tgz
os.chdir(S3PRLDIR)

Have a look to the contents of the folder `s3prl/downstream/native_language_identification/`. There are some important files that help to define the task:
- `dataset.py`: this file provides the class that permits loading the ETS data. Something similar to the ETS class used in part1, but following the formatting rules of the s3prl toolkit. **You don't need to change anything here**.
- `expert.py`: this file defines the expert downstream task. In this case, the expert takes the ouput of the upstream model (configurable), applies a projection layer, and then a classification model (configurable) to obtain the final predictions. **You don't need to change anything here**.
- `model.py`: this file contains the definitions of the model after the projection. We could include several configurations that can later be selected when we run the actual experiment. The model included is just an average pooling (that reduces the time dimension to a single vector) followed by a linear output layer. **You don't need to change anything here, but you may want to explore other configurations following the examples of other downstream tasks included in s3prl**.
- `config.yaml`: this file permits configuring some parameters of your experiment, including the path that contains the task data and the training set that is going to be used (either train or train100). 

Let's configure our experiment: 

Edit the 'file_path' entry in the configuration file `downstream/native_language_identification/config.yaml` to the folder containing the data:

```yaml
downstream_expert:
    datarc:
        file_path: "your_path/ets_data"
```



And also edit the config file, to use either the "train100" partition or the total training data "train"  by just editing the following entry of the `downstream/native_language_identification/config.yaml`:

```yaml
downstream_expert:
    datarc:
        ...
        train: "train100"
```

You may also want to reduce the number of training steps to 1000 or 2000 for quick experimentation of different configurations:

```yaml
runner:
  total_steps: 5000
  ...
```


## 2.4 Training the downstream model and classification of the dev set
Now it's time for training. For that, we will use the Pythons script `run_downstream.py` in train mode and in which we will set:
- an arbitrary name for identifying this experiment, in which the results will be saved ( `ExpName`)
- the upstream model to be used, for instance `fbank`. You can check for more SSL pretrained available models  here https://s3prl.github.io/s3prl/tutorial/upstream_collection.html
- the downstream task, in this case "native_language_identification"


```bash
python3 run_downstream.py -n ExpName -m train -u fbank -d native_language_identification
```

Since this can take a while (actually, a lot depending on the chosen upstream model), you probably want to run this in a terminal, rather than inside the Notebook:


In [None]:
os.chdir(f'{S3PRLDIR}/s3prl')
!python3 run_downstream.py -n fbank -m train -u fbank -d native_language_identification

This training step generated a folder containing training results in `result/downstream/{ExpName}`:

In [None]:

os.chdir(f'{S3PRLDIR}/s3prl')
!ls result/downstream/fbank/

Some interesting files: the `dev_predict.txt` contains the predictions on the dev set, the `dev-best.ckpt` contains the model parameters for the best checkpoint and the log.log contains information of the training process, including the identification accuracy in the train and dev sets. Notice that we will later use the `dev_predict.txt`  file together with the predictions of the `evl` set to create the submission file to upload to Kaggle.

If the training process is interrupted, you can continue from the last saved checkpoint (a checkpoint is saved every 200 iterations):

```bash
python3 run_downstream.py -m train -e result/downstream/fbank/states-2000.ckpt
```

Finally, note that the `fbank` is a very bad upstream model. You need to try other upstream models, you can start with those commented in the theoretical lessons or the ones that have shown good performance in similar tasks. Be careful (and wise) in your decisions: experiments may be slow!


## 2.5 Classification of the evl set 

Now that we already trained our downstream model, we can use it to predic on the blind evl set. For this purpose, we will use again the `run_downstream.py` script in evaluate mode and we need to select the actual model to use:


```bash
python3 run_downstream.py -m evaluate -e result/downstream/fbank/dev-best.ckpt
```

NOTE: Ignore the test accuracy reported at the end (we actually don't have the groundtruth).

In [None]:
!python3 run_downstream.py -m evaluate -e result/downstream/fbank/dev-best.ckpt

The `test_predict.txt` file contains the predictions of this model for the evl set.

## 2.6 Create the final predictions file and submit to the challenge

Like in Part 1, we will create the predictions file in the expected format.
The predictions file used for submission and scoring is a CSV file containing the predictions of both the `dev` and `evl` partitions.
The file has two fields: fileId and Lang. The fileId is the unique audio file identifier and the Lang field is the language prediction (numeric from 1 to 4). The predictions file name must be as follows:

`G<YY>_<SYSTEMID>.csv` 

where `<YY>` is the students' group number (use 2 digits) and `<SYSTEMID>` is an identifying string for that submission/system.

In [None]:

trainset = 'train100'
upstream_id = 'fbank' ## <--- CHANGE THIS ACCORDINGLY
group, system = '00', f'ssl_{trainset}_{upstream_id}'

filename_dev = f'{S3PRLDIR}/s3prl/result/downstream/{upstream_id}/dev_predict.txt'
filename_evl = f'{S3PRLDIR}/s3prl/result/downstream/{upstream_id}/test_predict.txt'


with open(f'{CWD}/g{group}_{system}.csv', 'w') as file:
    
    csv_writer = csv.writer(file) # CSV writer
    csv_writer.writerow(('fileId', 'Lang')) # Header of the CSV

    results_dev = [l.strip().split() for l in open(filename_dev, 'r')]
    results_evl = [l.strip().split() for l in open(filename_evl, 'r')]

    # Save dev results
    for file_id, lang in results_dev:
        file_id = file_id.split('-')[-1]
        lang = LANG2ID[lang]
        csv_writer.writerow((file_id, lang))
        
    # Save evl results
    for file_id, lang in results_evl:
        file_id = file_id.split('-')[-1]
        lang = LANG2ID[lang]
        csv_writer.writerow((file_id, lang))

Finally, you can submit your prediction(s) in the following [Kaggle competition](https://www.kaggle.com/t/312cd4200cfb4e138ea9372ce5bc33fd).



# Contacts and support
You can contact the professors during the classes or the office hours.

Particularly, for this second laboratory assignment, you should contact Prof. Alberto Abad: alberto.abad@tecnico.ulisboa.pt


