# Spoken Language Processing - Instituto Superior Técnico
### Laboratory Assignment 2 - Native Language Identification challenge

The second laboratory assignment of the course is designed to simulate a **native language identification** challenge. In this challenge, partipants (a.k.a students enrolled in the course) receive a train, development and evaluation (blind) data set, and a set of (weak) baseline systems for the task at hand: closed-set identification of the native language of foreign English (L2) speakers in a given audio file out of a set of four target native L1 languages: Chinese,  German,  Hindi,  and  Italian.

The **goal** for each participant is to develop/build the best native language identification system. To this end, participants are first required to complete the lab Notebooks and then encouraged to incorporate other techniques and explore any approach that permit improving their results.

During the first week (Part 1), students are expected to:
- Understand the main components of a simple baseline system based on **MFCC** features and **GMM** models.
- Complete and run the main components of this baseline.
- Propose, develop and explore simple modifications to the feature extraction process.
- Propose, develop and explore simple modifications to the GMM native language models.
- Evaluate the models on the development partition.

During the second week (Part 2), students are expected to:
- Understand the main components of a modern sytem based on self-supervised learning.
- Complete and run the main components of this system.
- Propose, develop and explore simple modifications to the native language classifiers.
- Evaluate the models on the development partition.
- Propose, develop, extend, combine and/or modify the different systems to obtain the best possible native language identifier.
- Obtain predictions for the blind test partition and prepare the submission.


<!--
The challenge distinguishes two different tracks or evaluation conditions:
- Track 1 - Participants are not allowed to use any kind of pre-trained model (such as x-vectors).
- Track 2 - Participants are allowed to use anything.
-->

## About the data

The data consists of mono audio files sampled at 16 kHz all of them containing English speech (L2) spoken by native speakers of one of the following L1 target languages: Chinese (`'CHI'`), German (`'GER'`), Hindi (`'HIN'`), and Italian (`'ITA'`).
```python 
LANGUAGES = ('CHI',  'GER',  'HIN',  'ITA')
```

The dataset is organized in 4 partitions:
- `'train'`: This is the full training set, consisting of 1200 audio samples of approximately 45-seconds each containg speech from English students with  different L1 backgrounds. (**ATENTION**: Do not use this dataset for training your models, unless your system is very fast or if you want to build your final model. It can be slow).
- `'train100'`: This is a subset of the full training set that consists of 100 audio files per target L1 language (**RECOMMENDATION**: Use this partition in your quick experiments, to more rapidly validate alternatives)
- `'dev'`: This is the development set. It contains the same kind of segments as the training partitions. You will typically use this to validate the quality of your model.
- `'evl'`: This is the evaluation set. It contains the same kind of segments as the training partitions. You don't have the groud-truth for this set. You are expected to generate a prediction file and submit it.

The data used in this challenge is a subset of the [ETS Corpus of Non-Native Spoken English](http://dx.doi.org/10.21437/Interspeech.2016-129).

The difference is that only four L1 target languages are considered instead of eleven, and that the original development partition has been split in the development and evaluation sets used in this course.

## Before starting
The following conditions are necessary to run correctly this notebook:

*   All modules included in the `requirements.txt` file need to be 
installed in the Python environment. In a local installation, you can run in the command line: `pip install -r requirements.txt`
*   The module `pf_tools` needs to be accessible (if you are using Google Colab, you will need to copy the `pf_tools.py` every time you start a new session).

In [1]:
from pf_tools import CheckThisCell

## How can you download (and process) the data

The first thing we have to do is to set our working directory. If you are using Google Colab, you probably want to mount Google Drive to keep persistent information, such as data, features and models. If you are not using Google Colab, you rather comment or delete the following code cell:

In [2]:
raise CheckThisCell ## <---- Remove this to run this cell if you are on Google Colab
from google.colab import drive
drive.mount('/content/drive')

CheckThisCell: 

In [2]:
#raise CheckThisCell ## <---- Remove this after completing/checking this cell
import os 

CWD = os.getcwd() # <--- Change this variable to your working directory 
DATADIR = f'{CWD}/ets_data/' # <--- Change this variable to your folder containig the ETS data
if not os.path.isdir(DATADIR):
    os.mkdir(DATADIR)

os.chdir(CWD)
print(f'Current working directory is set to {CWD}')   
print(f'Your ETS data folder is {DATADIR}')


Current working directory is set to c:\Users\Piotr\Desktop\spoken-language-processing-23-24-IST\slp-lab2
Your ETS data folder is c:\Users\Piotr\Desktop\spoken-language-processing-23-24-IST\slp-lab2/ets_data/


The class `ETS` permits downloading, transforming and storing the different data partitions. Each `ETS` instance can be used to iterate over all the samples of the partition. It can also be used in combination with pytorch dataloader to read batches of data to train neural networks with pytorch. For instance, consider the following piece of code:


In [3]:
import numpy as np
from pf_tools import ETS
import librosa 

def audio_transform(filename):
    y, _ = librosa.load(filename, sr=16000, mono=True)
    return y.reshape(-1,1)
    
train_ets = ETS(DATADIR, 'train100', transform_id='raw', audio_transform=audio_transform)

100%|██████████| 491M/491M [00:55<00:00, 9.24MB/s] 
100%|██████████| 400/400 [00:21<00:00, 18.49it/s]



This will first download and uncompress the .tar.gz file containing all the necessary data of the `'train100'` partition, that is, the audio files that are stored to disk (in DATADIR/train100/audio/) and key file (DATADIR/train100/key.lst). Then, the audio transformation `'transform'` will be applied to each file and the result stored to disk DATADIR/train100/raw/. 

(If the feature extraction process is interrupted, you will need to delete the corresponding tranformation folder to restart)

**Audio transformations** receive a filename of an audio file and returns an array of dimensions (NxD), in which N is the time dimension and D the dimension of the feature vector. In this simple case D is 1 because the transform is just returning the raw audio signal.

The `ETS` class permits chunking the output of the audio transformation (of size NxD) in chunks of CxD size. The chunking operation divides the result of the transformation process, in multiple smaller pieces with a configurable chunk size and hop length. These chunks can be further transformed and stored as individual feature files. The number of chunks depends on the size of the original file and the hop length. For instance:

In [4]:
train_ets = ETS(DATADIR, 'train100', 
                     transform_id='chunks', 
                     audio_transform=audio_transform, 
                     chunk_size=10*16000, 
                     chunk_hop=5*16000)

100%|██████████| 400/400 [00:07<00:00, 51.36it/s]


This will download and uncompress the partition data, only if was not already done before. Then, as previously, the simple tranform that returns the waveform is applied to each audio file. After this, the resulting array of dimension Nx1, in which N=16000xduration_in_seconds, is split in continuous chunks of length 160000 (that is, 10 seconds) with chunk hop of 5 seconds. Each one of these chunks of 10 seconds is stored and will be accessed whenever we iterate the dataset. 

Aditionally, the optional argument `chunk_transform` permits defining a transformation to be applied to each chunk before storing to disk. It can be any function that receives an array of size CxD and returns an array HxW, in which H is the *new time dimension*. For instance, the following example takes the audio segments of 160000x1, computes the mean and variance every 0.1 sec (1600 samples) and returns a feature vector of size 100x2.

In [5]:
def chunk_transform(x):
    x = x.reshape(-1,1600)
    return np.concatenate((x.mean(axis=1, keepdims=True), x.std(axis=1, keepdims=True)),axis=1)

train_ets = ETS(DATADIR, 'train100', 
                     transform_id='chunks_mv', 
                     audio_transform=audio_transform, 
                     chunk_size=10*16000, 
                     chunk_hop=5*16000, 
                     chunk_transform=chunk_transform)


100%|██████████| 400/400 [00:07<00:00, 56.68it/s]



Notice that, while the above example is probably useless as an effective feature extraction method, the proper combination of audio and chunk transformations is expected to permit quite flexible feature extraction that (hopefully) can match the needs of almost any training setting. 

Once we have instanciated a ETS dataset, it can be iterated to have access to each processed sample, for instance: 

In [6]:
import time
start = time.time()
for i, sample in enumerate(train_ets):
    data, label, basename = sample # array, str, str
    if i % 300 == 0:
        print(i, data.shape, label, basename)

print(f'Finished reading all data in {time.time() - start}')

0 (100, 2) HIN train_0005
300 (100, 2) CHI train_0110
600 (100, 2) ITA train_0224
900 (100, 2) CHI train_0332
1200 (100, 2) HIN train_0434
1500 (100, 2) ITA train_0556
1800 (100, 2) CHI train_0652
2100 (100, 2) GER train_0742
2400 (100, 2) GER train_0849
2700 (100, 2) CHI train_0960
3000 (100, 2) GER train_1057
Finished reading all data in 16.211992979049683


Now you can use the `ETS` class to check the  number of files and size (in minutes) of the training set for each target language. You should keep these numbers to include in your system description paper:

In [16]:
# Inspect the training data to find the size of each training language
#raise CheckThisCell ## <---- Remove this after completeing/checking this cell

LANGUAGES = ('CHI',  'GER',  'HIN',  'ITA')    
train_ets = ETS(DATADIR, 'train', transform_id='raw', audio_transform=audio_transform)
train_ets = ETS(DATADIR, 'train100', transform_id='raw', audio_transform=audio_transform)

num_files = dict().fromkeys(LANGUAGES, 0)
num_samples = dict().fromkeys(LANGUAGES, 0)

# <----- LAB WORK: ADD YOUR CODE HERE
for item in train_ets:
    data, label, basename = item
    num_files[label] += 1
    num_samples[label] += data.shape[0]


for lang in LANGUAGES:
    print(f'{lang}:\t{num_files[lang]}\t{num_samples[lang]/(60*16000)} minutes')

# The expected output should be something like for the train100:
# CHI:	100 files	77.44 minutes
# GER:	100 files	77.58666666666667 minutes
# HIN:	100 files	77.58666666666667 minutes
# ITA:	100 files	77.59333333333333 minutes
#
# And the following gor train:
# CHI:	300	232.38666666666666 minutes
# GER:	300	232.79333333333332 minutes
# HIN:	300	232.63333333333333 minutes
# ITA:	300	232.70666666666668 minutes

CHI:	100	77.44 minutes
GER:	100	77.58666666666667 minutes
HIN:	100	77.58666666666667 minutes
ITA:	100	77.59333333333333 minutes


Notice that the `ETS` class extends the `torch.utils.data.Dataset` and it can be used in combination with a Pytorch DataLoader to read data in batches:

In [8]:
import torch 

train_ets= ETS(DATADIR, 'train100', 
                     transform_id='chunks_mv', 
                     audio_transform=audio_transform, 
                     chunk_size=10*16000, 
                     chunk_hop=5*16000, 
                     chunk_transform=chunk_transform)

train_batches = torch.utils.data.DataLoader(
        dataset=train_ets,
        batch_size=10,
        shuffle=True
)

start = time.time()
for i, batch in enumerate(train_batches):
    data, label, basename = batch
    if i % 100 == 0:
        print(data.shape, len(label), len(basename))

print(f'Finished reading all data in {time.time() - start}')


torch.Size([10, 100, 2]) 10 10
torch.Size([10, 100, 2]) 10 10
torch.Size([10, 100, 2]) 10 10
torch.Size([10, 100, 2]) 10 10
Finished reading all data in 13.179200172424316


Before starting Part 1 of this lab assignment, you should delete the folders containing the dummy features that you just generated: _'raw'_, _'chunks'_, _'chunks_mv'_.

# What should you deliver at the end of this lab assignment?
You should deliver the following three elements:
- You must submit (via [Kaggle](https://www.kaggle.com/t/312cd4200cfb4e138ea9372ce5bc33fd) and Fênix) at least one prediction file in the format that will be described in the Notebook of part1.
- You must submit (via Fênix) all the modified notebooks and any additional code used for your proposed system(s).
- You must submit a report (via Fênix) of maximum 2 pages describing your work, your system(s), approaches explored (may be unsuccesful), parameters explored, lessons learnt, results on the dev partition, etc. You can use the following Overleaf template for the report: [report](https://www.overleaf.com/latex/templates/interspeech-2023-paper-kit/kzcdqdmkqvbr)


# Contacts and support
You can contact the professors during the classes or the office hours.

Particularly, for this second laboratory assignment, you should contact Prof. Alberto Abad: alberto.abad@tecnico.ulisboa.pt



