# Lab Voice Data

The purpose of this lab is to gain familiarity with speech data you might use to train an Automatic Speech Recognition (ASR) system. In the following steps, you'll:

* Explore the LibriSpeech data set and format
* Create your own audio files
* Build your own audio data set

## Exploration of LibriSpeech Corpus 

LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned. This is a free data set of English sentences matched with audio files. With over 1000 hours of public domain book content, it is appropriate for ASR training.

[http://www.openslr.org/12/] 

In [5]:
ls -la

total 1864
drwxr-xr-x@  9 x17  staff     306  7 Mar 15:08 [34m.[m[m/
drwxr-xr-x@ 11 x17  staff     374  7 Mar 15:08 [34m..[m[m/
-rw-r--r--@  1 x17  staff    6148  7 Mar 15:11 .DS_Store
-rwxr-xr-x@  1 x17  staff  117314  3 Oct  2014 [31mBOOKS.TXT[m[m*
-rwxr-xr-x@  1 x17  staff  676931 17 Aug  2014 [31mCHAPTERS.TXT[m[m*
-rwxr-xr-x@  1 x17  staff     199 17 Aug  2014 [31mLICENSE.TXT[m[m*
-rwxr-xr-x@  1 x17  staff    8217  3 Oct  2014 [31mREADME.TXT[m[m*
-rwxr-xr-x   1 x17  staff  127530 17 Aug  2014 [31mSPEAKERS.TXT[m[m*
drwxr-xr-x@  4 x17  staff     136  7 Mar 15:09 [34mdev-clean[m[m/


In [7]:
!cd LibriSpeech_Samples
!ls -la

# Lists of books, chapters numbers, speakers text files, 
# and 'dev-clean' folder, a sample of how the audio files are disposed 

/bin/sh: line 0: cd: LibriSpeech_Samples: No such file or directory
total 1864
drwxr-xr-x@  9 x17  staff     306  7 Mar 15:08 [34m.[m[m
drwxr-xr-x@ 11 x17  staff     374  7 Mar 15:08 [34m..[m[m
-rw-r--r--@  1 x17  staff    6148  7 Mar 15:11 .DS_Store
-rwxr-xr-x@  1 x17  staff  117314  3 Oct  2014 [31mBOOKS.TXT[m[m
-rwxr-xr-x@  1 x17  staff  676931 17 Aug  2014 [31mCHAPTERS.TXT[m[m
-rwxr-xr-x@  1 x17  staff     199 17 Aug  2014 [31mLICENSE.TXT[m[m
-rwxr-xr-x@  1 x17  staff    8217  3 Oct  2014 [31mREADME.TXT[m[m
-rwxr-xr-x   1 x17  staff  127530 17 Aug  2014 [31mSPEAKERS.TXT[m[m
drwxr-xr-x@  4 x17  staff     136  7 Mar 15:09 [34mdev-clean[m[m


In [14]:
!cd dev-clean
!ls -la

# 1993 --> name of the speaker

/bin/sh: line 0: cd: dev-clean: No such file or directory
total 16
drwxr-xr-x@ 4 x17  staff   136  7 Mar 15:09 [34m.[m[m
drwxr-xr-x@ 9 x17  staff   306  7 Mar 15:08 [34m..[m[m
-rw-r--r--@ 1 x17  staff  6148  7 Mar 15:11 .DS_Store
drwxr-xr-x@ 4 x17  staff   136  7 Mar 15:09 [34m1993[m[m


In [16]:
!cd 1993
!ls -la

# 147965 --> chapter number

/bin/sh: line 0: cd: 1993: No such file or directory
total 16
drwxr-xr-x@  4 x17  staff   136  7 Mar 15:09 [34m.[m[m
drwxr-xr-x@  4 x17  staff   136  7 Mar 15:09 [34m..[m[m
-rw-r--r--@  1 x17  staff  6148  7 Mar 15:11 .DS_Store
drwxr-xr-x@ 13 x17  staff   442  7 Mar 15:09 [34m147965[m[m


Let's have a look to the chapters lists. This command will display from line 13th line to 50th of the file. 

In [44]:
!sed -n 13,50p CHAPTERS.TXT

;
;ID    |READER|MINUTES| SUBSET           | PROJ.|BOOK ID| CH. TITLE | PROJECT TITLE
1      | 110  | 19.77 | train-other-500  | 53   | 1023  | In Chancery | Bleak House
2      | 110  | 10.30 | train-other-500  | 53   | 1023  | In Fashion | Bleak House
159    | 4174 | 7.67  | train-other-500  | 68   | 2184  | Letter XXV | Unbeaten Tracks in Japan
198    | 19   | 8.42  | train-clean-100  | 219  | 121   | Chapter 01 | Northanger Abbey
199    | 98   | 11.68 | train-clean-360  | 219  | 121   | Chapter 02 | Northanger Abbey
200    | 173  | 11.25 | train-other-500  | 219  | 121   | Chapter 03 | Northanger Abbey
201    | 44   | 7.57  | train-other-500  | 219  | 121   | Chapter 04 | Northanger Abbey
204    | 92   | 12.76 | train-other-500  | 219  | 121   | Chapter 07 | Northanger Abbey
205    | 20   | 12.82 | train-other-500  | 219  | 121   | Chapter 08 | Northanger Abbey
207    | 44   | 18.33 | train-other-500  | 219  | 121   | Chapter 10 | Northanger Abbey
208    | 14

In [45]:
!cat CHAPTERS.TXT | grep '147965'

147965 | 1993 | 0.95  | dev-clean        | 1592 | 19810 | Book 1 (The Shimerdas), Chapter 12 | My Antonia


In [52]:
!cat 1993-147965.trans.txt

1993-147965-0000 GRANDFATHER CAME DOWN WEARING A WHITE SHIRT AND HIS SUNDAY COAT
1993-147965-0001 MORNING PRAYERS WERE LONGER THAN USUAL
1993-147965-0002 HE GAVE THANKS FOR OUR FOOD AND COMFORT AND PRAYED FOR THE POOR AND DESTITUTE IN GREAT CITIES WHERE THE STRUGGLE FOR LIFE WAS HARDER THAN IT WAS HERE WITH US
1993-147965-0003 BECAUSE HE TALKED SO LITTLE HIS WORDS HAD A PECULIAR FORCE THEY WERE NOT WORN DULL FROM CONSTANT USE
1993-147965-0004 ALL AFTERNOON HE SAT IN THE DINING ROOM
1993-147965-0005 AT ABOUT FOUR O'CLOCK A VISITOR APPEARED MISTER SHIMERDA WEARING HIS RABBIT SKIN CAP AND COLLAR AND NEW MITTENS HIS WIFE HAD KNITTED
1993-147965-0006 HE SAT STILL AND PASSIVE HIS HEAD RESTING AGAINST THE BACK OF THE WOODEN ROCKING CHAIR HIS HANDS RELAXED UPON THE ARMS
1993-147965-0007 HIS FACE HAD A LOOK OF WEARINESS AND PLEASURE LIKE THAT OF SICK PEOPLE WHEN THEY FEEL RELIEF FROM PAIN
1993-147965-0008 HE MADE THE SIGN OF THE CROSS OVER ME PUT ON HIS CAP AND WENT OFF IN THE D

In this case 

* The speaker is named Wendy --> 1993
* The book that is being read from is "My Antonia"
* The transcripts have no punctuation other than apostrophes
* The transcripts are entirely in upper case

## Build your own dataset

### Step 0: Generate .wav files

Record yourself with [Sonic Visualizer https://www.sonicvisualiser.org] saying some short sentences and get the wav file. Take your time to explore the application and get the spectrogram of your voice. 
For this part you will need to install pysoundfile if it's not already installed. 


###  Step1: Convert and structure

The .wav files need to be converted from an IEEE-FLOAT format produced by Sonic Visualizer to a lower resolution PCM-16 format required in later processing steps. In addition, the audio files need to named and placed in a structure similar to the LibriSpeech file structure, i.e. sorted and identified by speaker and chapter. We need an arbitrary speaker number and chapter number to do this. A utility convert_flt_pcm.py has been provided for this purpose:

In [66]:
# pip install pysoundfile  --> If the package is not already installed 


# usage: convert_flt_pcm.py input_directory output_directory group speaker chapter

# positional arguments:
#   input_directory  Path to input directory
#   data_directory   Path to output data directory
#   group            group
#   speaker          speaker number
#   chapter          chapter number

# optional arguments:
#   -h, --help       show this help message and exit



!python convert_flt_pcm.py audio_samples MySpeech my_dev 1 12345  # you have to indicate a the directory with the samples

1-12345-0000
1-12345-0001
1-12345-0002
1-12345-0003
1-12345-0004


### Step 2: Add utterances to your transcript file

Find your 1-12345.trans.txt file, and add the sentences you have read. Note these will have different ID's if you gave different "speaker" and "chapter" numbers during the conversion step. ** Add sentences that correspond to your .wav files with the same ID**. The utterances should contain all capital letters and no punctuation except for apostrophes where needed. 

output: your 1-12345.trans.txt file

<br>1-12345-0000	YET WHEN WE COLLABORATE WE HAVE A TRICKY OFTEN AVOIDED CHARGE TO ASK FOR AND RECEIVE FEEDBACK
<br>1-12345-0001	IT CAN MAXIMIZE EFFICIENCY AND HEAL BROKEN TEAMS
<br>1-12345-0002	FEEDBACK IS ESSENTIAL IT CAN COURSE CORRECT AND ENCOURAGE DEVELOPMENT
<br>1-12345-0003	BEFORE RECEIVING FEEDBACK FROM SOMEONE PREPARE BY WRITING DOWN SPECIFIC QUESTIONS ABOUT THE THINGS YOU MIGHT CHANGE
<br>1-12345-0004	THIS BOOK WILL MAKE YOU MORE EFFECTIVE AND CONFIDENT GETTING FEEDBACK


[SOURCE: How to get and give feedback (A practical guide) SYPartners]


###  Step 3: Build your dataset. Create .json file needed for processing

