<a href="https://colab.research.google.com/github/merriekay/Researchers-Guide-to-UncommonVoice/blob/main/UncommonVoiceGuide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Researcher's Guide to the UncommonVoice Dataset

This guide will take you through the basic steps of getting UncommonVoice loaded and ready to work with.

## How to get the data:
To get access to the dataset, start by emailing meredith.moore@drake.edu with the suject `UncommonVoice Download`. There will be an automatic reply with a link to download the dataset as well as instructions as to how to cite the datset in your work. The zip file is 1.63 GB and includes 3693 speech samples from individuals with and without voice disorders.

## How to cite the data:
BibTex:
```
@article{moore2020uncommonvoice,
  title={UncommonVoice: A Crowdsourced Dataset of Dysphonic Speech},
  author={Moore, Meredith and Papreja, Piyush and Saxon, Michael and Berisha, Visar and Panchanathan, Sethuraman},
  journal={Proc. Interspeech 2020},
  pages={2532--2536},
  year={2020}
}
```
Other formats can be [found here](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C16&q=UncommonVoice%3A+A+Crowdsourced+Dataset+of+Dysphonic+Speech&btnG=#d=gs_cit&u=%2Fscholar%3Fq%3Dinfo%3AwzEJyLPfAG0J%3Ascholar.google.com%2F%26output%3Dcite%26scirp%3D0%26hl%3Den)

## Want more info?
If you're curious about the details of UncommonVoice, you have a few options:
- Read the [Interspeech 2020 paper here](https://img1.wsimg.com/blobby/go/bb8819fe-ceab-4aab-9326-de58f46295cf/downloads/UncommonVoice_IS2020.pdf?ver=1604346789008). 
- Check out my [personal website](https://merriekay.com/uncommonvoice) which basically recaps all of these details.
- Watch the [Interspeech 2020 highlight video](https://youtu.be/QwXwfGbWAH4).
- Or watch the [15-minute Interspeech presentation](https://youtu.be/lBEYCujz2L4).

## Licensing:
UncommonVoice is licensed under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/).


# Some Tips/tricks:

## Prompts:
Subjects were asked to complete four sections of data collection: 

1. **Task 1**: Non-words (sustained corner vowels, and DDK rate)

2. **Task 2**: Read Speech: Randomly selected TIMIT sentences and the sentences required to complete the CAPE-V intelligibility assessment

3. **Task 3**: Image Descriptions: Spontaneous speech to describe images from the MSCOCO dataset.

4. **Task 4**: Non-words (round 2 to measure any change in voice over data collection process). 

These prompts can be found in the `prompt` folder. They are split into T1-T4 corresponding to the above tasks. 


## How to decode the filename. 
Filename example: **FD04\_si1997\_31.wav**

\<**unique ID**\>\_ \<**promptID**\>\_\<**number of days post Botox treatment**\>.wav

Let's go character by character:
- **First character** is either **F** or **M** and corresponds to the sex of the participant. F: female, M: male. 
- **Second character** is either **D** or **C** and corresponds to whether or not the speaker has a voice disorder. D: disorder C: Control (no disorder).
- Then comes a number. The combination of the first section of the filename allows us to uniquely identify each participant.
- After the first underscore comes the prompt. This is what was presented to the individual.For more information on the prompts, see above.
    - Note that for non-words, the prompt does not directly correspond to what the participant said. 
    - For example, the prompt may have been `Please hold /a/ as in 'pot' for 5 seconds` but the speaker did not just read the prompt, they completed the task that it asked.
- Last, but certaintly not least, we have another number. This number corresponds with the number of days since the speaker's last Botox Injection. Botox is the most common treatment for Spasmodic Dysphonia and the speaker's voice may be more or less clear based on when they recieved their last treatment. If this is **NA** then it means that they do not receive Botox treatment.

Filename example: **FD04\_si1997\_31.wav**

So for the example given above, we know the speech sample:
- came from a female (**F**) with a voice disorder (**D**) and is saying what is found in `si1997.txt`. We also know that it is 31 days since she had a Botox injection.

In [27]:
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
import os

directory = '/content/drive/MyDrive/Research/UncommonVoice/UncommonVoice_data/' #I extracted my files here, but you'll have to changet this to where you have your files

#create a cvs file with the above information more clear:
# <filname, sex, disorder, prompt, days since btx> 

file_list = []
#file_list.append(['filename', 'sex', 'disorder', 'prompt', 'days_since_btx'])
sex = 'male'
disorder = 0
days_since_btx = 'NA'


for filename in os.listdir(directory):
  if filename.endswith('.wav'):

    split_file = filename.split('_')
    unique_id = split_file[0]
    task = split_file[1]
    prompt = split_file[2]
    days_since_bxt = split_file[3].split('.')[0]

    # add sex
    if unique_id[0] == 'F':
      sex = 'female'
    elif unique_id[0] == 'M':
      sex = 'male'

    # add disorder
    if unique_id[1] == 'D':
      disorder = 1
    elif unique_id[1] == 'C':
      disorder = 0

    #append
    file_list.append([filename, sex, disorder, prompt, days_since_btx])

#convert to a DataFrame
df = pd.DataFrame(file_list, columns=['filename', 'sex', 'disorder', 'prompt', 'days_since_btx'])
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,filename,sex,disorder,prompt,days_since_btx
0,FD032_T2_sx432_NA.wav,female,1,sx432,
1,MC04_T2_si1949_NA.wav,male,0,si1949,
2,FD020_T2_si1083_NA.wav,female,1,si1083,
3,FD05_T2_si1055_116.wav,female,1,si1055,
4,MD06_T2_sx448_NA.wav,male,1,sx448,


In [29]:
# of files per sex
df.value_counts('sex')

sex
female    2776
male       917
dtype: int64

In [30]:
# of files with/without voice disorder
df.value_counts('disorder')

disorder
1    2913
0     780
dtype: int64

In [35]:
df.to_csv('UncommonVoice_file_descriptions.csv')

In [37]:
from google.colab import files

files.download('UncommonVoice_file_descriptions.csv') 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>