### **Link:** https://platform.stratascratch.com/data-projects/voice-recordings-analysis

### **Difficulty:** Hard

# Voice Recordings Analysis

<div><p><em>This data project has been used as a take-home assignment in the recruitment process for the data science positions at Sandvik.</em></p>
<h2>Assignment</h2>
<p>The following assignment will let you extract, explore and analyze audio data from English speaking male and females, and <strong>build learning models aimed to predict a given person's gender using vocal features, such as mean frequency, spectral entropy or mode frequency.</strong></p>
<p>Contrary to most online communities that share datasets for data science, machine learning and artificial intelligence applications, readymade datasets rarely exist out in the wild, and you will have to explore one or more ways of downloading and extracting meaningful features from a raw dataset containing thousands of individual audio files.</p>
<p><strong>Question Set</strong>
The following are reference points that should be taken into account in the submission. Please use them to guide the reasoning behind the feature extraction, exploration, analysis and model building, rather than answer them point blank.</p>
<ol>
<li>How did you go about extracting features from the raw data?</li>
<li>Which features do you believe contain relevant information?
<ol>
<li>How did you decide which features matter most?</li>
<li>Do any features contain similar information content?</li>
<li>Are there any insights about the features that you didn't expect? If so, what are they?</li>
<li>Are there any other (potential) issues with the features you've chosen? If so, what are they?</li>
</ol>
</li>
<li>Which goodness of fit metrics have you chosen, and what do they tell you about the model(s) performance?
<ol>
<li>Which model performs best?</li>
<li>How would you decide between using a more sophisticated model versus a less complicated one?</li>
</ol>
</li>
<li>What kind of benefits do you think your model(s) could have as part of an enterprise application or service?</li>
</ol>
<h2>Data Description</h2>
<p>The provided dataset (when clicking the 'Download Datasets' button on this page) is a small extract from a repository of voice recordings. The raw data is compressed using <code>.tgz</code> files. The extract contains 100 such files with 1000 voice samples in total. Each sample is a recording of a short English sentence spoken by either a male or a female speaker. The format of a sample is <code>.wav</code> with a sampling rate of 16kHz and a bit depth of 16-bit.</p>
<p>Each <code>.tgz</code> compressed file contains the following directory structure and files:</p>
<ul>
<li><code>&lt;file&gt;/</code>
<ul>
<li><code>etc/</code>
<ul>
<li><code>GPL_license.txt</code></li>
<li><code>HDMan_log</code></li>
<li><code>HVite_log</code></li>
<li><code>Julius_log</code></li>
<li><code>PROMPTS</code></li>
<li><code>prompts-original</code></li>
<li><code>README</code></li>
</ul>
</li>
<li><code>LICENSE</code></li>
<li><code>wav/</code>
<ul>
<li>10 unique <code>.wav</code> audio files</li>
</ul>
</li>
</ul>
</li>
</ul>
<p>However, to increase the performance of a model, you may fetch the data directly from the original repository that can be found <strong><a href="http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Audio/Main/16kHz_16bit/">here</a></strong>. This repository consists of over 100,000 audio samples. The total size of the raw dataset is approximately 12.5 GB once it has been uncompressed.</p>
<h2>Practicalities</h2>
<p>This assignment should be completed within 10 days. You should present your work in a way that clearly and succinctly walks us through your approach to extracting features, exploring them, uncovering any potential constraints or issues with the data in its provided form, your choice of predictive models and your analysis of the models' performance. Try to keep it concise.</p>
<p>A good presentation presents potential caveats, findings and insights about the dataset and an analysis of the goodness of fit metrics, including benchmarking on the performance of different learning models.</p>
<p>A great presentation tells a visual, potentially even interactive, story about the data and how specific insights can be used to guide our product development so that non-technical colleagues can understand and act upon them.</p>
<h3>Tips</h3>
<p>We recommend considering the following for your data pre-processing:</p>
<ol>
<li>Automate the raw data download using web scraping techniques</li>
<li>Pre-process data using audio signal processing packages such as <a href="https://cran.r-project.org/web/packages/warbleR/vignettes/warbleR_workflow.html">WarbleR</a>, <a href="https://cran.r-project.org/web/packages/tuneR/index.html">TuneR</a>, <a href="https://cran.r-project.org/web/packages/seewave/index.html">seewave</a> for R, or similar packages for other programming languages</li>
<li>Consider, in particular, the <a href="https://en.wikipedia.org/wiki/Voice_frequency#Fundamental_frequency">human vocal range</a>, which typically resides within the range of <strong>0Hz-280Hz</strong></li>
<li>To help you on your way to identify potentially interesting features, consider the following (non-exhaustive) list:
<ul>
<li>Mean frequency (in kHz)</li>
<li>Standard deviation of frequency</li>
<li>Median frequency (in kHz)</li>
<li>First quantile (in kHz)</li>
<li>Third quantile (in kHz)</li>
<li>Inter-quantile range (in kHz)</li>
<li>Skewness</li>
<li>Kurtosis</li>
<li>Mode frequency</li>
<li>Peak frequency</li>
</ul>
</li>
<li>Make sure to check out all of the files in the raw data, you might find valuable data in files beyond the audio ones</li>
</ol>
<h3>Frequently Asked Questions</h3>
<ol>
<li>
<p><strong>Can I use &lt;Insert SDK or Framework here&gt; for the take home assignment?</strong></p>
<blockquote>
<p><strong>Answer:</strong> Yes, you are free to make use of the tools that you are most comfortable working with. We work with a mix of frameworks, and try to use the one best fit for the task at hand.</p>
</blockquote>
</li>
<li>
<p><strong>The raw data is too large to fit in memory, what do I do?</strong></p>
<blockquote>
<p><strong>Answer:</strong> This is part of the challenge, and the dataset is by design larger than can fit in memory for a normal computer. You will have to come up with a solution that enables processing of the data in a batch-like, or streaming, fashion, to extract meaningful features.</p>
</blockquote>
</li>
</ol></div>

## **Data:**

## **Solution:**