# Introducing Scikit-Learn (ii)

Having completed a basic overview, we will now use Scikit-Learn to visualise  a few fundamental concepts in machine learning. These will include:
* Bias-variance trade-off
* Something else

Thereafter, we will move to a real-world example to give you the opportunity to conduct a mock data science problem yourself. There is no expectation on any particular scores your models might achieve, the only goal is that you get a feel for how one might approach an ML problem, and what tools exist to score your model.

### Bias-variance tradeoff

In [None]:
...

### Scoring functions


*   Precision-Recall curves
*   Receiver operating characteristic curves
*   Confusion matrices



## Example real-world problem

Machine learning is prevelant in today's scientific landscape, with applications spanning Biomedical engineering, astrostatistics, finance, the entertainment industry, and countless other disciplines. 

In this tutorial we will consider how one might use ML to classify acoustics into categories. This type of analysis involves:



1.   Data collection and pre-processing
2.   Feature generation
3.   Model selection
4.   Model training
5.   Analysing model performance

You will find in practice that the key to unlocking good predictive insights is the quality and quantity of data used. We will however not make this the focal point, and assume the data is clean within reason.

If you are interested in common difficulties encountered with data processing, please refer to [XYZ]



### Problem definition
#### Mosquito acoustic detection: can we use machine learning to detect mosquitoes from the sound of their acoustic wingbeat?

Mosquitoes are responsible over xyz yyz [cite]. As a byproduct of their behaviour patterns, they produce a characteristic buzz from their flight, mating calls, and other etc... The idea is to leverage this sound with cheap sensors (acoustic smartphone sensors in an IoT network) to be able to estimate the prevelance of mosquitoes in a particular area. To do this, we need algorithms capable of distinguishing the buzz of mosquito from its surroundings. In this challenge we will show how it is possible to use Scikit-learn to build a basic classifier to achieve this.

### 1. Data collection and pre-processing

By default, opening colab will place you in the following directory:

In [18]:
import os
os.getcwd()

'/content'

We can now donwload the dataset of interest from the repository with `wget` and unzip to the subfolder `data`. The `!` before the command is used to run operating system commands directly in the notebook cell (in this case, which ever Linux OS the colab machines are using).

In [23]:
!wget https://github.com/ikiskin/UNIQ-deepmind/raw/master/data/CulexMozzSounds.zip
!unzip /content/CulexMozzSounds.zip -d /content/data/

--2022-06-29 17:58:43--  https://github.com/ikiskin/UNIQ-deepmind/raw/master/data/CulexMozzSounds.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ikiskin/UNIQ-deepmind/master/data/CulexMozzSounds.zip [following]
--2022-06-29 17:58:43--  https://raw.githubusercontent.com/ikiskin/UNIQ-deepmind/master/data/CulexMozzSounds.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28644396 (27M) [application/zip]
Saving to: ‘CulexMozzSounds.zip’


2022-06-29 17:58:43 (157 MB/s) - ‘CulexMozzSounds.zip’ saved [28644396/28644396]

Archive:  /content/CulexMozzSounds.zip
  inflating: /content/data/0001_norm

To start, we will split the recordings into a training and testing hold-out set. This is to ensure that we do not see parts of identical recordings in both training or testing

In [30]:
train_list = []
test_list = []

for i in os.listdir('data/'):
  if int(i[:4]) > 30:  # Reserve 27/57 recordings for testing
    test_list.append(i)
  else:
    train_list.append(i)

We now have the raw data accessible as files for train and for test, and corresponding label information in `csv` format. We now need to map this data into a form that can be used to perform computations with scikit-learn

Pre-process by removing the mean and standard deviation. We will store the results in xyz to then apply to the test data. Note that there are several schemes for normalisation:

* Normalise per sample/recording. This is similar to how images will be normalised by their intensity relative to only themselves
* As above, but normalise in batches [read more about this]
* Use the entire dataset to remove offset statistics such as the mean, and standardise the variance. When predicting over test data, we perform the same transform to the test data.

There is no universal or accepted method of normalising audio data, as there are benefits and drawbacks to each. You may experiment with different schemes. However, it is important to consider that some ML algorithms are expected to operate in a certain range, and require re-scaling to appropriate units. An example of this is the SVM because XYZ

In [None]:
# By default, let us remove the mean and standard deviation per sample.
...

### 2. Feature generation
In general, features extracted will vary from domain to domain, and we could opt to go for highly hand-crafted descriptors or let our inference models learn their own representations entirely. Current SOTA performance tends to use something in between, though this is highly dependent on the domain

For creating features we have several options to explore with audio:


1.   Learn hierarchical feature representations with neural networks from:
  1. Raw audio waveform
  2. Intermediate feature representations

2. Extract descriptive features. In audio these could be MFCCs -- a bandpass of non-linearly spaced frequency features, based on the mel-scale (melodic scale), where humans perceive each band as evenly spaced in blab bla bla [CITE + CORRECT]. There are many features we could go for, such as zero crossing rate, spectral power, fluctuations in xyz - for a complete list you could refer to OpenSMILE. 

2.   List item


2.   List item

