# Report
#### UJ SN2019 Zadanie 2: Nocne Ptasie Wędrówki

## Data
10s wav file recordings with assigned time intervals where bird voice was heard. <br>
__Purpose__: Guess the probability of bird voice in given second in recording.

Statistics about bird voices can be found in __first_approach/data_analysis.ipynb__.

#### I have gathered following statistics:
- count of intervals where bird was heard (for each recording)
- are intervals correctly marked (stop time isn't after start) <br>
    I wanted to remove all incorrectly marked intervals but there is no problem with it.
- bird voice length
    - Max length: 0.4543999999999997
    - Min length: 0.0039999999999995595
    - Median length: 0.020500000000000185
    - Mean length: 0.058235670103092746<br>
    Based on it, I have choosen length of interval for BirdDetector model (lower than max length and higher than mean -> twice a mean)
- uniqueness of bird voice length
- distribution of bird voice length with reference to median
    - Higher than median: 30 (unique values: 13)
    - Lower than median: 117 (unique values: 53)

# First approach

### Data preparation
For given recording, data was splitted into seconds. <br>
For each second was checked: 'Was a bird voice in this second?'
- no: <br>
    interval was splitted into 0.1 s parts and for each was assigned 0 what means - bird does not exist in this time
- yes: <br>
    I have checked exact time where bird exists in given second and took interval [start_time - 1s, end_time + 1s] <br>
    Then I checked how many intervals for positive samples should I create. Number is the same as count of samples where bird does not exist in recording. (number of second where there is no bird * 10). Multiplication by 10 because 1 second has ten 0.1s parts. <br>
    For each interval I splitted it into 0.1s parts and check if bird voice was heard:
    - If yes - assign 1
    - If no - assign 0
    
__Final state of data__: spectrograms for 0.1s recording and corresponding label (0 or 1). <br>
Spectrograms were normalized.

Data was splitted into train (70%) and validation (30%).

In train labels is:
- negative samples (without bird voice): 2160
- positive samples (with bird voice): 5113

In validation labels is:
- negative samples (without bird voice): 961
- positive samples (with bird voice): 2156

### Architecture
Two models:
1. convolutional neural network
    - takes spectrogram for 0.1s as an input 
    - returns a probabilty of bird voice in 0.1s (0-1) <br>
    __The highest validation roc auc: 0.685988128082064__
2. MLP
    - takes 10 next outputs from above model and join them, then give it as an input for MLP model
    - returns probability of bird voice in given second (0-1)<br>
    __The highest validation roc auc: 0.7633924884328613__
    
When there is no improvements in N following epochs, then training is stopped.
    
__Score on public kaggle leaderboard: 0.41438__

### To reproduce my score:
- unzip train and test data from https://www.kaggle.com/c/ujnn2019-2/data and place into first_approach
- notebooks should be run in following order:
    1. first_approach/data_analysis.ipynb
       - for checking staticstics about recordings
    2. first_approach/data_preparation.ipynb
       - for preparing data for convolutional neural network
    3. first_approach/model/balanced_data/convolutional_neural_network.ipynb
       - for learning model on 0.1s samples and preparing data for classifier model
    4. first_approach/model/balanced_data/classifier.ipynb
       - for learning model on 1s samples and preparing final submission (based on joined 0.1s parts into 1s)

__In my opinion, this is the most promising approach__ (based on data preparation and choosing combination of two models - one predicts probability for small part of recording and second join them all and predict probability for 1 second). <br>

First model is a core of this solution. If it has better validation roc auc score, then overall score could be much higher.<br>

Final score looks like multiplication of the highest roc auc scores for those models.

# Second approach

### Data preparation
For each second of recording spectrograms for 0.2s was created and then all was stacked one above the another. <br>
For above representation was assigned:
- 1 - with bird voice
- 0 - without bird voice


__Final state of data__: image consisting of 5 channels (in each channels is spectrogram for 0.2s) and corresponding label (0 or 1).

__Important__: data is imbalanced in this case!


#### Train data
- Probability of class '0': 0.8717847249703206
- Probability of class '1': 0.12821527502967947

#### Validation data
- Probability of class '0': 0.8707294552169899
- Probability of class '1': 0.12927054478301014


<cite>"In the case of imbalanced data, majority classes dominate over minority classes, causing the machine learning classifiers to be more biased towards majority classes." </cite>

I have created model to observe above. <br>
Observations can be found in __second_approach/model/imbalanced_data/MLP.ipynb__:
- Accuracy on validation dataset is really low(!). 
- Accuracy on training data is similar to percents of data representing "bird does not exist" values. 
- Model hasn't learned how to recognize rarely occuring value.

#### Upsampling
After it, I have decided to perform upsampling using imblearn.over_sampling.RandomOverSampler. Then I used sklearn.model_selection.StratifiedKFold for splitting data into train and validation with the same number of samples with labels 0 and 1.

Before upsampling:
- Number of samples for 'bird exists': 464
- Number of samples for 'bird does not exists': 3146

After upsampling and splitting into train and validation:
- train
    - Number of samples for 'bird exists': 2517
    - Number of samples for 'bird does not exists': 2517
- validation:
    - Number of samples for 'bird exists': 629
    - Number of samples for 'bird does not exists': 629

### Architecture
Convolutional neural network. <br>
When there is no improvements in N following epochs, then training is stopped. <br>

__The highest validation roc auc score: 0.5763912233565278__ <br>
__Score on public kaggle leaderboard: 0.47123__  <br>

### To reproduce my score:
- unzip train and test data from https://www.kaggle.com/c/ujnn2019-2/data and place into second_approach
- notebooks should be run in following order:
    1. second_approach/imbalanced_data.ipynb
       - for creating imbalanced train, validation data
    2. second_approach/model/imbalanced_data/MLP.ipynb
       - for checking ROC AUC score on imbalanced data
    3. second_approach/balancing_data.ipynb
       - for preparing balanced data
    4. second_approach/model/balanced_data/convolutional_neural_network.ipynb
       - for learning model on image consisting of 5 channels (in each channels is spectrogram for 0.2s) and preparing final submission

# Third approach

1. Copy-paste helpers.ipynb from kaggle, replace model by convolutional neural network and align input for model.
2. The same as above but with ReduceLROnPlateau learning rate scheduler.
<br>

Sounds boring? If I compare with above (in my opinion more creative ideas), for me - yes. <br>
Time spent on first + second aproach (over a dozen of hours) was much higher than on third approach (some minutes). <br>
Quite demotivating if I look at scores for all my approaches on public leaderboard on kaggle...

Without ReduceLROnPlateau:
- __The highest validation roc auc score: 0.691548716261__ <br>
- __Score on public kaggle leaderboard: 0.81271__

With ReduceLROnPlateau:
- __The highest validation roc auc score: 0.7858403110047847__ <br>
- __Score on public kaggle leaderboard: 0.72617__

### To reproduce my score:
- unzip train and test data from https://www.kaggle.com/c/ujnn2019-2/data and place into third_approach
- run third_approach/helpers.ipynb or third_approach/helpers_ReduceLROnPlateau.ipynb (depending on what solution you want to reproduce)