**Find the Optimal Hampel Filters**

# Imports

## Libraries

In [1]:
import os
import pandas as pd
import glob
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.rcParams.update({'font.size': 12})
import numpy as np
import librosa
import numba 
from numba import jit
import warnings # from https://stackoverflow.com/questions/14463277/how-to-disable-python-warnings
def fxn():
    warnings.warn("deprecated", DeprecationWarning)
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fxn()
from functions import *
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score

  import pandas.util.testing as tm


## Data

In [2]:
cwd = os.getcwd()
corrected_textgrid_names_list = glob.glob(cwd + "/corrected_textgrid/*ENF_0?TextGrid")

In [3]:
corrected_textgrid_names_list

['/Users/gregfeliu/Desktop/Flatiron Bootcamp/Vowel Identifier/corrected_textgrid/ENF_06TextGrid',
 '/Users/gregfeliu/Desktop/Flatiron Bootcamp/Vowel Identifier/corrected_textgrid/ENF_02TextGrid']

### DataFrame

In [4]:
df = pd.read_csv("combined_corrected_textgrids", index_col= 0)

In [5]:
df.head(3)

Unnamed: 0,Phone_Text,Phone_xmin,Phone_xmax,Word_Text,Word_xmin,Word_xmax,Vowel,Speaker,Phone_Duration
0,,0.0,2.0,,0.0,2.0,0,6,2.0
1,!SIL,2.0,4.589832,!SIL,2.0,4.361218,0,6,2.589832
2,AH,4.589832,4.831125,!SIL,2.0,4.361218,1,6,0.241293


### Audio

For this notebook, the main focus will be identifying the optimal parameters for the hampel filter, using the corrected textgrids from speakers ENF_02 and ENF_06. In order to optimize the amount of processing and time needed to find these parameters, I will only be sampling the first 90 seconds of each audio file.

In [6]:
short_audio, sr = librosa.load("./original_en_diapix_data/DP_ENF_02_ENF_06_EN_ENF_02_DP_ENF_02_ENF_06_EN_ENF_06.wav", duration = 90, sr = 8000)

In [7]:
len(short_audio)

720000

# Using Hampel Filter

I will use the hampel filter with a set of parameters, then evaluate how many predictions made were in a vowel interval. Additionally, I will find the recall, precision, and F1 score for that set of parameters. I will plot each of these set of parameters according to these parameters to determine which one is best.

## Building Evaluation Method

### Find all of the intervals for each speaker that are actually vowels

#### Narrowing dataframe

In [8]:
short_df = df[(df['Phone_xmax'] < 90) & (df['Vowel'] == 1)].reset_index(drop = True)

In [9]:
short_df.head(3)

Unnamed: 0,Phone_Text,Phone_xmin,Phone_xmax,Word_Text,Word_xmin,Word_xmax,Vowel,Speaker,Phone_Duration
0,AH,4.589832,4.831125,!SIL,2.0,4.361218,1,6,0.241293
1,AY,5.481125,5.881125,mine,5.141125,6.111125,1,6,0.4
2,IH,6.161722,6.441125,mine,5.141125,6.111125,1,6,0.279403


#### Chunk the original audio into segments of 1/8000th of a second 

In [10]:
vowel_indices = chunk_vowels_to_sr(short_df)

##### Checking if this method worked

In [12]:
# doing this, I predict that 22 seconds of the first 90 seconds of the data are vowels
len(vowel_indices) / 8000

22.0005

In [13]:
# checking the actual duration of vowel sounds in this section of the audio 
short_df.Phone_Duration.sum()

21.97939940385457

In [14]:
# I am predicting that 24% of this section of the audio file is made up of vowel sounds. 
# Initially, I saw 12% of the audio was made up of vowel sounds
len(vowel_indices) / (90 * 8000)

0.24445

### Calculate the metrics

In [12]:
vowel_indices_binary = make_results_into_binary(vowel_indices)

## Testing Different Filters

In [30]:
# putting all results into a dataframe
metric_df = pd.DataFrame()

In [31]:
# trying all window sizes from 25 - 200
metric_df_list = []
for y in range(0, 36, 5): # 25 b/c I want 0.25 and the step has to be an integer. All cases where its used are divided by 10
    for x in range(25, 201, 25):
        single_filter_metric_dict = use_filter_and_calculate_metrics(x, (y/10), short_audio, vowel_indices_binary)
        single_filter_metric_dict['window_size'] = x
        single_filter_metric_dict['n_sigmas'] = (y/10)
        single_metric_df = pd.DataFrame.from_dict(single_filter_metric_dict, orient='index').T
        metric_df_list.append(single_metric_df)

In [32]:
all_metric_values_df = pd.concat(metric_df_list)

In [33]:
all_metric_values_df.sort_values(by='F1_score', inplace=True, ascending=False)

In [13]:
all_metric_values_df.head(5)

NameError: name 'all_metric_values_df' is not defined

In [35]:
# all_metric_values_df.to_csv("First_round_of_hampel_filter_values.csv")

Couple of conclusions from this:
- The maximum F1_score we can get from this is 0.42
- Having a sigma of 0 means that all cases are predicted to be vowels
- Recall (number of correct guesses) is almost always higher than precision (correctly choosing a vowel on a guess)
- The lower sigma values did much better

## Second round of testing filters

In this round of testing, we'll focus on the lower sigma values. 
For both parameters, we'll test more values

In [20]:
log_window_values = [x**4 for x in range(2, 7)]

In [21]:
# putting all results into a dataframe
metric_df2 = pd.DataFrame()

In [23]:
# trying all window sizes from 25 - 200
metric_df_list2 = []
for y in range(1, 11, 1):
    for x in log_window_values:
        single_filter_metric_dict = use_filter_and_calculate_metrics(x, (y/10), short_audio, vowel_indices_binary)
        single_filter_metric_dict['window_size'] = x
        single_filter_metric_dict['n_sigmas'] = (y/10)
        single_metric_df = pd.DataFrame.from_dict(single_filter_metric_dict, orient='index').T
        metric_df_list2.append(single_metric_df)

In [24]:
all_metric_values_df2 = pd.concat(metric_df_list2)

In [25]:
all_metric_values_df2.sort_values(by='F1_score', inplace=True, ascending=False)

In [28]:
all_metric_values_df2.head()

Unnamed: 0,F1_score,Recall,Precision,window_size,n_sigmas
0,0.436068,0.8132,0.297908,16.0,0.2
0,0.435113,0.746324,0.307068,16.0,0.3
0,0.429632,0.680919,0.31382,16.0,0.4
0,0.427297,0.888708,0.281266,16.0,0.1
0,0.423631,0.499028,0.368027,1296.0,1.0


#### Saving the dataframe

In [27]:
# all_metric_values_df2.to_csv("Second_round_of_hampel_filter_values.csv")

Final conclusions on this:
- The maximum that this method can achieve, even after manual testing is an F1 score of about 0.43