# Supervised learning of Hillenbrand vowel data

This notebook implements the supervised learning of vowel clusters using a Support Vector Machine (SVM). A SVM was ultimately decided on after trying ut several different classifiers (not shown). Predictors/features are the formant values (steady state and various points in between onset and steady-state value). The vowel identities themselves were used as the targets.

In [1]:
# run setup script

%run '~/GitHub/hillenbrand-vowel-clustering/scripts/hillenbrand-data-setup.py'

In [2]:
# reduction via PCA;
# not used here, but additional features if so desired

from sklearn.decomposition import PCA

pca = PCA(n_components = 2)

formant_reduced = pca.fit_transform(formant_mtx)
formant_ratio_reduced = pca.fit_transform(formant_ratio_mtx)

# add PCA columns to hillenbrand_data DataFrame
# going to use vowel labels to make interpretation easier
# for classification reports and confusion matrices
# perhaps not needed for this implementation, but 
# could be used as additional predictors

hillenbrand_data['Formant_PC1'] = formant_reduced[:, 0]
hillenbrand_data['Formant_PC2'] = formant_reduced[:, 1]

hillenbrand_data['Formant_Ratio_PC1'] = formant_ratio_reduced[:, 0]
hillenbrand_data['Formant_Ratio_PC2'] = formant_ratio_reduced[:, 1]

In [3]:
# import needed modules and functions

from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import StratifiedKFold, train_test_split 

## Split into test and train sets

The predictors for the classification are:  
1. steady-state formant values  
2. formant values at 20%, 50%, 80% of steady-state  
3. (optional) normalized (z-score) of `1` and `2` above (can help speed up fitting)  

There is an 80/20 split between training and test sets.

In [4]:
# data split into train and test sets

# 80/20 data split

# using untransformed columns to predict
# other columns there just in case
# normed columns should give same results; 
# difference is that model should converge faster

formant_cols = ['F1', 'F2', 'F3', 'F4', 
                'F1_20', 'F2_20', 'F3_20', 
                'F1_50', 'F2_50', 'F3_50',
                'F1_80', 'F2_80', 'F3_80']


formant_norm_cols = ['F1_zscore', 'F2_zscore', 'F3_zscore', 'F4_zscore', 
                     'F1_20_zscore', 'F2_20_zscore', 'F3_20_zscore', 
                     'F1_50_zscore', 'F2_50_zscore', 'F3_50_zscore',
                     'F1_80_zscore', 'F2_80_zscore', 'F3_80_zscore']


vowel_train, vowel_test, vowel_label_train, vowel_label_test = \
  train_test_split(hillenbrand_data[formant_cols], hillenbrand_data['Vowel'], 
                   test_size = 0.2, random_state = 8393)

## Classifier parameters

Several different classifiers were tested and the SVM was the one that performed the best on the initial run, so it was ultimately chosen to classify the data. A polyomial kernel was chosen due to the complexity of the vowel production and perceptual space. A grid search with 5-fold cross-calidation was used to determine the best parameters for `C` (regularization parameter), `gamma` (kernel coefficient) and the polynomial degree (3, 5, 7). `C` and `gamma` were searched over a log space with 7 values so as to keep duration of the process reasonable.  

Since class probabilities are also being computer, this will slow down the fitting time.

In [5]:
# set up classifier parameters
  
svc = ('classifier', SVC(kernel = 'poly', probability = True, random_state = 4748))

# creating pipeline (though not really needed since single step)

svc_pipeline = Pipeline([svc])

# set up parameter grid search

svc_params = \
  {'classifier__C': (np.logspace(-4, 3, num = 7)),
  'classifier__gamma': (np.logspace(-4, 3, num = 7)),
  'classifier__degree': (3, 5, 7)}

# create search grid

vowel_grid = \
  GridSearchCV(svc_pipeline, svc_params, refit = True, n_jobs = -1,
               scoring = 'accuracy', cv = StratifiedKFold(vowel_label_train, n_folds = 5))

In [6]:
# fitting using search parameters
# takes longer due to calculation of class probabilities

%time vowel_detector = vowel_grid.fit(vowel_train, vowel_label_train)

CPU times: user 9.76 s, sys: 246 ms, total: 10 s
Wall time: 10min 17s


## Predictions and performance metrics

Metrics of interest:  
1. best score  
3. best parameters  
4. precision, recall, f-1 score (classification report)  
5. confusion matrix  

In [7]:
print 'Best parameters from SVM classifying vowels:'
vowel_detector.best_params_

Best parameters from SVM classifying vowels:


{'classifier__C': 0.0001, 'classifier__degree': 7, 'classifier__gamma': 0.0001}

In [8]:
print 'Best score from 5-cold cross-validation on grid search parameters:'
vowel_detector.best_score_

Best score from 5-cold cross-validation on grid search parameters:


0.89890710382513661

In [9]:
# predictions and performance metrics

vowel_prediction = vowel_detector.predict(vowel_test)

vowel_prediction_probs = vowel_detector.predict_proba(vowel_test)

In [10]:
print 'Target class: Vowel'
print 'Classification report'
print classification_report(vowel_label_test, vowel_prediction)
print 'Confusion matrix'
print confusion_matrix(vowel_label_test, vowel_prediction)

Target class: Vowel
Classification report
             precision    recall  f1-score   support

         ae       0.94      0.88      0.91        17
         ah       0.69      0.73      0.71        15
         aw       0.83      0.79      0.81        24
         eh       0.81      0.89      0.85        19
         ei       0.94      0.94      0.94        18
         er       1.00      0.96      0.98        28
         ih       0.89      0.96      0.93        26
         iy       0.96      0.96      0.96        26
         oa       0.81      0.87      0.84        15
         oo       0.96      0.92      0.94        26
         uh       0.85      0.90      0.88        31
         uw       0.92      0.80      0.86        30

avg / total       0.89      0.89      0.89       275

Confusion matrix
[[15  1  0  1  0  0  0  0  0  0  0  0]
 [ 0 11  3  0  0  0  0  0  0  0  1  0]
 [ 0  3 19  0  0  0  0  0  0  0  2  0]
 [ 1  0  0 17  0  0  1  0  0  0  0  0]
 [ 0  0  0  0 17  0  1  0  0  0  0  0]
 

## Comparison with normalized data
Normalizing the data will speed up computation. In this case, the values of some of the performance metrics change as well as the best parameters for fitting the data. A comparison using the normed vowel formant data is below.

In [11]:
vowel_norm_train, vowel_norm_test, vowel_norm_label_train, vowel_norm_label_test = \
  train_test_split(hillenbrand_data[formant_norm_cols], hillenbrand_data['Vowel'], 
                   test_size = 0.2, random_state = 8393)

In [12]:
%time vowel_norm_detector = vowel_grid.fit(vowel_norm_train, vowel_norm_label_train)

CPU times: user 3.6 s, sys: 120 ms, total: 3.72 s
Wall time: 46.6 s


In [13]:
print 'Best parameters from SVM classifying vowels (z-score formant values):'
vowel_norm_detector.best_params_

Best parameters from SVM classifying vowels (z-score formant values):


{'classifier__C': 0.0014677992676220704,
 'classifier__degree': 3,
 'classifier__gamma': 4.641588833612782}

In [14]:
print 'Best score from 5-cold cross-validation on grid search parameters (z-score formant values):'
vowel_norm_detector.best_score_

Best score from 5-cold cross-validation on grid search parameters (z-score formant values):


0.89708561020036426

In [15]:
vowel_norm_prediction = vowel_norm_detector.predict(vowel_norm_test)

vowel_norm_prediction_probs = vowel_norm_detector.predict_proba(vowel_norm_test)

In [16]:
print 'Target class: Vowel (z-score data)'
print 'Classification report'
print classification_report(vowel_norm_label_test, vowel_norm_prediction)
print 'Confusion matrix (z-score data)'
print confusion_matrix(vowel_norm_label_test, vowel_norm_prediction)

Target class: Vowel (z-score data)
Classification report
             precision    recall  f1-score   support

         ae       0.83      0.88      0.86        17
         ah       0.76      0.87      0.81        15
         aw       0.87      0.83      0.85        24
         eh       0.62      0.84      0.71        19
         ei       1.00      0.94      0.97        18
         er       1.00      0.96      0.98        28
         ih       0.96      1.00      0.98        26
         iy       1.00      1.00      1.00        26
         oa       0.93      0.87      0.90        15
         oo       0.88      0.88      0.88        26
         uh       0.96      0.84      0.90        31
         uw       0.93      0.83      0.88        30

avg / total       0.91      0.90      0.90       275

Confusion matrix (z-score data)
[[15  0  0  2  0  0  0  0  0  0  0  0]
 [ 0 13  2  0  0  0  0  0  0  0  0  0]
 [ 0  3 20  0  0  0  0  0  0  0  1  0]
 [ 3  0  0 16  0  0  0  0  0  0  0  0]
 [ 0  0  0

Overall, the best score from cross-validation between the two representations is about the same. The difference comes from identifying specific vowels. All vowels except `ah` and `eh` had similar precision, f-1 and recall values. This is likely because the formant values are not normally distributed; perhaps scaling in a different manner (e.g., over 2 standard deviations) might resolve this discrepancy.

In [17]:
import sys
import IPython

In [18]:
print 'Python version: ', sys.version
print 'Platform: ', sys.platform
print 'IPython version: ', IPython.__version__
print 'NumPy version: ', np.__version__
print 'Pandas version: ', pd.__version__

Python version:  2.7.10 |Anaconda 2.3.0 (x86_64)| (default, May 28 2015, 17:04:42) 
[GCC 4.2.1 (Apple Inc. build 5577)]
Platform:  darwin
IPython version:  3.2.1
NumPy version:  1.9.2
Pandas version:  0.16.2
