<a href="https://colab.research.google.com/github/mhuckvale/pals0039/blob/master/Exercise_2_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![PALS0039 Logo](https://www.phon.ucl.ac.uk/courses/pals0039/images/pals0039logo.png)](https://www.phon.ucl.ac.uk/courses/pals0039/)

#Exercise 2.3 Vowel Classification Problem

In this exercise we implement a system to classify vowels from their formant frequencies. We first explore some characteristics of the data and then implement a simple k-nearest-neghbour classifier.

(a) The following code reads in, summarises and generates plots from a data set of vowel formant measurements. Run the code blocks and add comments to describe what is happening in each step.

In [None]:
#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#
df=pd.read_csv("https://www.phon.ucl.ac.uk/courses/pals0039/data/vowels.csv")

# 
df.head()

In [None]:
# 
df.describe()

In [None]:
#
def plot_compare(data,ylabel):
  plt.boxplot(data,labels=("male","female"))
  plt.xlabel("Sex")
  plt.ylabel(ylabel)

# 
male=df.loc[df.SEX=="male",]
female=df.loc[df.SEX=="female"]

# 
plt.figure(figsize=(16,5))

# 
plt.subplot(1,3,1)
plot_compare([male.F1,female.F1],"F1 (Hz)")

# 
plt.subplot(1,3,2)
plot_compare([male.F2,female.F2],"F2 (Hz)")

# 
plt.subplot(1,3,3)
plot_compare([male.HEIGHT,female.HEIGHT],"Height (cm)")

# 
plt.show()


---
(b) This code plots an F1-F2 scatter plot in which different vowels are displayed in different colours. Run the code and then add comments to the code to describe what is happening in each step.


In [None]:
# 
df["VOWEL"]=df.VOWEL.astype("category")
print(df.VOWEL.cat.categories)

# 
df["VOWELIDX"]=df.VOWEL.cat.codes
print(df.VOWELIDX)

# 
plt.figure(figsize=(10,10))
plt.scatter(df.F2,df.F1,c=df.VOWELIDX,cmap="tab10")
plt.axis([3000,500,1100,100])
plt.xlabel("F2 (Hz)")
plt.ylabel("F1 (Hz)")
plt.grid()
plt.show()

---
(c) This code builds a simple vowel classifier based on formant frequencies. It works by taking each vowel in turn and find the 5 closest other vowels - then selecting a label based on the most commonly found neaest vowel.

Run the code then add comments describing what is happening in each step.


In [None]:
# 
from math import sqrt

# 
def distance(df,row1,row2):
  return(sqrt((df.F1[row1]-df.F1[row2])**2+(df.F2[row1]-df.F2[row2])**2))

# 
def getneighbours(df,row,n=5):
  # 
  distances = []
  for i in range(len(df)):
    distances.append(distance(df,row,i))
  # 
  index=np.argsort(distances)
  # 
  neighbours = df.index.values[index[1:n+1]]
  # 
  return neighbours

# 
def vote(df,neighbours):
  # 
  counts=df.loc[neighbours,"VOWEL"].value_counts()
  # 
  return counts.index[0]

# 
df=pd.read_csv("https://www.phon.ucl.ac.uk/courses/pals0039/data/vowels.csv")

# 
correct=0
total=0
for i in range(len(df)):
  # 
  neighbours=getneighbours(df,i)
  # 
  vowel=vote(df,neighbours)
  print(i,df.VOWEL[i],vowel)
  # 
  if (df.VOWEL[i]==vowel):
    correct += 1
  total += 1

# 
print("Correct = %d/%d (%.1f%%)" % (correct,total,100.0*correct/total))


---
(d) This code converts the F1 and F2 frequencies to z-scores for each speaker individually. Run the code then add comments describing what is happening in each step.

This code is rather inefficient - can you see why?

In [None]:
#
df=pd.read_csv("https://www.phon.ucl.ac.uk/courses/pals0039/data/vowels.csv")

#
for i in range(len(df)):
  #
  spkr=df.SPEAKER[i];
  #
  dfs=df.loc[df.SPEAKER==spkr,]
  #
  mnf1=dfs.F1.mean()
  sdf1=dfs.F1.std()
  mnf2=dfs.F2.mean()
  sdf2=dfs.F2.std()
  #
  df.at[i,"F1norm"]=(df.F1[i]-mnf1)/sdf1
  df.at[i,"F2norm"]=(df.F2[i]-mnf2)/sdf2

#
df.describe()

---
(e) This code also converts the F1 and F2 frequencies to z-scores but in a more efficient manner. Run the code and add comments describing what is happening in each step.

Why is this code more efficient?

In [None]:
#
df=pd.read_csv("https://www.phon.ucl.ac.uk/courses/pals0039/data/vowels.csv")

#
means=df.groupby(['SPEAKER']).agg("mean")
stds=df.groupby(['SPEAKER']).agg("std")

#
F1mean=means.F1[df.SPEAKER].to_numpy()
F1std=stds.F1[df.SPEAKER].to_numpy()
F2mean=means.F2[df.SPEAKER].to_numpy()
F2std=stds.F2[df.SPEAKER].to_numpy()

#
df["F1norm"]=(df.F1-F1mean)/F1std
df["F2norm"]=(df.F2-F2mean)/F2std

#
df.describe()

(f) Write code to run the nearest neighbour classifier again using the normalised F1 and F2 data.

**Hint:** you will need to re-use code from block (c) but with the F1norm and F2norm values replacing the F1 and F2 values.

Why is performance better after normalisation?