<a href="https://colab.research.google.com/github/grendelaglaeca/PALS0039/blob/master/Copy_of_Exercise_2_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![PALS0039 Logo](https://www.phon.ucl.ac.uk/courses/pals0039/images/pals0039logo.png)](https://www.phon.ucl.ac.uk/courses/pals0039/)

#Exercise 2.3 Vowel Classification Problem

In this exercise we implement a system to classify vowels from their formant frequencies. We first explore some characteristics of the data and then implement a simple k-nearest-neghbour classifier.

(a) The following code reads in, summarises and generates plots from a data set of vowel formant measurements. Run the code blocks and add comments to describe what is happening in each step.

In [None]:
#import the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#read csv file into a var
df=pd.read_csv("https://www.phon.ucl.ac.uk/courses/pals0039/data/vowels.csv")

# display first 10 rows of data
df.head(20)

In [None]:
# display descriptive stats
df.describe()

In [None]:
#define plot function to compare y values distribution by sex (x)
def plot_compare(data,ylabel):
  plt.boxplot(data,labels=("male","female"))
  plt.xlabel("Sex")
  plt.ylabel(ylabel)

# assign columns to variables
male=df.loc[df.SEX=="male",]
female=df.loc[df.SEX=="female"]

# set plot size
plt.figure(figsize=(16,5))

# set subplot location and content (compare f1 by sex)
plt.subplot(1,3,1)
plot_compare([male.F1,female.F1],"F1 (Hz)")

# subplot 2, same for f2
plt.subplot(1,3,2)
plot_compare([male.F2,female.F2],"F2 (Hz)")

# subplot 3 compares participants height by sex 
plt.subplot(1,3,3)
plot_compare([male.HEIGHT,female.HEIGHT],"Height (cm)")

# display plot
plt.show()


---
(b) This code plots an F1-F2 scatter plot in which different vowels are displayed in different colours. Run the code and then add comments to the code to describe what is happening in each step.


In [None]:
# convert the vowel series into categories
df["VOWEL"]=df.VOWEL.astype("category")
print(df.VOWEL.cat.categories)

# set up a series in which the vowels are stored as numbers 
df["VOWELIDX"]=df.VOWEL.cat.codes
print(df.VOWELIDX)

# set the plot size, choose scatterplot, add colour mapping
plt.figure(figsize=(10,10))
plt.scatter(df.F2,df.F1,c=df.VOWELIDX,cmap="tab10")
plt.axis([3000,500,1100,100])
plt.xlabel("F2 (Hz)")
plt.ylabel("F1 (Hz)")
plt.grid()
plt.show()

---
(c) This code builds a simple vowel classifier based on formant frequencies. It works by taking each vowel in turn and find the 5 closest other vowels - then selecting a label based on the most commonly found neaest vowel.

Run the code then add comments describing what is happening in each step.


In [None]:
# import the square root function
from math import sqrt

# calculate the euclidean distance between rows
def distance(df,row1,row2):
  return(sqrt((df.F1[row1]-df.F1[row2])**2+(df.F2[row1]-df.F2[row2])**2))

# get the nearest neighbours of a given row
def getneighbours(df,row,n=5):
  # get table of all the inter-row distances
  distances = []
  for i in range(len(df)):
    distances.append(distance(df,row,i))
  # list the indexes of the distances sorted by value
  index=np.argsort(distances)
  # choose the rows with the n nearest distances (excluding the original)
  neighbours = df.index.values[index[1:n+1]]
  # return the best rows
  return neighbours

# find the most frequently occuring vowel among the 5 nearest neighbours
def vote(df,neighbours):
  # Return a series containing counts of unique values starting with most frequent
  # (get table of counts ordered by frequency)
  counts=df.loc[neighbours,"VOWEL"].value_counts()
  # return the index of the most frequent vowel
  return counts.index[0]

# read the vowel formant data 
df=pd.read_csv("https://www.phon.ucl.ac.uk/courses/pals0039/data/vowels.csv")

# apply the classifier to the whole dataset
# (this is called: leave one out cross-validation)
correct=0
total=0
for i in range(len(df)):
  # for each row in df get 5 nearest neighbors 
  neighbours=getneighbours(df,i)
  # choose the most common vowel among the nearest neighbours
  # print row, matching vowel,  
  vowel=vote(df,neighbours)
  print(i,df.VOWEL[i],vowel)
  # if the chosen vowel is correct, keep tally
  if (df.VOWEL[i]==vowel):
    correct += 1
  total += 1

# report performance: 
# calculate the number of correctly classified vowels, total and success percentage
print("Correct = %d/%d (%.1f%%)" % (correct,total,100.0*correct/total))


---
(d) This code converts the F1 and F2 frequencies to z-scores for each speaker individually. Run the code then add comments describing what is happening in each step.

This code is rather inefficient - can you see why?

In [None]:
#read in the vowel data
df=pd.read_csv("https://www.phon.ucl.ac.uk/courses/pals0039/data/vowels.csv")

#for each row (vowel) in df:
for i in range(len(df)):
  #assign speaker's ID to a var
  spkr=df.SPEAKER[i];
  #assign subset data for this speaker into dfs var
  dfs=df.loc[df.SPEAKER==spkr,]
  #get means and SDs for this speaker's F1 and F2
  mnf1=dfs.F1.mean()
  sdf1=dfs.F1.std()
  mnf2=dfs.F2.mean()
  sdf2=dfs.F2.std()
  #normalize F1 and F2 for this vowel
  #add a column to the speaker's data, with values converted to z-scores
  df.at[i,"F1norm"]=(df.F1[i]-mnf1)/sdf1
  df.at[i,"F2norm"]=(df.F2[i]-mnf2)/sdf2

#displa
df.describe()

---
(e) This code also converts the F1 and F2 frequencies to z-scores but in a more efficient manner. Run the code and add comments describing what is happening in each step.

Why is this code more efficient?

In [None]:
#this code doesn't iterate through each row, instead first groups the data points
#by speaker id and maps the function over arrays

#assign vowel data into a var
df=pd.read_csv("https://www.phon.ucl.ac.uk/courses/pals0039/data/vowels.csv")

#group each speaker's data, aggregate means and SDs into variables
means=df.groupby(['SPEAKER']).agg("mean")
stds=df.groupby(['SPEAKER']).agg("std")

#convert means and SDs for the speaker's F1 and F2 into numpy arrays
#(replicate means and sds to one per vowel)
F1mean=means.F1[df.SPEAKER].to_numpy()
F1std=stds.F1[df.SPEAKER].to_numpy()
F2mean=means.F2[df.SPEAKER].to_numpy()
F2std=stds.F2[df.SPEAKER].to_numpy()

#create new column with normalized f1 and f2 (z-scores) 
# (process all vowels at the same time)
df["F1norm"]=(df.F1-F1mean)/F1std
df["F2norm"]=(df.F2-F2mean)/F2std

#
df.describe()

(f) Write code to run the nearest neighbour classifier again using the normalised F1 and F2 data.

**Hint:** you will need to re-use code from block (c) but with the F1norm and F2norm values replacing the F1 and F2 values.

Why is performance better after normalisation?