# Speech emotion recoginition using PyTorch - Prediction Pipeline

**James Morgan (jhmmorgan)**

_2022-04-20_

# 📖 Background

Humans are really good at understanding the emotion behind spoken words.  Does someone sound happy, or angry? Can we build an AI to do this and if so why would we want to when we're already so good at it?

We're good at understanding emotion because it's vital to our communication.  We react and respond differently to people depending on their emotional state, be it happy, sad, angry or another emotion.

Having an AI complete this task can provide great advantage in so many different ways, many of which wouldn't be possible by us; from having digital assistance tools (think Siri and Alexa) understand the emotional intend of a request, to a business analysing their calls to identify potential complaints or the general mood of their customers. You could also apply this in a call center telephony routing, where the system picks up angry or upset customers and transfers them to a human to prevent further agitation.

This is also where AI can really help because we can process audio in bulk or without human intervention.  Rather than having a person listen to a sample of calls, an AI can listen to a large volume without an invasion of privacy, whilst outputting the general positive or negative nature of their customers.

### The Task
Our task is to create a model that can predict the emotion of spoken audio.  This workbook is the outcome of this task.  It uses a custom class that allows us to input new audio, which will in turn predict the emotion, or if required, return the probability of each emotion.

# 📚 Libraries
For this to work, we only need to load one library - our audio_model utility.
However, that library does use a number of different libraries to work.

In [1]:
from utils import *
from audio_model import *
from IPython.display import Audio # To play audio files

# 🎶 Speech emotion prediction
To predict the emotion of any given audio, we just need to load our class **speech_prediction()** and provide it with the path of an audio file.

We can also provide is with the optional argument **return_all = True**, which will return the probability of each emotion, rather than a single, predicted emotion.

In [2]:
sp = speech_prediction()

---

In [3]:
print("Audio One: Happy")
audio_one = "new recordings/happy.wav"
Audio(audio_one)

Audio One: Happy


In [4]:
audio_one_pred = sp.predict(audio_one)
audio_one_prob = sp.predict(audio_one, return_all=True)

print(f"Predicted emotion is: {audio_one_pred}")
print(f"Probabiity of all emotions are: {audio_one_prob}")

Predicted emotion is: happy
Probabiity of all emotions are: [0. 0. 1. 0. 0. 0. 0. 0.]


---

In [5]:
print("Audio Two: Happy (Amused)")
audio_two = "new recordings/amused.wav"
Audio(audio_two)

Audio Two: Happy (Amused)


In [6]:
audio_two_pred = sp.predict(audio_two)
audio_two_prob = sp.predict(audio_two, return_all=True)

print(f"Predicted emotion is: {audio_two_pred}")
print(f"Probabiity of all emotions are: {audio_two_prob}")

Predicted emotion is: happy
Probabiity of all emotions are: [0.    0.    0.999 0.001 0.    0.    0.    0.   ]


---

In [7]:
print("Audio Three: Anger")
audio_three = "new recordings/anger.wav"
Audio(audio_three)

Audio Three: Anger


In [8]:
audio_three_pred = sp.predict(audio_three)
audio_three_prob = sp.predict(audio_three, return_all=True)

print(f"Predicted emotion is: {audio_three_pred}")
print(f"Probabiity of all emotions are: {audio_three_prob}")

Predicted emotion is: disgust
Probabiity of all emotions are: [0.    0.    0.416 0.048 0.02  0.    0.516 0.   ]


---

In [9]:
print("Audio Four: Disgust")
audio_four = "new recordings/disgust.wav"
Audio(audio_three)

Audio Four: Disgust


In [10]:
audio_four_pred = sp.predict(audio_four)
audio_four_prob = sp.predict(audio_four, return_all=True)

print(f"Predicted emotion is: {audio_four_pred}")
print(f"Probabiity of all emotions are: {audio_four_prob}")

Predicted emotion is: happy
Probabiity of all emotions are: [0.    0.    0.608 0.    0.    0.    0.392 0.   ]


---

### Summary
As you can see, when given the new audio, the model performs ok, but it does make mistakes.  This can be seen where it confuses disgust and anger in audio three, or in the last prediction, where it was only 39.2% confident the emotion was disgust.  

The model is restricted by the fact the model was built using only two audio statements.  If we were to train this on more data, such as the EmoV-DB audio set, which contains a large range of different statements, the model would hopefully perform much better.