**Overview**

In this project I am trying to develop a system that is able to detect human emotion based on a persons voice.

I am using a dataset that contains voices of a number people which are labelled by a number of annotators in-terms of the emotions the tone and pitch of these voices indicate i.e anger or happy etc 

Just like from images these voices are are processed to extract features which a model can understand and use to it for classifications purposes.

These features that are extracted include mfcc or Mel Frequency Cepstral Coefficient that indicates the short-term power spectrum of a sound, chroma and mel or Mel Spectrogram Frequency.

The zero crossing rate is the rate of sign-changes along a signal, i.e., the rate at which the signal changes from positive to negative or back. This feature has been used heavily in both speech recognition and music information retrieval. It usually has higher values for highly percussive sounds like those in metal and rock.

Spectral Centroid, indicates where the ”centre of mass” for a sound is located and is calculated as the weighted mean of the frequencies present in the sound. If the frequencies in music are same throughout then spectral centroid would be around a centre and if there are high frequencies at the end of sound then the centroid would be towards its end.

Spectral rolloff is the frequency below which a specified percentage of the total spectral energy, e.g. 85%, lies.

**Imports**

In [4]:
import librosa
import soundfile
import os, glob, pickle
import numpy as np
from sklearn.neural_network import MLPClassifier

In [5]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import MaxAbsScaler

In [6]:
def get_prediction(classifier, X_train, X_test, y_train, y_test):

    model = classifier.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(" Accuarcy: {}".format(round(accuracy_score(y_test, y_pred)*100,2)))
    cm = confusion_matrix(y_test, y_pred)
    print(" Confusion Matrix: \n", cm)
    print(" Classification Report: \n", classification_report(y_test, y_pred))

In [7]:
""""
This block of code defines the method that extracts the
features from a particular file.

"""""
def extract_feature(file_name, mfcc, chroma, mel, contrast, tonnetz,zero_crossings ,spectral_centroids,spectral_rolloff,spectral_bandwidth):
    with soundfile.SoundFile(file_name) as sound_file:
        X, sample_rate = librosa.load(file_name)
        if chroma or contrast:
            stft=np.abs(librosa.stft(X))
        result=np.array([])
        if mfcc:
            mfccs=np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
            result=np.hstack((result, mfccs))
        if chroma:
            chroma=np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
            result=np.hstack((result, chroma))
        if mel:
            mel=np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T,axis=0)
            result=np.hstack((result, mel))
            
        if contrast:
            contrast = np.mean(librosa.feature.spectral_contrast(S=stft, sr=sample_rate).T,axis=0)
            result = np.hstack((result, contrast))
        if tonnetz:
            tonnetz = np.mean(librosa.feature.tonnetz(y=librosa.effects.harmonic(X), sr=sample_rate).T,axis=0)
            result = np.hstack((result, tonnetz))   
        if zero_crossings:    
            zero_crossings = np.mean(librosa.feature.zero_crossing_rate(X))
            result = np.hstack((result, zero_crossings))
        if spectral_centroids:    
            spectral_centroids = np.mean(librosa.feature.spectral_centroid(X, sr=sample_rate)) 
            result = np.hstack((result, spectral_centroids)) 
        if spectral_rolloff: 
            spectral_rolloff = np.mean(librosa.feature.spectral_rolloff(X, sr=sample_rate))
            result = np.hstack((result, spectral_rolloff)) 
        if spectral_bandwidth: 
            spectral_bandwidth = np.mean(librosa.feature.spectral_bandwidth(X, sr=sample_rate))
            result = np.hstack((result, spectral_bandwidth))     
    
    
        return result

In [8]:
# Emotions in the RAVDESS dataset
emotions={
  '01':'neutral',
  '02':'calm',
  '03':'happy',
  '04':'sad',
  '05':'angry',
  '06':'fearful',
  '07':'disgust',
  '08':'surprised'
}
#DataFlair - Emotions to observe
observed_emotions=[ 'happy', 'fearful']

In [9]:
"""""
This block of code defines the method that is reponsible for 
loading the data and extract features for each file.
"""""
def load_data(test_size=0.2):
    x,y=[],[]
    for file in glob.glob("archive/audio_speech_actors_01-24/Actor_*/*.wav"):
        file_name=os.path.basename(file)
        emotion=emotions[file_name.split("-")[2]]
        if emotion not in observed_emotions:
            continue
        feature=extract_feature(file, mfcc=True, chroma=True, mel=True,contrast=True, tonnetz=True,zero_crossings=True ,spectral_centroids=False,spectral_rolloff=True,spectral_bandwidth=True)
        x.append(feature)
        y.append(emotion)
    return train_test_split(np.array(x), y, test_size=test_size, random_state=9) #Splits the dataset

In [10]:
X_train,X_test,y_train,y_test=load_data(test_size=0.25)

In [11]:
print((X_train.shape[0], X_test.shape[0])) # Overviewing the shape of the training and testing datasets

(288, 96)


In [12]:
print(f'Features extracted: {X_train.shape[1]}') # Overviewing the total number of features extracted per sound file.

Features extracted: 196


**Scalling**

In [None]:
"This block of code scales the features"
sc = MaxAbsScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

**Classification**

In [14]:
classifiers = [LogisticRegression(), SGDClassifier(), BernoulliNB(), LinearSVC(),
              KNeighborsClassifier(n_neighbors=5), DecisionTreeClassifier(), GradientBoostingClassifier(), 
               RandomForestClassifier(), XGBClassifier(),MLPClassifier()]

for classifier in classifiers:
    print("\n\n", classifier)
    get_prediction(classifier, X_train, X_test, y_train, y_test)



 LogisticRegression()
 Accuarcy: 71.88
 Confusion Matrix: 
 [[35 15]
 [12 34]]
 Classification Report: 
               precision    recall  f1-score   support

     fearful       0.74      0.70      0.72        50
       happy       0.69      0.74      0.72        46

    accuracy                           0.72        96
   macro avg       0.72      0.72      0.72        96
weighted avg       0.72      0.72      0.72        96



 SGDClassifier()
 Accuarcy: 48.96
 Confusion Matrix: 
 [[ 1 49]
 [ 0 46]]
 Classification Report: 
               precision    recall  f1-score   support

     fearful       1.00      0.02      0.04        50
       happy       0.48      1.00      0.65        46

    accuracy                           0.49        96
   macro avg       0.74      0.51      0.35        96
weighted avg       0.75      0.49      0.33        96



 BernoulliNB()
 Accuarcy: 53.12
 Confusion Matrix: 
 [[25 25]
 [20 26]]
 Classification Report: 
               precision    recall  f1

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, msg_start, len(result))


 Accuarcy: 47.92
 Confusion Matrix: 
 [[ 0 50]
 [ 0 46]]
 Classification Report: 
               precision    recall  f1-score   support

     fearful       0.00      0.00      0.00        50
       happy       0.48      1.00      0.65        46

    accuracy                           0.48        96
   macro avg       0.24      0.50      0.32        96
weighted avg       0.23      0.48      0.31        96



 KNeighborsClassifier()
 Accuarcy: 48.96
 Confusion Matrix: 
 [[24 26]
 [23 23]]
 Classification Report: 
               precision    recall  f1-score   support

     fearful       0.51      0.48      0.49        50
       happy       0.47      0.50      0.48        46

    accuracy                           0.49        96
   macro avg       0.49      0.49      0.49        96
weighted avg       0.49      0.49      0.49        96



 DecisionTreeClassifier()
 Accuarcy: 69.79
 Confusion Matrix: 
 [[38 12]
 [17 29]]
 Classification Report: 
               precision    recall  f1-score