In [1]:
!pip install librosa

Collecting librosa
  Downloading librosa-0.11.0-py3-none-any.whl.metadata (8.7 kB)
Collecting audioread>=2.1.9 (from librosa)
  Downloading audioread-3.1.0-py3-none-any.whl.metadata (9.0 kB)
Collecting soundfile>=0.12.1 (from librosa)
  Downloading soundfile-0.13.1-py2.py3-none-win_amd64.whl.metadata (16 kB)
Collecting pooch>=1.1 (from librosa)
  Downloading pooch-1.8.2-py3-none-any.whl.metadata (10 kB)
Collecting soxr>=0.3.2 (from librosa)
  Downloading soxr-1.0.0-cp39-cp39-win_amd64.whl.metadata (5.6 kB)
Collecting msgpack>=1.0 (from librosa)
  Downloading msgpack-1.1.2-cp39-cp39-win_amd64.whl.metadata (8.4 kB)
Downloading librosa-0.11.0-py3-none-any.whl (260 kB)
Downloading audioread-3.1.0-py3-none-any.whl (23 kB)
Downloading msgpack-1.1.2-cp39-cp39-win_amd64.whl (71 kB)
Downloading pooch-1.8.2-py3-none-any.whl (64 kB)
Downloading soundfile-0.13.1-py2.py3-none-win_amd64.whl (1.0 MB)
   ---------------------------------------- 0.0/1.0 MB ? eta -:--:--
   ---------- ------------------

#### Librosa is a python library used for audio analysis

In [9]:
import librosa 
import numpy as np

#### Dataset used in this experiment is - RAVDESS Emotional speech audio dataset - containing 1440 files of speech audio to capture 8 emotions - neutral, calm, happy, sad, angry, fearful, suprise and disgust.

In [13]:
file_path = r"C:\Users\ksksw\OneDrive\Desktop\VIT\Sem 2\EdgeIntelligence\Lab1\Actor_23\03-01-07-02-02-01-23.wav"

In [14]:
audio_data, sample_rate = librosa.load(file_path, sr=None)

#### Sample rate is the number of sample taken per second from a continous signal (like sound) to convert it into a digital format.

In [15]:
mfccs = librosa.feature.mfcc(y=audio_data, sr=sample_rate)

#### MFCCs (Mel-Frequency Cepstral Coefficients) are compact numerical features extracted from audio that represent how humans perceive sound.

In [16]:
print(mfccs)

[[-756.04144 -756.04144 -756.04144 ... -756.04144 -756.04144 -756.04144]
 [   0.         0.         0.      ...    0.         0.         0.     ]
 [   0.         0.         0.      ...    0.         0.         0.     ]
 ...
 [   0.         0.         0.      ...    0.         0.         0.     ]
 [   0.         0.         0.      ...    0.         0.         0.     ]
 [   0.         0.         0.      ...    0.         0.         0.     ]]


#### We get a 2D mfcc matrix, containing one coefficient across each each time frame (columns) - to convert this into a 1D vector we take the mean of each row, essentially reducing each coefficient's time trajectory into one representative value.

In [17]:
mfccs_processed = np.mean(mfccs.T, axis=0)

In [19]:
print("Audio-->Vector: ",mfccs_processed)

Audio-->Vector:  [-598.6879       58.09425      -0.7908758    11.595831     -1.9955281
   11.347894     -4.6325517     1.7604246    -7.0438576    -3.06213
   -8.477985     -5.2278395    -0.59940976   -3.1489127    -5.64305
   -1.5659062    -3.3292205    -2.1775796    -1.486551     -1.229761  ]


### We have the 20 features extracted by mfcc - starting from MFCC 0 to MFCC 19 (20 coefficients)

### MFCC 0 : -598.69 
Represents overall log-energy of the signal, a large negative represents low energy.

#### MFCC 1 : 58.09 
Represents Spectral slope/ brightness, higher mfcc 1 represents brighter sound (more high frequency content)

#### MFCC 2 : -0.79
Represents Curvature of the spectral envolope , describes how the mid frequency energies are distributed

#### MFCC 3 : 11.59
Represents Formant-related variation, often connected to speech vowel shape or resonance.

#### MFCC 4 : -1.99
Lower-order MFCCs (4–5) control broad shape. Negative -> dip in certain mid frequencies.

#### MFCC 5 : 11.34
Positive → slight rising curvature in mid-high frequencies.

#### MFCC 6 : -4.63  
#### MFCC 7 : 1.76
#### MFCC 8 : -7.04
These mid-order MFCCs (6–8) represent finer spectral variations like
roughness, 
nasality, 
harmonic structure, 
subtle timbre differences

Negative values -> dips or soft regions in those frequency ranges.
Positive -> peaks.

#### MFCC 9 : -3.06
#### MFCC 10 : -8.48
#### MFCC 11 : -5.22

These coefficients express even finer timbral variation.They don’t map to a single easy physical property but generally represent:
How jagged or smooth the frequency spectrum is

High magnitude -> more variation / textured sound
Small magnitude -> smoother / duller sound

#### MFCC 12 : -0.59

#### MFCC 13 : -3.14

#### MFCC 14 : -5.64

#### MFCC 15 : -1.56

#### MFCC 16 : -3.33

#### MFCC 17 : -2.17

#### MFCC 18 : -1.48

#### MFCC₁₉ : -1.23

These high-order MFCCs capture:
tiny fluctuations in the spectral envelope, 
fine-grained texture, 
noise characteristics, 
micro variations,
These are rarely human-interpretable individually.

### Summary 
Loaded audio with librosa, extracted 20 MFCC coefficients per time frame, averaged them to create a single 20-dimensional feature vector, and each coefficient represents a different aspect of the sound’s spectral shape and texture.