<a href="https://colab.research.google.com/github/mraskj/css_fall2023/blob/main/code/class09/class09-solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Class 09: Audio Basics - Solution

In this exercise, we explore how to play around with audio in Python. We'll learn how plot waveforms, convert audio from the time to the frequency domain, and how to convert audio into spectrograms.



## Setup

We start by:

1. Cloning the course GitHub repo
2. Importing necessary modules





### 0.1 Cloning GitHub Repository

In [None]:
# Clone GitHub directory into
!git clone https://github.com/mraskj/css_fall2023.git

### 0.2 Importing Modules

In [None]:
# MODULES

# For file and directory management
import os

# For data handling
import numpy as np

# For plotting
import matplotlib.pyplot as plt

# For signal processing
import scipy
import librosa
from scipy.io import wavfile

## Exercise 1: Reading and Writing Audio Files

### Exercise 1.0: Reading Audio

1. Read in one of the audio files from *data/audio/class9/*. You decide which one.
2. After reading the file, inspect the sampling rate and the characteristics of the audio signal (e.g. number of samples, sampling rate, duration, max and min values, and so on)

[https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.wavfile.read.html](https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.wavfile.read.html)

### 1.0 Solution

In [None]:
# Define function that prints varies characteristics of the audio signal.
def describe_audio(signal, sr):
  """
  Analyze and describe characteristics of an audio signal.

  Parameters:
  signal (numpy.ndarray): The input audio signal to be analyzed.
  sr (int): The sample rate (in Hz) of the input signal.

  Returns:
  dict:
      - 'n_samples': The length of the signal.
      - 'sr': The sample rate.
      - 'duration': The duration of the signal in seconds.
      - 'bit_depth': The data type of the signal.
      - 'max_signal': The maximum amplitude in the signal.
      - 'min_signal': The minimum amplitude in the signal.
      - 'mean_signal': The mean amplitude of the signal.
      - 'std_signal': The standard deviation of the amplitudes in the signal.
  """

  print(f"Length of signal: {len(signal)}")
  print(f"Sampling rate of the signal: {sr}")
  print(f"Duration (s); {len(signal)/sr}")
  print(f"Bit depth: {signal.dtype}")
  print(f"Max amplitude: {np.max(signal)}")
  print(f"Min amplitude: {np.min(signal)}")
  print(f"Mean amplitude: {np.mean(signal)}")
  print(f"Std amplitude: {np.std(signal)}")

  return {'n_samples': len(signal),
          'sr': sr,
          'duration': len(signal)/sr,
          'bit_depth': signal.dtype,
          'max_signal': np.max(signal),
          'min_signal': np.min(signal),
          'mean_signal': np.mean(signal),
          'std_signal': np.std(signal)}

In [None]:
# Define base directory
base_dir = os.path.join(os.getcwd(), 'css_fall2023/data/audio/class09')
print(f"Directory: {base_dir}")

# Define audio filepath
fname = 'speaker0_q90'
audio_fpath = os.path.join(base_dir, fname + '.wav')
print(f"Audio filepath: {audio_fpath}")

In [None]:
# Read audio file using the read function from the wavfile class from scipy
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.wavfile.read.
sr, signal = wavfile.read(audio_fpath)

In [None]:
# Inspect the audio signal
audio_info = describe_audio(signal, sr)

### Exercise 1.1: Writing Audio

Audio files can easily be saved as files using the `wavfile.write` function. Define an audio signal of your choice (e.g. a sine wave) with frequency $f$, length $l$ (i.e duration in seconds) and sampling rate $sr$. You decide the various values, but I recommend $f=[50, 100]$ and $l=[0.0, 5.0]$.  


[https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.wavfile.write.html](https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.wavfile.write.html)

### Solution 1.1

In [None]:
# Define sampling rate
sr = 16000

# Define frequency
f = 100

# Define duration
length = 1.0

# Construct time samples
t = np.linspace(0, length, sr)

# Construct amplitude
amplitude = np.iinfo(np.int16).max

# Construct signal
data = amplitude * np.sin(2 * np.pi * f * t)

# Write audio file
wavfile.write("/content/writing_example.wav", sr, data.astype(np.int16))

## Exercise 2: Resampling

We can always down- and upsample signals. We always want to work with the same sampling rate across all our files. If you use a pretrained model, you are typically required to preprocess your audio to the same sampling rate as the model. Why is that necessary you think?

Whenver we want to change the sampling rate, we can *resample* our original audio file to a target rate. However, upsampling does not change the quality of your original audio. It only inter- and extrapolates to accomodate the target.

In [None]:
# Define function to resample signal
def resampling(signal, sr, target_sr):
  """
  Resample an audio signal from the old sample rate to the new sample rate.

  Parameters:
  signal (numpy.ndarray): The input audio signal to be resampled.
  sr (int): The original sample rate (in Hz) of the input signal.
  target_sr (int): The target sample rate (in Hz) for the resampled signal.

  Returns:
  tuple: A tuple containing the new sample rate (target_sr) and the resampled signal (numpy.ndarray).
  """

  # resample ratio
  resample_ratio = target_sr / sr

  # resample signal
  resampled_signal = scipy.signal.resample(signal,
                                           int(len(signal) * resample_ratio))

  return target_sr, resampled_signal

### Exercise 2.0: Naive Resampling

We start by using only the `wavefile.write` function to conduct the resampling.

1. Read in audio file we also used in *Exercise 1.0*
2. Describe the audio in the same way as in *Exercise 1.0*
3. Save the audio file using a sampling rate of of $44,100$ Hz by specyfing the the `rate` argument in the `write()` function. Call the file "*naive_resampling_44100Hz.wav*". Encode the signal as $16$-bit
4. Repeat step 1 and 2 "*naive_resampling_44100Hz.wav*"
5. Listen to "*naive_resampling_44100Hz.wav*" yourself. What's the problem?

### Solution 2.0

In [None]:
# Step 1: Read in original audio file
sr, signal = wavfile.read(audio_fpath)

# Step 2: Describe the audio
audio_info_original = describe_audio(signal=signal, sr=sr)

# Step 3: Write audio to 44100 Hz
target_sr = 44100
wavfile.write(filename=f"/content/naive_resampling_{target_sr}Hz.wav",
              rate=target_sr,
              data=signal.astype(np.int16))

In [None]:
# Step 4: Read in resampled audio and describe
sr, signal = wavfile.read(f"/content/naive_resampling_{target_sr}Hz.wav")
audio_info_resampled = describe_audio(signal=signal, sr=sr)

# Step 5: Manual listening or play from Python

### Exercise 2.1: Correct Resampling

You probably figured the problem by listening to the audio and seeing the different durations between the original and resampled audio files. The goal of resampling is not to change the duration. The duration should fixed. What we want to change is the distance in time between each sample. Note that a sampling rate of $16,000$ Hz corresponds to a sample every $\frac{1}{16}=0.0625 \hspace{.1cm}\text{ms}$ or $\frac{1}{16000}=6.25e\text{-}5 \hspace{.1cm}\text{s}$ (where $6.25 \times 10^{-5}=0.0000625$)

The trick is to scale the original signal by the ratio of the target rate and the original sampling rate. To conduct the resampling, you can use the `resample()` method from the `scipy.signal` module. Use a target rate of $44,100$ Hz

1. Conduct step 1 and 2 from *Exercise 2.0*
2. Compute the resample ratio
3. Use `scipy.signal.resample` to resample the audio read in step 1 and assign it to an object called `resampled_signal`
4. Write the audio to a file called "*correct_resampling_44100Hz.wav*". Make sure to encode the audio as $16$-bit
5. Do step 1 for "*correct_resampling_44100Hz.wav*". Describe the results and compare them to your naive solution.
6. Listen to "*correct_resampling_44100Hz.wav*" to verify your solution.

### Solution 2.1

In [None]:
# Step 1+2: Read in original audio file and describe
sr, signal = wavfile.read(audio_fpath)
audio_info_original = describe_audio(signal=signal, sr=sr)

# Step 2: Compute sampling ratio
target_sr = 44100
resample_ratio = target_sr / sr
print(f"Resample signal from {sr} to {target_sr} Hz")

# Step 3: Resample
resampled_signal = scipy.signal.resample(signal, int(len(signal) * resample_ratio))

In [None]:
# Step 4: Write resampled signal
wavfile.write(filename=f"correct_resampling_{target_sr}Hz.wav",
              rate=target_sr,
              data=resampled_signal.astype(np.int16))

# Step 5: Read in resampled audio file and describe
sr, resampled_signal = wavfile.read(f"correct_resampling_{target_sr}Hz.wav")
audio_info_original = describe_audio(signal=resampled_signal, sr=sr)

# Step 6: Manual listen

## Exercise 3: Speech Waveforms

So far we have seen dummy waveforms with only one or two frequencies. In reality, audio signals contain a lot frequencies. This is also true for human speech. In the next four exercises, we will plot the speech waveform of four different audio files uttered by two different speakers, a male and a female speaker. Each speaker has two audio files with one being an examplar of a more subdue speaking style (*q10*) and the other being an examplar of an activated speaking style (*q90*)

The files are in the *data/audio/class09/* folder:
- *speaker0_q10.wav*
- *speaker0_q90.wav*
- *speaker1_q10.wav*
- *speaker1_q90.wav*

### Exercise 3.0: Reading and Normalizing Audio Files

1. Read in each of the four audio files
2. Write a function that normalizes a vector
3. Normalize the amplitude of each audio file (*Hint*: you might get an error message when using your normalization function. Type cast the audio array as a float when you normalize to avoid the error)




In [None]:
# Define function to normalize audio signal
def normalization(x):
  return (x - min(x)) / (max(x) - min(x))

In [None]:
# Define names of audio files in a list
speaker_labels = [0, 1]
quantiles = [10, 90]

audio_files = []
for s in speaker_labels:
  for q in quantiles:
    audio_files += [f'speaker{str(s)}_q{str(q)}']

In [None]:
# Define path to directory containing audio files
base_dir = os.path.join(os.getcwd(), 'css_fall2023/data/audio/class09')

# Read in each audio file
audio_signals = [wavfile.read(os.path.join(base_dir, f + '.wav')) for f in audio_files]

# Unpack each tuple in audio_signals and keep only sampling rate
sr, audio_raw_signals = zip(*audio_signals)
sr = sr[0]

In [None]:
# Normalize audio - note that we need to type cast each array to a float to avoid errors
audio_normalized_signals = [normalization(a.astype(float)) for a in audio_raw_signals]

### Exercise 3.1: Full Speech

Plot the entire span of each speech for each audio file in a 2 by 2 grid with 2 rows and 2 columns. The first row should be *speaker0* and the second row should be *speaker1*. Give each speaker a unique color.

### Solution 3.1

In [None]:
num_subplots = len(audio_normalized_signals)
num_rows = (num_subplots + 1) // 2
num_cols = 2
colors = ['#381a61', '#381a61', '#e78429', '#e78429']
speakers = ['Claus Hjorth Frederiksen', 'Claus Hjorth Frederiksen', 'Özlem Cekic', 'Özlem Cekic']

In [None]:
c = 0
plt.figure(figsize=(20, 12))
for j in range(num_cols):
  for i in range(num_rows):
    plt.subplot(num_rows, num_cols, c + 1)
    t = np.linspace(0, len(audio_normalized_signals[c])/sr, len(audio_normalized_signals[c]), endpoint=False)
    plt.plot(t, audio_normalized_signals[c], color=colors[c], alpha=0.7)

    if c > 1:
      plt.xlabel('Time (s)', size=16)

    if c % 2 == 0:
      plt.ylabel('Amplitude', size=16)

    plt.title(f"Speech {i}: {speakers[c]}", size=14)
    plt.grid(True)
    c += 1
plt.show()


### Exercise 3.2: Five Seconds of Speech

As you can see, it is difficult to extract any meaning from the full speech waveforms. Try to plot a subset of five seconds for each speech. I use start=$20$ and stop=$25$, but feel free to choose any other interval.

Once again, plot the waveforms in a 2 by 2 grid with 2 rows and 2 columns. The first row should be *speaker0* and the second row should be *speaker1*. Give each speaker a unique color.

Describe the results. Try to listen to the five seconds in each speech manually.

### Solution 3.2

In [None]:
num_subplots = len(audio_normalized_signals)
num_rows = (num_subplots + 1) // 2
num_cols = 2
start=20
stop=25
colors = ['#381a61', '#381a61', '#e78429', '#e78429']
speakers = ['Claus Hjorth Frederiksen', 'Claus Hjorth Frederiksen', 'Özlem Cekic', 'Özlem Cekic']

In [None]:
c = 0
plt.figure(figsize=(20, 12))
for j in range(num_cols):
  for i in range(num_rows):
    plt.subplot(num_rows, num_cols, c + 1)
    t = np.linspace(0, len(audio_normalized_signals[c])/sr, len(audio_normalized_signals[c]), endpoint=False)
    plt.plot(t[start*sr:int(stop*sr)], audio_normalized_signals[c][start*sr:int(stop*sr)], color=colors[c], alpha=0.7)

    if c > 1:
      plt.xlabel('Time (s)', size=16)

    if c % 2 == 0:
      plt.ylabel('Amplitude', size=16)

    plt.title(f"Speech {i}: {speakers[c]}", size=14)
    plt.grid(True)
    c += 1
plt.show()


### Exercise 3.3: 25 milliseconds (0.025 seconds)

This is clearly more informative. We can go even further to see what's going on. Try to plot $25$ ms of each speech in a 2 by 2 grid with 2 rows and 2 columns. The first row should be *speaker0* and the second row should be *speaker1*. Give each speaker a unique color.

I use start=$11$ and stop=$11.025$, but you can choose any interval you like.

### Solution 3.3

In [None]:
num_subplots = len(audio_normalized_signals)
num_rows = (num_subplots + 1) // 2
num_cols = 2
start=11
stop=11.025
colors = ['#381a61', '#381a61', '#e78429', '#e78429']
speakers = ['Claus Hjorth Frederiksen', 'Claus Hjorth Frederiksen', 'Özlem Cekic', 'Özlem Cekic']

In [None]:
c = 0
plt.figure(figsize=(20, 12))
for j in range(num_cols):
  for i in range(num_rows):
    plt.subplot(num_rows, num_cols, c + 1)
    t = np.linspace(0, len(audio_normalized_signals[c])/sr, len(audio_normalized_signals[c]), endpoint=False)
    plt.plot(t[start*sr:int(stop*sr)], audio_normalized_signals[c][start*sr:int(stop*sr)], color=colors[c], alpha=0.7)

    if c > 1:
      plt.xlabel('Time (s)', size=16)

    if c % 2 == 0:
      plt.ylabel('Amplitude', size=16)

    plt.title(f"Speech {i}: {speakers[c]}", size=14)
    plt.grid(True)
    c += 1
plt.show()


### 4 Spectrograms

Iterate the exact same procedure as in *Exercise 3* but plot the speeches as spectrograms instead of waveforms. I use the ranges:

* Exercise 4.0: Full speech
* Exercise 4.1: 5-20 seconds (15 seconds in total)
* Exercise 4.2: 11-12.5 seconds (1.5 seconds in total)

You decide if you use mel-spectrograms or the standard spectrogram. For the former, you should use the `librosa` module. For the latter, you can use `plt.specgram`.

In [None]:
num_subplots = len(audio_normalized_signals)
num_rows = (num_subplots + 1) // 2
num_cols = 2
colors = ['viridis', 'viridis', 'inferno', 'inferno']
speakers = ['Claus Hjorth Frederiksen', 'Claus Hjorth Frederiksen', 'Özlem Cekic', 'Özlem Cekic']

### Exercise 4.0: Full Speech




### Solution 4.0

In [None]:
c = 0
plt.figure(figsize=(20, 12))
for j in range(num_cols):
  for i in range(num_rows):
    plt.subplot(num_rows, num_cols, c + 1)
    t = np.linspace(0, len(audio_normalized_signals[c])/sr, len(audio_normalized_signals[c]), endpoint=False)
    Pxx, freqs, spectimes, cax = plt.specgram(audio_normalized_signals[c],
                                              Fs=sr,
                                              scale='dB',
                                              cmap=colors[c])

    if c > 1:
      plt.xlabel('Time (s)', size=16)

    if c % 2 == 0:
      plt.ylabel('Frequency (Hz)', size=16)

    plt.title(f"Speech {i}: {speakers[c]}", size=14)
    plt.colorbar(format='%+2.0f dB')
    plt.grid(True)
    c += 1
plt.show()


### Exercise 4.1: 5-20 seconds

### Solution 4.1

In [None]:
start=5
stop=20
c = 0
plt.figure(figsize=(20, 12))
for j in range(num_cols):
  for i in range(num_rows):
    plt.subplot(num_rows, num_cols, c + 1)
    Pxx, freqs, spectimes, cax = plt.specgram(audio_normalized_signals[c][start*sr:stop*sr],
                                              Fs=sr,
                                              scale='dB',
                                              cmap=colors[c])

    if c > 1:
      plt.xlabel('Time (s)', size=16)

    if c % 2 == 0:
      plt.ylabel('Frequency (Hz)', size=16)

    plt.title(f"Speech {i}: {speakers[c]}", size=14)
    plt.colorbar(format='%+2.0f dB')
    plt.grid(True)
    c += 1
plt.show()


### Exercise 4.2: 11-12.5 seconds

### Solution 4.2

In [None]:
start=11
stop=12.5
c = 0
plt.figure(figsize=(20, 12))
for j in range(num_cols):
  for i in range(num_rows):
    plt.subplot(num_rows, num_cols, c + 1)
    Pxx, freqs, spectimes, cax = plt.specgram(audio_normalized_signals[c][start*sr:int(stop*sr)],
                                              Fs=sr,
                                              scale='dB',
                                              cmap=colors[c])

    if c > 1:
      plt.xlabel('Time (s)', size=16)

    if c % 2 == 0:
      plt.ylabel('Frequency (Hz)', size=16)

    plt.title(f"Speech {i}: {speakers[c]}", size=14)
    plt.colorbar(format='%+2.0f dB')
    plt.grid(True)
    c += 1
plt.show()