<a href="https://colab.research.google.com/github/mraskj/css_fall2023/blob/main/code/class09/class09-exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Class 09: Audio Basics - Exercise

In this exercise, we explore how to play around with audio in Python. We'll learn how plot waveforms, convert audio from the time to the frequency domain, and how to convert audio into spectrograms.



## Setup

We start by:

1. Cloning the course GitHub repo
2. Importing necessary modules





### 0.1 Cloning GitHub Repository

In [None]:
# Clone GitHub directory into
!git clone https://github.com/mraskj/css_fall2023.git

### 0.2 Importing Modules

In [None]:
# MODULES

# For file and directory management
import os

# For data handling
import numpy as np

# For plotting
import matplotlib.pyplot as plt

# For signal processing
import scipy
import librosa
from scipy.io import wavfile

## Exercise 1: Reading and Writing Audio Files

### Exercise 1.0 Reading Audio

1. Read in one of the audio files from *data/audio/class9/*. You decide which one.
2. After reading the file, inspect the sampling rate and the characteristics of the audio signal (e.g. number of samples, sampling rate, duration, max and min values, and so on)

[https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.wavfile.read.html](https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.wavfile.read.html)

### Exercise 1.1: Writing Audio

Audio files can easily be saved as files using the `wavfile.write` function. Define an audio signal of your choice (e.g. a sine wave) with frequency $f$, length $l$ (i.e duration in seconds) and sampling rate $sr$. You decide the various values, but I recommend $f=[50, 100]$ and $l=[0.0, 5.0]$.  


[https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.wavfile.write.html](https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.wavfile.write.html)

## Exercise 2: Resampling

We can always down- and upsample signals. We always want to work with the same sampling rate across all our files. If you use a pretrained model, you are typically required to preprocess your audio to the same sampling rate as the model. Why is that necessary you think?

Whenver we want to change the sampling rate, we can *resample* our original audio file to a target rate. However, upsampling does not change the quality of your original audio. It only inter- and extrapolates to accomodate the target.

In [None]:
# Define function to resample signal
def resampling(signal, sr, target_sr):
  """
  Resample an audio signal from the old sample rate to the new sample rate.

  Parameters:
  signal (numpy.ndarray): The input audio signal to be resampled.
  sr (int): The original sample rate (in Hz) of the input signal.
  target_sr (int): The target sample rate (in Hz) for the resampled signal.

  Returns:
  tuple: A tuple containing the new sample rate (target_sr) and the resampled signal (numpy.ndarray).
  """

  # resample ratio
  resample_ratio = target_sr / sr

  # resample signal
  resampled_signal = scipy.signal.resample(signal,
                                           int(len(signal) * resample_ratio))

  return target_sr, resampled_signal

### Exercise 2.0: Naive Resampling

We start by using only the `wavefile.write` function to conduct the resampling.

1. Read in audio file we also used in *Exercise 1.0*
2. Describe the audio in the same way as in *Exercise 1.0*
3. Save the audio file using a sampling rate of of $44,100$ Hz by specyfing the the `rate` argument in the `write()` function. Call the file "*naive_resampling_44100Hz.wav*". Encode the signal as $16$-bit
4. Repeat step 1 and 2 "*naive_resampling_44100Hz.wav*"
5. Listen to "*naive_resampling_44100Hz.wav*" yourself. What's the problem?

### Exercise 2.1: Correct Resampling

You probably figured the problem by listening to the audio and seeing the different durations between the original and resampled audio files. The goal of resampling is not to change the duration. The duration should fixed. What we want to change is the distance in time between each sample. Note that a sampling rate of $16,000$ Hz corresponds to a sample every $\frac{1}{16}=0.0625 \hspace{.1cm}\text{ms}$ or $\frac{1}{16000}=6.25e\text{-}5 \hspace{.1cm}\text{s}$ (where $6.25 \times 10^{-5}=0.0000625$)

The trick is to scale the original signal by the ratio of the target rate and the original sampling rate. To conduct the resampling, you can use the `resample()` method from the `scipy.signal` module. Use a target rate of $44,100$ Hz

1. Conduct step 1 and 2 from *Exercise 2.0*
2. Compute the resample ratio
3. Use `scipy.signal.resample` to resample the audio read in step 1 and assign it to an object called `resampled_signal`
4. Write the audio to a file called "*correct_resampling_44100Hz.wav*". Make sure to encode the audio as $16$-bit
5. Do step 1 for "*correct_resampling_44100Hz.wav*". Describe the results and compare them to your naive solution.
6. Listen to "*correct_resampling_44100Hz.wav*" to verify your solution.

## Exercise 3: Speech Waveforms

So far we have seen dummy waveforms with only one or two frequencies. In reality, audio signals contain a lot frequencies. This is also true for human speech. In the next four exercises, we will plot the speech waveform of four different audio files uttered by two different speakers, a male and a female speaker. Each speaker has two audio files with one being an examplar of a more subdue speaking style (*q10*) and the other being an examplar of an activated speaking style (*q90*)

The files are in the *data/audio/class09/* folder:
- *speaker0_q10.wav*
- *speaker0_q90.wav*
- *speaker1_q10.wav*
- *speaker1_q90.wav*

### Exercise 3.0: Reading and Normalizing Audio Files

1. Read in each of the four audio files
2. Write a function that normalizes a vector
3. Normalize the amplitude of each audio file (*Hint*: you might get an error message when using your normalization function. Type cast the audio array as a float when you normalize to avoid the error)




### Exercise 3.1: Full Speech

Plot the entire span of each speech for each audio file in a 2 by 2 grid with 2 rows and 2 columns. The first row should be *speaker0* and the second row should be *speaker1*. Give each speaker a unique color.

### Exercise 3.2: Five Seconds of Speech

As you can see, it is difficult to extract any meaning from the full speech waveforms. Try to plot a subset of five seconds for each speech. I use start=$20$ and stop=$25$, but feel free to choose any other interval.

Once again, plot the waveforms in a 2 by 2 grid with 2 rows and 2 columns. The first row should be *speaker0* and the second row should be *speaker1*. Give each speaker a unique color.

Describe the results. Try to listen to the five seconds in each speech manually.

### Exercise 3.3: 25 milliseconds (0.025 seconds)

This is clearly more informative. We can go even further to see what's going on. Try to plot $25$ ms of each speech in a 2 by 2 grid with 2 rows and 2 columns. The first row should be *speaker0* and the second row should be *speaker1*. Give each speaker a unique color.

I use start=$11$ and stop=$11.025$, but you can choose any interval you like.

### 4 Spectrograms

Iterate the exact same procedure as in *Exercise 3* but plot the speeches as spectrograms instead of waveforms. I use the ranges:

* Exercise 4.0: Full speech
* Exercise 4.1: 5-20 seconds (15 seconds in total)
* Exercise 4.2: 11-12.5 seconds (1.5 seconds in total)

You decide if you use mel-spectrograms or the standard spectrogram. For the former, you should use the `librosa` module. For the latter, you can use `plt.specgram`.

### Exercise 4.0: Full Speech




### Exercise 4.1: 5-20 seconds

### Exercise 4.2: 11-12.5 seconds