# Pitch Recognition using Fast Fourter Transform(FFT)

This notebook is a simple implementation of pitch recognition using Fast Fourier Transform(FFT). The idea is to use the FFT to convert the audio signal into frequency data 
measured in Hertz. This data can be used to identify the pitch of the audio signal and hence an idea of the note being played

## Package Installation

The project uses a series of packages that need to be installed. The packages are as follows:
- sounddevice: this package is used to record and play audio
- tqdm: this package is used to display a progress bar
- kaleido: package used dot generate static images from plotly figures
- plotly: package used to generate interactive plots
- numpy: package used for numerical operations
- scipy: package used for scientific operations
- librosa: package used for audio processing

In [33]:
%pip install kaleido==0.1.0post1
%pip install tqdm
%pip install sounddevice
%pip install librosa
import sounddevice as sd
import numpy as np
import librosa
from scipy.fftpack import fft
from scipy.io import wavfile
from scipy.io.wavfile import write
import os
import plotly.graph_objects as go
import plotly.io as pio
import tqdm
import glob

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

Note: you may need to restart the kernel to use updated packages.


The code below goes through the setup process of the FFT operations. It sets the Frames per second, the range of time to make an fft windowq and the range of frequencies to consider. The code also sets the range of notes to consider.

These parameters can be adjusted to improve the accuracy of the pitch recognition.

In [34]:
# Configuration
FPS = 30
FFT_WINDOW_SECONDS = 0.25 # how many seconds of audio make up an FFT window

# Note range to display
FREQ_MIN = 10
FREQ_MAX = 1000

# Notes to display
TOP_NOTES = 3

# Names of the notes
NOTE_NAMES = ["C", "C#", "D", "D#", "E", "F", "F#", "G", "G#", "A", "A#", "B"]

# Output size. Generally use SCALE for higher res, unless you need a non-standard aspect ratio.
RESOLUTION = (1920, 1080)
SCALE = 2 # 0.5=QHD(960x540), 1=HD(1920x1080), 2=4K(3840x2160)

sd.default.device = 2

The below code sets up the audio recording and playback functions. We execute the code in the following cell.

In [35]:
def record_audio(duration, filename, fs=44100):
    print("Recording...")
    recording = sd.rec(int(duration * fs), samplerate=fs, channels=2)
    sd.wait()  # Wait until recording is finished
    print("Recording finished. Saving to file...")
    write(filename, fs, recording)  # Save as WAV file 

In [38]:
duration = 2
filename = 'output.wav'
record_audio(duration, filename)

Recording...
Recording finished. Saving to file...


In [36]:
AUDIO_FILE = "c-scale-demo.wav" # EDIT THIS LINE TO USE YOUR OWN FILE

fs, data = wavfile.read(AUDIO_FILE) # load the data

# Check if the audio data has more than one channel
if len(data.shape) > 1 and data.shape[1] > 1:
    audio = data.T[0] # this is a two channel soundtrack, get the first track
else:
    audio = data # this is a mono soundtrack

FRAME_STEP = (fs / FPS) # audio samples per video frame
FFT_WINDOW_SIZE = int(fs * FFT_WINDOW_SECONDS)
AUDIO_LENGTH = len(audio)/fs

The lines below set up the FFT operations and the plots for the same. We can break down the three functions:

1. `plot_fft`: This function plots the FFT of the audio signal. It creates a scatter trace of the power spectrum of fft against the frequencies of teh sound. Additionally, it annotates the graph with the note being recognized.
2. `extract_sample`: It is degined to extract a sample audio data from a larger audio signam  for processing
3. `find_top_notes`: this function is used to find the top notes in the audio signal. It uses the FFT to find the top frequencies in the audio signal and then maps these frequencies to notes.

In [37]:
def plot_fft(p, xf, fs, notes, dimensions=(960,540)):
  layout = go.Layout(
      title="frequency spectrum",
      autosize=False,
      width=dimensions[0],
      height=dimensions[1],
      xaxis_title="Frequency (note)",
      yaxis_title="Magnitude",
      font={'size' : 24}
  )

  fig = go.Figure(layout=layout,
                  layout_xaxis_range=[FREQ_MIN,FREQ_MAX],
                  layout_yaxis_range=[0,1]
                  )
  
  fig.add_trace(go.Scatter(
      x = xf,
      y = p))
  
  for note in notes:
    fig.add_annotation(x=note[0]+10, y=note[2],
            text=note[1],
            font = {'size' : 48},
            showarrow=False)
  return fig

def extract_sample(audio, frame_number):
  end = frame_number * FRAME_OFFSET
  begin = int(end - FFT_WINDOW_SIZE)

  if end == 0:
    # We have no audio yet, return all zeros (very beginning)
    return np.zeros((np.abs(begin)),dtype=float)
  elif begin<0:
    # We have some audio, padd with zeros
    return np.concatenate([np.zeros((np.abs(begin)),dtype=float),audio[0:end]])
  else:
    # Usually this happens, return the next sample
    return audio[begin:end]

def find_top_notes(fft,num):
  if np.max(fft.real)<0.001:
    return []

  lst = [x for x in enumerate(fft.real)]
  lst = sorted(lst, key=lambda x: x[1],reverse=True)

  idx = 0
  found = []
  found_note = set()
  while( (idx<len(lst)) and (len(found)<num) ):
    f = xf[lst[idx][0]]
    y = lst[idx][1]
    n = freq_to_number(f)
    n0 = int(round(n))
    name = note_name(n0)

    if name not in found_note:
      found_note.add(name)
      s = [f,note_name(n0),y]
      found.append(s)
    idx += 1
    
  return found

The below cell block will execute the code and set up the functions for the FFT operations. It will generate frames for the FFT and the audio signal.

We will then generate a plot of the FFT of the audio signal along with the most prominent notes in the audio signal in the next cell.

In [39]:
png_files = glob.glob('frames/*.png')

# Remove each file
for file in png_files:
    os.remove(file)

# See https://newt.phys.unsw.edu.au/jw/notes.html
def freq_to_number(f): return 69 + 12*np.log2(f/440.0)
def number_to_freq(n): return 440 * 2.0**((n-69)/12.0)
def note_name(n): return NOTE_NAMES[n % 12] + str(int(n/12 - 1))

# Hanning window function
window = 0.5 * (1 - np.cos(np.linspace(0, 2*np.pi, FFT_WINDOW_SIZE, False)))

xf = np.fft.rfftfreq(FFT_WINDOW_SIZE, 1/fs)
FRAME_COUNT = int(AUDIO_LENGTH*FPS)
FRAME_OFFSET = int(len(audio)/FRAME_COUNT)

# Pass 1, find out the maximum amplitude so we can scale.
mx = 0
for frame_number in range(FRAME_COUNT):
  sample = extract_sample(audio, frame_number)

  fft = np.fft.rfft(sample * window)
  fft = np.abs(fft).real 
  mx = max(np.max(fft),mx)

print(f"Max amplitude: {mx}")

# Pass 2, produce the animation
print("Producing frames...")
for frame_number in tqdm.tqdm(range(FRAME_COUNT)): 
  sample = extract_sample(audio, frame_number)

  fft = np.fft.rfft(sample * window)
  fft = np.abs(fft) / mx 
  
  
  s = find_top_notes(fft,TOP_NOTES)
  
  fig = plot_fft(fft.real,xf,fs,s,RESOLUTION)
  try:
    fig.write_image(f"frames/frame{frame_number}.png",scale=2)
  except Exception as e:
    print(f"Error writing image: {e}")


Max amplitude: 31356926.787363894
Producing frames...


100%|██████████| 223/223 [00:46<00:00,  4.75it/s]


In [40]:
!ffmpeg -y -r {FPS} -f image2 -s 1920x1080 -i frames/frame%d.png -i {AUDIO_FILE} -c:v libx264 -pix_fmt yuv420p demo-movie.mp4

ffmpeg version 2024-05-27-git-01c7f68f7a-full_build-www.gyan.dev Copyright (c) 2000-2024 the FFmpeg developers
  built with gcc 13.2.0 (Rev5, Built by MSYS2 project)
  configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-libsnappy --enable-zlib --enable-librist --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-libbluray --enable-libcaca --enable-sdl2 --enable-libaribb24 --enable-libaribcaption --enable-libdav1d --enable-libdavs2 --enable-libuavs3d --enable-libxevd --enable-libzvbi --enable-librav1e --enable-libsvtav1 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxavs2 --enable-libxeve --enable-libxvid --enable-libaom --enable-libjxl --enable-libopenjpeg --enable-libvpx --enable-mediafoundation --enable-libass --enable-frei0r --enable-libfreetype --enable-libfribidi --enable-lib

## Source of information:

- Extract Musical notes in python with FFT: https://www.youtube.com/watch?v=rj9NOiFLxWA
    - This video was used to understand the concept of FFT and how it can be used to extract musical notes from an audio signal. 
    - Alot of the code and fucntions used in this notebook are from the code provided in the video.
- But what is the Fourier Transform? A visual introduction: https://www.youtube.com/watch?v=spUNpyF58BY
    - This video was used to understand the concept of the Fourier Transform and how it can be used to convert a signal from the time domain to the frequency domain.
- Note Recognition in Python: https://medium.com/@ianvonseggern/note-recognition-in-python-c2020d0dae24
    - This article was used to understand the concept of note recognition and how it can be implemented in python.
    - I did not end up using the code from this article but it was helpful in understanding the concept of note recognition.
    - I did take the code to record and play audio from this article.

## Next Steps

My main goal for the next project is to build an interactive UI that will allow the display of the FFT of the audio signal and the notes being played in real time. Unfortunately, I was not able to implement this in this project but I will be working on it in the next project. One of the main concerns I have is that the process of performing an FFT on the audio signal is computationally expensive and hence it may not be possible to perform the FFT in real time. I will be looking into ways to optimize the FFT operation to make it more efficient or a way to perform the FFT later along with a suggestion for how to transpose playback to better suit the user's needs.