<a href="https://colab.research.google.com/github/lauratomokiyo/imspeaking/blob/main/EdTechProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ed Tech Project - Laura Tomokiyo
## Instructions

To run this notebook, you will run each code block independently.  At the top left of each block is an arrow.  Click on that arrow and wait until you get a check mark beside it.  Some of them take some time to run, and you will see some output messages.  If you get something like "error" at the end, or get a red mark instead of a green check mark, it means that there is a problem with the code.  Please let me know so I can fix it.

## Initial Setup

In [None]:
# Initial imports and installs
!sudo apt install ffmpeg
!pip install torchaudio ipywebrtc notebook
!pip install ipywebrtc --upgrade
!jupyter nbextension enable --py widgetsnbextension
!pip install praatio
from ipywebrtc import AudioRecorder, CameraStream
import scipy
from IPython.display import Audio, display,HTML

from google.colab import output
output.enable_custom_widget_manager()

import ipywidgets as widgets
import matplotlib.pyplot as plt
import scipy.io.wavfile
import librosa
import numpy as np
import soundfile as sf

from praatio import textgrid

mode = 'offline' # offline or live

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Enabling notebook extension jupyter-js-widgets/extension...
Paths used for configuration of notebook: 
    	/root/.jupyter/nbconfig/notebook.json
Paths used for configuration of notebook: 
    	
      - Validating: [32mOK[0m
Paths used for configuration of notebook: 
    	/root/.jupyter/nbconfig/notebook.json


##MFA Setup

This section sets up the Montreal Forced Aligner.  It takes some time to load and run.  The forced aligner is used to find the boundaries between words and phonemes in the speech signal.

Credit https://gist.github.com/NTT123/12264d15afad861cb897f7a20a01762e

Credit https://eleanorchodroff.com/tutorial/montreal-forced-aligner.html

In [None]:
# MFA setup

%%writefile install_mfa.sh
#!/bin/bash

## a script to install Montreal Forced Aligner (MFA)

root_dir=${1:-/tmp/mfa}
mkdir -p $root_dir
cd $root_dir

# download miniconda3
wget -q --show-progress https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $root_dir/miniconda3 -f

# create py38 env
#$root_dir/miniconda3/bin/conda create -n aligner -c conda-forge openblas python=3.8 openfst pynini ngram baumwelch -y
$root_dir/miniconda3/bin/conda create -n aligner -c conda-forge montreal-forced-aligner
#$root_dir/miniconda3/bin/conda create -n aligner -c conda-forge python=3.11 montreal-forced-aligner=3.0.2 kalpy=0.6.2 kaldi=5.5.1112=cpu*
source $root_dir/miniconda3/bin/activate aligner

# install mfa, download kaldi
pip install montreal-forced-aligner
pip install git+https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner.git # install latest updates
conda install -yq pytorch torchvision torchaudio cpuonly -c pytorch

#mfa thirdparty download
mfa download

echo -e "\n======== DONE =========="
echo -e "\nTo activate MFA, run: source $root_dir/miniconda3/bin/activate aligner"
echo -e "\nTo delete MFA, run: rm -rf $root_dir"
echo -e "\nSee: https://montreal-forced-aligner.readthedocs.io/en/latest/aligning.html to know how to use MFA"



Overwriting install_mfa.sh


Important: the next code block requires the user to respond "y" to a "y/n" question during the installation process.  Watch the output of the installation until you see this question, and then click at the end of that line of text and type "y".

In [None]:
# download and install mfa
INSTALL_DIR="/tmp/mfa" # path to install directory

# make sure the environment and dependencies are right for MFA
!bash ./install_mfa.sh {INSTALL_DIR}
!source {INSTALL_DIR}/miniconda3/bin/activate aligner; mfa align --help

# install sox tool
!sudo apt install -q -y sox

# download an english acoustic model (needs to be downloaded because we're in a colab?)
!wget -q --show-progress https://github.com/MontrealCorpusTools/mfa-models/raw/main/acoustic/english.zip


PREFIX=/tmp/mfa/miniconda3
Unpacking payload ...

Installing base environment...

Preparing transaction: ...working... done
Executing transaction: ...working... done
installation finished.
    You currently have a PYTHONPATH environment variable set. This may cause
    unexpected behavior when running the Python interpreter in Miniconda3.
    For best results, please verify that your PYTHONPATH only points to
    directories of packages that are compatible with the Python interpreter
    in Miniconda3: /tmp/mfa/miniconda3

Remove existing environment?
This will remove ALL directories contained within this specified prefix directory, including any other conda environments.

 (y/[n])? 

In [None]:
# Get audio files
# these are the reference files that have been pre-recorded, and also pre-recorded learner files for debugging

!wget https://raw.githubusercontent.com/lauratomokiyo/imspeaking/main/r4.zip
!wget https://raw.githubusercontent.com/lauratomokiyo/imspeaking/main/rm4.zip
!wget https://raw.githubusercontent.com/lauratomokiyo/imspeaking/main/promptlex.txt
!unzip /content/r4.zip  # learner
!unzip /content/rm.zip
!unzip /content/rm4.zip # male reference

reffiledir = "/content/reference_audio/"

learnerfiledir = "/content/learner_baseline_audio/"
if mode == 'offline':
  !wget https://raw.githubusercontent.com/lauratomokiyo/imspeaking/main/f.zip
  !unzip /content/f.zip

tempalignmentdir = '/content/mfa/'



In [None]:
prompts = [
  "I'd like to make an appointment.",
  "Can I change my flight return date?",
  "I'd like the blue cheese burger with bacon.",
  "How can I get to the train station?",
  "That's a difficult situation.",
  "Could we get the check, please?",
  "Does this bus go to the airport?",
  "You could ask him.",
  "Is it okay if I join you?",
  "Tariffs are paid by the consumer."
]

In [None]:
reffiledir = "/content/reference_audio_male/"

def get_duration_statistics(tier):
  vowel_count = 0
  vowel_duration = 0
  for (start,stop,word) in tier.entries:
    if any(digit in word for digit in ['0','1','2']):
      vowel_count += 1
      vowel_duration += (stop-start)

  nwords = len(tier)
  utt_start = tier.entries[0][0]
  utt_end = tier.entries[-1][1]
  speaking_duration = utt_end-utt_start
  speaking_rate = nwords/speaking_duration
  ave_vowel_dur = vowel_duration / vowel_count
#  print(f'there is {speaking_duration} total speaking time and {nwords} phonemes')
  return(speaking_rate,ave_vowel_dur)

def plot_contours(idx,mode='live',recorder_object=None):
#idx = 1
  ref_file = f'{reffiledir}r{idx}.wav'
  learner_file = f'f{idx}.wav'
  print(f'learner file is {learner_file}')

  # write file if needed and load audio
  if mode == 'live':
#    recorder_object = f'recorder{idx}.audio.value'
#    print(f'recorder object is {recorder_object}')
#    with open('recording.webm', 'wb') as f:
#      f.write(recorder_object)
#    !ffmpeg -i recording.webm -ac 1 -f wav learner_file -y -hide_banner -loglevel panic
    y_l,sr_l = librosa.load(learner_file)
  elif mode == 'offline':
    print(f'Prompt is: {prompts[idx-1]}')
    y_l,sr_l = librosa.load(f'{learnerfiledir}f{idx}.wav')
  else:
    print('unknown mode')

  # make temporary directory for alignments
  !rm -rf /content/mfa/
  !mkdir -p /content/mfa/sourcefiles/
  !mkdir -p /content/mfa/aligned/

  # write prompt text file to learner file dir
#  with open (f'{tempalignmentdir}t{idx}.txt','w') as f:
  with open (f'{tempalignmentdir}sourcefiles/f{idx}.txt','w') as f:
#  with open (f'/content/mfa/aligned/t{idx}.txt','w') as f:
    f.write(prompts[idx-1])

  # load reference audio files
  y_r,sr_r = librosa.load(ref_file)
  # trim silence and mouse clicks from learner audio
  y_trimmed = trim_silence(trim_clicks(y_l))
  # write temporary copy of learner wave file to alignment directory
  sf.write(f"{tempalignmentdir}sourcefiles/f{idx}.wav", y_trimmed, sr_l)

  # get alignments
  # this is SUPER slow, even with minimal lexicon
  !source {INSTALL_DIR}/miniconda3/bin/activate aligner; mfa align --clean -j 1 /content/mfa/sourcefiles promptlex.txt english.zip /content/mfa/aligned
  tg = textgrid.openTextgrid(f'/content/mfa/aligned/f{idx}.TextGrid',includeEmptyIntervals=False)
#  textgrid_entries = tg.getTier('words')
  textgrid_entries = tg.getTier('phones')
  (spk_rate,vowel_len) = get_duration_statistics(textgrid_entries)
  #print(f'speaking rate is {spk_rate:.2f} and average vowel duration is {vowel_len:.2f}')
  print(f'phone sequence is {[p for (s,e,p) in textgrid_entries]}')
  # PEPLOs pitch extraction
  # average male: 120: average female: 220
#  fo, voiced_flag, voiced_probs = librosa.pyin(y_l, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7')) # original recording
  f0, voiced_flag, voiced_probs = librosa.pyin(y_trimmed, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7')) # trimmed recording
  f0_ref, vf_ref, vp_ref = librosa.pyin(y_r, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7')) # reference recording

  # plot against length of longest utterance
#  timeo = librosa.frames_to_time(np.arange(len(fo)), sr=sr_l) # original
  time1 = librosa.frames_to_time(np.arange(len(f0)), sr=sr_l) # trimmed
  time2 = librosa.frames_to_time(np.arange(len(f0_ref)), sr=sr_r) # reference
  dur1 = librosa.get_duration(y=y_trimmed,sr=sr_l)
  dur2 = librosa.get_duration(y=y_r,sr=sr_r)
  dur_max = max(dur1,dur2)

  # Plot waveform of learner - untrimmed
#  plt.figure(figsize=(10, 3))
#  librosa.display.waveshow(y_l, sr=sr_l, alpha=0.5)
#  plt.xlim(0,dur_max)
#  plt.xlabel(f'Time (s) - ORIGINAL UNTRIMMED')
#  plt.ylabel('Amplitude')
#  plt.show()

  # Plot waveform of learner - trimmed
#  plt.figure(figsize=(10, 3))
#  librosa.display.waveshow(y_trimmed, sr=sr_l, alpha=0.5)
#  plt.xlim(0,dur_max)
#  plt.xlabel(f'Time (s) - TRIMMED')
#  plt.ylabel('Amplitude')
#  plt.show()

  # Plot waveform of learner
  plt.figure(figsize=(10, 3))
  librosa.display.waveshow(y_trimmed, sr=sr_l, alpha=0.5)

  # add alignments
  for start, end, word in textgrid_entries:
    plt.axvspan(start, end, color='lightgray', alpha=0.5)  # Highlight the word region
    plt.text((start + end) / 2, -0.1, word, horizontalalignment='center', verticalalignment='center', fontsize=8, color='black')
    plt.axvline(start,color='r',linestyle='--',alpha=0.15)
    plt.axvline(end,color='r',linestyle='--',alpha=0.15)


  plt.xlim(0,dur_max)
  plt.xlabel(f'Time (s) - Learner speaking rate {spk_rate:.2f} phonemes/second and average vowel length {vowel_len:.2f} seconds')
  plt.ylabel('Amplitude')
  plt.show()

  # Plot contours
  plt.figure(figsize=(10, 3))
#  plt.plot(timeo, fo, color='r', label='untrimmed learner audio')
  plt.plot(time1, f0, color='b', label='learner audio')
  plt.plot(time2, f0_ref, color='g', label='reference audio')
  plt.xlabel('Time (s)')
  plt.ylabel('Frequency (Hz)')
  plt.title('Pitch Contour')
  plt.legend()
  plt.show()

  # get alignments

  # provide listening
  print("Click the button below to play the learner audio:")
  display(Audio(y_trimmed, rate=sr_l))
  print("Click the button below to play the reference audio:")
  display(Audio(y_r, rate=sr_r))

def trim_clicks(audiofile):
  # assume hop_length = 512
  # click is about 0.05s
  trim_dur = 512*3
  return(audiofile[trim_dur:-trim_dur])

def trim_silence(audiofile):
  trimmed,index = librosa.effects.trim(audiofile,top_db=20)
  # add a small amount of silence back for naturalness
  buffer_length = 10*512 # 5 frames * 512 hop length
  start_index = max(0,index[0] - buffer_length)
  end_index = min(index[1] + buffer_length, len(audiofile))
  audio_with_buffer = audiofile[start_index:end_index]
  return(audio_with_buffer)


In [None]:
mode='live'
if mode=='live':
  camera = CameraStream(constraints={'audio': True,'video':False})
  recorder1 = AudioRecorder(stream=camera)
  recorder2 = AudioRecorder(stream=camera)
  recorder3 = AudioRecorder(stream=camera)
  recorder4 = AudioRecorder(stream=camera)
  recorder5 = AudioRecorder(stream=camera)
  recorder6 = AudioRecorder(stream=camera)
  recorder7 = AudioRecorder(stream=camera)
  recorder8 = AudioRecorder(stream=camera)
  recorder9 = AudioRecorder(stream=camera)
  recorder10 = AudioRecorder(stream=camera)


##OFFLINE TESTING

In [None]:
plot_contours(3,mode='offline')


## LIVE TESTING

###Prompt 1

**BOTH** code blocks have to be clicked in order to process a new recording.  The first one re-sets the recorder and asks you to press to record and press to stop.  Once you have recorded, click the second code block.  This will pick up your new recording and process it.

In [None]:
print(f'Say the sentence "{prompts[0]}"')
recorder1


In [None]:
with open('recording.webm', 'wb') as f:
    f.write(recorder1.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav f1.wav -y -hide_banner -loglevel panic
plot_contours(1,mode='live',recorder_object=recorder1.audio.value)

###Prompt 2

In [None]:
print(f'Say the sentence "{prompts[1]}"')
recorder2

In [None]:
with open('recording.webm', 'wb') as f:
    f.write(recorder2.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav f2.wav -y -hide_banner -loglevel panic
plot_contours(2)

###Prompt 3

In [None]:
print(f'Say the sentence "{prompts[2]}"')
recorder3

In [None]:
with open('recording.webm', 'wb') as f:
    f.write(recorder3.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav f3.wav -y -hide_banner -loglevel panic
plot_contours(3)

###Prompt 4

In [None]:
print(f'Say the sentence "{prompts[3]}"')
recorder4

In [None]:
with open('recording.webm', 'wb') as f:
    f.write(recorder4.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav f4.wav -y -hide_banner -loglevel panic
plot_contours(4)

###Prompt 5


In [None]:
print(f'Say the sentence "{prompts[4]}"')
recorder5

In [None]:
with open('recording.webm', 'wb') as f:
    f.write(recorder5.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav f5.wav -y -hide_banner -loglevel panic
plot_contours(5)

###Prompt 6

In [None]:
print(f'Say the sentence "{prompts[5]}"')
recorder6

In [None]:
with open('recording.webm', 'wb') as f:
    f.write(recorder6.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav f6.wav -y -hide_banner -loglevel panic
plot_contours(6)

###Prompt 7

In [None]:
print(f'Say the sentence "{prompts[6]}"')
recorder7

In [None]:
with open('recording.webm', 'wb') as f:
    f.write(recorder7.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav f7.wav -y -hide_banner -loglevel panic
plot_contours(7)

###Prompt 8

In [None]:
print(f'Say the sentence "{prompts[7]}"')
recorder8

In [None]:
with open('recording.webm', 'wb') as f:
    f.write(recorder8.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav f8.wav -y -hide_banner -loglevel panic
plot_contours(8)

###Prompt 9

In [None]:
print(f'Say the sentence "{prompts[8]}"')
recorder9

In [None]:
with open('recording.webm', 'wb') as f:
    f.write(recorder9.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav f9.wav -y -hide_banner -loglevel panic
plot_contours(9)

###Prompt 10

In [None]:
print(f'Say the sentence "{prompts[9]}"')
recorder10

In [None]:
with open('recording.webm', 'wb') as f:
    f.write(recorder10.audio.value)
!ffmpeg -i recording.webm -ac 1 -f wav f10.wav -y -hide_banner -loglevel panic
plot_contours(10)

##HANDY-MAYBE?


In [None]:
# reading and writing wave files
for i in range(1,len(prompts)+1):
  y,sr = librosa.load(f'{learnerfiledir}f{i}.wav')
  yt = trim_silence(trim_clicks(y))
  sf.write(f"ft{i}.wav", yt, sr)

#data, samplerate = sf.read('existing_file.wav')
#sf.write('new_file.flac', data, samplerate)

In [None]:
# accessing text grid
tg = textgrid.openTextgrid(f'/content/mfa/aligned/f3.TextGrid',includeEmptyIntervals=False)
w =   tg.getTier('phones')
vowel_count = 0
vowel_duration = 0
for (start,stop,word) in w.entries:
  print(f'{start} {word} {stop}')
  if any(digit in word for digit in ['0','1','2']):
    vowel_count += 1
    vowel_duration += (stop-start)


nwords = len(w)
utt_start = w.entries[0][0]
utt_end = w.entries[-1][1]
speaking_duration = utt_end-utt_start
ave_vowel_dur = vowel_duration / vowel_count

print (f'{nwords} phonemes starting at {utt_start} and ending at {utt_end} total time {speaking_duration}')
print(f'speaking rate is {nwords/speaking_duration:.3f} phones per second')
print(f'average vowel duration is {ave_vowel_dur:.3f} seconds')

### EXTRA STUFF - IGNORE

In [None]:
## using torcoaudio to load and display audio
import torchaudio
yy,ss = torchaudio.load('/content/f10.wav')
audiopath = '/content/f1.wav'
ex_waveform, SR = torchaudio.load(audiopath)
plt.plot(np.arange(0,ex_waveform.shape[1])/SR, ex_waveform[0])
Audio(audiopath)

In [None]:
reffiledir = "/content/reference_audio/"
# Load the two audio files (or one from user recording and another pre-recorded)
y1, sr1 = librosa.load('f1.wav')   # Corresponding to the red pitch contour
y2, sr2 = librosa.load(f'{reffiledir}r1.wav')  # Corresponding to the blue pitch contour

# Extract pitch contours for both audio files using librosa's piptrack function
pitches1, magnitudes1 = librosa.core.piptrack(y=y1, sr=sr1)
pitches2, magnitudes2 = librosa.core.piptrack(y=y2, sr=sr2)

# Extract the highest pitch for each frame where there's significant energy for the first audio
pitch_contour1 = []
for t in range(pitches1.shape[1]):
    index = magnitudes1[:, t].argmax()
    pitch = pitches1[index, t]
    if pitch > 0:
        pitch_contour1.append(pitch)
    else:
        pitch_contour1.append(np.nan)

# Extract the highest pitch for each frame where there's significant energy for the second audio
pitch_contour2 = []
for t in range(pitches2.shape[1]):
    index = magnitudes2[:, t].argmax()
    pitch = pitches2[index, t]
    if pitch > 0:
        pitch_contour2.append(pitch)
    else:
        pitch_contour2.append(np.nan)

# Plot the two pitch contours on the same plot
plt.figure(figsize=(10, 6))
plt.plot(pitch_contour1, label='First Audio (Red)', color='red')
plt.plot(pitch_contour2, label='Second Audio (Blue)', color='blue')
plt.xlabel('Frames')
plt.ylabel('Pitch (Hz)')
plt.title('Pitch Contour of Two Audio Files')
plt.legend()
plt.grid(True)
plt.show()

# Display two buttons to play each audio file
print("Click the buttons below to play the corresponding audio files:")

# Audio player for the first audio (Red line)
display(Audio(y1, rate=sr1))

# Audio player for the second audio (Blue line)
display(Audio(y2, rate=sr2))

# Add side-by-side layout for the buttons
display(HTML("""
<div style="display:flex; justify-content: space-around;">
    <div>
        <button onclick="document.querySelector('audio:nth-of-type(1)').play()">Play Red Audio</button>
    </div>
    <div>
        <button onclick="document.querySelector('audio:nth-of-type(2)').play()">Play Blue Audio</button>
    </div>
</div>
"""))


In [None]:
#import librosa.display
#import matplotlib.pyplot as plt

plt.figure(figsize=(10, 3))
librosa.display.waveshow(y, sr=sr) # use waveplot should waveshow be unavailable
plt.show()



In [None]:
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# 1. Upload an audio file
#from google.colab import files
#uploaded = files.upload()

# Assuming the audio file is uploaded and the filename is extracted
#filename = list(uploaded.keys())[0]

# 2. Load the audio file
y, sr = librosa.load("file2.wav", sr=None)  # y is the audio time series, sr is the sampling rate

# 3. Extract pitch (fundamental frequency) using librosa's piptrack function
# The piptrack function returns both the pitch and the magnitude for each frame.
pitches, magnitudes = librosa.core.piptrack(y=y, sr=sr)

# Extract the highest pitch for each frame where there's significant energy
pitch_contour = []
for t in range(pitches.shape[1]):
    index = magnitudes[:, t].argmax()
    pitch = pitches[index, t]
    if pitch > 0:  # Filter out frames without detected pitch
        pitch_contour.append(pitch)
    else:
        pitch_contour.append(np.nan)

# 4. Plot the pitch contour
plt.figure(figsize=(10, 6))
plt.plot(pitch_contour, label='Pitch Contour', color='blue')
plt.xlabel('Frames')
plt.ylabel('Pitch (Hz)')
plt.title('Pitch Contour of the Audio File')
plt.legend()
plt.grid(True)
plt.show()
