# Spoken Language Processing 2022-23

# Lab3 - Dialogue Systems

_Bruno Martins_


This lab assignment will introduce tools and concepts related to the development of dialogue systems, exemplifying also the use of automatic speech recognition and text-to-speech models.

Students will be tasked with the development of a simple (spoken/conversational) question answering system, reusing different models associated to the HuggingFace Transformers library:

* Speech recognition models (e.g., OpenAI Whisper).
* Large language models for natural language understanding and generation (e.g., GPT-2 or Alpaca models).
* Text-to-speech models (e.g., SpeechT5).

The first parts of this notebook will guide students in the use of the tools, while the last part presents the main problem that is to be tackled. Note that the first parts also features intermediate tasks, which students are required to solve.

To complete the project, student groups must deliver in Fenix an updated version of this notebook, featuring the proposed solutions to each task, together with a small PDF report (2 pages) outlining the methods that were developed (you can use the [following Overleaf template](https://www.overleaf.com/latex/templates/interspeech-2023-paper-kit/kzcdqdmkqvbr) for the report).

Students are encouraged to modify examples, incorporate any other techniques, and in general explore any approach that may permit improving the results. Assessment will be based on task completion, creativity in the proposed solutions, and overall accuracy over a benchmark dataset.

### Group identification

Initialize the variable `group_id` with the number that Fenix assigned to your group and `student1_name`, `student1_id`, `student2_name` and `student2_id` with your names and student numbers.

In [None]:
group_id = 3

student1_name = "Duarte Almeida"
student1_id = 95565

student2_name = "Leonor Barreiros"
student2_id = 95618

print(f"Group number: {group_id}")
print(f"Student 1: {student1_name} ({student1_id})")
print(f"Student 2: {student2_name} ({student2_id})")

In [None]:
assert isinstance(group_id, int) and isinstance(student1_id, int) and isinstance(student2_id, int)
assert isinstance(student1_name, str) and isinstance(student2_name, str)
assert (group_id > 0) and (group_id < 40)
assert (student1_id > 60000) and (student1_id < 120000) and (student2_id > 60000) and (student2_id < 120000)

# Python packages

NumPy is a Python library that provides functions to process multidimensional array objects. The NumPy documentation is available [here](https://numpy.org/doc/1.24/).

[Librosa](https://librosa.org/) is a Python package for analyzing and processing audio signals. It provides a wide range of tools for tasks such as loading and manipulating audio files, extracting features from audio signals, and visualizing and playing back audio data.

IPython display is a module in the IPython interactive computing environment that provides a set of functions for displaying various types of media in the Jupyter notebook or other IPython-compatible environments. For example, you can use the display() function to display an object in a notebook cell (for example an audio object).

Matplotlib is a popular Python library that allows users to create a wide range of visualizations using a simple and intuitive syntax.

Huggingface transformers provides APIs and tools to easily download and train state-of-the-art pretrained models based on the Transformer architecture. The documentation is available [here](https://huggingface.co/docs/transformers/index) and, for more details, look at the official [HuggingFace course](https://huggingface.co/course/chapter1/1).

The associated HuggingFace libraries named [datasets](https://huggingface.co/docs/datasets/index) and [evaluate](https://huggingface.co/docs/evaluate/index) respectivly suport the direct access to many well-known datasets and common evaluation metrics used in NLP and speech research.

In [None]:
!pip3 install sentencepiece
!pip3 install xformers
!pip3 install transformers
!pip3 install datasets
!pip3 install evaluate
!pip3 install jiwer
!pip3 install librosa

In [1]:
import transformers
import datasets
import evaluate
import numpy as np
import librosa
import librosa.display
from IPython.display import Audio
from matplotlib import pyplot as plt

  from .autonotebook import tqdm as notebook_tqdm
2023-06-14 16:29:42.236918: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Using OpenAI Whisper

Whisper is an exciting new model for Automatic Speech Recognition (ASR), developed by OpenAI and made available through the HuggingFace Transformers library.

The following example illustrates the use of the Whisper model to transcribe a small audio sample taken from the LibriSpeech dataset (which is available through the HuggingFace datasets library).

More detailed information about Whisper, including information on how to fine-tune the model with task-specific data, is available on a [tutorial in the HuggingFace blog](https://huggingface.co/blog/fine-tune-whisper).

In [None]:
import torch
import librosa
from transformers import AutoProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

processor = AutoProcessor.from_pretrained("openai/whisper-tiny.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")

audio = ds[0]["audio"]["array"]
audio = librosa.resample(audio, orig_sr=16000, target_sr=16000) # Resample audio to 16kHz (not needed in the case of this dataset)
print(audio)

inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features

display(Audio(audio, rate=16000)) # You are able to hear the audio inputs

generated_ids = model.generate(inputs=input_features)
transcription = processor.batch_decode(generated_ids, max_length=250, skip_special_tokens=True)[0]

print(transcription)

Automatic Speech Recognition (ASR) models are frequently evaluated through the Word Error Rate (WER).

The WER is derived from the Levenshtein distance, working at the word level and aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. The metric can then be computed as:

WER = (S + D + I) / N = (S + D + I) / (S + D + C),

where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, and N is the number of words in the reference (N=S+D+C). The WER value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system, with a WER of 0 being a perfect score.

The example below illustrates the computation of the WER for two paired examples of a generated sentence versus a reference sentence. The score produced as output is the average value accross the two examples.

In [None]:
from evaluate import load

wer = load("wer")
predictions = ["this is the prediction", "there is an other sample"]
references = ["this is the reference", "there is another one"]
wer_score = wer.compute(predictions=predictions, references=references)

print(wer_score)

## Intermediate tasks:

* Collect two small audio samples with your own voice, together with a transcription of the spoken messages. The following [example shows how to record audio from your microphone within a Python notebook running on Google Colab](https://colab.research.google.com/gist/ricardodeazambuja/03ac98c31e87caf284f7b06286ebf7fd/microphone-to-numpy-array-from-your-browser-in-colab.ipynb#scrollTo=H4rxNhsEpr-c), but you can use any other method to collect the audio samples.
* Use the Whisper speech recognition model to transcribe the two spoken messages that were collected.
* Use the transcriptions to compute the word error rate.
* Experiment with the use of different recognition models (e.g., larger Whisper models), and see if the error rate changes.

In [None]:
# # # # # # # # # # # # # # # # # #
# SPEECH RECOGNITION WITH WHISPER #
# # # # # # # # # # # # # # # # # #

from transformers import AutoProcessor, WhisperForConditionalGeneration

processors = []
models = []
audios = []
references = ["The exhibition is about to commence.", "Hi! I'm very happy today. What's your name?"]
names = ["whisper-tiny", "whisper-small", "whisper-medium"]

utt_1, _ = librosa.load("audio_duarte.wav", sr=16000)
utt_1 = librosa.resample(utt_1, orig_sr=16000, target_sr=16000)
display(Audio(utt_1, rate=16000))
audios.append(utt_1)

utt_2, _ = librosa.load("audio_leonor.wav", sr=16000)
utt_2 = librosa.resample(utt_2, orig_sr=16000, target_sr=16000)
display(Audio(utt_2, rate=16000))
audios.append(utt_2)

processors.append(AutoProcessor.from_pretrained("openai/whisper-tiny.en"))
models.append(WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en"))

processors.append(AutoProcessor.from_pretrained("openai/whisper-small.en"))
models.append(WhisperForConditionalGeneration.from_pretrained("openai/whisper-small.en"))

processors.append(AutoProcessor.from_pretrained("openai/whisper-medium.en"))
models.append(WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium.en"))

for (processor, model, name) in zip(processors, models, names):
    print(f"Model {name}: ")
    predictions = []
    for audio in audios:
        inputs = processor(audio = audio, sampling_rate=16000, return_tensors="pt")
        input_features = inputs.input_features
        generated_ids = model.generate(inputs=input_features)
        prediction = processor.batch_decode(generated_ids, max_length=250, skip_special_tokens=True)[0]
        predictions.append(prediction)
        print("\tPrediction: ", prediction)
    wer = load("wer")
    wer_score = wer.compute(predictions=predictions, references=references)

    print(f"Wer score: {wer_score}")


In [None]:
# # # # # # # # # # # # # # # # # #
#    SPEECH RECOGNITION WITH T5   #
# # # # # # # # # # # # # # # # # #

from transformers import SpeechT5Processor, SpeechT5ForSpeechToText

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_asr")
model = SpeechT5ForSpeechToText.from_pretrained("microsoft/speecht5_asr")

audios = {"task_1/audio_duarte.wav" : "The exhibition is about to commence.", "task_1/audio_leonor.wav": "Hi! I'm very happy today. What's your name?"}
for audio in audios:
    utt, st = librosa.load(audio, sr=16000)
    utt = librosa.resample(utt, orig_sr=16000, target_sr=16000)
    display(Audio(utt, rate=16000))
    inputs = processor(audio = utt, sampling_rate=16000, return_tensors="pt")
    predicted_ids = model.generate(**inputs, max_length=100)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)


    print(transcription)

    wer = load("wer")
    predictions = transcription
    print(audio)
    references = [audios[audio]]
    wer_score = wer.compute(predictions=predictions, references=references)

    print(wer_score)

# Using LLMs for conditional language generation

OpenAI GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. Thus, GPT-2 can be used to address problems like question answering, modeling the task as language generation conditioned in the question (plus other relevant additional context).

The following example illustrates the use of the GPT-2 through the Huggingface Transformers library. In this case, instead of using the model directly, we are using the model through the pipeline API, which facilitates the adaptation to the case of other LLMs. The pipeline() function can be used to connect a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer.

In [None]:
from transformers import pipeline, set_seed

set_seed(42) # make results deterministic

generator = pipeline(model='gpt2')
generator("Who is the president of the United States? The answer is", max_length=15, num_return_sequences=1)

## Intermediate tasks:

* Adapt the example showing how to use GPT-2 to do question answering over the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/) (available from HuggingFace datasets).
* Evaluate the results obtained with different models (e.g., [Alpaca-based models](https://huggingface.co/declare-lab/flan-alpaca-base)) and/or different usage strategies (e.g., consider prompting, parameter efficient fine-tuning, etc.).
* Compute the error over the first 1000 examples from the validation split from the SQuAD dataset, using the [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu) for comparing the generated answers against the ground truth.


In [None]:
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# USING GPT-2 OR ALPACA-BASED TO DO QA OVER SQUAD DATASET #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

from transformers import pipeline, set_seed, logging
from datasets import load_dataset
from tqdm import tqdm

logging.set_verbosity_error()

set_seed(42) # make results deterministic

ds = load_dataset("squad", split="validation[:200]")

generators = []
names = ["flan-alpaca-gpt4-xl", "gpt2"]
generators.append(pipeline(model="declare-lab/flan-alpaca-gpt4-xl"))
generators.append(pipeline(model='gpt2'))

predictions = []
references = []

best_predictions = []
best_bleu = 0

for (generator, name) in zip(generators, names):
    predictions = []
    references = []

    print(f"Evaluating {name}...")
    for (i, sample) in tqdm(enumerate(ds), total=len(ds)):
        question = sample["question"] + " The answer is"
        if name == "gpt2":
          prediction = generator(question, max_length=20, num_return_sequences=1,  pad_token_id=generator.tokenizer.eos_token_id)[0]["generated_text"][len(question):]
        else:
          prediction = generator(question, max_length=20, num_return_sequences=1,  pad_token_id=generator.tokenizer.eos_token_id)[0]["generated_text"]

        reference  = sample["answers"]["text"][0]
        predictions.append(prediction)
        references.append(reference)

    bleu = evaluate.load("bleu")
    results = bleu.compute(predictions=predictions, references=references)
    print(f"Bleu score: {results['bleu']}")

# both have terrible results!! Let's find alternatives

In [None]:
# # # # # # # # # #
# QA WITH CONTEXT #
# # # # # # # # # #

# Model 1, source: https://huggingface.co/MaRiOrOsSi/t5-base-finetuned-question-answering
# This model was fine-tuned for QA on a different dataset, but it was evaluated on the squad dataset

from  transformers  import  AutoTokenizer, AutoModelWithLMHead, pipeline

model_name = "MaRiOrOsSi/t5-base-finetuned-question-answering"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelWithLMHead.from_pretrained(model_name)

ds = load_dataset("squad", split="validation[:200]")

predictions, references = [], []
for (i, sample) in tqdm(enumerate(ds), total=len(ds)):
    reference  = sample["answers"]["text"][0]
    question = sample["question"]
    context = sample["context"]
    input = f"question: {question} context: {context}"
    encoded_input = tokenizer([input],
                             return_tensors='pt',
                             max_length=512,
                             truncation=True)
    output = model.generate(input_ids = encoded_input.input_ids,
                            attention_mask = encoded_input.attention_mask)
    output = tokenizer.decode(output[0], skip_special_tokens=True)
    predictions.append(output)
    references.append(reference)

bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=references)
print(f"Bleu score: {results['bleu']}")

# Bleu score: 0.5328188114327255 --> we're getting there!


In [None]:
# # # # # # # # #
# FINAL RESULTS #
# # # # # # # # #

# Compute the error over the first 1000 examples from the validation split from the SQuAD dataset, using the BLEU metric for comparing the generated answers against the ground truth.
# NOTE: this is hardcoded because partial results were calculated separately using Google colab



# Bleu score: 0.7129344873580541
# Don't forget that this model was pre-trained on the same dataset as we're evaluating it,
# so the nature of the data is the same. Hence, the awesome results. However, we will
# conduct our experiments using another dataset, so we will probably need to do some fine-
# -tuning (or else our results won't be as good)

# Using SpeechT5 for converting text-to-speech

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in different natural language processing tasks, the unified-modal SpeechT5 framework explores encoder-decoder pre-training for self-supervised speech/text representation learning.

The model is again conveniently available through the HuggingFace Transformers library. The following example illustrates the use of the SpeechT5 model for generating a spectrogram from a textual input, together with a neural vocoder model for producing a speech signal.

More detailed information about SpeechT5 is available on a [tutorial on the HuggingFace blog](https://huggingface.co/blog/speecht5).

In [None]:
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan, set_seed
from IPython.display import Audio
from datasets import load_dataset
import soundfile as sf
import librosa
import torch
import matplotlib.pyplot

set_seed(42) # make results deterministic

model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")

inputs = processor(text="Hello, my dog is cute", return_tensors="pt")
speaker_embeddings = torch.zeros((1, 512))

# You can optionally use "speaker embeddings" to customize the output to a particular speaker’s voice characteristics
#embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
# speaker_embeddings = torch.tensor(embeddings_dataset[42]["xvector"]).unsqueeze(0)

spectrogram = model.generate_speech(inputs["input_ids"], speaker_embeddings)
with torch.no_grad(): speech = vocoder(spectrogram)

# You can plot the generated spectrogram
# import matplotlib.pyplot as plt
# plt.figure()
# plt.imshow(spectrogram.T)
# plt.show()

librosa.display.waveshow(speech.numpy(), sr=16000) # You can plot the generated waveform
sf.write("tts_example.wav", speech.numpy(), samplerate=16000) # You can save the audio to a .wav file
display(Audio(speech.numpy(), rate=16000)) # You can hear the audio inputs

## Intermediate tasks:

* Connect the results from your answer to the previous intermediate task (i.e., conditioned language generation) to the SpeechT5 text-to-speech model, so as to produce speech outputs from the text generated by the model.
* Produce speech-based answers for the first 5 questions in the validation split from the SQuaD dataset.
* Connect also the results from your answer to the first intermediate task (i.e., automated speech recognition) to the SpeechT5 model and the LLM, so as to take spoken questions as input and produce a speech output.
* Collect small audio samples, with your own voice, for the first 5 questions in the validation split from the SQuaD dataset, and produce speech-based answers for these five questions.


In [None]:
!mkdir task_3.1/

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# PRODUCE SPEECH OUTPUTS FROM THE TEXT GENERATED BY THE MODEL #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan, set_seed
import soundfile as sf
import torch

# Use SpeechT5 to produce speech outputs from the text generated by the model
i = 0
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")

for prediction in predictions:
    inputs = processor(text=prediction, return_tensors="pt")
    speaker_embeddings = torch.zeros((1, 512))

    spectrogram = model.generate_speech(inputs["input_ids"], speaker_embeddings)
    with torch.no_grad(): speech = vocoder(spectrogram)

    sf.write("task_3.1/" + str(i) + ".wav", speech.numpy(), samplerate=16000) # You can save the audio to a .wav file
    i += 1


In [None]:
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
#  TAKE SPOKEN QUESTIONS AS INPUT AND PRODUCE A SPEECH OUTPUT #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

from transformers import pipeline, set_seed
import librosa
from transformers import AutoProcessor, WhisperForConditionalGeneration
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan, set_seed

# Generate the text from the spoken questions
processor = AutoProcessor.from_pretrained("openai/whisper-medium.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium.en")

files = ["task_3.2/squad_0.wav", "task_3.2/squad_1.wav", "task_3.2/squad_2.wav", "task_3.2/squad_3.wav", "task_3.2/squad_4.wav"]
questions = []
for file in files:
    utt, st = librosa.load(file, sr=16000)
    utt = librosa.resample(utt, orig_sr=16000, target_sr=16000)
    inputs = processor(audio = utt, sampling_rate=16000, return_tensors="pt")
    predicted_ids = model.generate(**inputs, max_length=100)
    questions.append(processor.batch_decode(predicted_ids, skip_special_tokens=True))

print("Questions: ", questions)

In [None]:
from datasets import load_dataset

# Apply the model to the transcriptions, to get the answer
model_name = "deepset/roberta-base-squad2"
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
ds = load_dataset("squad", split="validation[:5]")

answers = []
j = 0
for sample in ds:
    question = questions[j][0] # generated question from the audio
    context = sample["context"]
    QA_input = {
        'question': question,
        'context': context
    }
    res = nlp(QA_input)
    answers.append(res['answer'])
    j += 1

print("Answers: ", answers)

In [None]:
!mkdir task_3.3/
import torch
import soundfile as sf

# Produce speech-based answers for these five questions
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")

i = 0
for answer in answers:
    inputs = processor(text=answer, return_tensors="pt")
    speaker_embeddings = torch.zeros((1, 512))

    spectrogram = model.generate_speech(inputs["input_ids"], speaker_embeddings)
    with torch.no_grad(): speech = vocoder(spectrogram)

    sf.write("task_3.3/" + str(i) + ".wav", speech.numpy(), samplerate=16000) # You can save the audio to a .wav file
    i += 1

# Main problem

Students are tasked with joining together the speech recognition, language understanding and generation, and text-to-speech models, in order to build a conversational spoken question answering approach.

* The method should take as input speech utterances with questions.
* The language understanding and generation component should use as input a transcription for the current speech utterance, and optionally also transcriptions from previous speech utterances (i.e., the conversation context).
* The language understanding and generation component can explore different strategies for improving answer quality:
  * Use of large LLMs trained with reinforcement learning from human feedback.
  * Prompting the language model with retrieved in-context examples.
  * Using parameter-efficient fine-ting with existing conversational question answering datasets (e.g., [the CoQA dataset](https://stanfordnlp.github.io/coqa/), available from HuggingFace datasets).
  * ...
* The text-to-speech component takes as input the results from language generation, and produces a speech output.
* Both the automated speech recognition and the text-to-speech components can explore different approaches, although students should attempt to justify their choices (e.g., if changing the automated speech recognition component, show that it achieves a lower WER).
* Collect small audio samples, with your own voice, for the first two instances in the CoQA validation split, and show the results produced by your method for these examples.

In [2]:
def is_question(utterance):
  return utterance[-1] == '?'

In [3]:
from transformers import pipeline, set_seed
import librosa
from transformers import AutoProcessor, WhisperForConditionalGeneration
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan, set_seed

# cycle of listening to questions and replying to them, while keeping in mind the
# context of the conversation so far
context = ""
processor_asr = AutoProcessor.from_pretrained("openai/whisper-medium.en")
model_asr = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium.en")

In [4]:
from  transformers  import  AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name_llm = "deepset/roberta-base-squad2"
nlp_llm = pipeline('question-answering', model=model_name_llm, tokenizer=model_name_llm)

In [5]:
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan, set_seed

model_tts = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder_tts = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
processor_tts = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")

In [10]:
# # # # # # # # # # # # # # # # # # # # # # # # # # #
# DIALOGUE SYSTEM - EXPERIMENTING WITH COQA DATASET #
# # # # # # # # # # # # # # # # # # # # # # # # # # #
import torch

n_questions = [12, 11]
for i in range(2): # first 2 instances in coqa validation split
  print("Ask me a question, or give me some context... :)")
  # # # # # # # # # # # # # #
  # STEP 1: ASR OF CONTEXT  #
  # # # # # # # # # # # # # #
  
  # TODO split audio in <= 30s segments 
  # TODO not use a extractive model 
  
  context_audio, _ = librosa.load(f"final_task/context_{i}.wav", sr=16000)
  context_audio = librosa.resample(context_audio, orig_sr=16000, target_sr=16000)
  
  inputs = processor_asr(audio = context_audio, sampling_rate=16000, return_tensors="pt")
  predicted_ids = model_asr.generate(**inputs, max_length=1000000)
  context = processor_asr.batch_decode(predicted_ids, skip_special_tokens=True)[0]

  print("Your context: ", context)
  
  for j in range(1, n_questions[i] + 1): # first instance has 12 questions, second has 11
    print("Ask me a question, or give me some context... :)")
    
    question_audio, _ = librosa.load(f"final_task/question_{i}_{j:02d}.wav", sr=16000)
    question_audio = librosa.resample(question_audio, orig_sr=16000, target_sr=16000)

    inputs = processor_asr(audio = question_audio, sampling_rate=16000, return_tensors="pt")
    predicted_ids = model_asr.generate(**inputs, max_length=100)
    question = processor_asr.batch_decode(predicted_ids, skip_special_tokens=True)[0]

    print("Your question: ", question)
    
    # # # # # # # #
    # STEP 2: LUG #
    # # # # # # # #
    QA_input = {
        'question': question,
        'context': context
    }
    res = nlp_llm(QA_input)['answer']

    context += (question + " ")
    
    print("My answer is: ", res)

    # # # # # # # #
    # STEP 3: TTS #
    # # # # # # # #

    inputs = processor_tts(text=res, return_tensors="pt")
    speaker_embeddings = torch.zeros((1, 512))

    spectrogram = model_tts.generate_speech(inputs["input_ids"], speaker_embeddings)
    with torch.no_grad(): speech = vocoder_tts(spectrogram)
    display(Audio(speech.numpy(), rate=16000)) # You can hear the audio inputs

Ask me a question, or give me some context... :)
Your context:   Once upon a time, in a barn near a farmhouse, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn, where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no! She shared her hay bed with her mommy and five other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white
Ask me a question, or give me some context... :)
Your question:   What's color Wisconsin?
 Once upon a time, in a barn near a farmhouse, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn, where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no! She shared her hay bed with her mommy and five other sisters. All of her sisters were cute and fluffy, lik

Ask me a question, or give me some context... :)
Your question:   Where did she live?
 Once upon a time, in a barn near a farmhouse, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn, where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no! She shared her hay bed with her mommy and five other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white What's color Wisconsin?  Where did she live? 
My answer is:  a barn near a farmhouse


Ask me a question, or give me some context... :)
Your question:   Did she leave alone?
 Once upon a time, in a barn near a farmhouse, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn, where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no! She shared her hay bed with her mommy and five other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white What's color Wisconsin?  Where did she live?  Did she leave alone? 
My answer is:  Did she leave alone


Ask me a question, or give me some context... :)
Your question:   Who did she live with?
 Once upon a time, in a barn near a farmhouse, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn, where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no! She shared her hay bed with her mommy and five other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white What's color Wisconsin?  Where did she live?  Did she leave alone?  Who did she live with? 
My answer is:  her mommy and five other sisters


Ask me a question, or give me some context... :)
Your question:   What scholar were her sisters?
 Once upon a time, in a barn near a farmhouse, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn, where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no! She shared her hay bed with her mommy and five other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white What's color Wisconsin?  Where did she live?  Did she leave alone?  Who did she live with?  What scholar were her sisters? 
My answer is:  


Ask me a question, or give me some context... :)


KeyboardInterrupt: 

In [None]:
import pandas as pd
import numpy as np
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer
from transformers import AutoTokenizer, RobertaForQuestionAnswering, AutoProcessor
from datasets import load_dataset

In [None]:
# Load CoQA train and dev partitions from json file
!curl -O https://downloads.cs.stanford.edu/nlp/data/coqa/coqa-train-v1.0.json
!curl -O https://downloads.cs.stanford.edu/nlp/data/coqa/coqa-dev-v1.0.json
ds_train = load_dataset('json', data_files='coqa-train-v1.0.json', field = 'data')
ds_dev = load_dataset('json', data_files='coqa-dev-v1.0.json', field = 'data')

In [None]:
for i in range(2):
  print(ds_dev['train'][i])

In [None]:
# Pretrained models (increasing BLEU score)
#model = BertForQuestionAnswering.from_pretrained('bert-large-cased-whole-word-masking-finetuned-squad')
#tokenizer = BertTokenizer.from_pretrained('bert-large-cased-whole-word-masking-finetuned-squad')

model = RobertaForQuestionAnswering.from_pretrained('deepset/roberta-base-squad2')
tokenizer =  AutoTokenizer.from_pretrained('deepset/roberta-base-squad2')

In [None]:
# source: https://towardsdatascience.com/question-answering-with-a-fine-tuned-bert-bc4dafd45626
# This one is for the BERT
def question_answer(question, text):

    #tokenize question and text as a pair
    input_ids = tokenizer.encode(question, text)

    #string version of tokenized ids
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    #segment IDs
    #first occurence of [SEP] token
    sep_idx = input_ids.index(tokenizer.sep_token_id)
    #number of tokens in segment A (question)
    num_seg_a = sep_idx+1
    #number of tokens in segment B (text)
    num_seg_b = len(input_ids) - num_seg_a

    #list of 0s and 1s for segment embeddings
    segment_ids = [0]*num_seg_a + [1]*num_seg_b
    assert len(segment_ids) == len(input_ids)

    #model output using input_ids and segment_ids
    output = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids]))

    #reconstructing the answer
    answer_start = torch.argmax(output.start_logits)
    answer_end = torch.argmax(output.end_logits)
    answer = ""
    if answer_end >= answer_start:
        answer = tokens[answer_start]
        for i in range(answer_start+1, answer_end+1):
            if tokens[i][0:2] == "##":
                answer += tokens[i][2:]
            else:
                answer += " " + tokens[i]

    if answer.startswith("[CLS]"):
        answer = "Unable to find the answer to your question."
    #print("\nPredicted answer:\n{}".format(answer.capitalize()))
    return answer


In [None]:
# This one is for roberta
import torch
def question_answer_2(question, text):
    inputs = tokenizer(question, text, return_tensors="pt")
    with torch.no_grad():
      outputs = model(**inputs)
    #print(outputs.start_logits)
    #print(output.end_logits)
    answer_start_index = outputs.start_logits.argmax()
    answer_end_index = outputs.end_logits.argmax()

    predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
    output = tokenizer.decode(predict_answer_tokens, skip_special_tokens=True)
    return output

In [None]:
from tqdm import tqdm
from transformers import logging

logging.set_verbosity_error()

i = 0
predictions = []
references = []


# Evaluate Language Model on Dev parition
for (i, sample) in tqdm(enumerate(ds_dev["train"]), total=25):
    questions = sample["questions"]
    answer_sets = [answer_set for answer_set in zip(sample["answers"], sample["additional_answers"]["0"],
                  sample["additional_answers"]["1"], sample["additional_answers"]["2"])]
    for (question, answer_set) in zip(questions, answer_sets):
        predictions.append(question_answer_2(question["input_text"], sample["story"]))
        references.append([a["input_text"] for a in answer_set])
        print(f"Prediced answer: {predictions[-1]}")
        print(f"Real answers: {references[-1]}")
    i += 1
    if (i == 25): # To speedup the process
        break

bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=references)
print(results)

In [None]:
generator = pipeline(model = "declare-lab/flan-alpaca-base")

In [None]:
from tqdm import tqdm
from transformers import logging

logging.set_verbosity_error()

i = 0
predictions = []
references = []

# Evaluate Language Model on Dev parition
for (i, sample) in tqdm(enumerate(ds_dev["train"]), total=25):
    questions = sample["questions"]
    answer_sets = [answer_set for answer_set in zip(sample["answers"], sample["additional_answers"]["0"],
                  sample["additional_answers"]["1"], sample["additional_answers"]["2"])]
    for (question, answer_set) in zip(questions, answer_sets):
        prompt = sample["story"]
        if prompt[-1] != ".":
            prompt += "."
        prompt += " " + question["input_text"]
        if prompt[-1] != "?":
            prompt += "?"
        predictions.append(generator(prompt, max_length=20, num_return_sequences=1,  pad_token_id=generator.tokenizer.eos_token_id)[0]["generated_text"])
        references.append([a["input_text"] for a in answer_set])

    i += 1
    if (i == 25):
        break


bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=references)
print(f"Bleu score: {results['bleu']}")