# Spoken Language Processing 2022-23

# Lab3 - Dialogue Systems

_Bruno Martins_


This lab assignment will introduce tools and concepts related to the development of dialogue systems, exemplifying also the use of automatic speech recognition and text-to-speech models.

Students will be tasked with the development of a simple (spoken/conversational) question answering system, reusing different models associated to the HuggingFace Transformers library:

* Speech recognition models (e.g., OpenAI Whisper).
* Large language models for natural language understanding and generation (e.g., GPT-2 or Alpaca models).
* Text-to-speech models (e.g., SpeechT5).

The first parts of this notebook will guide students in the use of the tools, while the last part presents the main problem that is to be tackled. Note that the first parts also features intermediate tasks, which students are required to solve.

To complete the project, student groups must deliver in Fenix an updated version of this notebook, featuring the proposed solutions to each task, together with a small PDF report (2 pages) outlining the methods that were developed (you can use the [following Overleaf template](https://www.overleaf.com/latex/templates/interspeech-2023-paper-kit/kzcdqdmkqvbr) for the report).

Students are encouraged to modify examples, incorporate any other techniques, and in general explore any approach that may permit improving the results. Assessment will be based on task completion, creativity in the proposed solutions, and overall accuracy over a benchmark dataset.

### Group identification

Initialize the variable `group_id` with the number that Fenix assigned to your group and `student1_name`, `student1_id`, `student2_name` and `student2_id` with your names and student numbers.

In [1]:
group_id = 3

student1_name = "Duarte Almeida"
student1_id = 95565

student2_name = "Leonor Barreiros"
student2_id = 95618

print(f"Group number: {group_id}")
print(f"Student 1: {student1_name} ({student1_id})")
print(f"Student 2: {student2_name} ({student2_id})")

Group number: 3
Student 1: Duarte Almeida (95565)
Student 2: Leonor Barreiros (95618)


In [2]:
assert isinstance(group_id, int) and isinstance(student1_id, int) and isinstance(student2_id, int)
assert isinstance(student1_name, str) and isinstance(student2_name, str)
assert (group_id > 0) and (group_id < 40)
assert (student1_id > 60000) and (student1_id < 120000) and (student2_id > 60000) and (student2_id < 120000)

# Python packages

NumPy is a Python library that provides functions to process multidimensional array objects. The NumPy documentation is available [here](https://numpy.org/doc/1.24/).

[Librosa](https://librosa.org/) is a Python package for analyzing and processing audio signals. It provides a wide range of tools for tasks such as loading and manipulating audio files, extracting features from audio signals, and visualizing and playing back audio data.

IPython display is a module in the IPython interactive computing environment that provides a set of functions for displaying various types of media in the Jupyter notebook or other IPython-compatible environments. For example, you can use the display() function to display an object in a notebook cell (for example an audio object).

Matplotlib is a popular Python library that allows users to create a wide range of visualizations using a simple and intuitive syntax.

Huggingface transformers provides APIs and tools to easily download and train state-of-the-art pretrained models based on the Transformer architecture. The documentation is available [here](https://huggingface.co/docs/transformers/index) and, for more details, look at the official [HuggingFace course](https://huggingface.co/course/chapter1/1).

The associated HuggingFace libraries named [datasets](https://huggingface.co/docs/datasets/index) and [evaluate](https://huggingface.co/docs/evaluate/index) respectivly suport the direct access to many well-known datasets and common evaluation metrics used in NLP and speech research.

In [None]:
!pip3 install sentencepiece
!pip3 install xformers
!pip3 install transformers
!pip3 install datasets
!pip3 install evaluate
!pip3 install jiwer
!pip3 install librosa
!pip install sentence_transformers

In [4]:
import transformers
import datasets
import evaluate
import numpy as np
import librosa
import librosa.display
from IPython.display import Audio
from matplotlib import pyplot as plt

# Using OpenAI Whisper

Whisper is an exciting new model for Automatic Speech Recognition (ASR), developed by OpenAI and made available through the HuggingFace Transformers library.

The following example illustrates the use of the Whisper model to transcribe a small audio sample taken from the LibriSpeech dataset (which is available through the HuggingFace datasets library).

More detailed information about Whisper, including information on how to fine-tune the model with task-specific data, is available on a [tutorial in the HuggingFace blog](https://huggingface.co/blog/fine-tune-whisper).

In [6]:
import torch
import librosa
from transformers import AutoProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

processor = AutoProcessor.from_pretrained("openai/whisper-tiny.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")

audio = ds[0]["audio"]["array"]
audio = librosa.resample(audio, orig_sr=16000, target_sr=16000) # Resample audio to 16kHz (not needed in the case of this dataset)
print(audio)

inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features

display(Audio(audio, rate=16000)) # You are able to hear the audio inputs

generated_ids = model.generate(inputs=input_features)
transcription = processor.batch_decode(generated_ids, max_length=250, skip_special_tokens=True)[0]

print(transcription)

Downloading builder script:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

Downloading and preparing dataset librispeech_asr_dummy/clean to /root/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset librispeech_asr_dummy downloaded and prepared to /root/.cache/huggingface/datasets/hf-internal-testing___librispeech_asr_dummy/clean/2.1.0/d3bc4c2bc2078fcde3ad0f0f635862e4c0fef78ba94c4a34c4c250a097af240b. Subsequent calls will reuse this data.


Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/844 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/999k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.13M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)main/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.94k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/151M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

[0.00238037 0.0020752  0.00198364 ... 0.00042725 0.00057983 0.0010376 ]




 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.


Automatic Speech Recognition (ASR) models are frequently evaluated through the Word Error Rate (WER).

The WER is derived from the Levenshtein distance, working at the word level and aligning the recognized word sequence with the reference (spoken) word sequence using dynamic string alignment. The metric can then be computed as:

WER = (S + D + I) / N = (S + D + I) / (S + D + C),

where S is the number of substitutions, D is the number of deletions, I is the number of insertions, C is the number of correct words, and N is the number of words in the reference (N=S+D+C). The WER value indicates the average number of errors per reference word. The lower the value, the better the performance of the ASR system, with a WER of 0 being a perfect score.

The example below illustrates the computation of the WER for two paired examples of a generated sentence versus a reference sentence. The score produced as output is the average value accross the two examples.

In [7]:
from evaluate import load

wer = load("wer")
predictions = ["this is the prediction", "there is an other sample"]
references = ["this is the reference", "there is another one"]
wer_score = wer.compute(predictions=predictions, references=references)

print(wer_score)

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

0.5


## Intermediate tasks:

* Collect two small audio samples with your own voice, together with a transcription of the spoken messages. The following [example shows how to record audio from your microphone within a Python notebook running on Google Colab](https://colab.research.google.com/gist/ricardodeazambuja/03ac98c31e87caf284f7b06286ebf7fd/microphone-to-numpy-array-from-your-browser-in-colab.ipynb#scrollTo=H4rxNhsEpr-c), but you can use any other method to collect the audio samples.
* Use the Whisper speech recognition model to transcribe the two spoken messages that were collected.
* Use the transcriptions to compute the word error rate.
* Experiment with the use of different recognition models (e.g., larger Whisper models), and see if the error rate changes.

In [8]:
# # # # # # # # # # # # # # # # # #
# SPEECH RECOGNITION WITH WHISPER #
# # # # # # # # # # # # # # # # # #

from transformers import AutoProcessor, WhisperForConditionalGeneration

processors = []
models = []
audios = []
references = ["The exhibition is about to commence.", "Hi! I'm very happy today. What's your name?"]
names = ["whisper-tiny", "whisper-small", "whisper-medium"]

utt_1, _ = librosa.load("task_1/audio_duarte.wav", sr=16000)
utt_1 = librosa.resample(utt_1, orig_sr=16000, target_sr=16000)
display(Audio(utt_1, rate=16000))
audios.append(utt_1)

utt_2, _ = librosa.load("task_1/audio_leonor.wav", sr=16000)
utt_2 = librosa.resample(utt_2, orig_sr=16000, target_sr=16000)
display(Audio(utt_2, rate=16000))
audios.append(utt_2)

processors.append(AutoProcessor.from_pretrained("openai/whisper-tiny.en"))
models.append(WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en"))

processors.append(AutoProcessor.from_pretrained("openai/whisper-small.en"))
models.append(WhisperForConditionalGeneration.from_pretrained("openai/whisper-small.en"))

processors.append(AutoProcessor.from_pretrained("openai/whisper-medium.en"))
models.append(WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium.en"))

for (processor, model, name) in zip(processors, models, names):
    print(f"Model {name}: ")
    predictions = []
    for audio in audios:
        inputs = processor(audio = audio, sampling_rate=16000, return_tensors="pt")
        input_features = inputs.input_features
        generated_ids = model.generate(inputs=input_features)
        prediction = processor.batch_decode(generated_ids, max_length=250, skip_special_tokens=True)[0]
        predictions.append(prediction)
        print("\tPrediction: ", prediction)
    wer = load("wer")
    wer_score = wer.compute(predictions=predictions, references=references)

    print(f"Wer score: {wer_score}")

# small and medium have the best scores

Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/845 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/999k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.13M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)main/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.94k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/999k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.13M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)main/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.95k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.06G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Model whisper-tiny: 




	Prediction:   The exhibition is about to commence.
	Prediction:   Hi, I'm very happy to say what's your name?
Wer score: 0.2857142857142857
Model whisper-small: 
	Prediction:   The exhibition is about to commence.
	Prediction:   Hi, I'm very happy today. What's your name?
Wer score: 0.07142857142857142
Model whisper-medium: 
	Prediction:   The exhibition is about to commence.
	Prediction:   Hi, I'm very happy today. What's your name?
Wer score: 0.07142857142857142


In [9]:
# # # # # # # # # # # # # # # # # #
#    SPEECH RECOGNITION WITH T5   #
# # # # # # # # # # # # # # # # # #

from transformers import SpeechT5Processor, SpeechT5ForSpeechToText

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_asr")
model = SpeechT5ForSpeechToText.from_pretrained("microsoft/speecht5_asr")

audios = {"task_1/audio_duarte.wav" : "The exhibition is about to commence.", "task_1/audio_leonor.wav": "Hi! I'm very happy today. What's your name?"}
for audio in audios:
    utt, st = librosa.load(audio, sr=16000)
    utt = librosa.resample(utt, orig_sr=16000, target_sr=16000)
    display(Audio(utt, rate=16000))
    inputs = processor(audio = utt, sampling_rate=16000, return_tensors="pt")
    predicted_ids = model.generate(**inputs, max_length=100)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)


    print(transcription)

    wer = load("wer")
    predictions = transcription
    print(audio)
    references = [audios[audio]]
    wer_score = wer.compute(predictions=predictions, references=references)

    print(wer_score)

# unsatisfactory results

Downloading (…)rocessor_config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading spm_char.model:   0%|          | 0.00/238k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/232 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/606M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

['de exhibition is about to commence']
task_1/audio_duarte.wav
0.3333333333333333


["high i'm very happy to day what's your name"]
task_1/audio_leonor.wav
0.75


# Using LLMs for conditional language generation

OpenAI GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. Thus, GPT-2 can be used to address problems like question answering, modeling the task as language generation conditioned in the question (plus other relevant additional context).

The following example illustrates the use of the GPT-2 through the Huggingface Transformers library. In this case, instead of using the model directly, we are using the model through the pipeline API, which facilitates the adaptation to the case of other LLMs. The pipeline() function can be used to connect a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer.

In [10]:
from transformers import pipeline, set_seed

set_seed(42) # make results deterministic

generator = pipeline(model='gpt2')
generator("Who is the president of the United States? The answer is", max_length=15, num_return_sequences=1)

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Who is the president of the United States? The answer is yes." This'}]

## Intermediate tasks:

* Adapt the example showing how to use GPT-2 to do question answering over the [SQuAD dataset](https://rajpurkar.github.io/SQuAD-explorer/) (available from HuggingFace datasets).
* Evaluate the results obtained with different models (e.g., [Alpaca-based models](https://huggingface.co/declare-lab/flan-alpaca-base)) and/or different usage strategies (e.g., consider prompting, parameter efficient fine-tuning, etc.).
* Compute the error over the first 1000 examples from the validation split from the SQuAD dataset, using the [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu) for comparing the generated answers against the ground truth.


In [None]:
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# USING GPT-2 OR ALPACA-BASED TO DO QA OVER SQUAD DATASET #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

from transformers import pipeline, set_seed, logging
from datasets import load_dataset
from tqdm import tqdm

logging.set_verbosity_error()

set_seed(42) # make results deterministic

ds = load_dataset("squad", split="validation[:200]")

generators = []
names = ["flan-alpaca-large", "gpt2"]
generators.append(pipeline(model="declare-lab/flan-alpaca-large"))
generators.append(pipeline(model='gpt2'))

In [None]:
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
#         RETRIEVAL OF IN-CONTEXT EXAMPLES                #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

from sentence_transformers import SentenceTransformer
import torch
from tqdm import tqdm

train_ds = load_dataset("squad", split="train")
ir_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

sentences = []

for i in tqdm(range(train_ds.num_rows), total = train_ds.num_rows):
    question = train_ds[i]["question"]
    text = train_ds[i]["context"]
    sentences.append(text+ " " + question)

utt_representations = ir_model.encode(sentences)

# get norms of each embedding (to do cosine similarity afterwards)
utt_representations_norms = np.linalg.norm(utt_representations, ord = 2, axis = 1)

In [None]:
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# USING GPT-2 OR ALPACA-BASED TO DO QA OVER SQUAD DATASET #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

retrieval = False
predictions = []
references = []

best_predictions = []
best_bleu = 0
for (generator, name) in zip(generators, names):

    predictions = []
    references = []
    i = 0
    print(f"Evaluating {name}...")
    for (i, sample) in tqdm(enumerate(ds), total=len(ds)):
        prompt = sample["context"] + " " + sample["question"] + " The shortest and simplest answer possible is: "

        # if in-context retrieval is activate, get append most similar context + question + answer triple to prompt
        if retrieval:
            output = ir_model.encode([text + " " + question])
            scores = np.linalg.norm(utt_representations - output[0], ord = 2, axis = 1)
            utt_norm = np.linalg.norm(output[0], ord = 2)
            scores = (scores) / (utt_norm * utt_representations_norms) # cosine similarity
            best = -np.argpartition(scores, 1)[:1]
            top_score = scores[best]
            if top_score > 0.10:
                prompt = train_ds[int(best[0])]["context"] + " " + \
                train_ds[int(best[0])]["question"] + \
                " The shortest and simplest answer possible is: " + \
                train_ds[int(best[0])]["answer"] + prompt
        if name == "gpt2":
            prediction = generator(prompt, max_length=200, num_return_sequences=1,  pad_token_id=generator.tokenizer.eos_token_id)[0]["generated_text"][len(prompt) + 1:]
        else:
            prediction = generator(prompt, max_length=20, num_return_sequences=1,  pad_token_id=generator.tokenizer.eos_token_id)[0]["generated_text"]
            if prediction[-1] == ".":
                prediction = prediction[:-1]
        reference  = sample["answers"]["text"][0]
        predictions.append(prediction)
        references.append(reference)

    bleu = evaluate.load("bleu")
    results = bleu.compute(predictions=predictions, references=references)
    print(f"Bleu score: {results['bleu']}")

# gpt2 terrible, alpaca gets ~0.25 bleu after forcing it after prompting it fo short answering

In [None]:
# # # # # # # # # #
# QA WITH CONTEXT #
# # # # # # # # # #

# Model 1, source: https://huggingface.co/MaRiOrOsSi/t5-base-finetuned-question-answering
# This model was fine-tuned for QA on a different dataset, but it was evaluated on the squad dataset

from  transformers  import  AutoTokenizer, AutoModelWithLMHead, pipeline

model_name = "MaRiOrOsSi/t5-base-finetuned-question-answering"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelWithLMHead.from_pretrained(model_name)

ds = load_dataset("squad", split="validation[:200]")

predictions, references = [], []
for (i, sample) in tqdm(enumerate(ds), total=len(ds)):
    reference  = sample["answers"]["text"][0]
    question = sample["question"]
    context = sample["context"]
    input = f"question: {question} context: {context}"
    encoded_input = tokenizer([input],
                             return_tensors='pt',
                             max_length=512,
                             truncation=True)
    output = model.generate(input_ids = encoded_input.input_ids,
                            attention_mask = encoded_input.attention_mask)
    output = tokenizer.decode(output[0], skip_special_tokens=True)
    predictions.append(output)
    references.append(reference)

bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=references)
print(f"Bleu score: {results['bleu']}")

# Bleu score: 0.5328188114327255 --> we're getting there!
# Note that this model only works if there's a context. Although it's a generative
# model... So we can only use it if there's a context from which the answer may be
# deduced (not necessarily extracted)


In [None]:
# # # # # # # #
# QA USING T5 #
# # # # # # # #

# source: https://huggingface.co/consciousAI/question-answering-generative-t5-v1-base-s-q-c?text=What+color+is+the+sky%3F

from transformers import (
    AutoModelForSeq2SeqLM,
    AutoTokenizer
)
from transformers import pipeline, set_seed, logging
from datasets import load_dataset
import torch

logging.set_verbosity_error()

set_seed(42) # make results deterministic

ds = load_dataset("squad", split="validation[:200]")

def _generate(query, context, model, device):

    FT_MODEL = AutoModelForSeq2SeqLM.from_pretrained(model).to(device)
    FT_MODEL_TOKENIZER = AutoTokenizer.from_pretrained(model)

    if (context != None):
      input_text = "question: " + query + "</s> question_context: " + context
    else:
      input_text = query

    input_tokenized = FT_MODEL_TOKENIZER.encode(input_text, return_tensors='pt', truncation=True, padding='max_length', max_length=1024).to(device)
    _tok_count_assessment = FT_MODEL_TOKENIZER.encode(input_text, return_tensors='pt', truncation=True).to(device)

    summary_ids = FT_MODEL.generate(input_tokenized,
                                       max_length=30,
                                       min_length=5,
                                       num_beams=2,
                                       early_stopping=True,
                                   )
    output = [FT_MODEL_TOKENIZER.decode(id, clean_up_tokenization_spaces=True, skip_special_tokens=True) for id in summary_ids]

    return str(output[0])

device = [0 if torch.cuda.is_available() else 'cpu'][0]

# # # # # # # # # #
# WITHOUT CONTEXT #
# # # # # # # # # #
predictions = []
references = []

for (i, sample) in tqdm(enumerate(ds), total=len(ds)):
    prediction = _generate(sample['question'], None, model="consciousAI/question-answering-generative-t5-v1-base-s-q-c", device=device)
    reference  = sample["answers"]["text"][0]
    predictions.append(prediction)
    references.append(reference)

bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=references)
print(f"Bleu score without context: {results['bleu']}")

# Without context this model performs poorly. That's because in our experiments there
# is a right answer that's extracted from the context - without seeing the context, the
# model doesn't give the right answer (but it does give a possible one if there's no
# information)

# # # # # # # # # #
#   WITH CONTEXT  #
# # # # # # # # # #
predictions = []
references = []

for (i, sample) in tqdm(enumerate(ds), total=len(ds)):
    prediction = _generate(sample['question'], sample['context'], model="consciousAI/question-answering-generative-t5-v1-base-s-q-c", device=device)
    reference  = sample["answers"]["text"][0]
    predictions.append(prediction)
    references.append(reference)

bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=references)
print(f"Bleu score with context: {results['bleu']}")


# Bleu score with context: 0.16175428918549858
# Not great results! Note that this model provides a "complete sentence", while
# the reference only contains the answer - which makes the BLEU score worse

In [None]:
# # # # # # # # # # #
# QA USING DIALOGPT #
# # # # # # # # # # #

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from tqdm import tqdm
from datasets import load_dataset


tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium", padding_side='left')
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-medium")

ds = load_dataset("squad", split="validation[:200]")


contexts = [True, False]
for context_value in contexts:
  predictions, references = [], []
  for (i, sample) in tqdm(enumerate(ds), total=len(ds)):
    reference  = sample["answers"]["text"][0]
    question = sample["question"]
    if context_value:
      context = sample["context"]
      input_ids = tokenizer.encode(context + " " + question + tokenizer.eos_token, return_tensors='pt')
      N = len(context + " " + question)
    else:
      input_ids = tokenizer.encode(question + tokenizer.eos_token, return_tensors='pt')
      N = len(question)

    output_ids = model.generate(input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)

    output = tokenizer.decode(output_ids[0], skip_special_tokens=True)[N:]
    predictions.append(output)
    references.append(reference)

  bleu = evaluate.load("bleu")
  results = bleu.compute(predictions=predictions, references=references)
  print(f"Bleu score with context={context_value}: {results['bleu']}")

  # both have bad performance for this type of questions, but this model behaves
  # more like a natural dialog, may be interesting for the final system

In [None]:
# # # # # # # # #
# FINAL RESULTS #
# # # # # # # # #

# Compute the error over the first 1000 examples from the validation split from the SQuAD dataset, using the BLEU metric for comparing the generated answers against the ground truth.
# NOTE: this is hardcoded because partial results were calculated separately using Google colab

from  transformers  import  AutoTokenizer, AutoModelWithLMHead, pipeline

model_name = "MaRiOrOsSi/t5-base-finetuned-question-answering"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelWithLMHead.from_pretrained(model_name)

ds = load_dataset("squad", split="validation[:1000]")

predictions, references = [], []
for (i, sample) in tqdm(enumerate(ds), total=len(ds)):
    reference  = sample["answers"]["text"][0]
    question = sample["question"]
    context = sample["context"]
    input = f"question: {question} context: {context}"
    encoded_input = tokenizer([input],
                             return_tensors='pt',
                             max_length=512,
                             truncation=True)
    output = model.generate(input_ids = encoded_input.input_ids,
                            attention_mask = encoded_input.attention_mask)
    output = tokenizer.decode(output[0], skip_special_tokens=True)
    predictions.append(output)
    references.append(reference)

bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=references)
print(f"Bleu score: {results['bleu']}")

# Bleu score: 0.5057555208611301
# Don't forget that we're using the context (although this is a generative model and
# so the answers are generated and not extracted from the context). In the final model,
# we might want to use a model without context (especially is there isn't any)

# Using SpeechT5 for converting text-to-speech

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in different natural language processing tasks, the unified-modal SpeechT5 framework explores encoder-decoder pre-training for self-supervised speech/text representation learning.

The model is again conveniently available through the HuggingFace Transformers library. The following example illustrates the use of the SpeechT5 model for generating a spectrogram from a textual input, together with a neural vocoder model for producing a speech signal.

More detailed information about SpeechT5 is available on a [tutorial on the HuggingFace blog](https://huggingface.co/blog/speecht5).

In [None]:
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan, set_seed
from IPython.display import Audio
from datasets import load_dataset
import soundfile as sf
import librosa
import torch
import matplotlib.pyplot

set_seed(42) # make results deterministic

model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")

inputs = processor(text="Hello, my dog is cute", return_tensors="pt")
speaker_embeddings = torch.zeros((1, 512))

# You can optionally use "speaker embeddings" to customize the output to a particular speaker’s voice characteristics
#embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
# speaker_embeddings = torch.tensor(embeddings_dataset[42]["xvector"]).unsqueeze(0)

spectrogram = model.generate_speech(inputs["input_ids"], speaker_embeddings)
with torch.no_grad(): speech = vocoder(spectrogram)

# You can plot the generated spectrogram
# import matplotlib.pyplot as plt
# plt.figure()
# plt.imshow(spectrogram.T)
# plt.show()

librosa.display.waveshow(speech.numpy(), sr=16000) # You can plot the generated waveform
sf.write("tts_example.wav", speech.numpy(), samplerate=16000) # You can save the audio to a .wav file
display(Audio(speech.numpy(), rate=16000)) # You can hear the audio inputs

## Intermediate tasks:

* Connect the results from your answer to the previous intermediate task (i.e., conditioned language generation) to the SpeechT5 text-to-speech model, so as to produce speech outputs from the text generated by the model.
* Produce speech-based answers for the first 5 questions in the validation split from the SQuaD dataset.
* Connect also the results from your answer to the first intermediate task (i.e., automated speech recognition) to the SpeechT5 model and the LLM, so as to take spoken questions as input and produce a speech output.
* Collect small audio samples, with your own voice, for the first 5 questions in the validation split from the SQuaD dataset, and produce speech-based answers for these five questions.


In [None]:
# !mkdir task_3.1/

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# PRODUCE SPEECH OUTPUTS FROM THE TEXT GENERATED BY THE MODEL #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan, set_seed
import soundfile as sf
import torch

# Use SpeechT5 to produce speech outputs from the text generated by the model
i = 0
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")

for prediction in predictions:
    inputs = processor(text=prediction, return_tensors="pt")
    speaker_embeddings = torch.zeros((1, 512))

    spectrogram = model.generate_speech(inputs["input_ids"], speaker_embeddings)
    with torch.no_grad(): speech = vocoder(spectrogram)

    sf.write("task_3.1/" + str(i) + ".wav", speech.numpy(), samplerate=16000) # You can save the audio to a .wav file
    i += 1

# !zip -r task_3.1/ task_3.1

In [1]:
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
#  TAKE SPOKEN QUESTIONS AS INPUT AND PRODUCE A SPEECH OUTPUT #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

from transformers import pipeline, set_seed
import librosa
from transformers import AutoProcessor, WhisperForConditionalGeneration
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan, set_seed

# Generate the text from the spoken questions
processor = AutoProcessor.from_pretrained("openai/whisper-medium.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium.en")

files = ["task_3.2/squad_0.wav", "task_3.2/squad_1.wav", "task_3.2/squad_2.wav", "task_3.2/squad_3.wav", "task_3.2/squad_4.wav"]
questions = []
for file in files:
    utt, st = librosa.load(file, sr=16000)
    utt = librosa.resample(utt, orig_sr=16000, target_sr=16000)
    inputs = processor(audio = utt, sampling_rate=16000, return_tensors="pt")
    predicted_ids = model.generate(**inputs, max_length=100)
    questions.append(processor.batch_decode(predicted_ids, skip_special_tokens=True))

print("Questions: ", questions)

Questions:  [[' Which NFL team represented the AFC at Super Bowl 50?'], [' Which NFL team represented the NFC at Super Bowl 50?'], [' Where did Super Bowl 50 take place?'], [' Which NFL team won Super Bowl 50?'], [' What colour was used to emphasize the 50th anniversary of the Super Bowl?']]


In [4]:
from  transformers  import  AutoTokenizer, AutoModelWithLMHead, pipeline
from datasets import load_dataset
from tqdm import tqdm

model_name = "MaRiOrOsSi/t5-base-finetuned-question-answering"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelWithLMHead.from_pretrained(model_name)

ds = load_dataset("squad", split="validation[:5]")

answers = []
for (i, sample) in tqdm(enumerate(ds), total=len(ds)):
    question = sample["question"]
    context = sample["context"]
    input = f"question: {question} context: {context}"
    encoded_input = tokenizer([input],
                             return_tensors='pt',
                             max_length=512,
                             truncation=True)
    output = model.generate(input_ids = encoded_input.input_ids,
                            attention_mask = encoded_input.attention_mask)
    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    answers.append(answer)

print("Answers: ", answers)

100%|██████████| 5/5 [00:05<00:00,  1.01s/it]

Answers:  ['Denver Broncos', 'Denver Broncos', 'Santa Clara, California', 'Denver Broncos', 'gold']





In [5]:
# !mkdir task_3.3/
import torch
import soundfile as sf

# Produce speech-based answers for these five questions
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")

i = 0
for answer in answers:
    inputs = processor(text=answer, return_tensors="pt")
    speaker_embeddings = torch.zeros((1, 512))

    spectrogram = model.generate_speech(inputs["input_ids"], speaker_embeddings)
    with torch.no_grad(): speech = vocoder(spectrogram)

    sf.write("task_3.3/" + str(i) + ".wav", speech.numpy(), samplerate=16000) # You can save the audio to a .wav file
    i += 1

# !zip -r task_3.3/ task_3.3

Downloading (…)lve/main/config.json:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/585M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/50.7M [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading spm_char.model:   0%|          | 0.00/238k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/232 [00:00<?, ?B/s]

# Main problem

Students are tasked with joining together the speech recognition, language understanding and generation, and text-to-speech models, in order to build a conversational spoken question answering approach.

* The method should take as input speech utterances with questions.
* The language understanding and generation component should use as input a transcription for the current speech utterance, and optionally also transcriptions from previous speech utterances (i.e., the conversation context).
* The language understanding and generation component can explore different strategies for improving answer quality:
  * Use of large LLMs trained with reinforcement learning from human feedback.
  * Prompting the language model with retrieved in-context examples.
  * Using parameter-efficient fine-ting with existing conversational question answering datasets (e.g., [the CoQA dataset](https://stanfordnlp.github.io/coqa/), available from HuggingFace datasets).
  * ...
* The text-to-speech component takes as input the results from language generation, and produces a speech output.
* Both the automated speech recognition and the text-to-speech components can explore different approaches, although students should attempt to justify their choices (e.g., if changing the automated speech recognition component, show that it achieves a lower WER).
* Collect small audio samples, with your own voice, for the first two instances in the CoQA validation split, and show the results produced by your method for these examples.

In [6]:
def is_question(utterance):
  # this could be done using a classifier, but that isn't required for this assignment
  return utterance[-1] == '?'

In [7]:
# # # # # # # # # # # # # # # # # # # #
# AUTOMATIC SPEECH RECOGNITION - ASR  #
# # # # # # # # # # # # # # # # # # # #

from transformers import pipeline, set_seed
import librosa
from transformers import AutoProcessor, WhisperForConditionalGeneration
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan, set_seed

# cycle of listening to questions and replying to them, while keeping in mind the
# context of the conversation so far
processor_asr = AutoProcessor.from_pretrained("openai/whisper-medium.en")
model_asr = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium.en")

In [8]:
# # # # # # # # # # # # # # # # # # # # # # # # #
# LANGUAGE UNDERSTANDING AND GENERATION  - LUG  #
# # # # # # # # # # # # # # # # # # # # # # # # #

from  transformers  import  AutoTokenizer, AutoModelWithLMHead, AutoModelForCausalLM, pipeline
import torch
from tqdm import tqdm

# WITH CONTEXT: T5 fine-tuned for QA
context_llm_model_name = "MaRiOrOsSi/t5-base-finetuned-question-answering"
context_llm_tokenizer = AutoTokenizer.from_pretrained(context_llm_model_name)
context_llm_model = AutoModelWithLMHead.from_pretrained(context_llm_model_name)

# WITHOUT CONTEXT: dialoGPT
# Note that we could also have used the generative T5 model for QA, but it would have been slower
# In a dialogue setting, we want answers to be quick, in order for the conversation to flow
free_llm_model_name = "microsoft/DialoGPT-medium"
free_llm_tokenizer = AutoTokenizer.from_pretrained(free_llm_model_name, padding_side='left')
free_llm_model = AutoModelForCausalLM.from_pretrained(free_llm_model_name)

In [9]:
# # # # # # # # # # # # #
# TEXT TO SPEECH - TTS  #
# # # # # # # # # # # # #

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan, set_seed

model_tts = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder_tts = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
processor_tts = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")

In [10]:
contexts = {'coqa_0': "Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton's mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer's orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing. \n\n\"What are you doing, Cotton?!\" \n\n\"I only wanted to be more like you\". \n\nCotton's mommy rubbed her face on Cotton's and said \"Oh Cotton, but your fur is so pretty and special, like you. We would never want you to be any other way\". And with that, Cotton's mommy picked her up and dropped her into a big bucket of water. When Cotton came out she was herself again. Her sisters licked her face until Cotton's fur was all all dry. \n\n\"Don't ever do that again, Cotton!\" they all cried. \"Next time you might mess up that pretty white fur of yours and we wouldn't want that!\" \n\nThen Cotton thought, \"I change my mind. I like being special\".", 'coqa_1': "Once there was a beautiful fish named Asta. Asta lived in the ocean. There were lots of other fish in the ocean where Asta lived. They played all day long. \n\nOne day, a bottle floated by over the heads of Asta and his friends. They looked up and saw the bottle. \"What is it?\" said Asta's friend Sharkie. \"It looks like a bird's belly,\" said Asta. But when they swam closer, it was not a bird's belly. It was hard and clear, and there was something inside it. \n\nThe bottle floated above them. They wanted to open it. They wanted to see what was inside. So they caught the bottle and carried it down to the bottom of the ocean. They cracked it open on a rock. When they got it open, they found what was inside. It was a note. The note was written in orange crayon on white paper. Asta could not read the note. Sharkie could not read the note. They took the note to Asta's papa. \"What does it say?\" they asked. \n\nAsta's papa read the note. He told Asta and Sharkie, \"This note is from a little girl. She wants to be your friend. If you want to be her friend, we can write a note to her. But you have to find another bottle so we can send it to her.\" And that is what they did.", 'free_dialogue': ""}

In [15]:
# # # # # # # # # # # # # # # # # # # # # # # # # # #
# DIALOGUE SYSTEM - EXPERIMENTING WITH COQA DATASET #
# # # # # # # # # # # # # # # # # # # # # # # # # # #
import torch
import os
from datasets import load_dataset
from IPython.display import Audio

for dirname, dirnames, filenames in os.walk('final_task'):
    for subdirname in dirnames:
        print(f"Dialogue {subdirname}")
        context = contexts[subdirname]
        for d, ds, files in os.walk(f'final_task/{subdirname}'):
            files.sort()
            for f in files:
                print(f)
                if not ".wav" in f:
                  continue
                print("Ask me a question, or give me some context... :)")
                # # # # # # # #
                # STEP 1: ASR #
                # # # # # # # #
                audio, _ = librosa.load(f"final_task/{subdirname}/{f}", sr=16000)
                audio    = librosa.resample(audio, orig_sr=16000, target_sr=16000)

                inputs        = processor_asr(audio = audio, sampling_rate=16000, return_tensors="pt")
                predicted_ids = model_asr.generate(**inputs, max_length=1000000)
                utterance     = processor_asr.batch_decode(predicted_ids, skip_special_tokens=True)[0]

                print("Your utterance: ", utterance)

                if is_question(utterance):
                    # # # # # # # #
                    # STEP 2: LUG #
                    # # # # # # # #
                    answer = ""
                    # with context
                    if context != "":
                        input = f"question: {utterance} context: {context}"


                        encoded_input = context_llm_tokenizer([input],
                                                return_tensors='pt',
                                                max_length=512,
                                                truncation=True)
                        output = context_llm_model.generate(input_ids = encoded_input.input_ids,
                                                attention_mask = encoded_input.attention_mask)
                        answer = context_llm_tokenizer.decode(output[0], skip_special_tokens=True)
                    # without context
                    elif context == "" or len(answer) == 0:
                        input_ids = free_llm_tokenizer.encode(utterance + tokenizer.eos_token, return_tensors='pt')
                        N = len(question)
                        output_ids = free_llm_model.generate(input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
                        answer = free_llm_tokenizer.decode(output_ids[0], skip_special_tokens=True)[N:]

                    print("My answer is: ", answer)

                    # # # # # # # #
                    # STEP 3: TTS #
                    # # # # # # # #

                    inputs = processor_tts(text=answer, return_tensors="pt")

                    embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
                    speaker_embeddings = torch.tensor(embeddings_dataset[8]["xvector"]).unsqueeze(0) # scottish male

                    spectrogram = model_tts.generate_speech(inputs["input_ids"], speaker_embeddings)
                    with torch.no_grad(): speech = vocoder_tts(spectrogram)
                    display(Audio(speech.numpy(), rate=16000))

                context += (utterance + " ")

Dialogue coqa_1
question_1_01.wav
Ask me a question, or give me some context... :)
Your utterance:   What was the name of the fish?
My answer is:  Asta




question_1_02.wav
Ask me a question, or give me some context... :)
Your utterance:   What looked like a bird's belly?
My answer is:  




question_1_03.wav
Ask me a question, or give me some context... :)
Your utterance:   Who said that?
My answer is:  Asta




question_1_04.wav
Ask me a question, or give me some context... :)
Your utterance:   Was Sharky a friend?
My answer is:  Yes




question_1_05.wav
Ask me a question, or give me some context... :)
Your utterance:   Did they get the bottle?
My answer is:  Yes.




question_1_06.wav
Ask me a question, or give me some context... :)
Your utterance:   What was in it?
My answer is:  A note




question_1_07.wav
Ask me a question, or give me some context... :)
Your utterance:   Did the little boy write the notes?
My answer is:  No




question_1_08.wav
Ask me a question, or give me some context... :)
Your utterance:   Who could read the notes?
My answer is:  Asta




question_1_09.wav
Ask me a question, or give me some context... :)
Your utterance:   What did I do with the nodes?
My answer is:  did the math




question_1_10.wav
Ask me a question, or give me some context... :)
Your utterance:   Did they write back?
My answer is:  No




question_1_11.wav
Ask me a question, or give me some context... :)
Your utterance:   Were they excited?
My answer is:  No, they got closer.




Dialogue coqa_0
question_0_01.wav
Ask me a question, or give me some context... :)
Your utterance:   What's color Wisconsin?
My answer is:  white




question_0_02.wav
Ask me a question, or give me some context... :)
Your utterance:   Where did she live?
My answer is:  barn




question_0_03.wav
Ask me a question, or give me some context... :)
Your utterance:   Did she leave alone?
My answer is:  no




question_0_04.wav
Ask me a question, or give me some context... :)
Your utterance:   Who did she live with?
My answer is:  She lived with her mommy and 5 other sisters.




question_0_05.wav
Ask me a question, or give me some context... :)
Your utterance:   What scholar were her sisters?
My answer is:  




question_0_06.wav
Ask me a question, or give me some context... :)
Your utterance:   Was Cotton happy that she looked different than the rest of her family?
My answer is:  Yes




question_0_07.wav
Ask me a question, or give me some context... :)
Your utterance:   What did she do to try to make herself the same color as her sisters?
My answer is:  She mixed colors.




question_0_08.wav
Ask me a question, or give me some context... :)
Your utterance:   Whose paint was it?
My answer is:  The old farmer's orange paint




question_0_09.wav
Ask me a question, or give me some context... :)
Your utterance:   What did Cotton's mother and siblings do when they saw her painted orange?
My answer is:  They laughed




question_0_10.wav
Ask me a question, or give me some context... :)
Your utterance:   Where did Cotton's mother put her to clean the paint off?
My answer is:  bleacher




question_0_11.wav
Ask me a question, or give me some context... :)
Your utterance:   What did the other cats do when cotton emerged from the bucket of water?
My answer is:  Kiss




question_0_12.wav
Ask me a question, or give me some context... :)
Your utterance:   Did they want Cotton to change the color of her fur?
My answer is:  Yes




Dialogue free_dialogue
question_0.wav
Ask me a question, or give me some context... :)
Your utterance:   Yesterday I went to the beach but I forgot to wear sunscreen. What do you think might happen?
My answer is:  ou think might happen?</s>




question_1.wav
Ask me a question, or give me some context... :)
Your utterance:   Besides this, I also fell asleep at the sun. Do you think this will solve my problem or actually make it worse?
My answer is:  




question_2.wav
Ask me a question, or give me some context... :)
Your utterance:   What solutions do you recommend?
My answer is:  2


