# From chatting to talking to any source of data

Converse to any source of data while doing any activity (walking on the treadmill, cooking, cleaning, ...) - a tutorial on leveraging OpenAI's GPT and Whisper models and Python libraries for audio processing. 

In a previous tutorial ([Chat to any source of data](https://medium.com/@juanabascal78/chat-to-any-source-of-data-with-langchain-and-openai-3677ecb8665d)), we show how to exploit *Langchain* and *OpenAI* to chat to any data (pdf, url, youtube link, xlsx, tex, ...) . In this tutorial, we level up and show how to transcribe text to audio and vice versa while interacting with both GPT and Whisper. Make your APIs talk as the now talking [ChatGPT](https://openai.com/blog/chatgpt-can-now-see-hear-and-speak)!

Let's start!

## Requirements

First, we install the required dependencies in the environment of choice.

In [None]:
# Install required libraires
!python -m venv venv
!cd venv
!source venv/bin/activate
!pip install -r requirements_talk_to_your_data.txt

where the requirements.txt file is given below
```
openai==1.2.4
langchain==0.0.335
chromadb==0.3.26
pydantic==1.10.8
langchain[docarray]
gTTS==2.4.0
pvrecorder==1.2.1
playsound==1.3.0
bs4==0.0.1
tiktoken==0.5.1
```

## OpenAI API Key


You will need an OpenAI API key, which we will read from a JSON file. It may also use an environment variable ([best practices](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety)).

In [None]:
import os
import json
import openai

path_file_key = '/home/username/openai'
name_file_key = "openai_key.json" 
    
"""
json file: 
openai_key.json:     
    {"organization": <org_key>, 
    "api_key": <api_key>}
"""

def read_key_from_file(path_file, name_file_key):
    with open(os.path.join(path_file, name_file_key), 'r') as f:
        org_data = json.load(f)
        
    openai.organization = org_data['organization']
    openai.api_key = org_data['api_key']

# Read OpenAI key from filepath_file
openai_key = read_key_from_file(path_file_key, name_file_key)

Now, that we are all set, let’s quick off with the fun stuff.

## Transcribe text to audio

We will explore the functionality required to transcribe text to speech and vice versa. The easiest part is to transcribe text to natural sound. For this, we would compare two different methods: 
- gTTS ([Google Text-to-Speech](https://gtts.readthedocs.io/en/latest/index.html)), which leverages Google Translate speach functionality, providing  text-to-speech transcription that allows unlimited lengths of text, keeping proper intonation, and abbreviations. It supports several languages and accents ([gTTS accents](https://gtts.readthedocs.io/en/latest/module.html)). To see the available languages: `gtts-cli --all`. For instance, we can use the following commands:
```
    gTTS('hello', lang='en', tld='co.uk')
    gTTS('bonjour', lang='fr')
```
- OpenAI's text-to-speech model [tts-1](https://platform.openai.com/docs/guides/text-to-speech). It allows for different voice options (`alloy`, `echo`, `fable`, ...) and supports a wide range of languages a(same as Whisper model). It also supports for real time audio streaming using chunk transfer encoding. 

First, we transcribe text to speech and then write it to file with `gTTS`. To play audio we use the python's library [playsound](https://pypi.org/project/playsound/).

In [None]:
from gtts import gTTS
from playsound import playsound

def play_text(text, language='en', accent='co.uk', file_audio="../tmp/audio.wav"):
    """ 
    play_text: Play text with gTTS and playsound libraries. It writes the audio file
    first and then plays it.
    """
    gtts = gTTS(text, lang=language)
    gtts.save(file_audio)
    playsound(file_audio)

text = "Hello, how are you Today? It's a beataful day, isn't it? Have a nice day!"
play_text(text, file_audio="../tmp/hello.wav")

We compare to OpenAI's TTS model

In [None]:
# Text to speech with openAI
from pathlib import Path
from openai import OpenAI

def play_text_oai(text, file_audio="../tmp/audio.mp3", model="tts-1", voice="alloy"):
    """ 
    play_text_oai: Play text with OpeanAI and playsound libraries. It writes the audio file
    first and then plays it.
    """
    speech_file_path = "../tmp/hello.mp3"
    response = openai.audio.speech.create(
      model=model,
      voice=voice,
      input=text
    )
    response.stream_to_file(file_audio)
    playsound(file_audio)

play_text_oai(text, file_audio="../tmp/audio.mp3")


OpenAI's model provides a more natural language and corrects the spelling mistake! However, if you provide a foreign address, it will be wrongly transcribed. 

## Transcribe audio to text

The key part is to **transcribe audio to text**. For this we use [openai.audio.transcribe](https://platform.openai.com/docs/guides/speech-to-text), which provides speech-to-text transcriptions for many languages and translation to English based on OpenAI "Whisper" model. It supports several audio formats (mp3, mp4, mav and others), with a limit of 25 MB, and text formats (json default). 

Whisper is an automatically speech recognition (ASR) system, trained on 680,000 hours of multilingual (98 languages) and multitask supervised data collected from the web. Trained on a large dataset, it may be less accurate than other models trained in specific datasets but should be more robust to new data. It also beats many translation models. It is based on a encoder-decoder transformer architecture. Audio is split into 30s chunks, converted to log-Mel spectogram, and trained to predict the next token on several tasks (language identification, transcription, and to-English speech translation).

### Record audio

First of all, we need to record audio and write it to file. For recording audio, we use [PvRecorder](https://pypi.org/project/pvrecorder/), an easy-to-use, cross platform audio recorder designed for real-time speech audio processing. For writing audio to file, we use [wave](https://docs.python.org/3/library/wave.html), which allows to easily read and write WAV files. Other options are [soundfile](https://pypi.org/project/SoundFile/) and [pydub](https://pypi.org/project/pydub/).

In [None]:
from pvrecorder import PvRecorder
devices = PvRecorder.get_available_devices()
print(devices)

In [None]:
import wave
import struct
from pvrecorder import PvRecorder

def write_audio_to_file(audio, 
                        audio_frequency=16000, 
                        file_audio="tmp.wav"):
    """ 
    write_audio_to_file: Write audio to file with wave library.
    """
    with wave.open(file_audio, 'w') as f:
        f.setparams((1, 2, audio_frequency, len(audio), "NONE", "NONE"))
        f.writeframes(struct.pack("h" * len(audio), *audio))


def record_audio(device_index=-1, 
                 frame_length=512, # audio samples at each read
                 num_frames = 600, # 20 seconds
                 audio_frequency=16000, 
                 file_audio="tmp.wav"):
    """
    record_audio: Record audio with pvrecorder library.
    """

    # Record audio
    # Init the recorder
    recorder = PvRecorder(frame_length=frame_length, device_index=device_index)

    print("\nRecording...")
    try:
        audio = []
        recorder.start()
        for fr_id in range(num_frames):
            frame = recorder.read()
            audio.extend(frame)
        write_audio_to_file(audio, audio_frequency=audio_frequency, file_audio=file_audio)
        recorder.stop()
    except KeyboardInterrupt:
        recorder.stop()
        write_audio_to_file(audio, audio_frequency=audio_frequency, file_audio=file_audio)
    finally:
        recorder.delete()
    print("Recording finished.")

Now, we text the audio recording. Run the following code to record and play back the audio.

In [None]:
# Record audio sample and play it
play_text_oai("Please, say something (you have 5 seconds)", file_audio="../tmp/tmp.wav")
record_audio(file_audio="../tmp/audio.wav", num_frames=150, device_index=-1)
play_text_oai("You said", file_audio="../tmp/tmp.wav")
playsound("../tmp/audio.wav")

### Transcribe audio file to text

Now, we are ready to transcribe audio to text using `openai.audio.transcriptions`. For this, we set the LLM name to `whisper-1` and specify the user language. The language is key to get good results; otherwise, it may get confuse with accents. 

To test it, we record some audio, play it back and print the transcribed text.

In [None]:
# LLM name
llm_audio_name = "whisper-1"

# Language of user speech (For better accuracy; otherwise accents lead to errors)
language_user = "en"                

# Record audio sample and play it
play_text_oai("Please, say something (you have 10 seconds)", file_audio="../tmp/tmp.wav")
record_audio(file_audio="../tmp/audio.wav", num_frames=300, device_index=-1)
play_text_oai("Now, we print what you said:", file_audio="../tmp/tmp.wav")

# Read audio file and transcribe it
audio_file = open(os.path.join("../tmp", "audio.wav"), "rb")
#transcript = openai.Audio.transcribe(llm_audio_name, 
#                                        audio_file,
#                                        language=language_user)
text = openai.audio.transcriptions.create(model="whisper-1", file=audio_file, 
                                          response_format="text")
print(f"\nQuestion: {text}")

It is not in real time. That would require chunk streaming and lots of optimization.

## Recap on LangChain and OpenAI

Refer to the previous tutorial ([Chat to any source of data](https://medium.com/@juanabascal78/chat-to-any-source-of-data-with-langchain-and-openai-3677ecb8665d)) for a short introduction to *Langchain* on data loading, splitting data into chunks, using embeddings, creating vector database stores and creating high-level chains to easily interact with a LLM. We recap the main steps in this section.

Define any source of data.

In [None]:
# Document type: "pdf" or "url" or "youtube"
example_type = "url"                

if example_type == "url":
    doc_type = "url" 
    doc_path = "https://en.wikipedia.org/wiki/Cinque_Terre"
elif example_type == "pdf":
    doc_type = "pdf" 
    doc_path = "../data/Adler_DeepPosteriorSampling19.pdf"
elif example_type == "youtube":
    doc_type = "youtube" 
    #doc_path = "https://www.youtube.com/watch?v=PNVgh4ZSjCw"
    doc_path = "https://www.youtube.com/watch?v=W0DM5lcj6mw"

Read the data.

In [None]:
import re

# Clear white lines in web pages
def clear_blank_lines(docs):
    for doc in docs:
        doc.page_content = re.sub(r"\n\n\n+", "\n\n", doc.page_content)
    return docs

# Read document with langchain.document_loaders
def read_doc(doc_type, doc_path):
    if doc_type == "pdf":
        from langchain.document_loaders import PyPDFLoader
        loader = PyPDFLoader(doc_path)
        docs = loader.load()
    elif doc_type == "url":
        from langchain.document_loaders import WebBaseLoader
        url = doc_path
        loader = WebBaseLoader(url)
        docs = loader.load()
    elif doc_type == "youtube":
        # See the requirements file for the extra required libraries
        # Not working currently on langchain with current openAI version for STT!!!
        from langchain.document_loaders.blob_loaders.youtube_audio import \
            YoutubeAudioLoader
        from langchain.document_loaders.generic import GenericLoader
        from langchain.document_loaders.parsers import OpenAIWhisperParser
        save_path = "./downloads"
        url = doc_path
        loader = GenericLoader(YoutubeAudioLoader([url], save_path), OpenAIWhisperParser())
        docs = loader.load()

    # Clear white lines in web pages
    clear_blank_lines(docs)

    print(f"Loaded {len(docs)} pages/documents")
    print(f"First page: {docs[0].metadata}")
    print(docs[0].page_content[:500])
    return docs

def pretty_print_docs(docs, question = None):
    print(f"\n{'-' * 100}\n")
    if question:
        print(f"Question: {question}")

    for i, doc in enumerate(docs):
        print(f"Document {i+1}:\n\nMetadata: {doc.metadata}\n")
        print(doc.page_content)
    print("\n")        

# Read document with langchain.document_loaders
docs = read_doc(doc_type, doc_path)

Split into chunks

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Parameters for splitting documents into chunks
chunk_size = 1500                   
chunk_overlap = 150
add_start_index = True

# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, 
    chunk_overlap=chunk_overlap,
    add_start_index=add_start_index)

docs_split = text_splitter.split_documents(docs)
print(f"Split into {len(docs_split)} chunks")
print(f"First chunk: {docs_split[0].metadata}")
print(docs_split[0].page_content)

Define the conversational chain.

In [None]:
from langchain.chat_models import ChatOpenAI
#from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DocArrayInMemorySearch

# Info user API key
llm_name = "gpt-3.5-turbo"

# Init the LLM and memory
# llm = OpenAI(temperature=0, openai_api_key=openai_key)
llm = ChatOpenAI(model_name=llm_name,
                 temperature=0,
                 openai_api_key=openai.api_key)

# Memory buffer
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

# Define embedding
embedding = OpenAIEmbeddings(openai_api_key=openai.api_key)    

# Create vector database from data    
db = DocArrayInMemorySearch.from_documents(
    docs_split, 
    embedding=embedding)

# Conversational chain
qa_chain = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=db.as_retriever(),
    memory=memory
)

## Build a talking chatbot

Finally, we got to the point when we can maintain a conversation with our data. We start by defining some parameters.

In [None]:
# Parameters

# LLM name
llm_name = "gpt-4"
llm_audio_name = "whisper-1"

# Document
example_type = "url"                # Document type: "pdf" or "url" or "youtube"
chunk_size = 1500                   # Parameters for splitting documents into chunks
chunk_overlap = 150
mode_input = "file"                 # Mode read: "file" or "db", db if db already saved to drive (avoid reading it)

# Mode of interaction
question_mode = "audio"             # "text" or "audio"
language_user = "en"                # Language of user speech (For better accuracy; otherwise accents lead to errors)
language_answer = "en"              # Desired language for reply speech (gTTS)

# Parameters for recording audio
audio_frequency = 16000
frame_length = 512                  # audio samples at each read
num_frames = 300                    # 600 for 20 seconds

Then, we specify text scripts to interact with the LLM. 

In [None]:
path_tmp = "../tmp"                         # Path to save audio
name_tmp_audio = "audio.mp3"                  
file_audio_intro = "../tmp/talk_intro.mp3"  # Audio temporal files
file_audio_question = "../tmp/question.mp3"
file_audio_answer = "../tmp/answer.mp3"

persist_path = "./docs/chroma"              # Persist path to save vector database
 
if not os.path.exists(path_tmp):
    os.makedirs(path_tmp)
file_tmp_audio = os.path.join(path_tmp, name_tmp_audio)

# Audio samples
text_intro = f"""
You are chatting to {llm_name}, transcriptions by {llm_audio_name}, 
about the provided {example_type} link. 
You can ask questions or chat about the document provided, in any language. 
You have 10 to 20 seconds to make your questions. 
Answers will be played back to you and printed out in the language selected. 
To end the chat, say 'End chat' when providing a question.
"""

text_question = "Ask your question"

Finally, we are ready to chat to our data. We define a function that takes a question and returns an answer. We use the function *chat* from the chain.

In [None]:
# Start interaction
play_text_oai(text_intro, file_audio=file_audio_intro)
qa_on = True # Ask questions to the user
while qa_on == True:
    # Prompt the user to introduce a question
    # Play prompt question
    print(text_question)
    #play_text(text_question, language=language_answer, file_tmp_audio=file_audio_intro)
    play_text_oai(text_question, file_audio=file_audio_question)

    # Record audio
    record_audio(device_index=-1, 
                    frame_length=frame_length, # audio samples at each read
                    num_frames = num_frames, # 20 seconds
                    audio_frequency=audio_frequency, 
                    file_audio=file_tmp_audio)

    # Transcribe audio
    audio_file = open(file_tmp_audio, "rb")
    #transcript = openai.Audio.transcribe(llm_audio_name, 
    #                                        audio_file,
    #                                        language=language_user)
    question = openai.audio.transcriptions.create(model="whisper-1", file=audio_file, 
                                          response_format="text", language=language_user)
    #question = transcript['text']
    print(f"\nQuestion: {question}")
    
    if question.lower() == "End chat":
        break

    # -------------------------
    # Run QA chain
    result = qa_chain({"question": question})
    print(f"Answer: {result['answer']}")

    # Text to speech
    if question_mode == "audio":
        #play_text(result['answer'], language=language_answer, file_tmp_audio=file_audio_answer)
        play_text_oai(result['answer'], file_audio=file_audio_answer)
    # -------------------------


Other speech and audio tools are the following:
- **SpeechRecognition**: 
- **PyAudio** :access devices and record/play audio
- **Librosa**: audio analysis, pitch detection, beat tracking, audio segmentation