# Voice Input Processing with the SpeechToText & VoiceInputChain classess to run on Ollama models

>**[SpeechToText](https://openai.com/index/whisper/)** is a wrapper around OpenAI Whisper API which utilizes machine learning to transcribe audio files to english text. 
>
>The Parser supports `.mp3`, `.mp4`, `.mpeg`, `.mpga`, `.m4a`, `.wav`, and `.webm`.

The current implementation follows LangChain core principles and can be used with other loaders to handle both audio downloading and parsing. As a result of this the parser will `yield` an `Iterator[Document]`.

> **VoiceInputChain** is a class that runs chains based on voice input from users.
>


## Prerequisites

The **SpeechToText** class requires an OpenAI api key to function (either passed to the class or as an environment variable), while the **VoiceInputChain** class does not require a key (due to the nature of ollama models). Ollama and the preferred model being used (on Ollama) should be downloaded on the user's device. Furthermore, the required dependencies must also be installed. 


In [None]:
%pip install -Uq langchain langchain-community openai

## Use Case 1: Using pre-recorded audio as voice input for Ollama models

The `speechToText`'s method, `.lazy_parse`, accepts a `Blob` object as a parameter containing the file path of the file to be transcribed. Once transcribed, audio input can be fed into the `VoiceInputChain` class to be ran through an Ollama model.

In [None]:
audio_path = "path/to/your/audio/file"
key = "<your_api_key>"

In [None]:
from langchain_community.tools.ollama_voice_input import SpeechToText, VoiceInputChain

stt = SpeechToText(api_key=key, audio_path=audio_path)
voice_model = VoiceInputChain(stt=stt)  # llama2 model by default
response = voice_model.run()

In [None]:
print(response)  # view response from voice input from Ollama model

## Use Case 2: Recording audio with 'SpeechToText' object as voice input for Ollama models

First, the `speechToText`'s method, `.record_audio` accepts `duration` & `sample_rate` integer parameters (in seconds) to record voice input from the user, and saves it at parameter `path`. The audio input will be saved as `audio_input.wav` in the current directory.
Then, the `speechToText`'s method, `.lazy_parse`, accepts a `Blob` object as a parameter containing the file path of the file to be transcribed. Once transcribed, audio input can be fed into the `VoiceInputChain` class to be ran through an Ollama model.

In [None]:
from langchain_community.tools.ollama_voice_input import SpeechToText, VoiceInputChain

In [None]:
stt = SpeechToText(api_key=key)
stt.record_audio(duration=20)

In [None]:
voice_model = VoiceInputChain(stt=stt)  # llama2 model by default
response = voice_model.run()
print(response)  # view response from voice input from Ollama model

## Use Case 3: Using pre-recorded voice input for Ollama model chains that are uniquely customised beforehand.

The `speechToText`'s method, `.lazy_parse`, accepts a `Blob` object as a parameter containing the file path of the file to be transcribed. Once transcribed, audio input can be fed into the `VoiceInputChain` class to be ran through an pre-made Ollama chain (e.g a custom summarizer for RAG)

In [None]:
from langchain_community.llms import ollama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

In [None]:
prompt_text = """You are an assistant tasked with summarizing text for retrieval.
These summaries will be embedded and used to retrieve the raw text.
Give a concise summary of the or text that is well optimized for retrieval. Text: {element}"""
prompt = ChatPromptTemplate.from_template(prompt_text)
model = ollama.Ollama(temperature=0, model="llama2", api_key=key)
summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

In [None]:
stt = SpeechToText(api_key=key)
voice_model = VoiceInputChain(stt=stt, chain=summarize_chain)
print(voice_model.run())  # view response from voice input from Ollama chain