-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add new document_loader: AssemblyAIAudioTranscriptLoader (#9667)
This PR adds a new document loader `AssemblyAIAudioTranscriptLoader` that allows to transcribe audio files with the [AssemblyAI API](https://www.assemblyai.com) and loads the transcribed text into documents. - Add new document_loader with class `AssemblyAIAudioTranscriptLoader` - Add optional dependency `assemblyai` - Add unit tests (using a Mock client) - Add docs notebook This is the equivalent to the JS integration already available in LangChain.js. See the [LangChain JS docs AssemblyAI page](https://js.langchain.com/docs/modules/data_connection/document_loaders/integrations/web_loaders/assemblyai_audio_transcription). At its simplest, you can use the loader to get a transcript back from an audio file like this: ```python from langchain.document_loaders.assemblyai import AssemblyAIAudioTranscriptLoader loader = AssemblyAIAudioTranscriptLoader(file_path="./testfile.mp3") docs = loader.load() ``` To use it, it needs the `assemblyai` python package installed, and the environment variable `ASSEMBLYAI_API_KEY` set with your API key. Alternatively, the API key can also be passed as an argument. Twitter handles to shout out if so kindly 🙇 [@AssemblyAI](https://twitter.com/AssemblyAI) and [@patLoeber](https://twitter.com/patloeber) --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
- Loading branch information
1 parent
25f2c82
commit 5990651
Showing
6 changed files
with
491 additions
and
2 deletions.
There are no files selected for viewing
224 changes: 224 additions & 0 deletions
224
docs/extras/integrations/document_loaders/assemblyai.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,224 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# AssemblyAI Audio Transcripts\n", | ||
"\n", | ||
"The `AssemblyAIAudioTranscriptLoader` allows to transcribe audio files with the [AssemblyAI API](https://www.assemblyai.com) and loads the transcribed text into documents.\n", | ||
"\n", | ||
"To use it, you should have the `assemblyai` python package installed, and the\n", | ||
"environment variable `ASSEMBLYAI_API_KEY` set with your API key. Alternatively, the API key can also be passed as an argument.\n", | ||
"\n", | ||
"More info about AssemblyAI:\n", | ||
"\n", | ||
"- [Website](https://www.assemblyai.com/)\n", | ||
"- [Get a Free API key](https://www.assemblyai.com/dashboard/signup)\n", | ||
"- [AssemblyAI API Docs](https://www.assemblyai.com/docs)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Installation\n", | ||
"\n", | ||
"First, you need to install the `assemblyai` python package.\n", | ||
"\n", | ||
"You can find more info about it inside the [assemblyai-python-sdk GitHub repo](https://github.com/AssemblyAI/assemblyai-python-sdk)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"#!pip install assemblyai" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Example\n", | ||
"\n", | ||
"The `AssemblyAIAudioTranscriptLoader` needs at least the `file_path` argument. Audio files can be specified as an URL or a local file path." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from langchain.document_loaders.assemblyai import AssemblyAIAudioTranscriptLoader\n", | ||
"\n", | ||
"audio_file = \"https://storage.googleapis.com/aai-docs-samples/nbc.mp3\"\n", | ||
"# or a local file path: audio_file = \"./nbc.mp3\"\n", | ||
"\n", | ||
"loader = AssemblyAIAudioTranscriptLoader(file_path=audio_file)\n", | ||
"\n", | ||
"docs = loader.load()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Note: Calling `loader.load()` blocks until the transcription is finished." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The transcribed text is available in the `page_content`:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"docs[0].page_content" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"```\n", | ||
"\"Load time, a new president and new congressional makeup. Same old ...\"\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The `metadata` contains the full JSON response with more meta information:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"docs[0].metadata" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"```\n", | ||
"{'language_code': <LanguageCode.en_us: 'en_us'>,\n", | ||
" 'audio_url': 'https://storage.googleapis.com/aai-docs-samples/nbc.mp3',\n", | ||
" 'punctuate': True,\n", | ||
" 'format_text': True,\n", | ||
" ...\n", | ||
"}\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Transcript Formats\n", | ||
"\n", | ||
"You can specify the `transcript_format` argument for different formats.\n", | ||
"\n", | ||
"Depending on the format, one or more documents are returned. These are the different `TranscriptFormat` options:\n", | ||
"\n", | ||
"- `TEXT`: One document with the transcription text\n", | ||
"- `SENTENCES`: Multiple documents, splits the transcription by each sentence\n", | ||
"- `PARAGRAPHS`: Multiple documents, splits the transcription by each paragraph\n", | ||
"- `SUBTITLES_SRT`: One document with the transcript exported in SRT subtitles format\n", | ||
"- `SUBTITLES_VTT`: One document with the transcript exported in VTT subtitles format" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from langchain.document_loaders.assemblyai import (\n", | ||
" AssemblyAIAudioTranscriptLoader,\n", | ||
" TranscriptFormat,\n", | ||
")\n", | ||
"\n", | ||
"loader = AssemblyAIAudioTranscriptLoader(\n", | ||
" file_path=\"./your_file.mp3\",\n", | ||
" transcript_format=TranscriptFormat.SENTENCES,\n", | ||
")\n", | ||
"\n", | ||
"docs = loader.load()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Transcription Config\n", | ||
"\n", | ||
"You can also specify the `config` argument to use different audio intelligence models.\n", | ||
"\n", | ||
"Visit the [AssemblyAI API Documentation](https://www.assemblyai.com/docs) to get an overview of all available models!" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import assemblyai as aai\n", | ||
"\n", | ||
"config = aai.TranscriptionConfig(speaker_labels=True,\n", | ||
" auto_chapters=True,\n", | ||
" entity_detection=True\n", | ||
")\n", | ||
"\n", | ||
"loader = AssemblyAIAudioTranscriptLoader(\n", | ||
" file_path=\"./your_file.mp3\",\n", | ||
" config=config\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Pass the API Key as argument\n", | ||
"\n", | ||
"Next to setting the API key as environment variable `ASSEMBLYAI_API_KEY`, it is also possible to pass it as argument." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"loader = AssemblyAIAudioTranscriptLoader(\n", | ||
" file_path=\"./your_file.mp3\",\n", | ||
" api_key=\"YOUR_KEY\"\n", | ||
")" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"language_info": { | ||
"name": "python" | ||
}, | ||
"orig_nbformat": 4 | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
111 changes: 111 additions & 0 deletions
111
libs/langchain/langchain/document_loaders/assemblyai.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
from __future__ import annotations | ||
|
||
from enum import Enum | ||
from typing import TYPE_CHECKING, List, Optional | ||
|
||
from langchain.docstore.document import Document | ||
from langchain.document_loaders.base import BaseLoader | ||
|
||
if TYPE_CHECKING: | ||
import assemblyai | ||
|
||
|
||
class TranscriptFormat(Enum): | ||
"""Transcript format to use for the document loader.""" | ||
|
||
TEXT = "text" | ||
"""One document with the transcription text""" | ||
SENTENCES = "sentences" | ||
"""Multiple documents, splits the transcription by each sentence""" | ||
PARAGRAPHS = "paragraphs" | ||
"""Multiple documents, splits the transcription by each paragraph""" | ||
SUBTITLES_SRT = "subtitles_srt" | ||
"""One document with the transcript exported in SRT subtitles format""" | ||
SUBTITLES_VTT = "subtitles_vtt" | ||
"""One document with the transcript exported in VTT subtitles format""" | ||
|
||
|
||
class AssemblyAIAudioTranscriptLoader(BaseLoader): | ||
""" | ||
Loader for AssemblyAI audio transcripts. | ||
It uses the AssemblyAI API to transcribe audio files | ||
and loads the transcribed text into one or more Documents, | ||
depending on the specified format. | ||
To use, you should have the ``assemblyai`` python package installed, and the | ||
environment variable ``ASSEMBLYAI_API_KEY`` set with your API key. | ||
Alternatively, the API key can also be passed as an argument. | ||
Audio files can be specified via an URL or a local file path. | ||
""" | ||
|
||
def __init__( | ||
self, | ||
file_path: str, | ||
*, | ||
transcript_format: TranscriptFormat = TranscriptFormat.TEXT, | ||
config: Optional[assemblyai.TranscriptionConfig] = None, | ||
api_key: Optional[str] = None, | ||
): | ||
""" | ||
Initializes the AssemblyAI AudioTranscriptLoader. | ||
Args: | ||
file_path: An URL or a local file path. | ||
transcript_format: Transcript format to use. | ||
See class ``TranscriptFormat`` for more info. | ||
config: Transcription options and features. If ``None`` is given, | ||
the Transcriber's default configuration will be used. | ||
api_key: AssemblyAI API key. | ||
""" | ||
try: | ||
import assemblyai | ||
except ImportError: | ||
raise ImportError( | ||
"Could not import assemblyai python package. " | ||
"Please install it with `pip install assemblyai`." | ||
) | ||
if api_key is not None: | ||
assemblyai.settings.api_key = api_key | ||
|
||
self.file_path = file_path | ||
self.transcript_format = transcript_format | ||
self.transcriber = assemblyai.Transcriber(config=config) | ||
|
||
def load(self) -> List[Document]: | ||
"""Transcribes the audio file and loads the transcript into documents. | ||
It uses the AssemblyAI API to transcribe the audio file and blocks until | ||
the transcription is finished. | ||
""" | ||
transcript = self.transcriber.transcribe(self.file_path) | ||
# This will raise a ValueError if no API key is set. | ||
|
||
if transcript.error: | ||
raise ValueError(f"Could not transcribe file: {transcript.error}") | ||
|
||
if self.transcript_format == TranscriptFormat.TEXT: | ||
return [ | ||
Document( | ||
page_content=transcript.text, metadata=transcript.json_response | ||
) | ||
] | ||
elif self.transcript_format == TranscriptFormat.SENTENCES: | ||
sentences = transcript.get_sentences() | ||
return [ | ||
Document(page_content=s.text, metadata=s.dict(exclude={"text"})) | ||
for s in sentences | ||
] | ||
elif self.transcript_format == TranscriptFormat.PARAGRAPHS: | ||
paragraphs = transcript.get_paragraphs() | ||
return [ | ||
Document(page_content=p.text, metadata=p.dict(exclude={"text"})) | ||
for p in paragraphs | ||
] | ||
elif self.transcript_format == TranscriptFormat.SUBTITLES_SRT: | ||
return [Document(page_content=transcript.export_subtitles_srt())] | ||
elif self.transcript_format == TranscriptFormat.SUBTITLES_VTT: | ||
return [Document(page_content=transcript.export_subtitles_vtt())] | ||
else: | ||
raise ValueError("Unknown transcript format.") |
Oops, something went wrong.