Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new document_loader: AssemblyAIAudioTranscriptLoader #9667

Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
224 changes: 224 additions & 0 deletions docs/extras/integrations/document_loaders/assemblyai.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# AssemblyAI Audio Transcripts\n",
"\n",
"The `AssemblyAIAudioTranscriptLoader` allows to transcribe audio files with the [AssemblyAI API](https://www.assemblyai.com) and loads the transcribed text into documents.\n",
"\n",
"To use it, you should have the `assemblyai` python package installed, and the\n",
"environment variable `ASSEMBLYAI_API_KEY` set with your API key. Alternatively, the API key can also be passed as an argument.\n",
"\n",
"More info about AssemblyAI:\n",
"\n",
"- [Website](https://www.assemblyai.com/)\n",
"- [Get a Free API key](https://www.assemblyai.com/dashboard/signup)\n",
"- [AssemblyAI API Docs](https://www.assemblyai.com/docs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installation\n",
"\n",
"First, you need to install the `assemblyai` python package.\n",
"\n",
"You can find more info about it inside the [assemblyai-python-sdk GitHub repo](https://github.com/AssemblyAI/assemblyai-python-sdk)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#!pip install assemblyai"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example\n",
"\n",
"The `AssemblyAIAudioTranscriptLoader` needs at least the `file_path` argument. Audio files can be specified as an URL or a local file path."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders.assemblyai import AssemblyAIAudioTranscriptLoader\n",
"\n",
"audio_file = \"https://storage.googleapis.com/aai-docs-samples/nbc.mp3\"\n",
"# or a local file path: audio_file = \"./nbc.mp3\"\n",
"\n",
"loader = AssemblyAIAudioTranscriptLoader(file_path=audio_file)\n",
"\n",
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: Calling `loader.load()` blocks until the transcription is finished."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The transcribed text is available in the `page_content`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"docs[0].page_content"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```\n",
"\"Load time, a new president and new congressional makeup. Same old ...\"\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `metadata` contains the full JSON response with more meta information:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"docs[0].metadata"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```\n",
"{'language_code': <LanguageCode.en_us: 'en_us'>,\n",
" 'audio_url': 'https://storage.googleapis.com/aai-docs-samples/nbc.mp3',\n",
" 'punctuate': True,\n",
" 'format_text': True,\n",
" ...\n",
"}\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transcript Formats\n",
"\n",
"You can specify the `transcript_format` argument for different formats.\n",
"\n",
"Depending on the format, one or more documents are returned. These are the different `TranscriptFormat` options:\n",
"\n",
"- `TEXT`: One document with the transcription text\n",
"- `SENTENCES`: Multiple documents, splits the transcription by each sentence\n",
"- `PARAGRAPHS`: Multiple documents, splits the transcription by each paragraph\n",
"- `SUBTITLES_SRT`: One document with the transcript exported in SRT subtitles format\n",
"- `SUBTITLES_VTT`: One document with the transcript exported in VTT subtitles format"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders.assemblyai import (\n",
" AssemblyAIAudioTranscriptLoader,\n",
" TranscriptFormat,\n",
")\n",
"\n",
"loader = AssemblyAIAudioTranscriptLoader(\n",
" file_path=\"./your_file.mp3\",\n",
" transcript_format=TranscriptFormat.SENTENCES,\n",
")\n",
"\n",
"docs = loader.load()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transcription Config\n",
"\n",
"You can also specify the `config` argument to use different audio intelligence models.\n",
"\n",
"Visit the [AssemblyAI API Documentation](https://www.assemblyai.com/docs) to get an overview of all available models!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import assemblyai as aai\n",
"\n",
"config = aai.TranscriptionConfig(speaker_labels=True,\n",
" auto_chapters=True,\n",
" entity_detection=True\n",
")\n",
"\n",
"loader = AssemblyAIAudioTranscriptLoader(\n",
" file_path=\"./your_file.mp3\",\n",
" config=config\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pass the API Key as argument\n",
"\n",
"Next to setting the API key as environment variable `ASSEMBLYAI_API_KEY`, it is also possible to pass it as argument."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"loader = AssemblyAIAudioTranscriptLoader(\n",
" file_path=\"./your_file.mp3\",\n",
" api_key=\"YOUR_KEY\"\n",
")"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
2 changes: 2 additions & 0 deletions libs/langchain/langchain/document_loaders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
from langchain.document_loaders.apify_dataset import ApifyDatasetLoader
from langchain.document_loaders.arcgis_loader import ArcGISLoader
from langchain.document_loaders.arxiv import ArxivLoader
from langchain.document_loaders.assemblyai import AssemblyAIAudioTranscriptLoader
from langchain.document_loaders.async_html import AsyncHtmlLoader
from langchain.document_loaders.azlyrics import AZLyricsLoader
from langchain.document_loaders.azure_blob_storage_container import (
Expand Down Expand Up @@ -219,6 +220,7 @@
"ApifyDatasetLoader",
"ArcGISLoader",
"ArxivLoader",
"AssemblyAIAudioTranscriptLoader",
"AsyncHtmlLoader",
"AzureBlobStorageContainerLoader",
"AzureBlobStorageFileLoader",
Expand Down
110 changes: 110 additions & 0 deletions libs/langchain/langchain/document_loaders/assemblyai.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
from __future__ import annotations

from enum import Enum
from typing import TYPE_CHECKING, List, Optional

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

if TYPE_CHECKING:
import assemblyai


class TranscriptFormat(Enum):
"""Transcript format to use for the document loader."""

TEXT = "text"
"""One document with the transcription text"""
SENTENCES = "sentences"
"""Multiple documents, splits the transcription by each sentence"""
PARAGRAPHS = "paragraphs"
"""Multiple documents, splits the transcription by each paragraph"""
SUBTITLES_SRT = "subtitles_srt"
"""One document with the transcript exported in SRT subtitles format"""
SUBTITLES_VTT = "subtitles_vtt"
"""One document with the transcript exported in VTT subtitles format"""


class AssemblyAIAudioTranscriptLoader(BaseLoader):
"""
Loader for AssemblyAI audio transcripts.

It uses the AssemblyAI API to transcribe audio files
and loads the transcribed text into one or more Documents,
depending on the specified format.

To use, you should have the ``assemblyai`` python package installed, and the
environment variable ``ASSEMBLYAI_API_KEY`` set with your API key.
Alternatively, the API key can also be passed as an argument.

Audio files can be specified via an URL or a local file path.
"""

def __init__(
self,
file_path: str,
transcript_format: TranscriptFormat = TranscriptFormat.TEXT,
baskaryan marked this conversation as resolved.
Show resolved Hide resolved
config: Optional[assemblyai.TranscriptionConfig] = None,
api_key: Optional[str] = None,
):
"""
Initializes the AssemblyAI AudioTranscriptLoader.

Args:
file_path: An URL or a local file path.
transcript_format: Transcript format to use.
See class ``TranscriptFormat`` for more info.
config: Transcription options and features. If ``None`` is given,
the Transcriber's default configuration will be used.
api_key: AssemblyAI API key.
"""
try:
import assemblyai
except ImportError:
raise ImportError(
"Could not import assemblyai python package. "
"Please install it with `pip install assemblyai`."
)
if api_key is not None:
assemblyai.settings.api_key = api_key

self.file_path = file_path
self.transcript_format = transcript_format
self.transcriber = assemblyai.Transcriber(config=config)

def load(self) -> List[Document]:
"""Transcribes the audio file and loads the transcript into documents.

It uses the AssemblyAI API to transcribe the audio file and blocks until
the transcription is finished.
"""
transcript = self.transcriber.transcribe(self.file_path)
# This will raise a ValueError if no API key is set.

if transcript.error:
raise ValueError(f"Could not transcribe file: {transcript.error}")

if self.transcript_format == TranscriptFormat.TEXT:
return [
Document(
page_content=transcript.text, metadata=transcript.json_response
)
]
elif self.transcript_format == TranscriptFormat.SENTENCES:
sentences = transcript.get_sentences()
return [
Document(page_content=s.text, metadata=s.dict(exclude={"text"}))
for s in sentences
]
elif self.transcript_format == TranscriptFormat.PARAGRAPHS:
paragraphs = transcript.get_paragraphs()
return [
Document(page_content=p.text, metadata=p.dict(exclude={"text"}))
for p in paragraphs
]
elif self.transcript_format == TranscriptFormat.SUBTITLES_SRT:
return [Document(page_content=transcript.export_subtitles_srt())]
elif self.transcript_format == TranscriptFormat.SUBTITLES_VTT:
return [Document(page_content=transcript.export_subtitles_vtt())]
else:
raise ValueError("Unknown transcript format.")