Add new document_loader: AssemblyAIAudioTranscriptLoader (#9667)

This PR adds a new document loader `AssemblyAIAudioTranscriptLoader` that allows to transcribe audio files with the [AssemblyAI API](https://www.assemblyai.com) and loads the transcribed text into documents. - Add new document_loader with class `AssemblyAIAudioTranscriptLoader` - Add optional dependency `assemblyai` - Add unit tests (using a Mock client) - Add docs notebook This is the equivalent to the JS integration already available in LangChain.js. See the [LangChain JS docs AssemblyAI page](https://js.langchain.com/docs/modules/data_connection/document_loaders/integrations/web_loaders/assemblyai_audio_transcription). At its simplest, you can use the loader to get a transcript back from an audio file like this: ```python from langchain.document_loaders.assemblyai import AssemblyAIAudioTranscriptLoader loader = AssemblyAIAudioTranscriptLoader(file_path="./testfile.mp3") docs = loader.load() ``` To use it, it needs the `assemblyai` python package installed, and the environment variable `ASSEMBLYAI_API_KEY` set with your API key. Alternatively, the API key can also be passed as an argument. Twitter handles to shout out if so kindly 🙇 [@AssemblyAI](https://twitter.com/AssemblyAI) and [@patLoeber](https://twitter.com/patloeber) --------- Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
langchain-ai · Aug 24, 2023 · 5990651 · 5990651
1 parent 25f2c82
commit 5990651
Show file tree

Hide file tree

Showing 6 changed files with 491 additions and 2 deletions.
diff --git a/docs/extras/integrations/document_loaders/assemblyai.ipynb b/docs/extras/integrations/document_loaders/assemblyai.ipynb
@@ -0,0 +1,224 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# AssemblyAI Audio Transcripts\n",
+    "\n",
+    "The `AssemblyAIAudioTranscriptLoader` allows to transcribe audio files with the [AssemblyAI API](https://www.assemblyai.com) and loads the transcribed text into documents.\n",
+    "\n",
+    "To use it, you should have the `assemblyai` python package installed, and the\n",
+    "environment variable `ASSEMBLYAI_API_KEY` set with your API key. Alternatively, the API key can also be passed as an argument.\n",
+    "\n",
+    "More info about AssemblyAI:\n",
+    "\n",
+    "- [Website](https://www.assemblyai.com/)\n",
+    "- [Get a Free API key](https://www.assemblyai.com/dashboard/signup)\n",
+    "- [AssemblyAI API Docs](https://www.assemblyai.com/docs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Installation\n",
+    "\n",
+    "First, you need to install the `assemblyai` python package.\n",
+    "\n",
+    "You can find more info about it inside the [assemblyai-python-sdk GitHub repo](https://github.com/AssemblyAI/assemblyai-python-sdk)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#!pip install assemblyai"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Example\n",
+    "\n",
+    "The `AssemblyAIAudioTranscriptLoader` needs at least the `file_path` argument. Audio files can be specified as an URL or a local file path."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders.assemblyai import AssemblyAIAudioTranscriptLoader\n",
+    "\n",
+    "audio_file = \"https://storage.googleapis.com/aai-docs-samples/nbc.mp3\"\n",
+    "# or a local file path: audio_file = \"./nbc.mp3\"\n",
+    "\n",
+    "loader = AssemblyAIAudioTranscriptLoader(file_path=audio_file)\n",
+    "\n",
+    "docs = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note: Calling `loader.load()` blocks until the transcription is finished."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The transcribed text is available in the `page_content`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "docs[0].page_content"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```\n",
+    "\"Load time, a new president and new congressional makeup. Same old ...\"\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `metadata` contains the full JSON response with more meta information:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "docs[0].metadata"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```\n",
+    "{'language_code': <LanguageCode.en_us: 'en_us'>,\n",
+    " 'audio_url': 'https://storage.googleapis.com/aai-docs-samples/nbc.mp3',\n",
+    " 'punctuate': True,\n",
+    " 'format_text': True,\n",
+    "  ...\n",
+    "}\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Transcript Formats\n",
+    "\n",
+    "You can specify the `transcript_format` argument for different formats.\n",
+    "\n",
+    "Depending on the format, one or more documents are returned. These are the different `TranscriptFormat` options:\n",
+    "\n",
+    "- `TEXT`: One document with the transcription text\n",
+    "- `SENTENCES`: Multiple documents, splits the transcription by each sentence\n",
+    "- `PARAGRAPHS`: Multiple documents, splits the transcription by each paragraph\n",
+    "- `SUBTITLES_SRT`: One document with the transcript exported in SRT subtitles format\n",
+    "- `SUBTITLES_VTT`: One document with the transcript exported in VTT subtitles format"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders.assemblyai import (\n",
+    "    AssemblyAIAudioTranscriptLoader,\n",
+    "    TranscriptFormat,\n",
+    ")\n",
+    "\n",
+    "loader = AssemblyAIAudioTranscriptLoader(\n",
+    "    file_path=\"./your_file.mp3\",\n",
+    "    transcript_format=TranscriptFormat.SENTENCES,\n",
+    ")\n",
+    "\n",
+    "docs = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Transcription Config\n",
+    "\n",
+    "You can also specify the `config` argument to use different audio intelligence models.\n",
+    "\n",
+    "Visit the [AssemblyAI API Documentation](https://www.assemblyai.com/docs) to get an overview of all available models!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import assemblyai as aai\n",
+    "\n",
+    "config = aai.TranscriptionConfig(speaker_labels=True,\n",
+    "                                 auto_chapters=True,\n",
+    "                                 entity_detection=True\n",
+    ")\n",
+    "\n",
+    "loader = AssemblyAIAudioTranscriptLoader(\n",
+    "    file_path=\"./your_file.mp3\",\n",
+    "    config=config\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Pass the API Key as argument\n",
+    "\n",
+    "Next to setting the API key as environment variable `ASSEMBLYAI_API_KEY`, it is also possible to pass it as argument."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loader = AssemblyAIAudioTranscriptLoader(\n",
+    "    file_path=\"./your_file.mp3\",\n",
+    "    api_key=\"YOUR_KEY\"\n",
+    ")"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/libs/langchain/langchain/document_loaders/__init__.py b/libs/langchain/langchain/document_loaders/__init__.py
@@ -31,6 +31,7 @@
 from langchain.document_loaders.apify_dataset import ApifyDatasetLoader
 from langchain.document_loaders.arcgis_loader import ArcGISLoader
 from langchain.document_loaders.arxiv import ArxivLoader
+from langchain.document_loaders.assemblyai import AssemblyAIAudioTranscriptLoader
 from langchain.document_loaders.async_html import AsyncHtmlLoader
 from langchain.document_loaders.azlyrics import AZLyricsLoader
 from langchain.document_loaders.azure_blob_storage_container import (
@@ -219,6 +220,7 @@
     "ApifyDatasetLoader",
     "ArcGISLoader",
     "ArxivLoader",
+    "AssemblyAIAudioTranscriptLoader",
     "AsyncHtmlLoader",
     "AzureBlobStorageContainerLoader",
     "AzureBlobStorageFileLoader",

diff --git a/libs/langchain/langchain/document_loaders/assemblyai.py b/libs/langchain/langchain/document_loaders/assemblyai.py
@@ -0,0 +1,111 @@
+from __future__ import annotations
+
+from enum import Enum
+from typing import TYPE_CHECKING, List, Optional
+
+from langchain.docstore.document import Document
+from langchain.document_loaders.base import BaseLoader
+
+if TYPE_CHECKING:
+    import assemblyai
+
+
+class TranscriptFormat(Enum):
+    """Transcript format to use for the document loader."""
+
+    TEXT = "text"
+    """One document with the transcription text"""
+    SENTENCES = "sentences"
+    """Multiple documents, splits the transcription by each sentence"""
+    PARAGRAPHS = "paragraphs"
+    """Multiple documents, splits the transcription by each paragraph"""
+    SUBTITLES_SRT = "subtitles_srt"
+    """One document with the transcript exported in SRT subtitles format"""
+    SUBTITLES_VTT = "subtitles_vtt"
+    """One document with the transcript exported in VTT subtitles format"""
+
+
+class AssemblyAIAudioTranscriptLoader(BaseLoader):
+    """
+    Loader for AssemblyAI audio transcripts.
+
+    It uses the AssemblyAI API to transcribe audio files
+    and loads the transcribed text into one or more Documents,
+    depending on the specified format.
+
+    To use, you should have the ``assemblyai`` python package installed, and the
+    environment variable ``ASSEMBLYAI_API_KEY`` set with your API key.
+    Alternatively, the API key can also be passed as an argument.
+
+    Audio files can be specified via an URL or a local file path.
+    """
+
+    def __init__(
+        self,
+        file_path: str,
+        *,
+        transcript_format: TranscriptFormat = TranscriptFormat.TEXT,
+        config: Optional[assemblyai.TranscriptionConfig] = None,
+        api_key: Optional[str] = None,
+    ):
+        """
+        Initializes the AssemblyAI AudioTranscriptLoader.
+
+        Args:
+            file_path: An URL or a local file path.
+            transcript_format: Transcript format to use.
+                See class ``TranscriptFormat`` for more info.
+            config: Transcription options and features. If ``None`` is given,
+                the Transcriber's default configuration will be used.
+            api_key: AssemblyAI API key.
+        """
+        try:
+            import assemblyai
+        except ImportError:
+            raise ImportError(
+                "Could not import assemblyai python package. "
+                "Please install it with `pip install assemblyai`."
+            )
+        if api_key is not None:
+            assemblyai.settings.api_key = api_key
+
+        self.file_path = file_path
+        self.transcript_format = transcript_format
+        self.transcriber = assemblyai.Transcriber(config=config)
+
+    def load(self) -> List[Document]:
+        """Transcribes the audio file and loads the transcript into documents.
+
+        It uses the AssemblyAI API to transcribe the audio file and blocks until
+        the transcription is finished.
+        """
+        transcript = self.transcriber.transcribe(self.file_path)
+        # This will raise a ValueError if no API key is set.
+
+        if transcript.error:
+            raise ValueError(f"Could not transcribe file: {transcript.error}")
+
+        if self.transcript_format == TranscriptFormat.TEXT:
+            return [
+                Document(
+                    page_content=transcript.text, metadata=transcript.json_response
+                )
+            ]
+        elif self.transcript_format == TranscriptFormat.SENTENCES:
+            sentences = transcript.get_sentences()
+            return [
+                Document(page_content=s.text, metadata=s.dict(exclude={"text"}))
+                for s in sentences
+            ]
+        elif self.transcript_format == TranscriptFormat.PARAGRAPHS:
+            paragraphs = transcript.get_paragraphs()
+            return [
+                Document(page_content=p.text, metadata=p.dict(exclude={"text"}))
+                for p in paragraphs
+            ]
+        elif self.transcript_format == TranscriptFormat.SUBTITLES_SRT:
+            return [Document(page_content=transcript.export_subtitles_srt())]
+        elif self.transcript_format == TranscriptFormat.SUBTITLES_VTT:
+            return [Document(page_content=transcript.export_subtitles_vtt())]
+        else:
+            raise ValueError("Unknown transcript format.")