YoutubeAudioLoader and updates to OpenAIWhisperParser #5772

rlancemartin · 2023-06-06T05:12:37Z

This introduces the YoutubeAudioLoader, which will load blobs from a YouTube url and write them. Blobs are then parsed by OpenAIWhisperParser(), as show in this PR, but we extend the parser to split audio such that each chuck meets the 25MB OpenAI size limit. As shown in the notebook, this enables a very simple UX:

# Transcribe the video to text
loader = GenericLoader(YoutubeAudioLoader([url],save_dir),OpenAIWhisperParser())
docs = loader.load()

Tested on full set of Karpathy lecture videos:

# Karpathy lecture videos
urls = ["https://youtu.be/VMj-3S1tku0"
        "https://youtu.be/PaCmpygFfXo",
        "https://youtu.be/TCH_1BHY58I",
        "https://youtu.be/P6sfmUTpUmc",
        "https://youtu.be/q8SA3rM6ckI",
        "https://youtu.be/t3YJ5hKiMQ0",
        "https://youtu.be/kCc8FmEb1nY"]

# Directory to save audio files 
save_dir = "~/Downloads/YouTube"
 
# Transcribe the videos to text
loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser())
docs = loader.load()

eyurtsev · 2023-06-06T15:24:08Z

langchain/document_loaders/parsers/audio.py

+        # Split the audio into chunk_duration_ms chunks 
+        for split_number,i in enumerate(range(0, len(audio), chunk_duration_ms)):
+
+            print(f"Transcribing part {split_number}!")


Suggested change

print(f"Transcribing part {split_number}!")

eyurtsev · 2023-06-06T15:25:35Z

langchain/document_loaders/parsers/audio.py


-        with blob.as_bytes_io() as f:
-            transcript = openai.Audio.transcribe("whisper-1", f)
            yield Document(


Should we yield a single document if the input is a single audio file and we're trying to hide the fact there's chunking under the hood? We can collect the transcripts and concatenate them. The only problem is that it's unclear on which delimiter to use to join on.

It would be easy to do this. E.g., we can build a single blob from the combined docs:

combined_docs = [doc.page_content for doc in docs].join(strings)

But, as discussed, it's kind of nice to have the intermediate outputs.

(The latency is somewhat high - 15 min for 2 hr video.)

eyurtsev · 2023-06-06T15:27:09Z

langchain/document_loaders/blob_loaders/youtube_audio.py

+                with yt_dlp.YoutubeDL(ydl_opts) as ydl:
+                    info = ydl.extract_info(url,download=False)
+                    title = info.get('title', 'video')
+                    print(f"Writing file: {title} to {self.save_dir}")


Suggested change

print(f"Writing file: {title} to {self.save_dir}")

eyurtsev · 2023-06-06T18:24:40Z

langchain/document_loaders/parsers/audio.py

+        try:
+            from pydub import AudioSegment
+        except ImportError:
+            print("Please install pydub : pip install pydub")


replace with raise ValueError or ImportError

eyurtsev · 2023-06-06T19:50:52Z

langchain/document_loaders/parsers/audio.py

+        try:
+            import openai
+        except ImportError:
+            print("Please install openai : pip install openai")


Needs to be raised as well

) This introduces the `YoutubeAudioLoader`, which will load blobs from a YouTube url and write them. Blobs are then parsed by `OpenAIWhisperParser()`, as show in this [PR](langchain-ai#5580), but we extend the parser to split audio such that each chuck meets the 25MB OpenAI size limit. As shown in the notebook, this enables a very simple UX: ``` # Transcribe the video to text loader = GenericLoader(YoutubeAudioLoader([url],save_dir),OpenAIWhisperParser()) docs = loader.load() ``` Tested on full set of Karpathy lecture videos: ``` # Karpathy lecture videos urls = ["https://youtu.be/VMj-3S1tku0" "https://youtu.be/PaCmpygFfXo", "https://youtu.be/TCH_1BHY58I", "https://youtu.be/P6sfmUTpUmc", "https://youtu.be/q8SA3rM6ckI", "https://youtu.be/t3YJ5hKiMQ0", "https://youtu.be/kCc8FmEb1nY"] # Directory to save audio files save_dir = "~/Downloads/YouTube" # Transcribe the videos to text loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser()) docs = loader.load() ```

YoutubeAudioLoader and updates to OpenAIWhisperParser

7ccb330

eyurtsev reviewed Jun 6, 2023

View reviewed changes

rlancemartin force-pushed the rlm/simple_audio_load_and_split branch 7 times, most recently from 01a5729 to 74326d6 Compare June 6, 2023 18:39

eyurtsev approved these changes Jun 6, 2023

View reviewed changes

rlancemartin force-pushed the rlm/simple_audio_load_and_split branch from 74326d6 to 4f0e4ca Compare June 6, 2023 21:59

Address comments

e1fa1a4

rlancemartin force-pushed the rlm/simple_audio_load_and_split branch from 4f0e4ca to e1fa1a4 Compare June 6, 2023 22:03

rlancemartin merged commit 4092fd2 into langchain-ai:master Jun 6, 2023
13 checks passed

This was referenced Jun 25, 2023

Zep Authentication #6725

Closed

Zep Authentication #6728

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YoutubeAudioLoader and updates to OpenAIWhisperParser #5772

YoutubeAudioLoader and updates to OpenAIWhisperParser #5772

rlancemartin commented Jun 6, 2023 •

edited

eyurtsev Jun 6, 2023

eyurtsev Jun 6, 2023

rlancemartin Jun 6, 2023

eyurtsev Jun 6, 2023

eyurtsev Jun 6, 2023

eyurtsev Jun 6, 2023

YoutubeAudioLoader and updates to OpenAIWhisperParser #5772

YoutubeAudioLoader and updates to OpenAIWhisperParser #5772

Conversation

rlancemartin commented Jun 6, 2023 • edited

eyurtsev Jun 6, 2023

Choose a reason for hiding this comment

eyurtsev Jun 6, 2023

Choose a reason for hiding this comment

rlancemartin Jun 6, 2023

Choose a reason for hiding this comment

eyurtsev Jun 6, 2023

Choose a reason for hiding this comment

eyurtsev Jun 6, 2023

Choose a reason for hiding this comment

eyurtsev Jun 6, 2023

Choose a reason for hiding this comment

rlancemartin commented Jun 6, 2023 •

edited