-
-
Notifications
You must be signed in to change notification settings - Fork 548
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into anthropic-client-fix
- Loading branch information
Showing
25 changed files
with
2,269 additions
and
45 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,273 @@ | ||
--- | ||
draft: False | ||
date: 2024-07-11 | ||
slug: youtube-transcripts | ||
comments: true | ||
authors: | ||
- jxnl | ||
--- | ||
|
||
# Analyzing Youtube Transcripts with Instructor | ||
|
||
## Extracting Chapter Information | ||
|
||
!!! info "Code Snippets" | ||
|
||
As always, the code is readily available in our `examples/youtube` folder in our repo for your reference in the `run.py` file. | ||
|
||
In this post, we'll show you how to summarise Youtube video transcripts into distinct chapters using `instructor` before exploring some ways you can adapt the code to different applications. | ||
|
||
By the end of this article, you'll be able to build an application as per the video below. | ||
|
||
![](../../hub/img/youtube.gif) | ||
|
||
Let's first install the required packages. | ||
|
||
```bash | ||
pip install openai instructor pydantic youtube_transcript_api | ||
``` | ||
|
||
!!! info "Quick Note" | ||
|
||
The video that we'll be using in this tutorial is [A Hacker's Guide To Language Models](https://www.youtube.com/watch?v=jkrNMKz9pWU) by Jeremy Howard. It has the video id of `jkrNMKz9pWU`. | ||
|
||
Next, let's start by defining a Pydantic Model for the structured chapter information that we want. | ||
|
||
```python | ||
from pydantic import BaseModel, Field | ||
|
||
|
||
class Chapter(BaseModel): | ||
start_ts: float = Field( | ||
..., | ||
description="Starting timestamp for a chapter.", | ||
) | ||
end_ts: float = Field( | ||
..., | ||
description="Ending timestamp for a chapter", | ||
) | ||
title: str = Field( | ||
..., description="A concise and descriptive title for the chapter." | ||
) | ||
summary: str = Field( | ||
..., | ||
description="A brief summary of the chapter's content, don't use words like 'the speaker'", | ||
) | ||
``` | ||
|
||
We can take advantage of `youtube-transcript-api` to extract out the transcript of a video using the following function | ||
|
||
```python | ||
from youtube_transcript_api import YouTubeTranscriptApi | ||
|
||
|
||
def get_youtube_transcript(video_id: str) -> str: | ||
try: | ||
transcript = YouTubeTranscriptApi.get_transcript(video_id) | ||
return " ".join( | ||
[f"ts={entry['start']} - {entry['text']}" for entry in transcript] | ||
) | ||
except Exception as e: | ||
print(f"Error fetching transcript: {e}") | ||
return "" | ||
``` | ||
|
||
Once we've done so, we can then put it all together into the following functions. | ||
|
||
```python hl_lines="30-31 38-48" | ||
import instructor | ||
from openai import OpenAI | ||
from pydantic import BaseModel, Field | ||
from youtube_transcript_api import YouTubeTranscriptApi | ||
|
||
# Set up OpenAI client | ||
client = instructor.from_openai(OpenAI()) | ||
|
||
|
||
class Chapter(BaseModel): | ||
start_ts: float = Field( | ||
..., | ||
description="The start timestamp indicating when the chapter starts in the video.", | ||
) | ||
end_ts: float = Field( | ||
..., | ||
description="The end timestamp indicating when the chapter ends in the video.", | ||
) | ||
title: str = Field( | ||
..., description="A concise and descriptive title for the chapter." | ||
) | ||
summary: str = Field( | ||
..., | ||
description="A brief summary of the chapter's content, don't use words like 'the speaker'", | ||
) | ||
|
||
|
||
def get_youtube_transcript(video_id: str) -> str: | ||
try: | ||
transcript = YouTubeTranscriptApi.get_transcript(video_id) | ||
return [f"ts={entry['start']} - {entry['text']}" for entry in transcript] | ||
except Exception as e: | ||
print(f"Error fetching transcript: {e}") | ||
return "" | ||
|
||
|
||
def extract_chapters(transcript: str): | ||
return client.chat.completions.create_iterable( | ||
model="gpt-4o", # You can experiment with different models | ||
response_model=Chapter, | ||
messages=[ | ||
{ | ||
"role": "system", | ||
"content": "Analyze the given YouTube transcript and extract chapters. For each chapter, provide a start timestamp, end timestamp, title, and summary.", | ||
}, | ||
{"role": "user", "content": transcript}, | ||
], | ||
) | ||
|
||
|
||
if __name__ == "__main__": | ||
transcripts = get_youtube_transcript("jkrNMKz9pWU") | ||
|
||
for transcript in transcripts[:2]: | ||
print(transcript) | ||
#> ts=0.539 - hi I am Jeremy Howard from fast.ai and | ||
#> ts=4.62 - this is a hacker's guide to language | ||
|
||
formatted_transcripts = ''.join(transcripts) | ||
chapters = extract_chapters(formatted_transcripts) | ||
|
||
for chapter in chapters: | ||
print(chapter.model_dump_json(indent=2)) | ||
""" | ||
{ | ||
"start_ts": 0.539, | ||
"end_ts": 9.72, | ||
"title": "Introduction", | ||
"summary": "Jeremy Howard from fast.ai introduces the video, mentioning it as a hacker's guide to language models, focusing on a code-first approach." | ||
} | ||
""" | ||
""" | ||
{ | ||
"start_ts": 9.72, | ||
"end_ts": 65.6, | ||
"title": "Understanding Language Models", | ||
"summary": "Explains the code-first approach to using language models, suggesting prerequisites such as prior deep learning knowledge and recommends the course.fast.ai for in-depth learning." | ||
} | ||
""" | ||
""" | ||
{ | ||
"start_ts": 65.6, | ||
"end_ts": 250.68, | ||
"title": "Basics of Language Models", | ||
"summary": "Covers the concept of language models, demonstrating how they predict the next word in a sentence, and showcases OpenAI's text DaVinci for creative brainstorming with examples." | ||
} | ||
""" | ||
""" | ||
{ | ||
"start_ts": 250.68, | ||
"end_ts": 459.199, | ||
"title": "How Language Models Work", | ||
"summary": "Dives deeper into how language models like ULMfit and others were developed, their training on datasets like Wikipedia, and the importance of learning various aspects of the world to predict the next word effectively." | ||
} | ||
""" | ||
# ... other chapters | ||
``` | ||
|
||
## Alternative Ideas | ||
|
||
Now that we've seen a complete example of chapter extraction, let's explore some alternative ideas using different Pydantic models. These models can be used to adapt our YouTube transcript analysis for various applications. | ||
|
||
### 1. Study Notes Generator | ||
|
||
```python | ||
from pydantic import BaseModel, Field | ||
from typing import List | ||
|
||
|
||
class Concept(BaseModel): | ||
term: str = Field(..., description="A key term or concept mentioned in the video") | ||
definition: str = Field( | ||
..., description="A brief definition or explanation of the term" | ||
) | ||
|
||
|
||
class StudyNote(BaseModel): | ||
timestamp: float = Field( | ||
..., description="The timestamp where this note starts in the video" | ||
) | ||
topic: str = Field(..., description="The main topic being discussed at this point") | ||
key_points: List[str] = Field(..., description="A list of key points discussed") | ||
concepts: List[Concept] = Field( | ||
..., description="Important concepts mentioned in this section" | ||
) | ||
``` | ||
|
||
This model structures the video content into clear topics, key points, and important concepts, making it ideal for revision and study purposes. | ||
|
||
### 2. Content Summarization | ||
|
||
```python | ||
from pydantic import BaseModel, Field | ||
from typing import List | ||
|
||
|
||
class ContentSummary(BaseModel): | ||
title: str = Field(..., description="The title of the video") | ||
duration: float = Field( | ||
..., description="The total duration of the video in seconds" | ||
) | ||
main_topics: List[str] = Field( | ||
..., description="A list of main topics covered in the video" | ||
) | ||
key_takeaways: List[str] = Field( | ||
..., description="The most important points from the entire video" | ||
) | ||
target_audience: str = Field( | ||
..., description="The intended audience for this content" | ||
) | ||
``` | ||
|
||
This model provides a high-level overview of the entire video, perfect for quick content analysis or deciding whether a video is worth watching in full. | ||
|
||
### 3. Quiz Generator | ||
|
||
```python | ||
from pydantic import BaseModel, Field | ||
from typing import List | ||
|
||
|
||
class QuizQuestion(BaseModel): | ||
question: str = Field(..., description="The quiz question") | ||
options: List[str] = Field( | ||
..., min_items=2, max_items=4, description="Possible answers to the question" | ||
) | ||
correct_answer: int = Field( | ||
..., | ||
ge=0, | ||
lt=4, | ||
description="The index of the correct answer in the options list", | ||
) | ||
explanation: str = Field( | ||
..., description="An explanation of why the correct answer is correct" | ||
) | ||
|
||
|
||
class VideoQuiz(BaseModel): | ||
title: str = Field( | ||
..., description="The title of the quiz, based on the video content" | ||
) | ||
questions: List[QuizQuestion] = Field( | ||
..., | ||
min_items=5, | ||
max_items=20, | ||
description="A list of quiz questions based on the video content", | ||
) | ||
``` | ||
|
||
This model transforms video content into an interactive quiz, perfect for testing comprehension or creating engaging content for social media. | ||
|
||
To use these alternative models, you would replace the `Chapter` model in our original code with one of these alternatives and adjust the system prompt in the `extract_chapters` function accordingly. | ||
|
||
## Conclusion | ||
|
||
The power of this approach lies in its flexibility. By defining the result of our function calls as Pydantic Models, we're able to quickly adapt code for a wide variety of applications whether it be generating quizzes, creating study materials or just optimizing for simple SEO. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
# Leveraging Local Models for Classifying Private Data | ||
|
||
In this article, we'll show you how to use Llama-cpp-python with instructor for classification. This is a perfect use-case for users who want to ensure that confidential documents are handled securely without ever leaving your own infrastructure. | ||
|
||
## Setup | ||
|
||
Let's start by installing the required libraries in your local python environment. This might take a while since we'll need to build and compile `llama-cpp` for your specific environment. | ||
|
||
```bash | ||
pip install instructor pydantic | ||
``` | ||
|
||
Next, we'll install `llama-cpp-python` which is a python package that allows us to use llama-cpp with our python scripts. | ||
|
||
For this tutorial, we'll be using `Mistral-7B-Instruct-v0.2-GGUF` by `TheBloke` to do our function calls. This will require around 6GB of RAM and a GPU. | ||
|
||
We can install the package by running the following command | ||
|
||
```bash | ||
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python | ||
``` | ||
|
||
!!! note "Don't have a GPU?" | ||
|
||
If you don't have a GPU, we recommend using the `Qwen2-0.5B-Instruct` model instead and compiling llama-cpp-python to use `OpenBLAS`. This allows you to run the program using your CPU instead. | ||
|
||
You can compile `llama-cpp-python` with `OpenBLAS` support by running the command | ||
|
||
```bash | ||
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python | ||
``` | ||
|
||
## Using `LLama-cpp-python` | ||
|
||
Here's an example of how to implement a system for handling confidential document queries using local models: | ||
|
||
```python hl_lines="7-12 14-16 40-46" | ||
from llama_cpp import Llama | ||
import instructor | ||
from pydantic import BaseModel | ||
from enum import Enum | ||
from typing import Optional | ||
|
||
llm = Llama.from_pretrained( | ||
repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF", # (1)! | ||
filename="*Q4_K_M.gguf", | ||
verbose=False, # (2)! | ||
n_gpu_layers=-1, # (3)! | ||
) | ||
|
||
create = instructor.patch( | ||
create=llm.create_chat_completion_openai_v1, #(4)! | ||
) | ||
|
||
# Define query types for document-related inquiries | ||
class QueryType(str, Enum): | ||
DOCUMENT_CONTENT = "document_content" | ||
LAST_MODIFIED = "last_modified" | ||
ACCESS_PERMISSIONS = "access_permissions" | ||
RELATED_DOCUMENTS = "related_documents" | ||
|
||
# Define the structure for query responses | ||
class QueryResponse(BaseModel): | ||
query_type: QueryType | ||
response: str | ||
additional_info: Optional[str] = None | ||
|
||
def process_confidential_query(query: str) -> QueryResponse: | ||
prompt = f"""Analyze the following confidential document query and provide an appropriate response: | ||
Query: {query} | ||
Determine the type of query (document content, last modified, access permissions, or related documents), | ||
provide a response, and include a confidence score and any additional relevant information. | ||
Remember, you're handling confidential data, so be cautious about specific details. | ||
""" | ||
|
||
return create( | ||
response_model=QueryResponse, #(5)! | ||
messages=[ | ||
{"role": "system", "content": "You are a secure AI assistant trained to handle confidential document queries."}, | ||
{"role": "user", "content": prompt}, | ||
], | ||
) | ||
|
||
|
||
# Sample confidential document queries | ||
confidential_queries = [ | ||
"What are the key findings in the Q4 financial report?", | ||
"Who last accessed the merger proposal document?", | ||
"What are the access permissions for the new product roadmap?", | ||
"Are there any documents related to Project X's budget forecast?", | ||
"When was the board meeting minutes document last updated?", | ||
] | ||
|
||
# Process each query and print the results | ||
for i, query in enumerate(confidential_queries, 1): | ||
response:QueryResponse = process_confidential_query(query) | ||
print(f"{query} : {response.query_type}") | ||
""" | ||
#> What are the key findings in the Q4 financial report? : document_content | ||
#> Who last accessed the merger proposal document? : access_permissions | ||
#> What are the access permissions for the new product roadmap? : access_permissions | ||
#> Are there any documents related to Project X's budget forecast? : document_content | ||
#> When was the board meeting minutes document last updated? : last_modified | ||
""" | ||
``` | ||
|
||
1. We load in the model from Hugging Face and cache it locally. This makes it quick and easy for us to experiment with different model configurations and types. | ||
|
||
2. We can set `verbose` to be `True` to log out all of the output from `llama.cpp`. This helps if you're trying to debug specific issues | ||
|
||
3. If you have a GPU with limited memory, set `n_gpu` to a lower number (Eg. 10 ). We've set it here to `-1` so that all of the model layers are loaded on the GPU by default. | ||
|
||
4. Now make sure to patch the client with the `create_chat_completion_openai_v1` api which is OpenAI compatible | ||
|
||
5. Pass in the response model as a parameter just like any other inference client we support | ||
|
||
## Conclusion | ||
|
||
`instructor` provides a robust solution for organizations needing to handle confidential document queries locally. By processing these queries on your own hardware, you can leverage advanced AI capabilities while maintaining the highest standards of data privacy and security. | ||
|
||
But this goes far beyond just simple confidential documents, using local models unlocks a whole new world of interesting use-cases, fine-tuned specialist models and more! |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.