References
1. [Whisper](https://blog.devgenius.io/transcribing-youtube-videos-using-openais-whisper-%EF%B8%8F-%EF%B8%8F-a29d264d6fb1)
2. [Langchain and LLama](https://www.youtube.com/watch?v=k_1pOF1mj8k)

### Basic Imports

In [1]:
import yt_dlp


In [2]:
def download(video_id: str) -> str:
    video_url = f'https://www.youtube.com/watch?v={video_id}'
    ydl_opts = {
        'format': 'm4a/bestaudio/best',
        'paths': {'home': 'audio/'},
        'outtmpl': {'default': '%(id)s.%(ext)s'},
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'm4a',
        }]
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        error_code = ydl.download([video_url])
        if error_code != 0:
            raise Exception('Failed to download video')

    return f'audio/{video_id}.m4a'

In [3]:
download('CuBzyh4Xmvk')

[youtube] Extracting URL: https://www.youtube.com/watch?v=CuBzyh4Xmvk
[youtube] CuBzyh4Xmvk: Downloading webpage
[youtube] CuBzyh4Xmvk: Downloading ios player API JSON
[youtube] CuBzyh4Xmvk: Downloading android player API JSON
[youtube] CuBzyh4Xmvk: Downloading m3u8 information
[info] CuBzyh4Xmvk: Downloading 1 format(s): 140
[download] audio/CuBzyh4Xmvk.m4a has already been downloaded
[download] 100% of   72.26MiB
[ExtractAudio] Not converting audio audio/CuBzyh4Xmvk.m4a; file is already in target format m4a


'audio/CuBzyh4Xmvk.m4a'

In [4]:
import whisper

In [5]:
whisper_model = whisper.load_model("base.en")


In [6]:
transcription = whisper_model.transcribe("audio/CuBzyh4Xmvk.m4a", fp16=True, verbose=True)

[00:00.000 --> 00:05.400]  Please look at the code mentioned above and please sign up on the Google Cloud.
[00:05.400 --> 00:08.520]  We've already started making some announcements.
[00:08.520 --> 00:14.240]  You will likely end up missing the announcements and you'll have no one else to play with.
[00:14.240 --> 00:20.080]  The second quick logistical announcement is that we'll have an extra lecture on Saturday,
[00:20.080 --> 00:23.800]  11th Jan at 11am in 1.101.
[00:23.800 --> 00:26.240]  So a lot of ones over there.
[00:26.240 --> 00:32.000]  And I think one or two people still have conflict, but in the larger, in the larger
[00:32.000 --> 00:36.240]  phone we will have almost everyone available, so we'll have to stick with this.
[00:36.240 --> 00:43.960]  FAQ and the projects which were earlier shared on Google Docs, I'll give all of you a comment
[00:43.960 --> 00:48.960]  access on it so that if you have any questions, queries, things like what should be the,
[00:48.960 --> 00

In [7]:
transcription.keys()

dict_keys(['text', 'segments', 'language'])

In [11]:
def create_srt_from_transcription(transcription_objects, srt_file_path):
    with open(srt_file_path, 'w') as srt_file:
        index = 1  # SRT format starts with index 1

        for entry in transcription_objects['segments']:
            start_time = entry['start']
            end_time = entry['end']
            text = entry['text']

            # Convert time to SRT format
            start_time_str = format_time(start_time)
            end_time_str = format_time(end_time)

            # Write entry to SRT file
            srt_file.write(f"{index}\n")
            srt_file.write(f"{start_time_str} --> {end_time_str}\n")
            srt_file.write(f"{text}\n\n")

            index += 1

def format_time(time_seconds):
    minutes, seconds = divmod(time_seconds, 60)
    hours, minutes = divmod(minutes, 60)
    return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d},000"


In [12]:
create_srt_from_transcription(transcription, "audio/CuBzyh4Xmvk.srt")

In [13]:
!head audio/CuBzyh4Xmvk.srt

1
00:00:00,000 --> 00:00:05,000
 Please look at the code mentioned above and please sign up on the Google Cloud.

2
00:00:05,000 --> 00:00:08,000
 We've already started making some announcements.

3
00:00:08,000 --> 00:00:14,000


In [14]:
from langchain.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler 
                                 
llm = Ollama(model="llama2", 
             callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]))

In [23]:
prompt_qs = ["Please provide a bullet-point summary for the given text:",
             "Summarize the following in Markdown bullets:",
             "Highlight the important topics and subtopics in the given lecture:",
             "Give us some question for a quiz based on the following text:"]

prompts = [q + "\n" + transcription["text"] for q in prompt_qs]

for prompt, prompt_qs in zip(prompts, prompt_qs):
    print(prompt_qs, end="\n\n")
    output = llm(prompt)
    print(output, end="\n\n")
    print("=="*50, end="\n\n")

Please provide a bullet-point summary for the given text:

The lecture discusses the concept of entropy and its relationship to decision trees, particularly in the context of information gain. The speaker explains that entropy is a measure of disorder or uncertainty in a system, and that in decision tree learning, we want to choose an attribute that reduces the entropy of the system. This attribute is called the information gain, and it is calculated by partitioning the set of examples based on an attribute and then weighting the entropy of each subset. The speaker also mentions that the weighted entropy is zero for the entire set of examples, but the side with the highest entropy has the most information gain.

The lecture starts by explaining that decision trees are constructed using a greedy algorithm, which chooses the attribute that gives the biggest estimated performance gain at each level of the tree. However, this greedy approach is not optimal, as it does not consider the glob