
# Summarize Youtube video lecture 
- Summarize Youtube's script by chapter creater configured. 
  - Create `markdown_note.md` with script and summary.
- Use [youtube-transcript-api](https://pypi.org/project/youtube-transcript-api/), [langchain](https://github.com/hwchase17/langchain), and [OpenAI](https://github.com/openai/openai-python) package. 


## Install dependency

In [1]:
# Install libav and ffmpeg. 
! brew install ffmpeg 

# For linux (aptitude)
# apt-get install libav-tools libavcodec-extra ffmpeg

# install package 
! pip install -r requirements.txt

[34m==>[0m [1mDownloading https://formulae.brew.sh/api/formula.jws.json[0m
######################################################################### 100.0%
[34m==>[0m [1mDownloading https://formulae.brew.sh/api/cask.jws.json[0m
######################################################################### 100.0%
To reinstall 6.0, run:
  brew reinstall ffmpeg


## Input variables

In [15]:
# Youtube video ID
youtube_video_id="esr-1s91_9A"

# Language of subscription 
language = "ko"

# LLM: Recommended parameters for my testing. 
max_token = 3000
model = "gpt-3.5-turbo"
chunk_size = 700
chunk_overlap = 30

# Officially no way to get chapter automatically, 
# so copy and paste the time stamp and chapter in description of Youtube video.

chapter_part_in_description = """
1:12 질서자유주의와 사회적 시장경제
3:49 실존철학의 메시지
6:32 독일 기본법(헌법)과 대한민국 헌법의 정신
18:58 우리는 어떻게 살아가고 있는가?
23:54 우리는 앞으로 어떻게 살아야 하는가?
28:56 피라미드형 계급구조 vs. 네트워크형 수평구조
32:56 의사결정은 반드시 합의를 거친다 (consensus)
45:54 대중의 지혜와 집단어리석음
50:10 ‘게르만 모형’의 공동결정법과 ‘일하는 방식’
54:55 아메리칸 드림에서 유러피언 드림으로
1:02:03 자기의식과 변증법적 역사발전
1:12:09 “우리는 다리를 놓으며 그 다리를 건너야 한다”
"""

In [3]:
# Officially no way to get chapter automatically, 
# so we need to parse the text in description and set up the dictionary 
# [ (time_in_sec, chapter_title) ]
import re 
pattern = r'(\d+(:\d+){1,2})\s(.+)'
matches = re.findall(pattern, chapter_part_in_description)

def time_to_seconds(time):
    parts = time.split(':')
    seconds = int(parts[-1])
    minutes = int(parts[-2]) if len(parts) > 1 else 0
    hours = int(parts[-3]) if len(parts) > 2 else 0
    return hours * 3600 + minutes * 60 + seconds

chapters = [(time_to_seconds(time), title.strip()) for time, _, title in matches]


# Build up note with chapter and script under each chapter 

In [4]:
import yt_dlp

# Download youtube video and extract audio file. 
def download(video_id: str) -> str:
    video_url = f'https://www.youtube.com/watch?v={video_id}'
    ydl_opts = {
        'format': 'm4a/bestaudio/best',
        'paths': {'home': 'audio/'},
        'outtmpl': {'default': '%(id)s.%(ext)s'},
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'mp3',
        }]
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        error_code = ydl.download([video_url])
        if error_code != 0:
            raise Exception('Failed to download video')

    return f'audio/{video_id}.mp3'

file_path = download(youtube_video_id)



[youtube] Extracting URL: https://www.youtube.com/watch?v=esr-1s91_9A
[youtube] esr-1s91_9A: Downloading webpage
[youtube] esr-1s91_9A: Downloading ios player API JSON
[youtube] esr-1s91_9A: Downloading android player API JSON
[youtube] esr-1s91_9A: Downloading m3u8 information
[info] esr-1s91_9A: Downloading 1 format(s): 140
[download] audio/esr-1s91_9A.m4a has already been downloaded
[download] 100% of   70.00MiB
[ExtractAudio] Destination: audio/esr-1s91_9A.mp3
Deleting original file audio/esr-1s91_9A.m4a (pass -k to keep)


In [5]:
# split audio file
import time
from pydub import AudioSegment

audio_data = AudioSegment.from_mp3(file_path)

for i in range( len(chapters) ):
    current_time_in_sec, current_title = chapters[i]
    prev_time_in_sec, prev_title = chapters[i-1] if i>0 else (None, None)

    current_time_in_ms = current_time_in_sec * 1000
    prev_time_in_ms = prev_time_in_sec * 1000 if prev_time_in_sec is not None else 0


    if prev_time_in_sec:
        splitted_audio_data = audio_data[prev_time_in_ms:current_time_in_ms]
    else:
        splitted_audio_data = audio_data[:current_time_in_ms]

    splitted_audio_data.export(f'audio/{i}.mp3' , format="mp3")


In [8]:
# Transcribe the text from audio files.
import os
import whisper


# You can adjust the model used here. Model choice is typically a tradeoff between accuracy and speed.
# All available models are located at https://github.com/openai/whisper/#available-models-and-languages.
whisper_model = whisper.load_model("small")

script_by_chapter = []
def transcribe(file_path: str) -> str:
    # `fp16` defaults to `True`, which tells the model to attempt to run on GPU.
    # For local demonstration purposes, we'll run this on the CPU by setting it to `False`.
    transcription = whisper_model.transcribe(file_path, fp16=False)
    return transcription['text']

for i in range( len(chapters) ):
    current_time_in_sec, current_title = chapters[i]
    print( f'{current_title} is transcripting... \n' )
    audio_file_path = os.path.join( os.getcwd(), 'audio', f'{i}.mp3' )
    transcript = transcribe(audio_file_path)
    chapter_data = { 
                "title": current_title,
                "script": transcript,
                "summary" : ""
                }
    script_by_chapter.append(chapter_data)

    # Save transcript file
    text_file_folder_path = os.path.join( os.getcwd(), 'text')
    if not os.path.exists( text_file_folder_path ):
        os.makedirs(text_file_folder_path) 

    text_file = os.path.join(text_file_folder_path, f'{i}.txt')
    with open( text_file, "w") as file:
        file.write(transcript)



100%|███████████████████████████████████████| 461M/461M [00:47<00:00, 10.3MiB/s]


질서자유주의와 사회적 시장경제 is transcripting... 

실존철학의 메시지 is transcripting... 

독일 기본법(헌법)과 대한민국 헌법의 정신 is transcripting... 

우리는 어떻게 살아가고 있는가? is transcripting... 

우리는 앞으로 어떻게 살아야 하는가? is transcripting... 

피라미드형 계급구조 vs. 네트워크형 수평구조 is transcripting... 

의사결정은 반드시 합의를 거친다 (consensus) is transcripting... 

대중의 지혜와 집단어리석음 is transcripting... 

‘게르만 모형’의 공동결정법과 ‘일하는 방식’ is transcripting... 

아메리칸 드림에서 유러피언 드림으로 is transcripting... 

자기의식과 변증법적 역사발전 is transcripting... 

“우리는 다리를 놓으며 그 다리를 건너야 한다” is transcripting... 



In [9]:

# Temporary save data into file 
import os 
import json 

with open( "temp_script_by_chapter.json", "w") as file:
    file.write( json.dumps(script_by_chapter, indent=2, ensure_ascii=False) )


# Write note by summarizing contents


In [10]:
import openai
from dotenv import load_dotenv

# Setup OpenAI API key 
load_dotenv()


def summarize_text_with_gpt3(text, max_token=3000, model="gpt-3.5-turbo", languages="ko"):

    prompt = f"""
        네가 대학생이고 아래 문장을 요약해서 노트를 만든다고 하자. 최대한 저자의 의도와 문장을 살려서 
        bulletin point를 붙여서 요약/정리해줘. 무언가 문장에 이상한 단어가 나오면 () 로 표시해줘.
        ----------
        {text}
        """
    # prompt = f"Summarize following text with bulletin points in Korean:\n{text}"
    response = openai.ChatCompletion.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=max_token
    )

    corrected_text = response.choices[0].message.content
    return corrected_text

In [11]:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = chunk_size,
    chunk_overlap  = chunk_overlap,
    length_function = len
)


In [18]:
import time


# Create temporary folder to store note
text_file_folder_path = os.path.join( os.getcwd(), 'note')
if not os.path.exists( text_file_folder_path ):
    os.makedirs(text_file_folder_path) 

# Summarize each chapter
index = 0
for c in script_by_chapter:
    
    # Split script
    texts = text_splitter.split_text(c["script"])

    # Summarize the text 
    title = c["title"]
    print( f"Chapter {title} is in-pro. ")
    summarized_text = ""
    for t in texts:
        partial_summary = summarize_text_with_gpt3(t, max_token = max_token, model = model)
        summarized_text += partial_summary
        print( ".", end="")

    c["summary"] = summarized_text

    # Save note into file
    text_file = os.path.join(text_file_folder_path, f'{index}.txt')
    with open( text_file, "w") as file:
        file.write(summarized_text)
    index += 1

    print('\n')
    time.sleep(0.5) # Avoid the bad request error. 


Chapter 질서자유주의와 사회적 시장경제 is in-pro. 
.

Chapter 실존철학의 메시지 is in-pro. 
..

Chapter 독일 기본법(헌법)과 대한민국 헌법의 정신 is in-pro. 
..

Chapter 우리는 어떻게 살아가고 있는가? is in-pro. 


ServiceUnavailableError: The server is overloaded or not ready yet.

## Publish markdown document

Find `markdown_note.md`. This is the summarized note for this Youtube video. 

In [None]:
import os

# Remove temporary data
os.remove("temp_script_by_chapter.json")

# Save chapter data into file 
with open( "script_by_chapter.json", "w") as file:
    file.write( json.dumps(script_by_chapter, indent=2, ensure_ascii=False) )


NameError: name 'os' is not defined

In [None]:
full_markdown_text = ""

for c in script_by_chapter:
    full_markdown_text += f"# {c['title']} \n\n"
    full_markdown_text += f"## Summary \n"
    full_markdown_text += f"{c['summary']} \n\n"
    full_markdown_text += f"## Script \n\n"
    full_markdown_text += f"{c['script']} \n"
    full_markdown_text += "\n\n"

In [None]:

# Write markdown document for note.
with open( "markdown_note.md", "w") as file:
    file.write(full_markdown_text)