<a href="https://colab.research.google.com/github/martinopiaggi/summarize/blob/main/Martino_Summarize_videos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarization notebook with AIs

Repository: https://github.com/martinopiaggi/summarize

In [46]:
#@markdown # Source of the summary

Type = "Text" #@param ['Text', 'Text from Google Drive','Youtube video or playlist', 'Videos on Google Drive folder']
#@markdown (*Run this cell again if you change the source*)

#@markdown ---
#@markdown #### Text
#@markdown (*only if type is text*)
Text = "\u201CHow are things?\u201D asked a friend. \u201CIt\u2019s busy, but I\u2019ll take some time to relax when things ease up,\u201D I replied. I recently caught myself giving a variation of this answer every time I was asked how I was doing. \u201CSo much work, but hopefully it will be better next week.\u201D Being busy all the time can give us an illusion of productivity which may feel reassuring, but isn\u2019t there a risk we are too busy to enjoy life?  For some people, being busy is unfortunately not an option. Students working part-time to pay for their studies, parents with two jobs just to stay afloat\u2014not everyone has the luxury of managing their time the way they see fit. But many people do have this flexibility, and yet rush from one task to another without ever taking a step back to ask: am I really enjoying any of this? Or are these tasks actually making me too busy to enjoy life?  Research shows that humans tend to do whatever it takes to keep busy, even if the activity feels meaningless to them. Dr Bren\xE9 Brown from the University of Houston describes being \u201Ccrazy busy\u201D as a numbing strategy we use to avoid facing the truth of our lives.  We are scared of idleness because stopping would mean having to really consider what we want out of life and what we currently have. Sometimes, the gap feels so wide, we\u2019d rather stay on the hamster wheel.  Being busy is a defense mechanism. It\u2019s a way to avoid just being. Having responsibilities, deadlines, a long task list\u2026 Overloading our senses can make us believe we are moving in the right direction, or at least in a direction. But the constant cycle of tasks we tackle without ever thinking often leaves us stagnant. Who proudly looks back at their old to-do lists at the end of the year and thinks: \u201CWow, I tackled so many tasks this year\u201D?  Instead of measuring progress by the quantity of work we produce, we should consider the quality of our work. Not just the quality of the output, as usually measured by externally-designed metrics, but the quality of the impact it has on our mental and physical well-being. \u201CDid the work feel intellectually stimulating, did I learn something new, did it help me cultivate my curiosity, did it give me the opportunity to connect with interesting people?\u201D are sensible questions to ask when work represents such a huge chunk of our lives.  \u201CYou cannot step into the same river twice, for other waters are continually flowing on,\u201D supposedly said Heraclitus. Time is like a river. If you\u2019re too busy to enjoy life\u2014too busy to spend time with friends and family, too busy to learn how to paint or play the guitar, too busy to go on that hike, too busy to cook something nice for yourself\u2014these moments will be gone, and you will never get that time back.  You may think it\u2019s too late. It\u2019s not. Like many people, I personally experience time anxiety\u2014the recurring thought that it\u2019s too late to start or accomplish something new\u2014but the reality is you probably still have many years in front of you. Defining what \u201Ctime well spent\u201D means to you and making space for these moments is one of the greatest gifts you can make to your future self.  Next time you think of learning something new, or a friend asks you if you want to do something together or have a chat, and your automatic answer is: \u201CI\u2019m just too busy\u201D, take a few minutes to actually consider whether you are actually too busy, and, if that\u2019s the case, whether this busyness is more valuable to you in the long-term than learning something new or spending time with your friend.  Maybe you are actually going through a temporary phase where you\u2019re working on an exciting but all-consuming project\u2014and that\u2019s fine. Such activities you feel extremely passionate about are actually nurturing. But if \u201CI\u2019m just too busy\u201D is becoming a recurrent answer of yours, you may want to consider whether it is possible to be that excited over such a long period of time.   Again, if that\u2019s the case, lucky you\u2014the problem is we usually don\u2019t even take the time to consider the alternative, which is that we\u2019re numbing our minds with work. Being busy with exciting work is good. Being too busy to enjoy life, spending time with the people you love, and exploring your full potential is not. If you belong to the people who do have a choice, consider making the most of your fortunate situation." #@param {type:"string"}
#@markdown #### Youtube video or playlist
#@markdown (*only if type is yt videos*)
URL = "https://www.youtube.com/watch?v=VqnF1TTkKV0" #@param {type:"string"}
#@markdown #### Google Drive video, audio (mp4, wav), or folder containing video and/or audio files
#@markdown (*only if type is from Google Drive*)
video_path = "Colab Notebooks/transcription/my_video.mp4" #@param {type:"string"}
#@markdown ---
#@markdown #### If source is video, you want timestamps in final summary?
Timestamps = True #@param {type:"boolean"}
if Type is ("Text" or "Text from Google Drive"):
  Timestamps = False

In [None]:
#@markdown ---
#@markdown # Install libraries
#@markdown This cell will take a little while to download several libraries

#@markdown ---
!pip install transformers
!pip install tensorflow
from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn",device=0)

import re
import math

if Type == ("Youtube video or playlist" or 'Videos on Google Drive folder'):

  video_path_local_list = []
  ! pip install faster-whisper
  from faster_whisper import WhisperModel
  from pathlib import Path
  import subprocess
  import torch
  import shutil
  import numpy as np

  model = WhisperModel('small', device="cuda", compute_type='int8')


if Type == "Youtube video or playlist":
  !pip install yt-dlp
  from pathlib import Path
  import yt_dlp

  ydl_opts = {
  'format': 'm4a/bestaudio/best',
  'outtmpl': '%(id)s.%(ext)s',
  # ℹ️ See help(yt_dlp.postprocessor) for a list of available Postprocessors and their arguments
  'postprocessors': [{  # Extract audio using ffmpeg
  'key': 'FFmpegExtractAudio',
  'preferredcodec': 'wav',
  }]
  }

  with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    error_code = ydl.download([URL])
    list_video_info = [ydl.extract_info(URL, download=False)]

  for video_info in list_video_info:
    video_path_local_list.append(Path(f"{video_info['id']}.wav"))

  for video_path_local in video_path_local_list:
    if video_path_local.suffix == ".mp4":
        video_path_local = video_path_local.with_suffix(".wav")
    result  = subprocess.run(["ffmpeg", "-i", str(video_path_local.with_suffix(".mp4")), "-vn", "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", str(video_path_local)])


In [39]:
#@markdown ---


if Type is not ("Text" or "Text from Google Drive"):

  def seconds_to_time_format(s):
      # Convert seconds to hours, minutes, seconds, and milliseconds
      hours = s // 3600
      s %= 3600
      minutes = s // 60
      s %= 60
      seconds = s // 1
      milliseconds = round((s % 1) * 1000)

      # Return the formatted string
      return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}"


  #@markdown # Trascription
  #@markdown Trascription of videos (if needed)
  language = "auto" #@param ["auto", "en", "zh", "ja", "fr", "de"] {allow-input: true}
  initial_prompt = "Here are some English words you may need: OneDrive" #@param {type:"string"}

  segments, info = model.transcribe(str(video_path_local), beam_size=5,
                                    language=None if language == "auto" else language,
                                    initial_prompt=initial_prompt,
                                    vad_filter=True, #voice activity detection
                                    vad_parameters=dict(min_silence_duration_ms=50))

  ext_name = ".srt"
  transcript_file_name = video_path_local.stem + ext_name
  sentence_idx = 1
  with open(transcript_file_name, 'w') as f:
    for segment in segments:
      if Timestamps:
        ts_start = seconds_to_time_format(segment.start)
        ts_end = seconds_to_time_format(segment.end)
        f.write(f"{ts_start} -> {ts_end} ")
      f.write(f"{segment.text.strip()}\n")
      sentence_idx = sentence_idx + 1

  try:
    shutil.copy(video_path_local.parent / transcript_file_name,
              drive_whisper_path / transcript_file_name
    )
    display(Markdown(f"**Transcript file created: {drive_whisper_path / transcript_file_name}**"))
  except:
    display(Markdown(f"**Transcript file created: {video_path_local.parent / transcript_file_name}**"))



In [None]:
#@markdown ---
#@markdown # Summarization
#@markdown Using https://huggingface.co/facebook/bart-large-cnn

summarizer = pipeline("summarization", model="facebook/bart-large-cnn",device=0)
tokenizer = summarizer.tokenizer

if Type is not ("Text" or "Text from Google Drive"):
  Text = open(transcript_file_name, "r").read()

Text = re.sub(r'\n', ' ', Text)

tokens = tokenizer.encode(Text.strip())

# Calculate the number of chunks needed
chunk_len = math.ceil(len(tokens) / 512)
chunksNumber = len(tokens)//chunk_len

# Split the tokens into chunks
chunks = [tokens[i:i+chunksNumber] for i in range(0, len(tokens), chunksNumber)]

if(len(chunks)>1):
  if (len(chunks[-1]) + len(chunks[-2])) < 1024:
      merged_chunk = chunks.pop(-1) + chunks.pop(-1)
      chunks.append(merged_chunk)

summary = ''

for chunk in chunks:
    if Timestamps:
      chunkText = tokenizer.decode(chunk);
      init_ts = re.findall(r"\d{2}:\d{2}:\d{2} -", chunkText)[0]
      end_ts = re.findall(r"> \d{2}:\d{2}:\d{2} ", chunkText)[-1]
      chunkText = re.sub(r"(\d{2}:?)* -> (\d{2}:?)*", '', chunkText)

    # Set max_length and min_length based on token count
    max_length = len(chunk) // 3
    min_length = len(chunk) // 5

    #Generate summary for each chunk without sampling (example)
    summary_chunk = summarizer(chunkText, max_length=max_length, min_length=min_length, do_sample=True)
    if Timestamps:
      summary += init_ts
      summary += end_ts + ' '
    summary += summary_chunk[0]['summary_text'] + "\n"

print(summary)