<a href="https://colab.research.google.com/github/martinopiaggi/summarize/blob/main/Martino_Summarize_videos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarization notebook with AIs

---



Repository: https://github.com/martinopiaggi/summarize

In [None]:
#@markdown # Source of the summary
#@markdown ## **Type**
Type = "Text" #@param ['Text', 'Text from Google Drive','Youtube video or playlist', 'Videos on Google Drive folder','Dropbox video link']
#@markdown (*Run this cell again if you change the source*)

#@markdown ---
#@markdown #### **Text**
#@markdown (*only if type is text*)
Text = "In the past decade, there have been significant advancements in extending and enhancing the MapReduce framework. These improvements have brought about various changes:  - The traditional two-step processing in MapReduce has evolved into the capability to handle arbitrary acyclic graphs of transformations. This expansion allows for more complex computations and even supports iterative calculations in some cases. - Another noteworthy improvement is the shift from batch processing to stream processing. With this, MapReduce can not only handle large datasets with high throughput but also provide low latency, enabling real-time data analysis. - Additionally, advancements in storage approaches have been made. Systems have moved beyond relying solely on disk storage, with some utilizing main-memory or hybrid approaches. These developments enhance performance and efficiency within the MapReduce framework." #@param {type:"string"}
#@markdown #### **Youtube video or playlist**
#@markdown (*only if type is yt videos*)
URL = "https://www.youtube.com/watch?v=tLK-vfFXL50" #@param {type:"string"}
#@markdown #### **Google Drive video**
#@markdown *audio (mp4, wav), or folder containing video and/or audio files*
#@markdown (*only if type is from Google Drive*)
video_path = "Colab Notebooks/transcription/my_video.mp4" #@param {type:"string"}
#@markdown #### **Dropbox link video**
#@markdown *The video share link which allows anyone to view it*
dropbox_URL = "https://www.dropbox.com/scl/fi/fj96cauwfcz1ih9t9629i/2023_10_25_DistSys_BigData.mp4?rlkey=rz95nslyghxhcmod3ra4slsje&dl=1" #@param {type:"string"}
#@markdown ---
#@markdown #### If source is video, you want timestamps in final summary?
Timestamps = False #@param {type:"boolean"}
#@markdown ---
#@markdown #### Desired output length as percentual of original length
#@markdown

Min_percentual_summary = 0.2 #@param {type:"number"}
Max_percentual_summary = 0.4 #@param {type:"number"}

if Type is ("Text" or "Text from Google Drive"):
  Timestamps = False

In [None]:
#@markdown ---
#@markdown # Install libraries
#@markdown This cell will take a little while to download several libraries

#@markdown ---
!pip install transformers
!pip install tensorflow
from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn",device=0)

import re
import math

if Type == ("Youtube video or playlist"
            or 'Videos on Google Drive folder'
            or "Dropbox video link"):

  video_path_local_list = []
  !pip install faster-whisper
  from faster_whisper import WhisperModel
  from pathlib import Path
  import subprocess
  import torch
  import shutil
  import numpy as np

  if Type == "Youtube video or playlist":
    !pip install yt-dlp
    from pathlib import Path
    import yt_dlp

  if Type == ("Dropbox video link"):
    !sudo apt update && sudo apt install ffmpeg


Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m38.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m69.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m77.5 MB/s[0m eta [36m0:00:00[0m
Col

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
#@markdown ---
#@markdown # Downloading videos
#@markdown Download and conversion of video(s) in audio (if needed)

#@markdown ---

if Type == "Youtube video or playlist":

  ydl_opts = {
  'format': 'm4a/bestaudio/best',
  'outtmpl': '%(id)s.%(ext)s',
  # ℹ️ See help(yt_dlp.postprocessor) for a list of available Postprocessors and their arguments
  'postprocessors': [{  # Extract audio using ffmpeg
  'key': 'FFmpegExtractAudio',
  'preferredcodec': 'wav',
  }]
  }

  with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    error_code = ydl.download([URL])
    list_video_info = [ydl.extract_info(URL, download=False)]

  for video_info in list_video_info:
    video_path_local_list.append(Path(f"{video_info['id']}.wav"))

  for video_path_local in video_path_local_list:
    if video_path_local.suffix == ".mp4":
        video_path_local = video_path_local.with_suffix(".wav")
    result  = subprocess.run(["ffmpeg", "-i", str(video_path_local.with_suffix(".mp4")), "-vn", "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", str(video_path_local)])

if Type == ("Dropbox video link"):
    !wget -O dropbox_video.mp4 $dropbox_URL
    !ffmpeg -i dropbox_video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 dropbox_video_audio.wav


In [None]:
#@markdown # Trascription
#@markdown Trascription of videos (if needed)
language = "en" #@param ["auto", "en", "zh", "ja", "fr", "de"] {allow-input: true}
initial_prompt = "" #@param {type:"string"}

if Type == ("Dropbox video link"):
    video_path_local = "dropbox_video_audio.wav"


if Type is not ("Text" or "Text from Google Drive"):

  def seconds_to_time_format(s):
      # Convert seconds to hours, minutes, seconds, and milliseconds
      hours = s // 3600
      s %= 3600
      minutes = s // 60
      s %= 60
      seconds = s // 1
      milliseconds = round((s % 1) * 1000)

      # Return the formatted string
      return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}"


  model = WhisperModel('small', device="cuda", compute_type='int8')
  segments, info = model.transcribe(str(video_path_local), beam_size=5,
                                    language=None if language == "auto" else language,
                                    initial_prompt=initial_prompt,
                                    vad_filter=True, #voice activity detection
                                    vad_parameters=dict(min_silence_duration_ms=50))

  transcript_file_name = "dropbox_video_audio" + ".txt"
  with open(transcript_file_name, 'w') as f:
    for segment in segments:
      if Timestamps:
        ts_start = seconds_to_time_format(segment.start)
        ts_end = seconds_to_time_format(segment.end)
        Text += ts_start + "->" + ts_end
      Text += segment.text.strip()


#
 # try:
  #  shutil.copy(video_path_local.parent / transcript_file_name,
   #           drive_whisper_path / transcript_file_name
    #)
    #display(Markdown(f"**Transcript file created: {drive_whisper_path / transcript_file_name}**"))
 # except:
  #  display(Markdown(f"**Transcript file created: {video_path_local.parent / transcript_file_name}**"))



In [None]:
#@markdown ---
#@markdown # Summarization
#@markdown Using https://huggingface.co/facebook/bart-large-cnn

summarizer = pipeline("summarization", model="facebook/bart-large-cnn",device=0)
tokenizer = summarizer.tokenizer

if Type is not ("Text" or "Text from Google Drive"):
  Text = open(transcript_file_name, "r").read()

Text = re.sub(r'\n', ' ', Text)

tokens = tokenizer.encode(Text.strip())

# Calculate the number of chunks needed
chunk_len = math.ceil(len(tokens) / 512)
chunksNumber = len(tokens)//chunk_len

# Split the tokens into chunks
chunks = [tokens[i:i+chunksNumber] for i in range(0, len(tokens), chunksNumber)]
#Last 2 chunks are merged
if(len(chunks)>1):
  merged_chunk = chunks.pop(-1) + chunks.pop(-1)
  chunks.append(merged_chunk)

summary = ''

for chunk in chunks:
    chunkText = tokenizer.decode(chunk);
    if Timestamps:
      init_ts = re.findall(r"\d{2}:\d{2}:\d{2} -", chunkText)[0]
      end_ts = re.findall(r"> \d{2}:\d{2}:\d{2} ", chunkText)[-1]
      chunkText = re.sub(r"(\d{2}:?)* -> (\d{2}:?)*", '', chunkText)

    # Set max_length and min_length based on token count
    max_length = round(len(chunk) // (1/Max_percentual_summary))
    min_length = round(len(chunk) // (1/Min_percentual_summary))

    #Generate summary for each chunk without sampling (example)
    summary_chunk = summarizer(chunkText, max_length=max_length, min_length=min_length, do_sample=True)
    if Timestamps:
      summary += init_ts
      summary += end_ts + ' '
    summary += summary_chunk[0]['summary_text'] + "\n"
    print(summary_chunk[0]['summary_text'])

open(transcript_file_name, 'w').write(summary)