<a href="https://colab.research.google.com/github/learningsomethingnew/podcast_summary/blob/main/summarize_podcasts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Click 👆 this to begin if you found this on Github.

# Podcast Summarizer
Podcast are a wealth of information, but they tend to have tons of episodes that are upwards of 1-2 hours per. This Google Colab will take a given podcast and will create a csv document for each audio/mp3 audio episode for free! There is also an option to use OpenAI's GPT3, which does charge, but is super cheap. 

## Details
Each episode will get a text file in the transcription folder that will contain:

*   Speaker identification (optional)
*   Using Speach to Text transcribe what was said by each speaker. Example: 
  > 0:00:00.000-0:00:33.960> SPEAKER_01:  Welcome to the Profitable Farmer podcast, where we share stories and tips to help you run a better farming business and create your very own freedom farm. If you're looking to work smarter and not harder in your farm business, welcome, you're in the right place. .... 
*   Summerize the episode
  > FarmTender was founded in March 2012, but the backstory of its founder goes back further. He grew up on a family farm in the Wimmera and made the decision to move away from primary production so he could pursue something new. He had an interest in building databases and the internet, and was brave enough to leave something that offered certainty to go and try something else. After trying a few different things, he heard a farmer at a field day asking why he couldn't sell his products online. This gave him the idea for FarmTender, and he worked with a developer to get it up and running. It was a tough process, but he had a database of 2000 members to help him get started. With lots of cold calling and content creation, FarmTender has grown significantly since then....

### Note About The Summaries
This project is aiming for good enough to quickly read through.

## FAQ

### Does my computer need a graphics card?
NO! This runs in the cloud and does NOT require your personal machine to have a graphics card.

### Where is the output located?
gDrive -> My Drive -> Summarize_Podcasts -> PODCAST_NAME_HERE -> Transcript <br>

In you gDrive there will be a directory called "Summarize_Podcasts". When you run this notebook for a podcast a directory will be created in this folder by that name. Podcast "Reply All" will be "replyall". In there you will find the "transcript" folder where you will find the transcript and summary.

### How long does this need to run?
* In testing, for a 1 hour podcast episode it takes around 20-25 minutes.

### Which Runtime Type?
You will need to use a **GPU** Runtime. In order to select a runtime, on the menu: 
1. Click “Runtime” -> “Change runtime type”
2. Under "Hardware Accelerator" -> select "GPU" 
3. Click “Save”

### How do I run this?
1. Go to the Required Information section, below, and fill out the forms.
2. Once filled in, on the menu, click “Runtime” -> “Run All”
 * Or you can press "shift + enter" on your keyboard until you get to the end of this document to run each cell.

### Keep this tab open and don't let your computer sleep
Colab, unless on the $50 dollar plan, requires this tab to stay open and that the computer running this tab stays awake while executing

### Uh oh, the runtime shutdown before completing
No need to worry. This code has been setup to pick up where it left off in processing. Just reconnect, with GPU collab. Depending on how long each episode is and how many episodes, this could take a few sessions. 

### Empty gDrive Trash
If you run this a few times, you will need to empty your gdrive trash to free up space. [More information can be found here on how to do that ](https://support.google.com/drive/answer/2375102?hl=en&co=GENIE.Platform%3DDesktop)

### Which languages are supported?
* English
While Whisper does support additional languages, with varying degrees of accuracy, I don't know enough of the various languages to validate if the summarization is even in the "good enough" category. Reach out if you want to help.

### How do I track progress?
After you click the run button, scroll down to the bottom to see the logs to see the progress.

In [None]:
#@markdown 👈👈👈 Click this play button after filling out this form. Scroll down to see the progress
import json
from google.colab import drive
import subprocess
import logging
import sys

# logging setup
logger = logging.getLogger('podcast_summary')

logging.basicConfig(
  format='%(asctime)s - %(message)s', 
  level=logging.INFO,
  force=True
)

#@title Please fill out this form

#@markdown ## Podcast XML <img src="https://upload.wikimedia.org/wikipedia/en/thumb/4/43/Feed-icon.svg/256px-Feed-icon.svg.png" alt="RSS Feed Icon" height=35; />
#@markdown Please provide the Podcast's RSS Feed XML link here.

podcast_xml = "https://feeds.megaphone.fm/replyall" #@param {type:"string"}
#@markdown Number of episodes you would like to summarize, starting from most recent? Use 0 for all episodes.
num_of_episodes = 5 #@param {type:"integer"}
#@markdown <br>



#@markdown ### Hugging Face Setup <img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" alt="RSS Feed Icon" height=35; />
#@markdown Do you want to get an output file that identifies each speaker? 
#@markdown If "False" then you can skip the api key
identify_speakers = True #@param ["False", "True"] {type:"raw"}
#@markdown 1. Please provide a Hugging Face READ api key. [You can get one by following this guide](https://huggingface.co/docs/hub/security-tokens)
apikey_for_hugging_face = '' #@param {type:"string"}
#@markdown 2. Visit [pyannote/speaker-diarization](https://huggingface.co/pyannote/speaker-diarization) and and accept user conditions <br> <img src="https://i.imgur.com/HpwAGpR.png" alt="RSS Feed Icon" height=120; />
#@markdown 3. Visit [pyannote/segmentation](https://huggingface.co/pyannote/segmentation) and accept user conditions
#@markdown <br>
if identify_speakers and 'apikey_for_hugging_face' == "":
  logger.error("No API Key")
  sys.exit("ERROR: !!!!! Please input an API key for Hugging Face")

#@markdown  ## Summarize Setup <img src="https://i.imgur.com/Mc95bds.png" alt="RSS Feed Icon" height=40; />

#@markdown In order to summarize a podcast episode, you will need to select which tool you would like to use. This colab is setup with:<br>
#@markdown * Sumy - A free library that uses Latent Semantic Analysis (LSA)... tldr: It is OK and summarizes by finding high value sentences 
#@markdown and combining them.
#@markdown * GPT3 - Requires an [API key and a credit card after the free credits](https://elephas.app/blog/how-to-create-openai-api-keys-cl5c4f21d281431po7k8fgyol0). More advanced & powerful than Sumy.
#@markdown   *  In testing, 3, 1 hour podcasts summarized cost me ~1 dollar USD. 
summarization_tool = "Sumy (Free)" #@param ["Sumy (Free)", "GPT3 (Requires API Key)"]
#@markdown if you selected GPT3, you will need to input the API key here. Otherwise you can leave blank.
apikey_for_openai = "" #@param {type:"string"}
#@markdown <br>
if summarization_tool.startswith("GPT3") and apikey_for_openai == "":
  logger.error("No API Key")
  sys.exit("ERROR: !!!!! Please input an API key for OpenAI")

#@markdown ## Optional Settings <img src="https://cdn2.iconfinder.com/data/icons/connectivity/32/setting-512.png" height=30;>
#@markdown Do you want to summarize each episode?
summarize_episode = True #@param ["False", "True"] {type:"raw"}

#@markdown Percent of sentences do you want to trim off the front and back of each
#@markdown episode? The idea here is to skip ads, intros, and outros to reduce noise
#@markdown for the summary. If 800 sentences, at 10%, 80 sentences from the front
#@markdown  and back will be ignored for summarizing
percent_sentences_to_skip = 4 #@param {type:"slider", min:0, max:40, step:1}

#@markdown ### Google Drive <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/12/Google_Drive_icon_%282020%29.svg/2295px-Google_Drive_icon_%282020%29.svg.png" alt="RSS Feed Icon" height=20; />
#@markdown Free Colab has a limited runtime. By leveraging your Google Drive, we can iterate through the podcast 
#@markdown and pick back up where we left off after the runtime ends. 
logger.info('A pop up will appear asking for access to your gDrive.')
drive.mount('/content/drive')
#@markdown Google Drive Directory Path. (Can leave as default)
path_to_use = "/content/drive/MyDrive/Summarize_Podcasts" #@param {type:"string"}

#@markdown ### Whisper (Speech To Text)
#@markdown Which speech to text model would you like to use. The larger the model, 
#@markdown the more RAM is needed. Only change if you know about what this means
whisper_model = "medium.en" #@param ["large-v2", "large-v1", "large", "medium", "medium.en", "base", "base.en", "tiny", "tiny.en", ""]
#@markdown ### Storage Management
#@markdown Do you want to delete the podcast MP3s after completion? Helps conserve gDrive space.
#@markdown This is done after this code fully completes to reduce duplicate downloads
delete_mp3s_when_done = True #@param ["False", "True"] {type:"raw"}
#@markdown ### Auto Shutdown
#@markdown If you want Colab to auto disconnect the environment when completed
auto_shutoff = True #@param ["False", "True"] {type:"raw"}

#@markdown 👇👇👇 Progress logs will appear below this 👇👇👇 


# validate that we have the right runtime environment
logger.info("Verifying the current runtime environment has a GPU")
result = subprocess.run(["nvidia-smi", "-L"], stdout=subprocess.PIPE)
output = result.stdout.decode('utf-8')
# catch the runtime if it does not have a GPU. Example output "GPU 0: Tesla T4"
if 'gpu' not in output.lower():
  logger.error("No GPU detected")
  logger.info("You will need to use a **GPU** Runtime. In order to select a runtime, on the menu:") 
  logger.info("1. Click “Runtime” -> “Change runtime type”")
  logger.info("""2. Under "Hardware Accelerator" -> select "GPU"""")
  logger.info("""3. Click “Save”""")
  sys.exit("ERROR: !!!!! Please change the Runtime environment to include a GPU. See the instructions above.")
else: 
  logger.info("Runtime environment has a GPU! Now installing packages")

#################### PACKAGE INSTALLING
################################################################################
logger.info("Installing needed libraries")

def install_packages(packages: list):
  """
  Checks to see if a specific package has already been installed
  """
  logger.info("Checking if packages have already been installed")
  result = subprocess.run(["pip", "freeze"], stdout=subprocess.PIPE)
  installed_packages = result.stdout.decode().split("\n")
  packages_to_install = []
  for package in packages:
      package_installed = False
      for p in installed_packages:
          if package in p:
              package_installed = True
              break
      if not package_installed:
          packages_to_install.append(package)
  if packages_to_install:
    logger.info(f"Installing {len(packages)} packages. This will take a few minutes")
    subprocess.run(["pip", "install", "-q"] + packages_to_install)
    for package in packages_to_install:
        logger.info(f"Installed package {package}")
  else:
    logger.info("All packages already installed")

packages = [
  # requests for downloading the podcast feed and mp3s
  "requests", 
  # for processing the podcast xml 
  "feedparser",
  # for speech to text
  "git+https://github.com/openai/whisper.git",
]

if identify_speakers:
  logger.info("Identifying Speakers set to True.")
  packages += [
          "torch==1.11.0",
          "torchvision==0.12.0",
          "torchaudio==0.11.0",
          "torchtext==0.12.0",
          "speechbrain==0.5.12",
          "pyannote.audio==2.1.1",
          "pydub==0.25.1"
      ]

# for text summarization
if summarization_tool.startswith("GPT3"):
  logger.info("GPT3 selected for summarizing")
  packages += [
      "openai==0.26.4",
      "backoff"
    ]
else:
  logger.info("Sumy selected for summarizing")
  packages+=["sumy==0.11.0"]

install_packages(packages)

#################### PACKAGE IMPORTING
################################################################################
logger.info("Importing the packages")
import os
import glob
import json
import time
import re
import gc
import math
import shutil

import google.colab
from google.colab import runtime

import feedparser
import requests

import backoff
import whisper
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

if identify_speakers==True:
  from pyannote.audio import Audio 
  from pyannote.audio import Pipeline
  from pydub import AudioSegment
  from pydub import silence
  from pydub.silence import split_on_silence

if summarization_tool.startswith("GPT3"):
  import openai
  openai.api_key = apikey_for_openai
else:
  from sumy.parsers.plaintext import PlaintextParser
  from sumy.nlp.tokenizers import Tokenizer
  from sumy.summarizers.lsa import LsaSummarizer as Summarizer
  from sumy.nlp.stemmers import Stemmer
  from sumy.utils import get_stop_words
  

# max length of the clips. We add .5 seconds to the front and back if we 
# split on silence as 30 seconds is the max length of whisper. Multiply by 1k
# to get milliseconds
audio_max_clip_length = 29 * 1000

logger.info("Loading functions")
#################### FILE HANDLING
################################################################################
def file_path_validate_get(folder_path: str, file_name: str):
  """
  validates that a path is available for a file. If it is not, then it will create
  the path as needed.
  save_path: the folder directory
  file_name: the name of the file
  return: str of file path
  """
  if not os.path.exists(folder_path):
      logger.info(f"Creating directory '{folder_path}' in Google Drive")
      os.makedirs(folder_path)
  return os.path.join(folder_path, file_name)

def files_in_folder_delete(folder_path: str, file_type: str = None):
  """
  Finds all of the files, or files by type, in the specified folder and
  puts them into gDrive trash. Cannot hard delete.
  folder_path: str path of folder
  file_type: str file extension of files to delete (optional)
  return: n/a 
  """
  for file_name in os.listdir(folder_path):
    if file_type:
      if not file_name.endswith(file_type):
        continue
    file_path = os.path.join(folder_path, file_name)
    if os.path.isfile(file_path):
        os.remove(file_path)
          
def json_save_to(data: dict, file_path: str):
  with open(file_path, 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False)

def json_load(file_path: str):
  """
  Check to see if the file is present. If it is, then return the file. If it 
  is not, return False.
  """
  try:
    with open(file_path, 'r', encoding='utf-8') as f:
      return json.load(f)
  except FileNotFoundError:
    return False

def file_write(directory: str, data: str):
  with open(directory, 'w', encoding='utf-8') as f:
    f.write(data)

#################### ML MODEL LOADING
################################################################################
# create the path to save the model to
path_ml_models = os.path.join(path_to_use, 'ml_models/')
_ = file_path_validate_get(path_ml_models, "")

# download the needed ML models
# Diarization is the model needed to identify speakers
# only load if we want to identify speakers
if identify_speakers:
  pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization', 
                                      use_auth_token=apikey_for_hugging_face, 
                                      #this model doesn't look to see if it is cached
                                      # cache_dir=path_ml_models 
                                      )

    # setting up whisper for use with speaker identification
  model = whisper.load_model(whisper_model, download_root=path_ml_models)

#################### PODCAST DOWNLOADING
################################################################################

def mp3_download(url: str, file_name: str, save_path: str):
  """
  From a given url, downloads the mp3 into a specified path
  """
  file_path = file_path_validate_get(save_path, file_name)
  response = requests.get(url)
  open(file_path, 'wb').write(response.content)
  logger.info(f"Downloaded {file_name}")
  del response


def podcast_feed_download(rss_feed_url: str):
  """
  downloads the entire podcast's rss, performs a light refactoring
  on the file, and then stores feed for later use/reference
  """
  # get the rss_feed
  feed = feedparser.parse(rss_feed_url)
  logger.info(f"There are {len(feed.entries)} episodes in this podcast")
  data = {
      'title': feed.feed['title'].lower().replace(" ", "_"),
      'feed': feed.feed,
      'entries': []
  }
  # keep the data that we want and add some flags for tracking progress
  count = 0
  for entry in feed.entries:
    temp_entry = {}
    temp_entry['title'] = entry['title']

    # get the href link for the podcast audio file
    for link in entry.links:
      if link['type'] == 'audio/mpeg':
        temp_entry['href'] = link['href']
      else:
        # ignore the items that don't have an audio href
        continue
    temp_entry['file_name'] = entry['title'].replace(" ", "_").lower()+".mp3"
    data['entries'].append(temp_entry)
    
  del temp_entry
  del feed
  return data
  

#################### AUDIO/MP3 MANAGEMENT
################################################################################

def match_target_amplitude(chunk, target_dBFS):
  """
  Normalize given audio chunk
  """
  return chunk.apply_gain(target_dBFS - chunk.dBFS)

def chunk_normalize(chunk):
  """
  Create a silence chunk that's 0.5 seconds (or 500 ms) long for padding
  copied from https://stackoverflow.com/a/46001755
  """
  silence_chunk = AudioSegment.silent(duration=500)
  # add silence to the begging and end of the chunk
  audio_chunk = silence_chunk + chunk + silence_chunk
  # normalize the entire chunk
  return match_target_amplitude(audio_chunk, -20.0)

def chunk_save(chunk, chunk_index):
  normalized_chunk = chunk_normalize(chunk)
  normalized_chunk.export(path_split_mp3s + f"chunk_{chunk_index}.mp3", format="mp3")

def mp3_split(audio, start_time, end_time):
  # Convert the start and end times from seconds to milliseconds
  start_time_ms = start_time * 1000
  end_time_ms = end_time * 1000
  chunk = audio[start_time_ms:end_time_ms]
  return chunk

def mp3_convert_to_wav(mp3_file_path: str, wav_file_path: str):
    # we need a wave file for us to determine speaker timing of each mp3 file
    # check to see if the folders exists
    sound = AudioSegment.from_mp3(mp3_file_path)
    sound.export(wav_file_path, format="wav")
    logger.info("MP3 to WAV conversion successful")
    del sound

def mp3_split_on_silence(chunk, silence_threshold=-50, min_silence_len=400, clip_duration=30*1000, chunk_index=""):
  # split up the audio into chunks
  logger.debug(f"Splitting up audio into chunks, chunk_index = {chunk_index}")
  chunks = split_on_silence(chunk, min_silence_len=min_silence_len, silence_thresh=silence_threshold)

  for i, chunk in enumerate(chunks):
    # identify if any chunks are longer than 29 secs, when saved we pad .5 second front and back
    if len(chunk) > 29 * 1000:
      mp3_split_on_silence(chunk, min_silence_len=min_silence_len-50, chunk_index=chunk_index+"."+str(i))
    else:
      temp = chunk_index+"."+str(i)
      chunk_save(chunk, chunk_index+"."+str(i))
  del chunk
  del chunks

def split_and_save_thread(audio, start_time, end_time, index):
  duration = end_time - start_time
  chunk = mp3_split(audio, start_time, end_time)
  # if the segment is longer than audio_len_split
  if duration > audio_max_clip_length/1000:
    logger.debug(f"This segment has duration of {duration} seconds")
    mp3_split_on_silence(chunk, silence_threshold=-50, min_silence_len=500, clip_duration=30*1000, chunk_index=str(index))
  else:
    chunk_save(chunk, index)
  
  del chunk

def mp3_split_speaker_segments(audio_segments, mp3_file_path):
  num_seg = len(audio_segments)
  logger.info(f"There are {num_seg} segments to split up. This will take a few minutes")
  # Load the entire mp3 file into memory
  audio = AudioSegment.from_file(mp3_file_path, format="mp3")

  for index, segment in enumerate(audio_segments):
    # split up the mp3 into various chunks to make transcribing easier
    split_and_save_thread(audio, segment['start'], segment['stop'], index)

  del audio


#################### SPEAKER IDENTIFICATION/DIARIZATION
################################################################################

def capture_speaker_changes(diarization):
  """
  Iterates through all of the turns from the diarization and adds up the
  time that the same speaker speaks for to create one consistent clip vs smaller
  clips.
  returns: a list of dicts with keys "speaker", "start", "stop", "duration"
  """
  result = []
  current_speaker = None
  for turn, _, speaker in diarization.itertracks(yield_label=True):
      start, stop = turn.start, turn.end
      if current_speaker != speaker:
        if current_speaker is not None:
            result.append({
                "speaker": current_speaker, 
                "start": start_time, 
                "stop": stop_time, 
                "duration": stop_time-start_time,
                "transcribed_text": ""})
        current_speaker = speaker
        start_time = start
        stop_time = stop
      else:
        stop_time = max(stop_time, stop)
  if current_speaker is not None:
    result.append({
      "speaker": current_speaker, 
      "start": start_time, 
      "stop": stop_time, 
      "duration": stop_time-start_time,
      "transcribed_text": ""
      })
  # clear memory of diarization
  del diarization
  return result

def audio_segments_get(wave_file_path):
    temp_start = time.time()
    WAVE_FILE = {'audio': wave_file_path}
    logger.info("Identifying speakers. This will take a few minutes")
    waveform, sample_rate = Audio()(WAVE_FILE)
    # delete the wave file to save space as we no longer need it
    del waveform
    return pipeline(WAVE_FILE)
    


#################### TRANSCRIPTION WITH WHISPER
################################################################################
## Whisper Transcribe

def mp3_transcribe(mp3_path: str):
  # load audio and pad/trim it to fit 30 seconds
  audio = whisper.load_audio(mp3_path)
  audio = whisper.pad_or_trim(audio)

  # make log-Mel spectrogram and move to the same device as the model
  mel = whisper.log_mel_spectrogram(audio).to(model.device)

  # decode the audio
  options = whisper.DecodingOptions(language="en", without_timestamps=True)
  result = whisper.decode(model, mel, options)
  del audio
  return result.text

def mp3_get_index_from_name(chunk_path):
  regex= r"chunk_(\d+)"
  found = re.findall(regex, chunk_path.split("/")[-1])
  return int(found[0])

def convert_to_hms(seconds):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    remaining_seconds = int(seconds % 60)
    milliseconds = int(round(seconds * 1000) % 1000)
    return f"{hours}:{minutes:02d}:{remaining_seconds:02d}.{milliseconds:03d}"

def transcript_get(audio_segments, no_speaker=False):
  transcript = ""
  if no_speaker:
      for segment in audio_segments:
        transcript += f"{segment['transcribed_text']} \n"
  else:
    for segment in audio_segments:
      t_start = convert_to_hms(segment['start'])
      t_stop = convert_to_hms(segment['stop'])
      transcript += f"""{t_start}-{t_stop}: {segment['speaker']} - {segment['transcribed_text']} \n"""
  return transcript

def mp3_chunks_transcribe(audio_segments):
  chunks = glob.glob(path_split_mp3s + "*.mp3")

  for index, chunk_path in enumerate(chunks):
    segment_index = mp3_get_index_from_name(chunk_path)

    speaker_seg = audio_segments[segment_index]

    # by splitting on periods, the len of a regular chunk is 2 and anything
    # higher will have a multiple sub parts
    file_name_split = chunk_path.split("/")[-1].split(".")
    # if there are multiple sub parts to one section, we need to append the 
    # scripts together to complete one chunk. 
    if len(file_name_split) > 2:
      speaker_seg['transcribed_text'] += mp3_transcribe(chunk_path) + " "
    else:
      speaker_seg['transcribed_text'] = mp3_transcribe(chunk_path)

    # update the speaker segment
    speaker_seg = audio_segments[segment_index]
    if index % 25:
      logger.info(f"Transcribed {index} out of {len(chunks)}")

  return audio_segments


#################### ML MODEL SUMMARIZATION
################################################################################
def summy_summarize(text, sentence_count=10):
  result = ""
  # Summarize using sumy LexRank
  LANGUAGE = "english"
  SENTENCES_COUNT= sentence_count
  parser = PlaintextParser.from_string(text, Tokenizer(LANGUAGE))
  stemmer = Stemmer(LANGUAGE)

  summarizer = Summarizer(stemmer)
  summarizer.stop_words = get_stop_words(LANGUAGE)

  for sentence in summarizer(parser.document, SENTENCES_COUNT):
    result += str(sentence) + " "
  # free up mem
  del parser
  del stemmer
  del summarizer
  return result

def break_up_text(tokens, chunk_size, overlap_size):
  if len(tokens) <= chunk_size:
    yield tokens
  else:
    chunk = tokens[:chunk_size]
    yield chunk
    yield from break_up_text(tokens[chunk_size-overlap_size:], chunk_size, overlap_size)

def break_up_transcript_to_chunks(text, chunk_size=2000, overlap_size=100):
  tokens = word_tokenize(text)
  return list(break_up_text(tokens, chunk_size, overlap_size))

def convert_to_detokenized_text(tokenized_text):
  prompt_text = " ".join(tokenized_text)
  prompt_text = prompt_text.replace(" 's", "'s")
  return prompt_text

# handle the rate limiting of API calls
@backoff.on_exception(backoff.expo, openai.error.RateLimitError)
def completions_with_backoff(**kwargs):
  return openai.Completion.create(**kwargs)


def openai_make_summarization(text):
  # this was inspired by https://sungkim11.medium.com/how-to-get-around-openai-gpt-3-token-limits-b11583691b32
  # split text up by max tokens
  chunks = break_up_transcript_to_chunks(text)
  de_tokenized_chunks = [{"index": i, "summary":"", "chunk": convert_to_detokenized_text(chunk)} for i, chunk in enumerate(chunks)]
  logger.info(f"Breaking up the transcript into {len(de_tokenized_chunks)} chunks for summarizing")
  summary = ""
  
  # check to see if the partial summary is present. If it is, load it
  # the OpenAI API is not reliable service and sometimes is overloaded. This will
  # capture the current 
  partial_summary = json_load(file_path_temp_summary)
  # if there was not a partial temp summary
  if partial_summary == False:
    starting_index = 0
    partial_summary = []
  else:
    logger.info("A partial summary was found. Picking up where we left off")

  for i, chunk in enumerate(de_tokenized_chunks):
    # if there is not a summary, process it
    if chunk['summary'] == "":
      try:
        logger.info(f"Making Summarizing OpenAPI call")
        res = completions_with_backoff(
            model="text-davinci-003",
            prompt=chunk['chunk'] + "\n\ntl;dr",
            temperature= 0.7,
            max_tokens= 1700,  
            top_p= 1, 
            frequency_penalty= 0.0, 
            presence_penalty= 1
          )
        partial_summary.append({"index": i, "summary": res["choices"][0]["text"].strip() + "\n"})

      except Exception as e:
        logger.error(f"While summarizing, ran into the following \n {e}.")
        logger.info("No worries. Storing the progress made. Try again in a few hours.")
        logger.info("Check https://status.openai.com/ for more information")
        logger.info("This code will pick up where it left off to reduce cost \
          by not resummarizing what's already been done")
        json_save_to(partial_summary, file_path_temp_summary)
        logger.info("Shutting runtime off")
        runtime.unassign()
    # aggregate the summaries into one
    summary += res["choices"][0]["text"].strip() + "\n"
    
  return summary
  

def summarize(transcript):
  if summarization_tool.startswith("GPT3"):
    return openai_make_summarization(transcript)
  else:
    return summy_summarize(transcript, sentence_count=10)


def transcript_wadsworth_constant(audio_segments):
  # cut out the ads and outros for summary
  no_speaker = transcript_get(audio_segments, no_speaker=True)
  # how many sentences are there
  count = len(no_speaker.split("\n"))
  num_skip = math.ceil(count* (percent_sentences_to_skip/100))
  temp_speaker = ""
  for i in no_speaker.split("\n")[num_skip:-num_skip]:
    temp_speaker += i + "\n"
  return temp_speaker


################################################################################
########################## MAIN ################################################
################################################################################

start = time.time()

## pull down the podcast
logger.info("Grabbing the latest podcast xml feed")
feed_data = podcast_feed_download(podcast_xml)
tokens = 0

#################### PATH CREATION/VALIDATION
################################################################################
path_working_base_dir = os.path.join(path_to_use, feed_data['title'])
path_full_mp3 = os.path.join(path_working_base_dir, "full_mp3/")
path_full_wave = os.path.join(path_working_base_dir, "full_wav/")
path_split_mp3s = os.path.join(path_full_mp3, "split_mp3s/")
path_completed_transcripts = os.path.join(path_working_base_dir, "transcripts/")

logger.info("Making sure that the needed directories that we need exist and/or creating as needed")
_ = file_path_validate_get(path_to_use, "")
_ = file_path_validate_get(path_full_mp3, "")
_ = file_path_validate_get(path_full_wave, "")
_ = file_path_validate_get(path_split_mp3s, "")
_ = file_path_validate_get(path_completed_transcripts, "")

## save the podcast feed
feed_json_path = path_working_base_dir + f"/{feed_data['title']}_podcast_feed.json"
json_save_to(feed_data, feed_json_path)
logger.info(f"Saving podcast episodes for tracking at {feed_json_path}")


index = 0
entry_duration = []

if num_of_episodes==0:
  logger.info("You have selected all episodes in the podcast. Know that this will probably take multiple sessions to complete.")
  num_of_episodes = len(feed_data['entries'])
else:
  logger.info(f"You have selected to download {num_of_episodes} episodes")
# allow for downloading on specific number of files
for entry in feed_data['entries'][:num_of_episodes]:
  entry_start=time.time()

  logger.info(f"Starting on {entry['title']}")
  mp3_file_path = path_full_mp3 + entry['file_name']
  file_name_no_extension = entry['file_name'][:-4]
  file_path_wave = path_full_wave + file_name_no_extension + ".wav"
  file_path_audio_segments = path_full_mp3 + file_name_no_extension + "_audio_segments.json"
  file_path_transcript_path = path_completed_transcripts + f"{file_name_no_extension}-speaker_transcript.txt"
  file_path_summary = path_completed_transcripts + f"{file_name_no_extension}-summary.txt"
  file_path_temp_summary = path_full_mp3 + f"{file_name_no_extension}-temp_summary.txt"
  
  ## DOWNLOAD MP3
  # check to see if the mp3 already exists
  if os.path.isfile(mp3_file_path) == False:
    logger.info(f"Downloading {entry['title']}")
    mp3_download(url=entry['href'], file_name=entry['file_name'], save_path=path_full_mp3)
    

  ### DIARIZATION
  if identify_speakers == True:
    # delete the split mp3s
    logger.info("Clearing the working cache")
    files_in_folder_delete(path_split_mp3s)
    
    start_diarization = time.time()

    #check to see if the segment file already exists
    if os.path.isfile(file_path_audio_segments) == False:
      # convert to wave file as diarization requires wav
      logger.info("Converting the mp3 to wav for speaker identification")
      mp3_convert_to_wav(mp3_file_path=mp3_file_path, wav_file_path=file_path_wave)
      # get the diarization and then get the output in the format we need it in
      audio_segments = capture_speaker_changes(audio_segments_get(file_path_wave))
      # save the speaker segments
      logger.info("Saving the speaker segments")
      json_save_to(audio_segments, file_path_audio_segments)
      logger.info("Deleting the wav file")
      files_in_folder_delete(path_full_wave)
      logger.info(f"Speaker Identification has completed in {time.time()-start_diarization} seconds")
    # load the speaker segments if 
    else:
      logger.info("Found existing speaker segmentation file. Loading...")
      audio_segments = json_load(file_path_audio_segments)

    ##### MP3 SPLIT UP & WHISPER Transcribe

    if os.path.isfile(file_path_transcript_path) == False:
      start_whisper = time.time()
      #splitting up the mp3 to smaller chunks to make it easier for whisper &
      #requires less ram and prefers 30 second chunks or less
      logger.info("Splitting the MP3 based on speakers")
      start_split = time.time()
      mp3_split_speaker_segments(audio_segments, mp3_file_path)
      logger.info(f"Splitting MP3 into chunks has completed in {time.time()-start_split} seconds")
      audio_segments = mp3_chunks_transcribe(audio_segments)
      logger.info(f"Transcription has completed in {time.time()-start_whisper} seconds")
      # save the audio_segments
      json_save_to(audio_segments, file_path_audio_segments)
      # save the transcript for this episode
      file_write(file_path_transcript_path, transcript_get(audio_segments))
      
      
    ### SUMMARIZE
    if os.path.isfile(file_path_summary) == False and summarize_episode == True:
      summary = summarize(transcript_wadsworth_constant(audio_segments))
      file_write(file_path_summary, summary)
      
  
  # user just wants a summary
  else:
    # transcribe the file if the file doesn't already exist
    if os.path.isfile(file_path_transcript_path) == False:
      logger.info("Starting transcription. 1hr episode takes around 20-35 min.")
      start_whisper = time.time()
      return_code = subprocess.call(f"whisper {mp3_file_path} --language en --model {whisper_model} --output_format json --model_dir {path_ml_models} --output_dir {path_split_mp3s} --verbose True", shell=True)
      logger.info(f"Transcription has completed in {time.time()-start_whisper} seconds")

      logger.info(f"Converting the output to template")
      # calling CLI whisper, it generates a json file that we will convert
      # to the template
      temp_json = glob.glob(path_split_mp3s + "*.json")[0]
      data = json_load(temp_json)
      transcript = data['text']
      audio_segments = []
      for row in data['segments']:
        audio_segments.append({ 
            "speaker": "SPEAKER",
            "start": row['start'], 
            "stop": row['end'], 
            "duration": row['end']-row['start'],
            "transcribed_text": row["text"]
          })
      json_save_to(audio_segments, file_path_audio_segments)
      os.remove(temp_json)
      json_save_to(transcript_get(audio_segments), file_path_transcript_path)
      # converting audio_segments to the transcript
      file_write(file_path_transcript_path, transcript_get(audio_segments))
      logger.info(f"You can find this file at {file_path_transcript_path}")
    
    if os.path.isfile(file_path_summary) == False and summarize_episode == True:
      summary = summarize(transcript_wadsworth_constant(audio_segments))
      logger.info(f"Saving the summary at {file_path_summary}")
      file_write(file_path_summary, summary)
      with open(file_path_summary, 'w') as f:
            f.write(summary)


  dur = convert_to_hms(time.time() - entry_start)
  entry_duration.append(dur)
  logger.info(f"This episode took {dur}")

  # update the podcast entries
  feed_data['entries'][index] = entry
  # update the file
  json_save_to(feed_data, feed_json_path)
  index += 1

logger.info(f"Completed the summarization of {num_of_episodes} episodes of the podcast")
display(str(time.time()-start))
if delete_mp3s_when_done:
  logger.info("Deleting podcast mp3s to conserve space")
  files_in_folder_delete(path_full_mp3, file_type: 'mp3')
# kill this session
if auto_shutoff:
  logger.info("Shutting runtime off")
  runtime.unassign()
