# Cleanup Podcast Transcripts

Sometimes, when the spaker in the podcast is talking quickly, the base whisper model will transcribe a very long run-on sentence. This becomes difficult to chunk and send to an LLM since my usual method involves splitting on the period character.

I use [fullstop-punctuation-multilang-large](https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large) from huggingface to cleanup the punctuation from the podcast transcript. It is not perfect, but using an LLM for this task felt like over-engineering.

In [None]:
!pip install deepmultilingualpunctuation

In [1]:
CONFIG = {
    "podcast": {
        "rss_url": "https://anchor.fm/s/74aab30/podcast/rss",
        "summary_regex": r"<p>(?P<speaker>[\w\s]+)\s-\s(?P<reference>.*)<\/p>",
    },
    "output_dir": "media",
}

In [None]:
from pathlib import Path
from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model="oliverguhr/fullstop-punctuation-multilang-large")
for transcript_file in Path(CONFIG["output_dir"]).glob("**/transcript.txt"):
    transcript = transcript_file.read_text()
    with_punctuation = model.restore_punctuation(transcript)
    transcript_with_puncutation_file = transcript_file.parent / "transcript_with_punctuation.txt"
    transcript_with_puncutation_file.write_text(with_punctuation)