# Download Podcast Episodes

The first step of summarizing podcast episodes is to download them. This notebook contains no generative AI content, only Python things.

In [None]:
!pip install feedparser
!pip install pytz

## Define the CONFIG

This config dictionary contains information about which podcast is being transcribed. It also has some fields which can be tweaked according to your use case.

I tried to abstract some things to make it easy to implement this project on your podcast of choice, but in the end this is an example project for learning. Everything here can be improved upon!

In [10]:
CONFIG = {
    "podcast": {
        "rss_url": "https://anchor.fm/s/74aab30/podcast/rss",
        "summary_regex": r"<p>(?P<speaker>[\w\s]+)\s-\s(?P<reference>.*)<\/p>",
    },
    "output_dir": "media",
}

## Get the RSS feed for the podcast

Show the structure of one item in the feed.

In [11]:
import feedparser

feed = feedparser.parse(CONFIG["podcast"]["rss_url"])
feed["entries"][0]

{'title': 'Bearing Burdens',
 'title_detail': {'type': 'text/plain',
  'language': None,
  'base': 'https://anchor.fm/s/74aab30/podcast/rss',
  'value': 'Bearing Burdens'},
 'summary': '<p>Chris Price - Galatians 6:1-5</p>',
 'summary_detail': {'type': 'text/html',
  'language': None,
  'base': 'https://anchor.fm/s/74aab30/podcast/rss',
  'value': '<p>Chris Price - Galatians 6:1-5</p>'},
 'links': [{'rel': 'alternate',
   'type': 'text/html',
   'href': 'https://podcasters.spotify.com/pod/show/providence-fellowship/episodes/Bearing-Burdens-e2eogit'},
  {'length': '35724141',
   'type': 'audio/mpeg',
   'href': 'https://anchor.fm/s/74aab30/podcast/play/81592349/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2024-0-21%2F34f870d4-0747-b973-19b3-67401ecc891b.mp3',
   'rel': 'enclosure'}],
 'link': 'https://podcasters.spotify.com/pod/show/providence-fellowship/episodes/Bearing-Burdens-e2eogit',
 'id': '62e6dc62-60d8-4f54-821c-bad82dc74ee2',
 'guidislink': False,
 'authors': [{'name

### Extract desired info from feed

There's a lot of data in the feed, let's distill it down to only those fields we need.

In [None]:
import re

remove_html_tags_regex = re.compile('<.*?>')
summary_regex = re.compile(CONFIG["podcast"]["summary_regex"])

# These characters are not allowed in Windows filenames
bad_file_symbols = r"[.?!:%$#}{\\/'\"+`|=]"


distilled = []
for e in feed["entries"]:
    clean_title = re.sub(bad_file_symbols, "", e["title"])
    parsed_summary_info = dict()
    try:
        parsed_summary_info = summary_regex.match(e["summary"]).groupdict()
    except Exception as ex:
        if e["summary"] is not None:
            parsed_summary_info["speaker"] = remove_html_tags_regex.sub("", e["summary"])
        print("Could not parse summary:", e["summary"], "with custom regex:", CONFIG["podcast"]["summary_regex"], "using default regex instead.")
    distilled.append(dict(
        id=e["id"],
        title=e["title"],
        title_clean=clean_title,
        summary=e["summary"],
        audio=[l for l in e["links"] if l["type"].startswith("audio")][0],
        published=e["published"],
        **parsed_summary_info,
    ))

distilled[0]

## Download the podcast audio files

You could edit this code to only download a subsection of the audio files. For example the following edit would only download the first 10 episodes:

```python
for entry in distilled[0:10]:
    ...
```

> 📃 Important! This code assumes the [ffmpeg tool](https://ffmpeg.org/download.html) is available to convert audio files to mp3 format. Visit the link and install the tool if you want to run this on your machine.

In [None]:
import subprocess
import mimetypes
import json
import requests
from datetime import datetime
from pytz import timezone
from pathlib import Path

output_dir = Path(CONFIG["output_dir"])


def download_audio(filepath: Path, audio: dict[str,str]):
    # Determine the file extension from the MIME type
    if audio["type"] == "audio/x-m4a":
        extension = ".m4a"
    else:
        extension = mimetypes.guess_extension(audio["type"])
        if not extension:
            raise ValueError("Unsupported audio type")

    # Download the audio file
    response = requests.get(audio["href"])
    if response.status_code != 200:
        raise Exception("Failed to download the file")

    # Save the audio file
    downloaded_filepath = filepath.with_suffix(extension)
    with open(downloaded_filepath, 'wb') as audio_file:
        audio_file.write(response.content)

    # Check if conversion to mp3 is needed
    if audio["type"] != "audio/mpeg":
        # Convert the file to mp3 using ffmpeg
        try:
            subprocess.run(["ffmpeg", "-i", downloaded_filepath, filepath], check=True)
            downloaded_filepath.unlink()
        except subprocess.CalledProcessError:
            raise Exception("Failed to convert the file to mp3")

    return filepath

def parse_published_ts(ts: str):
    #time_stamp = 'Mon, 17 Dec 2018 18:05:01 GMT'
    utc = timezone('UTC')
    central = timezone('US/Central')
    published_time = datetime.strptime(ts, '%a, %d %b %Y %H:%M:%S %Z')
    published_gmt = published_time.replace(tzinfo=utc)
    published_cst = published_gmt.astimezone(central)
    return published_cst

filename_template = "{date} {title}"

for entry in distilled:
    published = parse_published_ts(entry["published"])
    name = filename_template.format(
        date=published.strftime("%Y-%m-%d"),
        title=entry["title_clean"],
    )
    metadata_path = output_dir / name / ("metadata.json")
    print(f"   - Metadata location: {metadata_path}")
    metadata_path.parent.mkdir(parents=True, exist_ok=True)
    with open(metadata_path, "w") as metadata_file:
        json.dump(entry, metadata_file)
    audio_path = output_dir / name / f"{name}.mp3"
    if audio_path.exists():
        continue
    print(f"Downloading {name}")
    download_audio(audio_path, entry["audio"])
    print(f"   - Audio location: {audio_path}")