# Convert Substack Posts into Markdown Files

## Functionality
This Jupyter Notebook takes a Substack export (in the format as of June 2023) and converts the HTML posts into Markdown files.

## Use Case
I would like to create an archive of posts in Gatsby (where I'll be hosting my personal website). To achieve this, I need the post in Markdown files.

## Folder Structure

```
/root
├── converter.ipynb
├── posts.csv         # From Substack Imports
├── converted-posts/  # CREATED AUTOMATICALLY - tracked in OUTPUT_FOLDER
├── img/              # CREATED AUTOMATICALLY
├── posts/            # From Substack Imports — tracked in INPUT_FOLDER
│   ├── post-1.html
│   ├── post-1.delivers.csv
│   ├── post-1.opens.csv
|   ├── ... (other posts)
```

## Caveats
- I haven't escaped titles from special characters like `:`, which would mess up the YAML frontmatter.
- You'll have to resolve links to Substack posts and embeds manually

# 1. Install required libraries

In [8]:
!pip install markdownify pandas bs4

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1272 sha256=7529a1e6271c2d1ae5274779345317a012187093907534a239d8ba0d0c5b7f05
  Stored in directory: /Users/kyurikotpq/Library/Caches/pip/wheels/73/2b/cb/099980278a0c9a3e57ff1a89875ec07bfa0b6fcbebb9a8cad3
Successfully built bs4
Installing collected packages: bs4
[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please s

# 2. Read Posts from CSV File

In [2]:
import pandas as pd

df = pd.read_csv("posts.csv")
df.head()

Unnamed: 0,post_id,post_date,is_published,email_sent_at,inbox_sent_at,type,audience,title,subtitle,podcast_url
0,125176456.can-materialism-be-productive,,False,,,newsletter,everyone,,,
1,125068899.reflecting-on-my-law-of-attraction,,False,,,newsletter,everyone,,,
2,124355314.why-celebrate-private-wins,2023-05-30T05:00:37.088Z,True,2023-05-30T05:00:37.156Z,2023-05-30T05:00:37.156Z,newsletter,everyone,Why you should celebrate your private wins,"Private wins are the bedrock of everything, ye...",
3,123724680.how-to-stop-feeling-left-behind-in,,False,,,newsletter,everyone,,,
4,122686529.my-mid-year-review-ritual,,False,,,newsletter,everyone,,,


# 3. Get only published posts (optional)

In [3]:
published_df = df[df['is_published']]
published_df.head()

Unnamed: 0,post_id,post_date,is_published,email_sent_at,inbox_sent_at,type,audience,title,subtitle,podcast_url
2,124355314.why-celebrate-private-wins,2023-05-30T05:00:37.088Z,True,2023-05-30T05:00:37.156Z,2023-05-30T05:00:37.156Z,newsletter,everyone,Why you should celebrate your private wins,"Private wins are the bedrock of everything, ye...",
7,122077161.3-things-to-introspect-for-accelerated,2023-05-23T05:00:13.808Z,True,2023-05-23T05:00:13.868Z,2023-05-23T05:00:13.868Z,newsletter,everyone,3 data points to collect for faster goal achie...,Your efforts will always pay off. Introspectio...,
8,120361656.a-simple-system-that-made-my-life,2023-05-16T05:01:08.829Z,True,2023-05-16T05:01:08.911Z,2023-05-16T05:01:08.911Z,newsletter,everyone,A simple system that made my life transitions ...,These 3 things helped me settle down quickly a...,
11,117531865.the-science-of-deliberate-experiment...,2023-04-27T03:33:27.744Z,True,2023-04-27T03:33:27.870Z,2023-04-27T03:33:27.870Z,newsletter,everyone,The Science of Deliberate Experimentation,How I explore new interests to figure out what...,
13,115219323.what-is-autopilot-mode-and-how-you,2023-04-18T08:39:15.288Z,True,2023-04-18T08:39:15.359Z,2023-04-18T08:39:15.359Z,newsletter,everyone,What is Autopilot Mode? (and how you can harne...,A counterintuitive but effective approach to i...,


# 4. Convert HTML files to Markdown

In [21]:
import os
import requests
from bs4 import BeautifulSoup

# Helper function to download images in a HTML string
# ATTRIBUTION: This code was adapted from ChatGPT's answer
def download_images_from_html(html_content, output_directory):
    # Create the output directory if it doesn't exist
    os.makedirs(output_directory, exist_ok=True)

    # Parse the HTML using BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Find all the image tags in the HTML
    image_tags = soup.find_all('img')

    # Download each image
    for img_tag in image_tags:
        # Get the source URL of the image
        image_url = img_tag['src']

        # Ignore images with relative URL
        if not image_url.startswith('http'):
            continue

        # Download the image
        response = requests.get(image_url)
        if response.status_code == 200:
            # Extract the image file name from the URL
            image_file_name = os.path.basename(image_url)

            # Save the image to the output directory
            output_path = os.path.join(output_directory, image_file_name)
            with open(output_path, 'wb') as image_file:
                image_file.write(response.content)
                print(f"Downloaded image: {image_file_name}")

            # Replace with relative file path
            html_content = html_content.replace(image_url, output_path)
        else:
            print(f"Failed to download image: {image_url}")

    return html_content

In [22]:
import markdownify


# Helper function to read a HTML file,
# Convert to HTML, and add metadata 
def html_to_md(input_file_path, output_file_path, post):
    # Read the content of the input file
    html = ""

    with open(input_file_path, 'r') as file:
        html = file.read()

    # Add metadata
    # NOTE: This doesn't automatically
    # escape special chars like #
    metadata = f"---\n" + \
    f"templateKey: blog-post\n" + \
    f"title: {post.title}\n" + \
    f"date: {post.post_date}\n" + \
    f"featuredpost: false\n" + \
    f"featuredimage: \n" + \
    f"description: {post.subtitle}\n" + \
    f"---\n"


    # Save images and convert them to relative paths
    html = download_images_from_html(html, "img")

    # Convert to Markdown
    md_version = markdownify.markdownify(html)

    # Write the content to the output file
    with open(output_file_path, 'w') as file:
        file.write(metadata + md_version)


In [24]:
import os

INPUT_FOLDER = "./posts/"
OUTPUT_FOLDER = "./converted-posts/"

# Create the output directory if it doesn't exist
os.makedirs(OUTPUT_FOLDER, exist_ok=True)

for index in published_df.index:
    post = published_df.loc[index]
    input_file_path = INPUT_FOLDER + post.post_id + ".html"
    output_file_path = OUTPUT_FOLDER + post.post_id.split(".")[1] + ".md"
    html_to_md(input_file_path, output_file_path, post)
        

Downloaded image: 737cfde0-0740-4cbe-9a55-eb3c9b41e8cf_960x1280.jpeg
Downloaded image: 5144165a-c581-4433-bda8-951a364728bd_1274x800.jpeg
Downloaded image: da6eb6e2-a4da-459d-986f-76582c17b767_710x710.jpeg
Downloaded image: 3429fed1-2829-4d55-8418-a21c42c464de_800x560.jpeg
Downloaded image: 83efe949-8787-4f8b-8dbc-499b869bbf42_702x395.jpeg
Downloaded image: 5d07880c-356a-472d-aafd-f5a090cf23b2_1540x1544.png
Downloaded image: dae8084c-a9ef-4146-b6a7-6a9770b232c4_960x1280.jpeg
Downloaded image: 836c33b5-41b9-4823-8c3d-3386c04f2b72_1080x1080.jpeg
Downloaded image: 3cd296ff-7755-484b-a7bd-d1e2d8f26a2f_718x835.jpeg
Downloaded image: 44478dae-50c7-49aa-8528-fe90cdde8256_470x562.png
Downloaded image: a3e6a028-415f-4381-83b2-f2b0b620f460_1746x910.jpeg
Downloaded image: c0bdf5a4-3556-4681-a3bf-970a68ca7b49_1645x855.jpeg
Downloaded image: 34058967-75e1-4a1d-90e0-3a32a6c67b4d_1748x911.jpeg
Downloaded image: 1d61d5ff-2902-4b4a-ad6e-3f4409b55994_1711x892.jpeg
Downloaded image: 16c32f02-1988-4696-87