In [1]:
import httpx
from fastcore.utils import first, last, L
import html2text

# Medium To Markdown

I want to migrate my Medium post into Markdown so what we will do is try to extract the content of each of my posts in Medium and then the content of each post will be converted into Markdown.

In [2]:
url = "https://medium.com/@pol.avec"
r = httpx.get(url)
r.text[:600]

'<!doctype html><html lang="en"><head><title data-rh="true">Pol Alvarez Vecino – Medium</title><meta data-rh="true" charset="utf-8"/><meta data-rh="true" name="viewport" content="width=device-width,minimum-scale=1,initial-scale=1,maximum-scale=1"/><meta data-rh="true" name="theme-color" content="#000000"/><meta data-rh="true" name="twitter:app:name:iphone" content="Medium"/><meta data-rh="true" name="twitter:app:id:iphone" content="828256236"/><meta data-rh="true" property="al:ios:app_name" content="Medium"/><meta data-rh="true" property="al:ios:app_store_id" content="828256236"/><meta data-rh='

**AI Prompt**

Could you help me use beautiful soup to extract the titles of each of the articles? They are placed inside the <article> tag and some children below, like this: <div role="link" data-href="https://medium.com/@pol.avec/why-give-ai-agents-access-to-money-be460a819a9c" tabindex="0" style="position: relative; display: flex;">

**AI Response**

I'll help you extract the article titles using BeautifulSoup. Let's do this step by step:

1. First, let's import BeautifulSoup and parse the HTML:

```python
from bs4 import BeautifulSoup

soup = BeautifulSoup(r.text, 'html.parser')
```

2. Let's try to find all article elements and then locate the titles:

```python
articles = soup.find_all('article')
for article in articles:
    # Find div with role="link"
    link_div = article.find('div', attrs={'role': 'link'})
    if link_div:
        print("Found link:", link_div.get('data-href'))
        # Try to find the title text within this div
        title = link_div.get_text(strip=True)
        print("Title:", title)
    print("---")
```

Let's run this and see what we get. If we need to adjust the selectors, we can modify them based on the output. The structure of Medium's HTML can be a bit complex and might require some trial and error to get exactly what we want.

Would you like to try this first step and let me know what output you get? Then we can refine the approach if needed.

In [3]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(r.text, 'html.parser')
articles = soup.find_all('article')
len(articles)

4

In [4]:
def extract(article):
    link_div = article.find('div', attrs={'role': 'link'})
    return link_div.get('data-href'), link_div.find('h2').get_text(strip=True)
extract(first(articles))

('https://medium.com/@pol.avec/why-give-ai-agents-access-to-money-be460a819a9c',
 'Why Give AI Agents Access to Money?')

In [5]:
arts = L(articles).map(extract)
arts

(#4) [('https://medium.com/@pol.avec/why-give-ai-agents-access-to-money-be460a819a9c', 'Why Give AI Agents Access to Money?'),('https://medium.com/@pol.avec/building-apps-no-one-needs-dogfooding-hammers-and-over-engineering-5689c5e265c1', 'Building apps no one needs: Dogfooding, Hammers and Over Engineering'),('https://medium.com/@pol.avec/from-text-to-actions-llms-as-the-new-software-consumers-8dc5fd065623', 'From text to actions: LLMs as the new software consumers'),('https://medium.com/@pol.avec/how-i-validated-a-gmail-ai-assistant-in-under-a-week-with-chatgpt-d33213d6fd0d', 'How I Validated a Gmail AI Assistant in Under a Week with ChatGPT')]

In [6]:
r = httpx.get(arts[1][0])

In [7]:
soup = BeautifulSoup(r.text, 'html.parser')
webp_sources = soup.find_all('source', attrs={'type': 'image/webp'})
webp_sources

[<source sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" srcset="https://miro.medium.com/v2/resize:fit:640/format:webp/1*mjRPB7HiX7dsX_2ZzMOJFA.png 640w, https://miro.medium.com/v2/resize:fit:720/format:webp/1*mjRPB7HiX7dsX_2ZzMOJFA.png 720w, https://miro.medium.com/v2/resize:fit:750/format:webp/1*mjRPB7HiX7dsX_2ZzMOJFA.png 750w, https://miro.medium.com/v2/resize:fit:786/format:webp/1*mjRPB7HiX7dsX_2ZzMOJFA.png 786w, https://miro.medium.com/v2/resize:fit:828/format:webp/1*mjRPB7HiX7dsX_2ZzMOJFA.png 828w, https://miro.medium.com/v2/resize:fit:1100/for

**AI Prompt**

now I need to extract the image name for each source set. There are multiple versions but the file name is always the same for each source. For example, from "https://miro.medium.com/v2/resize:fit:640/format:webp/1*mjRPB7HiX7dsX_2ZzMOJFA.png" I want to extract "1*mjRPB7HiX7dsX_2ZzMOJFA.png"

**AI Response**

I'll help you extract the image filenames from the source sets. Here's how we can do it:

```python
def get_image_name(source):
    # Get the srcset attribute
    srcset = source['srcset']
    # Take the first URL (they all have same filename)
    first_url = srcset.split()[0]
    # Get the filename after the last '/'
    return first_url.split('/')[-1]

# Get all unique image names
image_names = L(webp_sources).map(get_image_name)
image_names
```

This will extract the filename from each source. The `map` function will apply our `get_image_name` function to each source element. Try running this and let me know if you need any adjustments!

In [8]:
def get_images(source):
    name = source['srcset'].split()[0].split('/')[-1]
    return f"https://miro.medium.com/{name}"


imgs = L(webp_sources).map(get_images)
imgs

(#2) ['https://miro.medium.com/1*mjRPB7HiX7dsX_2ZzMOJFA.png','https://miro.medium.com/1*x9KHym2NjNpINIXn1dlXpg.png']

**AI Prompt**

now let's extract the page into markdown text using a library

**AI Response**

We can use the `html2text` library to convert the HTML content into Markdown. Let's install and use it:

```python
!pip install html2text
import html2text

# Create converter instance
h = html2text.HTML2Text()
h.ignore_links = False
h.ignore_images = False

# Get the main article content (Medium usually has the article in an 'article' tag)
article = soup.find('article')
if article:
    markdown_content = h.handle(str(article))
    print(markdown_content[:500])  # Let's look at the first 500 chars to see how it looks
```

This should convert the HTML content to Markdown while preserving links and images. If this doesn't give us exactly what we want, we might need to adjust some settings of the HTML2Text converter or we could try alternative approaches like targeting specific content sections. Let me know how this works and we can refine it if needed!

**AI Prompt**

the article actually starts after the element <figure> (not inside it, but its siblings). Also, let's display the markdown using IPython

**AI Response**

Let's modify our approach to get the content after the `<figure>` element and display it nicely using IPython:

```python
from IPython.display import Markdown

# Find the figure element
figure = soup.find('figure')

# Get all the content after the figure
content = []
for elem in figure.find_next_siblings():
    content.append(str(elem))

# Convert to markdown
h = html2text.HTML2Text()
h.ignore_links = False
h.ignore_images = False
markdown_content = h.handle(''.join(content))

# Display using IPython's Markdown
display(Markdown(markdown_content))
```

This should give us a cleaner version of the article with proper Markdown formatting. Let's try it and see how it looks!

In [9]:
def get_md(soup):
    content = []
    for elem in  soup.find('figure').find_next_siblings():
        content.append(str(elem))

    h = html2text.HTML2Text()
    return h.handle(''.join(content))
get_md(soup)[:50]

'So, I had this brilliant idea for a platform. Spen'

In [10]:
def get_img_md(soup):
    img = L(soup.find_all('source', attrs={'type': 'image/webp'})).map(get_images)
    md = get_md(soup)
    return img, md
i,m = get_img_md(soup)
i

(#2) ['https://miro.medium.com/1*mjRPB7HiX7dsX_2ZzMOJFA.png','https://miro.medium.com/1*x9KHym2NjNpINIXn1dlXpg.png']

In [11]:
from fastcore.xtras import dict2obj

def parse(el):
    url, title = el
    r = httpx.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    imgs,md = get_img_md(soup)
    return dict2obj(dict(url=url, title=title, imgs=imgs, content=md))
parse(first(arts))

```json
{ 'content': 'The current wave of AI automation largely focuses on optimizing '
             'existing\n'
             'workflows — making them faster, more efficient, or better '
             'tailored to user\n'
             'needs. While valuable, this approach merely scratches the '
             'surface of AI’s\n'
             'potential. The real revolution lies in enabling AI to drive '
             'genuine innovation\n'
             'by participating directly in the market economy.\n'
             '\n'
             'True innovation in a market economy means finding novel '
             'applications for\n'
             'limited resources that create more value than they consume. For '
             'AI to do this,\n'
             'it needs more than just the ability to analyze — it needs to '
             'effect change in\n'
             'the world. And in our economy, that requires the ability to pay '
             'for resources.\n'
             '\n'
             'The key transformation is from AI-as-tools to AI-as-innovators. '
             'When we treat\n'
             'AI as tools, we limit them to optimizing what exists. But '
             'innovators need\n'
             'capital — to acquire and experiment with resources freely. By '
             'giving AI agents\n'
             'both the ability to discover new tools at runtime and deploy '
             'capital to use\n'
             'them, we transform them from optimization engines into genuine '
             'innovators.\n'
             '\n'
             'Consider Replit’s AI agents that can build software tools '
             'independently. While\n'
             'powerful, they’re still limited by their inability to access '
             'paid services or\n'
             'APIs they might need without human supervision. Imagine instead '
             'if these\n'
             'agents could discover and integrate new APIs at runtime, pay for '
             'necessary\n'
             'cloud resources, or even hire other AI services to solve parts '
             'of the problem.\n'
             'That’s when we’ll see true AI-driven innovation.\n'
             '\n'
             'This isn’t about making existing processes better. It’s about '
             'enabling AI to\n'
             'create things we haven’t imagined yet. By giving AI agents both '
             'the tools to\n'
             'discover new possibilities and the capital to pursue them, we’re '
             'enabling them\n'
             'to help create the future.\n'
             '\n',
  'imgs': ['https://miro.medium.com/1*nIukoXOaNmaVjaV93QTpsA.png'],
  'title': 'Why Give AI Agents Access to Money?',
  'url': 'https://medium.com/@pol.avec/why-give-ai-agents-access-to-money-be460a819a9c'}
```

In [14]:
cts = arts.map(parse)
first(cts)


```json
{ 'content': 'The current wave of AI automation largely focuses on optimizing '
             'existing\n'
             'workflows — making them faster, more efficient, or better '
             'tailored to user\n'
             'needs. While valuable, this approach merely scratches the '
             'surface of AI’s\n'
             'potential. The real revolution lies in enabling AI to drive '
             'genuine innovation\n'
             'by participating directly in the market economy.\n'
             '\n'
             'True innovation in a market economy means finding novel '
             'applications for\n'
             'limited resources that create more value than they consume. For '
             'AI to do this,\n'
             'it needs more than just the ability to analyze — it needs to '
             'effect change in\n'
             'the world. And in our economy, that requires the ability to pay '
             'for resources.\n'
             '\n'
             'The key transformation is from AI-as-tools to AI-as-innovators. '
             'When we treat\n'
             'AI as tools, we limit them to optimizing what exists. But '
             'innovators need\n'
             'capital — to acquire and experiment with resources freely. By '
             'giving AI agents\n'
             'both the ability to discover new tools at runtime and deploy '
             'capital to use\n'
             'them, we transform them from optimization engines into genuine '
             'innovators.\n'
             '\n'
             'Consider Replit’s AI agents that can build software tools '
             'independently. While\n'
             'powerful, they’re still limited by their inability to access '
             'paid services or\n'
             'APIs they might need without human supervision. Imagine instead '
             'if these\n'
             'agents could discover and integrate new APIs at runtime, pay for '
             'necessary\n'
             'cloud resources, or even hire other AI services to solve parts '
             'of the problem.\n'
             'That’s when we’ll see true AI-driven innovation.\n'
             '\n'
             'This isn’t about making existing processes better. It’s about '
             'enabling AI to\n'
             'create things we haven’t imagined yet. By giving AI agents both '
             'the tools to\n'
             'discover new possibilities and the capital to pursue them, we’re '
             'enabling them\n'
             'to help create the future.\n'
             '\n',
  'imgs': ['https://miro.medium.com/1*nIukoXOaNmaVjaV93QTpsA.png'],
  'title': 'Why Give AI Agents Access to Money?',
  'url': 'https://medium.com/@pol.avec/why-give-ai-agents-access-to-money-be460a819a9c'}
```

## Saving it all
Let's save all the posts into markdown files & download the images

In [31]:
from urllib.parse import urlparse
from pathlib import Path
import re

In [None]:
def download_image(url, save_dir):
    """Download image and return local path"""
    r = httpx.get(url, follow_redirects=True)
    r.raise_for_status()
    # Get filename from URL or create one
    fname = Path(urlparse(url).path).name or f"img_{hash(url)}.jpg"
    fpath = save_dir / fname
    
    # Save image
    fpath.write_bytes(r.content)
    return fname
download_image(first(cts).imgs[0], Path('.'))

In [28]:
L(first(cts).imgs).map(lambda img: download_image(img, Path('.'))).map(lambda iname: f'images/{iname}')


(#1) ['images/1*nIukoXOaNmaVjaV93QTpsA.png']

In [30]:
def save_post(post, base_dir='posts'):
    # Setup directories
    post_dir = Path(base_dir)
    img_dir = post_dir / 'images'
    post_dir.mkdir(exist_ok=True)
    img_dir.mkdir(exist_ok=True)
    
    # Clean title for filename
    fname = re.sub(r'[^\w\s-]', '', post.title).strip().lower().replace(' ', '-')
    
    # Download images and collect paths
    content = post.content
    img_paths = L(post.imgs).map(lambda img: download_image(img, img_dir)).map(lambda iname: f'images/{iname}')
    
    # Create markdown with frontmatter
    md = f"""---
title: {post.title}
original: {post.url}
images:
{'\n'.join(f'- {img}' for img in img_paths)}
---

{content}
"""
    # Save markdown file
    (post_dir / f'{fname}.md').write_text(md)
    
for post in cts: save_post(post)