# Writing Utils
> Various utilities to help with writing

In [1]:
#|default_exp writing

In [2]:
#|export
import httpx, os, subprocess
from pathlib import Path
from fastcore.utils import L
from hamel import yt
from hamel.gem import gem

## PDF to Images

Split PDF files into individual slide images. Requires poppler-utils installed (`brew install poppler` on macOS or `apt-get install poppler-utils` on Ubuntu).

In [3]:
#|export
def pdf2imgs(pdf_path, output_dir=".", prefix="slide"):
    "Split a PDF file into individual slide images using poppler's pdftoppm."
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    cmd = ["pdftoppm", "-png", str(pdf_path), str(output_path / prefix)]
    subprocess.run(cmd, check=True, capture_output=True)
    created_files = sorted(output_path.glob(f"{prefix}-*.png"))
    return [str(f.rename(output_path / f"{prefix}_{i}.png")) for i, f in enumerate(created_files, 1)]

For example, you can split the NewFrontiersInIR.pdf file into individual slide images:

In [4]:
# Split NewFrontiersInIR.pdf into individual slides
output_folder = "slides_output"
image_files = pdf2imgs("NewFrontiersInIR.pdf", output_dir=output_folder)

# Show number of slides created
print(f"Created {len(image_files)} slide images in {output_folder}/")

Created 65 slide images in slides_output/


In [5]:
!rm -rf slides_output/

## Gather Context From Webpages

I often want to gather context from a set of web pages.

In [6]:
#|export
def jina_get(url):
    "Get a website as md with Jina."
    if not (jkey := os.getenv('JINA_READER_KEY')): raise Exception('JINA_READER_KEY env variable not set.') 
    return httpx.get(f"https://r.jina.ai/{url}",
                     headers = {"Authorization": f"Bearer {jkey}"},
                     timeout=60).text

def gather_urls(urls, tag='example'):
    "Gather contents from URLs."
    xml=[f'<{tag}-{i+1}>\n{c}\n</{tag}-{i}>' for i,c in enumerate(urls.map(jina_get))]
    return f'<{tag}s>\n' + '\n'.join(xml) + f'\n<{tag}s>'

For example, these are what I might use as context for annotated posts

In [7]:
#|export
_annotated_post_urls = L(['https://raw.githubusercontent.com/hamelsmu/hamel-site/refs/heads/master/notes/llm/rag/p1-intro.md', 'https://raw.githubusercontent.com/hamelsmu/hamel-site/refs/heads/master/notes/llm/rag/p2-evals.md',
'https://raw.githubusercontent.com/hamelsmu/hamel-site/refs/heads/master/notes/llm/evals/inspect.qmd'])

In [8]:
_annotated_post_content = gather_urls(_annotated_post_urls)
print(_annotated_post_content[:500])

<examples>
<example-1>
Title: 

URL Source: https://raw.githubusercontent.com/hamelsmu/hamel-site/refs/heads/master/notes/llm/rag/p1-intro.md

Markdown Content:
---
title: "P1: I don't use RAG, I just retrieve documents"
description: "Ben Clavié's introduction to advanced retrieval techniques"
image: p1-images/slide_12.png
date: 2025-06-25
---

As part of our [LLM Evals course](https://bit.ly/evals-ai){target="_blank"}, I hosted [Benjamin Clavié](https://ben.clavie.eu/){target="_blank"} to kick 


In [9]:
#|export
def outline_slides(slide_path):
    return gem("Provide a numbered list of each slide with a one sentence summary of each.  Just a numbered list please, no other asides or meta explanations of the task are required.", slide_path)

In [10]:
_o = outline_slides('NewFrontiersInIR.pdf')
print(_o[:300])

Here is a one-sentence summary for each slide:

1.  This slide introduces the presentation "New Frontiers in IR: Instruction Following and Reasoning" by Orion Weller from Johns Hopkins Whiting School of Engineering.
2.  This slide shows a "Message ChatGPT" interface with a prominent "Search" button,


## Annotated Posts From Talk

In [29]:
#|export
def generate_annotated_talk_post(slide_path, 
                                 youtube_link, 
                                 image_dir,
                                 transcript_path=None, 
                                 example_urls=_annotated_post_urls):
    "Assemble the prompt for the annotated post."
    
    youtube_chapters = yt.yt_chapters(youtube_link)
    slide_outline = outline_slides(slide_path)
    transcript =  Path(transcript_path).read_text() if transcript_path else yt.transcribe(youtube_link)
    examples = gather_urls(example_urls)
    _ = pdf2imgs(slide_path, output_dir=image_dir)
    
    prompt=f"""Attached is the transcript (in <transcript> tags) of a technical talk for the attached slides. I'd like to make an annotated presentation blog post as illustratd in <example-posts> tags.

For each slide, provide a detailed synopsis of the information to maximize understanding for the reader for the purposes of educating the reader.  Each section should provide enough commentary and info to understand the full context of that particular slide.  The idea is that the reader will not have to watch the video and can instead read the material so the writing + slide should stand alone.  Do not simply repeat the information on each slide, briefly describe what the slide is about, and capture supplementary information that was provided in the talk that is NOT in the slides.   Be thoroughly detailed and capture useful asides or commentary as well, such that the notes you generate should be a legitimate value add on top of the slides.

When writing the article, provide markdown placeholders with appropriate captions where the slides will go.  For example, you might have placeholder like this.

[Overview of xyz concpet]({image_dir}/slide_1.png) 

Note that images for this post will be placed in {image_dir}/

Refer to slides with naming convention (slide_1.png, slide_2.png, etc)

Additionally, reference the correct timestamp in the form of a timestamped linked to the youtube video that corresponds to the start of each slide.   The link to this presentation is {youtube_link} (so use this when adding timestamps please).  

I have included other annotated posts as an example for you to understand the format. These examples are in <examples> tags.

Finally, there might be Q&A section of the talk that will not correspond to any slides at all.  If that exists, list all those questions with answers in a Q&A section.  If there is a Q&A section, it should be drafted to maximize learning such that people who have listened to the talk can understand the full context.  Add timestamps if possible to each question in the Q&A as well.  The post should be written from the perspective of Hamel Husain (me) who hosted the talk as part of a course on LLM Evals (https://bit.ly/evals-ai).   Put a CTA at the beginning and end of the post in a tasteful way that is appropriate for a developer blog that looks something like the example posts, particularly following `p1-intro.md`.  

Example CTA: We are teaching our last and final cohort of our [AI Evals course](https://bit.ly/evals-ai) next month (we have to get back to building). Here is a [35% discount code](https://bit.ly/evals-ai) for readers.

Here is the transcript
<transcript>
{transcript}
</transcript>

Incase it is helpful, here is here is the YouTube description with chapters from the talk.  However, please use timestamps from the transcript when possible when constructing timestamped links. 
<youtube-chapters>
{youtube_chapters}
</youtube-chapters>

Below is a brief slide outline (in addition to the attached pdf)
<slide-outline>
{slide_outline}
</slide-outline>

Here are example posts that I have previously written:
{examples}

When writing the introduction, annotation and Q&A keep the following writing guidelines in mind:

1. Do not add filler words. 
2. Make every sentence information-dense without repetition.
3. Get to the point while providing necessary context.
4. Use short words and fewer words.
5. Avoid multiple examples if one suffices.
6. Make questions neutral without telegraphing answers.
7. Remove sentences that restate the premise.
8. Cut transitional fluff like "This is important because..."
9. Combine related ideas into single statements.
10. Avoid overusing bullet points. Prefer flowing prose that combines related concepts. Use lists only for truly distinct items.
11. Trust the reader's intelligence.
12. Start sections with specific advice, not general statements.
13. Replace em dashes with periods, commas, or colons.
14. Cut qualifying phrases that add no concrete information.
15. Use direct statements. Avoid hedge words unless exceptions matter.
16. Remove setup phrases like "It's worth noting that" or "The key point is."
17. Avoid unnecessarily specific claims when general statements work.
18. Avoid explanatory asides and redundant clauses.
19. Each sentence should add new information.
20. Avoid "Remember... the goal is not X but Y" conclusions.
21. No emojis in professional writing.
22. Use simple language. Present information objectively. Avoid exaggeration.
23. No formulaic conclusions with labels and prescriptive wisdom.

Please go ahead and draft the post. Please also include front matter similar to the front matter in the examples and select the best slide from the talk as the cover image (which is not the title slide, but instead another interesting slide that is punchy).
"""
    draft_post = gem(prompt, [slide_path, youtube_link], model='gemini-2.5-pro')
    return draft_post

## Example Post

In [24]:
post = generate_annotated_talk_post(slide_path='orion_example/NewFrontiersInIR.pdf',
                                    youtube_link='https://youtu.be/YB3b-wPbSH8?si=u_x0Puwreld3YCGf',
                                    image_dir='orion_example/p3_images',
                                    transcript_path='orion_example/transcript.md')

In [22]:
Path('orion_example/p3_orion.qmd').write_text(post)

30263