<a href="https://colab.research.google.com/github/pfunk5150/jina-reader-small-lms/blob/main/reader_lm_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reader-LM Tutorial

[Read full release post](https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown)

In this notebook, we will demonstrate how to use Jina AI’s latest reader-lm model to convert HTML directly into markdown format. For this tutorial, we will use the `reader-lm-1.5b` model, which is compatible with the Colab T4 free tier. Additionally, a smaller version, `reader-lm-0.5b`, is available for those who need a lighter model.

✋ **Important Note:** The free-tier T4 GPU has certain limitations that may prevent you from utilizing advanced optimizations for model execution. Features such as bf16, flash-attn, and others cannot be used on the T4, which may result in higher vRAM usage and slower performance for longer inputs. For production environments, using a higher-end GPU such as the RTX3090/4090 is recommended for significantly better performance.

##!!!GPU-ONLY!!! Make sure that your runtime is set to GPU!!!**

Go to `Menu Bar -> Runtime -> Change runtime type -> T4 GPU` or higher tier.


In [None]:
# check if CUDA is >=12.1 (https://docs.vllm.ai/en/latest/getting_started/installation.html#install-with-pip)
! nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


## Install vLLM + Triton

In [None]:
!pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly
!pip install vllm

Looking in indexes: https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/
Collecting triton-nightly
  Downloading https://aiinfra.pkgs.visualstudio.com/2692857e-05ef-43b4-ba9c-ccf1c22c437c/_packaging/07c94329-d4c3-4ad4-9e6b-f904a60032ec/pypi/download/triton-nightly/3.post20240716052845/triton_nightly-3.0.0.post20240716052845-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (138.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.2/138.2 MB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: triton-nightly
Successfully installed triton-nightly-3.0.0.post20240716052845
Collecting vllm
  Downloading vllm-0.6.0-cp38-abi3-manylinux1_x86_64.whl.metadata (2.2 kB)
Collecting fastapi (from vllm)
  Downloading fastapi-0.113.0-py3-none-any.whl.metadata (27 kB)
Collecting openai>=1.0 (from vllm)
  Downloading openai-1.43.1-py3-none-any.whl.metadata (22 kB)
Collecting uvicorn[standard] (from vllm)
  Downloa

In [None]:
# @title Config reader-lm parameters { run: "auto" }

# @markdown ### Model:
# @markdown ---

model_name = 'jinaai/reader-lm-1.5b' # @param ["jinaai/reader-lm-1.5b", "jinaai/reader-lm-0.5b"]
# @markdown ---
# @markdown ### SamplingParams:

top_k = 1 # @param {type:"integer"}
temperature = 0 # @param {type:"slider", min:0, max:1, step:0.1}
repetition_penalty = 1.08 # @param {type:"number"}
presence_penalty = 0.25 # @param {type:"slider", min:0, max:1, step:0.1}
top_k = 1 # @param {type:"integer"}
max_tokens = 1024 # @param {type:"integer"}
# @markdown ---

from vllm import SamplingParams

sampling_params = SamplingParams(temperature=temperature, top_k=top_k, presence_penalty=presence_penalty, repetition_penalty=repetition_penalty, max_tokens=max_tokens)

print('sampling_params', sampling_params)

sampling_params SamplingParams(n=1, best_of=1, presence_penalty=0.25, frequency_penalty=0.0, repetition_penalty=1.08, temperature=0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None)


### Load the reader-lm into GPU

In [None]:
from vllm import LLM

llm = LLM(model=model_name, dtype='float16')

config.json:   0%|          | 0.00/705 [00:00<?, ?B/s]

INFO 09-06 03:30:02 config.py:999] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 09-06 03:30:02 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='jinaai/reader-lm-1.5b', speculative_config=None, tokenizer='jinaai/reader-lm-1.5b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=256000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=jinaai/reader-lm-1.5b, use_v2_bloc

tokenizer_config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/335 [00:00<?, ?B/s]

INFO 09-06 03:30:12 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 09-06 03:30:12 selector.py:116] Using XFormers backend.


  @torch.library.impl_abstract("xformers_flash::flash_fwd")
  @torch.library.impl_abstract("xformers_flash::flash_bwd")


INFO 09-06 03:30:13 model_runner.py:915] Starting to load model jinaai/reader-lm-1.5b...
INFO 09-06 03:30:13 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 09-06 03:30:13 selector.py:116] Using XFormers backend.
INFO 09-06 03:30:14 weight_utils.py:236] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

INFO 09-06 03:31:27 weight_utils.py:280] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 09-06 03:31:31 model_runner.py:926] Loading model weights took 2.9417 GB
INFO 09-06 03:31:32 gpu_executor.py:122] # GPU blocks: 19972, # CPU blocks: 9362
INFO 09-06 03:31:37 model_runner.py:1217] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-06 03:31:37 model_runner.py:1221] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-06 03:32:03 model_runner.py:1335] Graph capturing finished in 26 secs.


In [None]:
# @title ## Specify a URL as input{"run":"auto","vertical-output":true}

import re
import requests
from IPython.display import display, Markdown

def display_header(text):
    display(Markdown(f'**{text}**'))

def display_rendered_md(text):
    # for mimic "Reading mode" in Safari/Firefox
    display(Markdown(text))

def display_content(text):
    display(Markdown(f'```\n{text}\n```'))

def get_html_content(url):
    api_url = f'https://r.jina.ai/{url}'
    headers = {'X-Return-Format': 'html'}
    try:
        response = requests.get(api_url, headers=headers, timeout=10)
        response.raise_for_status()
        return response.text
    except requests.exceptions.RequestException as e:
        return f"error: {str(e)}"


def get_html_content(url):
    api_url = f'https://r.jina.ai/{url}'
    headers = {'X-Return-Format': 'html'}
    try:
        response = requests.get(api_url, headers=headers, timeout=10)
        response.raise_for_status()
        return response.text
    except requests.exceptions.RequestException as e:
        return f"error: {str(e)}"

def create_prompt(text:str, tokenizer) -> str:
   messages = [
    {
        "role": "user",
        "content": text
    },
   ]
   return tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
   )



# (REMOVE <SCRIPT> to </script> and variations)
SCRIPT_PATTERN = r'<[ ]*script.*?\/[ ]*script[ ]*>'  # mach any char zero or more times
# text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML <STYLE> to </style> and variations)
STYLE_PATTERN = r'<[ ]*style.*?\/[ ]*style[ ]*>'  # mach any char zero or more times
# text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML <META> to </meta> and variations)
META_PATTERN = r'<[ ]*meta.*?>'  # mach any char zero or more times
# text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML COMMENTS <!-- to --> and variations)
COMMENT_PATTERN = r'<[ ]*!--.*?--[ ]*>'  # mach any char zero or more times
# text = re.sub(pattern, '', text, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

# (REMOVE HTML LINK <LINK> to </link> and variations)
LINK_PATTERN = r'<[ ]*link.*?>'  # mach any char zero or more times

# (REPLACE base64 images)
BASE64_IMG_PATTERN = r'<img[^>]+src="data:image/[^;]+;base64,[^"]+"[^>]*>'

# (REPLACE <svg> to </svg> and variations)
SVG_PATTERN = r'(<svg[^>]*>)(.*?)(<\/svg>)'


def replace_svg(html: str, new_content: str = "this is a placeholder") -> str:
    return re.sub(
        SVG_PATTERN,
        lambda match: f"{match.group(1)}{new_content}{match.group(3)}",
        html,
        flags=re.DOTALL,
    )


def replace_base64_images(html: str, new_image_src: str = "#") -> str:
    return re.sub(BASE64_IMG_PATTERN, f'<img src="{new_image_src}"/>', html)


def has_base64_images(text: str) -> bool:
    base64_content_pattern = r'data:image/[^;]+;base64,[^"]+'
    return bool(re.search(base64_content_pattern, text, flags=re.DOTALL))


def has_svg_components(text: str) -> bool:
    return bool(re.search(SVG_PATTERN, text, flags=re.DOTALL))


def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False):
    html = re.sub(SCRIPT_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    html = re.sub(STYLE_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    html = re.sub(META_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    html = re.sub(COMMENT_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))
    html = re.sub(LINK_PATTERN, '', html, flags=(re.IGNORECASE | re.MULTILINE | re.DOTALL))

    if clean_svg:
        html = replace_svg(html)

    if clean_base64:
        html = replace_base64_images(html)

    return html

url = "https://www.hackernews.com" # @param {type:"string"}


print(f'We will use Jina Reader to fetch the **raw HTML** from: {url}')

We will use Jina Reader to fetch the **raw HTML** from: https://www.hackernews.com


## Action!

First, we use [Jina Reader API](https://jina.ai/reader) to get the **raw html** from that url. By default Jina Reader API returns you a formatted markdown (with some rule-based heuristics), but here we add `{'X-Return-Format': 'html'}` to the request header and force it to return the raw HTML.

In [None]:
html = get_html_content(url)

Second, we remove `<meta>, <script>, <svg>` tags from the raw html to reduce the noise and length of the input a bit (i.e. make it more friendly for T4 VRAM) This step is not a must but in general is helpful.

In [None]:
html = clean_html(html, clean_svg=True, clean_base64=True)

Now we use the raw html as the input to vllm for generation, subjected to our predefined sampling parameters.

In [None]:
prompt = create_prompt(html, llm.get_tokenizer())
results = llm.generate(prompt, sampling_params=sampling_params)

Processed prompts: 100%|██████████| 1/1 [00:30<00:00, 30.29s/it, est. speed input: 453.94 toks/s, output: 33.81 toks/s]


## Finally, print the results!

Here we iterate over all samplings. If your previous `sampling_params.top_k=1` then there is only one output in the `results`.

In [None]:
for output in results:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    display_content(generated_text)

```
Hacker News new | past | comments | ask | show | jobs | submit	login



1.	
	People can read their manager's mind (yosefk.com)
	81 points by luu 2 hours ago | hide | 20 comments

2.	
	UE5 Nanite in WebGPU (github.com/scthe)
	279 points by vouwfietsman 9 hours ago | hide | 83 comments

3.	
	Phind-405B and faster, high quality AI answers for everyone (phind.com)
	204 points by rushingcreek 10 hours ago | hide | 83 comments

4.	
	LwIP – Lightweight IP Stack (nongnu.org)
	19 points by fidotron 2 hours ago | hide | 15 comments

5.	
	Tell HN: Burnout is bad to your brain, take care
	252 points by tuyguntn 3 hours ago | hide | 118 comments

6.	
	AlphaProteo generates novel proteins for biology and health research (deepmind.google)
	229 points by meetpateltech 12 hours ago | hide | 80 comments

7.	
	Deploying Rust in Existing Firmware Codebases (googleblog.com)
	113 points by pjmlp 10 hours ago | hide | 59 comments

8.	
	serverless-registry: A Docker registry backed by Workers and R2 (github.com/cloudflare)
	115 points by tosh 10 hours ago | hide | 47 comments

9.	
	Launch HN: Maitai (YC S24) – Self-Optimizing LLM Platform
	116 points by cmdalsanto 13 hours ago | hide | 58 comments

10.	
	Show HN: Feature Flags Backed by Git (flipt.io)
	63 points by bullcitydev 7 hours ago | hide | 18 comments

11.	
	The Origins of the Steam Engine (rootsofprogress.org)
	16 points by bpierre 4 hours ago | hide | 2 comments

12.	
	Swiss watchmakers put employees on state-funded leave as luxury demand disappear (fortune.com)
	17 points by cwwc 1 hour ago | hide | 18 comments

13.	
	Show HN: AnythingLLM – Open-Source, All-in-One Desktop AI Assistant (github.com/mintplex-labs)
	211 points by tcarambat1010 11 hours ago | hide | 57 comments

14.	
	The 'Freakish Radio Writings' of 1924 (centauri-dreams.org)
	50 points by JPLeRouzic 9 hours ago | hide | 5 comments

15.	
	Why I self host my servers and what I've recently learned (chollinger.com)
	191 points by transpute 1 day ago | hide | 77 comments

16.	
	Show HN: We built a FOSS documentation CMS with a pretty GUI (difuse.io)
	87 points by arch1e 10 hours ago | hide | 15 comments

17.	
	Clojure 1.12.0 is now available (clojure.org)
	157 points by msolli 7 hours ago | hide | 23 comments

18.	
	The Work You Do, the Person You Are (2017) (newyorker.com)
	3 points by mitchbob 2 hours ago | hide | discuss

19.	
	Why Don't Tech Companies Pay Their Engineers to Stay? (goethena.com)
	51 points by samspenc 2 hours ago | hide | 66 comments

20.	
	Visa to launch pay-by-bank payments, an alternative to credit cards (cnbc.com)
	33 points by jnord 3 hours ago | hide | 26 comments

21.	
	Common food dye found to make skin and muscle temporarily transparent (theguardian.com)
	155 points by _Microft 7 hours ago | hide | 59 comments

22.	
	Libations: Tailscale on the Rocks (jnsgr.uk)
	81 points by yarapavan 11 hours ago | hide | 15 comments

23.	
	My job is to watch dreams die (2011) (reddit.com)
	245 points by eezurr 8 hours ago | hide
```

## Don't froget to release vRAM by resetting vllm!

In [None]:
from vllm.distributed.parallel_state import destroy_model_parallel, destroy_distributed_environment
import gc
import os
import torch

destroy_model_parallel()
destroy_distributed_environment()
del llm.llm_engine.model_executor.driver_worker
del llm.llm_engine.model_executor
del llm
gc.collect()
torch.cuda.empty_cache()

print(f"cuda memory: {torch.cuda.memory_allocated() // 1024 // 1024}MB")