### *Requirements*
*To use this notebook, you need a WARC export of your search result from SolrWayback. (See **[SolrWayback > Export](https://nlnwa.github.io/research-services/docs/solrwayback/solrwayback-5export.html)** )*

*The exported data:*
- *must be in the WARC format,*
- *must be indexed as `type:html` or `content_type:"text/html"`.*

*After export, the WARC file should be moved to the `/warc/` folder.*

*We also recommend to give it a meaningful name, e.g. `domain-regjeringen-no_content-type-html.warc`*

# WARC to HTML
This notebook allow you to extract html files from warc records.

The purpose is to allow further work on the html files, e.g. removing boiler plate, tokenisation, etc.

## Import packages
Before starting, we must import the necessary python libraries.

To run a code cell: Make sure it is marked and then press <kbd>Shift</kbd> + <kbd>Enter</kbd>)

In [None]:
import os
from warcio import ArchiveIterator
from warcio.archiveiterator import ArchiveIterator

## Set WARC file path and HTML output
First, we need to set the path and file name we want to extract from.

In [None]:
warc_path = '../warc/file-name.warc'

Then, we define where the HTML files should be output to:

In [None]:
html_dir = '../html/{name-of}_html/'

# Create the output folder if it doesn't exist
if not os.path.exists(html_dir):
    os.makedirs(html_dir)

## Open WARC and extract HTML

In [None]:
# Open the WARC file and iterate through its records
with open(warc_path, 'rb') as stream:
    for record in ArchiveIterator(stream):
        # Only process 'response' records
        if record.rec_type == 'response':
            # Get the WARC-Record-ID and sanitize into valid filename
            warc_record_id = record.rec_headers.get_header('WARC-Record-ID')
            warc_record_id = warc_record_id.replace('<', '').replace('>', '').replace(':', '_')
            
            # Create HTML filename based on WARC-Record-ID (URN UUID)
            html_file_name = f"{warc_record_id}.html"
            
            # Create full path to the HTML file
            html_file_path = os.path.join(html_dir, html_file_name)
            
            # Extract and write HTML payload to file
            payload = record.content_stream().read()
            with open(html_file_path, 'wb') as f:
                f.write(payload)