### *Requirements*
*To use this notebook, you need a WARC export of your search result from SolrWayback. (See **[SolrWayback > Export](https://nlnwa.github.io/research-services/docs/solrwayback/solrwayback-5export.html)** )*

*We also recommend to give it a meaningful name, e.g. `domain-regjeringen-no_content-type-html.warc`*

# WARC to any
This notebook allow you to extract any underlying file(s) from warc records.

## Import packages
Before starting, we must import the necessary python libraries.

To run a code cell: Make sure it is marked and then press <kbd>Shift</kbd> + <kbd>Enter</kbd>)

In [None]:
from pathlib import Path
from mimetypes import guess_extension

from warcio import ArchiveIterator
from warcio.archiveiterator import ArchiveIterator

from magic import detect_from_content

## Set WARC file path and output directory
First, we need to set the path and file name we want to extract from.

In [None]:
warc_path = Path().home() / "Downloads"/ "<insert-name-here>" # it will likely look something like this: solrwayback_2023-09-05_08-39-11.warc.gz

Then, we define where the HTML files should be output to:

In [None]:
repo_root = (Path() / ".." ).resolve()
output_root_dir = repo_root / "output"
output_root_dir.mkdir(parents=True, exist_ok=True)

## Open WARC and extract files

In [None]:
def _write_output(file_extension: str) -> None:
    output_dir = output_root_dir / file_extension[1:]
    output_dir.mkdir(parents=True, exist_ok=True)
    destination = (output_dir / f"{warc_record_id}{file_extension}")
    destination.write_bytes(payload)
     

# Open the WARC file and iterate through its records
with open(warc_path, "rb") as file_pointer:
    for record in ArchiveIterator(file_pointer):
        if record.rec_type != "response":
            continue
        warc_record_id = record.rec_headers.get_header("WARC-Record-ID")
        warc_record_id = warc_record_id.replace("<", "").replace(">", "").replace(":", "_")
        payload = record.content_stream().read()
        detected_mime_type = detect_from_content(payload)
        failed_to_obtain_file_extension = False
        if guess_extension(detected_mime_type.mime_type) is None:
            failed_to_obtain_file_extension = True
            print(f"Failed to detect file extension from mime type '{detected_mime_type.mime_type}', using provided mime type '{record.http_headers.get_header('Content-Type')}' instead")
            provided_mime_type = record.http_headers.get_header('Content-Type')
        try:
            if failed_to_obtain_file_extension:
                _write_output(guess_extension(provided_mime_type))
            else:
                _write_output(guess_extension(detected_mime_type.mime_type))
        except TypeError:
            print(f"Failed to determine type of '{record.rec_headers.get_header('WARC-Record-ID')}'")