 <font size=6> <b>Working with WARC Files</b></font> <br>
 <font size=4> The Gold Standard to Create Own Collections of Webdata in Panelformat</font>

by *Julian Oliver Dörr*

This notebook shows how to work with files in the **Web ARChive (WARC)** format. The **WARC** format represents the unique **ISO standard** ([ISO 28500:2017](https://www.iso.org/standard/68004.html)) format for archiving webdata. With the increased crawling and storing of World Wide Web material in the late 90s, the need for a standardized format to store such content started to become a serious issue. It is for this reason that the WARC format (previously ARC format) was developed in order to allow standardized storage, management and exchange of digital objects from the web. Today WARC is recognised by most national library systems as the standard to follow for web archiving and all major web archives such as [Common Crawl](https://commoncrawl.org/the-data/get-started/) (CC) and the [Internet Archive](https://archive.org/) (IA) use the WARC format as gold standard to store their web crawls.

In simple terms, a WARC file consists of two parts:
1. **header** information including compulsory metadata such as a Uniform Resource Identifier (URI) which is typically the website's URL, crawling date, content type and length of the record as well as non-compulsory metadata
2. **cotent** block which comprises the actual content (the so called payload) found on the webpage (e.g. html code, pdfs, images, videos)

The International Internet Preservation Consortium ([IIPC](https://netpreserve.org/)) provides a detailed description of the WARC file format and its different elements which can be found [here](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/).

The following code snippets show how to create own WARC files from both crawling live websites and from generating collections from existing web archives (such as CC and IA). For this purpose, we work with the Python module [`warcio`](https://github.com/webrecorder/warcio). warcio provides a standalone way to read and write WARC files compliant with the WARC ISO standard. Important to mention, the module is designed for fast, low-level access to web archival content, oriented around a stream of WARC records rather than files.

In [3]:
from warcio.warcwriter import WARCWriter
from warcio.statusandheaders import StatusAndHeaders
import requests # required to access live website content given URL

Here we define two functions that allows us the return the size of files (required later).

In [33]:
import gzip
import os

def get_uncompressed_size(FileName):
    with gzip.open(FileName, 'rb') as fd:
        fd.seek(0, 2)
        size = fd.tell()
    return print(size/1000000, 'MB')

def get_compressed_size(filename):
    return print(str(os.stat(filename).st_size/1000000), 'MB')

# WARC files form live websites 

First, we conduct the crawling by accessing the starting pages for a set of corporate websites and the store the response elements as WARC files.

## Write .warc files 

First we define a short list of (corporate) URLs whose current content we want to store in a WARC file.

In [4]:
urls = ['https://www.zew.de/', 'https://istari.ai/', 'https://new.siemens.com/']

We now create a .warc file and stream the web contents found on the above websites into the file adhering to the WARC 1.0 ISO standard. Note that we compress the .warc via the GZIP format which allows significant storage savings. GZIP is widely used and supported across many free and commercial software packages and operating systems. Compressing via GZIP is commonly done when creating .warc files and the reason why most WARC files end with .warc.gz.

In [139]:
with open(r'Q:\Meine Bibliotheken\Research\05_Code\02_ArchiveSpark\example.warc.gz', 'wb') as output:
    writer = WARCWriter(output, gzip=True) # instantiate the warc writer object which allows to write the retrieved webdate into the example.warc.gz file

    for url in urls:
        # get the webserver's response to the HTTP request ...
        response = requests.get(
            url,                            # ... of the given URL
            headers={'Accept-Encoding': 'identity'},
            stream=True                     # by default, when you make a request, the body of the response is downloaded immediately.
                               )            # Override this behaviour and defer downloading the response body until you access 
                                            # the Response.content attribute by setting stream = True
                                           
        # get raw headers from the response
        headers_list = response.raw.headers.items()
        # append url to headers_list
        headers_list.append(('URL', url))
        
        # from the raw headers create a WARC compatible header
        http_headers = StatusAndHeaders('200 OK', headers_list, protocol='HTTP/1.0')
        
        # create WARC record ...
        record = writer.create_warc_record(
            uri=url,                         # ... of the given URL which serves as Uniform Resource Identifier (URI) of the record
            record_type='response',
            payload=response.raw,            # content block found on website
            http_headers=http_headers)       # header information (metadata) of respective website

        writer.write_record(record)
output.close()

The above code snippet has created gzipped WARC file. The size of both the compressed and the uncompressed file is as follows.

In [141]:
get_uncompressed_size(r'Q:\Meine Bibliotheken\Research\05_Code\02_ArchiveSpark\example.warc.gz')
get_compressed_size(r'Q:\Meine Bibliotheken\Research\05_Code\02_ArchiveSpark\example.warc.gz')

0.291773 MB
0.053284 MB


We see that the file size reduces by factor 5 when using GZIP for compressing the WARC file.

## Read .warc files 

The stored WARC file and its headers and payloads can be easily accessed via warcio's `ArchiveIterator`.

In [124]:
from warcio.archiveiterator import ArchiveIterator
import pandas as pd # used to store results in a data frame

In [146]:
df_header = pd.DataFrame()
with open(r'Q:\Meine Bibliotheken\Research\05_Code\02_ArchiveSpark\example.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream):
        temp = sorted([(i[0], [i[1]]) for i in record.http_headers.headers if i[0] in ['Date', 'Content-Type', 'Content-Length', 'URL']], key=lambda tup: tup[0], reverse=True) 
        df_temp = pd.DataFrame.from_dict(dict(temp))
        df_header = df_header.append(df_temp)
stream.close()

df_header.reset_index(drop=True, inplace=True)

In [147]:
df_header

Unnamed: 0,URL,Date,Content-Type,Content-Length
0,https://www.zew.de/,"Tue, 27 Jul 2021 12:00:27 GMT",text/html; charset=utf-8,75275
1,https://istari.ai/,"Tue, 27 Jul 2021 12:00:27 GMT",text/html,194894
2,https://new.siemens.com/,"Tue, 27 Jul 2021 12:00:29 GMT",text/html; charset=utf-8,16954


We can see that WARC file stores the following metadata: `URL` of the scraped website, `Date` of scraping, the `Content-Type` found on the website which is in all instances html text and finally the `Content-Length` of the content block (in bytes).

Besides, the header information the WARC files obviously also include the content blocks which we access now using warcio's `content_stream`. We further use `BeautifulSoup` to parse the stored html content.

In [148]:
from bs4 import BeautifulSoup

In [189]:
df_content = pd.DataFrame()
with open(r'Q:\Meine Bibliotheken\Research\05_Code\02_ArchiveSpark\example.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream):
        temp = [record.rec_headers.get_header('WARC-Target-URI'), BeautifulSoup(record.content_stream().read(), 'html.parser').get_text().replace('\n', " ").replace('\t', " ").strip()]
        df_temp = pd.DataFrame([temp], columns=['URL', 'html/text'])
        df_content = df_content.append(df_temp)
stream.close()

df_content.reset_index(drop=True, inplace=True)

In [190]:
pd.set_option('max_colwidth', 400)
df_content

Unnamed: 0,URL,html/text
0,https://www.zew.de/,ZEW – Leibniz-Zentrum für Europäische Wirtschaftsforschung - Startseite Menü Presse Team Karriere ZEW-Newsletter Kontakt ZEWnews DE EN ...
1,https://istari.ai/,"ISTARI.AI - Die Zukunft der Unternehmensdatenbank Primary Menu HOME TEAM REFERENZEN WEBAI NEWS KONTAKT Die Zukunft der Unternehmensdatenbank – Umfangreiche Informationen in Echtzeit Verborgene Unternehmensdaten automatisiert sichtbar machen, ..."
2,https://new.siemens.com/,SiemensWe're sorry but the new Siemens doesn't work properly without JavaScript enabled. Please enable it to continue.


# WARC files from existing web archives 

In a second step, we create our own collection of websites that have already been stored in web archives by saving these historcial web contents into WARC files. This means that we retrieve only the relevant websites stored in existing web archives such as CC and IA. In our context this includes historical website content of corporate web domains only! Note that CC's and IA's web archives comprise way more than just corporate websites. That is why it is impractical to work with their WARC files but rather it is neccessary to create domain specific collections of own WARC files. In the following, we will work with `cdx_toolkit` which allows to access CC's and IA's CDX API for structured access to the archives.

In [209]:
import cdx_toolkit
from tqdm import tqdm # required for showing progress in accessing web archives
import pandas as pd
import re             # required for wildcarding URLs (see below)

## Access web archives and write .warc files  

First, we load a number of year-url combinations for a sample of corporate web domains.

In [195]:
df_url = pd.read_csv(r"Q:\Meine Bibliotheken\Research\01_Promotion\05_Ideas\06_GreenFinance\05_Data\mup2afid_urls_sample.txt", sep = "\t")
df_url.loc[(df_url.crefo==3270030744) & (df_url.year.isin([2019, 2020])),'url'] = "www.thomas-gruppe.de" # minor manual correction
df_mup = pd.read_csv(r"Q:\Meine Bibliotheken\Research\01_Promotion\05_Ideas\06_GreenFinance\05_Data\mup2afid.txt", sep="\t")
df_url

Unnamed: 0,crefo,year,url
0,5050292256,2010,www.tulip.de
1,5050292256,2011,www.tulip.de
2,5050292256,2012,www.tulip.de
3,5050292256,2013,www.tulip.de
4,5050292256,2014,www.tulip.de
...,...,...,...
1942,7290613987,2016,
1943,7290613987,2017,
1944,7290613987,2018,
1945,7290613987,2019,


When accessing web archives, it is most convenient to wildcard the corporate web domain's start page as this way all subpages will be extracted from the archives as well.

In [202]:
# Define function that allows widlcarding of URLs
def wildcarding(url):
    url_wildcarded = re.search(r'\..{1,}\.(?:com|de|rwe|net|eu|fr|heise-service)', url).group(0)[1:] + '/*'
    return url_wildcarded

In [205]:
# Reset index for loop
df_url = df_url.reset_index(drop=True)
# Check if any url cannot be wildcarded
ind = []
for i in range(df_url.shape[0]):
    if pd.notna(df_url.url[i]):
        try:
            wildcarding(df_url.url[i])
        except:
            ind.append(i)
            
# If resulting list empty all URLs have been wildcarded sucessfully, else the list returns the index of URLs in df_url where wildcarding failed
print(ind)

[]


In [206]:
# Conduct the wildcarding
df_url.loc[df_url.url.notnull(), 'url'] = df_url.loc[df_url.url.notnull(), 'url'].apply(lambda x: wildcarding(x))

In [208]:
df_url

Unnamed: 0,crefo,year,url
0,5050292256,2010,tulip.de/*
1,5050292256,2011,tulip.de/*
2,5050292256,2012,tulip.de/*
3,5050292256,2013,tulip.de/*
4,5050292256,2014,tulip.de/*
...,...,...,...
1942,7290613987,2016,
1943,7290613987,2017,
1944,7290613987,2018,
1945,7290613987,2019,


We now define some parameters for accessing CC and IA and also for storing the retrieved data in WARC files both via `cdx_toolkit`.

In [211]:
client = cdx_toolkit.CDXFetcher(source='ia')         # define client for fetching data from source (ia: Internet Archive, cc: Common Crawl)
limit = 1000                                         # define maximum number of captures that is suppossed to be retrieved for each year-url from the respective archive
crefos = list(df_url.crefo.drop_duplicates().values) # create list of unique company IDs (crefos) for which panel dataset of corporate website content is created
len(crefos)

177

In [212]:
# A 'warcinfo' record describes the records that follow it, up through end of file, end of input, or until next 'warcinfo' record.
# Typically, this appears once and at the beginning of a WARC file. 
# For a web archive, it often contains information about the web crawl which generated the following records.
warcinfo = {
    'software': 'pypi_cdx_toolkit iter-and-warc',
    'isPartOf': 'GREENWASHING-SAMPLE-IA',
    'description': 'warc extraction',
    'format': 'WARC file version 1.0',
}

In [215]:
for year in range(2010, 2011):
    writer = cdx_toolkit.warc.CDXToolkitWARCWriter(
        prefix='example',         # first part of .warc file where warc records will be stored
        subprefix=str(year),     # second part of .warc file where warc records will be stored
        info=warcinfo,           
        size=1000000000,         # once the .warc file exceeds 1 GB of size a new .warc file will be created for succeeding records
        gzip=True)            
    
    for i, crefo in enumerate(crefos[0:3]):
        row = df_url.loc[(df_url.crefo==crefo) & (df_url.year == year),:].squeeze(axis=0)
        
        if pd.isna(row.url):
            pass                 # pass if firm has not existed in the respective (which refers to the url entry is missing for the respective year)
        else:
            print(str(i), '- Crefo: ', str(row.crefo))
            for obj in tqdm(client.iter(row.url, from_ts=str(row.year), to=str(row.year), limit=limit, verbose='v', collapse='urlkey', filter=['status:200', 'mime:text/html'])):
                url = obj['url']
                status = obj['status']
                timestamp = obj['timestamp']

                try:
                    record = obj.fetch_warc_record()
                    # Save crefo into header information of the WARC record so it is not lost in the WARC file
                    record.rec_headers['crefo'] = str(row.crefo)
                except RuntimeError:
                    print('Skipping capture for RuntimeError 404: %s %s', url, timestamp)
                    continue
                writer.write_record(record)
                
       

0 ) Crefo:  5050292256


1it [00:02,  2.41s/it]


1 ) Crefo:  7010096344


55it [01:30,  1.64s/it]


2 ) Crefo:  3270014138


0it [00:00, ?it/s]


The above code snippet has created gzipped WARC file. The output shows the number of archived webpages for the respective crefo on the Internat Archive in the year 2010. The size of both the compressed and the uncompressed file is as follows.

In [219]:
get_uncompressed_size(r'Q:\Meine Bibliotheken\Research\05_Code\02_ArchiveSpark\example-2010-000000.extracted.warc.gz')
get_compressed_size(r'Q:\Meine Bibliotheken\Research\05_Code\02_ArchiveSpark\example-2010-000000.extracted.warc.gz')

1.055119 MB
0.289178 MB


We see that the file size reduces by factor 5 when using GZIP for compressing the WARC file.

## Read .warc files 

Again, the stored WARC file and its headers and payloads can be easily accessed via warcio's `ArchiveIterator`.

In [220]:
from warcio.archiveiterator import ArchiveIterator
import pandas as pd # used to store results in a data frame

Header information (metadate):

In [242]:
df_header = pd.DataFrame()
with open(r'Q:\Meine Bibliotheken\Research\05_Code\02_ArchiveSpark\example-2010-000000.extracted.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream):
        if record.rec_headers['WARC-Type'] == 'warcinfo':
            pass
        else:
            temp = sorted([(i[0], [i[1]]) for i in record.rec_headers.headers if i[0] in ['crefo', 'WARC-Date', 'Content-Type', 'Content-Length', 'WARC-Source-URI']], key=lambda tup: tup[0], reverse=True) 
            df_temp = pd.DataFrame.from_dict(dict(temp))
            df_header = df_header.append(df_temp)
stream.close()

df_header.reset_index(drop=True, inplace=True)

In [243]:
df_header.head(5)

Unnamed: 0,crefo,WARC-Source-URI,WARC-Date,Content-Type,Content-Length
0,5050292256,https://web.archive.org/web/20100402084249id_/http%3A//www.tulip.de%3A80/,2010-04-02T08:42:52Z,application/http; msgtype=response,36018
1,7010096344,https://web.archive.org/web/20100327143928id_/http%3A//www.omya.de%3A80/,2010-03-27T14:56:08Z,application/http; msgtype=response,10134
2,7010096344,https://web.archive.org/web/20100824072815id_/http%3A//www.omya.de%3A80/C12574C800506F23/direct/Home,2010-08-24T07:48:58Z,application/http; msgtype=response,16723
3,7010096344,https://web.archive.org/web/20100824072820id_/http%3A//www.omya.de%3A80/C12574C800506F23/vwWebPagesByID/1AA35D5E7DF8930AC12574E300474F38,2010-08-24T07:49:04Z,application/http; msgtype=response,18842
4,7010096344,https://web.archive.org/web/20100802002229id_/http%3A//www.omya.de%3A80/C12574C800506F23/vwWebPagesByID/1C593814E6E896D2C12574E300474FEC,2010-08-02T00:42:36Z,application/http; msgtype=response,16002


And of course the content block (payload):

In [225]:
with open(r'Q:\Meine Bibliotheken\Research\05_Code\02_ArchiveSpark\example-2010-000000.extracted.warc.gz', 'rb') as stream:
    for i, record in enumerate(ArchiveIterator(stream)):
        if record.http_headers is None:
            pass
        else:
            print(record.http_headers['X-Archive-X-Cache-Key'] + '\n' + BeautifulSoup(record.content_stream().read(), 'html.parser').get_text(strip=True))
            print('\n')
            if i > 2:
                break
stream.close()

httpsweb.archive.org/web/20100402084249id_/http%3A//www.tulip.de%3A80/DE
TULIP DeutschlandWillkommen auf der Homepage von Tulip, Germany.Bitte klicken Sie aufeinen der Buttons,um auf diejeweiligeHomepage zu gelangen:Tulip Food Company, DüsseldorfTulip Food Service, KielTulip Food Company, DüsseldorfAls Vertriebs-und Marketing-organisation mit Sitz in Düsseldorf beliefert die Tulip Food CompanyGmbH denEinzelhandel mit Kühl-und Tiefkühlprodukten sowieKonserven.Tulip Food Service, Kielbietet ein umfassendes GV-Sortiment tiefgekühlterundgekühlter Fleisch-Convenience-Produkte und ist attraktiverLieferant für dieweiterverarbeitendeIndustrie.


httpsweb.archive.org/web/20100327143928id_/http%3A//www.omya.de%3A80/DE
OMYA AG - white minerals, calcium carbonat and talcsOMYA Omya,paper,paints,plastics,adhesives,coatings,calcium carbonate,calcium,talc,white minerals,industrial minerals,fillers,extenders,pigments,chalk,limestone,marble,pcc,gcc,coating pigments Omya,Papier,Farben,Lacke,Kunststoff,El

# Why WARC files?

Besides WARC being the ISO standard for storing webdata, working with WARC data comes with another big advantage. It allows to process the webdata collections with cluster computing frameworks such as Hadoop and Spark. These frameworks allow efficient data processing, extraction as well as derivation for large archival collections (i.e. WARC files). One specific framework in this context is [ArchiveSpark](https://github.com/helgeho/ArchiveSpark). The main use case of ArchiveSpark is the efficient access to archival data with the goal to derive corpora by applying filters and tools in order to extract information from the original raw data, to be stored in a more accessible format, like JSON, while reflecting the data lineage of each derived value.

<img src="https://www.clipartmax.com/png/small/78-780281_whats-new-in-apache-spark-apache-spark-logo.png" alt="What's New In Apache Spark - Apache Spark Logo @clipartmax.com">

So even if we have reduced the large web archives of CC and IA to a collection of only corporate website data, we will still be talking about data volume close to 1 TB (my personal estimation). To derive from these collection of corporate website data smaller corpora which only comprise web content that is relevant for the specific research question at hand, ArchiveSpark and similar cluster computing solutions are the way to go.