<a href="https://colab.research.google.com/github/javieraespinosa/lifranum/blob/main/Crawl_and_WARC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Crawling and Archiving Websites

This notebook illustrates the use of [Wget](https://www.gnu.org/software/wget/manual/wget.html) for crawling and archiving websites using the [WARC file format](https://en.wikipedia.org/wiki/Web_ARChive).




# Configuration

## Updating Wget

Wget supports the production of WARC files since v1.14. Colab includes Wget v1.19.4 by default, but this version does not support WARC compression. The following code upgrades Wget to a version supporting WARC per-record and file-level compression, which considerably reduces storage space usage.

In [None]:
TMP_DIR = "tmp"

!mkdir {TMP_DIR}
%cd {TMP_DIR}

!wget -nv  http://ftp.gnu.org/gnu/wget/wget-1.21.tar.gz
!tar -xzf wget-1.21.tar.gz

!./wget-1.21/configure --quiet --with-ssl=openssl > /dev/null 2>&1 

!make > log.txt 2>&1   
!make install > log.txt 2>&1 

%cd ..
!rm -r {TMP_DIR}

## Wget General Parameters

The crawling examples in this notebook assume the existence of an `input.txt` file containing the list of URLs to crawl. You can modify the content of this file according to your needs. 

In [None]:
!echo "http://example.com/" > input.txt

The following variables define Wget' **default behaviour** when crawling. You can modify them at will, either here or directly in the cell executing `!wget`.

In [None]:
LEVEL=1       # maximum number of links to follow (i.e, crawl depth)
WAIT=0.1      # num. seconds to wait between consecutive calls 

INPUT_FILE = "input.txt"     # list of URLs to crawl

OUT_DIR       = "WARC"       # folder where crawl results will be stored
OUT_WARC_FILE = "out"        # prefix for WARC files
OUT_LOG_FILE  = "log.txt"    # file containing Wget's log

# Examples

## Ex1. Crawling a specific domain



The following example crawls the URLs in `input.txt` and produces a WARC file containing only HTML files (i.e., wget ignores images, css, js and all other files). By default, wget will only follow the URLs within the same domain. This behaviour is useful for crawling the entirety of a specific domain.

> Recall that wget will stop after following links 1 level deep. To crawl the entirety of a domain set `LEVEL=0`.


In [None]:
LEVEL=1
!wget \
  --delete-after -nd \
  --input-file={INPUT_FILE}  \
  --recursive   \
  --level={LEVEL}     \
  --no-parent   \
  --wait={WAIT}    \
  --random-wait   \
  --follow-tags=a \
  --accept html  \
  --adjust-extension \
  --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15" \
  --warc-file={OUT_WARC_FILE}  \
  --warc-cdx=on \
  --warc-max-size=1G  \
  --no-warc-keep-log  \
  --output-file={OUT_LOG_FILE} 


In [None]:
# Move resulting files to the OUT_DIR folder
!mkdir -p {OUT_DIR} 
!mv *.warc.gz *.cdx {OUT_LOG_FILE} {OUT_DIR} 
!cp {INPUT_FILE} {OUT_DIR}


## Ex2. Multi-host crawling


This example mimics the previous crawling but follows links pointing to other hosts.

> **Attention!** If `LEVEL=0` wget will crawl the entire web! 

In [None]:
LEVEL=1
!wget \
  -H  \    # Span to any hos
  --delete-after -nd \
  --input-file={INPUT_FILE}  \
  --recursive   \
  --level={LEVEL}     \
  --no-parent   \
  --wait={WAIT}    \
  --random-wait   \
  --follow-tags=a \
  --accept html  \
  --adjust-extension \
  --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Safari/605.1.15" \
  --warc-file={OUT_WARC_FILE}  \
  --warc-cdx=on \
  --warc-max-size=1G  \
  --no-warc-keep-log  \
  --output-file={OUT_LOG_FILE} 

In [None]:
# Move resulting files to the OUT_DIR folder
!mkdir -p {OUT_DIR} 
!mv *.warc.gz *.cdx {OUT_LOG_FILE} {OUT_DIR} 
!cp {INPUT_FILE} {OUT_DIR}