# Exploration of WebArchives: Quickstart (Docker)

## ENV config

Initialize spark in [single-node cluster](https://docs.databricks.com/clusters/single-node.html) with the AUT and GraphFrames libraries.


In [None]:
%run ../scripts/spark-init-docker.ipynb
spark

## LIFRANUM dataset

WARC collections available in Google Cloud storage:

| WARC collection | size |
| --- | --- |
| lifranum-method | 2.84 Gb
| autres | 721 Mb
| cartoweb | 336 Mb
| repo-ecriture-num | 158 Mb

> **Tip**:  
> * Experiment with the smallest collection (`repo-ecritures-num`) first. Then move to bigger collections once your code is ready.

In [None]:
%%capture
DIR="LIFRANUM"
!mkdir -p $DIR

# --------------------------------------------------------
# UNCOMMENT THE LINE(S) BELOW FOR DOWNLOADING 
# THE WARC COLLECTION(S) OF YOUR CHOICE
# --------------------------------------------------------

# !gsutil -m cp -r gs://cpe-lyon/LIFRANUM/autre $DIR
# !gsutil -m cp -r gs://cpe-lyon/LIFRANUM/cartoweb $DIR
# !gsutil -m cp -r gs://cpe-lyon/LIFRANUM/lifranum-method $DIR
# !gsutil -m cp -r gs://cpe-lyon/LIFRANUM/repo-ecritures-num $DIR

## Querying Web Archives

Note:

* **AUT generates dataframes**. See the [AUT dataframe schemas](https://aut.docs.archivesunleashed.org/docs/dataframe-schemas) for more info.
* Check the [AUT documentation](https://aut.docs.archivesunleashed.org/docs/home) for more examples.



In [None]:
from aut import *


# Read WARC file from the LIFRANUM folder
WARCs_path = "LIFRANUM/repo-ecritures-num/out-00000.warc.gz"

df = WebArchive(spark.sparkContext, sqlContext, WARCs_path)

r = df.all().count()
p = df.webpages().count()         # df.webpages() is an expensive operation!

print("Number of registers:", r)
print("Number of pages:",     p)