# Access and Decode Common Crawl Data

> Please follow the instructions below to set up the environment before using the code.

## Setup
### Install Dependencies 
```zsh
pip install pandas
pip install comcrawl 
pip install beautifulsoup4
```
### Update the `comcrawl` package 
Common Crawl has updated their file download path. Please locate the downloaded `comcrawl` library directory. This can be done by holding `ctrl` and clicking the `comcrawl` keyword. Open `utils/download.py`, and replace the following identifiers with the provided code:
- `download_multiple_results`
  ```python
  def download_multiple_results(results: ResultList, threads: int = None) -> ResultList:
    # multi-threaded download
    if threads:
        multithreaded_download = make_multithreaded(download_single_result, threads)
        results_with_html = multithreaded_download(results)
    # single-threaded download
    else:
        for result in results:
            success = False
            while not success:
                try:
                    result_with_html = download_single_result(result)
                    results_with_html.append(result_with_html)
                    success = True
                except Exception as e:
                    print("Library Error: download_single_result failed, retrying...")
    return results_with_html
  ```
- `URL_TEMPLATE`
  ```python
  https://data.commoncrawl.org/{filename}
  ``` 

## Usage

### Specify crawl data range and source
Here you specify the `time_code` and the `searching_uri`. 

In [None]:
# Crawling time codes can be accessed at:
# https://commoncrawl.org/the-data/get-started/
time_code = "2022-05"
searching_uri = "www.cna.com.tw/news/afe/*"

### Import Dependencies

In [None]:
from comcrawl import IndexClient
import pandas as pd
import os

### Create Output Directories

In [None]:
searching_uri_dir = searching_uri.replace("/", "-").replace("*", "-all")
if not os.path.exists("output"):
    os.makedirs("output")
if not os.path.exists(f"output/{time_code}"):
    os.makedirs(f"output/{time_code}")
if not os.path.exists(f"output/{time_code}/{searching_uri_dir}"):
    os.makedirs(f"output/{time_code}/{searching_uri_dir}")

### Retrieve Pages from Index
Note that the Common Crawl Index Server is constantly under heavy load and ofter responds with `504` timeout. 

In [None]:
client = IndexClient([time_code], verbose=True)
success = False
while not success:
    try:
        client.search(searching_uri)
        client.results = (pd.DataFrame(client.results)
                        .sort_values(by="timestamp")
                        .drop_duplicates("urlkey", keep="last")
                        .to_dict("records"))
        pd.DataFrame(client.results).to_csv(f"output/{time_code}/{searching_uri_dir}/index.csv", index=False)
        success = True
    except:
        print("Index Server Response Timeout. Retrying...")

### Download the files 
Depending on the number of webpages and the pages' sizes, this can take up to a few hours. 

In [None]:
success = False
while not success:
    try:
        client.download()
        success = True
    except:
        print("Download Server Error. Retrying...")

### Extract and Write Paragraphs

In [None]:
from web_parse import extract_main_paragraphs
for result in client.results:
    extracted_text = extract_main_paragraphs(result["html"])
    if len(extracted_text) < 100:
        continue
    with open(f"output/{time_code}/{searching_uri_dir}/{result['urlkey'].replace('/', '-')}.txt", "w") as f:
        f.write(extracted_text)