# How to generate Seed URLs from CommonCrawl

This notebook explains how to generate seed URLs from Commoncrawl.

First of all, you need to decide which languages you want to generate. Commoncrawl identifies languages with ISO-639-3 codes. You can find the list of codes in the [ISO 639 Code Tables](https://iso639-3.sil.org/code_tables/639/data).

In this example, we want to generate seedurls for these languages

| Language    | ISO-639-3 Code |
|-------------|----------------|
| Swahili     | swa            |
| Hausa       | hau            |
| Amharic     | amh            |
| Yoruba      | yor            |
| Oromo       | orm            |
| Kinyarwanda | kin            |
| Kirundi     | rin            |


## Step 1: Download URLs from Commoncrawl

Next, please setup the Amazon Athena Database of CommonCrawl following the article [How to query Common Crawl data using Amazon Athena
](https://medium.com/@vtbs55596/how-to-query-common-crawl-data-using-amazon-athena-416ad13e54f8). You need a credit card for this, but the query is not very expensive (less than 1 USD).

After the setup, please execute this query. We put the list of languages in the query.

```
SELECT url, url_host_registered_domain, content_languages
FROM "ccindex"."ccindex"
WHERE subset = 'warc' AND content_languages IN ('swa', 'hau', 'amh', 'yor', 'orm', 'kin', 'rin');
```

You can download the results as CSV. The result files have cryptic names, in our case, it is `d07aed5d-212e-4741-8b48-e854939c20ff.csv`. Lets assume you downloaded the file in `~/Downloads/d07aed5d-212e-4741-8b48-e854939c20ff.csv`. The file is large, in our case ~700MB.

## Step 2: Convert URLs into the proper format

You need to have the pandas library installed to execute the code.

In [2]:
!pip install pandas



In [16]:
import pandas as pd

infile = '~/Downloads/d07aed5d-212e-4741-8b48-e854939c20ff.csv'
df = pd.read_csv(infile)
# drop duplicate urls
df = df.drop_duplicates('url').copy()
df

Unnamed: 0,url,url_host_registered_domain,content_languages
0,http://ambachhof.at/,ambachhof.at,swa
1,http://members.aon.at/lemu/Homepage/Index.htm,aon.at,orm
3,http://154.118.228.138:8080/mis/,,swa
4,http://serma.al/index.php/en/products/producer...,serma.al,orm
5,http://41.41.114.242/EyeCare/public/index.php/...,,orm
...,...,...,...
7406702,https://aguiarbuenosaires.com/category/tango-m...,aguiarbuenosaires.com,kin
7406703,https://aguiarbuenosaires.com/tag/buenos-aires...,aguiarbuenosaires.com,kin
7406704,https://aguiarbuenosaires.com/tag/documentos/,aguiarbuenosaires.com,kin
7406705,https://aguiarbuenosaires.com/tag/vegetarianos/,aguiarbuenosaires.com,kin


First, lets compute some statistics on the data

In [17]:
# how many urls per language?
df.content_languages.value_counts()

content_languages
swa    1863433
kin     619991
hau     571159
amh     500201
yor     308485
orm      80595
Name: count, dtype: int64

In [18]:
# how many domains per language?
df.drop_duplicates('url_host_registered_domain').content_languages.value_counts()

content_languages
swa    19568
kin    16091
hau    10834
amh     7799
yor     5104
orm     4620
Name: count, dtype: int64

Next, lets define the scripts. Scripts are encoded in ISO 15924. Here is the [ISO 15924 Code List](https://unicode.org/iso15924/iso15924-codes.html). We define that we use the Latin script for all languages except for Amharic, for which we use the Ge'ez script.

In [21]:
languages2scripts = {'swa': 'Latn',
 'orm': 'Latn',
 'kin': 'Latn',
 'amh': 'Ethi',
 'hau': 'Latn',
 'yor': 'Latn'}

next, lets create the output data

In [None]:
import os
import shutil
import gzip

outfolder = '/tmp/crawlzilla-seeds'

if os.path.exists(outfolder):
    shutil.rmtree(outfolder)

os.mkdir(outfolder)

writers = {open(os.path.join(outfolder, language + ".txt.gz") for language in languages2scripts.keys())}