# Markup to Identify Adult Content

## HTML Metatag "Rating"

Specified in the HTML `<head>` element:

```html
<meta name="rating" content="adult">
<!-- or alternatively -->
<meta name="rating" content="RTA-5042-1996-1400-1577-RTA">
```

or in the HTTP response header (which would allow to tag non-HTML content):

```
HTTP/1.1 200 OK
Server: Apache/2.4.6
Rating: RTA-5042-1996-1400-1577-RTA
...
```

References:
- https://www.asacp.org/index.html?content=RTA
- https://www.rtalabel.org/index.php?content=howto
- https://developers.google.com/search/docs/crawling-indexing/special-tags
- https://developers.google.com/search/docs/crawling-indexing/safesearch
- https://davidwalsh.name/rta-label
- https://wiki.whatwg.org/wiki/MetaExtensions


## Schema.org Annotation

https://schema.org/AdultOrientedEnumeration

(so far not observed "in the wild")


## Extraction of "Rating" Metadata from Common Crawl WAT Files


### Sample Set of WAT Files

1. download WAT file paths
   ```
   curl -O https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/wat.paths.gz
   ```

2. reproducibly sample a list of WAT files
   - based on the SHA-512 hash of the WAT path
   - select paths starting with `f[012]` which results in a pseudo-random sample of about ${3\over{256}} = 1.17\%$ of the paths listed, here 1007 files
   - two shell one-liners to do the sampling
     a.
       ```bash
       python3 -c 'import gzip,hashlib,re; [print(line) for line in map(str.strip, gzip.open("wat.paths.gz", "rt")) if re.match("f[012]", hashlib.sha512(line.encode("ascii")).hexdigest())]' >wat.sample.paths
       ```
     b.
       ```bash
       zcat wat.paths.gz | while read line; do digest=$(echo -n $line | sha512sum); if [[ $digest =~ ^f[012] ]]; then echo $line; fi done >wat.sample.paths
       ```
3. (for further experiments) prepare the list of corresponding WET files
   ```
   sed 's@wat@wet@g' wat.sample.paths >wet.sample.paths
   ```

### Extraction of "Rating" Metadata

Run the Spark job [MetaRatingCountJob](./meta_rating_counts.py), based on [cc-pyspark](https://github.com/commoncrawl/cc-pyspark).

Here the job counters:
```
2023-03-09 15:13:19,078 INFO MetaRatingCount: WARC/WAT/WET input files processed = 1007
2023-03-09 15:13:19,081 INFO MetaRatingCount: WARC/WAT/WET input files failed = 0
2023-03-09 15:13:19,084 INFO MetaRatingCount: WARC/WAT/WET records processed = 109479686
2023-03-09 15:13:19,087 INFO MetaRatingCount: records failed to process = 0
2023-03-09 15:13:19,090 INFO MetaRatingCount: response records WAT = 36492893
2023-03-09 15:13:19,093 INFO MetaRatingCount: records WAT parsing failed = 0
2023-03-09 15:13:19,096 INFO MetaRatingCount: records not HTML = 553543
2023-03-09 15:13:19,099 INFO MetaRatingCount: records with rating = 498128
2023-03-09 15:13:19,102 INFO MetaRatingCount: records with rating HTML metatag = 497901
2023-03-09 15:13:19,104 INFO MetaRatingCount: records with rating HTTP header = 2450
2023-03-09 15:13:19,107 INFO MetaRatingCount: records with rating HTTP header not HTML = 0
```

The resulting counts can be downloaded using the [aws-cli](https://github.com/aws/aws-cli) by running
```bash
for i in $(seq 0 7); do aws --no-sign-request s3 cp s3://commoncrawl-dev/oscar/metadata-rating/wat-1k-metadata-rating/part-0000$i-103b9138-9d7b-47bf-bda6-14915ac04abc-c000.gz.parquet wat-1k-metadata-rating/; done
```

### Exploring "Rating" Metadata

Following the job counters, there are about 500k pages with a "rating" metatag - out of 36 million WAT response records. But which "rating" values are observed?

We reed the extracted metadata into a [polars DataFrame](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/index.html)...

In [1]:
import polars as pl

pl.Config.set_fmt_str_lengths(40) # show up to 40 characters in string columns

df = pl.read_parquet('wat-1k-metadata-rating/*.parquet')
df = df.unnest('key')

df.head()

type,value,host,url,count
str,str,str,str,i64
"""HTML-metatag-rating""","""mature""","""cz.ohmicam.com""","""http://cz.ohmicam.com/profile/tagira-""",1
"""HTML-metatag-rating""","""General""","""dwtest101.dev.autofunds.com""","""http://dwtest101.dev.autofunds.com/2005...",1
"""HTML-metatag-rating""","""general""","""giuseppini.murialdo.org""","""http://giuseppini.murialdo.org/index.ph...",1
"""HTML-metatag-rating""","""general""","""www.astakos-news.gr""","""http://www.astakos-news.gr/2021/04/12.h...",1
"""HTML-metatag-rating""","""general""","""dealers.heromotocorp.com""","""https://dealers.heromotocorp.com/shri-t...",1


#### "Rating" Values

In [2]:
value_counts = df[['value', 'count']].groupby('value').sum().sort('count', descending=True)
value_counts.head(20)

value,count
str,i64
"""general""",183266
"""General""",150010
"""RTA-5042-1996-1400-1577-RTA""",88646
"""GENERAL""",45305
"""mature""",11241
"""adult""",10035
"""Safe For Kids""",5206
"""safe for kids""",4810
"""Mature""",2285
"""All""",2267


In [3]:
rta_pattern = '(?i)(?:mature|adult(?:s\s*only)?|RTA-5042-1996-1400-1577-RTA)'

value_counts.filter(value_counts['value'].str.count_match(rta_pattern) > 0)

value,count
str,i64
"""RTA-5042-1996-1400-1577-RTA""",88646
"""mature""",11241
"""adult""",10035
"""Mature""",2285
"""Adult""",110
"""ADULTS ONLY""",55
"""ADULT""",24
"""rta-5042-1996-1400-1577-rta""",10
"""General, Mature""",7
"""MATURE""",7


In [4]:
# we focus on the most common "adult" rating values
rta_pattern = '(?i)^\s*(?:mature|adult(?:s\s*only)?|RTA-5042-1996-1400-1577-RTA)\s*$'

value_counts.filter(value_counts['value'].str.count_match(rta_pattern) > 0)

value,count
str,i64
"""RTA-5042-1996-1400-1577-RTA""",88646
"""mature""",11241
"""adult""",10035
"""Mature""",2285
"""Adult""",110
"""ADULTS ONLY""",55
"""ADULT""",24
"""rta-5042-1996-1400-1577-rta""",10
"""MATURE""",7


#### Hosts with Adult Rating

In [5]:
hosts_adult_rating = df.filter(df['value'].str.count_match(rta_pattern) > 0)[['host', 'count']].groupby('host').sum().sort('count', descending=True)
hosts_adult_rating

host,count
str,i64
"""adult.contents.fc2.com""",436
"""en.smotri.com""",290
"""forum.mydebut.ru""",256
"""ru.smotri.com""",252
"""xnxxfilm.cc""",247
"""prostoporno.net.ru""",241
"""russkoeporno.net.ru""",233
"""mydebut.ru""",223
"""pompini.org""",178
"""dic.academic.ru""",165


#### Share of Pages Rated as Adult

There are 112k pages tagged as adult using the "rating" metatags. This is 0.3% of all 36 million pages in the sample. 

In [6]:
hosts_adult_rating.write_csv('hosts_adult_content.tsv', sep='\t')

hosts_adult_rating['count'].sum()

112413

#### Overlap with UT1 Blocking List

To estimate the overlap with the "[UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php)" blacklist compiled by the Université Toulouse 1, we first need to download the adult black list:

```
$> curl -O ftp://ftp.ut-capitole.fr/pub/reseau/cache/squidguard_contrib/adult.tar.gz

$> md5sum adult.tar.gz 
7f13aa0bda6c0baf55d1439b72fe3521  adult.tar.gz

$> date
Fri Mar 10 15:52:34 CET 2023

$> tar xvfz adult.tar.gz
adult/domains
adult/urls
adult/expressions
adult/usage
```

Then we look up which host names are also among the UT1 domain names or where the UT1 domains are domain suffixes. We use a double-array trie package ([datrie](https://pypi.org/project/datrie/)) to look up [reverse domain names](https://en.wikipedia.org/wiki/Reverse_domain_name_notation) to match the domain name suffixes. Because 80% of the blacklist are subdomains of blogspot we match the blogspot domain separately.

In [7]:
import datrie, re, string

domain_name_alpabet = string.ascii_lowercase + '0123456789._-'
ut1_adult_domain_trie = datrie.BaseTrie(domain_name_alpabet)

rx_blogspot_domain = re.compile(
    '\.blogspot\.(?:c(?:[ahlz]|o(?:m(?:\.(?:e[egs]|a[ru]|b[ry]|c[oy]|mt|ng|tr|uy))?|\.(?:i[dl]|at|ke|nz|uk|za)))|s[egikn]|i[enst]|m[dkxy]|a[elm]|b[aeg]|h[kru]|l[itu]|r[osu]|[gk]r|d[ek]|f[ir]|n[lo]|p[et]|jp|qa|tw|ug)$',
)
ut1_adult_subdomains_blogspot = set()

def reverse_domain(domain):
    return '.'.join(reversed(domain.split('.')))

with open('adult/domains') as ut1_adult_domains:
    for line in ut1_adult_domains:
        domain = line.strip()
        m = rx_blogspot_domain.search(domain)
        if m:
            ut1_adult_subdomains_blogspot.add(domain[0:m.start()])
        else:
            ut1_adult_domain_trie[reverse_domain(domain)] = 0

            
print(len(ut1_adult_subdomains_blogspot), 'adult blogspot subdomains')
print(len(ut1_adult_domain_trie.values()), 'adult domain trie size')

58134 adult blogspot subdomains
891000 adult domain trie size


In [8]:
def ut1_is_adult_domain(domain):
    m = rx_blogspot_domain.search(domain)
    if m:
        return domain[0:m.start()] in ut1_adult_subdomains_blogspot
    else:
        rev_domain = reverse_domain(domain)
        try:
            pfx = ut1_adult_domain_trie.longest_prefix(rev_domain)
            return (pfx == rev_domain) or (rev_domain[len(pfx)] == '.')
        except KeyError:
            return False

# note: tests depend on the current UT1 domain list and may need adjustments if the list changes
assert(ut1_is_adult_domain('000-sex-you-tube.blogspot.com'))
assert(ut1_is_adult_domain('subdomain.domain.adult'))
assert(ut1_is_adult_domain('pornhub.com'))
assert(ut1_is_adult_domain('subdomain.pornhub.com'))
assert(not ut1_is_adult_domain('this-is-a-different-domain-than-pornhub.com'))
assert(not ut1_is_adult_domain('pornhub-this-is-a-different-domain.com'))

In [9]:
hosts_adult_rating = hosts_adult_rating.with_columns(hosts_adult_rating['host'].apply(ut1_is_adult_domain).alias('inUT1'))
hosts_adult_rating.groupby(['inUT1']).agg(pl.col('host').count(), pl.col('count').sum())

inUT1,host,count
bool,u32,i64
False,12189,76026
True,5228,36387


About 1/3 of the pages and hosts tagged as "adult" are included in the UT1 "adult" black list.

#### Top-Level Domains of Adult Hosts

- in which top-level domains (TLDs) are the more sites tagged as "adult"?
- which TLDs are addressed the more by the UT1 block list?

In [10]:
hosts_adult_rating = hosts_adult_rating.with_columns(
    hosts_adult_rating['host'].apply(lambda host: host.split('.')[-1]).alias('tld'))
tld_adult_rating = hosts_adult_rating.groupby('tld').agg(
    pl.col('host').count(),
    pl.col('count').sum(),
    pl.col('inUT1').sum()).sort('count',
                                descending=True)
tld_adult_rating = tld_adult_rating.with_columns((100.0 * tld_adult_rating['inUT1'] / tld_adult_rating['host']).alias('%inUT1'))
tld_adult_rating.head(25)

tld,host,count,inUT1,%inUT1
str,u32,i64,u32,f64
"""com""",9223,56239,3138,34.023637
"""ru""",1120,9411,44,3.928571
"""net""",1101,7561,380,34.514078
"""mobi""",248,2588,103,41.532258
"""pro""",575,2549,219,38.086957
"""org""",443,2537,136,30.699774
"""top""",384,2462,21,5.46875
"""xyz""",153,2445,22,14.379085
"""me""",276,2388,115,41.666667
"""hu""",183,1773,8,4.371585




### Precision of Tagged Adult Hosts

In overall, the metatags seem to be correct. But included are
- book reviews on Blogspot, ev. of erotic literature
- dictionaries, e.g.   "medicine.en-academic.com", which might articles about "adult" topics but genuinely far away from the bulk of tagged sites



## TODO

For the next run of the Spark job to exact "rating" metadata:
- include all URLs or total page counts per host to detect
  - which sites tag all pages or only parts as "adult"
  - which URLs or sites are included in the UT1 adult content block list but are not marked as adult by rating metatags
- extract content language of pages to estimate the distribution of tagged content over languages
