## Notebook to create curated Curlie dataset (flattened categories for .de domains)

This means the curlie categories for each domain are condensed into one based on some elaborate heuristics

- Author: Hadi Asghari
- Version: 2023.02


- __Input__: curated curlie dataset from H2V project (links below) 
- __Output__: curlie-ourset.csv (this is already provided so rerunning this notebook is unnecessary)

In [1]:
# imports
import json
import pickle
import binascii
import zlib
from time import time
from collections import Counter
from os import path
import pandas as pd
import numpy as np
import tldextract

In [10]:
# Download as input the Curlie curated dataset by the H2V project 
# This is necessary as Curlie has removed archives of their own data
# However, we use our own heuristics instead of H2V's categories as theirs is a bit too broad 
# and exlucdes regional which is very important in Germany (at least in v3)

# H2V project: https://github.com/epfl-dlab/homepage2vec 
# Data repository: https://figshare.com/ndownloader/files/38937971 
# The file is curlie.csv.gz (unfiltered) version 3.
!wget https://figshare.com/ndownloader/files/34491131
!mv 34491131 ./data/curlie-by-h2v.csv.gz

--2023-02-24 16:52:01--  https://figshare.com/ndownloader/files/34491131
Resolving figshare.com (figshare.com)... 99.81.233.31, 46.137.13.70, 2a05:d018:1f4:d000:614f:f7d5:b342:899c, ...
Connecting to figshare.com (figshare.com)|99.81.233.31|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/34491131/curlie.csv.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20230224/eu-west-1/s3/aws4_request&X-Amz-Date=20230224T155201Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=1d0a57adb0835cb060322e43d2b5d79f03bafceaab0e5e6572949a0944144241 [following]
--2023-02-24 16:52:01--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/34491131/curlie.csv.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20230224/eu-west-1/s3/aws4_request&X-Amz-Date=20230224T155201Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=1d0a57adb0835cb060322e43d2b5d79f03bafceaab

In [3]:
# STEP 1: load the H2V raw Curlie data (downloaded above)

# we extract two domain parts: the FQDN (includes subdomain, minus www./m./de.) and the registered domain part; in this notebook we shall base the analysis on the r-domain
# we keep only `de` & `en` language part of the Curlie data, and futher filter domains to `.de`

curlie_ = pd.read_csv("./data/curlie-by-h2v.csv.gz")  # 3s to load; note `uid` repeats hence is not index
curlie_.drop(columns=['Unnamed: 0', 'uid'], inplace=True)

# en + de are the biggest communities on Curlie -- we limit to these two groups
curlie1 = curlie_[curlie_.lang.isin(['de', 'en'])].copy()

# remove some wierd UTF code in the file
def convert_htmlperc(s):
    ll = s.split('%')
    bs = bytes(ll[0], encoding="UTF-8")
    for l in ll[1:]:
        bs += binascii.unhexlify(l[0:2]) + bytes(l[2:], encoding="UTF-8")
    return bs.decode()

curlie1.label = curlie1.label.apply(convert_htmlperc)

# extract full/reg domains.
# (note, removal of www/de maybe problematic on some very short domain names, but we can ignore for now)
print('extracting f/r domains...')
curlie1.loc[curlie1.url=="www.netzgeek.de%20title=", "url"] = "www.netzgeek.de"  # manually fix 1 error
curlie1['fdomain'] = curlie1.url.apply(lambda x: tldextract.extract(x).fqdn)  # 10s+
curlie1['fdomain'] = curlie1.fdomain.str.replace(r"^www[.]", "", regex=True)
curlie1['fdomain'] = curlie1.fdomain.str.replace(r"^m[.]", "", regex=True)
curlie1['fdomain'] = curlie1.fdomain.str.replace(r"^de[.]", "", regex=True)
curlie1['domain'] = curlie1.url.apply(lambda x: tldextract.extract(x).registered_domain)  # 10s+
curlie1 = curlie1[curlie1.domain!=""]  # drop the few empty ones

# num of labels based on `domain` (rdomain is much larger)
print("\ncurlie data records/domains:", len(curlie1), len(set(curlie1.domain)))
curlie1.head(2)

extracting f/r domains...

curlie data records/domains: 1467914 1210441


Unnamed: 0,url,label,lang,fdomain,domain
14,www.malaysiakini.com,/en/Regional/Asia/Malaysia/News_and_Media,en,malaysiakini.com,malaysiakini.com
16,www.bernama.com,/en/Regional/Asia/Malaysia/News_and_Media,en,bernama.com,bernama.com


In [4]:
# SPECIFIC TO OUR RESEARCH DESIGN: LIMIT CURLIE ONLY TO .DE/.AT DOMAINS 
# This is done to speed up processing in this file. (we eventually only use the .de in the paper)
curlie1 = curlie1[curlie1.domain.str[-3:].isin(['.de', '.at'])].copy()  # 1210k => .de: 228k + .at: 17k
print(len(set(curlie1.domain)))

245060


In [5]:
# STEP 2: let's add the single-word-category additions stuff
# from the curlie category/label, we keep the first/top category for all categories except 'regional'. 
# for regional, we look at the next relevant labels 
# for an explanation of the actual categories see curlie.org

# get the top category (ignoring the starting /de/ or /en/)
curlie1['cat'] = curlie1.label.apply(lambda x: x.split('/',3)[2])
curlie1['catreg'] = False

# combine the German & English labels (we do this manually)
curlie1.loc[curlie1.cat=='Computer', 'cat'] = 'Computers'
curlie1.loc[curlie1.cat=='Freizeit', 'cat'] = 'Recreation'
curlie1.loc[curlie1.cat=='Gesellschaft', 'cat'] = 'Society'
curlie1.loc[curlie1.cat=='Gesundheit', 'cat'] = 'Health'
curlie1.loc[curlie1.cat=='Kultur', 'cat'] = 'Arts'
curlie1.loc[curlie1.cat=='Medien', 'cat'] = 'News'
curlie1.loc[curlie1.cat=='Online-Shops', 'cat'] = 'Shopping'
curlie1.loc[curlie1.cat=='Spiele', 'cat'] = 'Games'
curlie1.loc[curlie1.cat=='Sport', 'cat'] = 'Recreation'  # this is deu
curlie1.loc[curlie1.cat=='Sports', 'cat'] = 'Recreation'  # this is eng
curlie1.loc[curlie1.cat=='Wirtschaft', 'cat'] = 'Business'
curlie1.loc[curlie1.cat=='Wissen', 'cat'] = 'Reference'
curlie1.loc[curlie1.cat=='Wissenschaft', 'cat'] = 'Science'
curlie1.loc[curlie1.cat=='Zuhause', 'cat'] = 'Home'

# Reference/Education we split as Education; this is important re all .edu/univ websites, for instance
for ix, row in curlie1[curlie1.cat=='Reference'].iterrows():
    if row['label'].startswith("/en/Reference/Education/") or row['label'].startswith("/de/Wissen/Bildung/"):
        curlie1.loc[ix, 'cat'] = "Education"

# Similarly, we make Recreation/Travel it's own category (given the many regional sites)
for ix, row in curlie1[curlie1.cat=='Recreation'].iterrows():
    if row['label'].startswith("/en/Recreation/Travel/") or row['label'].startswith("/de/Freizeit/Reisen/"):
        curlie1.loc[ix, 'cat'] = "Travel"

# Let's further unpack `regional` (only German)
# we use first match from left
# these categories can later on be merged with the top categories
for ix, row in curlie1[curlie1.cat=='Regional'].iterrows():
    lbl = row['label']
    cat = None
    for ll in  lbl.split('/'):
        if ll == "Society_and_Culture" or ll == 'Gesellschaft':
            cat = 'Society'
            break
        elif ll == "Business_and_Economy"  or ll == "Wirtschaft":
            cat = "Business"
            break
        elif ll == "Arts_and_Entertainment" or ll == "Kultur":
            cat = "Arts"
            break
        elif ll == "Health" or ll == "Gesundheit":
            cat = "Health"
            break
        elif ll == "Education" or ll == "Bildung":
            cat = "Education"
            break
        elif ll == "Science_and_Environment" or ll == "Natur_und_Umwelt":
            cat = "Science"
            break
        elif ll in ("Travel_and_Tourism", "Transportation", "Transport", "Reise_und_Tourismus", "Verkehr", "Gastgewerbe"):
            cat = "Travel"
            break
        elif ll == "Recreation_and_Sports" or ll == "Sport" or ll == "Freizeit":
            cat = "Recreation"  # merged with sports
            break
        elif ll =="Government" or ll == "Staat" or ll =="Ämter":
            cat = "Government"  # incs. member of parliament
            break
        elif ll == "News_and_Media" or ll == "Nachrichten_und_Medien" or ll == "Weather" or ll == "Wetter":
            cat = "News"  # mix of news & media
            break
        elif ll == "Guides_and_Directories" or ll == "Verzeichnisse_und_Portale":
            cat = "Guide"  # in regions, these guides are often run by local governments
            break
    if not cat:
        # what remains (incls Provinces; States_and_Federal_Territories; Districts; Regions; Maps_and_Views...)
        cat = 'RegOther'
    curlie1.loc[ix, 'cat'] = cat
    curlie1.loc[ix, 'catreg'] = True

# summarize so far
print(len(curlie1))
print(curlie1.groupby('cat').label.count())

286231
cat
Arts              18072
Business          75483
Computers          7262
Education          9228
Games              2225
Government         1607
Guide               207
Health            22190
Home               1197
Kids_and_Teens     1295
News               2240
Recreation        36526
Reference           459
RegOther          49637
Science            5614
Shopping           8060
Society           27786
Travel            17143
Name: label, dtype: int64


In [6]:
# STEP 3: MERGE categories by fdomain
# We condense our frame into a frame grouped by domain, 
# and combine the labels into one, based on some heurestics derived from manual inspection

%time curlie2 = curlie1.groupby(['fdomain'], as_index=False).agg({'cat': set, 'domain': set})
curlie2['cat'] = curlie2.cat.apply(lambda x: str(sorted(x))[1:-1].replace("'", "").replace(",", " "))
curlie2['domain'] = curlie2.domain.apply(lambda x: str(sorted(x))[1:-1].replace("'", ""))

# Now, iterate & merge categories (11 cats max per domain)
# - note in both losing & dominating categories, the order is important (as to what cat is left standing)
# - in general, the larger category/groups can be replaced with more specific ones

# 1. LOSING CATEGORIES :   CAT & X => X
for i, row in curlie2.iterrows():
    cat = row['cat']
    if "RegOther" in cat and " " in cat:
        cat = cat.replace("RegOther", "").strip()
    if "Business" in cat and " " in cat:
        cat = cat.replace("Business", "").strip()
    if "Reference" in cat and " " in cat:
        cat = cat.replace("Reference", "").strip()
    if "Society" in cat and " " in cat:
        cat = cat.replace("Society", "").strip()
    if "Recreation" in cat and " " in cat:
        cat = cat.replace("Recreation", "").strip()
    if "Computers" in cat and " " in cat:
        cat = cat.replace("Computers", "").strip()
    if "News" in cat and " " in cat:
        cat = cat.replace("News", "").strip()
    if "Science" in cat and " " in cat:
        cat = cat.replace("Science", "").strip()
    if "Home" in cat and " " in cat:
        cat = cat.replace("Home", "").strip()
    curlie2.loc[i, "cat"] = cat

# 2. DOMINATING CATEGORIES :  CAT & X => CAT!
c = Counter(curlie2.cat)
for k,v in c.items():
    if "Government" in k and " " in k:
        curlie2.loc[curlie2.cat==k, "cat"] = "Government"
    elif "Kids_and_Teens" in k and " " in k:
        curlie2.loc[curlie2.cat==k, "cat"] = "Kids_and_Teens"
    elif "Travel" in k and " " in k:
        curlie2.loc[curlie2.cat==k, "cat"] = "Travel"
    elif "Arts" in k and " " in k:
        curlie2.loc[curlie2.cat==k, "cat"] = "Arts"
    elif "Health" in k and " " in k:
        curlie2.loc[curlie2.cat==k, "cat"] = "Health"
    elif "Shopping" in k and " " in k:
        curlie2.loc[curlie2.cat==k, "cat"] = "Shopping"
    elif "Education" in k and " " in k:
        curlie2.loc[curlie2.cat==k, "cat"] = "Education"

assert len(curlie2[curlie2.cat.str.contains(" ")]) == 0  # sanity check

# save interim results
# with open("./data/tmp-curlie-fdomain.p", "wb") as f:  
#    pickle.dump(curlie2, f)

print('fqdomains:', len(curlie2))
curlie2.head(3)

CPU times: user 6.63 s, sys: 40 ms, total: 6.66 s
Wall time: 6.66 s
fqdomains: 252162


Unnamed: 0,fdomain,cat,domain
0,007-berlin.de,Society,007-berlin.de
1,011.joomla.schule.bremen.de,Education,bremen.de
2,01art.de,Business,01art.de


In [7]:
# STEP 4: FURTHER MERGE INTO registered DOMAIN (combining subdomains)
# - we condense using heuristics, MAJORITY VOTING, and tie-breaker rules similar to for fdomain
# - there are a few large domains left as 'MIX' which are manually reclassified later

%time curlie3 = curlie2.groupby(['domain'], as_index=False).agg({'cat': Counter, 'fdomain': len})
curlie3.rename(columns={'cat': 'cats', 'fdomain':'fdoms'}, inplace=True)
# note, I also tried grouping by curlie.domain, as then the weights will slightly differ
#       however, upon manual inspection, that wasn't really better

# main loop to combine categories per domain
st = time()
curlie3['cat'] = ''
for i, row in curlie3.iterrows():
    cats, fdoms = row['cats'].copy(), row['fdoms']

    # first some simple heuristics
    # - if RegOther and other categories, let's keep the other ones
    # - if both science & education, let's merge them (most are higher-ed)
    if len(cats) > 1 and 'RegOther' in cats:
        cats.pop('RegOther')
    if 'Education' in cats and 'Science' in cats:
        cats['Education'] += cats['Science']
        cats.pop('Science')

    # now, if we have just one category, use that
    if len(cats) == 1:
        curlie3.loc[i, 'cat'] = list(cats.keys())[0]
        continue

    # next, if there are many subdomains, use majority voting (or MIX);
    #            if few subdomains, use top category, unless tied
    (cat1, n1), (cat2, n2) = cats.most_common(2)
    if fdoms > 99:
        if n1 / fdoms > 0.5:
            curlie3.loc[i, 'cat'] = cat1
        else:
            curlie3.loc[i, 'cat'] = 'MIX'
    else:
        if n1 > n2:
            curlie3.loc[i, 'cat'] = cat1
        else:
            # finally, for ties: use the heuristics/rules we used for fdomains (curlie2)
            # we check for any dominating categories, otherwise, remove categories until we get to 1...
            top = [c for c,n in cats.items() if n==n1]
            if 'Government' in top:
                curlie3.loc[i, 'cat'] = 'Government'
            elif 'Kids_and_Teens' in top:
                curlie3.loc[i, 'cat'] = 'Kids_and_Teens'
            elif 'Travel' in top:
                curlie3.loc[i, 'cat'] = 'Travel'
            elif 'Arts' in top:
                curlie3.loc[i, 'cat'] = 'Arts'
            elif 'Health' in top:
                curlie3.loc[i, 'cat'] = 'Health'
            elif 'Shopping' in top:
                curlie3.loc[i, 'cat'] = 'Shopping'
            elif 'Education' in top:
                 curlie3.loc[i, 'cat'] = 'Education'
            else:
                if len(top) > 1 and 'Business' in top:
                    top.remove('Business')
                if len(top) > 1 and 'Reference' in top:
                    top.remove('Reference')
                if len(top) > 1 and 'Society' in top:
                    top.remove('Society')
                if len(top) > 1 and 'Recreation' in top:
                    top.remove('Recreation')
                if len(top) > 1 and 'Computers' in top:
                    top.remove('Computers')
                if len(top) > 1 and 'News' in top:
                    top.remove('News')
                if len(top) > 1 and 'Science' in top:
                    top.remove('Science')
                if len(top) > 1 and 'Home' in top:
                    top.remove('Home')
                assert len(top) == 1  # only 1 at this point 
                curlie3.loc[i, 'cat'] = top[0]

#
print("in", round(time()-st), "s")  # ~100s

# save interim results
# with open("./data/tmp-curlie-rdomain.p", "wb") as f:  
#    pickle.dump(curlie3, f)

print('domains:', len(curlie3))
print(len(curlie3[curlie3.cat.str.startswith('MIX')]))  
curlie3.head(3)

CPU times: user 6.22 s, sys: 11.9 ms, total: 6.23 s
Wall time: 6.23 s
in 181 s
domains: 245060
4


Unnamed: 0,domain,cats,fdoms,cat
0,007-berlin.de,{'Society': 1},1,Society
1,007box.de,{'Computers': 1},1,Computers
2,01art.de,{'Business': 1},1,Business


In [8]:
# some individual checks below
# - note, in this list, only two appear wrong (xs4all.nl & essen.de)
# - the rest appear correct, which is good 
# curlie[curlie.domain.isin(["nrw.de", "uni-potsdam.de", "bayern.de", "bayer.de", "lmu.de", "lego.com", "pbs.org",
#                            "sagepub.com", "gmu.edu", "xs4all.nl", "ucanr.edu", "sourceforge.net",
#                            "uwa.edu.au", "drk.de", "essen.de", "nationalgeographic.com",
#                            "dlrg.de", "eq.edu.au", "weebly.com", "webs.com", "freeservers.com",
#                            "free.fr", "homestead.com", "uk.com", "house.gov", "iheart.com",
#                            "play-cricket.com", "schoolloop.com", "ox.ac.uk",
#                            "freeservers.com",  "iwarp.com", "jimdofree.com", "beepworld.de",
#                            "sourceforge.net", "blogspot.com", "wordpress.com", "tripod.com"])]

In [9]:
# STEP 5: Final touches before saving
curlie = curlie3.copy()

# change remaining four 'mix' domains 
curlie.loc[curlie.domain=='nrw.de', 'cat'] = 'Government'
curlie.loc[curlie.domain=='t-online.de', 'cat'] = 'Recreation'  # includes a lot of private pages
curlie.loc[curlie.domain=='beepworld.de', 'cat'] = 'Recreation' # similar
curlie.loc[curlie.domain=='gmxhome.de', 'cat'] = 'Recreation'   # similar

# simplify
curlie.loc[curlie.cat=='Kids_and_Teens', 'cat'] = 'Kids'  # simpler name
curlie.loc[curlie.cat=='Guide', 'cat']  = 'Travel'  # merge as includes a lot of travel sites
curlie.loc[curlie.cat=='Reference', 'cat'] = 'RegOther'  # merge 

# remove unused column and finally save
curlie.drop(columns={'cats', 'fdoms'}, inplace=True)
curlie.to_csv("./data/curlie-ourset.csv", index=False)

# some stats
print('curlie .de/.at set:', len(curlie))
print("\n", curlie.groupby('cat').domain.count(), "\n")

curlie .de set: 245060

 cat
Arts          16114
Business      65234
Computers      6860
Education      7425
Games          2099
Government     1171
Health        20216
Home           1142
Kids           1253
News           1812
Recreation    30685
RegOther      40017
Science        3751
Shopping       7794
Society       22911
Travel        16576
Name: domain, dtype: int64 

