## This notebook uses data from the GAW to train a PersonCentric Webpage-classifier. 

This is performed on multiple data sets to investigate the accuracy depending on the training data, thus providing a way of evaluating the data set creation method.

# What is the problem, why is it hard?
The Web, as representation of the physical world provides the opportunity to study large scale phenomena of entities and relations originating from structures of offline interactions.

As stated by the IDC, unstructured data occupies approximately 80% of the digital space by volume compared to only 20% for structured data and continues to by the primary drive for data growth \cite{potnis2019idc}.

Investigating these interactions requires a prior information extraction process through which \emph{entities}- and \emph{relations}-centric \emph{informational needs} are met \cite{broder2002taxonomy, butt2015taxonomy}.

This task becomes increasingly complex if the data is not available as structured data (i.e. RDF/XML) \cite{gandhi2016information} but only as unstructured HTML/TEXT documents spread over millions of web pages.

Therefore, a semantic enrichment process of unstructured web content is necessary to extract \emph{entity}- and \emph{relation}-centric information.

With large data resources such a process \emph{must} be performed automatically and in a shorter time frame, than the collection of the information by human-performed structuring, with given Web search opportunities.

Despite being unstructured, web documents provide a structural and textual aspect of their content, which has been previously described in a combined representation \cite{lanotte2017exploiting, fathi2004web}.

The difficulty remains to be connecting these Web base structures to physical or organizational structures. In a first step towards this goal, we aim to provide the web data and network of the people associated within the different university web structures in Germany.

In [1]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
    # If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 4 GPU(s) available.
We will use the GPU: GeForce GTX 1080 Ti


In [3]:
seed = 42
split_ratio_train_test = 0.8 
split_ratio_train_val = 0.9 

In [11]:
import pandas as pd
import numpy as np
import os
from collections import Counter
import random as rd

def load_raw_datasets(pathname):
    df =  pd.read_csv("datasets/" + pathname, delimiter='\t', header=None, names=['sentence', 'label']).drop_duplicates().reset_index(drop=True)
    return df

# RegEx
fromRegExTrue = load_raw_datasets("fromRegEx/train_true.tsv")
fromRegExFalse = load_raw_datasets("fromRegEx/train_false.tsv")

# Existing resource DBLP
fromDBLPTrue = load_raw_datasets("fromDBLP/train_true.tsv")

# Manual annotation
fromURLLabelTrue = load_raw_datasets("fromURLLabel/train_true.tsv")
fromURLLabelFalse = load_raw_datasets("fromURLLabel/train_false.tsv")

# Existing resource
fromWikidata_on_seeds = load_raw_datasets("fromWikidata_urls_in_GAW/test_true.tsv")

In [12]:
# Concatenate true and false samples from the individual sources, and remove internal duplicate sentences, remove cooccurring sample in RegEx and DBLP from URLLabel 
def dedup_by_sentences(dfList):
    df_merged = pd.concat(dfList, ignore_index=True)
    df_deduped = df_merged[df_merged.sentence.duplicated(keep=False) == False].reset_index(drop=True)
    return df_deduped

def dup_sentences(dfList):
    dfunion = pd.concat(dfList,ignore_index=True)
    duplicates = dfunion[dfunion.sentence.duplicated(keep=False)]
    return duplicates

def disjoint_samples(df1,df2):
    df1_without_df2 = df1[~df1.sentence.isin(df2.sentence)]
    df1_clean = df1_without_df2.sample(frac=1.0,random_state=seed).reset_index(drop=True)
    return df1_clean

## URLLabel merge true and false
## remove internally duplicate sentences
fromURLLabel_deduped = dedup_by_sentences([fromURLLabelTrue,fromURLLabelFalse])
## DBLP
fromDBLP_deduped = dedup_by_sentences([fromDBLPTrue])
## RegEx 
fromRegEx_deduped = dedup_by_sentences([fromRegExTrue,fromRegExFalse])
## Wikidata 
fromWikidata_deduped = dedup_by_sentences([fromWikidata_on_seeds])


## Find duplicate sentences
_duplicateDU = dup_sentences([fromDBLP_deduped,fromURLLabel_deduped])
_duplicateDR = dup_sentences([fromDBLP_deduped,fromRegEx_deduped])
_duplicateRU = dup_sentences([fromRegEx_deduped,fromURLLabel_deduped])

## Wikidata dups
_duplicateWD = dup_sentences([fromWikidata_deduped,fromDBLP_deduped])
_duplicateWR = dup_sentences([fromWikidata_deduped,fromRegEx_deduped])
_duplicateWU = dup_sentences([fromWikidata_deduped,fromURLLabel_deduped])


## Deduplicated data sets
_URLLabel_withoutDU = disjoint_samples(fromURLLabel_deduped,_duplicateDU)
URLLabel_clean = disjoint_samples(_URLLabel_withoutDU,_duplicateWU)

_DBLP_withoutDU = disjoint_samples(fromDBLP_deduped,_duplicateDU)
DBLP_clean = disjoint_samples(_DBLP_withoutDU,_duplicateWD)

_RegEx_withoutRU = disjoint_samples(fromRegEx_deduped,_duplicateRU)
_RegEx_withoutRU_DR = disjoint_samples(_RegEx_withoutRU,_duplicateDR)
RegEx_clean = disjoint_samples(_RegEx_withoutRU_DR,_duplicateWR)

Wikidata_on_seeds_clean = disjoint_samples(disjoint_samples(fromWikidata_on_seeds,fromDBLP_deduped),fromRegEx_deduped)

assert all(pd.concat([RegEx_clean,DBLP_clean,URLLabel_clean,Wikidata_on_seeds_clean]).reset_index(drop=True).duplicated() == False) == True
# assert all(pd.concat([RegEx_clean,DBLP_clean,URLLabel_clean,fromWikidata_on_seeds]).reset_index(drop=True).duplicated() == False) == True
print("No duplicate sentences in all data sets")

No duplicate sentences in all data sets


In [21]:
# Construction of train, test dataset 

def construct_test_set(df_for_true, df_for_false, sampleing_true, sampleing_false):
    test_true = df_for_true[df_for_true.label == 1].sample(**sampleing_true).reset_index(drop=True)
    test_false = df_for_false[df_for_false.label == 0].sample(**sampleing_false).reset_index(drop=True)
    test_df = pd.concat([test_true, test_false],ignore_index=True).reset_index(drop=True)
    return test_df

def construct_train_set(df, testsamples):
    df_train = df[~df.sentence.isin(testsamples.sentence)]
    return df_train

URLLabel_clean_counter = Counter(URLLabel_clean.label)
print("URLLabel_clean_counter:",URLLabel_clean_counter)
# Counter({0: 1407, 1: 606}) 
false_true_ratio = URLLabel_clean_counter[0]/URLLabel_clean_counter[1]
print("false_true_ratio: ", false_true_ratio)

## Test dataset
# URLLabel 
URLLabel_test = construct_test_set(URLLabel_clean,URLLabel_clean,{"frac": 1-split_ratio_train_test,"random_state" : seed},{"frac": 1-split_ratio_train_test,"random_state" : seed})
URLLabel_test_counter = Counter(URLLabel_test.label)
print("URLLabel_test_counter: " + str(URLLabel_test_counter))

# DBLP/RegEx
DBLP_test = construct_test_set(DBLP_clean, RegEx_clean, {"n": URLLabel_test_counter[1],"random_state" : seed},{"n": URLLabel_test_counter[0],"random_state" : seed})
DBLP_test_counter = Counter(DBLP_test.label)
DBLP_clean_counter = Counter(DBLP_clean.label)
print(DBLP_clean_counter)
print("DBLP_test_counter: " + str(DBLP_test_counter))

# Wikidata
n_wikidata = len(Wikidata_on_seeds_clean)
Wikidata_test = construct_test_set(Wikidata_on_seeds_clean,RegEx_clean, {"n": n_wikidata,"random_state" : seed},{"n": int(n_wikidata*URLLabel_test_counter[0]/URLLabel_test_counter[1]),"random_state" : seed})
Wikidata_test_counter = Counter(Wikidata_test.label)
print("Wikidata_test_counter: " + str(Wikidata_test_counter))
Wikidata_on_seeds_clean_counter = Counter(Wikidata_on_seeds_clean.label)
print(Wikidata_on_seeds_clean_counter)

## Train dataset
# URLLabel
URLLabel_train = URLLabel_clean[~URLLabel_clean.sentence.isin(URLLabel_test.sentence)]
URLLabel_train_counter = Counter(URLLabel_train.label)
print("URLLabel_train_counter: "+ str(URLLabel_train_counter))

#DBLP/RegEx
DBLP_train_true = DBLP_clean[~DBLP_clean.sentence.isin(DBLP_test.sentence)]
DBLP_train_false = RegEx_clean[RegEx_clean.label == 0][~RegEx_clean.sentence.isin(DBLP_test.sentence)].sample(int(len(DBLP_train_true)*false_true_ratio),random_state=seed).reset_index(drop=True)
DBLP_train = pd.concat([DBLP_train_true, DBLP_train_false],ignore_index=True).reset_index(drop=True)
DBLP_train_counter = Counter(DBLP_train.label)
print("DBLP_train_counter: " + str(DBLP_train_counter))

# Remaining sample from the RegEx
#RegEx
remaining_RegEx = RegEx_clean[~RegEx_clean.sentence.isin(DBLP_train.sentence)]
RegEx_train_true = remaining_RegEx[remaining_RegEx.label == 1].sample(10*URLLabel_train_counter[1],random_state=seed)
RegEx_train_false = remaining_RegEx[remaining_RegEx.label == 0].sample(10*URLLabel_train_counter[0],random_state=seed)
RegEx_train = pd.concat([RegEx_train_true, RegEx_train_false],ignore_index=True).sample(frac=1,random_state=seed).reset_index(drop=True)
RegEx_train_counter = Counter(RegEx_train.label)
print("RegEx_train_counter: "+ str(RegEx_train_counter))

URLLabel_clean_counter: Counter({0: 1407, 1: 605})
false_true_ratio:  2.3256198347107437
URLLabel_test_counter: Counter({0: 281, 1: 121})
Counter({1: 1790})
DBLP_test_counter: Counter({0: 281, 1: 121})
Wikidata_test_counter: Counter({0: 589, 1: 254})
Counter({1: 254})
URLLabel_train_counter: Counter({0: 1126, 1: 484})
DBLP_train_counter: Counter({0: 3881, 1: 1669})
RegEx_train_counter: Counter({0: 11260, 1: 4840})




In [None]:
def formatDataset(df, n0, n1, seed):
    df0 = df[df['label'] == 0].sample(n0,random_state=seed)
    df1 = df[df['label'] == 1].sample(n1,random_state=seed)
    return pd.concat([df0,df1],ignore_index=True).reset_index(drop=True)

def traintestDict(df1, df2List):
    return {"train":df1.sample(frac=1,random_state=seed).reset_index(drop=True), 
          "test":[df2.sample(frac=1,random_state=seed).reset_index(drop=True) for df2 in df2List]}

def printCounter(dataset):
    return Counter(dataset["train"].label), [Counter(df.label) for df in dataset["test"]]

testset_list = [URLLabel_test, DBLP_test, Wikidata_test]

M_dataset = traintestDict(URLLabel_train, testset_list)
D_dataset = traintestDict(DBLP_train, testset_list)
R_dataset = traintestDict(RegEx_train, testset_list)


M_counter = printCounter(M_dataset)
D_counter = printCounter(D_dataset)
R_counter = printCounter(R_dataset)


print(printCounter(M_dataset))
print(printCounter(D_dataset))
print(R_counter)