## This notebook uses data from the GAW to train a PersonCentric Webpage-classifier. 

This is performed on multiple data sets to investigate the accuracy depending on the training data, thus providing a way of evaluating the data set creation method.

# What is the problem, why is it hard?
The Web, as representation of the physical world provides the opportunity to study large scale phenomena of entities and relations originating from structures of offline interactions.

As stated by the IDC, unstructured data occupies approximately 80% of the digital space by volume compared to only 20% for structured data and continues to by the primary drive for data growth \cite{potnis2019idc}.

Investigating these interactions requires a prior information extraction process through which \emph{entities}- and \emph{relations}-centric \emph{informational needs} are met \cite{broder2002taxonomy, butt2015taxonomy}.

This task becomes increasingly complex if the data is not available as structured data (i.e. RDF/XML) \cite{gandhi2016information} but only as unstructured HTML/TEXT documents spread over millions of web pages.

Therefore, a semantic enrichment process of unstructured web content is necessary to extract \emph{entity}- and \emph{relation}-centric information.

With large data resources such a process \emph{must} be performed automatically and in a shorter time frame, than the collection of the information by human-performed structuring, with given Web search opportunities.

Despite being unstructured, web documents provide a structural and textual aspect of their content, which has been previously described in a combined representation \cite{lanotte2017exploiting, fathi2004web}.

The difficulty remains to be connecting these Web base structures to physical or organizational structures. In a first step towards this goal, we aim to provide the web data and network of the people associated within the different university web structures in Germany.

In [1]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
    # If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 4 GPU(s) available.
We will use the GPU: GeForce GTX 1080 Ti


In [2]:
seed = 42
split_ratio_train_test = 0.8 
split_ratio_train_val = 0.9 

In [3]:
import pandas as pd
import numpy as np
from collections import Counter


def load_raw_datasets(pathname):
    df =  pd.read_csv("datasets/" + pathname, delimiter='\t', header=None, names=['sentence', 'label']).drop_duplicates().reset_index(drop=True)
    return df

# RegEx
fromRegExTrue = load_raw_datasets("fromRegEx/train_true.tsv")
fromRegExFalse = load_raw_datasets("fromRegEx/train_false.tsv")

# Existing resource DBLP
fromDBLPTrue = load_raw_datasets("fromDBLP/train_true.tsv")

# Manual annotation
fromURLLabelTrue = load_raw_datasets("fromURLLabel/train_true.tsv")
fromURLLabelFalse = load_raw_datasets("fromURLLabel/train_false.tsv")

# Existing resource
fromWikidata_on_seeds = load_raw_datasets("fromWikidata_urls_in_GAW/test_true.tsv")

In [4]:
# Concatenate true and false samples from the individual sources, and remove internal duplicate sentences, remove cooccurring sample in RegEx and DBLP from URLLabel 
def dedup_by_sentences(dfList):
    df_merged = pd.concat(dfList, ignore_index=True)
    df_deduped = df_merged[df_merged.sentence.duplicated(keep=False) == False].reset_index(drop=True)
    return df_deduped

def dup_sentences(dfList):
    dfunion = pd.concat(dfList,ignore_index=True)
    duplicates = dfunion[dfunion.sentence.duplicated(keep=False)]
    return duplicates

def disjoint_samples(df1,df2):
    df1_without_df2 = df1[~df1.sentence.isin(df2.sentence)]
    df1_clean = df1_without_df2.sample(frac=1.0,random_state=seed).reset_index(drop=True)
    return df1_clean

## URLLabel merge true and false
## remove internally duplicate sentences
fromURLLabel_deduped = dedup_by_sentences([fromURLLabelTrue,fromURLLabelFalse])
## DBLP
fromDBLP_deduped = dedup_by_sentences([fromDBLPTrue])
## RegEx 
fromRegEx_deduped = dedup_by_sentences([fromRegExTrue,fromRegExFalse])
## Wikidata 
fromWikidata_deduped = dedup_by_sentences([fromWikidata_on_seeds])


## Find duplicate sentences
_duplicateDU = dup_sentences([fromDBLP_deduped,fromURLLabel_deduped])
_duplicateDR = dup_sentences([fromDBLP_deduped,fromRegEx_deduped])
_duplicateRU = dup_sentences([fromRegEx_deduped,fromURLLabel_deduped])

## Wikidata dups
_duplicateWD = dup_sentences([fromWikidata_deduped,fromDBLP_deduped])
_duplicateWR = dup_sentences([fromWikidata_deduped,fromRegEx_deduped])
_duplicateWU = dup_sentences([fromWikidata_deduped,fromURLLabel_deduped])


## Deduplicated data sets
_URLLabel_withoutDU = disjoint_samples(fromURLLabel_deduped,_duplicateDU)
URLLabel_clean = disjoint_samples(_URLLabel_withoutDU,_duplicateWU)

_DBLP_withoutDU = disjoint_samples(fromDBLP_deduped,_duplicateDU)
DBLP_clean = disjoint_samples(_DBLP_withoutDU,_duplicateWD)

_RegEx_withoutRU = disjoint_samples(fromRegEx_deduped,_duplicateRU)
_RegEx_withoutRU_DR = disjoint_samples(_RegEx_withoutRU,_duplicateDR)
RegEx_clean = disjoint_samples(_RegEx_withoutRU_DR,_duplicateWR)

Wikidata_on_seeds_clean = disjoint_samples(disjoint_samples(fromWikidata_on_seeds,fromDBLP_deduped),fromRegEx_deduped)

assert all(pd.concat([RegEx_clean,DBLP_clean,URLLabel_clean,Wikidata_on_seeds_clean]).reset_index(drop=True).duplicated() == False) == True
# assert all(pd.concat([RegEx_clean,DBLP_clean,URLLabel_clean,fromWikidata_on_seeds]).reset_index(drop=True).duplicated() == False) == True
print("No duplicate sentences in all data sets")

No duplicate sentences in all data sets


# Construction of train, test dataset 


In [5]:
def construct_test_set(df_for_true, df_for_false, sampleing_true, sampleing_false):
    test_true = df_for_true[df_for_true.label == 1].sample(**sampleing_true).reset_index(drop=True)
    test_false = df_for_false[df_for_false.label == 0].sample(**sampleing_false).reset_index(drop=True)
    test_df = pd.concat([test_true, test_false],ignore_index=True).reset_index(drop=True)
    return test_df

def construct_train_set(df, testsamples):
    df_train = df[~df.sentence.isin(testsamples.sentence)]
    return df_train

URLLabel_clean_counter = Counter(URLLabel_clean.label)
print("URLLabel_clean_counter:",URLLabel_clean_counter)
# Counter({0: 1407, 1: 606}) 
false_true_ratio = URLLabel_clean_counter[0]/URLLabel_clean_counter[1]
print("false_true_ratio: ", false_true_ratio)

## Test dataset
# URLLabel 
URLLabel_test = construct_test_set(URLLabel_clean,URLLabel_clean,{"frac": 1-split_ratio_train_test,"random_state" : seed},{"frac": 1-split_ratio_train_test,"random_state" : seed})
URLLabel_test_counter = Counter(URLLabel_test.label)
print("URLLabel_test_counter: " + str(URLLabel_test_counter))

# DBLP/RegEx
DBLP_test = construct_test_set(DBLP_clean, RegEx_clean, {"n": URLLabel_test_counter[1],"random_state" : seed},{"n": URLLabel_test_counter[0],"random_state" : seed})
DBLP_test_counter = Counter(DBLP_test.label)
DBLP_clean_counter = Counter(DBLP_clean.label)
print(DBLP_clean_counter)
print("DBLP_test_counter: " + str(DBLP_test_counter))

# Wikidata
n_wikidata = len(Wikidata_on_seeds_clean)
Wikidata_test = construct_test_set(Wikidata_on_seeds_clean,RegEx_clean, {"n": n_wikidata,"random_state" : seed},{"n": int(n_wikidata*URLLabel_test_counter[0]/URLLabel_test_counter[1]),"random_state" : seed})
Wikidata_test_counter = Counter(Wikidata_test.label)
print("Wikidata_test_counter: " + str(Wikidata_test_counter))
Wikidata_on_seeds_clean_counter = Counter(Wikidata_on_seeds_clean.label)
print(Wikidata_on_seeds_clean_counter)

## Train dataset
# URLLabel
URLLabel_train = URLLabel_clean[~URLLabel_clean.sentence.isin(URLLabel_test.sentence)]
URLLabel_train_counter = Counter(URLLabel_train.label)
print("URLLabel_train_counter: "+ str(URLLabel_train_counter))

#DBLP/RegEx
DBLP_train_true = DBLP_clean[~DBLP_clean.sentence.isin(DBLP_test.sentence)]
DBLP_train_false = RegEx_clean[RegEx_clean.label == 0][~RegEx_clean.sentence.isin(DBLP_test.sentence)].sample(int(len(DBLP_train_true)*false_true_ratio),random_state=seed).reset_index(drop=True)
DBLP_train = pd.concat([DBLP_train_true, DBLP_train_false],ignore_index=True).reset_index(drop=True)
DBLP_train_counter = Counter(DBLP_train.label)
print("DBLP_train_counter: " + str(DBLP_train_counter))

# Remaining sample from the RegEx
#RegEx
remaining_RegEx = RegEx_clean[~RegEx_clean.sentence.isin(DBLP_train.sentence)]
RegEx_train_true = remaining_RegEx[remaining_RegEx.label == 1].sample(10*URLLabel_train_counter[1],random_state=seed)
RegEx_train_false = remaining_RegEx[remaining_RegEx.label == 0].sample(10*URLLabel_train_counter[0],random_state=seed)
RegEx_train = pd.concat([RegEx_train_true, RegEx_train_false],ignore_index=True).sample(frac=1,random_state=seed).reset_index(drop=True)
RegEx_train_counter = Counter(RegEx_train.label)
print("RegEx_train_counter: "+ str(RegEx_train_counter))

URLLabel_clean_counter: Counter({0: 1407, 1: 605})
false_true_ratio:  2.3256198347107437
URLLabel_test_counter: Counter({0: 281, 1: 121})
Counter({1: 1790})
DBLP_test_counter: Counter({0: 281, 1: 121})
Wikidata_test_counter: Counter({0: 589, 1: 254})
Counter({1: 254})
URLLabel_train_counter: Counter({0: 1126, 1: 484})
DBLP_train_counter: Counter({0: 3881, 1: 1669})
RegEx_train_counter: Counter({0: 11260, 1: 4840})




In [6]:
def formatDataset(df, n0, n1, seed):
    df0 = df[df['label'] == 0].sample(n0,random_state=seed)
    df1 = df[df['label'] == 1].sample(n1,random_state=seed)
    return pd.concat([df0,df1],ignore_index=True).reset_index(drop=True)

def traintestDict(df1, df2List):
    return {"train":df1.sample(frac=1,random_state=seed).reset_index(drop=True), 
          "test":[df2.sample(frac=1,random_state=seed).reset_index(drop=True) for df2 in df2List]}

def printCounter(dataset):
    return Counter(dataset["train"].label), [Counter(df.label) for df in dataset["test"]]

testset_list = [URLLabel_test, DBLP_test, Wikidata_test]

M_dataset = traintestDict(URLLabel_train, testset_list)
D_dataset = traintestDict(DBLP_train, testset_list)
R_dataset = traintestDict(RegEx_train, testset_list)

print("M_dataset:", printCounter(M_dataset))
print("D_dataset:", printCounter(D_dataset))
print("R_dataset:", printCounter(R_dataset))

M_dataset: (Counter({0: 1126, 1: 484}), [Counter({0: 281, 1: 121}), Counter({0: 281, 1: 121}), Counter({0: 589, 1: 254})])
D_dataset: (Counter({0: 3881, 1: 1669}), [Counter({0: 281, 1: 121}), Counter({0: 281, 1: 121}), Counter({0: 589, 1: 254})])
R_dataset: (Counter({0: 11260, 1: 4840}), [Counter({0: 281, 1: 121}), Counter({0: 281, 1: 121}), Counter({0: 589, 1: 254})])


In [7]:
import time
import datetime
import sklearn as sk
import sklearn.preprocessing as pp

  # Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))

    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded)) 

def person_dataset2bert_dataset(df:pd.DataFrame, tokenizer):
    sentences = df.sentence.values
    labels = df.label.values

  # Tokenize all of the sentences and map the tokens to thier word IDs.
    input_ids = []
    attention_masks = []

    # For every sentence...
    for sent in sentences:
      # `encode_plus` will:
      #   (1) Tokenize the sentence.
      #   (2) Prepend the `[CLS]` token to the start.
      #   (3) Append the `[SEP]` token to the end.
      #   (4) Map tokens to their IDs.
      #   (5) Pad or truncate the sentence to `max_length`
      #   (6) Create attention masks for [PAD] tokens.
      encoded_dict = tokenizer.encode_plus(
                          sent,                           # Sentence to encode.
                          add_special_tokens = True,      # Add '[CLS]' and '[SEP]'
                          max_length = 128,               # Pad & truncate all sentences.
                          pad_to_max_length = True,
                          return_attention_mask = True,   # Construct attn. masks.
                          return_tensors = 'pt',          # Return pytorch tensors.
                    )

      # Add the encoded sentence to the list.    
      input_ids.append(encoded_dict['input_ids'])

      # And its attention mask (simply differentiates padding from non-padding).
      attention_masks.append(encoded_dict['attention_mask'])

    # Convert the lists into tensors.
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    labels = torch.tensor(labels)

    from torch.utils.data import TensorDataset, random_split

    # Combine the training inputs into a TensorDataset.
    dataset = TensorDataset(input_ids, attention_masks, labels)
    return dataset

def log2prob(logits):
    probits = list(map(lambda l: np.exp(l)/(1+ np.exp(l)), logits))
    return pp.normalize(probits,norm='l1')

In [8]:
def createDataLoader(train_dataset, val_dataset,batch_size):
    # The DataLoader needs to know our batch size for training, so we specify it 
    # here. For fine-tuning BERT on a specific task, the authors recommend a batch 
    # size of 16 or 32.
    from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

    # Create the DataLoaders for our training and validation sets.
    train_dataloader = DataLoader(
              train_dataset,                          # The training samples.
              sampler = RandomSampler(train_dataset), # Select batches randomly
              batch_size = batch_size                 # Trains with this batch size.
          )

    validation_dataloader = DataLoader(
              val_dataset,                              # The validation samples.
              sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
              batch_size = batch_size                   # Evaluate with this batch size.
          )
    return train_dataloader, validation_dataloader


In [None]:
pretrained_model_name = "bert-base-multilingual-cased" 
import sklearn
from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score
from sklearn.metrics import matthews_corrcoef
from collections import defaultdict
import random
import numpy as np
from torch.utils.data import TensorDataset, random_split
import warnings
warnings.filterwarnings('ignore')
# import logging
# logging.getLogger("transformers.tokenization_utils").setLevel(logging.ERROR)
# logging.getLogger("pytorch_pretrained_bert.tokenization").setLevel(logging.ERROR)

start_sampleNo = 200
step_size_sampleNo = 200
sampleNoList = list(np.arange(start_sampleNo,len(M_dataset["train"]),step_size_sampleNo)) + ['max']
seedList=list(range(42,52))

results = {}
for person_dataset_name, person_dataset in zip(["M/M - M/M", "R/R - M/M", "DBLP/R - M/M"],[M_dataset, R_dataset, D_dataset]):
# for person_dataset_name, person_dataset in zip([ "R/R - M/M", "DBLP/R - M/M"],[ R_dataset, D_dataset]):
# for person_dataset_name, person_dataset in zip([ "DBLP/R - M/M"],[ D_dataset]):
# for person_dataset_name, person_dataset in zip(["M/M - M/M"],[M_dataset]): 
    data_set_results = {}
    save_flag = 0
    person_dataset_name_path = "_".join("".join(person_dataset_name.split("/")).split())
    for sampleNo in sampleNoList:
        seed_stats = []
        for seed in seedList:
            if sampleNo != 'max':
                df_train = person_dataset["train"].sample(sampleNo,replace=False,random_state=seed).reset_index(drop=True)
            else:
                df_train = person_dataset["train"]

            df_test = person_dataset["test"]

            sampleNoStr = str(len(df_train))
            print(person_dataset_name + " " + sampleNoStr + " " + str(seed) + format_time(time.time()))

            from transformers import BertTokenizer

            # print('Loading BERT tokenizer...')
            tokenizer = BertTokenizer.from_pretrained(pretrained_model_name, do_lower_case=False)

            dataset = person_dataset2bert_dataset(df_train, tokenizer)

            train_size = int(split_ratio_train_val * len(dataset))
            val_size = len(dataset) - train_size

            train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

            batch_size = 32
            train_dataloader, validation_dataloader = createDataLoader(train_dataset, val_dataset,batch_size)
            from transformers import BertForSequenceClassification, BertConfig

            model = BertForSequenceClassification.from_pretrained(
              pretrained_model_name, # Use the 12-layer BERT model, with an uncased vocab.
              num_labels = 2, # The number of output labels--2 for binary classification.
                              # You can increase this for multi-class tasks.   
              output_attentions = False, # Whether the model returns attentions weights.
              output_hidden_states = False, # Whether the model returns all hidden-states.
            )

            # Tell pytorch to run this model on the GPU.
            model.cuda()

            from torch import optim
            optimizer = optim.AdamW(model.parameters(), lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                              eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                            )

            from transformers import get_linear_schedule_with_warmup

            epochs = 4
            total_steps = len(train_dataloader) * epochs

            scheduler = get_linear_schedule_with_warmup(optimizer, 
                                                      num_warmup_steps = 0, # Default value in run_glue.py
                                                      num_training_steps = total_steps)


            # This training code is based on the `run_glue.py` script here:
            # https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

            seed_val = seed

            random.seed(seed_val)
            np.random.seed(seed_val)
            torch.manual_seed(seed_val)
            torch.cuda.manual_seed_all(seed_val)

            training_stats = []
            predict_stats = []

            total_t0 = time.time()
            total_steps = 0
            # For each epoch...
            for epoch_i in range(0, epochs):
                    
              # ========================================
              #               Training
              # ========================================

              # Measure how long the training epoch takes.
                t0 = time.time()

                # Reset the total loss for this epoch.
                total_train_loss = 0
                model.train()

                # For each batch of training data...
                for step, batch in enumerate(train_dataloader):

                    total_steps+=1

                    if step % 40 == 0 and not step == 0:
                        elapsed = format_time(time.time() - t0)
                      # print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

                    b_input_ids = batch[0].to(device)
                    b_input_mask = batch[1].to(device)
                    b_labels = batch[2].to(device)

                    model.zero_grad()        

                    loss_logits = model(b_input_ids, 
                                      token_type_ids=None, 
                                      attention_mask=b_input_mask, 
                                      labels=b_labels)

                    loss = loss_logits[0]
                    logits = loss_logits[1]

                    total_train_loss += loss.item()

                    loss.backward()

                    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

                    optimizer.step()
                    scheduler.step()

                avg_train_loss = total_train_loss / len(train_dataloader)            
                training_time = format_time(time.time() - t0)
              
              # ========================================
              #               Validation
              # ========================================


                t0 = time.time()

                model.eval()

                # Tracking variables 
                total_eval_accuracy = 0
                total_eval_loss = 0
                nb_eval_steps = 0
                total_eval_auc_roc = []
                total_eval_f1_score = []
                total_eval_recall = []
                total_eval_prec = []
                y_true = []
                y_pred = []

              # Evaluate data for one epoch
                for batch in validation_dataloader:

                    b_input_ids = batch[0].to(device)
                    b_input_mask = batch[1].to(device)
                    b_labels = batch[2].to(device)

                    # Tell pytorch not to bother with constructing the compute graph during
                    # the forward pass, since this is only needed for backprop (training).
                    with torch.no_grad():        

                        loss_logits = model(b_input_ids, 
                                            token_type_ids=None, 
                                            attention_mask=b_input_mask,
                                            labels=b_labels)

                        loss = loss_logits[0]
                        logits = loss_logits[1]

                    total_eval_loss += loss.item()

                    # Move logits and labels to CPU
                    logits = logits.detach().cpu().numpy()
                    label_ids = b_labels.to('cpu').numpy()

                    # Determine probability
                    probits = log2prob(logits)

                    # Calculate the accuracy for this batch of test sentences, and
                    # accumulate it over all batches.
                    y_true.append(label_ids)
                    y_pred.append(probits.T[1])

                    total_eval_accuracy += flat_accuracy(logits, label_ids)
                    
                flat_true_labels = np.concatenate(y_true, axis=0)

                flat_probs = np.concatenate(y_pred, axis=0)

                # Combine the correct labels for each batch into a single list.
                y_true = flat_true_labels
                y_pred = flat_probs

                # total_eval_auc_roc.append(sklearn.metrics.roc_auc_score(y_true, y_pred))
                total_eval_f1_score.append(f1_score(y_true,np.round(y_pred)))
                total_eval_recall.append(recall_score(y_true,np.round(y_pred)))
                total_eval_prec.append(precision_score(y_true,np.round(y_pred)))

                # Report the final accuracy for this validation run.
                avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
                # print("  Accuracy: {0:.2f}".format(avg_val_accuracy))

                # Calculate the average loss over all of the batches.
                avg_val_loss = total_eval_loss / len(validation_dataloader)
              
                # Measure how long the validation run took.
                validation_time = format_time(time.time() - t0)
              
                # print("  Validation Loss: {0:.2f}".format(avg_val_loss))
                # print("  Validation took: {:}".format(validation_time))

                # Record all statistics from this epoch.
                val_stats_dict = {   'epoch': epoch_i + 1,
                                'total_steps': total_steps,
                      'Training Loss': avg_train_loss,
                      # 'Valid. avg. AUC': np.average(total_eval_auc_roc),
                      'Valid. avg. F1' : np.average(total_eval_f1_score),
                      'Valid. Recall': np.average(total_eval_recall),
                      'Valid. Prec.' : np.average(total_eval_prec),
                      'Valid. Loss': avg_val_loss,
                      'Valid. Accur.': avg_val_accuracy,
                      'Training Time': training_time,
                      'Validation Time': validation_time
                  }
                training_stats.append( val_stats_dict
                )

                for testset, df_predict in enumerate(df_test):
                              # df_predict = df_test
                    # Report the number of sentences.
                    from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

                    # print('Number of test sentences: {:,}\n'.format(df_predict.shape[0]))
                    prediction_data = person_dataset2bert_dataset(df_predict, tokenizer)
                    prediction_sampler = SequentialSampler(prediction_data)
                    prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

                    # print('Predicting labels for {:,} test sentences...'.format(len(input_ids)))

                    # Put model in evaluation mode
                    model.eval()

                    # Tracking variables 
                    predictions , true_labels = [], []
                    probs = []

                    # Predict 
                    for batch in prediction_dataloader:
                            # Add batch to GPU
                            batch = tuple(t.to(device) for t in batch)

                            # Unpack the inputs from our dataloader
                            b_input_ids, b_input_mask, b_labels  = batch

                            # Telling the model not to compute or store gradients, saving memory and 
                            # speeding up prediction
                            with torch.no_grad():
                              # Forward pass, calculate logit predictions
                              outputs = model(b_input_ids, token_type_ids=None, 
                                              attention_mask=b_input_mask)

                            logits = outputs[0]

                            # Move logits and labels to CPU
                            logits = logits.detach().cpu().numpy()
                            label_ids = b_labels.to('cpu').numpy()

                            # Probits
                            probits = log2prob(logits)

                            # Store predictions and true labels
                            predictions.append(logits)
                            true_labels.append(label_ids)
                            probs.append(probits.T[1])


                    flat_predictions = np.concatenate(predictions, axis=0)
                    flat_probs = np.concatenate(probs, axis=0)
                    # For each sample, pick the label (0 or 1) with the higher score.
                    flat_predictions = np.argmax(flat_predictions, axis=1).flatten()

                    # Combine the correct labels for each batch into a single list.
                    flat_true_labels = np.concatenate(true_labels, axis=0)

                    predict_stats.append(
                        {   'TestSet':testset,
                            'epoch' : epoch_i + 1,
                            'total_steps': total_steps,
                            'Test AUC'   : sklearn.metrics.roc_auc_score(flat_true_labels, flat_probs),
                            'Test F1'    : f1_score(flat_true_labels, flat_predictions, average='macro'),
                            'Test Accu.' : accuracy_score(flat_true_labels, flat_predictions),
                            'Test Recall': recall_score(flat_true_labels, flat_predictions),
                            'Test Prec.' : precision_score(flat_true_labels, flat_predictions),
                            'MCC.'  : matthews_corrcoef(flat_true_labels, flat_predictions),
                            # 'True_Labels' : flat_true_labels,
                            # 'Predictions' : flat_predictions,
                        }
                    )

              # if (sampleNo == 'max') & (save_flag == 0):
              #   save_flag = 1
              #   torch.save(model.state_dict(), '/content/gdrive/MyDrive/InstituteClustering/bert/models/with_predictions_seeded_'+ person_dataset_name_path)
     
            seed_stats.append({str(seed) : { "train_stats": training_stats , "predict_stats":predict_stats}})
        data_set_results[sampleNoStr] = seed_stats
    results[person_dataset_name] = data_set_results
    import json
    person_dataset_name_path = "_".join("".join(person_dataset_name.split("/")).split())
    with open('/content/gdrive/MyDrive/InstituteClustering/bert/results/camera_ready_with_predictions_seeded_' + person_dataset_name_path +'.json', 'w') as fp:
        json.dump(data_set_results, fp) 

R/R - M/M 200 4218774 days, 12:15:01


Widget Javascript not detected.  It may not be installed or enabled properly.





Widget Javascript not detected.  It may not be installed or enabled properly.





Widget Javascript not detected.  It may not be installed or enabled properly.



R/R - M/M 200 4318774 days, 12:17:07
R/R - M/M 200 4418774 days, 12:18:25
R/R - M/M 200 4518774 days, 12:19:36
R/R - M/M 200 4618774 days, 12:20:46
R/R - M/M 200 4718774 days, 12:21:55
