## Distant Supervision

To run this script, you need the following files found in the /data directory:
- "news_headlines_usa_neutral.csv"
- "news_headlines_usa_biased.csv"

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 19.7MB/s eta 0:00:01
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 51.8MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 54.3MB/s 
Installing collected packages: tokenizers, sacremoses, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1


In [3]:
# gpu card
!nvidia-smi

Wed Apr 21 10:55:56 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.67       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
import time
import random
import os

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import tensorflow as tf
from transformers import RobertaTokenizer, TFRobertaForSequenceClassification
from transformers import BertTokenizer, TFBertForSequenceClassification

In [None]:
# set seed, TF uses python ramdom and numpy library, so these must also be fixed
tf.random.set_seed(0)
random.seed(0)
np.random.seed(0)
os.environ['PYTHONHASHSEED']=str(0)
os.environ['TF_DETERMINISTIC_OPS'] = '0'

## Media Cloud

The idea is that in outlets such as Alternet, Breitbart, Federalist the density of bias is higher than in Reuters. Using Media Cloud, I extract article headers of these outlets covering controversial topics and assume them to be biased in the case of Breitbart etc.

In [None]:
def read_media_cloud_data(path, label):
    """Read in data downloaded from media cloud and assign a label to all rows"""
    df = pd.read_csv(path)
    df['Label_bias'] = label
    df = df.rename({'title': 'sentence'}, axis=1)
    return df

# read in two datasets
PATH_biased = "data/news_headlines_usa_biased.csv"
PATH_neutral = "data/news_headlines_usa_neutral.csv"
df_biased = read_media_cloud_data(PATH_biased, 1)
df_neutral = read_media_cloud_data(PATH_neutral, 0)

# combine them
df_distant = pd.concat([df_biased,df_neutral], axis=0, ignore_index=1)
df_distant = shuffle(df_distant)

# train-test split
df_distant_train, df_distant_test = train_test_split(df_distant, test_size=0.2)

In [None]:
df_distant['Label_bias'].value_counts()

0    83143
1    45605
Name: Label_bias, dtype: int64

In [None]:
def preprocess(df):
    """convert a pandas dataframe into a tensorflow dataset"""
    target = df.pop('Label_bias')
    sentence = df.pop('sentence')

    #tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased') #uncased
    #tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

    train_encodings = tokenizer(
                        sentence.tolist(),                      
                        add_special_tokens = True, # add [CLS], [SEP]
                        truncation = True, # cut off at max length of the text that can go to BERT
                        padding = True, # add [PAD] tokens
                        return_attention_mask = True, # add attention mask to not focus on pad tokens
              )
    
    dataset = tf.data.Dataset.from_tensor_slices(
        (dict(train_encodings), 
         target.tolist()))
    return dataset

In [None]:
# pandas -> tensorflow
train_distant_dataset = preprocess(df_distant_train)
test_distant_dataset = preprocess(df_distant_test)

# batch and randomize
BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_distant_dataset = train_distant_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)
test_distant_dataset = test_distant_dataset.batch(BATCH_SIZE).prefetch(tf.data.experimental.AUTOTUNE)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




CPU: 64 mini-batch size has 19hours ETA. 256 mini-batch size yields 16hours ETA

GPU: 256 mini-batch size yields 20 mins ETA

In [None]:
tf.keras.backend.clear_session()

In [None]:
# train entire model with distant signals
#bert = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased") #DistilBert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output
#bert = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
roberta = TFRobertaForSequenceClassification.from_pretrained('roberta-base')

callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=1, restore_best_weights=True) # after 3 epochs without improvement, stop training

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
roberta.compile(optimizer=optimizer, loss=roberta.compute_loss) 

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=481.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=657434796.0, style=ProgressStyle(descri…




All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
history_bert = roberta.fit(train_distant_dataset, epochs=1, validation_data = test_distant_dataset, callbacks=[callback])

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Cause: while/else statement not yet supported
Cause: while/else statement not yet supported


In [None]:
trained_layer = roberta.get_layer(index=0).get_weights()

roberta.save_weights('./checkpoints/roberta_final_checkpoint_news_headlines_USA')

#bert.load_weights('./checkpoints/final_checkpoint_distant_learning')