<br><br>

## **Import necessary Python libraries and modules**

In [1]:
pip install transformers[torch]



In [2]:
import torch
from transformers import RobertaTokenizerFast, RobertaForSequenceClassification
from transformers import Trainer, TrainingArguments

In [3]:
from collections import defaultdict
import gdown
import gzip
import json
import os
import random
import pickle

import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import ticker
sns.set(style='ticks', font_scale=1.2)

In [4]:
!pip install datasets



In [5]:
from datasets import load_dataset

<br><br>

## **Load StorySeeker annotations from Hugging Face**

In [10]:
storyseeker_dataset = load_dataset('') # removed for anonymity

Downloading readme:   0%|          | 0.00/4.77k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/9.54k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.03k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.16k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating val split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [14]:
train_df = pd.DataFrame(storyseeker_dataset['train'])
len(train_df.index)

301

In [16]:
val_df = pd.DataFrame(storyseeker_dataset['val'])
len(val_df.index)

100

<br><br><br><br>

## **Load Reddit data**

"The Webis TLDR Corpus (2017) consists of approximately 4 Million content-summary pairs extracted for Abstractive Summarization, from the Reddit dataset for the years 2006-2016. This corpus is first of its kind from the social media domain in English and has been created to compensate the lack of variety in the datasets used for abstractive summarization research using deep learning models."

Read more here: [https://webis.de/data/webis-tldr-17.html](https://webis.de/data/webis-tldr-17.html)

From:
Michael Völske, Martin Potthast, Shahbaz Syed, and Benno Stein. [TL;DR: Mining Reddit to Learn Automatic Summarization](https://downloads.webis.de/publications/papers/voelske_2017.pdf). In Giuseppe Carenini, Jackie Chi Kit Cheung, Fei Liu, and Lu Wang, editors, Workshop on New Frontiers in Summarization at EMNLP 2017, pages 59–63, September 2017. Association for Computational Linguistics.

In [30]:
reddit_path = 'https://zenodo.org/records/1043504/files/corpus-webis-tldr-17.zip?download=1'

In [31]:
import urllib.request
import json
from io import BytesIO
from zipfile import ZipFile

In [27]:
target_ids = train_df['id'].tolist() + val_df['id'].tolist()

In [41]:
access_url = urllib.request.urlopen(reddit_path)

id_data_dict = {}
with ZipFile(BytesIO(access_url.read()), 'r') as zip_obj:
  for _filename in zip_obj.namelist():
    with zip_obj.open(_filename) as _file:
      for _line in _file:
        _data = json.loads(_line)
        if _data['id'] in target_ids:
          for _key, _value in _data.items():
            id_data_dict[_data['id']] = _data
len(id_data_dict)

400

In [42]:
len(id_data_dict)

400

In [43]:
random.sample(id_data_dict.items(), 1)

since Python 3.9 and will be removed in a subsequent version.
  random.sample(id_data_dict.items(), 1)


[('c3q5abb',
  {'author': 'phishsticker',
   'body': "With Downing and Kuyt on the field, we lack that cutting edge. Downing is basically invisible most of the game, and Kuyt really lacks the pace and skill to work on the right side. Our attack is so easily defended against that its laughable. We need a skillful RW/LW that can actually beat their player and whip a GOOD ball in or take a 1/2 decent shot. \n\nCarroll was back to his usual self, he really seems to hate coming on as a sub. You could hear Kenny on the side constantly yelling at him to apply some pressure. \n\nBoth goals were through individual mistakes and were avoidable, but we didn't deserve anything from the match. We were atrocious for 90% of the game. I really think we will regret not bringing in a striker or rw in january, I have no idea where our goals will come from unless Bellamy magically wakes up with working knees. Oh and Evra is a cunt.\n\n**tldr: we were shit, evra is a cunt**",
   'normalizedBody': "With Down

In [44]:
train_df['text'] = train_df['id'].apply(lambda x: id_data_dict[x]['body'])
val_df['text'] = val_df['id'].apply(lambda x: id_data_dict[x]['body'])

In [46]:
X_train = train_df['text'].tolist()
X_val = val_df['text'].tolist()

# y_train = _story_df_train['gold_intersection'].tolist()
# y_val = _story_df_test['gold_intersection'].tolist()
y_train = train_df['gold_consensus'].tolist()
y_val = val_df['gold_consensus'].tolist()

len(X_train), len(X_val), len(y_train), len(y_val)

(301, 100, 301, 100)

<br><br><br><br>

## **Load StorySeeker model from Hugging Face**

In [48]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [64]:
device_name = 'cuda'
max_length = 512

In [None]:
tokenizer = AutoTokenizer.from_pretrained("") # removed for anonymity
model = AutoModelForSequenceClassification.from_pretrained("") # removed for anonymity

tokenizer_config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [57]:
trainer = Trainer(model=model)
trainer.model = model.cuda()

<br><br><br><br>

## **Encode the StorySeeker data and run prediction**

In [50]:
unique_labels = set(label for label in y_val)
label2id = {label: id for id, label in enumerate(unique_labels)}
id2label = {id: label for label, id in label2id.items()}

In [53]:
val_encodings  = tokenizer(X_val, truncation=True, padding=True, max_length=max_length)

val_labels_encoded  = [label2id[y] for y in y_val]

In [54]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [55]:
val_dataset = MyDataset(val_encodings, val_labels_encoded)

In [58]:
predicted_results = trainer.predict(val_dataset)

<br><br><br><br>

## **Examine the prediction output**

In [59]:
predicted_labels = predicted_results.predictions.argmax(-1)
predicted_labels = predicted_labels.flatten().tolist()
predicted_labels = [id2label[l] for l in predicted_labels]

In [60]:
print(classification_report(y_val,
                            predicted_labels))

              precision    recall  f1-score   support

           0       0.92      0.89      0.91        54
           1       0.88      0.91      0.89        46

    accuracy                           0.90       100
   macro avg       0.90      0.90      0.90       100
weighted avg       0.90      0.90      0.90       100



### Examine false negatives

In [62]:
for _predicted_label, _true_label, _text in zip(predicted_labels, y_val, X_val):
  if _predicted_label != _true_label and _true_label == 1:
    print(' '.join(_text.split()))
    print()

Steam's recent decision to make Steam trading cards selling dependent on whether or not you could afford to have a cellphone has made it difficult for some like me to continue to enjoy the Steam sales. Up until now, it was completely optional to add a phone number or to add Two-factor authentication through the Steam app to your account. Now, it's being tied to if you can sell Steam trading cards. If you don't, you have to wait 15 days before it gets put on the market. That just doesn't seem like an option, does it? Well, for poor folks like me without a cellphone or those that just have a broken one, I have an option for you! I found a way around all this, without needing a physical cellphone. It only takes a few minutes with some basic software and then you can not only complete the Steam badge, but you'll be able to sell Steam cards again! 1. Download [Bluestacks]( It's a free Android emulator. With this installed, we can begin step two. 2. After Bluestacks is installed, click "MyBl

### Examine false positives

In [63]:
for _predicted_label, _true_label, _text in zip(predicted_labels, y_val, X_val):
  if _predicted_label != _true_label and _true_label == 0:
    print(' '.join(_text.split()))
    print()

Commenting as I watch, all that jazz. - Mauler, your mic seems a bit... better, honestly, but it's still got that element of slight fuzziness about it. Can't quite pinpoint what might make it better, since it isn't ambient noise, but even so. - Game audio is fuckin' loud, man. Real loud. Entire video is a bit loud, but game is especially so. - What's with the static lookin' things? Is that a cut in the video? Not a bad touch, admittedly, just didn't know this was an edited LP and oh I just noticed this is part 2. - Okay yeah when the main character starts grinding his teeth it's especially noticable that game audio is loud. It's like we're rubbing together pumice stones in my ear canal. - I'm still ehh on reading shit ingame, but alright. - I genuinely laughed at the bedmaking bit, it was funny... but ear pumice made the entire thing seem a bit less pleasant. - Laughed again at "come at me mate", it was appropriate to the situation... just be careful, don't ever overdo that sort of thi