<a href="https://colab.research.google.com/github/jlucasa/cs6390/blob/main/JAmen6390Proj.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Persuasian Extraction (FLAGA)**
## **Jared Amen - University of Utah**

This project is based on [SemEval 2021 Task 6, Subtask 2](https://propaganda.math.unipd.it/semeval2021task6/index.html) and utilizes the corpora both [for the task](https://github.com/di-dimitrov/SEMEVAL-2021-task6-corpus) and [for SemEval 2020 Task 11](http://propaganda.qcri.org/ptc/teampage.php?passcode=4f0be54df4e32e11416cbe7081c6056c) (where the latter requires registration to receive data). For this project, the model attempts to tag sequences in an input sentence with any of a set of labels (where a sequence can be assigned multiple labels). The definitions of these labels can be found [here](https://propaganda.math.unipd.it/semeval2021task6/definitions.html). **Important!** Some labels from this subtask are either not found or are ambiguous in the dataset provided from 2020 - these are listed in the `ignore` key of the translation map between labels given in the 2020 dataset and labels given 

## About

This phase utilizes a Seq2Seq Model from SimpleTransformers which trains and tests based on a formatted `input_text` and `target_text`. Transitions are considered on a character level and tags are marked simply by their start and end tags, grouped with an index that corresponds to the respective technique used.

**To reset the model state for this environment, run the following cell.**

In [None]:
rm -rf train_results_2020/ train_results_2021/

In order to run the program, a `requirements.txt` with the following libraries is necessary:

```
transformers
simpletransformers
torch==1.7.1
torchvision==0.8.2
tensorflow==2.3.0
absl-py
pandas
numpy
pytest
```

You can make a `requirements.txt` file with those libraries and upload it here, or use the following cell to copy-paste the `requirements.txt` file from Google Drive.

In [None]:
!cp drive/MyDrive/requirements.txt .

In [None]:
pip install -r requirements.txt

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 5.3 MB/s 
[?25hCollecting simpletransformers
  Downloading simpletransformers-0.63.3-py3-none-any.whl (247 kB)
[K     |████████████████████████████████| 247 kB 48.8 MB/s 
[?25hCollecting torch==1.7.1
  Downloading torch-1.7.1-cp37-cp37m-manylinux1_x86_64.whl (776.8 MB)
[K     |████████████████████████████████| 776.8 MB 17 kB/s 
[?25hCollecting torchvision==0.8.2
  Downloading torchvision-0.8.2-cp37-cp37m-manylinux1_x86_64.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 20.5 MB/s 
[?25hCollecting tensorflow==2.3.0
  Downloading tensorflow-2.3.0-cp37-cp37m-manylinux2010_x86_64.whl (320.4 MB)
[K     |████████████████████████████████| 320.4 MB 57 kB/s 
Collecting numpy
  Downloading numpy-1.18.5-cp37-cp37m-manylinux1_x86_64.whl (20.1 MB)
[K     |████████████████████████████████| 20.1 MB 42.2 MB/s 
Collecting tensorflow-estimator<2.4.

# **Project Utils**

Includes the text fragment class as well as methods to form tagged text and extract from tagged text utilizing text fragments.

In [None]:
from dataclasses import dataclass

import pandas as pd
import regex as re

whitespace_chars = [' ', '\n', '\t']

@dataclass
class text_fragment:
    def __init__(self, start=0, end=0, technique=''):
        self.start = start
        self.end = end
        self.technique = technique

    def __len__(self):
        return self.end - self.start

    def __lt__(self, other):
        return self.start < other.start

    def __gt__(self, other):
        return self.start > other.start

    def __ge__(self, other):
        return self.start >= other.start

    def __le__(self, other):
        return self.start <= other.start

    def __add__(self, other):
        return text_fragment(self.start + other, self.end + other, self.technique)

    def __sub__(self, other):
        return text_fragment(self.start - other, self.end - other, self.technique)

    def __and__(self, other):
        frag1, frag2 = text_fragment(), text_fragment()
        if self < other:
            frag1, frag2 = self, other
        else:
            frag1, frag2 = other, self
        
        if frag1.technique != '' and frag2.technique != '' and frag2.start >= frag1.end:
            # There is no intersection between the two fragments
            return None
        elif frag1.technique == '' and frag2.technique == '':
            return None
        else:
            return text_fragment(frag2.start, min(frag1.end, frag2.end), f'{frag1.technique},{frag2.technique}')
    
    def __or__(self, other):
        frag1, frag2 = text_fragment(), text_fragment()
        if self < other:
            frag1, frag2 = self, other
        else:
            frag1, frag2 = other, self
        
        return text_fragment(frag1.start, max(frag1.end, frag2.end), f'{frag1.technique},{frag2.technique}')

    def __str__(self):
        return f'{self.start}, {self.end}, {self.technique}'

    def get_text_from_fragment(self, text):
        return text[self.start:self.end]


def convert_technique_to_tag(technique, all_techniques):
    """
    Converts a technique to a symbol for the transformer
    :param technique: The given technique
    :type technique: str
    :param all_techniques: The list of all techniques
    :type all_techniques: list
    :return: The 'inside' and 'outside' tags for the transformer to use
    :rtype: list
    """

    idx = all_techniques.index(technique)
    if idx is None:
        raise ValueError(f'No index value could be found for the given technique "{technique}"')

    inside_tag = f'[S-{idx}]'
    outside_tag = f'[E-{idx}]'
    return [inside_tag, outside_tag]


def get_technique_from_tag(tag, all_techniques):
    """

    :param tag: The tag (in 'S-{num}' or 'E-{num}' form
    :type tag: str
    :param all_techniques: The list of all techniques
    :type all_techniques: list
    :return: The technique which corresponds to the index in tag
    :rtype: str
    """

    starttag = re.search(r'[S-(\d+)]', tag).group(1)
    endtag = re.search(r'[E-(\d+)]', tag).group(1)

    if starttag is None and endtag is None:
        raise ValueError(f'No technique could be found for the given tag "{tag}"')

    return all_techniques[starttag] if starttag is not None and endtag is None else all_techniques[endtag]


def get_all_tags_from_techniques(techniques):
    """

    :param techniques:
    :type techniques: list
    :return:
    """
    tags_to_return = []

    for tech in techniques:
        tags_to_return.extend(convert_technique_to_tag(tech, techniques))

    return tags_to_return


def form_tagged_text(text, frags, all_techniques, should_print_sentence):
    """
    Gets a string which contains the tags assigned to each character in a given text.
    :param text: The input text
    :type text: str
    :param frags: The list of fragments for the text
    :type frags: list
    :param all_techniques: The list of all techniques
    :type all_techniques: list
    :return: A string which contains the tags assigned to each character in a given text.
    :rtype: str
    """

    possible_techs = set([frag.technique for frag in frags])
    tech2txt = {tech: [None] * len(text) for tech in possible_techs}
    tech2tag = {tech: convert_technique_to_tag(tech, all_techniques) for tech in possible_techs}

    for frag in frags:
        tech2txt[frag.technique][frag.start] = tech2tag[frag.technique][0]          # Corresponds to inside tag
        tech2txt[frag.technique][frag.end - 1] = tech2tag[frag.technique][1]        # Corresponds to outside tag

    all_text_with_tags = []
    for idx in range(len(text)):
        for tech in possible_techs:
            corresponding_tagwithtxt = tech2txt[tech][idx]
            if corresponding_tagwithtxt is not None and corresponding_tagwithtxt.startswith('[S-'):
                all_text_with_tags.append(f' {corresponding_tagwithtxt} ')
                # all_text_with_tags.extend(['~', corresponding_tagwithtxt, '~'])

        all_text_with_tags.append(text[idx])

        for tech in possible_techs:
            corresponding_tagwithtxt = tech2txt[tech][idx]
            if corresponding_tagwithtxt is not None and corresponding_tagwithtxt.startswith('[E-'):
                all_text_with_tags.append(f' {corresponding_tagwithtxt} ')
                # all_text_with_tags.extend(['~', corresponding_tagwithtxt, '~'])

    if should_print_sentence:
      print(''.join(all_text_with_tags))

    return ''.join(all_text_with_tags)


def extract_tagged_text(tagged_text, all_techniques):
    """

    :param tagged_text:
    :type tagged_text: str
    :param all_techniques:
    :type all_techniques: list
    :return:
    :rtype: str, [text_fragment]
    """

    extracted_text = []
    tag_in_text = {idx: None for idx in range(len(all_techniques))}
    all_frags = []

    start_tag_matcher = r'^\[S-(\d+)\]\s?'
    end_tag_matcher = r'^\s?\[E-(\d+)\]'

    char_idx = 0
    while char_idx < len(tagged_text):
        begintag = re.match(start_tag_matcher, tagged_text[char_idx:])
        corresponding_endtag = re.match(end_tag_matcher, tagged_text[char_idx:])

        if begintag is not None:
            start_tag_idx = int(begintag.group(1))
            tag_in_text[start_tag_idx] = len(extracted_text)
            char_idx += len(begintag.group(0))
        elif corresponding_endtag is not None:
            end_tag_idx = int(corresponding_endtag.group(1))
            corresponding_start = tag_in_text[end_tag_idx]
            if corresponding_start is not None:
                tag_in_text[end_tag_idx] = None                     # reset for future iterations
                span_len = len(extracted_text)
                while corresponding_start < span_len and extracted_text[corresponding_start] in whitespace_chars:
                    corresponding_start += 1
                if span_len > corresponding_start:
                    all_frags.append(text_fragment(corresponding_start, span_len, all_techniques[end_tag_idx]))
            char_idx += len(corresponding_endtag.group(0))
        else:
            extracted_text.append(tagged_text[char_idx])
            char_idx += 1

    return ''.join(extracted_text), all_frags


def get_processed_df_for_entries(all_entries, all_techniques):
    """

    :param all_entries: All of the entries of a given dataset, that have the keys: `'id'`, `'text'`, `'fragments'`
    :type all_entries: list(dict)
    :param all_techniques: The list of all techniques
    :type all_techniques: list
    :return: A dataframe that has all of the entries formatted with tags attached to text
    :rtype: pd.DataFrame
    """

    if all_entries is None:
        return None

    to_form = []

    for entry in all_entries:
        to_form.append([
            entry['id'],
            entry['text'],
            form_tagged_text(entry['text'], entry['fragments'], all_techniques, True if len(to_form) == 0 else False)
        ])
    
    # to_form = [
    #     [
    #         entry['id'],
    #         entry['text'],
    #         form_tagged_text(entry['text'], entry['fragments'], all_techniques, True if not to_form else False)
    #     ]
    #     for entry in all_entries
    # ]

    return pd.DataFrame(to_form, columns=['id', 'input_text', 'target_text'])

# **Project Model**

Includes the project model with tokenizers assigned from the techniques passed in, as well as methods to train/test/evaluate/make predictions using the model.

In [None]:
from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs

import pandas as pd

import random
import numpy as np


class ProjModel(Seq2SeqModel):
    def __init__(self, transformer_type, io_path, labels, args, **kwargs):
        super().__init__(
            encoder_decoder_type=transformer_type,
            encoder_decoder_name=io_path,
            args=args,
            use_cuda=True,
            **kwargs
        )

        self.labels = labels
        self.analyzed_results = None
        self.scores = None
        tokens = get_all_tags_from_techniques(labels)

        self.encoder_tokenizer.add_tokens(tokens)
        self.decoder_tokenizer.add_tokens(tokens)

        if transformer_type == 'bart':
            self.model.resize_token_embeddings(len(self.encoder_tokenizer))
            self.model.resize_token_embeddings(len(self.decoder_tokenizer))
        else:
            self.model.encoder.resize_token_embeddings(len(self.encoder_tokenizer))
            self.model.decoder.resize_token_embeddings(len(self.decoder_tokenizer))

    def train_model(
        self,
        train_data,
        output_dir=None,
        show_running_loss=True,
        args=None,
        eval_data=None,
        verbose=True,
        **kwargs
    ):
        train_df = get_processed_df_for_entries(train_data, self.labels)
        val_df = get_processed_df_for_entries(eval_data, self.labels)
        super().train_model(
            train_df,
            output_dir=output_dir,
            show_running_loss=show_running_loss,
            args=args,
            eval_data=val_df,
            verbose=True,
            **kwargs
        )

    def eval_model(
        self,
        eval_data,
        output_dir=None,
        verbose=True,
        silent=False,
        **kwargs
    ):
        test_df = get_processed_df_for_entries(eval_data, self.labels)

        preds = self.predict([entry['text'] for entry in eval_data])
        preds_df = pd.DataFrame(preds, columns=['text', 'predicted_fragments'])
        self.analyzed_results = pd.concat([test_df, preds_df], axis=1)
        self.scores = score_model(preds, eval_data, self.labels)

    def predict(self, to_predict):
        preds = super().predict(to_predict)
        return self.format_predictions(preds)

    def pretty_predict(self, to_predict, filepath=None):
        preds = super().predict(to_predict)
        formatted_predictions = self.format_predictions(preds)
        pretty_predictions = []

        if filepath is None:
            for form_pred in formatted_predictions:
                print('======================================================')
                print(f'Sentence: {form_pred["extracted_text"]}')
                if len(form_pred['fragments']) == 0:
                    print('Could not find persuasive phrases')
                else:
                    for pred_frag in form_pred['fragments']:
                        print(f'Found an instance of {pred_frag.technique} in "{pred_frag.get_text_from_fragment(form_pred["extracted_text"])}"')
        else:
            with open(filepath, 'w+') as file:
                for form_pred in formatted_predictions:
                    file.write('======================================================\n')
                    file.write(f'Sentence: {form_pred["extracted_text"]}\n')
                    if len(form_pred['fragments']) == 0:
                        file.write('Could not find persuasive phrases\n')
                    else:
                        for pred_frag in form_pred['fragments']:
                            file.write(f'Found an instance of {pred_frag.technique} in "{pred_frag.get_text_from_fragment(form_pred["extracted_text"])}"\n')

    def format_predictions(self, preds):
        all_formatted_preds = []

        for idx in range(len(preds)):
            text, fragments = extract_tagged_text(preds[idx], self.labels)

            all_formatted_preds.append({
                'id': f'<PREDICTED_{idx}>',
                'text': preds[idx],
                'fragments': fragments,
                'extracted_text': text,
                'predicted_fragments': [f'{str(frag)}, {frag.get_text_from_fragment(text)}' for frag in fragments]
            })

        return all_formatted_preds

    def save_results_official(self, path):
        to_output = []

        for index, row in self.analyzed_results.iterrows():
            text, fragments = extract_tagged_text(row['text'], self.labels)
            label_instance = []
            for frag in fragments:
                label_instance.append({
                    'start': frag.start,
                    'end': frag.end,
                    'technique': frag.technique
                })
            to_output.append({
                'id': row['id'],
                'labels': label_instance
            })
        
        with open(path, 'w+') as file:
            json.dump(to_output, file, indent=6)


    def save_results(self, path):
        self.analyzed_results.to_csv(path)

    def save_scores(self, path):
        with open(path, 'w+') as file:
            file.write('================== SCORES ==================\n')
            file.write(f'Total Precision: {self.scores["precision"]}\n')
            file.write(f'Total Recall: {self.scores["recall"]}\n')
            file.write(f'Total F1: {self.scores["f1"]}\n')

            for technique in self.scores['scores_for_techniques'].keys():
                file.write('============================================\n')
                file.write(f'{technique}\n')
                file.write(f'\tPrecision: {self.scores["scores_for_techniques"][technique]["precision"]}\n')
                file.write(f'\tRecall: {self.scores["scores_for_techniques"][technique]["recall"]}\n')
                file.write(f'\tF1: {self.scores["scores_for_techniques"][technique]["f1"]}\n')


def create_model(transformer_type, io_path, labels, args):
    return ProjModel(
        transformer_type=transformer_type,
        io_path=io_path,
        labels=labels,
        args=args
    )


def train_model_on_data(model, fileset, possible_techniques, is_2020data, args):
    """
    Trains the given model on a set of data, dependent upon whether the data is
    from 2020 or from 2021.
    :param model: The model created using create_model
    :type model: ProjModel
    :param fileset: The fileset passed in (used for I/O)
    :type fileset: Fileset
    :param possible_techniques: 
    :type possible_techniques: list
    :param args:
    :return:
    """
    if is_2020data:
        train_data_2020 = adjust_2020_data(load_and_convert_2020_data(fileset, is_dev=False))
        val_data_2020 = adjust_2020_data(load_and_convert_2020_data(fileset, is_dev=True))

        model.train_model(train_data_2020, eval_data=val_data_2020, args=args)
    else:
        train_data_2021 = load_2021_train_data(fileset)
        val_data_2021 = load_2021_val_data(fileset)

        model.train_model(train_data_2021, eval_data=val_data_2021, args=args)


def calculate_overlap_between_pred_and_actual(pred_frag, actl_frag, frag_len):
    if pred_frag & actl_frag is not None and pred_frag.technique == actl_frag.technique:
        return len(pred_frag & actl_frag) / frag_len

    return 0


def compute_f1(prec=0, rec=0):
    f1 = 0

    if prec == 0 and rec == 0:
        f1 = 0

    if prec > 0 and rec > 0:
        f1 = (2 * (prec * rec))/(prec + rec)

    return f1


def get_num_frags_with_technique(entries, technique):
    num_frags_with_technique = 0
    for entry in entries:
        num_frags_with_technique += len([frag for frag in entry['fragments'] if frag.technique == technique])

    return num_frags_with_technique


def score_model(preds, actuals, all_techniques):
    total_frags_pred = 0
    total_frags_actl = 0

    technique_precision = {tech: 0 for tech in all_techniques}
    technique_recall = {tech: 0 for tech in all_techniques}

    total_precision = 0
    total_recall = 0

    for pred, actual in zip(preds, actuals):
        total_frags_pred += len(pred['fragments'])
        total_frags_actl += len(actual['fragments'])

        curr_precision = 0
        curr_recall = 0

        for pred_fragment in pred['fragments']:
            for actl_fragment in actual['fragments']:
                precision = calculate_overlap_between_pred_and_actual(pred_fragment, actl_fragment, len(pred_fragment))
                recall = calculate_overlap_between_pred_and_actual(pred_fragment, actl_fragment, len(actl_fragment))

                curr_precision += precision
                curr_recall += recall

                if pred_fragment.technique == actl_fragment.technique:
                    technique_precision[pred_fragment.technique] += precision
                    technique_recall[pred_fragment.technique] += recall

        total_precision += curr_precision
        total_recall += curr_recall

        # prec_for_text = curr_precision / len(pred['fragments']) if len(pred['fragments']) > 0 else 0
        # rec_for_text = curr_recall / len(actual['fragments']) if len(actual['fragments']) > 0 else 0

    weighted_total_precision = total_precision / total_frags_pred if total_frags_pred > 0 else 0
    weighted_total_recall = total_recall / total_frags_actl if total_frags_actl > 0 else 0
    weighted_total_f1 = compute_f1(weighted_total_precision, weighted_total_recall)

    scores_for_techniques = {}

    for tech in all_techniques:
        num_pred_frags_for_tech = get_num_frags_with_technique(preds, tech)
        num_actl_frags_for_tech = get_num_frags_with_technique(actuals, tech)

        precision_for_tech = technique_precision[tech] / num_pred_frags_for_tech \
            if num_pred_frags_for_tech > 0 else 0
        recall_for_tech = technique_recall[tech] / num_actl_frags_for_tech \
            if num_actl_frags_for_tech > 0 else 0
        f1_for_tech = compute_f1(precision_for_tech, recall_for_tech)

        scores_for_techniques.update({
            tech: {
                'precision': np.round(precision_for_tech, 3),
                'recall': np.round(recall_for_tech, 3),
                'f1': np.round(f1_for_tech, 3)
            }
        })

    return {
        'precision': np.round(weighted_total_precision, 3),
        'recall': np.round(weighted_total_recall, 3),
        'f1': np.round(weighted_total_f1, 3),
        'scores_for_techniques': scores_for_techniques
    }

# **Project IO**

Includes the list of techniques for both 2020 and 2021 data, as well as methods to extract/translate the data from the files based on the `Fileset` class.

In [None]:
import json


TECH_2021SET = [
    'Appeal to authority',
    'Appeal to fear/prejudice',
    'Black-and-white Fallacy/Dictatorship',
    'Causal Oversimplification',
    'Doubt',
    'Exaggeration/Minimisation',
    'Flag-waving',
    'Glittering generalities (Virtue)',
    'Loaded Language',
    'Misrepresentation of Someone\'s Position (Straw Man)',
    'Name calling/Labeling',
    'Obfuscation, Intentional vagueness, Confusion',
    'Presenting Irrelevant Data (Red Herring)',
    'Reductio ad hitlerum',
    'Repetition',
    'Slogans',
    'Smears',
    'Thought-terminating cliché',
    'Whataboutism',
    'Bandwagon'
]

TECH_2020MAP = {
    'consider': {
        'Appeal_to_Authority': TECH_2021SET[0],
        'Appeal_to_fear-prejudice': TECH_2021SET[1],
        'Black-and-White_Fallacy': TECH_2021SET[2],
        'Causal_Oversimplification': TECH_2021SET[3],
        'Doubt': TECH_2021SET[4],
        'Exaggeration,Minimisation': TECH_2021SET[5],
        'Flag-Waving': TECH_2021SET[6],
        'Loaded_Language': TECH_2021SET[8],
        'Name_Calling,Labeling': TECH_2021SET[10],
        'Repetition': TECH_2021SET[14],
        'Slogans': TECH_2021SET[15],
        'Thought-terminating_Cliches': TECH_2021SET[17]
    },
    'ignore': [
        'Bandwagon,Reductio_ad_hitlerum',
        'Whataboutism,Straw_Men,Red_Herring'
    ]
}


class Fileset:
    def __init__(self):
        self.IN_MODEL_DIR = ''
        self.OUT_MODEL_DIR = ''
        self.DEV_SET = 'drive/MyDrive/2021data/dev_set_task2.txt'
        self.TEST_SET = 'drive/MyDrive/2021data/test_set_task2.txt'
        self.TRAIN_SET_2021 = 'drive/MyDrive/2021data/training_set_task2.txt'
        self.TRAIN_SET_2020_DIR = 'drive/MyDrive/2020data/train-articles'
        self.TRAIN_SET_2020_SUM = 'drive/MyDrive/2020data/train-task-flc-tc.labels'
        self.DEV_SET_2020_SUM = 'drive/MyDrive/2020data/dev-task-flc-tc.labels'
        self.DEV_SET_2020_DIR = 'drive/MyDrive/2020data/dev-articles'


def read_2020_text(id, fileset, is_dev):
    """

    :param id:
    :type id: int
    :param fileset:
    :type fileset: Fileset
    :return:
    """

    path = ''
    if is_dev:
        path = f'{fileset.DEV_SET_2020_DIR}/article{id}.txt'
    else:
        path = f'{fileset.TRAIN_SET_2020_DIR}/article{id}.txt'

    with open(path, 'r', encoding='utf-8') as file:
        return file.read()


def load_and_convert_2020_data(fileset, is_dev):
    """

    :param fileset: The name of the directory that holds all article data from SEMEVAL 2020
    :type fileset: Fileset
    :return:
    :rtype:
    """

    with open(fileset.DEV_SET_2020_SUM if is_dev else fileset.TRAIN_SET_2020_SUM, 'r') as file:
        all_articles = file.readlines()
    # if is_dev:
    #     with open(fileset.DEV_SET_2020_SUM, 'r') as file:
    #       all_articles = file.readlines()
    # else:
    #     with open(fileset.TRAIN_SET_2020_SUM, 'r') as file:
    #         all_articles = file.readlines()

    loaded = []
    curr_id = -1
    frags = []

    for article in all_articles:
        id, technique, start, end = article.strip().split('\t')

        id, start, end = int(id), int(start), int(end)

        if id == curr_id:
            frags.append(text_fragment(start, end, technique))
        else:
            if curr_id != -1:
                loaded.append({
                    'id': curr_id,
                    'text': read_2020_text(curr_id, fileset, is_dev),
                    'fragments': frags
                })

            # Reset for next set of fragments
            curr_id = id
            frags = []

    if len(frags) != 0:
        if curr_id == -1:
            raise ValueError('"curr_id" cannot be set to a value of -1')

        loaded.append({
            'id': curr_id,
            'text': read_2020_text(curr_id, fileset, is_dev),
            'fragments': frags
        })

    return loaded


def get_specific_2021_data(path):
    """

    :param path:
    :type path: str
    :return:
    """

    loaded = []
    with open(path, 'r', encoding='utf-8') as file:
        entries = json.load(file)

    for entry in entries:
        id = entry['id']
        text = entry['text']
        labels = entry['labels']

        frags = []
        for label_set in labels:
            if label_set['text_fragment'].strip() == '':
                continue
            frags.append(text_fragment(label_set['start'], label_set['end'], label_set['technique']))

        loaded.append({
            'id': id,
            'text': text,
            'fragments': frags
        })

    return loaded


def load_and_convert_2021_data(fileset):
    """

    :param fileset:
    :type fileset: Fileset
    :return:
    """

    return \
        get_specific_2021_data(fileset.DEV_SET),\
        get_specific_2021_data(fileset.TEST_SET),\
        get_specific_2021_data(fileset.TRAIN_SET_2021)


def load_2021_train_data(fileset):
    return get_specific_2021_data(fileset.TRAIN_SET_2021)


def load_2021_test_data(fileset):
    return get_specific_2021_data(fileset.TEST_SET)


def load_2021_val_data(fileset):
    return get_specific_2021_data(fileset.DEV_SET)


def get_correspondent_2021_tech(tech_2020):
    """

    :param tech_2020:
    :type tech_2020: str
    :return:
    """

    if tech_2020 in TECH_2020MAP['consider'].keys():
        return TECH_2020MAP['consider'][tech_2020]
    elif tech_2020 in TECH_2020MAP['ignore']:
        return None

    return tech_2020


def adjust_2020_data(dataset):
    new_dataset = []

    for entry in dataset:
        new_frags = []

        for frag in entry['fragments']:
            translated_technique = get_correspondent_2021_tech(frag.technique)
            if translated_technique is not None:
                frag.technique = translated_technique
                new_frags.append(frag)
        entry['fragments'] = new_frags
        new_dataset.append(entry)

    return new_dataset

# **Project Training**

Includes training of the model with the provided 2020 data and 2021 data.

In [None]:
fileset = Fileset()

Here, we train the model starting at `bart-base` for 25 epochs on the **2020 data**. We'll load in the training and validation data, and train the model against that data.

In [None]:
!cp train_results_2021/added_tokens.json train_results_2021/config.json train_results_2021/merges.txt train_results_2021/model_args.json train_results_2021/pytorch_model.bin train_results_2021/special_tokens_map.json train_results_2021/tokenizer_config.json train_results_2021/training_args.bin train_results_2021/vocab.json first_results/
!rm -rf train_results_2021/

cp: target 'first_results/' is not a directory


In [None]:
args = dict(
    num_train_epochs=25,
    overwrite_output_dir=True,
    output_dir='train_results_2020'
)

model = create_model('bart', 'facebook/bart-base', TECH_2021SET, args)
train_model_on_data(model, fileset, TECH_2021SET, True, args)

50305
50305
Next plague outbreak in Madagascar could be 'stronger': WHO

Geneva - The World Health Organisation chief on Wednesday said a deadly plague epidemic  [S-4] appeared [E-4]  to have been brought under control in Madagascar, but warned the next outbreak would likely be stronger.

"The next transmission could be more pronounced or stronger," WHO Director-General Tedros Adhanom Ghebreyesus told reporters in Geneva, insisting that "the issue is serious."

An outbreak of both bubonic plague, which is spread by infected rats via flea bites, and pneumonic plague, spread person to person, has killed more than 200 people in the Indian Ocean island nation since August.

Madagascar has suffered bubonic plague outbreaks almost every year since 1980, often caused by rats fleeing forest fires.

The disease tends to make a comeback each hot rainy season, from September to April.
On average, between 300 and 600 infections are recorded every year among a population approaching 25 million peop

HBox(children=(FloatProgress(value=0.0, max=357.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=25.0, style=ProgressStyle(description_width='…

HBox(children=(FloatProgress(value=0.0, description='Running Epoch 0 of 25', max=45.0, style=ProgressStyle(des…



KeyboardInterrupt: ignored

Here, we train the model starting at `train_results_2020` (so as to train the model weights off of the previous results) for 30 epochs on the 2021 data. We'll load in the training and validation data, and train the model against that data. I selected 30 epochs here to provide more priority to the 2021 data, which has helped scores slightly.

In [None]:
args = dict(
    num_train_epochs=20,
    overwrite_output_dir=True,
    output_dir='train_results_2021',
)

model = create_model('bart', 'train_results_2020', TECH_2021SET, args=args)
train_model_on_data(model, fileset, TECH_2021SET, False, args=args)

HBox(children=(FloatProgress(value=0.0, max=688.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=20.0, style=ProgressStyle(description_width='…

HBox(children=(FloatProgress(value=0.0, description='Running Epoch 0 of 20', max=86.0, style=ProgressStyle(des…








HBox(children=(FloatProgress(value=0.0, description='Running Epoch 1 of 20', max=86.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 2 of 20', max=86.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 3 of 20', max=86.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 4 of 20', max=86.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 5 of 20', max=86.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 6 of 20', max=86.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 7 of 20', max=86.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 8 of 20', max=86.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 9 of 20', max=86.0, style=ProgressStyle(des…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 10 of 20', max=86.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 11 of 20', max=86.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 12 of 20', max=86.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 13 of 20', max=86.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 14 of 20', max=86.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 15 of 20', max=86.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 16 of 20', max=86.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 17 of 20', max=86.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 18 of 20', max=86.0, style=ProgressStyle(de…




HBox(children=(FloatProgress(value=0.0, description='Running Epoch 19 of 20', max=86.0, style=ProgressStyle(de…





Here, we evaluate the model given considerations made to span generations made by it. We'll load in the 2021 test data and evaluate the model against that data, printing out the predictions in the `predicted_fragments` field of `test_results.csv` as well as the overall F1, precision, and recall scores for the test data along with the per-category F1, precision, and recall scores. *If 0 is returned in `test_scores.txt`, that means that the fragment was not considered/found in the predictions made by the model.*

# **Project Evaluation**

Includes evaluation of the model and saving of the results to `test_results.csv` and the scores `test_scores.txt`.

In [None]:
args = dict(
    max_length=200,             # Anything less or more seemed to degrade performance
    top_p=0.8,          
    top_k=0,
    length_penalty=0.5,
    repetition_penalty=2.0,
    num_beams=5,
    num_return_sequences=1,
    do_sample=True
)

model = create_model('bart', 'train_results_2021', TECH_2021SET, args=args)
test_data_2021 = load_2021_test_data(fileset)

model.eval_model(test_data_2021)
model.save_results_official('official_results_cnn.txt')
# model.save_results('test_results_withargs.csv')
# model.save_scores('test_results_withargs.txt')

HBox(children=(FloatProgress(value=0.0, description='Generating outputs', max=25.0, style=ProgressStyle(descri…




# **Model IO**

Includes zipping up the final models and uploading them to Google Drive.

Here, we zip up model contents and save them to Google Drive.

In [None]:
!zip -r train_results_2020_final.zip train_results_2020/added_tokens.json train_results_2020/config.json train_results_2020/merges.txt train_results_2020/model_args.json train_results_2020/pytorch_model.bin train_results_2020/special_tokens_map.json train_results_2020/tokenizer_config.json train_results_2020/training_args.bin train_results_2020/vocab.json

  adding: train_results_2020/added_tokens.json (deflated 72%)
  adding: train_results_2020/config.json (deflated 64%)
  adding: train_results_2020/merges.txt (deflated 53%)
  adding: train_results_2020/model_args.json (deflated 62%)
  adding: train_results_2020/pytorch_model.bin (deflated 7%)
  adding: train_results_2020/special_tokens_map.json (deflated 50%)
  adding: train_results_2020/tokenizer_config.json (deflated 79%)
  adding: train_results_2020/training_args.bin (deflated 50%)
  adding: train_results_2020/vocab.json (deflated 59%)


In [None]:
!zip -r train_results_2021_mnli.zip train_results_2021/added_tokens.json train_results_2021/config.json train_results_2021/merges.txt train_results_2021/model_args.json train_results_2021/pytorch_model.bin train_results_2021/special_tokens_map.json train_results_2021/tokenizer_config.json train_results_2021/training_args.bin train_results_2021/vocab.json

  adding: train_results_2021/added_tokens.json (deflated 71%)
  adding: train_results_2021/config.json (deflated 59%)
  adding: train_results_2021/merges.txt (deflated 53%)
  adding: train_results_2021/model_args.json (deflated 62%)
  adding: train_results_2021/pytorch_model.bin (deflated 7%)
  adding: train_results_2021/special_tokens_map.json (deflated 50%)
  adding: train_results_2021/tokenizer_config.json (deflated 79%)
  adding: train_results_2021/training_args.bin (deflated 50%)
  adding: train_results_2021/vocab.json (deflated 59%)


In [None]:
from google.colab import auth
from googleapiclient.http import MediaFileUpload
from googleapiclient.discovery import build

auth.authenticate_user()

In [None]:
drive_service = build('drive', 'v3')

def save_file_to_drive(name, path):
    file_metadata = {
      'name': name,
      'mimeType': 'application/zip'
    }

    media = MediaFileUpload(path, mimetype='application/zip', resumable=True)

    created = drive_service.files().create(body=file_metadata, media_body=media, fields='id').execute()

    print('File ID: {}'.format(created.get('id')))

    return created

In [None]:
saved_model_paths = [
  # 'train_results_2020_large.zip',
  'train_results_2021_mnli.zip'
]

for path in saved_model_paths:
  save_file_to_drive(path, path)

File ID: 1zqR1LolcIiXIcv1H4UBQSkzZNOlgEWxC


**To reset the model state for this environment, run the following cell.**

In [None]:
rm -rf train_results_2020/ train_results_2021/

# **Custom Project Evaluation**

Here, you'll be able to supply your own sentences to the model, where it will attempt to make predictions from the sentences that you provide it.

If the models are already trained on the right data, you can load them in from Google Drive here. Otherwise, you'll need to train them by running the cells above.

In [None]:
!cp drive/MyDrive/train_results_2020_final.zip .
!cp drive/MyDrive/train_results_2021_final.zip .

!unzip train_results_2020_final.zip -d train_results_2020
!unzip train_results_2021_final.zip -d train_results_2021

!mv train_results_2020/train_results_2020/* train_results_2020/
!mv train_results_2021/train_results_2021/* train_results_2021/

!rmdir train_results_2020/train_results_2020/
!rmdir train_results_2021/train_results_2021/

!rm train_results_2020_final.zip
!rm train_results_2021_final.zip

Archive:  train_results_2020_final.zip
  inflating: train_results_2020/train_results_2020/added_tokens.json  
  inflating: train_results_2020/train_results_2020/config.json  
  inflating: train_results_2020/train_results_2020/merges.txt  
  inflating: train_results_2020/train_results_2020/model_args.json  
  inflating: train_results_2020/train_results_2020/pytorch_model.bin  
  inflating: train_results_2020/train_results_2020/special_tokens_map.json  
  inflating: train_results_2020/train_results_2020/tokenizer_config.json  
  inflating: train_results_2020/train_results_2020/training_args.bin  
  inflating: train_results_2020/train_results_2020/vocab.json  
Archive:  train_results_2021_final.zip
  inflating: train_results_2021/train_results_2021/added_tokens.json  
  inflating: train_results_2021/train_results_2021/config.json  
  inflating: train_results_2021/train_results_2021/merges.txt  
  inflating: train_results_2021/train_results_2021/model_args.json  
  inflating: train_results

In [None]:
def get_samples_or_custom():
    choice = input('Please select whether you\'d like to use the provided test sentences (\'sample\') or your own custom sentences (\'custom\'): ')

    if choice != 'sample' and choice != 'custom':
        print('Invalid input. Please type either \'sample\' or \'custom\'.')
        return get_samples_or_custom()
    
    return choice


def run_on_test_data(model):
    test_data_2021 = load_2021_test_data(fileset)
    sample_sentences = [entry['text'] for entry in test_data_2021]
    model.pretty_predict(sample_sentences, filepath='sample_predictions.txt')


def run_on_user_data(model, user_sentences):
    if len(user_sentences) > 10:
        fpath = input('Number of sentences provided is greater than 10. Input a filepath to store the predictions in: ')
        model.pretty_predict(user_sentences, filepath=fpath)
    else:
        model.pretty_predict(user_sentences, filepath=None)


def get_user_sentences():
    user_sentences = []
    try:
        num_sentences = int(input('Please type in the number of sentences you\'d like to provide to the model: '))
    except ValueError:
        print('Please input a number for how many sentences you\'d like to provide to the model.')
        return get_user_sentences()

    print(f'Accepting {num_sentences} sentences...')
    for i in range(num_sentences):
        sentence = input('Type in a sentence you\'d like to provide to the model: ')
        user_sentences.append(sentence)

    return user_sentences

In [None]:
print(f'Creating model with weights from provided training results...')

args = dict(
    max_length=150,             # Anything less or more seemed to degrade performance
    top_p=0.8,          
    top_k=0,
    length_penalty=0.5,
    repetition_penalty=2.0,
    num_beams=5,
    num_return_sequences=1,
    do_sample=True
)

model = create_model('bart', 'train_results_2021', TECH_2021SET, args=args)
choice = get_samples_or_custom()

if choice == 'sample':
    print('Running the model on the provided testing data...')
    run_on_test_data(model)
else:
    user_sentences = get_user_sentences()
    run_on_user_data(model, user_sentences)

Creating model with weights from provided training results...
50305
50305
Please select whether you'd like to use the provided test sentences ('sample') or your own custom sentences ('custom'): sample
Running the model on the provided testing data...


HBox(children=(FloatProgress(value=0.0, description='Generating outputs', max=25.0, style=ProgressStyle(descri…


