**Basil Dataset**

Base code and description for this preprocessing was taken from the MTL project and only adjusted:
Reshaped to a binary label: For now ONLY BIASED sentences in the dataset (coul eventually add all the other sentences as unbiased).


A collection of 300 articles sampled between 2010 and 2019.
For each article, the authors provide 2 files: An article file and a file containing annotations.
Each article object contains the following keys: ['title', 'keywords', 'date', 'uuid', 'url', 'main-entities', 'word-count', 'source', 'main-event', 'triplet-uuid', 'body-paragraphs'] where the 'body-paragraph' contains the sentences of each paragraph.
Each sentence-level-annotation objet contains the following keys: ['aim', 'bias', 'end', 'id', 'indirect-ally-opponent-sentiment', 'indirect-target-name', 'notes', 'polarity', 'quote', 'speaker', 'start', 'target', 'txt'].
For our purposes, we extracted the labels bias, aim, quote, txt from each annotation.
Additionally, we extracted the sentences contained in body-paragraphs from each article.
We entirely discarded the article-level annotations.
With these labels, we perform binary classification for three targets. Additionally, we can perform POS tagging for the bias-inducing POS (txt).
Because of the quality of the dataset, we didn't have to discard any observation. The final dataset contains 1724 observations.

[Dataset Source](https://github.com/launchnlp/BASIL)


Domains of the columns:


label:  Type of bias (inf=linguistic, lex=lexical)

quote: Whether phrase is/ contains quote or not.

aim: Whether a phrase is directly/ indirectly aiming at the target.

pos: The sequence that induces the bias. This sequence is part of the sentence.

text: The source sentence.


Note: The authors used the term phrase instead of sentence. To ensure readability and comparability to other datasets, we used the term text.
Citation Identifier: fan_plain_2019

Title: In Plain Sight: Media Bias Through the Lens of Factual Reporting.


In [1]:
import itertools
import json
import os
from typing import List, Tuple

import pandas as pd
from prep_collection import PrepCollection as prep
#from preprocessors import PreprocessorBlueprint

def _load_raw_data_from_local() -> Tuple[List, List]:
    """Load the raw data of 09_BASIL."""
    articles, annotations = [], []
    for year in range(2010, 2020):
        wdr_path = os.path.dirname(os.path.dirname(os.getcwd()))
        ds_raw_path = os.path.join(wdr_path + "/Datasets/Linguistic Bias/BASIL/")
        arts = sorted(os.listdir(os.path.join(ds_raw_path, "articles", str(year))))

        anns = sorted(os.listdir(os.path.join(ds_raw_path, "annotations", str(year))))
        anns_cut = [("").join(ann.split("_ann")) for ann in anns]

        assert arts == anns_cut

        for i, art in enumerate(arts):
            try:
                with open(os.path.join(ds_raw_path, "articles", str(year), art), "r", errors= 'replace') as f:
                    article_data = json.load(f)

                with open(os.path.join(ds_raw_path, "annotations", str(year), anns[i]), "r", errors= 'replace') as f:
                    annotation_data = json.load(f)

                articles.append(article_data)
                annotations.append(annotation_data)
            except json.decoder.JSONDecodeError:
                print("Caught error. Attempted to load an empty file.")

    return articles, annotations


def _preprocess():
    """Preprocess the raw data of 09_BASIL."""
    article_data, annotation_data = _load_raw_data_from_local()
    observations = []
    sent_id = 0
    for i, art in enumerate(article_data):
        ann = annotation_data[i]

        paragraphs = art.get("body-paragraphs")  # Now a list of paragraphs
        annotations = ann.get("phrase-level-annotations")  # Now a list of annotations

        # For each paragraph, join the sentences together
        phrases = list(itertools.chain.from_iterable(paragraphs))
        for ann in annotations:
            # The id can be in 2 different formats:
            # "p<prase-id>" or "title"
            id = ann["id"]
            text = ann["txt"]
            bias = ann["bias"]  # Type of bias. (inf=linguistic, lex=lexical)
            quote = ann["quote"]  # Binary, whether phrase is/ contains quote or not.
            aim = ann["aim"]  # direct/ indirect. If indirect, annotations for ...-sentiment are availabe.
            if id == "title":
                phrase = art.get("title")
            else:
                id = int(id.split("p")[-1])
                phrase = phrases[id]

            observation = {"id": sent_id, "text": prep.prepare_text(phrase), "label": 1, "pos": text, "bias_type": bias, "quote": quote, "aim": aim}
            observations.append(observation)
            sent_id += 1

    df = pd.DataFrame(observations)
    df.to_csv(os.path.join(os.path.dirname(os.path.dirname(os.getcwd())) + "/Preprocessed_Datasets/009-Basil.csv"))


_preprocess()