**Multidimensional Dataset**

There are four main annotation categories. I left out framing, since for 1 the format didn't really fit to the others (and so I couldn't really use it for binary categories) and it seemed to be more or less very topic specific sentiment analysis likely already included in the bias category.
There are 5 annotations per sentence, both average and majority vote are given (there are even more categories here).

Categories in the preprocessed ds: ['id', 'text', 'label', 'average_subjectivity', 'majority_subjectivity', 'average_hidden_assumptions', 'majority_hidden_assumptions', 'average_bias_west', 'majority_bias_west', 'average_bias_russia', 'majority_bias_russia']


Since they're already in the 0 to 3 scales, we leave them in the same scales:
Scales (from the paper):
Hidden Assumptions: no; rather no; rather yes; yes; (0.0 to 3.0)
Subjectivity: objective; rather objective; rather subjective; subjective; (0.0 to 3.0)
Framing (for each government, i.e., Russian/Ukrainian/Western government(s)): negative; slightly negative; neutral; slightly positive; positive; (-2.0 to 2.0)
Bias (for each tendency of bias, i.e., Pro-Russia/Pro-West): no; rather no; rather yes; yes; (0.0 to 3.0)

```
To construct a binary label (both based on majority vote):
0 ('NON-BIASED'): 0 and 1
1 ('BIASED'): 2 and 3
If either hidden assumptions or subjectivity is present the sentence is given label 1
```

NOTE: There seem to be only very few sentences actually biased here. Even with quite the broad definition there are only 50 out of 2000 sentences labeled as biased.

In [1]:
import os
import sys
import pandas as pd
from prep_collection import PrepCollection as prep
import json
import numpy as np

In [3]:
wdr_path = os.path.dirname(os.path.dirname(os.getcwd()))
ds_raw_path = os.path.join(wdr_path + "/Datasets/Text Level Bias/Multidimensional Dataset/")
files = [ele for ele in os.listdir(ds_raw_path) if '.json' in ele]

In [5]:
def preprocess_multidimensional(wdr_path, ds_raw_path, files):
    article_id = 0
    ds_lst = []
    for k in files:
        with open(os.path.join(ds_raw_path, k)) as f:
            article = json.load(f)
            sentences = article['sentences']
        sentence_id = 0
        for i in sentences:
            final_id = str(article_id) + '-' + str(sentence_id)  # so every sentence can be traced back to the article it belongs to
            text = prep.prepare_text(i['content'])
            avg_subjectivity = i['subjectivity']['score']['avg']
            maj_subjectivity = i['subjectivity']['score']['maj']
            avg_hidden = i['hidden_assumptions']['score']['avg']
            maj_hidden = i['hidden_assumptions']['score']['maj']
            avg_bias_west = i['bias']['score']['pro-west']['avg']
            maj_bias_west = i['bias']['score']['pro-west']['maj']
            avg_bias_russia = i['bias']['score']['pro-russia']['avg']
            maj_bias_russia = i['bias']['score']['pro-russia']['maj']

            if 1.5 <= maj_subjectivity <= 3 or 1.5 <= maj_hidden <= 3 or 1.5 <= maj_bias_west  <= 3 or 1.5 <= maj_bias_russia  <= 3:  # turns out majority votes can actually end up being floats
                label = 1
            elif 0 <= maj_subjectivity < 1.5 and 0 <= maj_hidden < 1.5 and 0 <= maj_bias_west  < 1.5 and 0 <= maj_bias_russia  < 1.5:
                label = 0
            else: # just a check
                print(final_id)
                raise ValueError
            ds_lst.append([final_id, text, label, avg_subjectivity, maj_subjectivity, avg_hidden, maj_hidden, avg_bias_west, maj_bias_west, avg_bias_russia, maj_bias_russia])
            sentence_id += 1
        article_id += 1
    df = pd.DataFrame(ds_lst, columns = ['id', 'text', 'label', 'average_subjectivity', 'majority_subjectivity', 'average_hidden_assumptions', 'majority_hidden_assumptions', 'average_bias_west', 'majority_bias_west', 'average_bias_russia', 'majority_bias_russia'])
    df.to_csv(os.path.join(wdr_path + "/Preprocessed_Datasets/019-MultidimensionalDataset.csv"))

In [6]:
preprocess_multidimensional(wdr_path, ds_raw_path, files)