## 105 - rtgender


The Dataset is split in original `post` and `response` files except for TED, where only responses to a talk are available.

Annotations contain a `sentiment`, the (assumed or provided?) `gender` of the OP, the `data source`, as well as the referenced entity (depicted as `relevance`; holds the following values: `'Content', 'Irrelevant', 'Poster', 'ContentPoster'`).

Dataset Size: `15353` post and responses (3,8MB) are manually annotated, while the majority of the data (5,6GB) remains unlabelled.


CSV input format: `source,op_gender,post_text,response_text,sentiment,relevance`

CSV output format: `id,text,label,category,source,op_gender,relevance`


`sentiment` mapping to `(label, category)`:
- 'Neutral': `(0,0)`
- 'Positive': `(1,1)`
- 'Negative': `(1,2)`
- 'Mixed': `(1,3)`

```
Label:
- 0: not biased
- 1: biased
```

```
Category
- 0: neutral
- 1: positive bias
- 2: negative bias
- 3: mixed
```


Other design decisions:
- `text` contains both `post` (context) + `response`.

In [147]:
import os
import sys
import pandas as pd
from prep_collection import PrepCollection as prep

In [148]:
path_input_relative = "/Datasets/Gender Bias/105-rtgender/annotations.csv"
path_output_relative = "/Preprocessed_Datasets/105-rtgender.csv"

In [149]:
wdr_path = os.path.dirname(os.path.dirname(os.getcwd()))
df = pd.read_csv(os.path.join(wdr_path + path_input_relative))

dict_label = {'Neutral' : 0, 'Positive' : 1, 'Negative' : 1, 'Mixed' : 1}
dict_category = {'Neutral' : 0, 'Positive' : 1, 'Negative' : 2, 'Mixed' : 3}


df['id'] = pd.DataFrame(range(len(df))) + 1
df['text'] = df['post_text'].astype('string') + df['response_text'].astype('string')
df['text'] = df['text'].apply(str).apply(prep.prepare_text)

df['label'] = df['sentiment']
df['category'] = df['sentiment']
df['label'] = df['label'].map(dict_label)

df = df.reindex(columns=['id','text','label','category','source','op_gender','relevance'])


In [150]:
df.to_csv(wdr_path + path_output_relative)