**RedditBias Dataset**

Datasource: https://github.com/umanlp/RedditBias
Consits of four subcategories: Religion, Race, Gender and Queerness
Only the annotated comments were taken. Here entire sentence and phrase was annotated for whether it's biased. The entire sentence (including the phrase) was taken into the ds.

```
LABEL:
1 - BIASED
0 - NON-BIASED
```
```
CATEGORIES:
0 - GENDER
1 - ORIENTATION
2 - RACE
3 - RELIGION
```

In [1]:
import os
import sys
import pandas as pd
from prep_collection import PrepCollection as prep

In [2]:
# Adjusted to a filestructure of ./Datasets and ./Preprocessed_Datasets
wdr_path = os.path.dirname(os.path.dirname(os.getcwd()))
ds_raw_path = os.path.join(wdr_path + "/Datasets/Linguistic Bias/RedditBias/data")
files = ['reddit_comments_gender_female_processed_phrase_annotated.csv', 'reddit_comments_orientation_lgbtq_processed_phrase_annotated.csv', 'reddit_comments_race_black_processed_phrase_annotated.csv','reddit_comments_religion1_jews_processed_phrase_annotated.csv', 'reddit_comments_religion2_muslims_processed_phrase_annotated.csv']

In [3]:
def redditbias_preprocessing(wdr_path, ds_raw_path, files):
    df = pd.DataFrame(columns= ['text', 'label', 'category'])
    for file in files:
        file_path = os.path.join(ds_raw_path + '/' + file)
        print(file_path)
        with open(file_path, errors= 'replace') as f:
            df_original = pd.read_csv(f)
        df_sub = pd.DataFrame()
        df_original = df_original.dropna(subset=['bias_sent']) # dropped all rows that did not have a label
        df_original = df_original[~df_original['bias_sent'].isin(['1 - context needed', 're-state', 'biased?', 'toxic-unrelated', 'fact?', 'question'])]
        df_sub['text'] = df_original['comment'].apply(lambda x: prep.prepare_text(x))
        df_sub['category'] = file
        df_sub['label'] = df_original['bias_sent'].apply(lambda x: int(x))
        # strangely the number of observations doesn't 100% align with those in the paper but we couldn't find the reason for that
        df = pd.concat([df, df_sub])

    df.index.name='id'
    df.to_csv(os.path.join(wdr_path + "/Preprocessed_Datasets/075-RedditBias.csv"))
    return df

In [4]:
df = redditbias_preprocessing(wdr_path, ds_raw_path, files)

/home/user/docs/studium/ongoing/22-22/thesis/working_dir/Datasets/Linguistic Bias/RedditBias/data/reddit_comments_gender_female_processed_phrase_annotated.csv
/home/user/docs/studium/ongoing/22-22/thesis/working_dir/Datasets/Linguistic Bias/RedditBias/data/reddit_comments_orientation_lgbtq_processed_phrase_annotated.csv
/home/user/docs/studium/ongoing/22-22/thesis/working_dir/Datasets/Linguistic Bias/RedditBias/data/reddit_comments_race_black_processed_phrase_annotated.csv
/home/user/docs/studium/ongoing/22-22/thesis/working_dir/Datasets/Linguistic Bias/RedditBias/data/reddit_comments_religion1_jews_processed_phrase_annotated.csv
/home/user/docs/studium/ongoing/22-22/thesis/working_dir/Datasets/Linguistic Bias/RedditBias/data/reddit_comments_religion2_muslims_processed_phrase_annotated.csv


In [14]:
df

Unnamed: 0_level_0,text,label,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,im sorry this is happening for you at least yo...,0,reddit_comments_gender_female_processed_phrase...
1,their mothers are in on the joke art project,1,reddit_comments_gender_female_processed_phrase...
2,i feel like this is some deep artistic dance t...,0,reddit_comments_gender_female_processed_phrase...
3,or biology over protective and smothering moth...,0,reddit_comments_gender_female_processed_phrase...
4,writing poetry to a girl is the fastest way to...,0,reddit_comments_gender_female_processed_phrase...
...,...,...,...
10563,saying that a muslim is violent because they d...,0,reddit_comments_religion2_muslims_processed_ph...
10565,funny that a muslim is talking about other rel...,1,reddit_comments_religion2_muslims_processed_ph...
10568,a good muslim is a violent racist homophobic o...,1,reddit_comments_religion2_muslims_processed_ph...
10575,yeah no muslim is violent,0,reddit_comments_religion2_muslims_processed_ph...
