# Contents

This notebook cleans all three working datasets and merges them together for the final presentation. The merged dataframe includes the original reddit post id as well as the body of the post itself, the names, gender and ethnicity of the person/people addressed in the reddit post (note that ethnicity takes on the value nan for most instances of the sample). Further, the original VAD scores obtained by Marakovic et al. (2020) are preserved and saved under capital letters. 

We add to the original data by including variables on text complexity metrics, especially the Gunning-Fog index and type-token-ratio; alternative VAD-scores obtained through LEIA and leia-sentiments as one of six emotions; and finally a probability for any given reddit post to be perceived as toxic, as well as further indicators returned by PersepctiveAPI.

In [1]:
# import nec libraries
import pandas as pd

In [2]:
# load all three datasets
complexity = pd.read_csv("/kaggle/input/all-comments-text-complexity/all_comments_text_complexity.csv", low_memory=False)
leia = pd.read_csv("/kaggle/input/all-comments-text-complexity/leia_vad_sample_normalized_137k.csv")
toxicity = pd.read_csv("/kaggle/input/all-comments-text-complexity/comments_scores.csv", low_memory=False)

In [3]:
print([col for col in complexity.columns])

['id', 'body', 'subreddit', 'to_type', 'NEL', 'Names', 'created_utc', 'sex', 'ethnicity', 'origin', 'DOB', 'highest_position', 'party', 'entity_given_name', 'entity_family_name', 'given_name_used', 'family_name_used', 'full_name_used', 'nickname_used', 'Adjectives', 'Verbs', 'Nouns', 'Descriptors_parsed', 'Verbs_parsed', 'Relation', 'Valence', 'Arousal', 'Dominance', 'nr_sentences', 'nr_words', 'nr_characters', 'nr_letters', 'nr_syllables', 'nr_words_one_syllable', 'nr_words_more_syllables', 'tokens', 'nr_unique_words', 'share_unique words', 'nr_swear_words', 'share_swear_words', 'bert_vector', 'Gunning_Fog', 'gender_dummy']


In [4]:
print([col for col in leia.columns])

['id', 'body', 'subreddit', 'to_type', 'NEL', 'Names', 'created_utc', 'sex', 'ethnicity', 'origin', 'DOB', 'highest_position', 'party', 'entity_given_name', 'entity_family_name', 'given_name_used', 'family_name_used', 'full_name_used', 'nickname_used', 'Adjectives', 'Verbs', 'Nouns', 'Descriptors_parsed', 'Verbs_parsed', 'Relation', 'Valence', 'Arousal', 'Dominance', 'leia', 'sentiment', 'sentiment_prob', 'valence', 'arousal', 'dominance', 'norm_valence', 'norm_arousal', 'norm_dominance']


In [5]:
print([col for col in toxicity.columns])

['id', 'body', 'subreddit', 'to_type', 'NEL', 'Names', 'created_utc', 'sex', 'ethnicity', 'origin', 'DOB', 'highest_position', 'party', 'entity_given_name', 'entity_family_name', 'given_name_used', 'family_name_used', 'full_name_used', 'nickname_used', 'Adjectives', 'Verbs', 'Nouns', 'Descriptors_parsed', 'Verbs_parsed', 'Relation', 'Valence', 'Arousal', 'Dominance', 'insult', 'profanity', 'threat', 'severe_toxicity', 'identity_attack', 'toxicity']


**Drop future duplicate columns**

In [6]:
columns_to_drop = ['body', 'subreddit', 'to_type', 'NEL', 'Names', 'created_utc', 'sex', 'ethnicity', 'origin', 'DOB', 'highest_position', 'party', 'entity_given_name', 'entity_family_name', 'given_name_used', 'family_name_used', 'full_name_used', 'nickname_used', 'Adjectives', 'Verbs', 'Nouns', 'Descriptors_parsed', 'Verbs_parsed', 'Relation', 'Valence', 'Arousal', 'Dominance']

In [7]:
leia_dropped = leia.drop(columns=columns_to_drop, errors='ignore')
toxicity_dropped = toxicity.drop(columns=columns_to_drop, errors='ignore')
complexity = complexity.drop('bert_vector', axis=1)

In [19]:
# check nr of rows
print(leia_dropped.shape[0])
print(toxicity_dropped.shape[0])
print(complexity.shape[0])

137555
137556
138626


**Merge datasets**

In [20]:
merged_df = complexity.merge(leia_dropped, on='id', how='left').merge(toxicity_dropped, on='id', how='left')

In [21]:
# check columns again
print([col for col in merged_df.columns])

['id', 'body', 'subreddit', 'to_type', 'NEL', 'Names', 'created_utc', 'sex', 'ethnicity', 'origin', 'DOB', 'highest_position', 'party', 'entity_given_name', 'entity_family_name', 'given_name_used', 'family_name_used', 'full_name_used', 'nickname_used', 'Adjectives', 'Verbs', 'Nouns', 'Descriptors_parsed', 'Verbs_parsed', 'Relation', 'Valence', 'Arousal', 'Dominance', 'nr_sentences', 'nr_words', 'nr_characters', 'nr_letters', 'nr_syllables', 'nr_words_one_syllable', 'nr_words_more_syllables', 'tokens', 'nr_unique_words', 'share_unique words', 'nr_swear_words', 'share_swear_words', 'Gunning_Fog', 'gender_dummy', 'leia', 'sentiment', 'sentiment_prob', 'valence', 'arousal', 'dominance', 'norm_valence', 'norm_arousal', 'norm_dominance', 'insult', 'profanity', 'threat', 'severe_toxicity', 'identity_attack', 'toxicity']


In [22]:
# and drop all columns that are irrelevant to us:
further_columns_to_drop = ['to_type', 'NEL', 'created_utc', 'origin', 'DOB', 'highest_position', 'party', 'entity_given_name', 'entity_family_name', 'given_name_used', 'family_name_used', 'full_name_used', 'nickname_used', 'Adjectives', 'Verbs', 'Nouns', 'Descriptors_parsed', 'Verbs_parsed', 'Relation']

In [23]:
merged_df = merged_df.drop(columns=further_columns_to_drop, errors='ignore')

In [24]:
# final column check
print([col for col in merged_df.columns])

['id', 'body', 'subreddit', 'Names', 'sex', 'ethnicity', 'Valence', 'Arousal', 'Dominance', 'nr_sentences', 'nr_words', 'nr_characters', 'nr_letters', 'nr_syllables', 'nr_words_one_syllable', 'nr_words_more_syllables', 'tokens', 'nr_unique_words', 'share_unique words', 'nr_swear_words', 'share_swear_words', 'Gunning_Fog', 'gender_dummy', 'leia', 'sentiment', 'sentiment_prob', 'valence', 'arousal', 'dominance', 'norm_valence', 'norm_arousal', 'norm_dominance', 'insult', 'profanity', 'threat', 'severe_toxicity', 'identity_attack', 'toxicity']


**And check for duplicate ids**

In [25]:
# check nr of rows
print(merged_df.shape[0])

145430


In [26]:
duplicate_ids = merged_df[merged_df.duplicated(subset='id', keep=False)]

In [27]:
if not duplicate_ids.empty:
    merged_df = merged_df.drop_duplicates(subset='id', keep='first')

In [28]:
# check duplicate rows again:
merged_df.shape[0]

137030

**Finally, save to csv**

In [29]:
merged_df.to_csv('gender_bias_toxicity_complexity.csv', index=False)