# Merge datasets

This script is mean to merge the ad_hominems dataset with the general dataset.

In [8]:
import numpy as np
import pandas as pd

general = pd.read_csv("../../data/general/ad_hominem_attacks.csv", sep=";")
adHominem = pd.read_csv("../../data/ad_hominem/reddit_ad_hominem.csv")

### 1. Preprocessing
The general dataset was analysed by us manually. We wrote 1 or 0 in the column with our name depending if we considered that row was a 'good' non-fallacy or a 'good' ad-hominem attack. If 2 out of 3 people put a 1 in a column, we are keeping it for our dataset.

In [9]:
general = general[ general["Pieter"] + general["Murilo"] + general["Eric"] >= 2]
general["isAdHominem"] = np.where(general["fallacies.df.Intended.Fallacy"] == "Ad Hominem",True, False)

In [10]:
general.head()

Unnamed: 0,fallacies.df.Topic,fallacies.df.Intended.Fallacy,fallacies.df.Text,Eric,Pieter,Murilo,isAdHominem
0,Are humans to blame for certain animal extinct...,No Fallacy,"Yes, human beings have hunted and eaten animal...",1.0,1,1,False
1,Are humans to blame for certain animal extinct...,No Fallacy,Humans are not to be blamed for animal extinct...,0.0,1,1,False
7,Are humans to blame for certain animal extinct...,No Fallacy,Humans don't care enough for living beings.,1.0,1,1,False
9,Are humans to blame for certain animal extinct...,Ad Hominem,Of course. You throw your garbage into the oce...,1.0,1,1,True
11,Are Quentin Tarantinos movies too violent?,Ad Hominem,"Oh now, I'm not going to debate with you... Ha...",1.0,1,1,True


In [11]:
adHominem.head()

Unnamed: 0.1,Unnamed: 0,archived,author_name,body,body_html,controversiality,created,created_utc,delta,downs,...,id,link_id,mod_reports,name,parent_id,title,ups,user_reports,violated_rule,ad_hominem
0,0,True,deckerparkes,What makes corporations different in this case...,"<div class=""md""><p>What makes corporations dif...",0,1466105000.0,1466076301.0,False,0,...,d4bfrtv,t3_4ocqwc,[],t1_d4bfrtv,t1_d4bfkyw,,9,[],0.0,False
1,1,True,strapt313,>there are arguments that they will consider b...,"<div class=""md""><blockquote>\n<p>there are arg...",0,1424406000.0,1424377113.0,False,0,...,coqorty,t3_2wbn3s,[],t1_coqorty,t1_coqer7x,,1,[],2.0,True
2,2,True,ZeusThunder369,"Basically to believe a patriarchy exists, you ...","<!-- SC_OFF --><div class=""md""><p>Basically to...",0,1486987000.0,1486958071.0,False,0,...,5tqqra,,[],t3_5tqqra,,CMV: I don't know how one can believe a patria...,5,[],0.0,False
3,3,True,TheToastIsBlue,The punishment for heresy was being burned at ...,"<div class=""md""><p>The punishment for heresy w...",0,1490136000.0,1490107248.0,False,0,...,df7ulkr,t3_60mna9,[],t1_df7ulkr,t1_df7topk,,42,[],0.0,False
4,4,True,,No it doesn't. Sex is defined by DNA. DNA cann...,"<div class=""md""><p>No it doesn&#39;t. Sex is d...",0,1444970000.0,1444940749.0,False,0,...,cw11y9d,t3_3ov1gq,[],t1_cw11y9d,t1_cw11qpe,,-9,[],0.0,False


In [30]:
df_ad_hominems = pd.concat([adHominem['body'], adHominem['ad_hominem']], axis=1, keys=['body', 'isAdHominem'])
df_general = pd.concat([general['fallacies.df.Text'], general['isAdHominem']], axis=1, keys=['body', 'isAdHominem'])
df_merged = df_ad_hominems.append(df_general)

df_merged["body"] = df_merged["body"].astype(str)
df_merged["isAdHominem"] = df_merged["isAdHominem"].astype(bool)
df_merged = df_merged[~df_merged.isin([np.nan, np.inf, -np.inf, 'nan']).any(1)] ## Remove rows with NaN values

print(df_ad_hominems.shape)
print(df_general.shape)
print(df_merged.shape)

(29281, 2)
(382, 2)
(29660, 2)


### 2. Export

After running the cell below, the merged_datasets file is created and includes both datasets merged. Keep in mind that the first 29281 rows belong originally to the ad_hominems dataset and the following 382 rows belong to the general dataset. The first column just includes the id of every row from each dataset. So it starts with id 0 (first row of ad_hominems dataset), then follows to id 29280 (last row of ad_hominems dataset), then starts over with id 0 again (first row of general dataset) and ends with id 588 (last row of general dataset).

In [31]:
df_merged.to_csv("../../data/merged_datasets.csv")