# Hate Speech Detector 2.0
---
**Data row-wise (or tweet-wise) binder and data duplicator**
1. Load dataframe with class labels.
2. Perform cardinality analysis for tweet classes combinations.
3. For those class combinations which cardinalities are lower than desired threshold (ex.: min 10 tweets per class combination), perform:
    1. Select tweets relevant for certain class combination.
    2. Randomly select appropriate number of tweets to fill up to desired threshold (ex.: if there's 2 examples, then take randomly 10-2=8 tweets).
    3. Append selected tweets to combined dataset.
4. Save duplicated dataset to .csv file.

In [1]:
import numpy as np
import pandas as pd

import random

In [2]:
LABELS = ['wyzywanie', 'grożenie', 'wykluczanie', 'odczłowieczanie', 'poniżanie', 'stygmatyzacja', 'szantaż']
THRESHOLD = 20

COMBINED_PATH = 'data/tweets_sady/main/sady_combined.csv'
DUPLICATED_PATH = 'data/tweets_sady/processed/sady_duplicated.csv'

## Loading dataframe

In [3]:
df_combined = pd.read_csv(COMBINED_PATH)
df_combined[LABELS] = df_combined[LABELS].fillna(.0)
df_combined[LABELS] = df_combined[LABELS].astype('int')
f'{len(df_combined)} total examples.'

'15202 total examples.'

In [4]:
def classes(df, convert_null=False):
    df_c = df[LABELS]
    if convert_null:
        df_c = df_c.notnull().astype('int')

    return df_c

df_classes = classes(df_combined)
df_classes.head(2)

Unnamed: 0,wyzywanie,grożenie,wykluczanie,odczłowieczanie,poniżanie,stygmatyzacja,szantaż
0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0


## Cardinality analysis

In [5]:
def class_combination_cards(df_c):
    df = df_c[df_c.columns]

    df['cardinality'] = np.ones(len(df_c), dtype=np.int32)
    df_cc = df.groupby(LABELS).count().sort_values(by='cardinality', ascending=False)
    df_cc['%'] = df_cc['cardinality'] / len(df_c) * 100

    return df_cc
df_combination_cards = class_combination_cards(df_classes)
df_combination_cards

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,cardinality,%
wyzywanie,grożenie,wykluczanie,odczłowieczanie,poniżanie,stygmatyzacja,szantaż,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0,0,0,0,0,0,13654,89.817129
0,0,0,0,0,1,0,361,2.374688
0,0,0,0,1,0,0,251,1.651099
0,0,0,0,1,1,0,197,1.295882
0,1,0,0,0,0,0,179,1.177477
0,1,0,0,0,1,0,106,0.697277
1,0,0,0,1,0,0,57,0.374951
1,0,0,0,0,0,0,43,0.282858
1,0,0,1,1,1,0,40,0.263123
0,0,0,1,0,0,0,36,0.236811


Many of the class combination cardinalities are lower than 5 (if to perform 5-fold stratified cross-validation). Those must be duplicated a relevant number of times (a multiplicity of 5) in order to reduce class combinations imbalance.

## Low-cardinalities tweets duplication

In [6]:
def duplicate_under_threshold(df, df_cc, threshold=5):
    combinations = df_cc[df_cc['cardinality'] < threshold].index
    
    df_dupl = pd.DataFrame(df)
    for combination in combinations:
        
        # reduce dataframe to only relevant examples for a combination of classes (labels)
        df_relev = pd.DataFrame(df)
        for label, c in zip(LABELS, combination):
            df_relev = df_relev[df_relev[label] == c]
        
        # random order of relevant examples (for duplication)
        rand_pos = [0 if len(df_relev)<=1 else random.randint(0, len(df_relev)-1)
                    for i in range(threshold - len(df_relev))]
        
        for rp in rand_pos:
            row = df_relev.iloc[rp]
            df_dupl = df_dupl.append(row)
    
    for label in LABELS:
        df_dupl[label] = df_dupl[label].astype('int')
    
    return df_dupl

In [7]:
df_duplicated = duplicate_under_threshold(df_combined, df_combination_cards, threshold=THRESHOLD)
f'{len(df_duplicated)} total examples.'

'15791 total examples.'

In [8]:
df_classes = classes(df_duplicated)
df_combination_cards = class_combination_cards(df_classes)
df_combination_cards

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,cardinality,%
wyzywanie,grożenie,wykluczanie,odczłowieczanie,poniżanie,stygmatyzacja,szantaż,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0,0,0,0,0,0,13654,86.466975
0,0,0,0,0,1,0,361,2.286112
0,0,0,0,1,0,0,251,1.589513
0,0,0,0,1,1,0,197,1.247546
0,1,0,0,0,0,0,179,1.133557
0,1,0,0,0,1,0,106,0.671268
1,0,0,0,1,0,0,57,0.360965
1,0,0,0,0,0,0,43,0.272307
1,0,0,1,1,1,0,40,0.253309
0,0,0,1,0,0,0,36,0.227978


The class cardinalities imbalance has been reduced. Now it is possible to perform 5-fold cross-validation.

## Saving dataset

In [9]:
df_duplicated = df_duplicated.sort_values(by=['date', 'time'])
df_duplicated.to_csv(DUPLICATED_PATH, index=False)