# Hate Speech Detector 2.0
---
**Advanced data analysis**
1. Calculation of **phrase occurence** in text:
    1. Does the phrase occur fully or partially, how?
    2. **POC - Phrase Occurence Coefficient** - Get max, mean and min values.
    3. 1.0 means full hate speech --> 0.0 mean no hate speech
    4. Visualization of POC calculation examples
2. For each of 7 hate-speech classes and one vulgar:
    1. Load of appropriate .txt file with dictionary with lemmatized hateful phrases
    2. For each lemmatized tweet:
        1. Calculate min, mean and max **POC** scores, according to appropriate **hateful or vulgar phrases**.
        2. Get average values of mins, means and maxes.
    3. Save results into .csv file.
3. Polish polyglot sentiment analysis
4. Characters, syllables, words counting.
5. For each of 7 hate speech classes and one vulgar:
    1. Detect N hateful topics which include K words. (assume N and K values)
    2. Save **LDA (Latent Dirichlet Allocation)** model.
    3. For each lemmatized tweet:
        1. Calculate **POC** scores of **topics** (treating them as phrases) and mean aggregate over topics.
    4. Save results into .csv file.
6. For each tweet:
    1. Determine how many words have which type of **polyglot sentiment**.
    2. Count characters, syllables, words and unique words.
    3. Compare polyglot sentiment results with empirical sentiment annotations. Calculate accuracy and F measures.
    4. Save results into .csv file.

In [1]:
import numpy as np
import pandas as pd

import itertools

from combo.predict import COMBO
from polyglot.text import Text
from polyglot.downloader import downloader

import pyphen

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA

from tqdm.notebook import tqdm

import os
import csv
import pickle

In [2]:
LABELS = ['wyzywanie', 'grożenie', 'wykluczanie', 'odczłowieczanie', 'poniżanie',
          'stygmatyzacja', 'szantaż', 'wulgaryzm']
LABELS_SMALL = ['wyz', 'groz', 'wyk', 'odcz', 'pon', 'styg', 'szan']
LABELS_V_SMALL = LABELS_SMALL + ['vulg']

DUPLICATED_PATH = 'data/tweets_sady/processed/sady_duplicated.csv'
LEMMAS_PATH = 'data/tweets_sady/processed/lemmas.csv'
POC_SCORES_PATH = 'data/tweets_sady/processed/poc_scores.csv'
TOPIC_POC_SCORES_PATH = 'data/tweets_sady/processed/topic_poc_scores.csv'
OTHER_SCORES_PATH = 'data/tweets_sady/processed/other_scores.csv'

HATEFUL_LEMM_DIR = 'data/hateful/lemm_{}.txt'
VULGARS_LEMM_DIR = 'data/vulgars/lemm_{}.txt'
HATEFUL_EXT_DIR = 'data/hateful/ext_{}.txt'
VULGARS_EXT_DIR = 'data/vulgars/ext_{}.txt'
LDA_MODEL_DIR = 'models/lda/lda_{}.pkl'

TAGGER_MODEL = 'polish-herbert-base'
HYPHENATION_MODEL = 'pl_PL'

In [3]:
pd.set_option('display.max_colwidth', 400)

In [4]:
with open('data/other/polish_stopwords.txt', 'r') as f:\
    polish_stopwords = f.read().split('\n')[:-1]
polish_stopwords[:10]

['a', 'aby', 'ach', 'acz', 'aczkolwiek', 'aj', 'albo', 'ale', 'alez', 'ależ']

In [5]:
from polyglot.detect.base import logger as polyglot_logger
polyglot_logger.setLevel("ERROR")

## Phrases occurance calculation

**How to calculate phrase occurence coefficient (POC) in text?**

1. Split by whitespace lemmatized text and phrase to separate words.
2. Delete all stopwords and interpunction symbols from text and phrase.
3. Enumerate all words left in text, starting from 0.
3. For each word in phrase list all occurences (i.e. referring numbers) of the word in text. If no occurences of word in text found, then omit it (empty list).
4. Get all possible phrase words orders in examined text i.e. perform cartesian product for positions lists.
5. For each possible order:
    1. Form n list of occurences into n-1 pairs.
    2. For each pair assign (1) if first element is smaller than second (ascending order) else (-1)
    3. Sum all assignations and divide the total by number of pairs (i.e. words in phrase - 1).
6. Return minimum, mean and maximum score.

---
**EXAMPLE 1.**:<br />
**text**: *Wróciły pisowskie trójki sądy doraźne koksowniki i SKOTy, do tego PiS gwałci żeby nie robić aborcji* <br />
**phrase**: *PiS gwałci żeby nie robić aborcji*<br />
![schema 01](charts/schemes/HSD2.0_scheme01.png)<br />
Results: **MIN=1.0 MEAN=1.0 MAX=1.0**

---
**EXAMPLE 2.**:<br />
**text**: *Faszystowskie sądy ach faszystowskie sądy*<br/>
**phrase** : *Ach faszystowskie sądy fałszywe*<br />
![schema 02](charts/schemes/HSD2.0_scheme02.png)<br />
Results: **MIN=-0.5 MEAN=0.25 MAX=0.5**

---
**EXAMPLE 3.**:<br />
**text**: *Ci z LGBT chcą zniszczyć pojęcia tradycji rodziny tworzonej przez mężczyznę i kobietę!*<br/>
**phrase**: *LGBT zniszczą nową rodziny tradycję mężczyzn i kobiet.*<br />
![schema 03](charts/schemes/HSD2.0_scheme03.png)<br />
Results: **MIN=0.17 MEAN=0.33 MAX=0.5**

In [6]:
tagger = COMBO.from_pretrained(TAGGER_MODEL)

In [7]:
def lemmatize_text(text):
    text = text.replace('#', '').replace('[...]', '')
    sentence = tagger(text)
    
    lemmas = [token.lemma.lower() for token in sentence.tokens if token.deprel != 'punct']
    lemm_text = ' '.join(lemmas)
    
    return lemm_text

In [8]:
def POC(text, phrase, lemmatized=False, stopwords=[]):
    
    t = text if lemmatized else lemmatize_text(text)
    p = phrase if lemmatized else lemmatize_text(phrase)
    
    t_words = list(filter(lambda x: x not in stopwords, t.split(' ')))
    p_words = list(filter(lambda x: x not in stopwords, p.split(' ')))
    
    assert (len(t_words) > 0), 'The examined text must have at least one non-stopword word!'
    
    if len(p_words) > 1:
        occurences = list([[i for i, x in enumerate(t_words) if x == p_w] for p_w in p_words])
        occurences = list([o for o in occurences if len(o) > 0])

        orders = list(itertools.product(*occurences))
        order_pairs_list = list([[tuple((o[i], o[i+1])) for i, oi in enumerate(o[:-1])] for o in orders])

        coeffs = list([sum([1. if op[0]<op[1] else -1. for op in ops])/(len(p_words) - 1)
                       for ops in order_pairs_list])

        return (np.min(coeffs), np.mean(coeffs), np.max(coeffs))
    elif len(p_words) == 1:
        return (1., 1., 1.) if p_words[0] in t_words else (0., 0., 0.)
    else:
        return (0., 0., 0.)

In [9]:
text = 'Wróciły pisowskie trójki sądy doraźne koksowniki i SKOTy, do tego PiS gwałci żeby nie robić aborcji'
phrase = 'PiS gwałci żeby nie robić aborcji'

POC(text, phrase, stopwords=polish_stopwords)

(1.0, 1.0, 1.0)

In [10]:
text = 'Faszystowskie sądy ach faszystowskie sądy'
phrase = 'Ach faszystowskie sądy fałszywe'

POC(text, phrase, stopwords=polish_stopwords)

(-0.5, 0.25, 0.5)

In [11]:
text = 'Ci z LGBT chcą zniszczyć pojęcia tradycji rodziny tworzonej przez mężczyznę i kobietę!'
phrase = 'LGBT zniszczą nową rodziny tradycję mężczyzn i kobiet.'

POC(text, phrase, stopwords=polish_stopwords)

(0.5, 0.5, 0.5)

## Loading data

In [12]:
def load_lemmatized_tweets():
    df = pd.read_csv(DUPLICATED_PATH)
    df = df[['id', 'tweet']]
    
    if not os.path.exists(LEMMAS_PATH):
        df['lemmatized'] = list([lemmatize_text(tweet) for tweet in tqdm(df['tweet'])])
        df[['id', 'lemmatized']].to_csv(LEMMAS_PATH, index=False)
    else:
        df['lemmatized'] = pd.read_csv(LEMMAS_PATH)['lemmatized']
    
    return df

df = load_lemmatized_tweets()
df.head(2)

HBox(children=(FloatProgress(value=0.0, max=15791.0), HTML(value='')))




Unnamed: 0,id,tweet,lemmatized
0,9,w czwartek muszę poprawić sądy i trybunały,w czwartek musieć poprawić sąd i trybunał
1,8,"Żale Nałęcza i riposta Macierewicza: Pan był w kompartii, czy ma prawo wy­gła­szać takie sądy? | niezalezna.pl",żale nałęcz i riposta macierewicz pan być w kompartia czy mieć prawo wyżgłaćszać taki sąd niezalezna.pl


In [13]:
def load_lemm_phrases(load_vulg=False):
    aphr = list([])
    for label in LABELS_SMALL:
        with open(HATEFUL_LEMM_DIR.replace('{}', label), 'r') as f:
            aphr.append(np.array(f.read().split(';')))
    if load_vulg:
        with open(VULGARS_LEMM_DIR.replace('{}', LABELS_V_SMALL[-1]), 'r') as f:
            aphr.append(np.array(f.read().split(';')))
    
    return np.array(aphr)

lemm_phrases = load_lemm_phrases(load_vulg=True)

  # Remove the CWD from sys.path while we load stuff.


In [14]:
def load_ext_phrases(load_vulg=False):
    aphr = list([])
    for label in LABELS_SMALL:
        with open(HATEFUL_EXT_DIR.replace('{}', label), 'r') as f:
            aphr.append(np.array(f.read().split(';')))
    if load_vulg:
        with open(VULGARS_EXT_DIR.replace('{}', LABELS_V_SMALL[-1]), 'r') as f:
            aphr.append(np.array(f.read().split(';')))
    
    return np.array(aphr)

ext_phrases = load_ext_phrases(load_vulg=True)

  # Remove the CWD from sys.path while we load stuff.


**Calculate POC score for all tweets.**

1. Load relevant data with sanitized tweets and all hateful phrases.
2. For each tweet:
    1. For each hate type (and one vulgar):
        1. Calculate POC scores (min, mean, max) for every phrase which belongs to certain hate type (or vulgar)
        2. Get means of minimum, mean and maximum POC scores
        3. Write calculations into dictionary
    2. Write all hate types dictionary values into .csv row.

In [15]:
def analyse_POC(df, phr):
    csv_labels = list(['id'])
    for label in LABELS_V_SMALL:
        csv_labels.append(f'{label}_POC_min')
        csv_labels.append(f'{label}_POC_mean')
        csv_labels.append(f'{label}_POC_max')
    with open(POC_SCORES_PATH, 'w') as f:
        csv.writer(f).writerow(csv_labels)
    del csv_labels

    for _, tweet in tqdm(df.iterrows(), total=len(df)):
        scores = dict({})

        for label, phrases in zip(LABELS_V_SMALL, phr):
            sc_min, sc_mean, sc_max = list([]), list([]), list([])

            for phrase in phrases:
                mn, mean, mx = POC(tweet['lemmatized'], phrase, lemmatized=True,
                                   stopwords=polish_stopwords)
                sc_min.append(mn)
                sc_mean.append(mean)
                sc_max.append(mx)

            scores[f'{label}_min'] = np.min(sc_min)
            scores[f'{label}_mean'] = np.mean(sc_mean)
            scores[f'{label}_max'] = np.max(sc_max)
            del sc_min, sc_mean, sc_max

        csv_values = list([tweet['id']])
        for label in LABELS_V_SMALL:
            csv_values.append(scores[f'{label}_min'])
            csv_values.append(scores[f'{label}_mean'])
            csv_values.append(scores[f'{label}_max'])
        with open(POC_SCORES_PATH, 'a') as f:
            csv.writer(f).writerow(csv_values)
        del scores, csv_values
    

if not os.path.exists(POC_SCORES_PATH):
    analyse_POC(df, ext_phrases)

HBox(children=(FloatProgress(value=0.0, max=15791.0), HTML(value='')))




In [16]:
df_poc_scores = pd.read_csv(POC_SCORES_PATH)
df_poc_scores.head(2)

Unnamed: 0,id,wyz_POC_min,wyz_POC_mean,wyz_POC_max,groz_POC_min,groz_POC_mean,groz_POC_max,wyk_POC_min,wyk_POC_mean,wyk_POC_max,...,pon_POC_max,styg_POC_min,styg_POC_mean,styg_POC_max,szan_POC_min,szan_POC_mean,szan_POC_max,vulg_POC_min,vulg_POC_mean,vulg_POC_max
0,9,0.0,0.0,0.0,-0.5,-0.002193,0.5,0.0,0.0,0.0,...,0.5,-0.5,0.00026,0.5,0.0,0.0,0.0,0.0,0.0,0.0
1,8,-0.333333,0.004526,0.5,-0.5,0.000808,0.5,0.0,0.006219,0.333333,...,0.5,-0.5,-0.004606,0.333333,0.0,0.0,0.0,0.0,0.0,0.0


## Hateful phrases topics detection

**Find top 20 topic 20-words sentences for phrases of each hate type (and one vulgar).**

1. For each hate type:
    1. Get relevant extended phrases.
    2. Fit CountVectorizer and LDA model.
    3. Save trained model into pickle archive.
    4. For each tweet:
        1. Calculate POC scores of each of 20 topics appearance.
        2. Save into .csv file.

In [17]:
LDA_N_TOPICS, LDA_N_WORDS = 20, 20

In [18]:
def train_lda_models(phr, n_topics=10):
    for label, phrases, in tqdm(zip(LABELS_V_SMALL, phr), total=len(LABELS_V_SMALL)):

        cv = CountVectorizer(stop_words=polish_stopwords)
        count_data = cv.fit_transform(phrases)

        lda_model = LDA(n_components=n_topics, n_jobs=-1)
        lda_model.fit(count_data)

        with open(LDA_MODEL_DIR.replace('{}', label), 'wb') as f:
            pickle.dump([lda_model, cv], f)

train_lda_models(ext_phrases, n_topics=LDA_N_TOPICS)

HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))




In [19]:
def lda_topics(lda_model, lda_cv, n_words=10):
    words = lda_cv.get_feature_names()
    
    topics = list([' '.join([words[i] for i in topic.argsort()[:-n_words - 1:-1]])
                   for topic in lda_model.components_])
    
    return topics

In [20]:
def analyse_topic_POC(df, n_words=10):
    csv_labels = list(['id'])
    for label in LABELS_V_SMALL:
        csv_labels.append(f'{label}_topic_POC_min')
        csv_labels.append(f'{label}_topic_POC_mean')
        csv_labels.append(f'{label}_topic_POC_max')
    with open(TOPIC_POC_SCORES_PATH, 'w') as f:
        csv.writer(f).writerow(csv_labels)
    del csv_labels

    for _, tweet in tqdm(df.iterrows(), total=len(df)):
        scores = dict({})

        for label in LABELS_V_SMALL:
            with open(LDA_MODEL_DIR.replace('{}', label), 'rb') as f:
                lda_model, cv = pickle.load(f)

            topics = lda_topics(lda_model, cv, n_words=n_words)
            sc_min, sc_mean, sc_max = list([]), list([]), list([])

            for topic in topics:
                mn, mean, mx = POC(tweet['lemmatized'], topic, lemmatized=True,
                                   stopwords=polish_stopwords)
                sc_min.append(mn)
                sc_mean.append(mean)
                sc_max.append(mx)

            scores[f'{label}_min'] = np.min(sc_min)
            scores[f'{label}_mean'] = np.mean(sc_mean)
            scores[f'{label}_max'] = np.max(sc_max)
            del sc_min, sc_mean, sc_max

        csv_values = list([tweet['id']])
        for label in LABELS_V_SMALL:
            csv_values.append(scores[f'{label}_min'])
            csv_values.append(scores[f'{label}_mean'])
            csv_values.append(scores[f'{label}_max'])
        with open(TOPIC_POC_SCORES_PATH, 'a') as f:
            csv.writer(f).writerow(csv_values)
        del scores, csv_values

if not os.path.exists(TOPIC_POC_SCORES_PATH):
    analyse_topic_POC(df, n_words=LDA_N_WORDS)

HBox(children=(FloatProgress(value=0.0, max=15791.0), HTML(value='')))




In [21]:
df_topic_poc_scores = pd.read_csv(TOPIC_POC_SCORES_PATH)
df_topic_poc_scores.head(2)

Unnamed: 0,id,wyz_topic_POC_min,wyz_topic_POC_mean,wyz_topic_POC_max,groz_topic_POC_min,groz_topic_POC_mean,groz_topic_POC_max,wyk_topic_POC_min,wyk_topic_POC_mean,wyk_topic_POC_max,...,pon_topic_POC_max,styg_topic_POC_min,styg_topic_POC_mean,styg_topic_POC_max,szan_topic_POC_min,szan_topic_POC_mean,szan_topic_POC_max,vulg_topic_POC_min,vulg_topic_POC_mean,vulg_topic_POC_max
0,9,0.0,0.0,0.0,-0.052632,0.0,0.052632,0.0,0.0,0.0,...,0.052632,-0.052632,0.002632,0.052632,0.0,0.0,0.0,0.0,0.0,0.0
1,8,0.0,0.005263,0.052632,-0.052632,-0.010526,0.0,-0.052632,-0.002632,0.052632,...,0.0,-0.052632,-0.010526,0.052632,0.0,0.0,0.0,0.0,0.0,0.0


## Other text scores

### Polish Polyglot sentiment analysis

In [22]:
def text_sentiment(text):
    
    # detect and delete invalid characters first
    t = text
    invalid = set()
    for i, ch in enumerate(t):
        try:
            Text(f"Char: {ch}").words
        except:
            invalid.add(ch)
    for ch in invalid:
        t = t.replace(ch, '')
    
    t = Text(t)
    sents = list([])
    for w in t.words:
        try:
            s = w.polarity
        except ValueError:
            s = 0
        sents.append(s)
    sents = np.array(sents)
    
    return np.size(sents[sents==-1]), np.size(sents[sents==0]), np.size(sents[sents==1])

In [23]:
text_sentiment('Wróciły pisowskie trójki sądy doraźne koksowniki i SKOTy, do tego PiS gwałci żeby nie robić aborcji')

(0, 17, 0)

In [24]:
text_sentiment('Faszystowskie sądy ach faszystowskie sądy')

(0, 5, 0)

In [25]:
text_sentiment('Ci z LGBT chcą zniszczyć pojęcia tradycji rodziny tworzonej przez mężczyznę i kobietę!')

(2, 12, 0)

### Characters, syllables, words counting

In [26]:
dic = pyphen.Pyphen(lang=HYPHENATION_MODEL)

In [27]:
def text_numbers(text):
    num_chars = len(text.replace(' ', ''))
    num_syllables = sum([len(dic.inserted(word).split('-')) for word in text.split(' ')])
    num_words = len(text.split(' '))
    num_unique_words = len(set(text.lower().split(' ')))
    
    return num_chars, num_syllables, num_words, num_unique_words

In [28]:
text_numbers('Wróciły pisowskie trójki sądy doraźne koksowniki i SKOTy, do tego PiS gwałci żeby nie robić aborcji')

(84, 33, 16, 16)

In [29]:
text_numbers('Faszystowskie sądy ach faszystowskie sądy')

(37, 13, 5, 3)

In [30]:
text_numbers('Ci z LGBT chcą zniszczyć pojęcia tradycji rodziny tworzonej przez mężczyznę i kobietę!')

(74, 26, 13, 13)

In [31]:
text_numbers(lemmatize_text('Wróciły pisowskie trójki sądy doraźne koksowniki i SKOTy, do tego PiS gwałci żeby nie robić aborcji'))

(78, 29, 16, 16)

In [32]:
text_numbers(lemmatize_text('Faszystowskie sądy ach faszystowskie sądy'))

(33, 11, 5, 3)

In [33]:
text_numbers(lemmatize_text('Ci z LGBT chcą zniszczyć pojęcia tradycji rodziny tworzonej przez mężczyznę i kobietę!'))

(74, 25, 13, 13)

**Calculate above other scores for all tweets.**

1. Load relevant data with sanitized tweets.
2. For each tweet:
    1. Remove invalid (for polyglot) characters which cause errors.
    2. Determine how many words have which of three sentiment types.
    3. Count characters, syllables, words and unique words.
    2. Write all values into .csv row.

In [34]:
def analyse_other(df):
    csv_labels = list([
        'id',
        's_neg', 's_neu', 's_pos',
        'n_chars', 'n_sylls', 'n_words', 'nu_words',
        'nl_chars', 'nl_sylls', 'nl_words', 'nlu_words',
    ])
    with open(OTHER_SCORES_PATH, 'w') as f:
        csv.writer(f).writerow(csv_labels)
    del csv_labels

    for _, tweet in tqdm(df.iterrows(), total=len(df)):
        scores = dict({})

        scores['neg'], scores['neu'], scores['pos'] = text_sentiment(tweet['tweet'])
        scores['chars'], scores['sylls'], scores['words'], scores['u_words'] = text_numbers(tweet['tweet'])
        scores['l_chars'], scores['l_sylls'], scores['l_words'], scores['l_u_words'] = text_numbers(tweet['lemmatized'])

        csv_values = list([
            tweet['id'],
            scores['neg'], scores['neu'], scores['pos'],
            scores['chars'], scores['sylls'], scores['words'], scores['u_words'],
            scores['l_chars'], scores['l_sylls'], scores['l_words'], scores['l_u_words']
        ])
        with open(OTHER_SCORES_PATH, 'a') as f:
            csv.writer(f).writerow(csv_values)
        del scores, csv_values

if not os.path.exists(OTHER_SCORES_PATH):
    analyse_other(df)

HBox(children=(FloatProgress(value=0.0, max=15791.0), HTML(value='')))




In [35]:
df_other_scores = pd.read_csv(OTHER_SCORES_PATH)
df_other_scores.head(2)

Unnamed: 0,id,s_neg,s_neu,s_pos,n_chars,n_sylls,n_words,nu_words,nl_chars,nl_sylls,nl_words,nlu_words
0,9,0,6,1,36,15,7,7,35,13,7,7
1,8,1,18,1,94,38,18,18,88,33,16,16
