# Hate Speech Detector 2.0
---
**Advanced data analysis**
1. Calculation of **phrase occurence** in text:
    1. Does the phrase occur fully or partially, how?
    2. **POC - Phrase Occurence Coefficient** - Get max, mean and min values.
    3. 1.0 means full hate speech --> 0.0 mean no hate speech
    4. Visualization of POC calculation examples
2. For each of 7 hate-speech classes and one vulgar:
    1. Load of appropriate .txt file with dictionary with lemmatized hateful phrases
    2. For each lemmatized tweet:
        1. Calculate min, mean and max **POC** scores, according to appropriate **hateful or vulgar phrases**.
        2. Get average values of mins, means and maxes.
    3. Save results into .csv file.
3. Polish polyglot sentiment analysis
4. Characters, syllables, words counting.
5. For each of 7 hate speech classes and one vulgar:
    1. Detect N hateful topics which include K words. (assume N and K values)
    2. Save **LDA (Latent Dirichlet Allocation)** model.
    3. For each lemmatized tweet:
        1. Calculate **POC** scores of **topics** (treating them as phrases) and mean aggregate over topics.
    4. Save results into .csv file.
6. For each tweet:
    1. Determine how many words have which type of **polyglot sentiment**.
    2. Count characters, syllables, words and unique words.
    3. Compare polyglot sentiment results with empirical sentiment annotations. Calculate accuracy and F measures.
    4. Save results into .csv file.

In [1]:
import numpy as np
import pandas as pd

import os

from src.extension.lemm import lemmatize_text
from src.measures import POC
from src.utils.lemm import load_lemmatized_tweets, load_lemm_phrases
from src.utils.ext import load_ext_phrases
from src.analysis.poc import analyse_POC
from src.analysis.lda import train_lda_models
from src.analysis.topic_poc import analyse_topic_POC
from src.analysis.other import analyse_other
from src.utils.texts import text_sentiment, text_numbers
from src.constants import (POC_SCORES_PATH, TOPIC_POC_SCORES_PATH, OTHER_SCORES_PATH,
                           POLISH_STOPWORDS)

In [2]:
pd.set_option('display.max_colwidth', 400)

## Phrases occurance calculation

**How to calculate phrase occurence coefficient (POC) in text?**

1. Split by whitespace lemmatized text and phrase to separate words.
2. Delete all stopwords and interpunction symbols from text and phrase.
3. Enumerate all words left in text, starting from 0.
3. For each word in phrase list all occurences (i.e. referring numbers) of the word in text. If no occurences of word in text found, then omit it (empty list).
4. Get all possible phrase words orders in examined text i.e. perform cartesian product for positions lists.
5. For each possible order:
    1. Form n list of occurences into n-1 pairs.
    2. For each pair assign (1) if first element is smaller than second (ascending order) else (-1)
    3. Sum all assignations and divide the total by number of pairs (i.e. words in phrase - 1).
6. Return minimum, mean and maximum score.

---
**EXAMPLE 1.**:<br />
**text**: *Wróciły pisowskie trójki sądy doraźne koksowniki i SKOTy, do tego PiS gwałci żeby nie robić aborcji* <br />
**phrase**: *PiS gwałci żeby nie robić aborcji*<br />
![schema 01](charts/schemes/HSD2.0_scheme01.png)<br />
Results: **MIN=1.0 MEAN=1.0 MAX=1.0**

---
**EXAMPLE 2.**:<br />
**text**: *Faszystowskie sądy ach faszystowskie sądy*<br/>
**phrase** : *Ach faszystowskie sądy fałszywe*<br />
![schema 02](charts/schemes/HSD2.0_scheme02.png)<br />
Results: **MIN=-0.5 MEAN=0.25 MAX=0.5**

---
**EXAMPLE 3.**:<br />
**text**: *Ci z LGBT chcą zniszczyć pojęcia tradycji rodziny tworzonej przez mężczyznę i kobietę!*<br/>
**phrase**: *LGBT zniszczą nową rodziny tradycję mężczyzn i kobiet.*<br />
![schema 03](charts/schemes/HSD2.0_scheme03.png)<br />
Results: **MIN=0.17 MEAN=0.33 MAX=0.5**

In [3]:
text = 'Wróciły pisowskie trójki sądy doraźne koksowniki i SKOTy, do tego PiS gwałci żeby nie robić aborcji'
phrase = 'PiS gwałci żeby nie robić aborcji'

POC(text, phrase, stopwords=POLISH_STOPWORDS)

(1.0, 1.0, 1.0)

In [4]:
text = 'Faszystowskie sądy ach faszystowskie sądy'
phrase = 'Ach faszystowskie sądy fałszywe'

POC(text, phrase, stopwords=POLISH_STOPWORDS)

(-0.5, 0.25, 0.5)

In [5]:
text = 'Ci z LGBT chcą zniszczyć pojęcia tradycji rodziny tworzonej przez mężczyznę i kobietę!'
phrase = 'LGBT zniszczą nową rodziny tradycję mężczyzn i kobiet.'

POC(text, phrase, stopwords=POLISH_STOPWORDS)

(0.5, 0.5, 0.5)

## Loading data

In [6]:
df = load_lemmatized_tweets()
df.head(2)

Unnamed: 0,id,tweet,lemmatized
0,9,w czwartek muszę poprawić sądy i trybunały,w czwartek musieć poprawić sąd i trybunał
1,8,"Żale Nałęcza i riposta Macierewicza: Pan był w kompartii, czy ma prawo wy­gła­szać takie sądy? | niezalezna.pl",żale nałęcz i riposta macierewicz pan być w kompartia czy mieć prawo wyżgłaćszać taki sąd niezalezna.pl


In [7]:
lemm_phrases = load_lemm_phrases(load_vulg=True)

  return np.array(aphr)


In [8]:
ext_phrases = load_ext_phrases(load_vulg=True)

  return np.array(aphr)


**Calculate POC score for all tweets.**

1. Load relevant data with sanitized tweets and all hateful phrases.
2. For each tweet:
    1. For each hate type (and one vulgar):
        1. Calculate POC scores (min, mean, max) for every phrase which belongs to certain hate type (or vulgar)
        2. Get means of minimum, mean and maximum POC scores
        3. Write calculations into dictionary
    2. Write all hate types dictionary values into .csv row.

In [9]:
if not os.path.exists(POC_SCORES_PATH):
    analyse_POC(df, ext_phrases)

In [10]:
df_poc_scores = pd.read_csv(POC_SCORES_PATH)
df_poc_scores.head(2)

Unnamed: 0,id,wyz_POC_min,wyz_POC_mean,wyz_POC_max,groz_POC_min,groz_POC_mean,groz_POC_max,wyk_POC_min,wyk_POC_mean,wyk_POC_max,...,pon_POC_max,styg_POC_min,styg_POC_mean,styg_POC_max,szan_POC_min,szan_POC_mean,szan_POC_max,vulg_POC_min,vulg_POC_mean,vulg_POC_max
0,9,0.0,0.0,0.0,-0.5,-0.002193,0.5,0.0,0.0,0.0,...,0.5,-0.5,0.00026,0.5,0.0,0.0,0.0,0.0,0.0,0.0
1,8,-0.333333,0.004526,0.5,-0.5,0.000808,0.5,0.0,0.006219,0.333333,...,0.5,-0.5,-0.004606,0.333333,0.0,0.0,0.0,0.0,0.0,0.0


## Hateful phrases topics detection

**Find top 20 topic 20-words sentences for phrases of each hate type (and one vulgar).**

1. For each hate type:
    1. Get relevant extended phrases.
    2. Fit CountVectorizer and LDA model.
    3. Save trained model into pickle archive.
    4. For each tweet:
        1. Calculate POC scores of each of 20 topics appearance.
        2. Save into .csv file.

In [11]:
LDA_N_TOPICS, LDA_N_WORDS = 20, 20

In [12]:
train_lda_models(ext_phrases, n_topics=LDA_N_TOPICS)

HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))




In [13]:
if not os.path.exists(TOPIC_POC_SCORES_PATH):
    analyse_topic_POC(df, n_words=LDA_N_WORDS)

In [14]:
df_topic_poc_scores = pd.read_csv(TOPIC_POC_SCORES_PATH)
df_topic_poc_scores.head(2)

Unnamed: 0,id,wyz_topic_POC_min,wyz_topic_POC_mean,wyz_topic_POC_max,groz_topic_POC_min,groz_topic_POC_mean,groz_topic_POC_max,wyk_topic_POC_min,wyk_topic_POC_mean,wyk_topic_POC_max,...,pon_topic_POC_max,styg_topic_POC_min,styg_topic_POC_mean,styg_topic_POC_max,szan_topic_POC_min,szan_topic_POC_mean,szan_topic_POC_max,vulg_topic_POC_min,vulg_topic_POC_mean,vulg_topic_POC_max
0,9,0.0,0.0,0.0,-0.052632,0.0,0.052632,0.0,0.0,0.0,...,0.052632,-0.052632,0.002632,0.052632,0.0,0.0,0.0,0.0,0.0,0.0
1,8,0.0,0.005263,0.052632,-0.052632,-0.010526,0.0,-0.052632,-0.002632,0.052632,...,0.0,-0.052632,-0.010526,0.052632,0.0,0.0,0.0,0.0,0.0,0.0


## Other text scores

### Polish Polyglot sentiment analysis

In [15]:
text_sentiment('Wróciły pisowskie trójki sądy doraźne koksowniki i SKOTy, do tego PiS gwałci żeby nie robić aborcji')

(0, 17, 0)

In [16]:
text_sentiment('Faszystowskie sądy ach faszystowskie sądy')

(0, 5, 0)

In [17]:
text_sentiment('Ci z LGBT chcą zniszczyć pojęcia tradycji rodziny tworzonej przez mężczyznę i kobietę!')

(2, 12, 0)

### Characters, syllables, words counting

In [18]:
text_numbers('Wróciły pisowskie trójki sądy doraźne koksowniki i SKOTy, do tego PiS gwałci żeby nie robić aborcji')

(84, 33, 16, 16)

In [19]:
text_numbers('Faszystowskie sądy ach faszystowskie sądy')

(37, 13, 5, 3)

In [20]:
text_numbers('Ci z LGBT chcą zniszczyć pojęcia tradycji rodziny tworzonej przez mężczyznę i kobietę!')

(74, 26, 13, 13)

In [21]:
text_numbers(lemmatize_text('Wróciły pisowskie trójki sądy doraźne koksowniki i SKOTy, do tego PiS gwałci żeby nie robić aborcji'))

(78, 29, 16, 16)

In [22]:
text_numbers(lemmatize_text('Faszystowskie sądy ach faszystowskie sądy'))

(33, 11, 5, 3)

In [23]:
text_numbers(lemmatize_text('Ci z LGBT chcą zniszczyć pojęcia tradycji rodziny tworzonej przez mężczyznę i kobietę!'))

(74, 25, 13, 13)

**Calculate above other scores for all tweets.**

1. Load relevant data with sanitized tweets.
2. For each tweet:
    1. Remove invalid (for polyglot) characters which cause errors.
    2. Determine how many words have which of three sentiment types.
    3. Count characters, syllables, words and unique words.
    2. Write all values into .csv row.

In [24]:
if not os.path.exists(OTHER_SCORES_PATH):
    analyse_other(df)

In [25]:
df_other_scores = pd.read_csv(OTHER_SCORES_PATH)
df_other_scores.head(2)

Unnamed: 0,id,s_neg,s_neu,s_pos,n_chars,n_sylls,n_words,nu_words,nl_chars,nl_sylls,nl_words,nlu_words
0,9,0,6,1,36,15,7,7,35,13,7,7
1,8,1,18,1,94,38,18,18,88,33,16,16
