In [1]:
import pandas as pd

from src.data.preprocess_text_helpers import (
    contractions_unpacker,
    punctuation_cleaner,
    remove_stopwords,
    tokenizer,
)

from src.data.preprocess_text_pipelines import (
  clean,
  tokenize,
  normalize,
)



So let's start by reading in the data. The data consists of comments and their labels, 1 if it is misogynistic and 0 if not. 

In [2]:
df = pd.read_csv("../data/external/hatespeech/hatespeech_data_en.csv")
df['text'] = df['content']
df.head()['text']


0    @SwarajyaMag #GermanProfessor gives meaning to...
1    #MKR Annie's never cooked on a BBQ before. See...
2    RT @asredasmyhair: Feminists, take note. #FemF...
3    @ChrisMMcDougall  He asked for it.  Did you?  ...
4    RT @VILLEGOD I always really wish i had a girl...
Name: text, dtype: object

What we see if we take a look at social media text, is that it is incredibly messy: spelling mistakes, grammar failures and emojis galore. How are we going to tidy this up so we can begin to prepare this diamond in the rough dataset so we can understand better the sentiment of the tweets? We need to preprocess the data. 

There are many steps you can take for this, and there is not one right answer, it is a case by case process. Nevertheless, Opt Out tries to make it easy for someone to get through the boring preprocessing work and into the nitty gritty text analyitics. Let's see what we can do in Opt Out


We can either clean, tokenize or normalize the text. Let's start with clean

In [3]:
tokenize(df).head()['tokenized']

  self.tok = re.compile(r"({})".format("|".join(pipeline)))


0    @swarajyamag #germanprofessor gives meaning to...
1    #mkr annie ' s never cooked on a bbq before . ...
2    rt @asredasmyhair : feminists , take note . #f...
3    @chrismmcdougall he asked for it . did you ? w...
4    rt @villegod i always really wish i had a girl...
Name: tokenized, dtype: object

So what can you see that's different? Lowercase? Is the punctuation still there? Yes and yes. So what's actually involved in the cleaning process. Let's take an average trolly sentence

In [4]:
troll_speak = "RT @baum_erik: Lol I'm not surprised these 2 accounts blocked me @femfreq #FemiNazi#Gamergate &amp; @MomsAgainstWWE #ParanoidParent "

Messy, gross, how are we going to understand more about this sentence?

In [5]:
contractions_unpacker(troll_speak) 


'RT @baum_erik: Lol I am not surprised these 2 accounts blocked me @femfreq #FemiNazi#Gamergate &amp; @MomsAgainstWWE #ParanoidParent '

See I'm has gone to I am? that's called unpacking a contraction, and there are loads in the English language.

In [6]:
tokenizer(troll_speak)


"RT @baum_erik : Lol I ' m not surprised these 2 accounts blocked me @femfreq #FemiNazi #Gamergate & @MomsAgainstWWE #ParanoidParent"

Ok doesn't seem particulary interesting, but think about it. How would you normally split text up? by white space? But we'd lose the joined hashtags if that was the case. Our tokenizer handles that beautitfully. Now let's compare the results of tokenize to clean. Can you tell the difference?

In [7]:
clean(df).head()['cleaned']

0    @swarajyamag #germanprofessor gives meaning te...
1                #mkr annie never cooked bbq see alien
2    rt @asredasmyhair feminists take note #femfree...
3    @chrismmcdougall he asked did ? why find emplo...
4    rt @villegod i always really wish girlfriend h...
Name: cleaned, dtype: object

What's the difference? Yup, punctuation and the removal of something call stopwords. Stopwords are unimportant words, like and, with. These words are important, but not for modeling. The extra steps we take to get here are show below.

In [8]:
punctuation_cleaner(troll_speak)

"RT @baum_erik: Lol I'm not surprised these 2 accounts blocked me @femfreq #FemiNazi#Gamergate &amp; @MomsAgainstWWE #ParanoidParent "

asfaf

In [None]:
remove_stopwords(troll_speak)

fafaf

In [9]:
normalize(df).head()['normalized'] # suppress output

Reading english - 1grams ...
You can't omit/backoff and unpack hashtags!
 unpack_hashtags will be set to False


  regexes = {k.lower(): re.compile(self.expressions[k]) for k, v in


0    <user> <hashtag> gives meaning term feminazi h...
1           <hashtag> annie never cooked bbq see alien
2    rt <user> feminists take note <hashtag> <hasht...
3    <user> he asked did ? why find employer value ...
4    rt <user> i always really wish girlfriend hung...
Name: normalized, dtype: object

So this one is a little more involved, but it produces a little cooler results. I love this method. What is does it normalize the text, we don't really care about the different urls, hashtags etc in the text, we care about the number of them per tweet. This method allows us to not care, but care all at the same time.