# Preprocessing Text for Hardcode Text Analytics in Opt Out

#### There are a million ways to skin a cat (poor cat), and it's quite similar when preparing text data to study it. In this notebook, we hope to show you the different steps involved to get the most information from your text possible.

Let's start by reading in the data. We try to be consistent at Opt Out. The data consists of comments under the column 'text' and their labels under column 'label'. Our labeling schema is 1 if it is misogynistic and 0 if not. 

In [1]:
import pandas as pd
from src.data.preprocess_text_helpers import (
    contractions_unpacker,
    punctuation_cleaner,
    remove_stopwords,
    tokenizer,
)

from src.data.preprocess_text_pipelines import (
  clean,
  tokenize,
  normalize,
)


from src.data.retrieve_data_from_s3_bucket import download_dataset

download_dataset("../data/processed/stanford.csv")
df = pd.read_csv("../data/processed/stanford.csv")
df.head()


Unnamed: 0,text,label
0,The new Doras cute af,0
1,@minniemonikive well,0
2,Rolou o skank,1
3,@AOC https//t.co/lbNOwMK1p2,1
4,@tangletorn We will be killed by a snake 3,0


Nice! We like consistency.


Sadly social media data is anything but consistent. For example

In [2]:
df.loc[13, 'text']

'@ATX_fight_club @AOC JFK was a clandestine austerity democratic. What the fk happened?!'

It's incredibly **messy**. Spelling mistakes, grammar failures and emojis/hashtags/urls make understanding the content of the text hard.  



So how are we going to tidy this up? Can we polish our diamond in the rough to better understand the text of the tweets? 

**Well I'm glad you asked...** 


There are many steps you can take. There is not one right answer, it is a case by case thing. Nevertheless, Opt Out tries to make it easy for someone to get through the boring preprocessing work and into the nitty gritty text analyitics. Let's see what we can do in Opt Out


We can either clean, tokenize or normalize the text. Let's start with tokenize. 

### Tokenization
Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens

In [3]:
tokenize(df).head()['tokenized']

  self.tok = re.compile(r"({})".format("|".join(pipeline)))


0                         the new doras cute af
1                          @minniemonikive well
2                                 rolou o skank
3            @aoc https / / t . co / lbnowmk1p2
4    @tangletorn we will be killed by a snake 3
Name: tokenized, dtype: object

So what can you see that's different? Lowercase? Is the punctuation still there? What's going on in this tokenization step? Let's take an average trolly sentence and we'll walk you through each step of our tokenization pipeline.

In [4]:
troll_speak = "RT @baum_erik: Lol I'm not surprised these 2 accounts blocked me @femfreq #FemiNazi#Gamergate &amp; @MomsAgainstWWE #ParanoidParent "

Messy, gross. Let's start by unpacking the contraction I'm

In [5]:
contractions_unpacker(troll_speak) 


'RT @baum_erik: Lol I am not surprised these 2 accounts blocked me @femfreq #FemiNazi#Gamergate &amp; @MomsAgainstWWE #ParanoidParent '

See `I'm is now I am`? that's called unpacking a contraction. Now we break our sentence up into tokens. Watch the final #FemiNazi#Gamergate hashtags at the end of the sentence

In [6]:
tokenizer(troll_speak)


"RT @baum_erik : Lol I ' m not surprised these 2 accounts blocked me @femfreq #FemiNazi #Gamergate & @MomsAgainstWWE #ParanoidParent"

Ok doesn't seem particularly interesting, but think about it. How would you normally split text up? We use tech that allows us to do social media tokenization, break up the text into meaningful chunks for social media data like hashtags, emojis etc.

### Clean
So now we can break the text up into pieces, but let's remove some rubbish.


In [7]:
clean(df).head()['cleaned']

0             the new doras cute af
1              @minniemonikive well
2                       rolou skank
3    @aoc https / / co / lbnowmk1p2
4     @tangletorn we killed snake 3
Name: cleaned, dtype: object

What's the difference? Yup, punctuation and the removal of something call stopwords. Stopwords are unimportant words, like and, with. These words are important, but not for modeling. The extra steps we take to get here are show below.

In [8]:
punctuation_cleaner(troll_speak)

"RT @baum_erik: Lol I'm not surprised these 2 accounts blocked me @femfreq #FemiNazi#Gamergate &amp; @MomsAgainstWWE #ParanoidParent "

Nothing to remove here

In [9]:
remove_stopwords(troll_speak)

"RT @baum_erik: Lol I'm surprised 2 accounts blocked @femfreq #FemiNazi#Gamergate &amp; @MomsAgainstWWE #ParanoidParent "

Words like am are removed, which helps us focus on the interesting words.

### Normalization
Finally my favourite step. We don't care about all the different users, the different hashtags, quite often all we care about are the densities of these.


In [10]:
normalize(df).head()['normalized'] 

Reading english - 1grams ...
You can't omit/backoff and unpack hashtags!
 unpack_hashtags will be set to False


  regexes = {k.lower(): re.compile(self.expressions[k]) for k, v in


0               the new doras cute af
1                         <user> well
2                         rolou skank
3    <user> https / / co / lbnowmk1p2
4            <user> we killed snake 3
Name: normalized, dtype: object

So this one is a little more involved, but it produces a little cooler results. I love this method. What is does it normalize the text, we don't really care about the different urls, hashtags etc in the text, we care about the number of them per tweet. This method allows us to not care, but care all at the same time.