# Text Cleaning

Similar to other domains, parsing, cleaning and creating reasonable features from the data is the most time consuming task.

In this section we take the manual approach. We will develop hand-crafted rules to clean our text data.

In [1]:
import pandas as pd

The data we are going to use is a "trolling" dataset.

Be warned, some of the data is quite offensive!

In [2]:
trolls = pd.read_csv("data/trolls.csv")
X, y = (trolls["Comment"], trolls["Insult"])

In [3]:
print("%d%% of the examples are positive" % (100 * y.sum()/len(y)))

26% of the examples are positive


Note that the data is heavily skewed.

## Cleaning

Now, let's get on with the cleaning.

There are a few libraries out there to help you work with text. Many of them have inbuilt libraries that will automatically clean and tag text (e.g. `nltk`). 

But there's nothing better than writing your own and seeing the results.

Let's create some cleaners.

In [4]:

def clean(X):
    # Lowercase everything
    X = X.str.lower()
    
    # Get rid of those duplicate backslashes
    X = X.str.replace(r'\\\\', r'\\', case=False)
    
    # Remove all unnecessary punctuation
    
    # Ditch all other unicode
    X = X.str.decode("unicode_escape").str.encode('ascii', 'ignore').str.decode("utf-8")
    
    # Remove contractions
    X = X.str.replace("won't", "will not")
    
    # Create tokens of interest
    X = X.str.replace(r"([#%&\*\$]{2,})(\w*)", r"_SW") # Swearword obfuscations
    X = X.str.replace(r" [8x;:=]-?(?:\)|\}|\]|>){2,}", " _BS") # Big smileys
    X = X.str.replace(r" (?:[;:=]-?[\)\}\]d>])|(?:<3)", " _S") # Smileys   
    X = X.str.replace(r" [x:=]-?[\(\[\|\\/\{<]', r", " _F") # Sad faces
    X = X.str.replace(r" [x:=]-?(?:\(|\[|\||\\|/|\{|<){2,}", " _BF") # Big Sad faces   
    X = X.str.replace(r"(@[a-z]+)", r"_AT") # Directed at someone
    X = X.str.replace(r"[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}", r"_EM") # Email
    X = X.str.replace(r"\w+:\/\/\S+", r"_U") # URL
    
    return X

X = clean(X)

In [5]:
print(X[[1, 4, 5, 19, 21, 113, 174, 183]])

1      "i really don't understand your point. it seem...
4      "cc bn xung ng biu tnh 2011 c n ho khng ? \ncc...
5      "_AT ok, but i would hope they'd sign him to a...
19         "your a retard go post your head up your _SW"
21                                                   "_U
113    "political correctness:\n\npolitical correctne...
174    "gallup daily\nmay 24-26, 2012  updates daily ...
183    "_AT \n\nyou are right concerning campbell.  m...
Name: Comment, dtype: object


## Tasks

- Add some more cleaners
- Investigate the data further