# Features Extraction

In this notebook we demonstrate how to encode features into machine-readable representation (i.e. numeric vectors).

We first build vocabularies based on word frequences, character frequences and parts-of-speech tags. We then apply pre-processing over our dataset and map the examples into vectors.

In [1]:
import numpy as np
import pandas as pd
import spacy
from collections import Counter
# import the pre-processing functions
from processing import text as text_prepro

In [4]:
# read datasets
toxic = pd.read_csv('datasets/toxic_comments.csv', encoding='utf-8')
non_toxic = pd.read_csv('datasets/non_toxic_comments.csv', encoding='utf-8')

In [5]:
# list of raw comments
comments = np.concatenate([non_toxic['comment'].unique(), toxic['comment'].unique()])

In [6]:
comments.shape

(159436,)

In [7]:
comments[0]

u"This: :One can make an analogy in mathematical terms by envisioning the distribution of opinions in a population as a Gaussian curve. We would then say that the consensus would be a statement that represents the range of opinions within perhaps three standard deviations of the mean opinion.  sounds arbitrary and ad hoc.  Does it really belong in n encyclopedia article?  I don't see that it adds anything useful.  The paragraph that follows seems much more useful.  Are there any political theorists out there who can clarify the issues?  It seems to me that this is an issue that Locke, Rousseau, de Toqueville, and others must have debated...  SR "

## Semantic Vector

Build the word frequences vocabulary and the function which apply pre-processing on a given text and maps it to the semantic vocabulary to produce the semantic vector.

In [36]:
# pre-processing pipeline
pipeline = [
    text_prepro.to_lower,
    text_prepro.transliterate,
    text_prepro.remove_tags,
    text_prepro.tokenize_url,
    text_prepro.alphanum
]

In [3]:
text = u'Hello @foobar, VISIT my [site](http://foo.bar)'
print text
for i, pipe in enumerate(pipeline, 1):
    text = pipe(text)
    print u'[{}]: {}'.format(i, text)
print

Hello @foobar, VISIT my [site](http://foo.bar)
[1]: hello @foobar, visit my [site](http://foo.bar)
[2]: hello @foobar, visit my [site](http://foo.bar)
[3]: hello foobar, visit my [site](http://foo.bar)
[4]: hello foobar, visit my [site](__URL__)
[5]: hello foobar visit my site __URL__



In [8]:
# extract word counts
word_counts = Counter()
for comment in comments:
    for pipe in pipeline:
        comment = pipe(comment)
    word_counts.update(comment.split())

In [9]:
print len(word_counts)

180929


In [10]:
# top 10 words
word_counts.most_common(10)

[(u'the', 498552),
 (u'to', 298646),
 (u'i', 241659),
 (u'of', 225486),
 (u'and', 225296),
 (u'you', 219647),
 (u'a', 216811),
 (u'is', 177325),
 (u'that', 161825),
 (u'it', 149322)]

In [37]:
# bottom 10 words
word_counts.most_common()[-10:]

[(u'branco', 1),
 (u'\u65b0\u64b0\u59d3\u6c0f\u9332', 1),
 (u'accoutns', 1),
 (u'queensborough', 1),
 (u'commagene', 1),
 (u'personal_attacks_', 1),
 (u'psone', 1),
 (u'classsssssssss', 1),
 (u'morihiro', 1),
 (u'downstep', 1)]

In [12]:
# select words with more than 1 occurrence
select = {k: v for k, v in word_counts.iteritems() if v > 1}

In [13]:
len(select)

87334

In [15]:
# assign unique indexes to each word
sorted_words = sorted(select.iteritems(), key=lambda (k, v): (v, k), reverse=True)
word_indexes = {k: i for i, (k, _) in enumerate(sorted_words)}

In [16]:
word_indexes['is']

7

In [17]:
word_indexes['damn']

1173

In [38]:
# semantic vector mapper
def semantic_vector(text):
    for pipe in pipeline:
        text = pipe(text)
    vector = np.zeros((len(word_indexes),), dtype=np.float32)
    for w in text.split():
        ind = word_indexes.get(w)
        if ind is not None:
            vector[ind] = 1.
    return vector

In [44]:
# example text
comments[30]

u" ::::That's reasonable enough; I just saw the conflict on this one page and not on the rest.  I don't know details of it either, and if I had noticed the rest I probably would have been suspicious also (like Mav, below).  Also like mav, though, I probably would have noted why I reverted any changes on the article's talk page.  Best,   "

In [52]:
# generate vector
vector = semantic_vector(comments[30])

In [53]:
vector.shape

(87334,)

In [54]:
vector[word_indexes['enough']]

1.0

In [55]:
vector.sum()

44.0

## Letter vector

We generate a list of 1-gram characters from alphanumeric characters (0-9, a-z) and punctuations then map a given text's characters to this list to produce the letter vector.

In [58]:
import string
import itertools

# character 1-grams
chars = list(set((string.letters + string.digits + string.punctuation).lower() + ' '))
char_indexes = {c: i for i, c in enumerate(sorted(chars))}

In [59]:
char_indexes

{' ': 0,
 '!': 1,
 '"': 2,
 '#': 3,
 '$': 4,
 '%': 5,
 '&': 6,
 "'": 7,
 '(': 8,
 ')': 9,
 '*': 10,
 '+': 11,
 ',': 12,
 '-': 13,
 '.': 14,
 '/': 15,
 '0': 16,
 '1': 17,
 '2': 18,
 '3': 19,
 '4': 20,
 '5': 21,
 '6': 22,
 '7': 23,
 '8': 24,
 '9': 25,
 ':': 26,
 ';': 27,
 '<': 28,
 '=': 29,
 '>': 30,
 '?': 31,
 '@': 32,
 '[': 33,
 '\\': 34,
 ']': 35,
 '^': 36,
 '_': 37,
 '`': 38,
 'a': 39,
 'b': 40,
 'c': 41,
 'd': 42,
 'e': 43,
 'f': 44,
 'g': 45,
 'h': 46,
 'i': 47,
 'j': 48,
 'k': 49,
 'l': 50,
 'm': 51,
 'n': 52,
 'o': 53,
 'p': 54,
 'q': 55,
 'r': 56,
 's': 57,
 't': 58,
 'u': 59,
 'v': 60,
 'w': 61,
 'x': 62,
 'y': 63,
 'z': 64,
 '{': 65,
 '|': 66,
 '}': 67,
 '~': 68}

In [60]:
# character mapper
def letter_vector(text):
    vector = np.zeros((len(char_indexes),), dtype=np.float32)
    for c in text:
        i = char_indexes.get(c)
        if i is not None:
            vector[i] = 1.
    return vector

In [62]:
vector_letter = letter_vector(comments[30])

In [63]:
len(vector_letter)

69

In [64]:
vector_letter

array([ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  1.,
        0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,
        0.,  0.,  0.,  0.], dtype=float32)