# Features Extraction

In this notebook we demonstrate how to encode features into machine-readable representation (i.e. numeric vectors).

We first build vocabularies based on word frequences and character grams and then apply pre-processing over our dataset and map the examples to the vocabularies into feature vectors.

In [1]:
import numpy as np
import pandas as pd
from collections import Counter
# import the pre-processing functions
from processing import text as text_prepro

In [2]:
# read the comments dataset
comments = pd.read_csv('datasets/labeled_comments.csv', encoding='utf-8')

In [3]:
comments.shape

(159686, 8)

In [4]:
comments['comment'][0]

u"This:NEWLINE_TOKEN:One can make an analogy in mathematical terms by envisioning the distribution of opinions in a population as a Gaussian curve. We would then say that the consensus would be a statement that represents the range of opinions within perhaps three standard deviations of the mean opinion. NEWLINE_TOKENsounds arbitrary and ad hoc.  Does it really belong in n encyclopedia article?  I don't see that it adds anything useful.NEWLINE_TOKENNEWLINE_TOKENThe paragraph that follows seems much more useful.  Are there any political theorists out there who can clarify the issues?  It seems to me that this is an issue that Locke, Rousseau, de Toqueville, and others must have debated...  SRNEWLINE_TOKEN"

In [5]:
# remove special tokens
comments['comment'] = comments['comment'].str.replace(u'NEWLINE_TOKEN|TAB_TOKEN', u' ')

In [6]:
comments['comment'][0]

u"This: :One can make an analogy in mathematical terms by envisioning the distribution of opinions in a population as a Gaussian curve. We would then say that the consensus would be a statement that represents the range of opinions within perhaps three standard deviations of the mean opinion.  sounds arbitrary and ad hoc.  Does it really belong in n encyclopedia article?  I don't see that it adds anything useful.  The paragraph that follows seems much more useful.  Are there any political theorists out there who can clarify the issues?  It seems to me that this is an issue that Locke, Rousseau, de Toqueville, and others must have debated...  SR "

## Semantic Vector

Build the word frequences vocabulary and the function which apply pre-processing on a given text and maps it to the semantic vocabulary to produce the semantic vector.

In [7]:
# pre-processing pipeline
pipeline = [
    # convert all letters to lowercase
    text_prepro.to_lower,
    # transliterate non-english letters
    text_prepro.transliterate,
    # strip tags (@ and #) from words
    text_prepro.remove_tags,
    # tokenize URLs into "__URL__"
    text_prepro.tokenize_url,
    # Keep alphanumeric characters only
    text_prepro.alphanum
]

In [8]:
text = u'Hëllo @foobar, VISIT my [site](http://foo.bar) #thankyou!'
print text
for i, pipe in enumerate(pipeline, 1):
    text = pipe(text)
    print u'[{}]: {}'.format(i, text)
print

Hëllo @foobar, VISIT my [site](http://foo.bar) #thankyou!
[1]: hëllo @foobar, visit my [site](http://foo.bar) #thankyou!
[2]: hello @foobar, visit my [site](http://foo.bar) #thankyou!
[3]: hello foobar, visit my [site](http://foo.bar) thankyou!
[4]: hello foobar, visit my [site](__URL__) thankyou!
[5]: hello foobar visit my site __URL__ thankyou



In [9]:
# define the prepro as a functiom
def semantic_prepro(text):
    for pipe in pipeline:
        text = pipe(text)
    return text

In [10]:
# extract word counts
word_counts = Counter()
for comment in comments['comment']:
    comment = semantic_prepro(comment)
    word_counts.update(comment.split())

In [11]:
print len(word_counts)

180929


In [12]:
# top 10 words
word_counts.most_common(50)

[(u'the', 499009),
 (u'to', 299282),
 (u'i', 241781),
 (u'and', 225667),
 (u'of', 225664),
 (u'you', 220267),
 (u'a', 217139),
 (u'is', 177412),
 (u'that', 161865),
 (u'it', 149423),
 (u'in', 146057),
 (u'for', 103492),
 (u'this', 98202),
 (u'not', 94392),
 (u'on', 90560),
 (u'be', 84128),
 (u'as', 77874),
 (u'are', 72731),
 (u'have', 72649),
 (u's', 72443),
 (u'your', 63192),
 (u'with', 60039),
 (u't', 59801),
 (u'if', 59174),
 (u'article', 57902),
 (u'was', 54862),
 (u'or', 53876),
 (u'but', 51469),
 (u'wikipedia', 46715),
 (u'page', 46352),
 (u'my', 45497),
 (u'an', 45224),
 (u'from', 41809),
 (u'by', 41630),
 (u'do', 40361),
 (u'can', 39789),
 (u'at', 39736),
 (u'about', 37335),
 (u'so', 36840),
 (u'me', 36745),
 (u'what', 35582),
 (u'there', 35494),
 (u'all', 31818),
 (u'has', 31067),
 (u'will', 30812),
 (u'please', 30209),
 (u'he', 29569),
 (u'would', 29547),
 (u'they', 29479),
 (u'no', 29465)]

In [13]:
# bottom 10 words
word_counts.most_common()[-50:]

[(u'leaded', 1),
 (u'shemeet', 1),
 (u'pe\u026al\u0268n', 1),
 (u'hmmpff', 1),
 (u'qoyunli', 1),
 (u'thoroughfare', 1),
 (u'fradaulent', 1),
 (u'shty', 1),
 (u'proberly', 1),
 (u'pocketbook', 1),
 (u'mahakavyas', 1),
 (u'fudd', 1),
 (u'cryokinesis', 1),
 (u'wonk', 1),
 (u'sipopo', 1),
 (u'belembay', 1),
 (u'knisfo', 1),
 (u'onclelosse', 1),
 (u'pertecting', 1),
 (u'antivermins', 1),
 (u'warrig', 1),
 (u'ajna', 1),
 (u'talkapge', 1),
 (u'nepotising', 1),
 (u'rattner2', 1),
 (u'bratwurst', 1),
 (u'publicationthe', 1),
 (u'clarityafflicting', 1),
 (u'ornella', 1),
 (u'cronyn', 1),
 (u'australianist', 1),
 (u'chromate', 1),
 (u'ehlers', 1),
 (u'spanko', 1),
 (u'thurst', 1),
 (u'gnawing', 1),
 (u'bennies', 1),
 (u'spanky', 1),
 (u'as_of', 1),
 (u'branco', 1),
 (u'\u65b0\u64b0\u59d3\u6c0f\u9332', 1),
 (u'accoutns', 1),
 (u'queensborough', 1),
 (u'commagene', 1),
 (u'personal_attacks_', 1),
 (u'psone', 1),
 (u'fapped', 1),
 (u'classsssssssss', 1),
 (u'morihiro', 1),
 (u'downstep', 1)]

In [14]:
# select words with more than 1 occurrence
select = {k: v for k, v in word_counts.iteritems() if v > 1}

In [15]:
len(select)

87348

In [16]:
# assign unique indexes to each word
sorted_words = sorted(select.iteritems(), key=lambda (k, v): (v, k), reverse=True)
word_indexes = {k: i for i, (k, _) in enumerate(sorted_words)}

# save vocabulary
import json
with open('datasets/semantic_vocab.json', 'w') as f:
    json.dump(word_indexes, f)

In [17]:
word_indexes['is']

7

In [18]:
word_indexes['damn']

1173

In [19]:
# semantic vector mapper
def semantic_vector(text):
    text = semantic_prepro(text)
    vector = np.zeros((len(word_indexes),), dtype=np.float32)
    for w in text.split():
        ind = word_indexes.get(w)
        if ind is not None:
            vector[ind] = 1.
    return vector

In [20]:
# example text
example = comments['comment'][30]
example

u" ::::That's reasonable enough; I just saw the conflict on this one page and not on the rest.  I don't know details of it either, and if I had noticed the rest I probably would have been suspicious also (like Mav, below).  Also like mav, though, I probably would have noted why I reverted any changes on the article's talk page.  Best,   "

In [21]:
# generate vector
vector_words = semantic_vector(example)

In [22]:
vector_words.shape

(87348,)

In [23]:
vector_words[word_indexes['enough']]

1.0

In [24]:
vector_words.sum()

44.0

In [25]:
words = semantic_prepro(example).split()
len(words)

64

## Letter vector

We generate a list of 2-lettergram from alpha characters (a-z) then map a given text's characters to this list to produce the letter vector.

In [26]:
import string
import itertools

# character 1-grams (letters, numbers and whitespace)
chars = list(u"".join(comb) for comb in set(itertools.permutations(set(string.letters.lower()), 2)))
char_indexes = {c: i for i, c in enumerate(sorted(chars))}

# save letter vocab
with open('datasets/letter_vocab.json', 'w') as f:
    json.dump(char_indexes, f)

In [27]:
char_indexes

{u'ab': 0,
 u'ac': 1,
 u'ad': 2,
 u'ae': 3,
 u'af': 4,
 u'ag': 5,
 u'ah': 6,
 u'ai': 7,
 u'aj': 8,
 u'ak': 9,
 u'al': 10,
 u'am': 11,
 u'an': 12,
 u'ao': 13,
 u'ap': 14,
 u'aq': 15,
 u'ar': 16,
 u'as': 17,
 u'at': 18,
 u'au': 19,
 u'av': 20,
 u'aw': 21,
 u'ax': 22,
 u'ay': 23,
 u'az': 24,
 u'ba': 25,
 u'bc': 26,
 u'bd': 27,
 u'be': 28,
 u'bf': 29,
 u'bg': 30,
 u'bh': 31,
 u'bi': 32,
 u'bj': 33,
 u'bk': 34,
 u'bl': 35,
 u'bm': 36,
 u'bn': 37,
 u'bo': 38,
 u'bp': 39,
 u'bq': 40,
 u'br': 41,
 u'bs': 42,
 u'bt': 43,
 u'bu': 44,
 u'bv': 45,
 u'bw': 46,
 u'bx': 47,
 u'by': 48,
 u'bz': 49,
 u'ca': 50,
 u'cb': 51,
 u'cd': 52,
 u'ce': 53,
 u'cf': 54,
 u'cg': 55,
 u'ch': 56,
 u'ci': 57,
 u'cj': 58,
 u'ck': 59,
 u'cl': 60,
 u'cm': 61,
 u'cn': 62,
 u'co': 63,
 u'cp': 64,
 u'cq': 65,
 u'cr': 66,
 u'cs': 67,
 u'ct': 68,
 u'cu': 69,
 u'cv': 70,
 u'cw': 71,
 u'cx': 72,
 u'cy': 73,
 u'cz': 74,
 u'da': 75,
 u'db': 76,
 u'dc': 77,
 u'de': 78,
 u'df': 79,
 u'dg': 80,
 u'dh': 81,
 u'di': 82,
 u'dj': 83,
 u

In [28]:
# character mapper
def letter_vector(text):
    vector = np.zeros((len(char_indexes),), dtype=np.float32)
    for i in xrange(0, len(text), 2):
        c = text[i:i + 2]
        i = char_indexes.get(c)
        if i is not None:
            vector[i] = 1.
    return vector

In [29]:
vector_letters = letter_vector(example)

In [30]:
len(vector_letters)

650

In [31]:
vector_letters

array([ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,
        0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,
        0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
        0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,
        1.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0

## Concatenate features

Finally, after computing each feature for each example, we concatenate them into a single vector.

In [32]:
features_vector = np.concatenate([vector_words, vector_letters])

In [33]:
features_vector.shape

(87998,)

In [34]:
len(vector_words) + len(vector_letters)

87998

In [35]:
def save_features(data, name):
    features = []
    batch_num = 0
    for i, comment in enumerate(data, 1):
        vs = semantic_vector(comment)
        vl = letter_vector(comment)
        v = np.concatenate([vs, vl])
        features.append(v)
        if i % 100 == 0:
            batch_num += 1
            np.save('datasets/processed/{}.{:04}.npy'.format(name, batch_num), np.vstack(features))
            del features[:]
    if features:
        batch_num += 1
        np.save('datasets/processed/{}.{:04}.npy'.format(name, batch_num), np.vstack(features))
        del features[:]
    print 'saved {} {} records ({} batches)'.format(i, name, batch_num)
    
# save positives
save_features(comments.query('label')['comment'], 'positives')
# save negatives
save_features(comments.query('~label')['comment'], 'negatives')

saved 15362 positives records (154 batches)
saved 144324 negatives records (1444 batches)
