In [1]:
import sqlite3
from bs4 import BeautifulSoup
from IPython.core.display import display, HTML
import re
import pickle
import pandas as pd

In [2]:
pd.set_option('display.max_colwidth', -1)

In [3]:
conn = sqlite3.connect('../Data/crossvalidated.db')
# return all the records for questions posts from posts table
ques_query = "SELECT * FROM [posts] WHERE PostTypeId==2"

In [4]:
apost_df = pd.read_sql_query(ques_query, conn)

In [5]:
apost_df.drop(['LastEditorDisplayName','CommunityOwnedDate','LastEditorUserId','LastEditDate',
             'LastActivityDate'],axis=1,inplace=True)

In [6]:
display(apost_df.Body[apost_df.Id==133694])#apost_df.Id==133694

53189    <p>What a great question- it's a chance to show how one would inspect the drawbacks and assumptions of any statistical method.  Namely: make up some data and try the algorithm on it!</p>\n\n<p>We'll consider two of your assumptions, and we'll see what happens to the k-means algorithm when those assumptions are broken. We'll stick to 2-dimensional data since it's easy to visualize. (Thanks to the <a href="http://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a>, adding additional dimensions is likely to make these problems more severe, not less). We'll work with the statistical programming language R: you can find the full code <a href="https://github.com/dgrtwo/dgrtwo.github.com/blob/master/_R/2015-01-16-kmeans-free-lunch.Rmd">here</a> (and the post in blog form <a href="http://varianceexplained.org/r/kmeans-free-lunch/">here</a>).</p>\n\n<h3>Diversion: Anscombe's Quartet</h3>\n\n<p>First, an analogy. Imagine someone argued the following:</p>\n\n<block

#Helper function

In [7]:
def clean_html_text(row):
    soup = BeautifulSoup(row, 'html.parser')
    #denote code
    for tag in soup.find_all('code'):
        tag.replaceWith(' refcode ')
    #denote link
    for tag in soup.find_all('a'):
        content = tag.text
        tag.replaceWith(content +' (reflink) ')
    #denote image
    for tag in soup.find_all('img'):
        tag.replaceWith(' refimage ')
        
    raw = soup.get_text().lower()
    #remove whitespace and /
    raw = re.sub('[\t\n\r\x0b\x0c/]+?', ' ', raw) 
    #denote mention
    # replace twitter @mentions
    mentionFinder = re.compile(r"@[a-z0-9_]{1,15}", re.IGNORECASE)
    raw = mentionFinder.sub("@mention", raw)
    
    #denote email
    raw = re.sub(r'[\w\.-]+@[\w\.-]+[\.][com|org|ch|uk]{2,3}', " refemail ", raw)
    #denote fomula
    reg = '(\$\$.+?\$\$)|((\\\\begin\{.+?\})(.+?)(\\\\end\{(.+?)\}))'
    raw = re.sub(reg, " refformula ", raw, flags=re.IGNORECASE)  
    #denote variable
    raw = re.sub('(\$.+?\$)|([a-z]\d)',' refvariable ', raw)
    #denote number
    raw = re.sub('[-+]?(\d*[.])?\d+',' refnumber ', raw)
    
    return(raw)

In [8]:
apost_df['Body_Text'] = apost_df.Body.map(lambda i: clean_html_text(i))

In [9]:
bins = [-36, 1, 260]
group_names = ['bad','good']
apost_df['AnsQuality']= pd.cut(apost_df['Score'],bins,labels=group_names)

In [10]:
apost_df.AnsQuality.value_counts()

good    40546
bad     33785
dtype: int64

In [12]:
import pickle
with open('../Data/ans_clean_forDL.pickle', 'wb') as handle:
    pickle.dump(apost_df[['Id','Body_Text','AnsQuality']], handle)

In [13]:
display(clean_html_text(apost_df.Body[apost_df.Id==4632].iloc[0]))

u"one at a somewhat lower level of mathematical sophistication than wooldridge (less dense, more pictures), but a bit more up to date on some of the fast-moving areas: murray, michael p. econometrics: a modern introduction. addison wesley,  refnumber .  refnumber  pp. isbn  refnumber  (reflink)  seems that it's not available for preview on the web and the publisher is out of stock, but you can view pdfs of  refnumber  web extensions (reflink)  to get an idea of its style. "

#Automatic Summarization

In [14]:
from gensim.summarization import summarize
from gensim.summarization import keywords

In [15]:
text = clean_html_text(apost_df.Body[apost_df.Id==1632].iloc[0])

In [16]:
text

u"it's hard to ignore the wealth of statistical packages available in r cran.  that said, i spend a lot of time in python land and would never dissuade anyone from having as much fun as i do.  :)  here are some libraries links you might find useful for statistical work.    numpy scipy (reflink)  you probably know about these already.  but let me point out the cookbook (reflink)  where you can read about many statistical facilities already available and the example list (reflink)  which is a great reference for functions (including data manipulation and other operations).  another handy reference is john cook's distributions in scipy (reflink) . pandas (reflink)  this is a really nice library for working with statistical data -- tabular data, time series, panel data.  includes many builtin functions for data summaries, grouping aggregation, pivoting.  also has a statistics econometrics library. larry (reflink)   labeled array that plays nice with numpy.  provides statistical functions n

In [17]:
print 'Summary:'
print summarize(text, word_count=30)

Summary:
but let me point out the cookbook (reflink)  where you can read about many statistical facilities already available and the example list (reflink)  which is a great reference for functions (including data manipulation and other operations).


In [18]:
print 'Keywords:'
print keywords(text,pos_filter=['NN'],ratio=0.1,lemmatize=True)

Keywords:
reflink
statistics
data
models
libraries
learning
packages


#Word2vec/Doc2vec prepare: Train and test set

In [19]:
from sklearn.cross_validation import train_test_split

In [20]:
X_train, X_test, y_train, y_test = train_test_split(apost_df[['Id','Body_Text']], apost_df['AnsQuality'], test_size=0.4, random_state=42)

In [21]:
X_train['source'] = 'train'
X_test['source'] = 'test'

In [22]:
tmp_df = pd.concat([X_train, X_test],ignore_index=True)

In [23]:
apost_df = apost_df.merge(tmp_df, left_on=apost_df.Id,right_on=tmp_df.Id,suffixes=['_post', '_tmp'])

In [24]:
apost_df.columns

Index([u'Body', u'ViewCount', u'ClosedDate', u'ParentID', u'CommentCount',
       u'AnswerCount', u'AcceptedAnswerId', u'Score', u'OwnerDisplayName',
       u'Title', u'PostTypeId', u'OwnerUserId', u'Tags', u'CreationDate',
       u'FavoriteCount', u'Id_post', u'Body_Text_post', u'AnsQuality',
       u'Id_tmp', u'Body_Text_tmp', u'source'],
      dtype='object')

In [60]:
fdata = open("../data/allans_forDL.csv", 'wb')
for index,row in apost_df.iterrows():
    ID = row['Id_tmp']
    split = row['source']
    sentiment = row['AnsQuality']     
    ans = row["Body_Text_post"].encode("ascii", "ignore")
    
    fdata.write("SENT_%s\t%s\t%s\t%s\n" % (str(index),ans,split,sentiment))
fdata.close()

In [2]:
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from gensim.models.doc2vec import LabeledSentence 
from collections import namedtuple
import nltk

In [3]:
alldocs = []  # will hold all docs in original order
train_doc = []
test_doc = []
train_y = []
test_y = []
with open('../Data/allans_forDL.csv') as alldata:
    for line_no, line in enumerate(alldata):
        tags, words_texts, split, sentiment = line.strip().split('\t')
        words = nltk.word_tokenize(words_texts)
        if split=='train':
            train_doc.append(LabeledSentence(words, [tags]))
            if sentiment == "good":
                train_y.append(1)
            else:
                train_y.append(0)
        else:
            test_doc.append(LabeledSentence(words, [tags]))
            if sentiment == "bad":
                train_y.append(1)
            else:
                train_y.append(0)
            
        alldocs.append(LabeledSentence(words, [tags]))

doc_list = alldocs[:]  # for reshuffling per pass

print('%d docs: %d train-sentiment, %d test-sentiment' % (len(doc_list), len(train_doc), len(test_doc)))

74331 docs: 44598 train-sentiment, 29733 test-sentiment


In [3]:
doc_list[5000]

TaggedDocument(words=['this', 'paper', 'by', 'massey', 'and', 'denton', 'refnumber', '(', 'reflink', ')', 'is', 'a', 'fairly', 'prolific', 'overview', 'of', 'commonly', 'used', 'indices', 'in', 'sociology', 'demography', '.', 'it', 'would', 'also', 'be', 'useful', 'for', 'some', 'other', 'key', 'terms', 'used', 'for', 'searching', 'articles', '.', 'frequently', 'in', 'sociology', 'the', 'indices', 'are', 'labelled', 'with', 'names', 'such', 'as', '``', 'heterogeneity', "''", 'and', '``', 'segregation', "''", 'as', 'well', 'as', '``', 'diversity', "''", '.', 'part', 'of', 'the', 'reason', 'no', 'absolute', 'right', 'answer', 'exists', 'to', 'your', 'question', 'is', 'that', 'people', 'frequently', 'only', 'use', 'epistemic', 'logic', 'to', 'reason', 'why', 'one', 'index', 'is', 'a', 'preferred', 'measurement', '.', 'infrequently', 'are', 'those', 'arguments', 'so', 'strong', 'that', 'one', 'should', 'entirely', 'discount', 'other', 'suggested', 'measures', '.', 'the', 'work', 'of', 'mas

In [37]:
apost_df['Body_Text_post'][5000]

u'this paper by massey and denton  refnumber  (reflink)  is a fairly prolific overview of commonly used indices in sociology demography. it would also be useful for some other key terms used for searching articles. frequently in sociology the indices are labelled with names such as "heterogeneity" and "segregation" as well as "diversity". part of the reason no absolute right answer exists to your question is that people frequently only use epistemic logic to reason why one index is a preferred measurement. infrequently are those arguments so strong that one should entirely discount other suggested measures. the work of massey and denton is useful to highlight what many of these indices theoretically measure and when they differ to a substantively noticeable extent (in large cities in the us). '

#Set-up Doc2Vec Training & Evaluation Models

In [4]:
from gensim.models import Doc2Vec
import gensim.models.doc2vec
from collections import OrderedDict
import multiprocessing

cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "this will be painfully slow otherwise"

In [5]:
simple_models = [
    # PV-DM w/concatenation - window=5 (both sides) approximates paper's 10-word total window size
    Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=8, workers=cores, sample=1e-4),
    # PV-DBOW 
    Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=10, workers=cores, sample=1e-4),
    # PV-DM w/average
    Doc2Vec(dm=1, dm_mean=1, size=100, window=8, negative=5, hs=0, min_count=8, workers=cores, sample=1e-4),
]

In [6]:
# speed setup by sharing results of 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs)  # PV-DM/concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
    model.reset_from(simple_models[0])
    print(model)

Doc2Vec(dm/c,d100,n5,w5,mc8,s0.0001,t16)
Doc2Vec(dbow,d100,n5,mc10,s0.0001,t16)
Doc2Vec(dm/m,d100,n5,w8,mc8,s0.0001,t16)


In [8]:
simple_models[0].docvecs.count

74331

In [9]:
print simple_models[0].docvecs['AnsID:7704']

[  4.61835507e-03  -3.64258396e-03   2.60482798e-03  -5.24145551e-04
   4.56172012e-04  -3.48055479e-03  -2.98021291e-03   1.10461249e-03
  -2.67894939e-03  -3.36333970e-03   3.97843914e-03   2.88273045e-03
  -4.57707047e-03   3.92084755e-03   4.83234064e-04   1.13892404e-03
   4.08172957e-04  -4.75213351e-03   1.20185642e-03  -3.79314972e-03
  -2.92225787e-03  -2.45182705e-03  -3.36044584e-03   3.64192203e-03
   4.86102421e-03  -3.48003511e-03   8.44478272e-05  -4.23227018e-03
  -3.19611561e-03   3.19259381e-03   3.07510607e-03   2.36585550e-03
   1.03524362e-03  -4.14243620e-03  -4.69785463e-03   1.33768586e-03
  -8.67638155e-04   3.78161180e-03  -1.01621414e-03   6.14542223e-04
  -1.27025798e-03  -1.92071078e-03   4.88684839e-03   3.89387016e-03
  -2.09059450e-04   3.63059132e-03   8.61565350e-05  -1.56969717e-03
   1.49597076e-03  -4.91627492e-03   1.18393754e-03   1.48012745e-03
   1.07549748e-03   2.88023963e-03  -9.36067721e-04  -3.77976045e-04
  -3.73104657e-03  -2.57961568e-03

In [7]:
from gensim.test.test_doc2vec import ConcatenatedDoc2Vec

models_by_name = OrderedDict((str(model), model) for model in simple_models)
models_by_name['dbow+dmm'] = ConcatenatedDoc2Vec([simple_models[1], simple_models[2]])
models_by_name['dbow+dmc'] = ConcatenatedDoc2Vec([simple_models[1], simple_models[0]])

#Predictive Evaluation Methods

In [8]:
import numpy as np
import statsmodels.api as sm
from random import sample

# for timing
from contextlib import contextmanager
from timeit import default_timer
import time 

@contextmanager
def elapsed_timer():
    start = default_timer()
    elapser = lambda: default_timer() - start
    yield lambda: elapser()
    end = default_timer()
    elapser = lambda: end-start




#Bulk Training

Using explicit multiple-pass, alpha-reduction approach as sketched in gensim doc2vec blog post – with added shuffling of corpus on each pass.

Evaluation of each model's sentiment-predictive power is repeated after each pass, as an error rate (lower is better), to see the rates-of-relative-improvement. The base numbers reuse the TRAIN and TEST vectors stored in the models for the logistic regression, while the inferred results use newly-inferred TEST vectors.

In [9]:
from collections import defaultdict
best_error = defaultdict(lambda :1.0)  # to selectively-print only best errors achieved

In [10]:
import warnings

In [11]:
from random import shuffle
import datetime

alpha, min_alpha, passes = (0.025, 0.001, 10)
alpha_delta = (alpha - min_alpha) / passes

print("START %s" % datetime.datetime.now())

with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    for epoch in range(passes):
        shuffle(doc_list)  # shuffling gets best results

        for name, train_model in models_by_name.items():
            # train
            duration = 'na'
            train_model.alpha, train_model.min_alpha = alpha, alpha
            with elapsed_timer() as elapsed:
                train_model.train(doc_list)
                duration = '%.1f' % elapsed()
                print("The %i passes : %s %ss" % (epoch + 1, name, duration))

        print('completed pass %i at alpha %f' % (epoch + 1, alpha))
        alpha -= alpha_delta

    print("END %s" % str(datetime.datetime.now()))

START 2016-06-01 19:31:17.272491
The 1 passes : Doc2Vec(dm/c,d100,n5,w5,mc8,s0.0001,t16) 51.6s
The 1 passes : Doc2Vec(dbow,d100,n5,mc10,s0.0001,t16) 39.7s
The 1 passes : Doc2Vec(dm/m,d100,n5,w8,mc8,s0.0001,t16) 54.0s
The 1 passes : dbow+dmm 0.0s
The 1 passes : dbow+dmc 0.0s
completed pass 1 at alpha 0.025000
The 2 passes : Doc2Vec(dm/c,d100,n5,w5,mc8,s0.0001,t16) 51.3s
The 2 passes : Doc2Vec(dbow,d100,n5,mc10,s0.0001,t16) 38.9s
The 2 passes : Doc2Vec(dm/m,d100,n5,w8,mc8,s0.0001,t16) 53.3s
The 2 passes : dbow+dmm 0.0s
The 2 passes : dbow+dmc 0.0s
completed pass 2 at alpha 0.022600
The 3 passes : Doc2Vec(dm/c,d100,n5,w5,mc8,s0.0001,t16) 50.2s
The 3 passes : Doc2Vec(dbow,d100,n5,mc10,s0.0001,t16) 38.1s
The 3 passes : Doc2Vec(dm/m,d100,n5,w8,mc8,s0.0001,t16) 53.6s
The 3 passes : dbow+dmm 0.0s
The 3 passes : dbow+dmc 0.0s
completed pass 3 at alpha 0.020200
The 4 passes : Doc2Vec(dm/c,d100,n5,w5,mc8,s0.0001,t16) 49.8s
The 4 passes : Doc2Vec(dbow,d100,n5,mc10,s0.0001,t16) 38.9s
The 4 passes :

#Achieved Sentiment-Prediction Accuracy

##Are inferred vectors close to the precalculated ones?

In [12]:
doc_id = np.random.randint(simple_models[0].docvecs.count)  # pick random doc; re-run cell for more examples
print('for doc %d...' % doc_id)
for model in simple_models:
    inferred_docvec = model.infer_vector(alldocs[doc_id].words)
    print('%s:\n %s' % (model, model.docvecs.most_similar([inferred_docvec], topn=3)))

for doc 16108...
Doc2Vec(dm/c,d100,n5,w5,mc8,s0.0001,t16):
 [('AnsID:29157', 0.8235429525375366), ('AnsID:29223', 0.4447145462036133), ('AnsID:5477', 0.4174382984638214)]
Doc2Vec(dbow,d100,n5,mc10,s0.0001,t16):
 [('AnsID:29157', 0.8550589084625244), ('AnsID:34344', 0.5546094179153442), ('AnsID:113676', 0.5431535243988037)]
Doc2Vec(dm/m,d100,n5,w8,mc8,s0.0001,t16):
 [('AnsID:29157', 0.8124358654022217), ('AnsID:65882', 0.4623816907405853), ('AnsID:32045', 0.46219533681869507)]


##Do close documents seem more related than distant ones?¶

In [30]:
def findidx(listtuple, tagkey):
    for idx,tup  in enumerate(listtuple):
        if tup.tags[0] == tagkey:
            print tup.tags
            #PAIR_FOUND = True
            location = idx
            return location
            break
    

In [36]:
import random

doc_id = np.random.randint(simple_models[0].docvecs.count)  # pick random doc, re-run cell for more examples
model = random.choice(simple_models)  # and a random model
sims = model.docvecs.most_similar(doc_id, topn=model.docvecs.count)  # get *all* similar documents
print(u'TARGET (%d): «%s»\n' % (doc_id, ' '.join(alldocs[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
    loc = findidx(alldocs, sims[index][0])
    print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(alldocs[loc].words)))

TARGET (5297): «i believe belsely said that ci over refnumber is indicative of a possible moderate problem , while over refnumber is more severe . in addition , though , you should look at the variance shared by sets of variables in the high condition indices . there is debate ( or was , last time i read this literature ) on whether collinearity that involved one variable and the intercept was problematic or not , and whether centering the offending variable got rid of the problem , or simply moved it elsewhere .»

SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/c,d100,n5,w5,mc8,s0.0001,t16):

['AnsID:103492']
MOST ('AnsID:103492', 0.4484512209892273): «i 'm not sure that this is an appropriate question for this forum since it deals only with the use of a particular software package , but the option refcode specifies that the residuals are independent , but that their variance may differ according to variable refcode . iow , residuals for observations with different values of refcode may 

In [31]:
findidx(alldocs,sims[0][0])

['AnsID:199838']


74209

In [24]:
sims[0][0]

'AnsID:199838'

In [25]:
alldocs[0].tags

['AnsID:5']