<h3>Training and Testing a Classification Model</h3>

In this notebook, I will build a training/test/validation set of sentences from Medium articles. I will label the set and extract features. Then I will train a model and cross-validate.

In [304]:
# import required libraries

import math
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2
import numpy as np
import matplotlib.pyplot as plt
import string
import re
from random import randint
from sklearn import linear_model, metrics

<b>Step One:</b> Separate out training from test sets.

In [2]:
# connect to postgresql db
username = 'kimberly'
dbname = 'medium'

dbe = create_engine('postgres://%s@localhost/%s'%(username,dbname))


In [3]:
# get df, drop missing data
df = pd.read_sql('articles', dbe, index_col='postid')
df = df.dropna(axis=0,how='any')


<b>Functions to format row of the df, as well as do text processing</b>

In [4]:
# functions to convert nlikes and ncomments to integer
def convert_K(nstr):
    spl = nstr.split('K')
    if len(spl)==1:
        return int(float(spl[0]))
    else:
        return int(float(spl[0])*1000)
    
def convert_str(nstr):
    nstr = nstr.replace(',','')
    if nstr=='':
        return None
    else:
        return int(nstr)

In [38]:
def process_paragraph(par,swords):
    '''takes one paragraph (string); performs lower, tokenize, remove punctuation/stop words'''
    par = par.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(par)
    nstop_tokens = [t for t in tokens if (t not in swords and t not in string.punctuation)]
    return nstop_tokens                      

def process_text_paragraphs(atext,origdb,swords):
    '''atext is a list or series of textstrings (one from each article), 
    origdb is a list or series of corresponding original databse IDs (ints from 0-4)'''
    # initial text split
    alist = [initial_text_split(a,int(o)) for a,o in zip(atext,origdb)]
        
    # remove very long articles
    removed_articles = [aix for aix,a in enumerate(alist) if len(a)>=250]
    alist = [a for a in alist if len(a)<250]
    
    # process each paragraph
    alist = [[process_paragraph(p,swords) for p in a] for a in alist]
    
    return [alist,removed_articles]

def process_text_sentences(atext,origdb,swords):
    '''same processing, but does a sentence breakup rather than paragraph'''
    # initial text split
    alist = [initial_text_split(a,int(o)) for a,o in zip(atext,origdb)]
    
    # remove very long articles
    removed_articles = [aix for aix,a in enumerate(alist) if len(a)>=250]
    alist = [a for a in alist if len(a)<250]
    
    # change paragraph splits to sentence splits
    alist = [plist_to_slist(plist) for plist in alist]
    
    # process each paragraph
    alist = [[process_paragraph(p,swords) for p in a] for a in alist]
    
    return [alist,removed_articles]

def initial_text_split(article_text,origdb):
    '''takes article text and original db and performs appropriate splitting'''
    if origdb in [1,2,3]:
        # split into paragraphs
        plist = article_text.split('/n')

        # remove \n symbols from within words
        plist = [p.replace('\n','') for p in plist]
    
    else:
        # split into paragraphs
        plist = article_text.split('\n')
        
    return plist

def plist_to_slist(plist):
    '''changes a list of paragraphs to a list of sentences'''
    spl = [re.split('[/./!/?]',par) for par in plist]
    return [s for p in spl for s in p if len(s)>1]


In [309]:
df[df.url == 'https://medium.freecodecamp.org/a-hacker-stole-31m-of-ether-how-it-happened-and-what-it-means-for-ethereum-9e5dc29e33ce']

Unnamed: 0_level_0,title,popdate,url,userid,username,highlight,nlikes,ncomments,ntags,origdb,tags,text,npar
postid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
9e5dc29e33ce,A hacker stole $31M of Ether — how it happened...,2017-07-21,https://medium.freecodecamp.org/a-hacker-stole...,8bc4e5f8b505,Haseeb Qureshi,"Blaming mistakes on individuals is pointless, ...",2.2K,100,5.0,4.0,"Ethereum,Blockchain,Security,Technology,Startup","Yesterday, a hacker pulled off the second bigg...",104.0


<b>Now we will use these functions to format the text</b>

In [187]:
# define stop word corpus and process text
swords = stopwords.words('english')
processing_output = process_text_sentences(df.text,df.origdb,swords)
ptext = processing_output[0]


In [188]:
# drop too-long articles
removed_articles = processing_output[1]
dfDrop = df.drop(df.index[removed_articles])
for rem in removed_articles:
    del ptext[rem]


In [189]:
# process highlights and titles
htext = [plist_to_slist([hilite]) for hilite in dfDrop.highlight]
htext = [[process_paragraph(hsent,swords) for hsent in art] for art in htext]
ttext = [plist_to_slist([title]) for title in dfDrop.title]
ttext = [[process_paragraph(tsent,swords) for tsent in art] for art in ttext]


<b>Now, we will LABEL each sentence as 1 or 0 (in highlight or not)</b>

In [190]:
# LABEL whether each sentence is in the highlight
plabel = []
for a,art in enumerate(ptext):
    alabel = []
    for s in art:
        alabel.append(any([(s==hs) for hs in htext[a]]))
    plabel.append(alabel)


In [191]:
# find articles with no highlight in ptext
len(plabel)
h_in_ptext = [any(a) for a in plabel]
print(len(plabel) - sum(h_in_ptext))
print(sum(h_in_ptext))

1457
3183


Note that several (about 1/4) of the articles with a highlight do NOT contain the highlight in the p-text (i.e., in the text scraped from p tags in the html).
<br><br>
These can still be used as negative examples. We will check to see if there is a corresponding positive example, and if not, use the highlight itself.

Finally, we get the dataframe together...

In [192]:
# format rows in dataframe
dfDrop.nlikes = dfDrop.nlikes.map(convert_K)
dfDrop.ncomments = dfDrop.ncomments.map(convert_str)


In [219]:
# add processed text to df
dfDrop.text = ptext
dfDrop.highlight = htext
dfDrop.title = ttext
dfDrop['label'] = plabel


In [220]:
# add article wcount to df
wcount = [sum([len(par) for par in art]) for art in ptext]
dfDrop['wcount'] = wcount

Training data will include (about) 80% of the articles in the dataframe. 
First, we separate out a random set of these.

In [221]:
# choose test/train sets
test_ix = randint(0,999)
dfTest = dfDrop.iloc([test_ix])
dfTrain = dfDrop.drop(dfDrop.index[test_ix])

<b>Set up training data</b>

In [222]:
dfTrain.head()

Unnamed: 0_level_0,title,popdate,url,userid,username,highlight,nlikes,ncomments,ntags,origdb,tags,text,npar,wcount,label
postid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1015a0f4961d,"[[day, one, president, obama, first, family, l...",2016-03-21,https://medium.com/@ObamaWhiteHouse/day-one-pr...,ca9f8f16893b,The Obama White House,"[[today, air, force, one, touched, havana, fir...",336,15.0,3.0,3.0,"Cuba,Twitter,Cuba Trip","[[hola, desde, cuba], [today, air, force, one,...",20.0,522,"[False, True, False, False, False, False, Fals..."
101a407e8c61,"[[make, makes]]",2016-06-02,https://medium.com/the-mission/you-dont-make-i...,5ce28105ffbc,Jon Westenberg,"[[make, makes]]",549,37.0,3.0,3.0,"Entrepreneurship,Startup,Life","[[always, wanted, make], [grew, dreaming, rock...",21.0,393,"[False, False, False, False, False, False, Fal..."
1030d29376f1,"[[ux, infinite, scrolling, vs], [pagination]]",2016-05-02,https://uxplanet.org/ux-infinite-scrolling-vs-...,bcab753a4d4e,Nick Babich,"[[instances, infinite, scrolling, effective], ...",1910,46.0,4.0,3.0,"UX,Design,User Experience,UX Design","[[use, infinite, scrolling, pagination, conten...",34.0,850,"[False, False, False, False, False, False, Fal..."
10315016b299,"[[lesson, stereotypes]]",2016-08-20,https://medium.com/@mramsburg85/a-lesson-on-st...,d38709ba4e06,Michael Ramsburg,"[[stereotypes, strip, culture, like, mountains...",583,103.0,5.0,3.0,"Stereotypes,Appalachia,Culture,Essay,Opinion","[[stereotypes], [mrs], [mitchell, sixth, grade...",12.0,381,"[False, False, False, False, False, False, Fal..."
10321e751c6d,"[[republican, never, trump, means]]",2016-07-30,https://medium.com/@ccmccain/for-this-republic...,4e965facd5f9,Caroline McCain,"[[trump, statement, view, unforgivable, speaks...",2500,302.0,5.0,3.0,"Hillary Clinton,Donald Trump,Never Trump,2016 ...","[[know, know, woman, fiercely, loyal, friends,...",45.0,993,"[False, False, False, False, False, False, Fal..."


<b>Sentence-wise split:</b> We set up the data by sentence...

In [225]:
# set up a dataframe for sentences...
dfS = pd.DataFrame()

for art in dfTrain.index:
    slist = dfTrain.text[art]
    for nsent in range(len(slist)):
        sent = slist[nsent]
        dfRow = pd.DataFrame([art,sent,len(sent),nsent,len(slist),dfTrain.label[art][nsent]])
        dfRow = dfRow.T
        dfRow.columns = ['postid','sentence','swcount','sposition','alength','slabel']
        dfS = pd.concat([dfS,dfRow])
    

In [227]:
print(dfS.shape)
dfS.head()

(434741, 6)


Unnamed: 0,postid,sentence,swcount,sposition,alength,slabel
0,1015a0f4961d,"[hola, desde, cuba]",3,0,52,False
0,1015a0f4961d,"[today, air, force, one, touched, havana, firs...",9,1,52,True
0,1015a0f4961d,"[question, remarkable, moment, relationship, u...",9,2,52,False
0,1015a0f4961d,"[also, landmark, progress, made, since, presid...",29,3,52,False
0,1015a0f4961d,"[trip, also, professionally, personally, meani...",19,4,52,False


In [228]:
dfS.to_sql('sentences_train',dbe)

<b>Adding article data:</b> We perform a merge to add article-wise data...

In [296]:
# left merge dfS -> dfTrain on postid

dfX = pd.merge(dfS, dfTrain[['title','nlikes','npar','wcount']], 
               how='left', left_on='postid', left_index=False, right_index=True, sort=False)
print(dfX.shape)
nsamp = dfX.shape[0]
dfX.isnull().any()
#dfX.head()

(434741, 10)


postid       False
sentence     False
swcount      False
sposition    False
alength      False
slabel       False
title        False
nlikes       False
npar         False
wcount       False
dtype: bool

In [297]:
dfX.loc[dfX.slabel].shape[0]

5920

<b>Now, we set up the model itself.</b>

In [298]:
# scikitlearn logistic regression... fit (with 2/3 of X, y from above)

dfY = dfX['slabel']
dfX = dfX[['swcount','sposition','alength','nlikes','wcount']]

In [305]:
spl = math.floor(2*nsamp/3)
ytrain = dfY.iloc[0:spl].astype(int)
Xtrain = dfX.iloc[0:spl]
ytest = dfY.iloc[spl:].astype(int)
Xtest = dfX.iloc[spl:]
print(ytrain.shape)
print(Xtrain.shape)
lrm = linear_model.LogisticRegression()
lrm.fit(X,y)

(289827,)
(289827, 5)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [307]:
# test on the 2/3 of training set
print(lrm.score(Xtrain,ytrain), 1 - ytrain.mean())

# test on the other 1/3 of X, y
print(lrm.score(Xtest,ytest), 1 - ytest.mean())

0.98661270344 0.98661270344
0.985922685179 0.985922685179


The model is predicting "no highlight" for every sample. This is what I expected, given the extremely unbalanced nature of the data. I will try balancing techniques (just as soon as I get Flask to work...)


In [None]:
# make ROC curve to compare thresholds

<b>Adding New Features</b>

In [None]:
# make tf-idf vectors

In [None]:
# compute average tf-idf score for each sentence

In [None]:
# cosine similarity of sentence to article

In [None]:
# cosine similarity of sentence to title

In [None]:
# find sentiment of each sentence

In [None]:
# find sentiment of each article

<b>Balancing Data</b>

In [None]:
# Downsample negative sentences so ratio is ~2:1 neg:pos

In [None]:
# Upsample positive sentence with data augmentation

# ideas: do they cluster in tf-idf space? try jitter...
#        shuffle/swap words from other highlights?
#        redistribute among highlights in the same article...