<h3>Training and Testing a Classification Model</h3>

In this notebook, I will build a training/test/validation set of sentences from Medium articles. I will label the set and extract features. Then I will train a model and cross-validate.

In [18]:
# import required libraries

import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2
import numpy as np
import matplotlib.pyplot as plt
import string
import re

<b>Step One:</b> Separate out training from test sets.

In [2]:
# connect to postgresql db
username = 'kimberly'
dbname = 'medium'

dbe = create_engine('postgres://%s@localhost/%s'%(username,dbname))


In [3]:
# get df, drop missing data
df = pd.read_sql('articles', dbe, index_col='postid')
df = df.dropna(axis=0,how='any')


<b>Functions to format row of the df, as well as do text processing</b>

In [4]:
# functions to convert nlikes and ncomments to integer
def convert_K(nstr):
    spl = nstr.split('K')
    if len(spl)==1:
        return int(float(spl[0]))
    else:
        return int(float(spl[0])*1000)
    
def convert_str(nstr):
    nstr = nstr.replace(',','')
    if nstr=='':
        return None
    else:
        return int(nstr)

In [38]:
def process_paragraph(par,swords):
    '''takes one paragraph (string); performs lower, tokenize, remove punctuation/stop words'''
    par = par.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(par)
    nstop_tokens = [t for t in tokens if (t not in swords and t not in string.punctuation)]
    return nstop_tokens                      

def process_text_paragraphs(atext,origdb,swords):
    '''atext is a list or series of textstrings (one from each article), 
    origdb is a list or series of corresponding original databse IDs (ints from 0-4)'''
    # initial text split
    alist = [initial_text_split(a,int(o)) for a,o in zip(atext,origdb)]
        
    # remove very long articles
    removed_articles = [aix for aix,a in enumerate(alist) if len(a)>=250]
    alist = [a for a in alist if len(a)<250]
    
    # process each paragraph
    alist = [[process_paragraph(p,swords) for p in a] for a in alist]
    
    return [alist,removed_articles]

def process_text_sentences(atext,origdb,swords):
    '''same processing, but does a sentence breakup rather than paragraph'''
    # initial text split
    alist = [initial_text_split(a,int(o)) for a,o in zip(atext,origdb)]
    
    # remove very long articles
    removed_articles = [aix for aix,a in enumerate(alist) if len(a)>=250]
    alist = [a for a in alist if len(a)<250]
    
    # change paragraph splits to sentence splits
    alist = [plist_to_slist(plist) for plist in alist]
    
    # process each paragraph
    alist = [[process_paragraph(p,swords) for p in a] for a in alist]
    
    return [alist,removed_articles]

def initial_text_split(article_text,origdb):
    '''takes article text and original db and performs appropriate splitting'''
    if origdb in [1,2,3]:
        # split into paragraphs
        plist = article_text.split('/n')

        # remove \n symbols from within words
        plist = [p.replace('\n','') for p in plist]
    
    else:
        # split into paragraphs
        plist = article_text.split('\n')
        
    return plist

def plist_to_slist(plist):
    '''changes a list of paragraphs to a list of sentences'''
    spl = [re.split('[/./!/?]',par) for par in plist]
    return [s for p in spl for s in p if len(s)>1]


[['hola', 'desde', 'cuba'], ['today', 'air', 'force', 'one', 'touched', 'havana', 'first', 'time', 'history'], ['question', 'remarkable', 'moment', 'relationship', 'united', 'states', 'cuba', 'governments', 'people'], ['also', 'landmark', 'progress', 'made', 'since', 'president', 'obama', 'decided', 'reform', 'failed', 'cold', 'war', 'era', 'policies', 'past', 'chart', 'new', 'course', 'would', 'actually', 'advance', 'american', 'interests', 'values', 'help', 'cuban', 'people', 'improve', 'lives'], ['trip', 'also', 'professionally', 'personally', 'meaningful', 'special', 'assistant', 'advisor', 'antoinette', 'rangel', 'cuban', 'american', 'learned', 'country', 'stories', 'abuela', 'maria', 'shared', 'growing'], ['good', 'illustration', 'closely', 'two', 'countries', 'linked', 'trip', 'potential', 'change', 'lives', 'families', 'cuba', 'united', 'states'], ['looking', 'forward', 'learning', 'first', 'hand', 'cuban', 'culture', 'life', 'bringing', 'white', 'house', 'press', 'corps', 'alo

<b>Now we will use these functions to format the text</b>

In [50]:
# define stop word corpus and process text
swords = stopwords.words('english')
processing_output = process_text_sentences(df.text,df.origdb,swords)
ptext = processing_output[0]


In [51]:
# drop too-long articles
removed_articles = processing_output[1]
dfDrop = df.drop(df.index[removed_articles])


In [79]:
# process highlights and titles
htext = [plist_to_slist([hilite]) for hilite in dfDrop.highlight]
htext = [[process_paragraph(hsent,swords) for hsent in art] for art in htext]
ttext = [plist_to_slist([title]) for title in dfDrop.title]
ttext = [[process_paragraph(tsent,swords) for tsent in art] for art in ttext]


In [85]:
# LABEL whether each sentence is in the highlight

for art in htext:
    

[['ux', 'infinite', 'scrolling', 'vs'], ['pagination']]