# Stage 2 - extracting structured information from raw data

In this stage of the project, our goals are to 1) develop an extractor that achieves at least precision of 90% or higher and as high recall as possible, but at least 50% in recall, 2) get practice with cross-validation

Our data sources can be found on the main *readme.md* of this GitHub folder. We have gone through and tagged positive and negative examples in our raw data. Our objective is to properly extract movie titles from movie reviews. Our pre-processing steps included extracting the body of the text from the raw html files. From there we developed a unique tag which we used to label positive examples.

We split up our data into three sets: set 1, set2, set 3. We designate set 3 as our final testing set. 

We start with set 1. Based on the positive examples and trends we identified during the labeling of positive examples, we generate features. We then run k-fold CV on set 1 using a selection of classifiers (Decision Tree, Logisitc, SVM, Random Forest, Extra Trees). Here we identify baseline precision, recall, and F1. 

We then test this on set 2 and explore false negative and false positives in set 2. Based on these, we modify our rules/features. We then re-run k-fold CV on set 1. We do this iteratively. Once we see an improvement in set1, we finally test on set 3 and report the results as well as our best-performing classifier.

In [1]:
#Import Necessary Modules
import os
import sys
import re
import time
import shutil
import collections
import pandas as pd
from sklearn import tree
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
import scipy
import matplotlib
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import ExtraTreesClassifier
from sklearn import svm
from sklearn.model_selection import KFold
import random

## Importing Data and Feature Generation

### Importing Data

In [2]:
def exampleGen(newpath):
    # max number of words to get before and after the title
    MAXWORDS = 3
    # max number of negtive examples from each document
    MAXNEG = 10
    # max number of words in negtive examples
    MAXNEGWORDS = 3

    # output list of examples for feature generation
    # use this as the input of feature generation
    # format: (title, wordsbefore, wordsafter, class)
    outlist = []
    exid = 0
    ouputfile = open("examples.txt",'w')

    pos_num = 0
    neg_num = 0
    badfile_num = 0
    file_num = 0

    # path to the html folder
    
    files = os.listdir(newpath)
    for file in files:
        with open(newpath+file) as f:

            file_num += 1
            # wrap all lines without new line charactar
            lines=""
            for line in f:
                lines += line.strip()
            words = lines.split(" ")

            title = ""
            within = False
            wordsprocessed = []
            badfile = False
            for i in range(len(words)):
                tmplist = []
                word = words[i]
                if word == '': continue
                
                pos1 = word.find("<myy>")
                pos2 = word.find("</myy>")
                # if no tag found
                if pos1 == -1 and pos2 == -1:
                    wordsprocessed.append((word,within))
                # if both are found
                if pos1 >= 0 and pos2 >= 0:
                    tmplist = word.split("<myy>")
                    if len(tmplist)>2 or pos2 < pos1:
                        badfile = True
                        break
                    if tmplist[0]!='':
                        wordsprocessed.append((tmplist[0],within))
                    within = True
                    tmplist = tmplist[1].split("</myy>")
                    wordsprocessed.append((tmplist[0],within))
                    within = False
                    if tmplist[1]!='':
                        wordsprocessed.append((tmplist[1],within))
                    continue
                # if only start tag
                if pos1 >= 0:
                    tmplist = word.split("<myy>")
                    if len(tmplist)>2:
                        badfile = True
                        break
                    if tmplist[0]!='':
                        wordsprocessed.append((tmplist[0],within))
                    within = True
                    wordsprocessed.append((tmplist[1],within))
                    continue
                # if only end tag
                if pos2 >= 0:
                    tmplist = word.split("</myy>")
                    if len(tmplist)>2:
                        badfile = True
                        break
                    wordsprocessed.append((tmplist[0],within))
                    within = False
                    if tmplist[1]!='':
                        wordsprocessed.append((tmplist[1],within))

            if badfile is True:
                badfile_num += 1
                continue

            # get positive examples from the processed words
            wordsbefore = []
            wordsafter = []
            title = ""
            for i in range(len(wordsprocessed)):
                item = wordsprocessed[i]
                if item[1] is True:
                    # if it is the first word
                    if i == 0:
                        title += item[0]
                        while len(wordsbefore) < MAXWORDS:
                            wordsbefore.append("")
                    # if word before is also title
                    elif wordsprocessed[i-1][1]:
                        title += " " + item[0]
                    else:
                        title += item[0]
                        # get words before the title
                        for j in range(i-1,i-1-MAXWORDS,-1):
                            if j >= 0:
                                wordsbefore.append(wordsprocessed[j][0])
                            else:
                                # add empty element
                                while len(wordsbefore) < MAXWORDS:
                                    wordsbefore.append("")
                                break
                else:
                    # if the word before this is title
                    if i != 0 and wordsprocessed[i-1][1]:
                        # get words after the title
                        for j in range(i,i+MAXWORDS):
                            if j < len(wordsprocessed):
                                wordsafter.append(wordsprocessed[j][0])
                            else:
                                # add empty element
                                while len(wordsafter) < MAXWORDS:
                                    wordsafter.append("")
                                break
                        # add title and words before and after to output list
                        featuretuple = (exid,title, wordsbefore, wordsafter, True)
                        #print(featuretuple)
                        outlist.append(featuretuple)
                        exid += 1

                        # generate output string
                        # this part can be skipped if only want features
                        outstr = title + ", ["
                        for w in wordsbefore:
                            if w is None:
                                outstr += 'None '
                            else:
                                outstr += w + ' '
                        outstr += '], ['
                        for w in wordsafter:
                            if w is None:
                                outstr += 'None '
                            else:
                                outstr += w + ' '
                        outstr += '], T\n'
                        ouputfile.write(outstr)

                        pos_num += 1

                        title = ""
                        wordsbefore = []
                        wordsafter = []

            # get negtive examples
            wordsbefore = []
            wordsafter = []
            title = ""
            count = 0
            poslist = []
            # skip the document if there are not enough words in it
            if len(wordsprocessed) < 100:
                continue
            while count < MAXNEG:
                pos = random.randint(MAXWORDS+1, len(wordsprocessed)-MAXWORDS-MAXNEGWORDS)
                title_len = random.randint(1,MAXNEGWORDS)
                # get neg title
                alltrue = True
                for i in range(title_len):
                    if wordsprocessed[pos+i][1] is False:
                        alltrue = False
                    title += wordsprocessed[pos+i][0] + " "
                # if all words are in title
                if alltrue is True:
                    title = ""
                    continue
                title = title.strip()
                # get words before and after
                for i in range(0,MAXWORDS):
                    wordsbefore.append(wordsprocessed[pos-i-1][0])
                    wordsafter.append(wordsprocessed[pos+title_len+i][0])
                featuretuple = (exid,title, wordsbefore, wordsafter, False)
                #print(featuretuple)
                outlist.append(featuretuple)
                exid += 1

                # generate output string
                # this part can be skipped if only want features
                outstr = title + ", ["
                for w in wordsbefore:
                    if w is None:
                        outstr += 'None '
                    else:
                        outstr += w + ' '
                outstr += '], ['
                for w in wordsafter:
                    if w is None:
                        outstr += 'None '
                    else:
                        outstr += w + ' '
                outstr += '], F\n'
                ouputfile.write(outstr)


                neg_num += 1
                count += 1
                title=""
                wordsbefore = []
                wordsafter = []

    ouputfile.close()
    return outlist


In [3]:
path_set1 = '../../DataforProject/reviews/Yuying_tagged/'
path_set2 = '../../DataforProject/reviews/Maria_tagged/'
path_set3 = '../../DataforProject/reviews/Young_tagged/'

In [5]:
set1 = exampleGen(path_set1)
set2 = exampleGen(path_set2)
set3 = exampleGen(path_set3)

In [6]:
#Explore the set
set1[0:10]

[(0, 'LITTLE WOMEN', ['', '', ''], ['A', 'film', 'review'], True),
 (1,
  'LITTLE WOMEN',
  ['1868', "Alcott's", 'LousiaMay'],
  [',', 'based', 'in'],
  True),
 (2,
  'MY BRILLIANT CAREER',
  ['of', 'Armstrong', 'Gillian'],
  ['and', 'MRS.', 'SOFFEL'],
  True),
 (3,
  'MRS. SOFFEL',
  ['and', 'CAREER', 'BRILLIANT'],
  ['.', 'The', 'Marches'],
  True),
 (4,
  'INTERVIEW WITH THE VAMPIRE',
  ['from', 'child', 'a'],
  ['.', 'Among', 'other'],
  True),
 (5,
  'LITTLE WOMEN',
  ['And', 'HEAVENLYCREATURES.', "Jackson's"],
  ['sorely', 'needs', 'something'],
  True),
 (6, 'LITTLE WOMEN', ['give', 'I', 'warmfeeling.'], ['a', 'high', '+1'], True),
 (7,
  'both senses',
  ['homely---in', 'being', 'approaches'],
  ['of', 'the', 'word--in'],
  False),
 (8,
  'her imagination isseen',
  ['Here', 'adult.', 'an'],
  ['as', 'positive', 'and'],
  False),
 (9,
  'is staying. Thereis',
  ['she', 'where', 'house'],
  ['sad', 'family', 'tragedy,'],
  False)]

### Feature Generation

In [7]:
def isitcapitalized(testcase):
    """Is the label capitalized? Return True or False"""
    return testcase[1].isupper()


def isthereayear(testcase):
    """Is the YEAR (YYYY) format in the surrounding 3 words before and after? Return True or False"""
    combo = testcase[2] + testcase[3]
    joined = "".join(combo)
    yearsearch = re.compile(r"\d\d\d\d")
    if re.search(yearsearch, joined):
        return True
    else:
        return False
    
def hasCueWords(testcase,cueword=None):
    """Are any of the cuewords we determined found in the surrounding text? Return True or False"""
    combo = testcase[2] + testcase[3]
    #combo = combo.split(' ')
    if cueword:
        for word in combo:
            if word == cueword:
                return True
    else:
        for word in combo:
            if word in cuewords:
                return True
    return False

def hasCueSymbols(testcase,cuesymbol=None):
    """Are any of the cuesymbols we determined found in the surrounding text? Return True or False"""
    combo = testcase[2] + testcase[3]
    #combo = combo.split(' ')
    if cuesymbol:
        for symb in combo:
            if symb == cuesymbol:
                return True
    else:
        for symb in cuesymbols:
            return True
    return False


def isPositive(testcase):
    """Is this a positive or negative example? Return True or False"""
    return testcase[-1]

def featureGen(cases):
    """This function brings together all of our feature generation functions and
    exports a pandas dataframe which can be fed into the ML classifiers"""
    #Set Key words and Key symbols
    cuewords = {'in':True, 'made':True, 'review':True, 'film':True, 'recommend':True, "showing":True, 
                'released':True, "The": True}
    cuesymbols = {'*':True, '(':True, ')':True ,'"':True, "'":True, ':':True, 'II':True, 'III':True}
    
    for cueword in cuewords:
        cuewords[cueword] = [hasCueWords(case, cueword) for case in cases]  
        
    for cuesymb in cuesymbols:
        cuesymbols[cuesymb] = [hasCueSymbols(case, cuesymb) for case in cases]   
        
    #Create subdataframes
    cuewords_df = pd.DataFrame(cuewords)
    cuesymbols_df = pd.DataFrame(cuesymbols)
    
    #Put it all together into master dataframe
    features = pd.DataFrame(
        {'hasYear':[isthereayear(case) for case in cases],
         'isCap':[isitcapitalized(case) for case in cases],
         'hasCueWords':[hasCueWords(case, cuewords) for case in cases],    
         'isPositive':[isPositive(case) for case in cases]
        }
    )
    df = pd.concat([cuewords_df, cuesymbols_df, features], axis=1)
    return(df)

In [8]:
df_set1 = featureGen(set1)
df_set1.head(n=10)

Unnamed: 0,The,film,in,made,recommend,released,review,showing,"""",',(,),*,:,II,III,hasCueWords,hasYear,isCap,isPositive
0,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,True
1,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True
3,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True
5,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True
6,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True
7,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


## ML Models

In [9]:
def returnPrecision(y_truth, y_pred, classifier):
    """This function takes in an array of true observed and an array of predicted observed and outputs three 
    different measures for precision"""
    precision = {}
    precision["p_macro"] = precision_score(y_truth, y_pred, average='macro')
    precision["p_micro"] = precision_score(y_truth, y_pred, average='micro')
    precision["p_weight"] = precision_score(y_truth, y_pred, average='weighted')
    precision["p"] = precision_score(y_truth, y_pred, pos_label=True, average='binary')
    return pd.DataFrame(precision, index=[classifier])

def returnRecall(y_truth, y_pred, classifier):
    """This function takes in an array of true observed and an array of predicted observed and outputs three 
    different measures for precision"""
    recall = {}
    recall["r_macro"] = recall_score(y_truth, y_pred, average='macro')
    recall["r_micro"] = recall_score(y_truth, y_pred, average='micro')
    recall["r_weight"] = recall_score(y_truth, y_pred, average='weighted')
    recall["r"] = recall_score(y_truth, y_pred, pos_label=True, average="binary")
    return pd.DataFrame(recall, index=[classifier])

def returnF1(y_truth, y_pred, classifier):
    """This function takes in an array of true observed and an array of predicted observed and outputs three 
    different measures for F1"""
    F1 = {}
    F1["f1_macro"] = f1_score(y_truth, y_pred, average='macro')
    F1["f1_micro"] = f1_score(y_truth, y_pred, average='micro')
    F1["f1_weight"] = f1_score(y_truth, y_pred, average='weighted')
    F1["f1"] = f1_score(y_truth, y_pred, pos_label=True, average='binary')
    return pd.DataFrame(F1, index=[classifier])

def returnAccuracy(y_truth, y_pred, classifier):
    """This function takes in an array of true observed and an array of predicted observed and outputs accuracy"""
    accuracy = {}
    accuracy["accu"] = accuracy_score(y_truth, y_pred)
    return pd.DataFrame(accuracy, index=[classifier])

def returnMetrics(y_truth, y_pred, classifier):
    """Returns recall, precision, and accuracy as a pandas dataframe for each classifier"""
    rec = returnRecall(y_truth, y_pred, classifier)
    prec = returnPrecision(y_truth, y_pred, classifier)
    accu = returnAccuracy(y_truth, y_pred, classifier)
    f1 = returnF1(y_truth, y_pred, classifier)
    df = pd.concat([rec, prec, f1, accu], axis=1)
    return(df)

def runModels(X, y, kf, names, classifiers):
    """Function that runs over the classifiers and does k-fold cross-validation on the training set - returns metrics as dataframe"""
    df = pd.DataFrame([])
    for name, clf in zip(names, classifiers):
        for train_index, test_index in kf.split(X):
            clf.fit(X[train_index], y[train_index])
            predicted = clf.predict(X[test_index])
            data = returnMetrics(y[test_index], predicted, name)
            myscore = clf.score(X[test_index], y[test_index])
            score = pd.DataFrame({"score": myscore}, index = [name])
            ccat = pd.concat([data, score], axis = 1)
            df = df.append(ccat)
    return df

def runTruth(X, y, names, classifiers):
    """This function will run the ML classifiers on the testing set  - returns metrics as dataframe"""
    df = pd.DataFrame([])
    falses = []
    for name, clf in zip(names, classifiers):
        clf.fit(X, y)
        predicted = clf.predict(X)
        data = returnMetrics(y, predicted, name)
        df = df.append(data)
        for i in range(len(predicted)):
            if (predicted[i] != y[i]):
                outtuple = (i,predicted[i],y[i], name)
                falses.append(outtuple) 
    return (df, falses)
    

def MLmodels(X, Y, nsplits,crossv = True):
    """This function will run classifiers and export accuracy, precision, and recall for each classifier.
    N-fold cross validation has been automated here, where user determines number of folds"""
    ###########################################
    #Define classifiers
    ###########################################
    names = ["Decision Tree", "Logistic", "SVM", "Random Forest", "ExtraTrees"]
    classifiers = [ LogisticRegression(), tree.DecisionTreeClassifier(), svm.SVC(),
                   RandomForestClassifier(), ExtraTreesClassifier()]
    
    ###########################################
    #Create Train/Test Split
    ###########################################
    kf = KFold(n_splits = nsplits)
    
    ###########################################
    #Run Models
    ###########################################
    if(crossv == True): 
        return(runModels(X, y, kf, names, classifiers))
    else:
        return(runTruth(X, y, names, classifiers))
        
    

### Prepare Set 1 and run ML models on Set 1

In [11]:
X = df_set1.ix[:, df_set1.columns != 'isPositive'].values
y = np.ravel(df_set1[['isPositive']].values)

Since this is our first set, we want to use k-fold cross validation. We choose to use 10 folds here:

In [13]:
crossv = True
nfolds = 10

if crossv == True:
    result = MLmodels(X, y, nfolds,crossv)
else:
    (result,idlist) = MLmodels(X, y, nfolds,crossv)

In [15]:
result

Unnamed: 0,r,r_macro,r_micro,r_weight,p,p_macro,p_micro,p_weight,f1,f1_macro,f1_micro,f1_weight,accu,score
Decision Tree,1.0,0.994444,0.994505,0.994505,0.989247,0.994624,0.994505,0.994565,0.994595,0.994504,0.994505,0.994505,0.994505,0.994505
Decision Tree,1.0,0.994318,0.994505,0.994505,0.989474,0.994737,0.994505,0.994563,0.994709,0.994497,0.994505,0.994504,0.994505,0.994505
Decision Tree,1.0,0.995098,0.994505,0.994505,0.987654,0.993827,0.994505,0.994573,0.993789,0.994431,0.994505,0.994509,0.994505,0.994505
Decision Tree,1.0,0.987654,0.989011,0.989011,0.980583,0.990291,0.989011,0.989224,0.990196,0.988848,0.989011,0.988996,0.989011,0.989011
Decision Tree,1.0,0.994382,0.994505,0.994505,0.989362,0.994681,0.994505,0.994564,0.994652,0.994501,0.994505,0.994505,0.994505,0.994505
Decision Tree,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
Decision Tree,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
Decision Tree,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
Decision Tree,1.0,0.994048,0.994475,0.994475,0.989796,0.994898,0.994475,0.994532,0.994872,0.994442,0.994475,0.994473,0.994475,0.994475
Decision Tree,1.0,0.996032,0.994475,0.994475,0.982143,0.991071,0.994475,0.994574,0.990991,0.993503,0.994475,0.994489,0.994475,0.994475


At this point, we are doing pretty well with respect to our precision/recall targets. We move on and test on set 2:

### Prepare Set 2 and run ML models on Set 2

In [16]:
df_set2 = featureGen(set2)
df_set2.head(n=10)

Unnamed: 0,The,film,in,made,recommend,released,review,showing,"""",',(,),*,:,II,III,hasCueWords,hasYear,isCap,isPositive
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True
2,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,True,True
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True
5,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True
6,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [17]:
X = df_set2.ix[:, df_set2.columns != 'isPositive'].values
y = np.ravel(df_set2[['isPositive']].values)

Here, we do not use CV and solely explore what our false positive/false negative misclassifications are:

In [18]:
crossv = False
nfolds = 10

if crossv == True:
    result = MLmodels(X, y, nfolds, crossv)
else:
    (result,idlist) = MLmodels(X, y, nfolds,crossv)

In [20]:
result

Unnamed: 0,r,r_macro,r_micro,r_weight,p,p_macro,p_micro,p_weight,f1,f1_macro,f1_micro,f1_weight,accu
Decision Tree,0.807281,0.886581,0.909643,0.909643,0.928571,0.914889,0.909643,0.910911,0.863688,0.898057,0.909643,0.908051,0.909643
Logistic,0.860814,0.903936,0.916477,0.916477,0.899329,0.912308,0.916477,0.916083,0.87965,0.907848,0.916477,0.916049,0.916477
SVM,0.773019,0.871216,0.899772,0.899772,0.932817,0.909419,0.899772,0.902615,0.845433,0.885638,0.899772,0.89733,0.899772
Random Forest,0.862955,0.904419,0.916477,0.916477,0.89755,0.911909,0.916477,0.916084,0.879913,0.907942,0.916477,0.916094,0.916477
ExtraTrees,0.860814,0.903936,0.916477,0.916477,0.899329,0.912308,0.916477,0.916083,0.87965,0.907848,0.916477,0.916049,0.916477


It looks like both our precision and recall fell a little bit. Let us explore our misclassified labels.

In [21]:
for i in idlist:
    print(set2[i[0]], i[1], i[2], i[3])

(20, '"Alien"', ['the', 'of', 'any'], ['films.Weaver,', 'perhaps', 'inspired'], True) False True Decision Tree
(107, "'s MUCH ADO ABOUT NOTHING", ['KennethBranagh', 'in', 'onscreen'], ['.', 'But', 'if'], True) False True Decision Tree
(123, 'ANNA KARENINAA film review', ['', '', ''], ['by', 'James', 'BerardinelliCopyright'], True) False True Decision Tree
(125, "'s (IMMORTAL BELOVED)", ['Rose', ',Bernard', 'KARENINA'], ['attempt', 'to', 'bring'], True) False True Decision Tree
(135, "'s HAMLET", ['Branagh', 'Kenneth', 'since'], ['.', 'Ballroominteriors,', 'landscapes,'], True) False True Decision Tree
(140, 'TO 10):', ['(0', 'BerardinelliRATING', 'James'], ['5.5Alternative', 'Scale:', '**'], False) True False Decision Tree
(166, 'I', ['though', '--', 'Gertrude'], ['must', "confessI'm", 'still'], False) True False Decision Tree
(181, '"Amadeus"', ['glance,', 'first', 'Forman.At'], ['seems', 'like', 'a'], True) False True Decision Tree
(182, '"Amadeus."', ['of', 'view', 'my'], ['After', 

Based on our error analysis, we add in some more rules:

In [34]:
#Amendments to Feature Generation Based on Error Analysis

def isitcapitalized(testcase):
    """Is the label capitalized? Return True or False"""
    return testcase[1].isupper()

def capitalletter(testcase):
    """Is there a capital letter followed by white space or capital letter +/- followed by white space - movie grades. Return True or False"""
    combo = testcase[2] + testcase[3]
    joined = "".join(combo)
    gradesearch = re.compile(r"[A-Z]\s|[A-Z+]\s|[A-Z-]\s")
    if re.search(gradesearch, joined):
        return True
    else:
        return False
    
def yearbyChar(testcase):
    """Is there a year in parentheses followed by any word character? Return True or False"""
    combo = testcase[2] + testcase[3]
    joined = "".join(combo)
    yearcharsearch = re.compile(r"(\d\d\d\d)\w")
    if re.search(yearcharsearch, joined):
        return True
    else:
        return False
    
    
def isthereayear(testcase):
    """Is the YEAR (YYYY) format in the surrounding 3 words before and after? Return True or False"""
    combo = testcase[2] + testcase[3]
    joined = "".join(combo)
    yearsearch = re.compile(r"\d\d\d\d")
    if re.search(yearsearch, joined):
        return True
    else:
        return False
    
def isthereayear_parenth(testcase):
    """Is the YEAR (YYYY) format in the surrounding 3 words before and after? Return True or False"""
    combo = testcase[2] + testcase[3]
    joined = "".join(combo)
    yearsearch = re.compile(r"(\d\d\d\d)")
    if re.search(yearsearch, joined):
        return True
    else:
        return False    
    
def hasCueWords(testcase,cueword=None):
    """Are any of the cuewords we determined found in the surrounding text? Return True or False"""
    combo = testcase[2] + testcase[3]
    if cueword:
        for word in combo:
            if word == cueword:
                return True
    else:
        for word in combo:
            if word in cuewords:
                return True
    return False

def hasCueSymbols(testcase,cuesymbol=None):
    """Are any of the cuesymbols we determined found in the surrounding text? Return True or False"""
    combo = testcase[2] + testcase[3]
    #combo = combo.split(' ')
    if cuesymbol:
        for symb in combo:
            if symb == cuesymbol:
                return True
    else:
        for symb in cuesymbols:
            return True
    return False


def isPositive(testcase):
    """Is this a positive or negative example? Return True or False"""
    return testcase[-1]

def featureGen(cases):
    """This function brings together all of our feature generation functions and
    exports a pandas dataframe which can be fed into the ML classifiers"""
    #Set Key words and Key symbols
    cuewords = {'in':True, 'made':True, 'review':True, 'film':True, 'recommend':True, "showing":True, 
                'released':True, "The": True, "Alien": True, "onscreen": True, "by": True, "Film": True, "Review":True,
               'Copyright': True, 'Starring':True, 'Sequel': True, '"\'s': True, "Directed": True, "Maria": False, 
               'Recommend':True, 'recommend':True, 'by': True}
    cuesymbols = {'*':True, '(':True, ')':True ,'"':True, "'":True, ':':True, 'II':True, 'III':True, '***': True,
                 '****': True, '@':False}
    
    for cueword in cuewords:
        cuewords[cueword] = [hasCueWords(case, cueword) for case in cases]  
        
    for cuesymb in cuesymbols:
        cuesymbols[cuesymb] = [hasCueSymbols(case, cuesymb) for case in cases]   
        
    #Create subdataframes
    cuewords_df = pd.DataFrame(cuewords)
    cuesymbols_df = pd.DataFrame(cuesymbols)
    
    #Put it all together into master dataframe
    features = pd.DataFrame(
        {'hasYear':[isthereayear(case) for case in cases],
         'hasYear_parenth':[isthereayear_parenth(case) for case in cases],
         'yearbyChar': [yearbyChar(case) for case in cases],
         'isCap':[isitcapitalized(case) for case in cases],
         'hasCueWords':[hasCueWords(case, cuewords) for case in cases],    
         'capitalLetter':[capitalletter(case) for case in cases],
         'isPositive':[isPositive(case) for case in cases]
        }
    )
    df = pd.concat([cuewords_df, cuesymbols_df, features], axis=1)
    return(df)

After modifying the features based on the FP/FN misclassifications from set 2, we go back and run k-fold CV on set 1:

In [36]:
df_set1_new = featureGen(set1)
df_set1_new.head(n=10)

Unnamed: 0,"""'s",Alien,Copyright,Directed,Film,Maria,Recommend,Review,Sequel,Starring,...,@,II,III,capitalLetter,hasCueWords,hasYear,hasYear_parenth,isCap,isPositive,yearbyChar
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,True,True,True,True
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
6,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
7,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [40]:
X = df_set1_new.ix[:, df_set1_new.columns != 'isPositive'].values
y = np.ravel(df_set1_new[['isPositive']].values)

In [41]:
crossv = True
nfolds = 10

if crossv == True:
    result = MLmodels(X, y, nfolds, crossv)
else:
    (result,idlist) = MLmodels(X, y, nfolds,crossv)

In [42]:
result

Unnamed: 0,r,r_macro,r_micro,r_weight,p,p_macro,p_micro,p_weight,f1,f1_macro,f1_micro,f1_weight,accu,score
Decision Tree,1.0,0.994444,0.994505,0.994505,0.989247,0.994624,0.994505,0.994565,0.994595,0.994504,0.994505,0.994505,0.994505,0.994505
Decision Tree,1.0,0.994318,0.994505,0.994505,0.989474,0.994737,0.994505,0.994563,0.994709,0.994497,0.994505,0.994504,0.994505,0.994505
Decision Tree,1.0,0.995098,0.994505,0.994505,0.987654,0.993827,0.994505,0.994573,0.993789,0.994431,0.994505,0.994509,0.994505,0.994505
Decision Tree,1.0,0.987654,0.989011,0.989011,0.980583,0.990291,0.989011,0.989224,0.990196,0.988848,0.989011,0.988996,0.989011,0.989011
Decision Tree,1.0,0.994382,0.994505,0.994505,0.989362,0.994681,0.994505,0.994564,0.994652,0.994501,0.994505,0.994505,0.994505,0.994505
Decision Tree,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
Decision Tree,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
Decision Tree,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
Decision Tree,1.0,0.994048,0.994475,0.994475,0.989796,0.994898,0.994475,0.994532,0.994872,0.994442,0.994475,0.994473,0.994475,0.994475
Decision Tree,1.0,0.996032,0.994475,0.994475,0.982143,0.991071,0.994475,0.994574,0.990991,0.993503,0.994475,0.994489,0.994475,0.994475


We see improvement! Let us now finally run this on set 3:

### Prepare Set 3 and run ML Models on Set 3 (final test set)

In [44]:
df_set3 = featureGen(set3)
df_set3.tail(n=10)

Unnamed: 0,"""'s",Alien,Copyright,Directed,Film,Maria,Recommend,Review,Sequel,Starring,...,@,II,III,capitalLetter,hasCueWords,hasYear,hasYear_parenth,isCap,isPositive,yearbyChar
1590,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1591,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1592,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1593,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1594,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1595,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1596,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1597,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1598,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1599,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [45]:
X = df_set3.ix[:, df_set3.columns != 'isPositive'].values
y = np.ravel(df_set3[['isPositive']].values)

In [46]:
crossv = False
nfolds = 10

if crossv == True:
    result = MLmodels(X, y, nfolds, crossv)
else:
    (result,idlist) = MLmodels(X, y, nfolds,crossv)

In [48]:
result

Unnamed: 0,r,r_macro,r_micro,r_weight,p,p_macro,p_micro,p_weight,f1,f1_macro,f1_micro,f1_weight,accu
Decision Tree,0.975385,0.981377,0.9825,0.9825,0.981424,0.982326,0.9825,0.982495,0.978395,0.981845,0.9825,0.982491,0.9825
Logistic,0.981538,0.984453,0.985,0.985,0.981538,0.984453,0.985,0.985,0.981538,0.984453,0.985,0.985,0.985
SVM,0.975385,0.981377,0.9825,0.9825,0.981424,0.982326,0.9825,0.982495,0.978395,0.981845,0.9825,0.982491,0.9825
Random Forest,0.98,0.983684,0.984375,0.984375,0.98151,0.98392,0.984375,0.984372,0.980754,0.983802,0.984375,0.984373,0.984375
ExtraTrees,0.981538,0.984453,0.985,0.985,0.981538,0.984453,0.985,0.985,0.981538,0.984453,0.985,0.985,0.985


On our final testing set, we were able to reach the required precision and recall. We identify our strongest classifier to be the logistic regression classifier, followed by the random forest. Looking above, we find that this holds true across the sets. 