# Model Iteration 2

Negations have come up time and time again to be a factor that causes phrases to be mis-categorized. In order to handle these cases, I will prepend a "NOT\_" to the beginning of all the words that come after a negation phrase (these include "not", "but", "didn't" among others).

In [36]:
import pandas

df = pandas.DataFrame.from_csv('train.tsv', sep='\t')

In [37]:
df.head()

Unnamed: 0_level_0,SentenceId,Phrase,Sentiment
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,A series of escapades demonstrating the adage ...,1
2,1,A series of escapades demonstrating the adage ...,2
3,1,A series,2
4,1,A,2
5,1,series,2


In [86]:
import re

def prepend_NOT(match):
    """
    A function that feeds into a regular expression substitution function
    that prepends all words after a negation word (i.e. "didn't" and
    "not") with "NOT_".
    """
    match = match.group()
    words = match.split(" ")
    negation = words[0]
    del words[0]
    new_words = ["NOT_" + word for word in words]
    return negation + " " + " ".join(new_words)


def substitute_negations(phrase):
    """
    Replaces input phrase with the same phrase, except prepending a "NOT_"
    for every word after a negation word (i.e. "didn't" and "not"). This
    can only occur in phrases with more than one word.
    """
    # negation_words is a list of regular expressions
    negation_words = [r"not", r"n't"]
    
    # negation_words then gets turned into a regular expression string
    negation_words = [r"(" + word + r")" for word in negation_words]
    negation_words = (r"|").join(negation_words)
    
    negations_re = re.compile(r"(" + negation_words + r")[A-z ']*")
    substitution = negations_re.sub(prepend_NOT, phrase)
    
    if substitution == "":
        return phrase
    return substitution


def add_NOT_to_negations(df):
    """
    Replaces each phrase in the dataframe with the same phrase, but
    replacing every word after a negation word (i.e. "didn't" and "not")
    with "NOT_" prepended to the word. This can only occur in phrases
    with more than one word.
    """
    data = df
    data["Negations"] = data["Phrase"].apply(lambda x: substitute_negations(x))
    return data

In [87]:
df01 = add_NOT_to_negations(df)

In [88]:
df01.drop_duplicates(['SentenceId']).head(10)

Unnamed: 0_level_0,SentenceId,Phrase,Sentiment,Negations
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,A series of escapades demonstrating the adage ...,1,A series of escapades demonstrating the adage ...
64,2,"This quiet , introspective and entertaining in...",4,"This quiet , introspective and entertaining in..."
82,3,"Even fans of Ismail Merchant 's work , I suspe...",1,"Even fans of Ismail Merchant 's work , I suspe..."
117,4,A positively thrilling combination of ethnogra...,3,A positively thrilling combination of ethnogra...
157,5,Aggressive self-glorification and a manipulati...,1,Aggressive self-glorification and a manipulati...
167,6,A comedy-drama of nearly epic proportions root...,4,A comedy-drama of nearly epic proportions root...
199,7,"Narratively , Trouble Every Day is a plodding ...",1,"Narratively , Trouble Every Day is a plodding ..."
214,8,"The Importance of Being Earnest , so thick wit...",3,"The Importance of Being Earnest , so thick wit..."
248,9,But it does n't leave you with much .,1,But it does n't NOT_leave NOT_you NOT_with NOT...
260,10,You could hate it for the same reason .,1,You could hate it for the same reason .
