# NLP Prediction Challenge

This notebook will walk through predicting which corpus a given sentence belongs to. We'll be using a BoW and a TF-IDF approach

In [35]:
# dependencies
import pandas as pd
import numpy as np
import nltk
import spacy
from nltk.corpus import pros_cons, stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

In [39]:
# download the corpora that we want
# and load the spacy english parser
nltk.download('pros_cons')
nlp = spacy.load('en')
# set the max length to higher (we're dealing with a large corpus)
nlp.max_length = 2000000

[nltk_data] Downloading package pros_cons to /Users/ryan/nltk_data...
[nltk_data]   Package pros_cons is already up-to-date!


In [3]:
pros_cons.fileids()

['IntegratedCons.txt', 'IntegratedPros.txt']

In [4]:
pro_sents = pros_cons.sents("IntegratedPros.txt")
con_sents = pros_cons.sents("IntegratedCons.txt")

In [6]:
pros_doc = [" ".join(sentence_tokens) for sentence_tokens in pro_sents]
cons_doc = [" ".join(sentence_tokens) for sentence_tokens in con_sents]

## Models

Here we're going to use a few different models and see how high we can get our testing score. We'll then use the best trained model to perform k-fold cross validation to further investigate our testing accuracy

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

In [14]:
# set some parameters for grid search
grid_search_params = {}

# logistic regression possible parameters
lr_params = {"C": [0.01, 0.1, 1., 10., 100.]}
grid_search_params[LogisticRegression] = lr_params

# random forrest possible parameters
rf_params = {
    "n_estimators": [10, 50, 100, 500],
    "criterion": ["gini", "entropy"]
}
grid_search_params[RandomForestClassifier] = rf_params

# ada-boost classifier
ada_boost_params = {
    "n_estimators": [10, 50, 100, 500],
    "learning_rate": [0.01, 0.1, 1., 1.5, 2.0]
}
grid_search_params[AdaBoostClassifier] = ada_boost_params

# gradient boosting classifier
gbc_params = {
    "loss": ["deviance", "exponential"],
    "learning_rate": [0.01, 0.1, 1., 10],
    "n_estimators": [5, 10, 50, 100, 500],
    "subsample": [0.1, 0.3, 0.5, 0.8, 1.]
}
grid_search_params[GradientBoostingClassifier] = gbc_params

# bernoulii naive bayes
nb_params = {
    "alpha": [0, 0.1, 1., 2., 10.]
}
grid_search_params[BernoulliNB] = nb_params

# svc params
svc_params = {
    "C": [0.1, 1., 2., 10.],
    "gamma": ["auto", "scale", 0.0001, 0.001, 0.01, 1., 10.]
}
grid_search_params[SVC] = svc_params


In [15]:
def find_best_model(models, X_train, X_test, y_train, y_test):
    
    # for each model, run grid search with the associated paramters
    # and print the best train and test score, along with the best parameters
    for model, params in models.items():
        print("Training {}".format(str(model.__name__)))
        grid = GridSearchCV(model(), params, refit=True, verbose=0, iid=True, cv=20, n_jobs=4)
        
        grid.fit(X_train, y_train)
        
        print("Results for {}...".format(str(model.__name__)))
        print("---------------------")
        print("Best Score: {}".format(grid.best_score_))
        print("Train Score: {}".format(grid.score(X_train, y_train)))
        print("Test Score: {}".format(grid.score(X_test, y_test)))
        print("Best Params: {}".format(grid.best_params_))
        print("")

## TF-IDF Approach
This run will evaluate different models on the Tf-idf matrix using GridSearch.

In [7]:
def make_td_idf_matrix(class1_corpus: list, class2_corpus: list):
    """Makes a TF-IDF matrix using the provided class corpora
    and assigns the target_class to the matrix as well.
    
    Note: This would never work in production as the tf-idf score
    requires knowledge of how many documents, etc. LSA makes sense,
    but is not supervised, which is what we want.
    """
    
    # save the length of the first class
    num_class1_sents = len(class1_corpus)
    
    # merge the two classes
    all_sentences = class1_corpus + class2_corpus
    
    # compute the tf-idf matrix for all sentences
    vectorizer = TfidfVectorizer(
        max_df=0.5, # drop words that occur in more than half of the sentences
        min_df=5, # only use words that appear at least 5 times overall
        stop_words='english', # remove stop words, sampling from the english dictionary of stop words
        lowercase=True, # prevent captitalization from ruining any potential matches
        use_idf=True, # use the inverse document frequency in our weighting (duh)
        norm=u'l2', # applies correction factor so that long and short sentences get "treated equally"
        smooth_idf=True # adds 1 to all document frequencies (prevents devide by zero errors).
    )
    
    # fit the matrix to the vectorizer
    
    td_idf_values = vectorizer.fit_transform(all_sentences)
    
    # create a pandas df from this data, using the feature names
    df = pd.DataFrame(td_idf_values.todense(), columns=vectorizer.get_feature_names())
    
    # now assign the classes
    df["CLASS_LABEL"] = "negative"
    # assign the positive class labels
    df.at[:num_class1_sents, "CLASS_LABEL"] = "positive"
    
    return df
    

In [8]:
vectorized_data = make_td_idf_matrix(pros_doc, cons_doc)

In [9]:
vectorized_data.head(5)

Unnamed: 0,00,000,10,100,1000,10x,11,11x17,12,120,...,young,yr,zeiss,zero,zone,zones,zoom,zooming,zooms,CLASS_LABEL
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,positive
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,positive
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.283573,0.0,0.0,positive
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,positive
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,positive


In [10]:
vectorized_data.tail(5)

Unnamed: 0,00,000,10,100,1000,10x,11,11x17,12,120,...,young,yr,zeiss,zero,zone,zones,zoom,zooming,zooms,CLASS_LABEL
45870,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,negative
45871,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,negative
45872,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,negative
45873,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,negative
45874,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,negative


In [11]:
features = vectorized_data.drop(columns=["CLASS_LABEL"])
target = vectorized_data["CLASS_LABEL"]

In [12]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

In [19]:
tf_idf_models = grid_search_params.copy()
# remove some slow models fro the TF-IDF approach
del tf_idf_models[AdaBoostClassifier]
del tf_idf_models[GradientBoostingClassifier]
del tf_idf_models[RandomForestClassifier]
del tf_idf_models[SVC]

In [20]:
find_best_model(tf_idf_models, X_train, X_test, y_train, y_test)

Training LogisticRegression




Results for LogisticRegression...
---------------------
Best Score: 0.9066392625809666
Train Score: 0.9342301943198804
Test Score: 0.9075056310397442
Best Params: {'C': 10.0}

Training BernoulliNB
Results for BernoulliNB...
---------------------
Best Score: 0.9044905331340309
Train Score: 0.9150473343298455
Test Score: 0.9066337281116036
Best Params: {'alpha': 0.1}



Our highest test score was with LogisticRegression, so let's cross validate with 20-fold CV.

In [27]:
lr = LogisticRegression(C=10., solver="liblinear")

In [28]:
# fit to training data
lr.fit(X_train, y_train)

LogisticRegression(C=10.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [29]:
# get test raw scores
lr.score(X_test, y_test)

0.9075056310397442

In [31]:
# get k=20-fold CV score
cross_val_scores = cross_val_score(lr, X_train, y_train, cv=20)
cross_validation_score = cross_val_scores.mean()
print("K=20 fold cross validation score: {}".format(cross_validation_score))

K=20 fold cross validation score: 0.9066391498713132


## BoW Approach

This next approach will use a Bag-Of-Words approach to model sentiment. We'll perform a similar proces as with TF-IDF, but will change a few things. Specifically regarding data-preparation...

In [57]:
import collections
# define a new bag of words approach that should get us some additional information
# we want to look at number of words in sentence, how many times puncutation is used, how many nouns, verbs, adj, etc. length of previous sentence vs. length of next sentence
def bow_feature_generator(sentences, common_words, verbose=False):
    # we want to go through each sentence and count the number of occurances of verbs, nouns, adj, etc.
    # we also want to count how many words are in each text
    rows = []
    for index, row in enumerate(sentences.itertuples()):
        
        sentence = row.text_sentence
        source = row.CLASS_LABEL
        
        info_row = collections.defaultdict(int)
        
        for token in sentence:
            part_of_speech = token.pos_
            
            if part_of_speech == "VERB":
                # if it is a verb, add 1 for the sentence
                info_row["n_verbs"] += 1
            elif part_of_speech == "NOUN":
                # if it is a noun, add 1 for the noun to the sentence
                info_row["n_nouns"] += 1
            
            elif part_of_speech == "ADJ":
                # if it is an adjective, add 1 for the adjective count for this sentence
                info_row["n_adjectives"] += 1
            
            info_row["n_words"] += 1
            
            
            if not token.is_punct and not token.is_stop and token.lemma_ in common_words:
                info_row[token.lemma_] += 1
        
        if verbose:
            if index % 10 == 0:
                print("Operating on row: {}".format(index))
        
        info_row = dict(info_row)
        info_row["text_sentence"] = sentence
        info_row["CLASS_LABEL"] = source
        rows.append(info_row)
    
    
    df = pd.DataFrame(data=rows, columns=["text_sentence", "n_verbs", "n_nouns", "n_adjectives", "n_words"] + list(common_words) + ["CLASS_LABEL"])
    
    df = df.fillna(0)
    
    return df
            

In [51]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in collections.Counter(allwords).most_common(2000)]

In [52]:
# convert the sentences into one big corpus
pros_raw = nlp("\n".join(pros_doc))
cons_raw = nlp("\n".join(cons_doc))

pros_sentences = [[sentence, "positive"] for sentence in pros_raw.sents]
cons_sentences = [[sentence, "negative"] for sentence in cons_raw.sents]

sentence_df = pd.DataFrame(pros_sentences + cons_sentences)
sentence_df["text_sentence"] = sentence_df[0]
sentence_df["CLASS_LABEL"] = sentence_df[1]
sentence_df = sentence_df[["text_sentence", "CLASS_LABEL"]]
sentence_df.head()

Unnamed: 0,text_sentence,CLASS_LABEL
0,"(Easy, to, use, ,, economical, !, \n)",positive
1,"(Digital, is, where, it, ', s, at, ..., down, ...",positive
2,"(Good, image, quality, ,, 3x, optical, zoom, ,...",positive
3,"(Cust, SVS, 2nd, 2, none, !, \n)",positive
4,"(intuitive, ,, user, friendly, \n, Simple, ,, ...",positive


In [53]:
pros_common_words = bag_of_words(pros_raw)
cons_common_words = bag_of_words(cons_raw)
common_words = set(pros_common_words + cons_common_words)

In [58]:
bow_features = bow_feature_generator(sentence_df, common_words)

In [59]:
features = bow_features.drop(columns=["text_sentence", "CLASS_LABEL"])
target = bow_features["CLASS_LABEL"]

In [60]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

In [62]:
# train our models
find_best_model(tf_idf_models, X_train, X_test, y_train, y_test)

Training LogisticRegression




Results for LogisticRegression...
---------------------
Best Score: 0.8942376619795974
Train Score: 0.9233526330300524
Test Score: 0.8961790814357391
Best Params: {'C': 1.0}

Training BernoulliNB
Results for BernoulliNB...
---------------------
Best Score: 0.8797353184449959
Train Score: 0.8877309070857458
Test Score: 0.8800977743470989
Best Params: {'alpha': 0.1}



In [63]:
# rerun with the best model and cv at 20 folds
lr = LogisticRegression(solver="liblinear", C=1.)

In [64]:
lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

In [65]:
cross_validation_scores = cross_val_score(lr, X_train, y_train, cv=20)
mean_cross_val_score = cross_validation_scores.mean()
print("K=20 fold cross val score: {}".format(mean_cross_val_score))

K=20 fold cross val score: 0.8942415545295272


### BoW wins (but not why you might think)

While Tf-IDF had the higher cross val score, it didn't win by much. The reason why Tf-IDF loses here is not because of the score, but because of the real-world implementation. To compute the tf-idf matrix, it's required that we know the counts of words and number of documents across the entire corpus. Which means that if we wanted to predict a sentence as positive or negative, we'd need to compute the td-idf score for each sentence token, which doesn't really make sense. BoW can be used even if we don't have info around the other sentences.

And that's why BoW wins, in the real world.