## Problem formulation

Here the input is a dataset with 3 features: an integer ID number, a binary rating, and a text review.

The output is to generate binary sentiment ratings (positive or negative) for a test set for each ID based off only the given text reviews. 

The data mining function required is analyzing the ID, rating, and text triples to establish which of several reviews correspond to the same product. This is challenging because many of the reviews don't specifically describe the item (e.g. 'I like this Brand X bar of soap since...') and may contain OOV (Out-Of-Vocabulary) words that were not in the training set. Potential impacts include improved product recommendations, adding or removing different products to the online store, and identifying high-value customers. An ideal solution contains a nonlinear classifier many-to-one classifier that produces an F1 score as close as possible to 1.0 on the test set.



In [1]:
#Student Name:    Philippe C. Rivet
#Student Number:  10105954
#Student Email:   13cpr@queensu.ca
#Description:     Submission for CISC 372 A2 with comments added for descriptive purposes

!wget -q https://l1nna.com/372/Assignment/A2-3/train.csv
!wget -q https://l1nna.com/372/Assignment/A2-3/test.csv

'wget' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.


In [41]:
from sklearn.feature_extraction.text import CountVectorizer #necessary package imports for functionality
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from xgboost.sklearn import XGBClassifier

from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd
from xgboost.sklearn import XGBClassifier

#additional package & class defn. for natural language pre-processing
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer  #snowball is one of the main stemming algos, along with Porter & Lancaster
from nltk.corpus import stopwords

snowball=SnowballStemmer(language='english')

'''
class SnowStemmer:
    def __init__(self):
        self.sbs=snowball
    def __call__(self,doc):
        return [self.sbs.stem(t) for t in word_tokenize(doc)]
        '''

#stop_words = set(stopwords.words('english'))  #save list of stopwords to remove in memory

xy_train = pd.read_csv('train.csv')
x_test  = pd.read_csv('test.csv')


In [46]:
x = xy_train.review  #dependent var is the reviews
y = xy_train.rating  #independent var is the ratings

#x_stem=[snowball.stem(wd) for wd in word_tokenize(x)]
#word_tokens=word_tokenize(x)
#x_stop=[w for w in word_tokens if not w.lower() in stop_words]   #convert all to lowercase as well


pipeline = Pipeline([                      #set up data pipeline for a sequence of transformations
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', XGBClassifier(
                          objective='multi:softmax', seed=1, num_class=2)),
])


'''
pipeline = Pipeline([                      
    ('vect', HashingVectorizer()),
    ('clf', SVC(class_weight='balanced')),
])
'''

#these will be changed during experimentation, but carefully as running time can increase exponentially or
#even factorially if too many are enabled
parameters = {
    'vect__max_features': [100, 500, 1000, 5000, 10000, 120000],
    'vect__analyzer': ['char',],
    'vect__ngram_range': ((1, 2),(1, 3)), # unigrams or bigrams or trigrams etc
    'tfidf__use_idf': (True, False),
     #'tfidf__norm': ('l1')
#     'clf__max_iter': (20,),
     #'clf__alpha': (0.001, 0.00001, 0.000001),
#     'clf__penalty': ('l2', 'elasticnet'),
    # 'clf__max_iter': (10, 50, 80),
}

scoring = ['f1', 'accuracy'] #how do we measure the quality of the results?
split = int(len(x) * 0.8)    #80% of data used during testing, the remaining 20% during cross-validation
'''
grid_search = GridSearchCV(
    pipeline, parameters, verbose=3, cv=[(np.arange(0, split), np.arange(split, len(x)))], 
    refit='f1', n_jobs=20, scoring=scoring, return_train_score=True) #param search method can also be changed
    '''

grid_search = RandomizedSearchCV(
    pipeline, parameters, verbose=3, cv=[(np.arange(0, split), np.arange(split, len(x)))], 
    refit='f1', n_jobs=20, scoring=scoring, return_train_score=True)
grid_search.fit(x, y)
#grid_search.fit(x_stem, y)
#grid_search.fit(x_stop, y)

#DESCRIPTION OF MODIFICATIONS 😎🤠🤖

#Tuning 1:          Stemming preprocessing
#Reasoning:         Simplify the input data by shortening common words
#Expected outcome:  Higher performance from reducing overfitting
#Actual outcome:    Small increase in performance, but not huge since the Snowball algo preserves ~90% of information
#see here for details https://towardsdatascience.com/stemming-corpus-with-nltk-7a6a6d02d3e5

#Tuning 2:          Stop word & punctuation preprocessing
#Reasoning:         Simplify the input data by removing unnecessary words
#Expected outcome:  Higher performance from reducing overfitting
#Actual outcome:    Negligible increase in performance as no sentiment information is contained in these words to begin with

#Tuning 3:          Use of HashingVectorizer instead of CountVectorizer
#Reasoning:         Improved classification as this method produces occurences over counts straight away
#Expected outcome:  Higher performance from reduced bias
#Actual outcome:    Slight reduction in performance due to collisions, i.e. two or more tokens mapped to same index

#Tuning 4:          Change analyzer from word-level to character-level
#Reasoning:         Look at single characters over words for the purpose of simplification
#Expected outcome:  Improvement in performance from reduced bias
#Actual outcome:    Slight increase in performance as key correlations are gained when analyzing at this resolution

#Tuning 5:          Change distance metric for tf-idf
#Reasoning:         Evaluate whether using L1 (Manhattan) or L2 (Euclidean) norms have different effects
#Expected outcome:  Improvement in performance from reduced overfitting with L1
#Actual outcome:    Major increase in performance due to trimming effect

#Tuning 6:          Add iterations with different learning rates
#Reasoning:         Evaluate whether changing the alpha param has an effect
#Expected outcome:  No major change in performance as in the case of NLP this param is less important
#Actual outcome:    As expected

#Tuning 7:          Change classifier to RBF SVM
#Reasoning:         Attempt to establish whether changing the classifier has a major effect
#Expected outcome:  Massive performance increase from non-linear capabilities
#Actual outcome:    Substantial increase on training, substantial decrease on testing

#Tuning 8:          Change hyperparam search to random
#Reasoning:         Less exhaustive search has a chance of yielding lower training time
#Expected outcome:  60% chance of faster execution
#Actual outcome:    As expected

#TUning 9:          XGBOOSTUH
#Expected outcome:  Tremendous improvement in performance from booosting effekt
#Reasoning:         Summation of extreme parallelized computation of decision trees
#Actual outcome:    WHAT I WANTED

Fitting 1 folds for each of 10 candidates, totalling 10 fits


KeyboardInterrupt: 

In [None]:
# let's visualize hyperparameters against performance

from matplotlib import pyplot as plt

selected_parameter = 'vect__max_features'  #vary one hyperparam at a time for testing
#selected_parameter = 'vect__ngram_range'  #hashingVectorizer doesn't have max_features so we vary this instead
results = grid_search.cv_results_

plt.figure()
plt.title("GridSearchCV",
          fontsize=16)

plt.xlabel(selected_parameter)
plt.ylabel("Score")

ax = plt.gca()
ax.set_ylim(0.4, 1.1)


# Get the regular numpy array from the MaskedArray
X_axis = np.array(results['param_'+ selected_parameter].data, dtype=float)

scorer = 'f1'
color ='b'
for sample, style in (('train', '--'), ('test', '-')):
    sample_score_mean = results['mean_%s_%s' % (sample, scorer)]
    sample_score_mean = [x for _,x in sorted(zip(X_axis,sample_score_mean))]
    ax.plot(sorted(X_axis), sample_score_mean, style, color=color,
            alpha=1 if sample == 'test' else 0.7,
            label="%s (%s)" % (scorer, sample if sample == 'train' else 'validation'))

best_index = np.nonzero(results['rank_test_%s' % scorer] == 1)[0][0]
best_score = results['mean_test_%s' % scorer][best_index]

# Plot a dotted vertical line at the best score for that scorer marked by x
ax.plot([X_axis[best_index], ] * 2, [0, best_score],
        linestyle='-.', color=color, marker='x', markeredgewidth=3, ms=8)

# Annotate the best score for that scorer
ax.annotate("%0.2f" % best_score,
            (X_axis[best_index], best_score + 0.005))
    

plt.legend(loc="best")
plt.grid(False)
plt.show()

In [None]:
# generate submission

y_predict = np.squeeze(grid_search.predict(x_test.review))

pd.DataFrame(
    {'id': x_test.id, 'rating':y_predict}).to_csv('submit5.csv', index=False)

## ANSWERS TO QUESTIONS

1. The difference between character n-gram & word n-gram is the size of the groupings: the former groups individual characters, whereas the latter groups words. Word n-gram tends to suffer more from OOV issue because it is not able to recombine previously used characters in new ways to identify unknown words.
2. Stop word removal involves removing specific irrelevant words from a pre-specified list, whereas stemming is a method where longer words are shortened by truncating suffixes. These techniques are not language-dependent because in both cases the structure of the language itself determines which words are possible end results of either process.
3. Tokenization techniques are language dependent because they convert natural language data to numeric values.
4. The CountVectorizer method converts text data to a (typically very bigly) matrix of token counts. Tf-idf transforms a count matrix into a normalized form via logarithmic smoothing. It is not feasible to use all possible n-grams as the running time and storage grow factorially, so they should be selected on a careful basis depending on contextual cues. In most cases stopping at a reasonably high value (say 4 or 5) works well enough 🤠🤓