Once we have grouped the reviews into 4 classes and re-labeled them to range from [0-4), we proceed as with any other ML problme, doing some preprocessing. 

To tha aim I am going to use all your usual suspects, imported below

In [2]:
import numpy as np
import os
import numpy as np
import pandas as pd
import multiprocessing
import en_core_web_sm
import pickle
import spacy

from pathlib import Path
from multiprocessing import Pool
from gensim.utils import simple_preprocess
from gensim.models.phrases import Phraser, Phrases
from nltk.stem import WordNetLemmatizer
from spacy.lang.en.stop_words import STOP_WORDS

cores = multiprocessing.cpu_count()

All the data processing will be done with the following functions and classes

In [3]:
def simple_tokenizer(doc):
    return [t for t in simple_preprocess(doc, min_len=2) if t not in STOP_WORDS]

class NLTKLemmaTokenizer(object):

    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()

    def __call__(self, doc):
        return [self.lemmatizer.lemmatize(t, pos="v") for t in simple_tokenizer(doc)]


class SpacyLemmaTokenizer(object):

    def __init__(self):
        self.tok = spacy.blank('en', disable=["parser","tagger","ner"])

    @staticmethod
    def condition(t, min_len=2):
        return not (t.is_punct | t.is_space | (t.lemma_ != '-PRON-') | len(t)<=min_len |
            t.is_stop |  t.is_digit)

    def __call__(self, doc):
        return [t.lemma_.lower() for t in self.tok(doc) if self.condition(t)]


class Bigram(object):

    def __init__(self):
        self.phraser = Phraser

    @staticmethod
    def append_bigram(doc, phrases_model):
        doc += [t for t in phrases_model[doc] if '_' in t]
        return doc

    def __call__(self, docs):
        phrases = Phrases(docs,min_count=10)
        bigram = self.phraser(phrases)
        p = Pool(cores)
        docs = p.starmap(self.append_bigram, zip(docs, [bigram]*len(docs)))
        pool.close()
        return docs


def count_nouns(tokens):
    return sum([t.pos_ is 'NOUN' for t in tokens])/len(tokens)


def count_adjectives(tokens):
    return sum([t.pos_ is 'ADJ' for t in tokens])/len(tokens)


def count_adverbs(tokens):
    return sum([t.pos_ is 'ADV' for t in tokens])/len(tokens)


def count_verbs(tokens):
    return sum([t.pos_ is 'VERB' for t in tokens])/len(tokens)


def sentence_metric(tokens):
    slen = [len(s) for s in tokens.sents]
    metrics = np.array([np.mean(slen), np.median(slen), np.min(slen), np.max(slen)])/len(tokens)
    return metrics


def xtra_features(doc):
    tokens = nlp(doc)
    n_nouns = count_nouns(tokens)
    n_adj   = count_adjectives(tokens)
    n_adv   = count_adverbs(tokens)
    n_verb  = count_verbs(tokens)
    sent_m  = sentence_metric(tokens)
    return [n_nouns, n_adj, n_adv, n_verb] + list(sent_m)

In the `preprocessing.py` script I have run the `LemmaTokenizer` using `Spacy` and `nltk`, with and without Bigrams. Let me illustrate the use here in the case of `nltk`:

In [5]:
DATA_PATH = Path("../../datasets/amazon_reviews")

df = pd.read_csv(DATA_PATH/'reviews_Clothing_Shoes_and_Jewelry.csv')
df = df[~df.reviewText.isna()].sample(frac=1, random_state=1).reset_index(drop=True)
reviews = df.reviewText.tolist()

In [6]:
nltk_tok  = NLTKLemmaTokenizer()

# Running the tokenizer in parallel
pool = Pool(cores)
nltk_docs  = pool.map(nltk_tok, reviews)
pool.close()

# Computing the Bigrams
nltk_pdocs  = Bigram()(nltk_docs)

In [14]:
print(reviews[10])

I got these earrings for my 15 year old granddaughter for Christmas.  She told me she liked jewelry that was wings.  They look like something she will like and wear.  The price was reasonable.


In [15]:
print(nltk_docs[10])

['get', 'earrings', 'year', 'old', 'granddaughter', 'christmas', 'tell', 'like', 'jewelry', 'wing', 'look', 'like', 'like', 'wear', 'price', 'reasonable']


In [16]:
print(nltk_pdocs[10])

['get', 'earrings', 'year', 'old', 'granddaughter', 'christmas', 'tell', 'like', 'jewelry', 'wing', 'look', 'like', 'like', 'wear', 'price', 'reasonable', 'year_old', 'granddaughter_christmas', 'price_reasonable']


At first sight one can see that perhaps a better preprocessing would be possible, for example, to break words like "granddaughter". For now, I will move on. In the `preprocessing.py` file you will find more code related to `Spacy` and computing what I have referred above as `xtra_features` (counts of number of nouns, adjectives, verbs, etc...)

All the results are saved to disk and we are ready to extract the features that will be used for text classification.