In [1]:
import numpy as np
import pandas as pd
import gensim
from gensim import corpora, models, similarities
from collections import defaultdict

data = pd.read_csv("data/service_reviews_15000rows_translated.csv")

queries = ["Customer service", "Delivery service", "Product Quality"]

## Introduction

The following cells contain a naive text classifier using query similarity. Using LSI (Latent Semantic Indexing), we group text into several topics and test the similarity of that text to provided queries. In this case, the queries are 'Customer service', 'Delivery service' and 'Product quality', the three major topics into which we want to classify the text. 

This is not a finished product: it is a prototype that still requires refinement, especially in selection of the number of topics and evaluation. However, this should serve as a proof of concept. It is presently classifying ~15k customer reviews translated into English. 

**Requires:** Numpy, Pandas, Gensim

**Approximate time to run:** <5s

In [2]:
def corpus_extractor():
    """Pulls out the ninth column from the dataset in order to get
    the raw corpus which will be use din preprocessing
    """
    data = pd.read_csv("data/service_reviews_15000rows_translated.csv")
    corpus = data.iloc[:, 8]
    return corpus

def preprocess(corpus:pd.core.series.Series, min_len:int = 3, max_len:int = 15) -> list:
    """ Take in a corpus of text in a pandas series and perform
    preprocessing

    corpus: a pandas series containing text

    min_len: minimum word length. No shorter words will be retained

    max_len: maximum word length. No longer words will be retained
    """
    
    if not (min_len <= max_len):
        raise ValueError("make sure your minimum and maximum token lengths are not reversed")

    preprocessed_corpus = []

    for i in corpus:
        preprocessed_doc = gensim.utils.simple_preprocess(i, min_len = min_len, max_len = max_len)
    
        preprocessed_corpus.append(preprocessed_doc)

        # go line by line, removing common words
    stoplist = set('for a of the and to in'.split(' '))
    texts = [[word for word in document if word not in stoplist]
         for document in preprocessed_corpus]

    # count word frequencies
    frequency = defaultdict(int)
    for text in texts:
        for token in text:
            frequency[token] += 1

    # only keep words that appear more than once
    processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]

    return processed_corpus

def corpus_maker(processed_corpus:list):
    """ Take in a processed corpus from preprocessing and transform it into
    a tfidf bag of words corpus
    
    """
    # turn this into a dictionary structure
    dictionary = corpora.Dictionary(processed_corpus)

    # create a 'bag of words' corpus using that dictionary
    bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]

    # train the model

    # tfidf is a transformation that finds term frequency in model frequency
    # we will use this in order to create a structure which other models can attack more easily
    tfidf = models.TfidfModel(bow_corpus)

    corpus_tfidf = tfidf[bow_corpus]

    return corpus_tfidf, dictionary

def similarity_order(corpus_tfidf: gensim.interfaces.TransformedCorpus, dictionary: gensim.corpora.dictionary.Dictionary, queries:list[str], mod, num_topics:int):
    """ Take in a tfidf corpus and dictionary created by corpus_creation,
    as well as a query such as 'customer support' and a model. Then we 
    classify each document in the corpus according to which query
    had the highest similarity score
    This remains a naive classifier, more refinement is needed. LSI model is recommended

    mod: must be formatted as models.ModelName, such as models.LdaModel, or models.LsiModel
    """

    model = mod(corpus_tfidf, id2word=dictionary, num_topics=num_topics)
    
    query_scores = []
    
    df = pd.DataFrame(columns = queries)
    
    for q in queries:
        vec_bow = dictionary.doc2bow(q.lower().split())
        vec_model = model[vec_bow]  # convert the query to LSI space

        #index these
        index = similarities.MatrixSimilarity(model[corpus_tfidf])

        sims = index[vec_model]  # perform a similarity query against the corpus
        
        query_scores.append(sims)
        
        df[q] = sims

    df["class"] = df.idxmax(axis=1)
    
    return df
    
    

In [3]:
def classification_pipeline(data, queries, mod, num_topics):
    corpus = corpus_extractor()
    processed_corpus = preprocess(corpus)
    corpus_tfidf, dictionary = corpus_maker(processed_corpus)
    
    df = similarity_order(corpus_tfidf, dictionary, queries, mod, num_topics)
    
    df["text"] = corpus
    
    queries = queries

    bins = []

    for q in queries:
        subset = df[df["class"]==q]
        bins.append(subset["text"].values)
    
    classes_dict = {}
    for q, b in zip(queries, bins):
        key, value = q, b
        classes_dict[key] = value
        
    return classes_dict

In [4]:
%%time

classifications = classification_pipeline(data, queries, models.LsiModel, 10)

CPU times: user 8.28 s, sys: 4.41 s, total: 12.7 s
Wall time: 2.29 s


Below are some of the comments classified as 'Customer service' 

In [5]:
classifications["Customer service"]

array(['Good evening, I received the order and one glass was missing, I have sent through the site and no one has answered me.',
       'Great, super pretty glasses. and good material',
       "I was a Hawkers customer and thought it was a trustworthy brand, but it wasn't. \n\nOn November 19th I placed two repeat orders (my mistake) and then tried to cancel one of them (unsuccessfully). I sent an email and they told me not to accept the order at home as it would be returned (and it was). \n\nTo date, I have sent 3 emails demanding a refund and still nothing! They have the glasses and my money (65).\n\nUntil reasons to the contrary, I do not recommend this brand TO ANYONE!!!\n\nThere's no point in saying to report the problem at the link because I've already done it!\n\n\n\nUpdate: I haven't received my refund yet. I received an email saying that they had refunded me but I still haven't received the money.",
       ...,
       'The price is good but some glasses arrived with a fallen gl

Below are some of the comments classified as 'Delivery service' 

In [6]:
classifications["Delivery service"]

array(['Fast and good delivery!',
       "Everything was fast and correct. I'm not giving it 5 stars because they seem a bit dark to me.\n\nThat's what happens when you buy things online without trying them on first!!",
       'Fast and recommended', ...,
       'This company did not send a confirmation email of my order, then my order went "missing" from the delivery service. Would not recommend.',
       '... top quality.\n\nawesome products, profesionalism and short time for delivery.',
       'Fast delivery, top quality'], dtype=object)

Below are some of the comments classified as 'Product quality'

In [7]:
classifications["Product Quality"]

array(['I love the quality of the glasses!',
       'I ordered a pair of sunglasses from their wide selection, I was able to benefit from a special advantage and get 2 pairs for the price of one. I find the quality/price ratio very good and the shipping went well, fast reception.\n\n\n\nI will not hesitate to order again.',
       'Correct for summer and the beach', ...,
       'Good value for money. Good designs!',
       'Good glasses, good promotions, good prices and good delivery service.',
       'Everything great quality price and fast shipping, a ten'],
      dtype=object)

## Further Notes

We note that, at present, we are only classifying the reviews into the three aforementioned categories. At present we do not account for reviews which do not belong to any of the three categories. 

We further note that many of the reviews, particularly the positive ones, could be plausibly placed in multiple categories. For example, the review 'Good glasses, good promotions, good prices and good delivery service.' was classified as 'Product quality', but could have easily been classified under 'Delivery service' as well. 

Solutions to these issues will depend on how we ultimately plan to implement this. In particular, I assume that proper classification of and quick response to positive reviews is not as important as for negative reviews, so issues like the one above may or may not be important to address. Work continues. 