The goal of this project is going to be to classify the question topic class of an observation.


Standard imports

In [8]:
import pandas as pd
import numpy as np
import re

In [2]:
data = pd.read_csv("data/question_topic.csv")

We'll take a look at 5 random observations from the dataset just to get a feel for what is looks like.

In [4]:
data.sample(5)

Unnamed: 0.1,Unnamed: 0,question_text,question_topic
423,423,Do you know if your app always has the same pr...,Omnichannel
1123,1123,What are the actual dimensions of the Peggy Mi...,Product Specifications
3988,3988,I'm looking for a nice pair of silver stud ear...,Product Availability
1873,1873,I'm not sure I want to order the Komplete Meal...,Product Comparison
844,844,What's the weight capacity of the Spencer Wood...,Product Specifications


In [7]:
data['question_topic'].value_counts()

Product Specifications    839
Product Availability      833
Product Comparison        806
Shipping                  799
Returns & Refunds         768
Sales/Promotions          505
Omnichannel               450
Name: question_topic, dtype: int64

We see that the classes are imbalanced because the smallest class is almost half of the largest class.

We will write a function to preprocess the question text to make it more conducive to being used in a classification ML algorithm.

In [9]:
def clean_str( string ):
    """
    Perform tokenization and cleaning for the strings
    """
    # remove newline characters
    string = re.sub(r"\n","",string)
    # remove carriage return characters
    string = re.sub(r"\r","",string)
    # remove digits
    string = re.sub(r"[0-9]","",string)
    # remove '
    string = re.sub(r"\'","",string)
    # remove "
    string = re.sub(r"\"","",string)
    # tokenzation and convert all text to lower case
    return string.strip().lower()

Apply the cleaning to the question text column and create a new column for the clean data

In [11]:
data['cleaned_question_text'] = data['question_text'].apply( clean_str )

Now we can pick out the target labels

In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 4 columns):
Unnamed: 0               5000 non-null int64
question_text            5000 non-null object
question_topic           5000 non-null object
cleaned_question_text    5000 non-null object
dtypes: int64(1), object(3)
memory usage: 156.3+ KB


In [15]:
X = data['question_text']
y = data['question_topic']

In [16]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 )

Because our inputs are the `question_text`, and are not numerical, we need to convert them into numerical features by performing a vectorization of the features. There are many different ways to perform featurization. We can use the `CountVectorizer` and the `TF-IDF`. We could also choose to include interaction terms.

**CountVectorizer** - Transforms a review into a token count matrix. First it tokenizes the text and then creates a sparse matrix containing the count of occurrence of each word it detects.

**TF-IDF** - represents the importance of a word to a document in a corpus. The TF-IDF value is proportional to the frequency of a word in a document. The TF-IDF is computed as the term frequency (TF) times the inverse document frequency (IDF). If a review contains 100 words, and 5 of them are the word Awesome, then the term frequency for the word Awesome is (5/100)=0.05. If there are then 1 million reviews in the entire corpus and the word Awesome appears 1000 times in the entire corpus, then the inverse document frequency is computed as $log(1000000/1000) =  3$. THen the TF-IDF value is $0.05 * 3=0.15$


Now we must choose the multi-class model. Here we'll use MultiClass SVM in the one versus all paradigm. In this case, there is one binary classifier trained for each class, and it is trained to identify the presence of its corresponding class. Passing the `class_weight="balanced"` means that the classifier will try to remove the biasedness of the model due to the imbalance of the classes.

In [20]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

pipeline = Pipeline( [('vectorizer', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', OneVsRestClassifier( LinearSVC(class_weight="balanced")))
                    ])

Now we need to fine tune the model by performing hyperparameter selection. We pass the classifier we're using, as well as grid specifying the values for the hyperparameters to try. We pass the `GridSearchCV` the `n_jobs=-1` to tell it to use all cores, and `cv=5` to do 5-fold cross validation.

In [28]:
from sklearn.model_selection import GridSearchCV
parameters = { 'vectorizer__ngram_range': [(1,1),(1,2),(2,2)],
              'tfidf__use_idf': (True, False) }
svm_clf_grid = GridSearchCV( pipeline, parameters, n_jobs=-1, cv=5 )
svm_clf_grid.fit( X, y )

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
       ..._class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
          n_jobs=None))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'vectorizer__ngram_range': [(1, 1), (1, 2), (2, 2)], 'tfidf__use_idf': (True, False)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [29]:
print("Best Score: {}".format(svm_clf_grid.best_score_))
print("Best Set of parameters: {}".format(svm_clf_grid.best_params_))

Best Score: 0.9672
Best Set of parameters: {'vectorizer__ngram_range': (1, 2), 'tfidf__use_idf': True}


Now we will fit the model with the training data and the evaluate on the test data to get an idea of the performance.

In [30]:
model = Pipeline( [('vectorizer', CountVectorizer(ngram_range=(1,2))),
                     ('tfidf', TfidfTransformer(use_idf=True)),
                     ('clf', OneVsRestClassifier( LinearSVC(class_weight="balanced")))
                    ])

In [35]:
from sklearn.metrics import classification_report, confusion_matrix
model.fit( X_train, y_train )
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

                        precision    recall  f1-score   support

           Omnichannel       1.00      1.00      1.00       131
  Product Availability       0.99      0.99      0.99       236
    Product Comparison       0.98      0.99      0.99       247
Product Specifications       0.98      0.98      0.98       253
     Returns & Refunds       1.00      1.00      1.00       248
      Sales/Promotions       1.00      0.99      0.99       152
              Shipping       1.00      1.00      1.00       233

             micro avg       0.99      0.99      0.99      1500
             macro avg       0.99      0.99      0.99      1500
          weighted avg       0.99      0.99      0.99      1500

[[131   0   0   0   0   0   0]
 [  0 233   1   2   0   0   0]
 [  0   0 245   2   0   0   0]
 [  0   1   4 248   0   0   0]
 [  0   0   0   0 248   0   0]
 [  0   1   0   1   0 150   0]
 [  0   0   0   0   0   0 233]]


Remember that there are many values for the precision, recall and the f1-score because we used a OneVsAll model. In this model we trained a binary classifier for every class.

For further steps and better accuracy, we could try the following:

* Eliminate low quality features (words)
* Remove stopwords
* Diversify the training corpus
* Use stemming and Lemmatizing