# A Production ready Multi-Class Text Classifier

## by Sambit Mahapatra

https://towardsdatascience.com/a-production-ready-multi-class-text-classifier-96490408757

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/learning-stack/Colab-ML-Playbook/blob/master/NLP/Production%20ready%20Multi-Class%20Text%20Classifier/question_topic_nlp.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/learning-stack/Colab-ML-Playbook/blob/master/NLP/Production%20ready%20Multi-Class%20Text%20Classifier/question_topic_nlp.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

In [0]:
import pandas as pd
import numpy as np

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

In [4]:
#Load the dataset
!wget https://raw.githubusercontent.com/sambit9238/Machine-Learning/master/question_topic.csv
df = pd.read_csv("question_topic.csv")

--2019-01-19 18:04:59--  https://raw.githubusercontent.com/sambit9238/Machine-Learning/master/question_topic.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 923583 (902K) [text/plain]
Saving to: ‘question_topic.csv’


2019-01-19 18:05:04 (14.3 MB/s) - ‘question_topic.csv’ saved [923583/923583]



In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,question_text,question_topic
0,0,"Hi! If I sign up for your email list, can I se...",Sales/Promotions
1,1,I'm going to be out of the country for about a...,Shipping
2,2,I was wondering if you'd be able to overnight ...,Shipping
3,3,The Swingline electronic stapler (472555) look...,Shipping
4,4,I think this cosmetic bag would work great for...,Shipping


In [6]:
df.shape

(5000, 3)

In [7]:
set(df["question_topic"])

{'Omnichannel',
 'Product Availability',
 'Product Comparison',
 'Product Specifications',
 'Returns & Refunds',
 'Sales/Promotions',
 'Shipping'}

In [8]:
from collections import Counter
Counter(df["question_topic"])

Counter({'Omnichannel': 450,
         'Product Availability': 833,
         'Product Comparison': 806,
         'Product Specifications': 839,
         'Returns & Refunds': 768,
         'Sales/Promotions': 505,
         'Shipping': 799})

In [0]:
#pre-processing
import re 
def clean_str(string):
    """
    Tokenization/string cleaning for dataset
    Every dataset is lower cased except
    """
    string = re.sub(r"\n", "", string)    
    string = re.sub(r"\r", "", string) 
    string = re.sub(r"[0-9]", "digit", string)
    string = re.sub(r"\'", "", string)    
    string = re.sub(r"\"", "", string)    
    return string.strip().lower()

In [10]:
df.columns


Index(['Unnamed: 0', 'question_text', 'question_topic'], dtype='object')

In [0]:
#train test split
from sklearn.model_selection import train_test_split
X = []
for i in range(df.shape[0]):
    X.append(clean_str(df.iloc[i][1]))
y = np.array(df["question_topic"])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=5)

In [0]:
#feature engineering and model selection
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

After, train test split we should start design the classifier. Since, the inputs are texts we first need to convert them into numerical vectors to feed them into any machine learning algorithm. The vectorization of features are done using the following two methods.

**CountVectorizer**: It transforms the review to token count matrix. First, it tokenizes the text and according to number of occurrence of each token, a sparse matrix is created. Calculation of CountVectorizer Matrix: suppose we have three different documents containing following sentences.

“Camera is great”.

“Camera is Awful”.

“Camera is fine”.

Matrix generated of size 3*5 because we have 3 documents and 5 distinct features. The matrix will look like:-

![alt text](https://cdn-images-1.medium.com/max/800/1*A6IYdBpa8VF2jmbZyahGUg.png)

**TF-IDF**: Its value represents the importance of a word to a document in a corpus. TF-IDF value is proportional to the frequency of a word in a document. Calculation of TF-IDF value: suppose a movie review contain 100 words wherein the word Awesome appears 5 times. The term frequency (i.e., TF) for Awesome then (5 / 100) = 0.05. Again, suppose there are 1 million reviews in the corpus and the word Awesome appears 1000 times in whole corpus Then, the inverse document frequency (i.e., IDF) is calculated as log(1,000,000 / 1,000) = 3. Thus, the TF-IDF value is calculated as: 0.05 * 3 = 0.15.

Now, the numeric vectors are given as input to the support vector machine algorithm. Since the number of features are generally large in text case, the linear kernel generally performs best.

Another challenge here is the multi class classification one. For that at the support vector machine implementation, we can use the OneVsRest classifier concept. The OneVsRest (or one-vs.-all, OvA or OvR, oneagainst-all, OAA) strategy involves training a single classiﬁer per class, with the samples of that class as positive samples and all other samples as negatives. This strategy requires the base classiﬁers to produce a real-valued conﬁdence score for its decision, rather than just a class label; discrete class labels alone can lead to ambiguities, where multiple classes are predicted for a single sample.

To make the classifier design production ready, we can create a pipeline of all these processes discussed above thus making it easier to move to other systems.

In [0]:
#pipeline of feature engineering and model
model = Pipeline([('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC(class_weight="balanced")))])

For every algorithm of machine learning used, parameter tuning plays a important role. It has been observed that with proper parameter values set, model’s performance increase reasonably. We can find the suitable parameters in our case using grid search as shown below:-

In [0]:
#paramater selection
from sklearn.model_selection import GridSearchCV
parameters = {'vectorizer__ngram_range': [(1, 1), (1, 2),(2,2)],
               'tfidf__use_idf': (True, False)}

In [17]:
gs_clf_svm = GridSearchCV(model, parameters, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(X, y)
print(gs_clf_svm.best_score_)
print(gs_clf_svm.best_params_)



0.9646
{'tfidf__use_idf': True, 'vectorizer__ngram_range': (1, 2)}


So, now we got the **SUITABLE PARAMETERS** from grid search. It’s time to prepare the final pipeline using the best suited parameters.

In [0]:
#preparing the final pipeline using the selected parameters
model = Pipeline([('vectorizer', CountVectorizer(ngram_range=(1,2))),
    ('tfidf', TfidfTransformer(use_idf=True)),
    ('clf', OneVsRestClassifier(LinearSVC(class_weight="balanced")))])

In [19]:
#fit model with training data
model.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
       ..._class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
          n_jobs=None))])

In [0]:
#evaluation on test data
pred = model.predict(X_test)

In [21]:
model.classes_

array(['Omnichannel', 'Product Availability', 'Product Comparison',
       'Product Specifications', 'Returns & Refunds', 'Sales/Promotions',
       'Shipping'], dtype='<U22')

Then, we will fit the model with training data and test data to just have an overview of the overall perfomance.

In [22]:
from sklearn.metrics import confusion_matrix, accuracy_score
confusion_matrix(pred, y_test)

array([[128,   0,   0,   0,   0,   0,   0],
       [  0, 252,   0,   5,   0,   5,   0],
       [  0,   0, 223,   2,   0,   0,   0],
       [  0,   1,   6, 254,   0,   1,   0],
       [  0,   0,   0,   0, 230,   1,   0],
       [  0,   0,   0,   0,   0, 146,   0],
       [  2,   0,   0,   0,   0,   0, 244]])

In [23]:
accuracy_score(y_test, pred)

0.9846666666666667

In [24]:
#save the model
from sklearn.externals import joblib
joblib.dump(model, 'model_question_topic.pkl', compress=1)

['model_question_topic.pkl']

# Deployment

In [0]:
from sklearn.externals import joblib
model = joblib.load('model_question_topic.pkl')

In [30]:
question = input()

Where to pickup my product


In [31]:
model.predict([question])[0]

'Product Comparison'