## Classifying Customer Queries 

Using simple machine learning techniques to solve actual business problems

----
## 1. Problem statement
Every company has customers and customers have complaints. <br>
For this project, we have a dataset that contains different queries from customers. <br>
Now in this case, each query we receive from the customer falls into one of two categories -
* A generic query that can be dealt with automatically (category 0)
* A complex query that requires human intervention (category 1)

---
## 2. Approach to solution:
* Here our objective is to build a model that can <i>predict</i> whether a query requires immediate attention (1) or not (0)


#### <b>2.1 Preparing the dataset</b>
* We are using publicly available dataset from an E-Learning website 
* Let us first import necessary libs & import data

In [24]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.decomposition import PCA
from sklearn.svm import LinearSVC
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style="whitegrid")

data=pd.read_csv('query_data.csv')
data.head()

Unnamed: 0,Input Query,Category
0,be or do,0
1,comparatives and superlatives,0
2,how to teach overcoming obstacles concept with...,1
3,Present perfect simple or continuous,0
4,-,0


0 - <i>Low level query </i><br>
1 - <i>Urgent query (needs immediate attention) </i>

* Shuffling the dataset to curb any unnecessary biases

In [5]:
data = data.sample(frac = 1, random_state=11)
data.head()
data.shape

(1084, 2)

#### <b> 2.2 Vectorization of words</b>
Algorithms cannot ingest raw text, so we will have to convert it into numbers. <br>
There are multiple methods to achieve this.

* <b> Counting Word frequency </b> (Bag of words) -
<br> In this approach, we use the tokenized words for each observation and find out the frequency of each token.

In [7]:
count_vec = CountVectorizer()
count_example = (count_vec.fit_transform(data["Input Query"].values.astype('U'))).toarray()
count_example = pd.DataFrame(count_example)
vocab_list = list(count_vec.get_feature_names())

i=0
for i in range(len(count_example.columns)):
    count_example.rename(columns={i: vocab_list[i]}, inplace=True)


In [10]:
count_example.shape
#count_example.to_csv("countexample.csv")

(1084, 1182)

As we can see, that this method is not effiecent since sparse matrix between document and token generated 1084 x 1182 cells.

* <b> Hashing Vectorization </b> - <br>
In this approach, the algorithm looks at how words tend to cluster in different contexts & find relations between them based on how they were used in the text.


In [17]:
vec = HashingVectorizer(n_features = 2**10, norm="l1")
vec_counts = (vec.fit_transform(data["Input Query"].values.astype('U'))).toarray()
train = pd.DataFrame(vec_counts)
train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.027778,...,0.0,0.0,0.0,0.0,0.0,-0.027778,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
vec_counts.shape

(1084, 1024)

The main advantage of using this method is even if we increase the observations to 100k, we can still limit the featuer count to 1024. <br>

#### <b>2.3 Dimensionality reduction using PCA </b>
We still have a lot of features. As you might notice, there are lot of zeroes in the dataset which seems unnecessary & inefficient.
* Principal component analysis is a statistical procedure to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components
* The "0.99" means that we want the compression to retain 99% of the original data´s significance. 

In [29]:
train_rows, train_cols = train.shape
pca_compressor = PCA(0.99)
comp_train = pd.DataFrame(pca_compressor.fit_transform(train))
comp_rows, comp_cols = comp_train.shape
print(" # cols before PCA:",train_cols,"\n","# cols after PCA:" ,comp_cols)

 # cols before PCA: 1024 
 # cols after PCA: 371


We´ve reduced the size of the dataset by 65% and all losing 1% significance of the original data. As you can see, we no longer have sparse zero columns, everything has been compressed as tightly as possible in order to have as few columns as possible. This now makes our data far more scalable when adding examples to it in the future.

#### <b> 2.4 Training & Evaluating the model </b>
We already have a labelled dataset. Now our we have to train our model to learn the relation between each input text and it´s label category.

In [39]:
train_ans = data['Category']
train_ans.head()

0    0
1    0
2    1
3    0
4    0
Name: Category, dtype: int64

* <b> Precision, Recall & F1 score: </b> <br>
Before deploying the solution, we must always take into the account the business side of it. <br>
For eg, in this scenario if 
    - the model classifies 0 (non urgent) as 1 (urgent) then it does not cause major problem
    - but if the model classifies 1 (urgent) as 0 (non urgent) then it will lead to customer dissatisfaction <br>

So in this case <i>Recall </i> is an important evaluation metric compared to <i>Precision</i> <br>
We´ll assign a heavier weight to the "1" category because we want to be as certain as possible to always detect that class.

In [40]:
#adjusting the weights
class_weights = {0:0.13, 1:0.87}

* <b> Using linear SVC </b>

In [41]:
#fitting SVC
svc_model = LinearSVC(C=7, dual = True, loss="squared_hinge", penalty = "l2", tol=1e-7, class_weight=class_weights)
svc_model.fit(comp_train, train_ans)

LinearSVC(C=7, class_weight={0: 0.13, 1: 0.87}, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=1e-07,
     verbose=0)

Now that we have fit our learning algorithm, lets check the performance

In [44]:
TP=0
TN=0
FP=0
FN=0

for i in range(len(train)):
    if svc_model.predict(comp_train.iloc[[i,]])==1 and train_ans.iloc[i]==1:
        TP = TP + 1
    if svc_model.predict(comp_train.iloc[[i,]])==1 and train_ans.iloc[i]==0:
        FP = FP + 1
    if svc_model.predict(comp_train.iloc[[i,]])==0 and train_ans.iloc[i]==0:
        TN = TN + 1
    if svc_model.predict(comp_train.iloc[[i,]])==0 and train_ans.iloc[i]==1:
        FN = FN + 1
        
print("Model", "True Pos - ",TP, "False Pos -",FP,"True Neg - ", TN, "False neg - ",FN)

model_accuracy = (TP+TN)/(len(train_ans))
model_precision = TP/ (TP+FP)
model_recall = TP/(TP+FN)
model_f1 = 2*(model_precision * model_recall) / (model_precision + model_recall)

print("Model's Accuracy:", model_accuracy * 100, '%')
print("Model´s Precision:", model_precision *100,'%')
print("Model´s Recall:", model_recall *100,'%')
print("Model´s F1:", model_f1 *100,'%')


Model True Pos -  172 False Pos - 341 True Neg -  551 False neg -  20
Model's Accuracy: 66.69741697416974 %
Model´s Precision: 33.528265107212476 %
Model´s Recall: 89.58333333333334 %
Model´s F1: 48.79432624113475 %


And as we can see, the most important metric for us, Recall was pretty good, with 99% which will come down to 90% in new use cases.

#### <b> 2.5 Prediction </b>
Now lets take our model for a test drive!

In [46]:
def query_classifier(new_input):
  new_input=[new_input]

  new_input_vectorized = vec.fit_transform(new_input)
  new_input_vectorized=pd.DataFrame(new_input_vectorized.toarray())
  
  compressing_new_input= train.append(new_input_vectorized, ignore_index=True)
  pca_input_compressor= PCA(n_components=comp_cols, svd_solver='full')
  
  compressing_new_input= pd.DataFrame(pca_input_compressor.fit_transform(compressing_new_input))
  
  new_input_compressed = compressing_new_input.iloc[[(len(compressing_new_input.index)-1),]]

  prediction=svc_model.predict(new_input_compressed)
  
  if prediction==0:
    print(prediction)
    print("[Not Urgent] - Simple query, will be dealt with by the sytem.")
  else:
    print(prediction)
    print("[Urgent] - Complex query, needs human intervention.")
    

new_input= (input("Enter your text: "))


query_classifier(new_input)

Enter your text:  fuck u


[0]
[Not Urgent] - Simple query, will be dealt with by the sytem.


------


## 3. Conclusion

So that's about it. It's amazing how we can apply Machine learning to simplest of things and make it more effiecient.
<br> 

Links:

* Portfolio : https://gofornaman.github.io
* LinkedIn : https://www.linkedin.com/in/naman-doshi/

----

