# Using Watson Explorer for Text Analytics and Document Classification

In this notebook, we are using customer reports about the products of a food retail company. Each report needs to be transferred to an appropriate department in the company. 
This notebook requires Watson Explorer add-on package installed on the cluster to execute.
We are using Watson Explorer to extract the features from the text data in the reports. The features are used with other structured data in the reports by Watson Studio Machine Learning Service to classify the reports. Based on the classification, each report can be transferred to an appropriate department.

To extract features by Watson Explorer, we need to create a Watson Explorer collection.

## Table of Contents
- [Prepare emvironment](#Prepare_environment)
- [List Watson Explorer Collections](#List-Watson-Explorer-Collections)
- [Setup data](#Setup_data)
- [Classification](#Classification)
- [Evaluate](#Evaluate)
- [Save model](#Save_model)
- [Summary](#Summary)

<a id='Prepare_environment'></a>
## Prepare environment
Import libraries for Watson Explorer and machine learning.

In [1]:
from ibmwex.ml import FeatureExtractor, OutputColumn, CollectionBrowser

In [2]:
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession, SQLContext
from pyspark.ml.feature import StringIndexer, IndexToString, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

<a id='List-Watson-Explorer-Collections'></a>
## List Watson Explorer Collections

Show all Watson Explorer collections to choose a collection for feature extraction.

In [3]:
browser = CollectionBrowser()
collections = browser.list()
for collection in collections:
    print(collection)

cb29ea70-7665-4e14-0000-016467fcc487


List the available feature path and the number of features for the collection.

In [4]:
termFreqMap = browser.getTermFreqMap('cb29ea70-7665-4e14-0000-016467fcc487')
for k,v in termFreqMap.items():
    print('{0} [{1}]'.format(k, v.size()))

._phrase [420]
._phrase.noun_phrase [330]
._phrase.noun_phrase.adp_noun [158]
._phrase.noun_phrase.mod_noun [92]
._phrase.noun_phrase.nouns [82]
._phrase.pred_phrase [90]
._phrase.pred_phrase.noun_pred [41]
._phrase.pred_phrase.verb_noun [49]
._word [322]
._word.adj [68]
._word.adv [43]
._word.conj [3]
._word.intj [2]
._word.noun [164]
._word.noun.general [148]
._word.noun.others [26]
._word.num [17]
._word.verb [64]


<a id='Setup_data'></a>
## Setup data
Load the data from csv file and create a DataFrame

This data has customer reports that have structured data ('claim_product_line' and 'claim_product') and unstructured data ('body'). 'label' column can be used for training to classify the reports. 

The file is already part of the Tutorial project. We can read the csv and create a DataFrame.

In [5]:
import os
# Add asset from file system
rows = SQLContext(sc).read.csv('../datasets/watson_explorer_sample_voc.csv', header='true', inferSchema = 'true')
rows.toPandas()[0:5]

Unnamed: 0,claim_id,date,claim_product_line,claim_product,client_segment,client_location,client_sex,client_age,body,label
0,0,2016/1/1,Tea,lemon tea,Not Member,Manhattan,Male,20,The straw was peeled off from the juice pack.,package_container
1,1,2016/1/2,Ice cream,vanilla ice cream,Silver Card Member,Queens,Female,20,"I got some ice cream for my children, but ther...",contamination_tampering
2,2,2016/1/2,Jelly,apple jelly,Silver Card Member,Brooklyn,Male,40,I could only find 11 cups in the 12-pack.,amount.shortage
3,3,2016/1/2,Juice,orange juice,Silver Card Member,Bronx,Female,50,There was a stain on the package that seemed t...,package_container.leak
4,4,2016/1/3,Chocolate,milk chocolate,Silver Card Member,Manhattan,Male,30,I love the ads for the new milk chocolate. Cou...,ads


In the model training process, the original dataset will be split into training dataset and testing dataset.

In [6]:
train, test = rows.randomSplit([0.8, 0.2], 12345)

Transform 'body' column into 'words' and 'phrases' columns using the features ('.\_word' and '.\_phrase') in the collection. 

In [7]:
wordCol = OutputColumn("words", "._word")
phraseCol = OutputColumn("phrases", "._phrase")
extractor = FeatureExtractor() \
    .setCollectionId('cb29ea70-7665-4e14-0000-016467fcc487') \
    .setInputCol("body") \
    .setOutputCols([wordCol, phraseCol])

Transform 'claim_product' and 'claim_product_line' to binary vectors.

In [8]:
productIndexer = StringIndexer(inputCol="claim_product", outputCol="claim_product_index")
productEncoder = OneHotEncoder(inputCol=productIndexer.getOutputCol(), outputCol="claim_product_vector")
productLineIndexer = StringIndexer(inputCol="claim_product_line", outputCol="claim_product_line_index")
productLineEncoder = OneHotEncoder(inputCol=productIndexer.getOutputCol(), outputCol="claim_product_line_vector")

Combine 'words', 'phrases', 'claim_product_vector', and 'claim_product_line_vector' columns into 'features'.

In [9]:
assembler = VectorAssembler( \
    inputCols=[productEncoder.getOutputCol(), productLineEncoder.getOutputCol(), wordCol.getName(), phraseCol.getName()],
    outputCol="features"
)

Encode string 'label' column of labels into 'label_index' column of label indices. 

In [10]:
label = StringIndexer(inputCol="label", outputCol="label_index", handleInvalid="skip")

In [11]:
labelModel = label.fit(train)

<a id='Classification'></a>
## Classification
Build a multiclass classification using Naive Bayes as a base classifier. The classifier can be changed to other methods.

In [12]:
classifier = NaiveBayes(labelCol=label.getOutputCol(), featuresCol=assembler.getOutputCol())

In [13]:
labelDecoder = IndexToString(inputCol=classifier.getPredictionCol(), outputCol="prediction_label", labels=labelModel.labels)

Build a pipeline and train the multiclass model.

In [14]:
pipeline = Pipeline(stages=[extractor, label, productIndexer, productEncoder, productLineIndexer, productLineEncoder, assembler, classifier, labelDecoder])
model = pipeline.fit(train)

<a id='Evaluate'></a>
## Evaluate

Show the accuracy and the result.

In [15]:
predicted = model.transform(test)
evaluator = MulticlassClassificationEvaluator(labelCol=classifier.getLabelCol(), predictionCol=classifier.getPredictionCol(), metricName="accuracy")
accuracy = evaluator.evaluate(predicted)
print("Accuracy:%g" % accuracy )

Accuracy:0.936508


In [16]:
predicted.select(label.getInputCol(), labelDecoder.getOutputCol()).toPandas()[0:10]

Unnamed: 0,label,prediction_label
0,ads,ads
1,amount.shortage,amount.shortage
2,package_container,package_container
3,package_container,package_container
4,other,other
5,ingredient.additives,ingredient.additives
6,package_container.leak,package_container.leak
7,contamination_tampering,contamination_tampering
8,contamination_tampering,contamination_tampering
9,contamination_tampering,contamination_tampering


<a id='Save_model'></a>
## Save model
Save the trained model

In [17]:
from dsx_ml.ml import save

save(name = 'DocumentClassificationModel',
     model = model,
     test_data = test,
     algorithm_type = 'Classification',
     source = 'Using+Watson+Explorer+for+Classification.ipynb',
     description = 'Document classification using WEX Feature Extractor'
    )

Using TensorFlow backend.


{'path': '/user-home/999/DSX_Projects/dsx-samples/models/DocumentClassificationModel/1',
 'scoring_endpoint': 'https://dsxl-api/v3/project/score/Python27/spark-2.0/dsx-samples/DocumentClassificationModel/1'}

<a id='Summaryl'></a>
## Summary

In this example, you learnded how to use Watson Explorer to extract the features from the text data. To extract the features from your own text data, you need to create a Watson Explorer collection.