
# Notebook 3 – Natural Language Classifier (NLC)
IBM Watson Natural Language Classifier uses machine learning algorithms to return the top matching predefined classes for short text input. 

*YOU* Create and train a classifier to connect predefined classes to example texts so that the service can apply those classes to new inputs.

https://www.ibm.com/watson/services/natural-language-classifier/ 
https://www.ibm.com/watson/developercloud/natural-language-classifier/api/v1 


## Install dependencies

In [1]:
#imports.... Run this each time after restarting the Kernel
#!pip install watson_developer_cloud
import watson_developer_cloud as watson
import json
from botocore.client import Config
import ibm_boto3
import requests

##  Cloud Object Storage - Add Credentials & Bucket Name
If you've not already set up COS - please see Step 1

### Credentials
Credentials are also created for you when you create project. From service dashboard page select `Service Credentials` from left navigation menu item, and copy/paste the credentials below:

### Bucket name
Buckets are created for you when you create project. From service dashboard page select `Buckets` from left navigation menu item, and get your bucket name and copy/paste bucket name below:


In [2]:
# For Cloud Object Storage - populate your own information here from "SERVICES" on this page, or Console Dashboard on ibm.com/cloud

# From service dashboard page select Service Credentials from left navigation menu item
credentials_os = {
  "apikey": "",
  "cos_hmac_keys": {
    "access_key_id": "",
    "secret_access_key": ""
  },
  "endpoints": "https://cos-service.bluemix.net/endpoints",
  "iam_apikey_description": "Auto generated apikey during resource-key operation for Instance",
  "iam_apikey_name": "",
  "iam_role_crn": "",
  "iam_serviceid_crn": "",
  "resource_instance_id": ""
}

# Buckets are created for you when you create project. From service dashboard page select Buckets from left navigation menu item, 
credentials_os['BUCKET'] = '<bucket_name>' # copy bucket name from COS

In [3]:
# The code was removed by DSX for sharing.

### ACCESS (pre-trained) Watson Natural Language Classifier (NLC) service for lab
### (*) NLC does NOT OFFER LITE PLAN and NLC also takes time to train

For this lab - to keep things simple - NLC has been PRE CONFIGURED for you. 

IBM Watson™ Natural Language Classifier uses machine learning algorithms to return the top matching predefined classes for short text input. You create and train a classifier to connect predefined classes to example texts so that the service can apply those classes to new inputs.

In short - YOU can train the NLC with a ground truth - to create your own classification model

https://www.ibm.com/watson/developercloud/natural-language-classifier/api/v1/curl.html?curl


In [7]:
# LAB CREDENTIALS FOR YOU - Credentials will only be available till March 23, 2018; afterward you need to train your own classifier

credentials_nlc = {
    "classifier_id": "f7ea68x308-nlc-917",
    "url": "https://gateway.watsonplatform.net/natural-language-classifier/api",
    "apikey": "280b9633-d8c0-4ed2-9ee6-1b2c139516fb",
}
# Ground truth used - simple tester "call_center_gt_NLC_V2.csv"
# https://github.com/rustyoldrake/call_center_instrumentation_analytics/blob/master/call_center_gt_NLC_V2.csv

### Set up Object Storage Client

In [8]:
endpoints = requests.get(credentials_os['endpoints']).json()

iam_host = (endpoints['identity-endpoints']['iam-token'])
cos_host = (endpoints['service-endpoints']['cross-region']['us']['public']['us-geo'])

auth_endpoint = "https://" + iam_host + "/oidc/token"
service_endpoint = "https://" + cos_host


client = ibm_boto3.client(
    's3',
    ibm_api_key_id = credentials_os['apikey'],
    ibm_service_instance_id = credentials_os['resource_instance_id'],
    ibm_auth_endpoint = auth_endpoint,
    config = Config(signature_version='oauth'),
    endpoint_url = service_endpoint
   )




### NLC

- `process_text()` goes throught the text and fetch sentences and concatenate transcript based on chunk size
- `classify()` calls natural language classifier endpoint and classify the text fields in transcript

In [9]:
#NLC

from watson_developer_cloud import NaturalLanguageClassifierV1

natural_language_classifier = NaturalLanguageClassifierV1(
    iam_apikey='{iam_api_key}', = credentials_nlc['apikey'])

chunk_size = 25
# Used to SPLIT up - "CHUNK" the aggregate transcript into smaller pieces

def chunk_transcript(transcript, chunk_size):
    transcript = transcript.split(' ')
    return [ transcript[i:i+chunk_size] for i in range(0, len(transcript), chunk_size) ] # chunking data
    

def process_text(text):
    transcript=''
    for sentence in json.loads(text)['results']:
        transcript = transcript + sentence['alternatives'][0]['transcript'] # concatenate sentences
    transcript = chunk_transcript(transcript, chunk_size) # chunk the transcript
    return transcript

def classify(file_name):
    streaming_body = client.get_object(Bucket = credentials_os['BUCKET'], Key = file_name.split('.')[0]+'_text.json')['Body']
    transcript=streaming_body.read().decode("utf-8")
    analysis = {}
    for chunk in process_text(transcript):
        chunk = ' '.join(chunk)
        analysis[chunk] = natural_language_classifier.classify(credentials_nlc['classifier_id'], chunk).get_result()
    client.put_object(Bucket = credentials_os['BUCKET'], Key = file_name.split('.')[0]+'_nlc', Body= json.dumps(analysis))
    return analysis


def classify_transcript(file_name):
    status = natural_language_classifier.get_classifier(credentials_nlc['classifier_id'])
    if status['status'] == 'Available':
        classes = classify(file_name)
    return classes


In [10]:
file_list = ['sample1-addresschange-positive.ogg',
             'sample2-address-negative.ogg',
             'sample3-shirt-return-weather-chitchat.ogg',
             'sample4-angryblender-sportschitchat-recovery.ogg',
             'sample5-calibration-toneandcontext.ogg',
             'jfk_1961_0525_speech_to_put_man_on_moon.ogg',
             'May 1 1969 Fred Rogers testifies before the Senate Subcommittee on Communications.ogg'
            ]


classify_transcript(file_list[0])

WatsonApiException: Error: Unauthorized: Access is denied due to invalid credentials , Code: 401 , X-dp-watson-tran-id: gateway02-3942423309 , X-global-transaction-id: ffea405d5c0302b5eafc9b0d

In [None]:
for filename in file_list:
    print("\n\nprocessing file: ", filename)
    analysis = classify_transcript(filename)
    print(analysis)