
## Notebook 2 – Natural Language Understanding (NLU)
NLU analyzes text to extract meta-data from content such as concepts, entities, keywords, categories, relations and semantic roles.
https://www.ibm.com/watson/services/natural-language-understanding/ 
https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/  


## Install dependencies

Python’s standard library is very extensive, offering a wide range of facilities. It contains built-in modules like JSON a lightweight data interchange format. https://docs.python.org/2/library/index.html and https://docs.python.org/2/library/json.html

IBM Watson Developer Cloud has a Python client library to quickly get started with the various Watson APIs services. https://pypi.python.org/pypi/watson-developer-cloud

Using Python with IBM COS: Python support is provided through the Boto 3 library. The boto3 library provides complete access and can source credentials. The IBM COS endpoint must be specified when creating a service resource or low-level client as shown in documentation https://ibm-public-cos.github.io/crs-docs/python




In [None]:
#imports.... Run this each time after restarting the Kernel
#!pip install watson_developer_cloud
import watson_developer_cloud as watson
import json
from botocore.client import Config
import ibm_boto3
import requests


##  Cloud Object Storage - Add Credentials & Bucket Name
If you've not already set up COS - please see Step 1

### Credentials
Credentials are also created for you when you create project. From service dashboard page select `Service Credentials` from left navigation menu item, and copy/paste the credentials below:

### Bucket name
Buckets are created for you when you create project. From service dashboard page select `Buckets` from left navigation menu item, and get your bucket name and copy/paste bucket name below:


In [None]:
# For Cloud Object Storage - populate your own information here from "SERVICES" on this page, or Console Dashboard on ibm.com/cloud

# From service dashboard page select Service Credentials from left navigation menu item
credentials_os = {
  "apikey": "",
  "cos_hmac_keys": {
    "access_key_id": "",
    "secret_access_key": ""
  },
  "endpoints": "https://cos-service.bluemix.net/endpoints",
  "iam_apikey_description": "Auto generated apikey during resource-key operation for Instance",
  "iam_apikey_name": "",
  "iam_role_crn": "",
  "iam_serviceid_crn": "",
  "resource_instance_id": ""
}

# Buckets are created for you when you create project. From service dashboard page select Buckets from left navigation menu item, 
credentials_os['BUCKET'] = '<bucket_name>' # copy bucket name from COS

In [None]:
# The code was removed by DSX for sharing.

### Create Watson Natural Language Understanding (NLU) service

Two options to create a new NLU service.  (1) Above click SERVICES and create/add new LITE version of NLU; or (2) In Console Dashboard in ibm.com/cloud create a LITE NLU services.  Click on 'SERVICE CREDENTIALS' to get creds.

For more information on creating Watson services, see Notebook 1


In [None]:
credentials_nlu = {
    "url": "",
    "apikey": "",
    "version": "2017-02-27"
}

### Set up Object Storage Client

In [None]:
endpoints = requests.get(credentials_os['endpoints']).json()

iam_host = (endpoints['identity-endpoints']['iam-token'])
cos_host = (endpoints['service-endpoints']['cross-region']['us']['public']['us-geo'])

auth_endpoint = "https://" + iam_host + "/oidc/token"
service_endpoint = "https://" + cos_host


client = ibm_boto3.client(
    's3',
    ibm_api_key_id = credentials_os['apikey'],
    ibm_service_instance_id = credentials_os['resource_instance_id'],
    ibm_auth_endpoint = auth_endpoint,
    config = Config(signature_version='oauth'),
    endpoint_url = service_endpoint
   )


### NLU

- `process_text()` goes throught the text and fetch sentences and concatenate transcript based on chunk size
- `analyze transcript()` calls natural language understanding endpoint and analyze the transcripe
- `post_analysis` processes the results and show insights based on response from NLU endpoint

In [None]:
#NLU

from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding.features import (
    v1 as Features)

natural_language_understanding = NaturalLanguageUnderstandingV1(
    version = '2017-02-27',
    iam_apikey = credentials_nlu['apikey'],
)

chunk_size=25 # This CHUNK size is used to disaggregate a transcript 
#e.g. in this case a 290 word transcript would have 10 chunks - 9 with 30 words and 1 with 20 words - approximates 'time domain' for this lab

def chunk_transcript(transcript, chunk_size):
    transcript = transcript.split(' ')
    return [ transcript[i:i+chunk_size] for i in range(0, len(transcript), chunk_size) ] # chunking data

def process_text(text):
    transcript=''
    for sentence in json.loads(text)['results']:
        transcript = transcript + sentence['alternatives'][0]['transcript'] # concatenate sentences
    #transcript = chunk_transcript(transcript, chunk_size) # chunk the transcript
    return  transcript


def analyze_transcript(features, file_name):
    streaming_body = client.get_object(Bucket = credentials_os['BUCKET'], Key=file_name.split('.')[0]+'_text.json')['Body']
    transcript = process_text(streaming_body.read().decode("utf-8"))
    nlu_analysis = natural_language_understanding.analyze(features, transcript, return_analyzed_text=True).get_result()
    res=client.put_object(Bucket = credentials_os['BUCKET'], Key=file_name.split('.')[0]+'_NLU.json', Body= json.dumps(nlu_analysis))
    return nlu_analysis

def post_analysis(result):
    print(result['analyzed_text'])
    categories = result['categories']
    for category in categories:
        print('label: ', category['label'], ', score: ', category['score']) #add table instead of prints

        
def process_text_chunks(text):
    transcript=''
    for sentence in json.loads(text)['results']:
        transcript = transcript + sentence['alternatives'][0]['transcript'] # concatenate sentences
    transcript = chunk_transcript(transcript, chunk_size) # chunk the transcript
    return  transcript

def analyze_transcript_chunks(features, file_name):
    streaming_body = client.get_object(Bucket = credentials_os['BUCKET'], Key=file_name.split('.')[0]+'_text.json')['Body']
    transcript=streaming_body.read().decode("utf-8")
    nlu_analysis={}
    for chunk in process_text_chunks(transcript):
        chunk = ' '.join(chunk)
        print('chunk: ', chunk)
        nlu_analysis[chunk] = natural_language_understanding.analyze(features, chunk, return_analyzed_text=True, language='en').get_result()
    outfilename = file_name.split('.')[0]+'_NLUchunks.json'
    print("writing file: ", outfilename, " to cloud object storage" )
    res=client.put_object(Bucket = credentials_os['BUCKET'], Key=outfilename, Body= json.dumps(nlu_analysis))
    return nlu_analysis


def post_analysis_chunks(result):
    for chunk in result.keys():
        categories = result[chunk]['categories']
        print('\nchunk: ', chunk)
        for category in categories:
            print('label: ', category['label'], ', score: ', category['score']) #add table instead of prints

In [None]:
file_list = ['sample1-addresschange-positive.ogg',
             'sample2-address-negative.ogg',
             'sample3-shirt-return-weather-chitchat.ogg',
             'sample4-angryblender-sportschitchat-recovery.ogg',
             'sample5-calibration-toneandcontext.ogg',
             'jfk_1961_0525_speech_to_put_man_on_moon.ogg',
             'May 1 1969 Fred Rogers testifies before the Senate Subcommittee on Communications.ogg']

features = {"concepts":{},"entities":{},"keywords":{},"categories":{},"emotion":{},"sentiment":{},"semantic_roles":{} }

Next, we will run NLU enrichment on the transcripts for all audio files. We show two approaches:
* One NLU call per audio file: In this case, we get aggregated features for the complete audio file.
* One NLU call per chunk of audio file where a chunk is 25 words: In this case, we get more granular NLU features.

Both approaches are valid. The default one we show is the second approach with chunks as that provides more granular sentiment results. In practice, you can decide which is more relevant to your application. If you'd like to try the first approach, you'll need to uncomment the next cell and comment out the cell after that.

In [None]:
result = analyze_transcript_chunks(features, file_list[0])
post_analysis_chunks(result)


In [None]:
## If you'd like to execute NLU per chunk of audio file (chunk is 25 words), make sure the next lines are uncomments
for filename in file_list:
    print("\n\nprocessing file: ", filename)
    result = analyze_transcript_chunks(features,filename)
    post_analysis_chunks(result)