
## Notebook 2 – Natural Language Understanding (NLU)
NLU analyzes text to extract meta-data from content such as concepts, entities, keywords, categories, relations and semantic roles.
https://www.ibm.com/watson/services/natural-language-understanding/ 
https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/  


## Install dependencies

Python’s standard library is very extensive, offering a wide range of facilities. It contains built-in modules like JSON a lightweight data interchange format. https://docs.python.org/2/library/index.html and https://docs.python.org/2/library/json.html

IBM Watson Developer Cloud has a Python client library to quickly get started with the various Watson APIs services. https://pypi.python.org/pypi/watson-developer-cloud

Using Python with IBM COS: Python support is provided through the Boto 3 library. The boto3 library provides complete access and can source credentials. The IBM COS endpoint must be specified when creating a service resource or low-level client as shown in documentation https://ibm-public-cos.github.io/crs-docs/python




In [None]:
#imports.... Run this each time after restarting the Kernel
#!pip install watson_developer_cloud
import watson_developer_cloud as watson
import json
from botocore.client import Config
import ibm_boto3
import requests


### Create Watson Natural Language Understanding service

For more information on creating Watson services, see Notebook 1

### Add Credentials

Copy paste the following snippet to next cell, and add your own set of crdentials there:

```code
credentials_os = {
  "apikey": "",
  "cos_hmac_keys": {
    "access_key_id": "",
    "secret_access_key": ""
  },
  "endpoints": "",
  "iam_apikey_description": "",
  "iam_apikey_name": "",
  "iam_role_crn": "",
  "iam_serviceid_crn": "",
  "resource_instance_id": ""
}
credentials_os['BUCKET'] = '<bucket_name_from_your_COS_instance>'

credentials_nlu = {
    "url": "",
    "username": "",
    "password": "",
    "version": "2017-02-27"
}

```

In [None]:
# The code was removed by DSX for sharing.

## Set-up Object storage

In [None]:
endpoints = requests.get(credentials_os['endpoints']).json()

iam_host = (endpoints['identity-endpoints']['iam-token'])
cos_host = (endpoints['service-endpoints']['cross-region']['us']['public']['us-geo'])

auth_endpoint = "https://" + iam_host + "/oidc/token"
service_endpoint = "https://" + cos_host


client = ibm_boto3.client(
    's3',
    ibm_api_key_id = credentials_os['apikey'],
    ibm_service_instance_id = credentials_os['resource_instance_id'],
    ibm_auth_endpoint = auth_endpoint,
    config = Config(signature_version='oauth'),
    endpoint_url = service_endpoint
   )




### NLU

- `process_text()` goes throught the text and fetch sentences and concatenate transcript based on chunk size
- `analyze transcript()` calls natural language understanding endpoint and analyze the transcripe
- `post_analysis` processes the results and show insights based on response from NLU endpoint

In [None]:
#NLU

from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding.features import (
    v1 as Features)

natural_language_understanding = NaturalLanguageUnderstandingV1(
    version = '2017-02-27',
    username = credentials_nlu['username'],
    password = credentials_nlu['password']
)

chunk_size=25 # This CHUNK size is used to disaggregate a transcript 
#e.g. in this case a 290 word transcript would have 10 chunks - 9 with 30 words and 1 with 20 words - approximates 'time domain' for this lab

def chunk_transcript(transcript, chunk_size):
    transcript = transcript.split(' ')
    return [ transcript[i:i+chunk_size] for i in range(0, len(transcript), chunk_size) ] # chunking data

def process_text(text):
    transcript=''
    for sentence in json.loads(text)['results']:
        transcript = transcript + sentence['alternatives'][0]['transcript'] # concatenate sentences
    #transcript = chunk_transcript(transcript, chunk_size) # chunk the transcript
    return  transcript


def analyze_transcript(features, file_name):
    streaming_body = client.get_object(Bucket = credentials_os['BUCKET'], Key=file_name.split('.')[0]+'_text.json')['Body']
    transcript = process_text(streaming_body.read().decode("utf-8"))
    nlu_analysis = natural_language_understanding.analyze(features, transcript, return_analyzed_text=True)
    res=client.put_object(Bucket = credentials_os['BUCKET'], Key=file_name.split('.')[0]+'_NLU.json', Body= json.dumps(nlu_analysis))
    return nlu_analysis

def post_analysis(result):
    print(result['analyzed_text'])
    categories = result['categories']
    for category in categories:
        print('label: ', category['label'], ', score: ', category['score']) #add table instead of prints


In [None]:
file_list = ['sample1-addresschange-positive.ogg',
             'sample2-address-negative.ogg',
             'sample3-shirt-return-weather-chitchat.ogg',
             'sample4-angryblender-sportschitchat-recovery.ogg',
             'sample5-calibration-toneandcontext.ogg',
             'jfk_1961_0525_speech_to_put_man_on_moon.ogg',
             'May 1 1969 Fred Rogers testifies before the Senate Subcommittee on Communications.ogg']

features = {"concepts":{},"entities":{},"keywords":{},"categories":{},"emotion":{},"sentiment":{},"semantic_roles":{} }

In [None]:
result = analyze_transcript(features, file_list[0])
post_analysis(result)


In [None]:
results = analyze_transcript(features, file_list[6])

post_analysis(results)
