# Chat Transcript

## 1. Setup

To prepare your environment, you need to install some packages and enter credentials for the Watson services.


## 1.1 Install the necessary packages

You need the latest versions of these packages:

python-docx: To read from a docx file   
Watson Developer Cloud: a client library for Watson services                                                                                                                 
nltk: Leading platform for building Python programs to work with human language data.


### Install python-docx

In [1]:
!pip install python-docx

Requirement not upgraded as not directly required: python-docx in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Requirement not upgraded as not directly required: lxml>=2.3.2 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from python-docx)


### Install Watson Developer Cloud

In [2]:
!pip install watson-developer-cloud

Requirement not upgraded as not directly required: watson-developer-cloud in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Requirement not upgraded as not directly required: autobahn>=0.10.9 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson-developer-cloud)
Requirement not upgraded as not directly required: python-dateutil>=2.5.3 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson-developer-cloud)
Requirement not upgraded as not directly required: service-identity>=17.0.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson-developer-cloud)
Requirement not upgraded as not directly required: Twisted>=13.2.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson-developer-cloud)
Requirement not upgraded as not directly required: pyOpenSSL>=16.2.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson-developer-cloud)
Requirement not upgraded as not directly required: requests<3.

### Install nltk package

In [3]:
!pip install nltk --upgrade

Requirement already up-to-date: nltk in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Requirement not upgraded as not directly required: six in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from nltk)


### Now restart the kernel by choosing Kernel > Restart.

## 1.2 Import packages and libraries

Import the packages and libraries that you'll use:


In [4]:
import docx
from docx import Document

import watson_developer_cloud
from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding_v1 \
  import Features, EntitiesOptions, KeywordsOptions, SemanticRolesOptions
from watson_developer_cloud import WatsonException 

from io import BytesIO
import zipfile
from collections import OrderedDict
import nltk
from nltk import word_tokenize,sent_tokenize,ne_chunk
import json

## 2. Configuration

Add configurable items of the notebook below


## 2.1 Add your service credentials from IBM Cloud for the Watson services

You must create a Watson Natural Language Understanding service on IBM Cloud. Create a service for Natural Language Understanding (NLU). Insert the username and password values for your NLU in the following cell. Do not change the values of the version fields. Run the cell.


In [5]:
natural_language_understanding = NaturalLanguageUnderstandingV1(
    version='2018-03-23',
    username="9d8eddab-da9f-454c-a1cc-a02ed8c2fe92",
    password="A4JHnZoZ1Wwt")

## 2.2 Add your service credentials for Object Storage

You must create Object Storage service on IBM Cloud. 
* Insert Streaming body credentials for the zip file, ensure the name of the streaming body variable to be streaming_body_1
* Insert Service Credentials for the configuration file, ensure the name of the variable is credentials_1


In [6]:

import sys
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share your notebook.
client_e857a804ea664403af9fdcb1f7be9ae1 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='XmdQPC08VWYEdRE1xqesRcU7XAL47zhVQPe2D0ISvzrq',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

# Your data file was loaded into a botocore.response.StreamingBody object.
# Please read the documentation of ibm_boto3 and pandas to learn more about your possibilities to load the data.
# ibm_boto3 documentation: https://ibm.github.io/ibm-cos-sdk-python/
# pandas documentation: http://pandas.pydata.org/
streaming_body_1 = client_e857a804ea664403af9fdcb1f7be9ae1.get_object(Bucket='chattranscript-donotdelete-pr-ycfr2ziwzhjktq', Key='Travel4Cases.zip')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(streaming_body_1, "__iter__"): streaming_body_1.__iter__ = types.MethodType( __iter__, streaming_body_1 ) 



In [7]:

# @hidden_cell
# The following code contains the credentials for a file in your IBM Cloud Object Storage.
# You might want to remove those credentials before you share your notebook.
credentials_1 = {
    'IBM_API_KEY_ID': 'XmdQPC08VWYEdRE1xqesRcU7XAL47zhVQPe2D0ISvzrq',
    'IAM_SERVICE_ID': 'iam-ServiceId-1a9955e4-1c95-4c04-802e-81aea1e66ef7',
    'ENDPOINT': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT': 'https://iam.ng.bluemix.net/oidc/token',
    'BUCKET': 'chattranscript-donotdelete-pr-ycfr2ziwzhjktq',
    'FILE': 'config.txt'
}


## 2.3 Global Variables

file_name: Name of the final output csv file

In [8]:
file_name='train.csv'

## 2.4 Configure and download required NLTK packages

Download the 'punkt' and 'averaged_perceptron_tagger' NLTK packages for POS tagging usage.


In [9]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /home/dsxuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/dsxuser/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## 3. Persistence and Storage

## 3.1 Configure Object Storage Client

In [10]:
cos = ibm_boto3.client('s3',
                    ibm_api_key_id=credentials_1['IBM_API_KEY_ID'],
                    ibm_service_instance_id=credentials_1['IAM_SERVICE_ID'],
                    ibm_auth_endpoint=credentials_1['IBM_AUTH_ENDPOINT'],
                    config=Config(signature_version='oauth'),
                    endpoint_url=credentials_1['ENDPOINT'])

def get_file(filename):
    '''Retrieve file from Cloud Object Storage'''
    fileobject = cos.get_object(Bucket=credentials_1['BUCKET'], Key=filename)['Body']
    return fileobject

def load_string(fileobject):
    '''Load the file contents into a Python string'''
    text = fileobject.read()
    return text

## 4. Input Data

## 4.1 Load the Configuration file

In [11]:
file= credentials_1['FILE']
json_str = load_string(get_file(file)).decode('utf-8')
#print(json_str)
jsontext=json.loads(json_str)
jsontext

{'configuration': {'class': {'stages': [{'System': [{'name': 'purpose',
       'purpose': [{'regex': 'Chunk: {<NN> <NN>}',
         'tag': 'purpose',
         'type': 'Chunk'}]},
      {'name': 'policy',
       'policy': [{'split': ',', 'tag': 'policy', 'type': 'split'}]}],
     'name': 'System'}]}}}

## 4.2 Extract Chat Transcripts from the zip file

In [12]:
zip_ref = zipfile.ZipFile(BytesIO(streaming_body_1.read()),'r')
paths = zip_ref.namelist()
for path in paths:
    loc=zip_ref.extract(path)
zip_ref.close()
loc

'/home/dsxuser/work/Travel4Cases.docx'

In [13]:
filename=loc
document = docx.Document(filename)
docText = b'\n\n'.join([
    paragraph.text.encode('utf-8') for paragraph in document.paragraphs
])
docText= docText.decode('utf-8')
docText

'Case 1\n\n\n\nEmp. I need to travel to Mumbai on 18th Feb for 2 days \n\n\n\nSystem – Sure Anuj, so it will be Delhi to Mumbai, 18 to 20th Feb 2018.\n\nSystem – What is going to be purpose of this visit \n\n\n\nEmp. I am going for a customer work\n\n\n\nSystem – What is the current policy for customer travel \n\nSystem(Content) – Travel Restrictions in Place, only revenue generating travel with Q1 deal closer allowed. Approvals for less than Rs 20000 must be approved by VP and above that by Country GM. \n\n\n\nSystem – How is the customer, and what is the objective of customer travel\n\nEmp. – Customer is SBI, and they need Lab support to resolve an issue \n\n\n\nSystem – Can you give me PNR number for the support issue you are going. \n\nEmp – OK, let me find that. \n\nEmp. PNR number 11001100\n\n\n\nSystem – Thanks for submitting request, please note down your travel request id A202020. \n\n\n\nSYSTEM – Give me travel cost from Delhi to Mumbai for 2 days in Feb\n\nSYSTEM (ML Model) 

## 5. Form the Dataframe

###  5.1 Split the Chat Transcripts by Cases

Here, each case will form a record in the dataframe

In [14]:
quesandans= docText.strip().splitlines()
quesandans

['Case 1',
 '',
 '',
 '',
 'Emp. I need to travel to Mumbai on 18th Feb for 2 days ',
 '',
 '',
 '',
 'System – Sure Anuj, so it will be Delhi to Mumbai, 18 to 20th Feb 2018.',
 '',
 'System – What is going to be purpose of this visit ',
 '',
 '',
 '',
 'Emp. I am going for a customer work',
 '',
 '',
 '',
 'System – What is the current policy for customer travel ',
 '',
 'System(Content) – Travel Restrictions in Place, only revenue generating travel with Q1 deal closer allowed. Approvals for less than Rs 20000 must be approved by VP and above that by Country GM. ',
 '',
 '',
 '',
 'System – How is the customer, and what is the objective of customer travel',
 '',
 'Emp. – Customer is SBI, and they need Lab support to resolve an issue ',
 '',
 '',
 '',
 'System – Can you give me PNR number for the support issue you are going. ',
 '',
 'Emp – OK, let me find that. ',
 '',
 'Emp. PNR number 11001100',
 '',
 '',
 '',
 'System – Thanks for submitting request, please note down your travel re

In [15]:
word='Case'
positions = [x for x, n in enumerate(quesandans) if word in n]
positions
quesandans[positions[2]]

'Case 3: Conf travel '

In [16]:
case=list()
leng = len(positions)-1
for i in range(len(positions)):
    if i != leng:
        case.append(quesandans[positions[i]:positions[i+1]])
    else:
        case.append(quesandans[positions[i]:len(quesandans)])
case[3]

['Case 4: Customer Travel ',
 '',
 '',
 '',
 '',
 '',
 'Emp. I need to travel to Bangalore on 18th Feb for 2 days ',
 '',
 '',
 '',
 'System – Sure Anuj, so it will be Delhi to Bangalore, 18 to 20th Feb 2018.',
 '',
 'System – What is going to be purpose of this visit ',
 '',
 '',
 '',
 'Emp. I am going for a customer work',
 '',
 '',
 '',
 'System – What is the current policy for customer travel ',
 '',
 'System(Content) – Travel allowed for valid business opportunity, explore alternative to travel or to reduce  cost.  ',
 '',
 '',
 '',
 'System – Who is the customer, and what is the objective of customer travel',
 '',
 'Emp. – Customer is XYZ Infotech, and going to take a session on DSx, which requires ML skills. ',
 '',
 '',
 '',
 'System – Do you have opportunity id?.',
 '',
 'Emp. – No, its my first meeting with customer. ',
 '',
 '',
 '',
 'System – Check from travel system on travel cost on 22, 23 Delhi to Bangalore. ',
 '',
 'System – High',
 '',
 '',
 '',
 'System – Do we have

In [17]:
''' Remove empty strings and Remove Case i
'''
for i in range(len(case)):
    case[i]=list(filter(None,case[i]))
    case[i].pop(0)
case

[['Emp. I need to travel to Mumbai on 18th Feb for 2 days ',
  'System – Sure Anuj, so it will be Delhi to Mumbai, 18 to 20th Feb 2018.',
  'System – What is going to be purpose of this visit ',
  'Emp. I am going for a customer work',
  'System – What is the current policy for customer travel ',
  'System(Content) – Travel Restrictions in Place, only revenue generating travel with Q1 deal closer allowed. Approvals for less than Rs 20000 must be approved by VP and above that by Country GM. ',
  'System – How is the customer, and what is the objective of customer travel',
  'Emp. – Customer is SBI, and they need Lab support to resolve an issue ',
  'System – Can you give me PNR number for the support issue you are going. ',
  'Emp – OK, let me find that. ',
  'Emp. PNR number 11001100',
  'System – Thanks for submitting request, please note down your travel request id A202020. ',
  'SYSTEM – Give me travel cost from Delhi to Mumbai for 2 days in Feb',
  'SYSTEM (ML Model)  – 19000',
  '

In [18]:
''' Split condition
'''
split=case[0][12][7] # Make Global Variable at the end

In [19]:
def analyze_using_NLU(analysistext):
    """ Call Watson Natural Language Understanding service to obtain analysis results.
    """
    response = natural_language_understanding.analyze( 
        text=analysistext,
        features=Features(entities= EntitiesOptions(), keywords=KeywordsOptions(), semantic_roles=SemanticRolesOptions()))
    return response

In [20]:
def find_index(values):
    to_find= 'Thanks for submitting request, please note down your travel request id'
    for i in range(len(values)):
        if to_find in values[i]:
            return i
#find_index(cases_dict[1])

### 5.2 First Strategy

In [21]:
def POS_tagging(text):
    """ Generate Part of speech tagging of the text.
    """
    POSofText = nltk.tag.pos_tag(text)
    return POSofText

def text_extract(reg, tag,text):
    """ Use Chunking to extract text from sentence
    """
    entities = list()
    text_extracted= str()
    chunkParser= nltk.RegexpParser(reg)
    chunked= chunkParser.parse(POS_tagging(text))
    #print(chunked)
    for subtree in chunked.subtrees():
        if subtree.label() == 'Chunk':
            #print(subtree.leaves())
            entities=subtree.leaves()
    #print(entities)
    entities= list(entities)
    for i in range(len(entities[0])):
        text_extracted= text_extracted + str(entities[i][0])
        text_extracted=text_extracted + ' '
    return text_extracted
def first_strategy(value, start, end):
    """ Work with configuration files to extract using rules
    """
    firstlist=list()
    #print(case[0][5])
    for i in range(end):
        #print(value[i])
        check_system= jsontext['configuration']['class']['stages'][0]['name']
        if check_system.strip() in value[i]:
            stages= jsontext['configuration']['class']['stages'][0]
            for j in range(len(stages)):
                if stages[check_system.strip()][j]['name'].strip() in value[i]:
                    step_name = stages[check_system.strip()][j]['name']
                    step= stages[check_system.strip()][j][step_name.strip()][0]
                    #print(step)
                    compare= step['type'].strip()
                    if compare =='Chunk':
                        words = nltk.word_tokenize(value[i+1])
                        firstlist.append((step['tag'],text_extract(step['regex'],step['tag'],words)))
                    elif compare == 'split':
                        firstlist.append((step['tag'],value[i+1]))
    return firstlist

### 5.3 Second Strategy

In [22]:
def check_type(ent_type):
    """ Check for Type of Entities
    """
    not_req= ['Anatomy', 'Award','Broadcaster','Company','Crime' ,'Drug','EmailAddress','Facility', 'GeographicFeature','HealthCondition','Hashtag','IPAddress','JobTitle','Location','Movie','MusicGroup','NaturalEvent','Organization','Person','PrintMedia','Quantity','Sport','SportingEvent','TelevisionShow','TwitterHandle','Vehicle']
    if ent_type in not_req:
        return 1
    return 0

In [23]:
def second_strategy(case_value, start, end):
    """ Extract feature-value pairs from Chat Transcripts
    """
    finallist= list()
    for i in range(start,end-2,2):
        imp_feature=list()
        try:
            features=list()
            subject= list()
            obj=list()
            
            text = case_value[i].split(split)
            res=analyze_using_NLU(text[1])
            keywords= res['keywords']
            entities= res['entities']
            semantic_roles= res['semantic_roles']
            for k in keywords:
                ''' Extract Keywords for each sentence
                '''
                if len(k)> 0:
                    features.append(k['text'])
            for e in entities:
                '''Remove Entities which have type like Location, Organization, etc
                '''
                if check_type(e['type']):
                    if e['text'] in features:
                        index= features.index(e['text'])
                        del features[index]
                else:
                    if len(e)>0:
                        features.append(e['text'])
            
            
            for sent in semantic_roles:
                ''' Extract subject and object for each sentence
                '''
                subject.append(sent['subject']['text'])
                obj.append(sent['object']['text'])
                
            ''' Remove Features that are not present in subject or object
            '''
            if len(subject)==0 or len(obj)==0:
                imp_feature= imp_feature + features
            else:
                for j in range(len(subject)):
                    if len(features) > 1:
                        for f in features:
                            if f in subject[j]:
                                imp_feature.append(f)
                            if f in obj[j]:
                                if f not in imp_feature:
                                    imp_feature.append(f)
                    else:
                        if features[0] in subject[j]:
                            imp_feature.extend(features)
                        if features[0] in obj[j]:
                            if features[0] not in imp_feature:
                                imp_feature.extend(features)

            ''' Take Only the Feature with highest relevance
            '''
            while len(imp_feature) > 1:
                imp_feature.pop(-1)
            #print(case[1][i])
            #print(case[0][i+1])
            #print('--------------------------------------------------------------')
            #print(imp_feature)
            #print('-')
            
            ''' Get Values and Form a List of Tuples
            '''
            value= case_value[i+1].split(split)
            if len(value) > 1:
                #print(value[1])
                text=value[1]
                #print(text)
                finaltodf = (imp_feature[0],text)
            else:
                finaltodf = (imp_feature[0],value)
                #print(value)
            #print('--------------------------------------------------------------')
            finallist.append(finaltodf)
        except WatsonException as err:
            print()
    return finallist

### 5.4 Collect Obtained Features and Form a Dataframe

In [24]:
def convert_to_dataframe(finallist, index):
    """ Take the result dictionary of key-value pairs and convert to Pandas Dataframe
    """
    final= OrderedDict(finallist)
    df = pd.DataFrame(final, columns=final.keys(),index=index)
    return df

In [25]:
index=[i+1 for i in range(len(case))] # Index row number to be fed into Pandas
cases_dict= dict()
frames= list()
for i in index:
    toenter= {i : case[i-1]}
    cases_dict.update(toenter)
#cases_dict # Dictionary of all the text
for k,v in cases_dict.items():
    firstlist= list()
    secondlist= list()
    #print(k)
    start_second=find_index(v)
    #print(start_second)
    if start_second != None:
        firstlist=first_strategy(v,0,start_second)
        secondlist=second_strategy(v,start_second+1,len(v))
        #print(secondlist)
    else:
        firstlist=first_strategy(v,0,len(v))
    #print(secondlist)
    recommendation=('Recommendation',(v[len(v)-1]).split(split)[1])
    finallist= firstlist+secondlist+[recommendation]
    df=convert_to_dataframe(finallist,[k])
    frames.append(df)
res= pd.concat(frames)
res= res[frames[0].columns]
res





Unnamed: 0,purpose,policy,cost,PNR number,strategic customer,revenue,open opportunities,closure,sentiments,Owner,Mail,Reminder,Navjot Dept Code,safety tips,Recommendation
1,customer work,System(Content) – Travel Restrictions in Place...,19000.0,"Product - Lotus Notes, Sev 2, Open for 10 days",ISA Top 20 Excel sheet,USD100M,5.0,"78900000, Deal size - Large",HIGH NEGATIVE,Navjot Bhogal,Approved,Navjot says YES,UIE and,Tips,Your travel request is approved. Pl. follow t...
2,customer work,System(Content) – Travel Restrictions in Place...,,,,,,,,,,,,,Travel approval mail to Emp.
3,,System(Content) – Travel allowed for reputed e...,,,,,,,,,,,,,Your travel is approved.
4,customer work,System(Content) – Travel allowed for valid bus...,,,,,,,,,,,,,Request rejected with options 1. Travel on o...


## 5.5 Save results as csv

In [26]:
csv_file=res.to_csv(file_name, sep='\t', encoding='utf-8')
path='/home/dsxuser/work'+file_name

In [27]:
def put_file(filename, filecontents):
    '''Write file to Cloud Object Storage'''
    resp = cos.put_object(Bucket=credentials_1['BUCKET'], Key=filename, Body=filecontents)
    return resp
put_file(file_name, path)

{'ETag': '"06a07b6695854b7e5cf37d34d67c88df"',
 'ResponseMetadata': {'HTTPHeaders': {'content-length': '0',
   'date': 'Fri, 08 Jun 2018 09:02:38 GMT',
   'etag': '"06a07b6695854b7e5cf37d34d67c88df"',
   'server': '3.13.3.49',
   'x-amz-request-id': '867d207f-c26b-42cb-8e73-bf1f3bc0267d',
   'x-clv-request-id': '867d207f-c26b-42cb-8e73-bf1f3bc0267d',
   'x-clv-s3-version': '2.5'},
  'HTTPStatusCode': 200,
  'HostId': '',
  'RequestId': '867d207f-c26b-42cb-8e73-bf1f3bc0267d',
  'RetryAttempts': 0}}