# Python and Analytics workshop - Using Natural Language Understanding and Sentiment

In this portion of the workshop, we'll use an instance of [Watson Natural Language Understanding](https://cloud.ibm.com/catalog/services/natural-language-understanding) to gather insights into data.

Watson Natural Language Understanding is a cloud native product that uses deep learning to extract metadata from text such as entities, keywords, categories, sentiment, emotion, relations, and syntax.
There is a rich [API](https://cloud.ibm.com/apidocs/natural-language-understanding?code=python) that we will use along with the [Watson Python SDK](https://github.com/watson-developer-cloud/python-sdk) to analyze our data.

## Contents

- [1.0 Setup - install modules](#setup)
- [2.0 Test NLU APIs](#test)
- [3.0 Import Data and Setup Pandas Dataframe ](#pandas)
- [4.0 Clean and Prepare data for NLU scoring](#clean)
- [5.0 Analyze response from NLU ](#analyze)
- [6.0 Get sentiment by row](#sentiment-row)
- [7.0 Graph with matplotlib](#graph)



## 1.0 Setup - Install Modules<a name="setup"></a>

We use the [Watson Python SDK](https://github.com/watson-developer-cloud/python-sdk) to access the [NLU APIs](https://cloud.ibm.com/apidocs/natural-language-understanding?code=python) programatically.

In [1]:
!pip install --upgrade numpy==1.16.4
!pip install --upgrade pandas==1.0.5
!pip install --upgrade ibm-watson==4.7.1

Requirement already up-to-date: numpy==1.16.4 in /usr/local/lib/python3.7/site-packages (1.16.4)
You should consider upgrading via the '/usr/local/opt/python/bin/python3.7 -m pip install --upgrade pip' command.[0m
Requirement already up-to-date: pandas==1.0.5 in /usr/local/lib/python3.7/site-packages (1.0.5)
You should consider upgrading via the '/usr/local/opt/python/bin/python3.7 -m pip install --upgrade pip' command.[0m
Requirement already up-to-date: ibm-watson==4.7.1 in /usr/local/lib/python3.7/site-packages (4.7.1)
You should consider upgrading via the '/usr/local/opt/python/bin/python3.7 -m pip install --upgrade pip' command.[0m


### Important: Restart the Jupyter kernel now
Restart the kernal by going to the `Kernel` tab above and choosing `Restart`.

Import python modules from the Watson Python SDKs

In [2]:
import json
from ibm_watson import NaturalLanguageUnderstandingV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_watson.natural_language_understanding_v1 import Features,CategoriesOptions,EmotionOptions,KeywordsOptions

### 1.1 Add NLU credentials
Get the [IAM Authentication Key](https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#authentication) and [Service URL](https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#service-endpoint) that you obtained when you [Created a Watson NLU instance](https://github.ibm.com/IBMDeveloper/python-and-analytics/tree/addNLU/workshop/natural-language-understanding#create-a-watson-nlu-instance).

Add your [IAM Authentication Key](https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#authentication) below.

In [3]:
IAM_KEY = ''

Add your [NLU Service URL](https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#service-endpoint) below

In [4]:
SERVICE_URL = ''

## 2.0 Test NLU APIs <a name="test"></a>
Run a quick check to make sure everything is working. We'll use a [basic web page](https://www.ibm.com) to see how Watson Natural Language Understanding can extract categories when given a URL. [This example](https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#categories) comes from the Watson NLU documentation.

In [7]:
authenticator = IAMAuthenticator(IAM_KEY)
natural_language_understanding = NaturalLanguageUnderstandingV1(version='2020-08-01',authenticator=authenticator)

natural_language_understanding.set_service_url(SERVICE_URL)

response = natural_language_understanding.analyze(
    url='https://womenintechsummit.net/',
    features=Features(categories=CategoriesOptions(limit=3))).get_result()

print(json.dumps(response, indent=2))

{
  "usage": {
    "text_units": 1,
    "text_characters": 1772,
    "features": 1
  },
  "retrieved_url": "https://womenintechsummit.net/",
  "language": "en",
  "categories": [
    {
      "score": 0.895921,
      "label": "/technology and computing/tech news"
    },
    {
      "score": 0.855429,
      "label": "/technology and computing/computer reviews"
    },
    {
      "score": 0.820702,
      "label": "/technology and computing/internet technology"
    }
  ]
}


## 3.0 Import Data and Setup Pandas Dataframe <a name="pandas"></a>

Read [cfpciti.csv](https://raw.githubusercontent.com/IBM/python-and-analytics/master/data/cfpbciti.csv) which contains data from the Consumer Credit Bureau for consumer complaints.

In [8]:
import pandas as pd
import numpy as np
from datetime import datetime

df = pd.read_csv('https://raw.githubusercontent.com/IBM/python-and-analytics/master/data/cfpbciti.csv')
df.head(5)

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,01/24/20,Credit card or prepaid card,General-purpose credit card or charge card,Problem with a purchase shown on your statement,Card was charged for something you did not pur...,,Company has responded to the consumer and the ...,"CITIBANK, N.A.",NJ,07302,,Consent not provided,Web,01/24/20,Closed with monetary relief,Yes,,3508199
1,02/12/20,Credit card or prepaid card,General-purpose credit card or charge card,Getting a credit card,Delay in processing application,,Company has responded to the consumer and the ...,"CITIBANK, N.A.",IL,600XX,,Consent not provided,Web,02/12/20,Closed with monetary relief,Yes,,3529728
2,05/21/20,Credit card or prepaid card,Store credit card,Problem with a purchase shown on your statement,Credit card company isn't resolving a dispute ...,,Company has responded to the consumer and the ...,"CITIBANK, N.A.",FL,33020,,Consent not provided,Web,05/21/20,Closed with monetary relief,Yes,,3661785
3,05/18/20,Debt collection,Credit card debt,Written notification about debt,Didn't receive notice of right to dispute,Company has wrong information on me and thus n...,Company has responded to the consumer and the ...,"CITIBANK, N.A.",CA,935XX,,Consent provided,Web,05/18/20,Closed with explanation,Yes,,3657603
4,05/21/20,Credit card or prepaid card,General-purpose credit card or charge card,"Other features, terms, or problems",Other problem,,Company has responded to the consumer and the ...,"CITIBANK, N.A.",FL,328XX,Older American,Other,Web,05/21/20,Closed with explanation,Yes,,3661714


## 4.0 Clean and Prepare data for NLU scoring <a name="clean"></a>

We are interested in the customer's sentement about various things, like `Product` or `Sub-product`. The column for `Customer complaint narrative` looks like it contains the text we should analyze. Let's look at this.
The first few rows have a 'NaN' value, so we'll look at row 3.

In [9]:
text1 = df.loc[3,"Consumer complaint narrative"]
text1

"Company has wrong information on me and thus not receiving proper notification on debt being sent to collection. Proper notification wasn't sent in advance on debt was going to be sent to collections. No effort was done to try and reach me before hand. Also they called a family member that I have no communication with to try an locate and having the correct phone number to reach me."

Now let's drop all the 'NaN' values found in the `Consumer complaint narrative` column using the Pandas method [dropna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

In [10]:
df2=df['Consumer complaint narrative'].dropna(how = 'all')
df2.head(5)

3     Company has wrong information on me and thus n...
8     I have closed my credit card account with Citi...
13    I want to raise a complaint against Citibank (...
16    On XX/XX/ 2020, Citibank posted a fee to my ac...
17    I was shocked when I reviewed my credit report...
Name: Consumer complaint narrative, dtype: object

We'll convert the dataframe column to a string to send to the NLU endpoint for scoring.

In [11]:
df_text = df2.to_string()
df_text



Now we'll send this data to NLU to get [keywords](https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#keywords), [sentiment](https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#sentiment), and [emotions](https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#emotion) .

In [12]:
response = natural_language_understanding.analyze(
    text = df_text,
    features=Features(keywords=KeywordsOptions(sentiment=True,emotion=True,limit=5))).get_result()

print(json.dumps(response, indent=2))

ERROR:root:Error in service call
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 381, in _make_request
    self._validate_conn(conn)
  File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 976, in _validate_conn
    conn.connect()
  File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 370, in connect
    ssl_context=context,
  File "/usr/local/lib/python3.7/site-packages/urllib3/util/ssl_.py", line 377, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 423, in wrap_socket
    session=session
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 870, in _create
    self.do_handshake()
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python

ReadTimeout: HTTPSConnectionPool(host='api.us-south.natural-language-understanding.watson.cloud.ibm.com', port=443): Read timed out. (read timeout=60)

Hmmmmm...It looks like we've picked up the 'XX/XX' field for dates that are obscured. Since this is occuring frequently, the NLU scoring is tagging it with a high relevance score. Let's drop those 'XX' characters to get a better response from NLU.

In [None]:
df2 =df2.replace(regex=['X+'],value='')

In [None]:
df_text = df2.to_string()
df_text

OK, now that those 'XX' characters are gone, let's score again with NLU.

In [None]:
response = natural_language_understanding.analyze(
    text = df_text,
    features=Features(keywords=KeywordsOptions(sentiment=True,emotion=True,limit=5))).get_result()

print(json.dumps(response, indent=2))

That looks like information we can use. We notice that there is a 50,000 character limit. We'll work with that in a bit. For now, let's see if we can analyze some of this data.

## 5.0 Analyze response from NLU <a name=analyze></a>

We'll create a dataframe from this API response. First, we'll look at the part of the response json that is associated with the key 'keywords`.

In [None]:
respj = json.dumps(response['keywords'])
respj

OK. That looks good. Now, let's create a Pandas dataframe with that json using the method [read_json()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html).

In [None]:
json_df = pd.read_json(respj)
json_df.head()

That kinda worked. But the `Sentiment` column is composed of json that has multiple values in a [Python dict](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) form, as does the `emotion` column.
We can use [json_normalize()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) to create a dataframe that splits up `sentiment` and `emotion`. We'll drop a few columns and focus only on the `sentiment.score` and the `emotion.*` features.

In [None]:
norm_df = pd.json_normalize(response['keywords'])
norm_df.drop('relevance',inplace = True, axis = 1)
norm_df.drop('count',inplace = True, axis = 1)
norm_df.drop('sentiment.label',inplace = True, axis = 1)
norm_df.head()

This exploration of the data gives us some tools to work with. We'll continue to analyze the text in the following sections.

## 6.0 Get sentiment by row <a name="sentiment-row"></a>
Now, let's derive some sentiment and emotion information on a per-row basis, to provide more granualarity.
The number of API calls that you can make to Watson NLU is [rate limited and dependent on your service plan](https://cloud.ibm.com/catalog/services/natural-language-understanding), so in order to limit the number of API calls to the NLU endpoint we'll start with just 50 rows by setting `num_rows` to 50.

In [None]:
num_rows = 50

In [None]:
df_rows = df.head(num_rows)
df_rows = df_rows.dropna(subset=['Consumer complaint narrative'],how = 'any')
df_rows =df_rows.replace(regex=['X+'],value='')
df_rows.head()

We notice that when we dropped the rows with a `NaN` value for `Customer complaint narrative`, the indexes are no longer sequential. Let's use [reset_index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html) to fix this.

In [None]:
df_rows.reset_index(drop=True, inplace=True)

There are many ways to iterate through the rows for a Pandas dataframe. We'll use [iterrows()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html)

First, we have a date for these entries. Let's put them into [Pandas datetime](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html) format. We can use this later to do time series graphs.

In [None]:
for index, row in df_rows.iterrows():
    df_rows.loc[index,'Date received'] = datetime.strptime(row['Date received'], "%m/%d/%y")

Now, let's look for something that we can use with Watson NLU to derive an analysis of the sentiment of the customer feedback.

In [None]:
df_rows.head()

In [None]:
print (df_rows['Consumer complaint narrative'][0])

That looks like what we want. Now, we'll create a list to hold the `responses`, call Watson NLU with the data and then populate the responses list. We'll do the same with a list called `normalize` that we can use along with [json_normalize()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html).

In [None]:
responses = []
normalize = []
for index, row in df_rows.iterrows():
    
    response = natural_language_understanding.analyze(
    text = row['Consumer complaint narrative'],
    features=Features(keywords=KeywordsOptions(sentiment=True,emotion=True,limit=1))).get_result()
    normalize.append(pd.json_normalize(response['keywords']))
    responses.append(response)
    

In [None]:
normalize

In [None]:
responses

Add the `responses` list and the `normalize` to the df_rows dataframe. We can continue to use these new data features, but more commonly we'll derive new dataframes for our experiments and change those new dataframes instead.

In [None]:
df_rows['response'] = responses
df_rows.head()

In [None]:
df_rows['normalized'] = normalize
df_rows.head()

Let's create a new dataframe where we can pull out the column for the `emotion` `anger`, then sort by the highest rating of `anger`.

In [None]:
test_df = df_rows

In [None]:
for index, row in test_df.iterrows():
    test_df.loc[index,"anger"] = test_df.iloc[index]['response']['keywords'][0]['emotion']['anger']
    test_df.loc[index,"sentiment.score"] = test_df.iloc[index]['response']['keywords'][0]['sentiment']['score']

In [None]:
test_df.head()

First, we'll look for entries that rank high in `anger`

In [None]:
sorted_df = test_df.sort_values(by='anger', ascending=False)
sorted_df.head()

Let's look at the `Consumer complaint narrative` that causes the most anger (the one at the top of the sorted list)

In [None]:
sorted_df.iloc[0]['Consumer complaint narrative']

We could to the same for other entries that rank high in anger.

In [None]:
sorted_df.iloc[1]['Consumer complaint narrative']

Now, let's look at those with the highest negative sentiment. Note that for this, we'll sort by "ascending", since the more negative numbers represent a higher degree of negative sentiment.

In [None]:
sorted_df = test_df.sort_values(by='sentiment.score', ascending=True)
sorted_df.head()

In [None]:
sorted_df.iloc[0]['Consumer complaint narrative']

Well, it's not a surprise that the entry with the largest negative `sentiment.score` also has the highest rating for `anger`

## 7.0 Graph with matplotlib <a name="graph"></a>

Let's create some graphs using [matplotlib](https://matplotlib.org). You may wish to explore more details about the Jupyter notebook [magic functions](https://ipython.readthedocs.io/en/stable/interactive/tutorial.html#magics-explained) that you see used with the command `%matplotlib inline`.

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline

### 7.1 Time series graphs

We can see if there is anything interesting when we plot data against time.

In [None]:
sorted_df.plot(kind='line',x='Date received',y='anger',color='red')

In [None]:
sorted_df.plot(kind='line',x='Date received',y='sentiment.score',color='blue')

Now we can plot both `sentiment.score` and `anger` against time to look for correlations.

In [None]:
sorted_df.plot(kind='line',x='Date received',y=['sentiment.score','anger'])

### 7.2 Bar graphs

We can sum up the number of times a given `Product` or `Sub-product` appears using the [Python collections library Counter](https://docs.python.org/2/library/collections.html#collections.Counter)

In [None]:
from collections import Counter

Then we'll create a bar graph to see which `Product` are refered to in the most customer complaints.

In [None]:
bar_hist = Counter(sorted_df['Product'].replace('\n', ''))

counts = bar_hist.values()
letters = bar_hist.keys()

# graph data
bar_x_locations = np.arange(len(counts))
plt.bar(bar_x_locations, counts, align = 'center')
plt.xticks(bar_x_locations, letters, rotation=90)
plt.grid()
plt.show()

We can do the same for `Sub-product`.

In [None]:
bar_hist = Counter(sorted_df['Sub-product'].replace('\n', ''))

counts = bar_hist.values()
letters = bar_hist.keys()

# graph data
bar_x_locations = np.arange(len(counts))
plt.bar(bar_x_locations, counts, align = 'center')
plt.xticks(bar_x_locations, letters, rotation=90)
plt.grid()
plt.show()

### 7.3 Scatterplot

Now we'll use some matplotlib and numpy code to generate a 3D scatter plot

In [None]:
import mpl_toolkits.mplot3d.axes3d as axes3d


Xuniques, X = np.unique(sorted_df['Sub-product'], return_inverse=True)
Yuniques, Y = np.unique(sorted_df['Product'], return_inverse=True)
Z= sorted_df['anger']
fig = plt.figure(figsize= [15,8])
ax = fig.add_subplot(1, 1, 1, projection='3d',autoscale_on=True)
ax.scatter(X, Y, Z, s=10, c='b')
ax.set(xticks=range(len(Xuniques)), xticklabels=Xuniques,
       yticks=range(len(Yuniques)), yticklabels=Yuniques)
plt.xticks(rotation=90)
plt.show()
