# Process Synonyms

This notebook uses a combination of Python data science libraries and the Google Natural Language API (machine learning) to expand the vocabulary of the chatbot by generating synonyms for topics created in the previous notebook.

In [1]:
# Only need to do this once...
!pip install inflect

Collecting inflect
  Downloading https://files.pythonhosted.org/packages/e0/3e/9f62aeee7e0963fa63e2373423e4235045db0712b2bf56d39cffa89c0734/inflect-0.3.1-py2.py3-none-any.whl (59kB)
[K    100% |████████████████████████████████| 61kB 2.1MB/s ta 0:00:011
[?25hInstalling collected packages: inflect
Successfully installed inflect-0.3.1
[33mYou are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
# Only need to do this once...
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /content/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /content/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [3]:
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

In [4]:
from google.cloud import datastore

In [5]:
datastore_client = datastore.Client()

In [6]:
client = datastore.Client()
query = client.query(kind='Topic')
results = list(query.fetch())

In [7]:
import inflect
plurals = inflect.engine()

## Extract Synonyms with Python
Split the topic into words and use PyDictionary to look up synonyms in a "thesaurus" for each word.  Store these in Datastore and link them back to the topic.  Note this section uses the concept of "stop words" to filter out articles and other parts of speech that don't contribute to meaning of the topic.

In [8]:
from nltk.corpus import wordnet

for result in results:
  for word in result.key.name.split():
    
    if word in stop:
        continue

    
    synonyms = []
    for syn in wordnet.synsets(word):
      for l in syn.lemmas():
        synonyms.append(l.name())
        
    print result.key.name, word, synonyms, plurals.plural(word)
    
    kind = 'Synonym'
    synonym_key = datastore_client.key(kind, result.key.name)

    synonym = datastore.Entity(key=synonym_key)
    synonym['synonym'] = result.key.name

    datastore_client.put(synonym)
    
    synonym_key = datastore_client.key(kind, word)

    synonym = datastore.Entity(key=synonym_key)
    synonym['synonym'] = result.key.name

    datastore_client.put(synonym)
    
    for dictionary_synonym in synonyms:
      
      synonym_key = datastore_client.key(kind, dictionary_synonym)

      synonym = datastore.Entity(key=synonym_key)
      synonym['synonym'] = result.key.name

      datastore_client.put(synonym)
      
    synonym_key = datastore_client.key(kind, plurals.plural(word))

    synonym = datastore.Entity(key=synonym_key)
    synonym['synonym'] = result.key.name

    datastore_client.put(synonym)

annual salary annual [u'annual', u'annual', u'yearly', u'yearbook', u'annual', u'one-year', u'annual', u'yearly'] annuals
annual salary salary [u'wage', u'pay', u'earnings', u'remuneration', u'salary'] salaries
compassionate leave compassionate [u'feel_for', u'pity', u'compassionate', u'condole_with', u'sympathize_with', u'compassionate'] compassionates
compassionate leave leave [u'leave', u'leave_of_absence', u'leave', u'farewell', u'leave', u'leave-taking', u'parting', u'leave', u'go_forth', u'go_away', u'leave', u'leave', u'leave', u'leave_alone', u'leave_behind', u'exit', u'go_out', u'get_out', u'leave', u'leave', u'allow_for', u'allow', u'provide', u'leave', u'result', u'lead', u'leave', u'depart', u'pull_up_stakes', u'entrust', u'leave', u'bequeath', u'will', u'leave', u'leave', u'leave', u'leave_behind', u'impart', u'leave', u'give', u'pass_on', u'forget', u'leave'] leaves
confidential information confidential [u'confidential', u'confidential', u'secret', u'confidential', u'conf

## Extract Synonyms with Machine Learning
Use Google Natural Language API (machine learning) to evaluate the "salience" of each word in the topic areas to identify its relevance to the overall meaning of the paragraph.  Highly relevant words in each block of text will be stored as synonyms for the topic.  Note:  The Threshold for salience is currently set at 0.1.  You can adjust this threshold to see the effect on synonym generation.

In [9]:
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types

In [10]:
for result in results:
  
  print result.key.name, '*'*5, result['action_text']
  
  client = language.LanguageServiceClient()

  #document = client.document_from_text(result['action text'])
  
  document = types.Document(
        content=result['action_text'],
        type=enums.Document.Type.PLAIN_TEXT)
  
  entities = client.analyze_entities(document).entities  

  # entity types from enums.Entity.Type
  entity_type = ('UNKNOWN', 'PERSON', 'LOCATION', 'ORGANIZATION',
                   'EVENT', 'WORK_OF_ART', 'CONSUMER_GOOD', 'OTHER')

  for entity in entities:
    if (entity_type[entity.type] != 'ORGANIZATION') and (entity.salience > 0.1):
      
      print('=' * 20)
      print('         name: %s' % (entity.name.lower()))
      print('         type: %s' % (entity_type[entity.type]))
      print('     salience: %s' % (entity.salience))
      
      kind = 'Synonym'
      synonym_key = datastore_client.key(kind, entity.name.lower())
      
      synonym = datastore.Entity(key=synonym_key)
      synonym['synonym'] = result.key.name

      datastore_client.put(synonym)

annual salary ***** Salaries shall be determined by the Executive Director, based on budget considerations and commensurate with the experience of the successful candidate.   The organization shall pay employees on a bi-weekly basis, less the usual and necessary statutory and other deductions payable in accordance with the Employer’s standard payroll practices.  These payroll practices may be changed from time to time at the Employer’s sole discretion.  Currently, payday occurs every second Thursday and covers the pay period ended the previous Saturday.

         name: salaries
         type: OTHER
     salience: 0.224026843905
         name: executive director
         type: PERSON
     salience: 0.224026843905
compassionate leave ***** [THE ORGANIZATION] will grant up to three (3) working days per event on the occasion of a death in the staff member’s immediate family.  Immediate family is defined as: parent(s), step parent(s), foster parent(s), sibling(s), grandparent(s), spouse (in