This notebook reviews all of the variable search & exploration components modules.

## Table of contents:
1. [wiktiwordnetapi.py](#sec-wwn)
2. [wikipediaapi.py](#sec-wapi)
3. [svoapi.py](#sec-svoapi)
4. [parse_tools.py](#sec-parsetools)
5. [knowledge_graph.py](#sec-kg)

## 1. wiktiwordnetapi.py <a class="anchor" id="sec-wwn"></a>

This module loads the generated WiktiWordNet data and has two interaction functions:
 - check_domain(term) - checks if the selected term refers to a domain
 - get_category(term) - returns a dictionary of {category:definition} pairs for the selected term

The wiktiwornetapi can be imported with the following command:

In [1]:
import wiktiwordnetapi as wwnapi

Next, instantiate a WiktiWordNet object with the following command:

In [2]:
wwn = wwnapi.wiktiwordnet()

Test the functionality of the two methods available:

In [3]:
def print_domain_status(term):
    [is_domain, definition] = wwn.check_domain(term)
    
    print('According to WiktiWordNet, {} is {}a domain.'\
          .format(term, '' if is_domain else 'NOT '))
    
for term in ['dogs', 'astronomy', 'geology', 'astrology']:
    print_domain_status(term)


According to WiktiWordNet, dogs is NOT a domain.
According to WiktiWordNet, astronomy is a domain.
According to WiktiWordNet, geology is a domain.
According to WiktiWordNet, astrology is NOT a domain.


In [4]:
def print_term_categories(term):
    categories = wwn.get_category(term)
    
    num_categories = len(list(categories.keys()))
    if num_categories == 0:
        print('WiktiWordnet does not have any categories for the term {}.'\
              .format(term))
    else:
        print('Found the following {} for {}:'\
              .format('categories' if num_categories > 1 else 'category', term))
        print(', '.join(list(categories.keys())))
    
for term in ['dog', 'astronomy', 'butter']:
    print_term_categories(term)

Found the following categories for dog:
Body, Role
Found the following category for astronomy:
Domain
WiktiWordnet does not have any categories for the term butter.


## 2. wikipediaapi.py <a class="anchor" id="sec-wapi"></a>

The functions contained in this module can be used to interact with Wikipedia. They can
- perform a search and return the top/most relevant result according to the Wikipedia algorithm
- get the "bulk" text from a Wikipedia page (discarding panel information)

The main function to be used from this module is:
- get_wikipedia_text(term) : returns the text and metadata information from most closely related Wikipedia page

And there are two helper functions present:
- get_top_wikipedia_entry(term) : returns metadata information about most closely related Wikipedia page
- parse_wikipedia_page(pageid) : returns the text and disambugation information about the Wikipedia page

Load the module as follows:

In [1]:
import wikipediaapi as wapi

First, test the helper functions.

In [2]:
def get_wikipedia_page_info(term):
    results = wapi.get_top_wikipedia_entry(term)
    
    if results == {}:
        print('Did not find a relevant Wikipedia page for {}.'.format(term))
    else:
        print('Found the following Wikipedia page for {}:'.format(term))
        if 'title' in results.keys():
            print('Title: {}'.format(results['title']))
        if 'pageid' in results.keys():
            print('Page ID: {}'.format(results['pageid']))
        if 'redirecttitle' in results.keys():
            print('Redirect Title: {}'.format(results['redirecttitle']))
        if 'sectiontitle' in results.keys():
            print('Section Title: {}'.format(results['sectiontitle']))

for term in ['dog', 'crop production', 'conductivity', 'hafdkj']:
    get_wikipedia_page_info(term)

Found the following Wikipedia page for dog:
Title: Dog
Page ID: 4269567
Found the following Wikipedia page for crop production:
Title: Agriculture
Page ID: 627
Redirect Title: Crop production
Found the following Wikipedia page for conductivity:
Title: Conductivity
Page ID: 403990
Did not find a relevant Wikipedia page for hafdkj.


In [3]:
def get_wikipedia_text_info(pageid):
    [text, disambig] = wapi.parse_wikipedia_page(pageid)
    
    if disambig:
        print('Page with id {} is a disambiguation page.'.format(pageid))
    if text != []:
        print('Here are the first few lines of page id {}:'.format(pageid) )
        print(text[0])
        
for term in [4269567, 403990, 0]:
    get_wikipedia_text_info(term)

Here are the first few lines of page id 4269567:

Page with id 403990 is a disambiguation page.
Here are the first few lines of page id 403990:
Electrical conductivity
Here are the first few lines of page id 0:



Now, test the main function:

In [4]:
def get_wikipedia_text(term):
    [text, disambig, title, redirecttitle] = wapi.get_wikipedia_text(term)
    
    if title == '':
        print('No page found for term {}.'.format(term))
    else:
        print('Page Title for term {}: {}'.format(term, title))
    if redirecttitle != '':
        print('Redirect title for term {} page is: {}.'.format(term, redirecttitle))
    if disambig:
        print('Page for term {} is a disambiguation page.'.format(term))
    if text != []:
        print('Here is the first paragraph of the page for term {}:'.format(term) )
        print(text[0])

for term in ['dog', 'crop production', 'conductivity', 'hafdkj']:
    get_wikipedia_text(term)
    print('==================================')
    

Page Title for term dog: dog
Here is the first paragraph of the page for term dog:
Canis familiaris Linnaeus, 1758[2][3]
Page Title for term crop production: agriculture
Redirect title for term crop production page is: crop production.
Here is the first paragraph of the page for term crop production:
Agriculture is the science and art of cultivating plants and livestock.[1] Agriculture was the key development in the rise of sedentary human civilization, whereby farming of domesticated species created food surpluses that enabled people to live in cities. The history of agriculture began thousands of years ago. After gathering wild grains beginning at least 105,000 years ago, nascent farmers began to plant them around 11,500 years ago. Pigs, sheep and cattle were domesticated over 10,000 years ago. Plants were independently cultivated in at least 11 regions of the world. Industrial agriculture based on large-scale monoculture in the twentieth century came to dominate agricultural output,

## 3. svoapi.py <a class="anchor" id="sec-svoapi"></a>

This module interacts with the SVO SPARQL endpoint to do term search
 
The main function to be used from this module is:
- search_rank(term) : returns a pandas dataframe of directly labeled entities and linked entities related to the search term(s) as well as a rank (from 0 to 1) of the match

There are three helper functions present in the module:
- search(term) : returns a pandas dataframe of directly labeled entities and linked entities related to the search term(s)
- search_entity_links(entities) : return a Pandas dataframe containing the columns: term, entity, entitylabel, entityclass, linkedentity, linkedentitylabel, linkedentityclass
    - linkedentity (and label, class) will be one of the entities passed in
    - entity (and label, class) will be the entities linked to that entity
- search_label(term) : return a Pandas dataframe containing the columns: term, entity, entitylabel, entityclass

Load the module as follows:

In [1]:
import svoapi

Test the rank search functionality:

In [2]:
def rank_search_test(terms):
    print('Performing search for {}'.format(', '.join(terms)))
    results = svoapi.rank_search(terms)
    print("Here are the top ten search results overall:")
    for _,row in results.head(10).iterrows():
        print('\t{}\t{}'.format(row['entity'].split('/')[-1],row['rank']))
    print()
    print("Here are the top ten search results for variables:")
    for _,row in results.loc[results['entityclass']=='Variable'].head(10).iterrows():
        print('\t{}\t{}'.format(row['entity'].split('/')[-1],row['rank']))

In [3]:
search_terms = [['viscosity'], ['volume viscosity'], ['rainfall', 'precipitation']]
for term in search_terms:
    rank_search_test(term)
    print('++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')

Performing search for viscosity
Here are the top ten search results overall:
	property#viscosity	1.0
	property#apparent_viscosity	0.5
	property#dynamic_viscosity	0.5
	property#kinematic_viscosity	0.5
	property#power-law-fluid_viscosity	0.5
	property#viscosity_term	0.5
	property#casson-model_viscosity_coefficient	0.3333333333333333
	property#extensional_dynamic_viscosity	0.3333333333333333
	property#extensional_kinematic_viscosity	0.3333333333333333
	property#log10_of_dynamic_viscosity	0.3333333333333333

Here are the top ten search results for variables:
	variable#air__shear_dynamic_viscosity	0.2
	variable#air__shear_kinematic_viscosity	0.2
	variable#air__volume_dynamic_viscosity	0.2
	variable#air__volume_kinematic_viscosity	0.2
	variable#chocolate%7Eliquid__apparent_viscosity	0.2
	variable#equation%7Enavier-stokes__viscosity_term	0.2
	variable#polymer__extensional_kinematic_viscosity	0.2
	variable#sea%40context%7Ein_%28water_eddy%29__viscosity	0.2
	variable#water__shear_dynamic_viscos

## 4. parse_tools.py <a class="anchor" id="sec-parsetools"></a>

This module contains functionality for parsing text and extract technical terminology. It uses the Stanford Stanza sentence parser to generate sentence part of speech trees, and then extracts extracts relevant information using knowledge of technical linguistic semantic patterns.

This module contains functions to perform the following text parsing steps:
- extract noun groups (technical terminology)
- find existence information about a desired term (e.g., X is defined as Y ...)
- find variations on a noun group (e.g. adjective modified)

Import module with the following command:

In [1]:
import parse_tools as pt

2020-06-15 12:08:15 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | ewt       |
| pos       | ewt       |
| lemma     | ewt       |
| depparse  | ewt       |
| ner       | ontonotes |

2020-06-15 12:08:15 INFO: Use device: cpu
2020-06-15 12:08:15 INFO: Loading: tokenize
2020-06-15 12:08:15 INFO: Loading: pos
2020-06-15 12:08:16 INFO: Loading: lemma
2020-06-15 12:08:16 INFO: Loading: depparse
2020-06-15 12:08:17 INFO: Loading: ner
2020-06-15 12:08:18 INFO: Done loading processors!


A section of text from a source, e.g. Wikipedia, can be parsed using parse_page_noun_groups(text) where text is provided as a list of paragraph strings. We can import the Wikipedia API for this test.

In [2]:
import wikipediaapi as wapi

In [3]:
[text, _, _, _] = wapi.get_wikipedia_text('agriculture')

In [4]:
text[0]

'Agriculture is the science and art of cultivating plants and livestock.[1] Agriculture was the key development in the rise of sedentary human civilization, whereby farming of domesticated species created food surpluses that enabled people to live in cities. The history of agriculture began thousands of years ago. After gathering wild grains beginning at least 105,000 years ago, nascent farmers began to plant them around 11,500 years ago. Pigs, sheep and cattle were domesticated over 10,000 years ago. Plants were independently cultivated in at least 11 regions of the world. Industrial agriculture based on large-scale monoculture in the twentieth century came to dominate agricultural output, though about 2 billion people still depended on subsistence agriculture into the twenty-first.'

Parsing a page for its noun groups takes a bit of time. In fact, it is the most time-intensive step in the variable exploration tools package.

In [5]:
parsed_text = pt.ParsedDoc(text)

Examining the first paragraph, first sentence 

`Agriculture is the science and art of cultivating plants and livestock.`

yields:

In [6]:
parsed_text.paragraphs[1].sentences[1].noun_groups.ng

{'livestock': {'pos_seq': ['NOUN'],
  'lemma_seq': ['livestock'],
  'type': 'noun'},
 'plants': {'pos_seq': ['NOUN'], 'lemma_seq': ['plant'], 'type': 'noun'},
 'art': {'pos_seq': ['NOUN'], 'lemma_seq': ['art'], 'type': 'noun'},
 'science': {'pos_seq': ['NOUN'], 'lemma_seq': ['science'], 'type': 'noun'},
 'Agriculture': {'pos_seq': ['NOUN'],
  'lemma_seq': ['agriculture'],
  'type': 'noun'}}

The output is in the following format:

{ par_no : { sentence_no : { noun_group : {'pos_seq' : [...], 'lemma_seq': [...], *'components': {noun_group: {'pos_seq' : [...], 'lemma_seq': [...]}* }

where the 'components' key is only present for noun groups that contain adpositions.

Examining the first paragraph, second sentence 

`Agriculture was the key development in the rise of sedentary human civilization, whereby farming of domesticated species created food surpluses that enabled people to live in cities.`

yields:

In [7]:
parsed_text.paragraphs[1].sentences[2].noun_groups.ng

{'cities': {'pos_seq': ['NOUN'], 'lemma_seq': ['city'], 'type': 'noun'},
 'people': {'pos_seq': ['NOUN'], 'lemma_seq': ['people'], 'type': 'noun'},
 'food surpluses': {'pos_seq': ['NOUN', 'NOUN'],
  'lemma_seq': ['food', 'surplus'],
  'type': 'noungrp'},
 'farming of domesticated species': {'pos_seq': ['NOUN',
   'ADPOSITION',
   'ADJECTIVE',
   'NOUN'],
  'lemma_seq': ['farming', 'of', 'domesticated', 'species'],
  'type': 'compound',
  'components': {'farming': {'pos_seq': ['NOUN'],
    'lemma_seq': ['farming'],
    'type': 'noun'},
   'domesticated species': {'pos_seq': ['ADJECTIVE', 'NOUN'],
    'lemma_seq': ['domesticated', 'species'],
    'type': 'modnoun',
    'has_type': {'species': {'pos_seq': ['NOUN'],
      'lemma_seq': ['species'],
      'type': 'noun'}},
    'has_attribute': {'domesticated': {'pos_seq': ['ADJECTIVE'],
      'lemma_seq': ['domesticated'],
      'type': 'adj'}}}}},
 'rise of sedentary human civilization': {'pos_seq': ['NOUN',
   'ADPOSITION',
   'ADJECTIVE',

To find the 'is' paragraph that corresponds to a particular subject, use the following:

In [8]:
parsed_text.find_is_nsubj('agriculture')

{1: [1]}

The function returns the paragraph and sentence number for the is sentences, which is the first sentence in the first paragraph (as expected).

To count the noun groups on a page related to a term and assign them types, use the following:

In [9]:
agriculture_types = parsed_text.get_term_noun_groups('agriculture')

In [10]:
agriculture_types.head(10)

Unnamed: 0,noun_group,count,type,modified,aspects
68,agriculture,44,simple,False,True
69,agriculture accounts,2,simple,False,True
70,agriculture occupation,1,simple,False,True
71,agriculture sector,1,simple,False,True
72,agriculture through changes in average tempera...,1,multiple,False,True
87,ancient egyptian agriculture,1,adjectival,True,False
121,assessment of agriculture,1,multiple,False,True
252,conservation agriculture,1,simple,True,False
262,conventional agriculture,1,multiple,False,True
267,cost of agriculture to society,1,multiple,False,True


In [11]:
agriculture_types[agriculture_types['modified']]

Unnamed: 0,noun_group,count,type,modified,aspects
87,ancient egyptian agriculture,1,adjectival,True,False
252,conservation agriculture,1,simple,True,False
360,domestic agriculture,1,adjectival,True,False
414,environment agriculture,1,simple,True,False
741,industrial agriculture,1,adjectival,True,False
762,intensive agriculture,1,adjectival,True,False
902,measure agriculture,1,simple,True,False
922,modern agriculture,1,adjectival,True,False
1343,subsistence agriculture,1,simple,True,False
1526,word agriculture,1,simple,True,False


In [12]:
agriculture_types[agriculture_types['aspects']]

Unnamed: 0,noun_group,count,type,modified,aspects
68,agriculture,44,simple,False,True
69,agriculture accounts,2,simple,False,True
70,agriculture occupation,1,simple,False,True
71,agriculture sector,1,simple,False,True
72,agriculture through changes in average tempera...,1,multiple,False,True
121,assessment of agriculture,1,multiple,False,True
262,conventional agriculture,1,multiple,False,True
267,cost of agriculture to society,1,multiple,False,True
323,density agriculture,1,multiple,False,True
324,density agriculture in loose rotation,1,multiple,False,True


The modified and aspects columns flag noun groups containing the desired filter term (in this case agriculture). Modified terms are those that modify agriculture with a noun or adjective at the head (to specify a 'type' of agriculture). Aspects terms are those that modify the desired term with a noun at the tail or contain adpositions; these are different 'dimensions' in which the term agriculture may be encountered.

## 5. knowledge_graph.py <a class="anchor" id="sec-kg"></a>

The knowledge graph module provides the capability of loading a knowledge graph (such as the one created for the WM indicators) or creating a new knowledge graph from scratch. The knowledge graph can be expanded to up to 3 levels around any specified node at a time. A wider range of the features of the generated knowledge graph are demonstrated in the Variable Report notebook. Here are a few of the main functionalities of this module.

In [1]:
import knowledge_graph as kg

2020-06-18 17:46:33 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | ewt       |
| pos       | ewt       |
| lemma     | ewt       |
| depparse  | ewt       |
| ner       | ontonotes |

2020-06-18 17:46:33 INFO: Use device: cpu
2020-06-18 17:46:33 INFO: Loading: tokenize
2020-06-18 17:46:33 INFO: Loading: pos
2020-06-18 17:46:34 INFO: Loading: lemma
2020-06-18 17:46:34 INFO: Loading: depparse
2020-06-18 17:46:35 INFO: Loading: ner
2020-06-18 17:46:35 INFO: Done loading processors!


Load a graph from file (in this case, the graph generated for the WM indicators):

In [2]:
graph = kg.SciVarKG(graphfile = 'resources/world_modelers_indicators_kg.json')

Add a concept to the graph:

In [3]:
graph.add_concept('drought', depth = 2)

Perform inference over the nodes in the graph to align SVO variables and WM indicators with all nodes and determine most likely SVO category.

In [4]:
graph.graph_inference()

In [5]:
graph.graph['drought']['hasSVOMatch']

{'Attribute': [-7004608295479326044]}

In [6]:
graph.svo_index_map[-7004608295479326044]

{'namespace': 'attribute',
 'entity': 'drought',
 'preflabel': 'drought',
 'class': 'Attribute'}

Write the graph from memory to file (in this case the default, which is resources/scivar_kg.json):

In [7]:
graph.write_graph()