# iKnow Demo Notebook

This Jupyter notebook bundles a handful of simple demos for the [iKnow NLP engine](https://github.com/intersystems/iknow):

1. [The Basics](#The-Basics)
1. [Indexing Text](#Indexing-Text)
1. [Highlighting](#Highlighting)
1. [Feature Engineering](#Feature-Engineering)

### The Basics

The following paragraph just loads the iKnow engine and prints the set of supported languages.

If you haven't already, please run ```pip install iknowpy``` first to retrieve the latest version from PyPI.

In [1]:
import iknowpy

# initialize the engine
iknow = iknowpy.iKnowEngine()

# display supported languages
print(iknow.get_languages_set())

{'en', 'cs', 'ru', 'ja', 'uk', 'sv', 'nl', 'fr', 'pt', 'es', 'de'}


### Indexing Text

The `index()` function is the main entry point into the engine, taking a text string and language code (any of the ones printed in the previous command). An optional third argument helps you to the full trace output in case you are debugging language model work or just very interested :-)

Upon the method returning, it will have its `m_index` array populated with the indexing results for the supplied text string.

In [2]:
# this is the main API function, just taking text and a language code
iknow.index("Belgian chocolate is suprisingly popular for a country that doesn't have any cocoa trees.","en")

# now we can look at the raw output
print(iknow.m_index)

{'sentences': [{'entities': [{'type': 'Concept', 'offset_start': 0, 'offset_stop': 17, 'index': 'belgian chocolate', 'dominance_value': 1000.0, 'entity_id': 1}, {'type': 'Relation', 'offset_start': 18, 'offset_stop': 20, 'index': 'is', 'dominance_value': 333.0, 'entity_id': 2}, {'type': 'Concept', 'offset_start': 21, 'offset_stop': 40, 'index': 'suprisingly popular', 'dominance_value': 1000.0, 'entity_id': 3}, {'type': 'Relation', 'offset_start': 41, 'offset_stop': 44, 'index': 'for', 'dominance_value': 333.0, 'entity_id': 4}, {'type': 'NonRelevant', 'offset_start': 45, 'offset_stop': 46, 'index': 'a', 'dominance_value': 0.0, 'entity_id': 0}, {'type': 'Concept', 'offset_start': 47, 'offset_stop': 54, 'index': 'country', 'dominance_value': 500.0, 'entity_id': 5}, {'type': 'Relation', 'offset_start': 55, 'offset_stop': 72, 'index': "that doesn't have", 'dominance_value': 1000.0, 'entity_id': 6}, {'type': 'NonRelevant', 'offset_start': 73, 'offset_stop': 76, 'index': 'any', 'dominance_val

Other modules such as `pprint` help render this in a slightly more readable way:

In [3]:
import pprint
pp = pprint.PrettyPrinter(indent=2)
pp.pprint(iknow.m_index)

{ 'proximity': [ ((5, 7), 64),
                 ((1, 3), 64),
                 ((3, 5), 64),
                 ((3, 7), 42),
                 ((1, 5), 42),
                 ((1, 7), 32)],
  'sentences': [ { 'entities': [ { 'dominance_value': 1000.0,
                                   'entity_id': 1,
                                   'index': 'belgian chocolate',
                                   'offset_start': 0,
                                   'offset_stop': 17,
                                   'type': 'Concept'},
                                 { 'dominance_value': 333.0,
                                   'entity_id': 2,
                                   'index': 'is',
                                   'offset_start': 18,
                                   'offset_stop': 20,
                                   'type': 'Relation'},
                                 { 'dominance_value': 1000.0,
                                   'entity_id': 3,
                                

We can of course also loop through the output and print the most important parts of the parsing ourselves. In the simple example below, we're printing the normalized *index* value of each sentence part, along with its role:

In [4]:
# print basic parsing output
for s in iknow.m_index['sentences']:
    for e in s['entities']:
        print(e['type']+': '+e['index'])

Concept: belgian chocolate
Relation: is
Concept: suprisingly popular
Relation: for
NonRelevant: a
Concept: country
Relation: that doesn't have
NonRelevant: any
Concept: cocoa trees


### Highlighting

The following snippet pulls in `colorama`, which is a convenient package for highlighting command-line output (that also works in most notebook apps). There are fancier packages with more options, but this one is universal across UNIX and Windows.

This time we'll leverage both the entity role and the *Negation* and *Certainty* attributes iKnow detects in natural language text. We'll create this as a function so we can easily reuse it in further examples.

In [5]:
# now use colorama to make it look nicer
from colorama import Fore, Style

#from colorama import init
#init() # init colorama - only when running outside notebook

def highlight(text, language="en", iknow=iknowpy.iKnowEngine()):
    
    iknow.index(text, language)
    
    for s in iknow.m_index['sentences']:
        
        # first figure out where negation spans are and tag those entities
        for a in s['path_attributes']:
            
            # path attributes are expressed as positions within s['path'],
            # which in turn keys into the s['entities'] array
            for ent in range(s['path'][a['pos']], 
                             s['path'][a['pos']+a['span']-1]+1):
                if a['type']=="Negation":
                    s['entities'][ent]['colour'] = Fore.RED
                if a['type']=="Certainty":
                    s['entities'][ent]['colour'] = Fore.CYAN
                    
        for e in s['entities']:
            colour = Fore.BLACK
            style = Style.NORMAL
            
            if "colour" in e:
                colour = e["colour"]
                
            if (e['type'] == 'Concept'):
                style = Style.BRIGHT
            if (e['type'] == 'NonRelevant') | (e['type'] == 'PathRelevant'):
                style = Style.DIM
                
            print(colour + style + text[e['offset_start']:e['offset_stop']], end=' ')
            
        print("\n")


In [6]:
highlight("The quick brown fox did not manage to jump over the lazy dog, but still managed to catch the curry chicken.")

[30m[2mThe [31m[1mquick brown fox [31m[22mdid not manage to jump over [31m[2mthe [31m[1mlazy dog, [30m[22mbut still managed to catch [30m[2mthe [30m[1mcurry chicken. 



And as a final part, we'll use this function to render a quick grab of an international news feed using the `feedparser` package and quickly spot uncertain or negated phrases.

In [7]:
# now let's look at some real text and pick up an RSS feed
import feedparser

feed = feedparser.parse("http://newsrss.bbc.co.uk/rss/newsonline_world_edition/americas/rss.xml")

# feedparser helps us browse through the entries
for entry in feed.entries:
    print(entry.description)

White House senior adviser Stephen Miller tests positive and several military leaders quarantine.
The co-founder of Van Halen is remembered as a "Guitar God" following his death from cancer aged 65.
The president and Democratic leader Nancy Pelosi trade blame for the collapse of negotiations.
The H-1B visa has mostly been used by Indian and Chinese technology workers to fill skills gaps.
The social network is deleting groups, pages and accounts linked to the conspiracy theory movement.
A grand jury indicts the streaming service for the alleged "lewd exhibition" of under-age children.
The American artist was best known for the 1972 hit I Can See Clearly Now.
A congressional report from House Democrats recommends changes that could lead to breaking up the companies.
The US has updated its guidance to reflect how the virus can linger in the air, sometimes for hours.
The Democratic presidential nominee criticises Donald Trump for downplaying Covid-19.
The US secretary of state met foreign 

In [8]:
# and apply iKnow highlighting to it
for entry in feed.entries:
    highlight(entry.description)

[30m[1mWhite House senior adviser Stephen Miller [30m[22mtests [30m[1mpositive [30m[22mand [30m[1mseveral military leaders quarantine. 

[30m[2mThe [30m[1mco-founder [30m[22mof [30m[1mVan Halen [30m[22mis remembered as [30m[2ma [30m[1m"Guitar God" [30m[22mfollowing [30m[2mhis [30m[1mdeath [30m[22mfrom [30m[1mcancer [30m[22maged [30m[1m65. 

[30m[2mThe [30m[1mpresident [30m[22mand [30m[1mDemocratic leader Nancy Pelosi trade blame [30m[22mfor [30m[2mthe [30m[1mcollapse [30m[22mof [30m[1mnegotiations. 

[30m[2mThe [30m[1mH-1B visa [30m[22mhas mostly been used by [30m[1mIndian [30m[22mand [30m[1mChinese technology workers [30m[22mto fill [30m[1mskills gaps. 

[30m[2mThe [30m[1msocial network [30m[22mis deleting [30m[1mgroups, [30m[1mpages [30m[22mand [30m[1maccounts [30m[22mlinked to [30m[2mthe [30m[1mconspiracy theory movement. 

[30m[2mA [30m[1mgrand jury [30m[22mindicts [30m[2mthe [30m[



## Feature Engineering

The following code sample leverages iKnow to refine text-based Feature Engineering by removing negated parts of a sentence. This is a little blunt as a general tactic and different types of problems (and text) require different approaches, but it makes for a clear demo.

We'll leverage iKnow to scrub the input fed into skicit-learn's [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class. You'll need to run `pip install sklearn` to load those libraries before running this paragraph.

In [9]:
from sklearn.datasets import fetch_20newsgroups

categories = [
 'rec.autos',
 'misc.forsale',
 'sci.med',
]
example = 12

data_train = fetch_20newsgroups(subset='train', categories=categories, random_state=123)
print(data_train.data[example])

Subject: apology (was Re: Did US drive on the left?)
From: aas7@po.CWRU.Edu (Andrew A. Spencer)
Reply-To: aas7@po.CWRU.Edu (Andrew A. Spencer)
Organization: Case Western Reserve University, Cleveland, OH (USA)
NNTP-Posting-Host: slc5.ins.cwru.edu
Lines: 54


In a previous article, dh3q+@andrew.cmu.edu ("Daniel U. Holbrook") says:

>>i'm guessing, but i believe in the twenties we probably drove mostly down
>>cattle trails and in wagon ruts.  I am fairly sure that placement of the 
>>steering wheel was pretty much arbitrary to the company at that time.....
>
>By the 1920s, there was a very active "good roads" movement, which had
>its origins actually in the 1890s during the bicycle craze, picked up
>steam in the teens (witness the Linclon Highway Association, 1912 or so,
>and the US highway support act (real name: something different) in 1916
>that first pledged federal aid to states and counties to build decent
>roads. Also, the experience of widespread use of trucks for domestic
>trans

The role of the `CountVectorizer` is to transform an array of strings into a document-term matrix with one column for each word and word frequencies as the corresponding values.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

vectorized = CountVectorizer().fit_transform(data_train.data)

print(vectorized.shape)
print("Sample record, after transformation:")
print(vectorized[example])

(1773, 28948)
Sample record, after transformation:
  (0, 12282)	3
  (0, 10575)	5
  (0, 15327)	2
  (0, 25118)	2
  (0, 12068)	5
  (0, 26358)	1
  (0, 12836)	5
  (0, 27352)	1
  (0, 19440)	1
  (0, 27177)	2
  (0, 16428)	1
  (0, 13410)	2
  (0, 19198)	12
  (0, 25977)	8
  (0, 26044)	4
  (0, 15454)	1
  (0, 10114)	2
  (0, 22561)	1
  (0, 19298)	9
  (0, 17324)	1
  (0, 4469)	12
  (0, 25984)	34
  (0, 4347)	2
  (0, 26294)	8
  (0, 11845)	2
  :	:
  (0, 15781)	2
  (0, 4293)	1
  (0, 10123)	2
  (0, 9545)	1
  (0, 23450)	1
  (0, 9619)	1
  (0, 18876)	1
  (0, 9046)	1
  (0, 6936)	1
  (0, 17430)	1
  (0, 4681)	1
  (0, 13717)	2
  (0, 28521)	1
  (0, 25033)	1
  (0, 7014)	2
  (0, 8048)	1
  (0, 21230)	1
  (0, 19513)	1
  (0, 24535)	1
  (0, 25976)	1
  (0, 8519)	1
  (0, 4639)	1
  (0, 13348)	1
  (0, 20740)	1
  (0, 10253)	1


Now we'll use this `CountVectorizer` as part of a pipeline to predict the target field (newsgroup category) based on the text input.

In [11]:
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('sgd', SGDClassifier()),
])
pipeline.fit(data_train.data, data_train.target)


# select test set to assess the quality of our model
data_test = fetch_20newsgroups(subset='test', categories=categories)

print(classification_report(pipeline.predict(data_test.data), data_test.target))

              precision    recall  f1-score   support

           0       0.93      0.88      0.90       408
           1       0.88      0.90      0.89       384
           2       0.88      0.90      0.89       390

    accuracy                           0.90      1182
   macro avg       0.90      0.90      0.90      1182
weighted avg       0.90      0.90      0.90      1182



After setting up this base pipeline, let's create an additional transformation step that leverages iKnow to get rid of all negated sentence spans. We'll first create a `strip_negation()` method similar to the `highlight()` method above, and then use it in a Transformer class implementing the appropriate sklearn interface.

In [12]:
from sklearn.base import BaseEstimator, TransformerMixin
import iknowpy


def strip_negation(text, language="en", iknow=iknowpy.iKnowEngine()):
    
    iknow.index(text, language)
    stripped = ""

    for s in iknow.m_index['sentences']:
        
        # first figure out where negation spans are and tag those entities
        for a in s['path_attributes']:
            
            # path attributes are expressed as positions within s['path'],
            # which in turn keys into the s['entities'] array
            if a['type']=="Negation":
                for ent in range(s['path'][a['pos']], 
                                 s['path'][a['pos']+a['span']-1]+1):
                    s['entities'][ent]['neg'] = 1
                    
        for e in s['entities']:
            if "neg" in e:
                continue
            stripped += text[e['offset_start']:e['offset_stop']] + " "

    return stripped


# implement sklearn Transformation interface
class iKnowNegationStripper(BaseEstimator, TransformerMixin):
    
    def __init__(self, language = "en"):
        self.engine = iknowpy.iKnowEngine()
        self.language = language
        
    def fit(self, X, y = None):
        return self
    
    def transform(self, X, y = None):
        X_ = []
        for source_text in X:
            X_.append(strip_negation(source_text, self.language, self.engine))
        return X_

This new Transformer can now be included at the start of the pipeline to build the model anew. For this particular (public) dataset, the difference in accuracy is very small, but in other datasets where negation (or any of the other attributes iKnow detects) is more important with respect to the target field, the uptick in precision can be more substantial. Other approaches included not leaving the attributed entities out, but rather flagging them with a suffix so they end up as separate features after the `CountVectorizer` transfomation.

In [13]:
pipeline2 = Pipeline([
    ('ik', iKnowNegationStripper()),
    ('vect', CountVectorizer()),
    ('sgd', SGDClassifier()),
])

pipeline2.fit(data_train.data, data_train.target)

print(classification_report(pipeline2.predict(data_test.data), data_test.target))

              precision    recall  f1-score   support

           0       0.96      0.91      0.93       412
           1       0.92      0.92      0.92       397
           2       0.89      0.94      0.91       373

    accuracy                           0.92      1182
   macro avg       0.92      0.92      0.92      1182
weighted avg       0.92      0.92      0.92      1182

