# Teach a machine to understand human language
##### WeAreDeveloper World Congress 2019

## Part 2 - Answer a Questions based on Named-Entity Recognition

**Goal**: We want to answer to questions that address a specific questions on devices, like *What kind of CPU does my Samsung Galaxy S7 have?*

**Step 1**: Mark device names and properties by using Named-Entity Recognition(NER)<br>
**Step 2**: Build a query against the Wikidata Ontology

Before building a NER solution, let's have a look into the data

In [2]:
import pandas as pd
path = "/home/paul/Downloads/classifier_dataset.csv"
df = pd.read_csv(path, encoding="utf-8", header=0, sep=",")

In [3]:
df[df['cat']=='devices'][0:10]

Unnamed: 0,question,cat
0,"Like many AUX, the Lombard has S or Lombards",devices
1,Is there still the Monaco 1?,devices
2,Show me Comfort Pro P 500 devices,devices
3,which connection options does the Ergotel s ha...,devices
4,Show me all the colors for the Actron Card,devices
5,I&#39;m looking for a garnet 1 phone,devices
6,Which SIM card is in the Google Pixel?,devices
7,What is the purchase price for speedort W724 v,devices
8,What weighs a Samsung phone?,devices
9,What is the bottom line between the Google Pix...,devices


### Mark device names and properties by using Named-Entity Recognition(NER)

There are different models for NER:
- Conditional Random Fields (CFR)
- Convolutional Neural Networks(CNN)
- Long Short Term Memory(LSTM)

**Conditional Random Fields (CFR)** <br>
<img src="file:///home/paul/Documents/wwc-demo/crf.png" width="800">

**Convolutional Neural Networks(CNN)** <br>
<img src="https://user-images.githubusercontent.com/7529838/34460821-5e3542f4-ee5d-11e7-93d4-f8ce81984b89.png">

**Long Short Term Memory(LSTM)**
<img src="file:///home/paul/Documents/wwc-demo/lstm3.png" width="400">
<img src="file:///home/paul/Documents/wwc-demo/lstm1.png" width="600">

#### Try some pretrained Models

The majority of the State-of-the-art Named-Entity Recognition models some kind of LSTM. The training of these models would take more time than we have. So let's try some Named-Entity Recognition models. Some performant and easy-to-use models are available trough modules like flair or allennlp.

Let's try some models from flair a project by Zalando Research

In [4]:
from flair.data import Sentence
from flair.models import SequenceTagger

The have trained a BiLSTM-CRF with some special embeddings

In [5]:
tagger_conll03 = SequenceTagger.load('ner')
tagger_ontonotes = SequenceTagger.load('ner-ontonotes')
tagger_wikiner = SequenceTagger.load('/home/paul/Downloads/resources/taggers/example-ner/best-model.pt')

2019-06-06 14:30:55,632 loading file /home/paul/.flair/models/en-ner-conll03-v0.4.pt
2019-06-06 14:31:17,221 loading file /home/paul/.flair/models/en-ner-ontonotes-v0.3.pt
2019-06-06 14:31:31,296 loading file /home/paul/Downloads/resources/taggers/example-ner/best-model.pt


In [6]:
def print_tags(tagger,text):
    sentence = Sentence(text)
    tagger.predict(sentence)

    for entity in sentence.get_spans('ner'):
        print(entity)

In [7]:
print_tags(tagger_conll03, 'George Washington went to Washington .')

PER-span [1,2]: "George Washington"
LOC-span [5]: "Washington"


In [8]:
print('What kind of CPU has the Samsung Galaxy S7')
print_tags(tagger_conll03, 'What kind of CPU has the Samsung Galaxy S7')

What kind of CPU has the Samsung Galaxy S7
ORG-span [4]: "CPU"
MISC-span [7,8,9]: "Samsung Galaxy S7"


In [9]:
from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/ner-model-2018.12.18.tar.gz")
res = predictor.predict(
  sentence="What kind of CPU has the Samsung Galaxy S7"
)



In [10]:
misc = []
for i, e in enumerate(res['words']):
    if res['tags'][i] == 'O':
        was_misc = False
    elif res['tags'][i].endswith('MISC'):
        misc += [e]
    else:
        print(e, res['tags'][i])
print(' '.join(misc), 'MISC')

CPU U-ORG
Samsung Galaxy S7 MISC


So the pretrained models we have, detect the products quite accurate. But not the properties.

#### Self-Training of models

In [19]:
import spacy
import json
import random
import pickle

In [20]:
def train_spacy(data,iterations):
    TRAIN_DATA = data
    nlp = spacy.blank('de')  # create blank Language class
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)
       

    # add labels
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        nlp.begin_training()
        for itn in range(iterations):
            print("Starting iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses)
            print(losses)
    return nlp

Train a new model

Test your text

In [21]:
import spacy
nlp = spacy.blank('de')
prdnlp = spacy.load('/home/paul/Documents/wwc-demo/telekom-test')
test_text = input("Enter your testing text: ")
doc = prdnlp(test_text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Enter your testing text: 


### Build a query against the Wikidata Ontology
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Datamodel_in_Wikidata.svg/2560px-Datamodel_in_Wikidata.svg.png" width=600>
Now let's use that knowledge and query against Wikidata for an answer. You can do so by building SPARQL Queries. You can build and try queries on you own using this endpoint: https://query.wikidata.org/ .

In [25]:
import re
import requests

In [26]:
def get_property(property_name):
    url = 'https://query.wikidata.org/sparql'
    query = """
    SELECT ?property ?propertyLabel WHERE {{
        ?property rdfs:label ?propertyLabel. 
        ?property a wikibase:Property .
        FILTER(lang(LCASE(?propertyLabel)) = "de").
        FILTER(CONTAINS(LCASE(?propertyLabel), "{}"@de)). 
    }} LIMIT 1
    """.replace('\n','').format(property_name.lower())
    query =  re.sub(' +', ' ', query)
    r = requests.get(url, params = {'format': 'json', 'query': query})
    data = r.json()
    name = data["results"]["bindings"][0]["property"]["value"]
    entity_id = name.split('http://www.wikidata.org/entity/')[1]
    return entity_id

In [27]:
def get_device(device_name, property_id):
    url = 'https://query.wikidata.org/sparql'
    query = """
    SELECT ?itemLabel ?propertyLabel
    WHERE {{ 
      ?item rdfs:label ?itemLabel. 
      ?item wdt:P279 wd:Q22645.
      FILTER(CONTAINS(LCASE(?itemLabel), "{0}"@de)). 
      ?item wdt:{1} ?property.
      ?property rdfs:label ?propertyLabel. 
      FILTER(LANG(?propertyLabel) = "de"). 
    }} LIMIT 1
    """.replace('\n','').format(device_name.lower(), property_id)
    query =  re.sub(' +', ' ', query)
    r = requests.get(url, params = {'format': 'json', 'query': query})
    data = r.json()
    name = data["results"]['bindings']
    return name

In [28]:
get_device("Galaxy S7", get_property("CPU"))

[{'itemLabel': {'xml:lang': 'de',
   'type': 'literal',
   'value': 'Samsung Galaxy S7'},
  'propertyLabel': {'xml:lang': 'de',
   'type': 'literal',
   'value': 'Samsung Exynos'}}]

### Use all that to answer simple questions
Like *What kind of CPU does the Samsung Galaxy S7 have?*. Keep in mind that you can get results for properties that are defined in the Wikidata Entry. Additionally this query builder is not meant to answer complex questions, because we are missing a lot of logic therefore.

In [29]:
test_text = input("Enter your testing text: ")
doc = prdnlp(test_text)
product = [ent.text for ent in doc.ents if ent.label_ == 'PRODUKT'][0]
propert = [ent.text for ent in doc.ents if ent.label_ == 'EIGENSCHAFT'][0]
res = get_device(product, get_property(propert))
print('The ' + product + ' has a ' + res[0]['propertyLabel']['value'] + ' ' + propert)

Enter your testing text: blabbla


IndexError: list index out of range