# Auxiliary methods

ToC
 
 * Data reconciliation to Wikidata
 * Web scraping with BeautifulSoup
 * NER with SpaCy

This tutorial addresses some typical scenarios you may face while integrating data from different sources. 



## Data reconciliation to Wikidata

For instance, when integrating data between two Linked data sources that **do not** share the same URIs (remember the URIs of art historians in ARTchives *and* Wikidata), you may need to find a workaround. In these cases, you can work on labels (artists' names), class constraints (being a person), and other contextual information (occupation, birth dates, birth places, etc) that can be matched between sources.

 * we gather the labels of entities we want to match (in the example below we use only one label, `Federico Zeri`)
 * we query Wikidata API of entities with the library `requests` 
 * we perform a lookup for WD entities having label `Federico Zeri`
 * we get a list of results (that are string-based matches only)
 * we decide how to filter results, e.g. we want only humans having that name. In wikidata humans have the pattern `wdt:P31 wd:Q5` 
 * we send an `ASK` query to WD endpoint (the API does not support this kind of query) to see if the first result returned by the API query has the pattern we want (`wdt:P31 wd:Q5`). 
 * we return the Qid of the entity and whether this has the pattern
 
NB. you can improve this method: 

 * you can try the `ASK` query also with other results returned by the API
 * you can ask for multiple (OPTIONAL) triple patterns
 * you can take in input a list of strings, not just one

We need to install a library to query Wikidata endpoint faster, `qwikidata`.

In [1]:
!pip install qwikidata
import pprint
import requests
from qwikidata.sparql  import return_sparql_query_results

pp = pprint.PrettyPrinter(indent=1)

def wikidata_reconciliation(query, q_class=None):
    """ query wd apis and print in a json file the results of reconciliation """
    
    API_WD = "https://www.wikidata.org/w/api.php"
    params = {
        'action': 'wbsearchentities',
        'format': 'json',
        'language': 'en',
        'search': query # the query string
    }
    
    # query wd API    
    r = requests.get(API_WD, params = params).json() 
    pp.pprint(r) # the response
    
    # iterate over results (if there is any)
    if 'search' in r and len(r['search']) >= 1:
        # if specified, double check if the entity belongs to the class
        if q_class:
            qid= r['search'][0]['title']
            query_string = """ASK {wd:"""+qid+""" wdt:P31 wd:"""+q_class+""". }"""
            
            # query WD endpoint this time!
            res = return_sparql_query_results(query_string) 
            print("\n my string:", query, "\n the query to WD endpoint:", query_string, "\n the result:",res)
            
            if res["boolean"] == True: 
                return [ r['search'][0]['title'] , 'the class matches :)']
            else:
                return [ r['search'][0]['title'] , 'the class does not match :(']
        else:
            return [ r['search'][0]['title'] , 'no class was given']
    else:
        return 'no results matching the query string'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting qwikidata
  Downloading qwikidata-0.4.2.tar.gz (22 kB)
Collecting mypy-extensions
  Downloading mypy_extensions-0.4.3-py2.py3-none-any.whl (4.5 kB)
Building wheels for collected packages: qwikidata
  Building wheel for qwikidata (setup.py) ... [?25l[?25hdone
  Created wheel for qwikidata: filename=qwikidata-0.4.2-py3-none-any.whl size=24886 sha256=e3f86dc97f8336ca5a922d8794859a201043bf1ced41dce2e19fbeaea6c98b4b
  Stored in directory: /root/.cache/pip/wheels/d1/b7/40/2d770bcce10a8ea528db49a82279e479f33043462509de7b7f
Successfully built qwikidata
Installing collected packages: mypy-extensions, qwikidata
Successfully installed mypy-extensions-0.4.3 qwikidata-0.4.2


In [2]:
wikidata_reconciliation("Federico Zeri", "Q5")

{'search': [{'concepturi': 'http://www.wikidata.org/entity/Q1089074',
             'description': 'Italian art historian',
             'display': {'description': {'language': 'en',
                                         'value': 'Italian art historian'},
                         'label': {'language': 'en', 'value': 'Federico Zeri'}},
             'id': 'Q1089074',
             'label': 'Federico Zeri',
             'match': {'language': 'en',
                       'text': 'Federico Zeri',
                       'type': 'label'},
             'pageid': 1036932,
             'repository': 'wikidata',
             'title': 'Q1089074',
             'url': '//www.wikidata.org/wiki/Q1089074'},
            {'aliases': ['Federico Zeri Foundation'],
             'concepturi': 'http://www.wikidata.org/entity/Q23687322',
             'description': 'Cultural institution',
             'display': {'description': {'language': 'en',
                                         'value': 'Cultural ins

['Q1089074', 'the class matches :)']

## Web scraping

How to integrate Linked Open Data with *non-LOD* data, e.g. data from web pages? For instance, how to integrate our `Federico Zeri` (now reconciled to Wikidata) with information on his friends, taken from the [Dictionary of art historians](https://arthistorians.info/)?

Web pages can be queried with the python library `BeautifulSoup` 
 * we study the structure of the HTML page to get the path of elements mentioning people. In XPATH this would be `div[class='field-name-body']/div[class='field-items']/div/p/a`
 * we query a web page with `requests`, given its URL. It returns the HTML string.
 * we parse the HTML page with `BeautifulSoup` either using a XML/HTML parser like `lxml` that allows XPATH (not covered in this tutorial) or by using CSS selectors to get to the element we want.
 
XPATH : `div[class='field-name-body']/div[class='field-items']/div/p/a`

becomes CSS selector: `div.field-name-body div.field-items div p a`


In [12]:
# !pip install beautifulsoup4

from bs4 import BeautifulSoup

#URL = "https://arthistorians.info/zerif" # the web page of Federico Zeri on the Dictionary of Art historians
#page = requests.get(URL)

#soup = BeautifulSoup(page.content, "html.parser")
with open("zerif.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
pp.pprint(soup)

<!DOCTYPE html>

<!-- saved from url=(0032)https://arthistorians.info/zerif -->
<html class=" js" dir="ltr" lang="en" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# "><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="https://arthistorians.info/zerif" rel="canonical"/>
<meta content="Collector and art historian of renaissance Italy. Zeri was born into a wealthy Roman family. He attended Rome University, where he initially studied botany. In 1944 he switched to the department of fine art under Pietro Toesca, a leading scholar of (then) undervalued Italian medieval art. Toesca introduced him to Roberto Longhi. He also made the persona

In [16]:
# look for div.field-name-body div.field-items div p a
for person in soup.select("div.field__item p a"):
    print("\n HTML element:", person, "\n text:", person.text)


 HTML element: <a href="https://arthistorians.info/toescap">Pietro Toesca</a> 
 text: Pietro Toesca

 HTML element: <a href="https://arthistorians.info/longhir">Roberto Longhi</a> 
 text: Roberto Longhi

 HTML element: <a href="https://arthistorians.info/berensonb">Bernard Berenson</a> 
 text: Bernard Berenson

 HTML element: <a href="https://arthistorians.info/argang">Giulio Carlo Argan</a> 
 text: Giulio Carlo Argan

 HTML element: <a href="https://arthistorians.info/posnerd">Donald Posner</a> 
 text: Donald Posner

 HTML element: <a href="https://arthistorians.info/fredericksenb">Burton Fredericksen</a> 
 text: Burton Fredericksen

 HTML element: <a href="https://www.aaa.si.edu/search/collections?edan_q=Ferree,%20Barr">Archives of American Art</a> 
 text: Archives of American Art

 HTML element: <a href="https://www.inha.fr/fr/ressources/publications/publications-numeriques/dictionnaire-critique-des-historiens-de-l-art/rechercher-dans-le-dictionnaire.html">Dictionnaire critique des

Have a look at this tutorial: [realpython](https://realpython.com/beautiful-soup-web-scraper-python/)

## Named Entity Recognition (NER)

What if there is no way to get so clean data from an HTML page? How about we have to parse the plain text that includes more references to people? 

We can use one of the many python libraries for NLP and NER, e.g. `SpaCy`.

 * we can transform the HTML parent element into plain text
 * we can parse it with Spacy `NER` method and extract entities classified as `PERSON`

In [15]:
# transform a HTML element into text (regardless of children elements)

txt = []
for p in soup.select("div.field__item p"):
    txt.append(p.text)
    
full_txt = " ".join(txt)
print(full_txt)

Collector and art historian of renaissance Italy. Zeri was born into a wealthy Roman family. He attended Rome University, where he initially studied botany. In 1944 he switched to the department of fine art under Pietro Toesca, a leading scholar of (then) undervalued Italian medieval art. Toesca introduced him to Roberto Longhi. He also made the personal acquaintance of influential rival to Longhi, Harvard art historian Bernard Berenson. Zeri described his meeting with Berenson as lasting "from 16:32 to 16:54 precisely". After graduating, Zeri worked for the Italian Ministry for Cultural Heritage in its fine arts committee for six years. In 1952, however, Zeri left, claiming that mismanagement and bureaucratic lethargy in the ministry were destroying the very monuments they were intended to save. Others suggested that Giulio Carlo Argan, the Inspector, actually dismissed Zeri for reasons of conflict of interest with private work. This proved to be a boon for Zeri, who developed his pro

In [7]:
# !pip install spacy
import spacy
from spacy import displacy

NER = spacy.load("en_core_web_sm")

parsed = NER(full_txt)
for word in parsed.ents:
    print(word.text,word.label_)



In [None]:
# get only the people
people = set()
for word in parsed.ents:
    if word.label_ == 'PERSON':
        people.add(word.text)
        
pp.pprint(people) 

{'Berenson',
 'Bernard Berenson',
 'Bruno Zanardi',
 'Burton Fredericksen',
 "Dietro L'Immagine",
 'Donald Posner',
 'Florence',
 'Galleria Pallavicini',
 'Galleria Spada',
 'Giotto',
 'Giulio Carlo Argan',
 "L'Inchiostro Variopinto",
 'Longhi',
 'Ludovisi Throne',
 'Michelangelo',
 'Modigliani',
 'Roberto Longhi',
 'Roman',
 'Sistine Chapel',
 'Vittorio Cini',
 'Walter Art Gallery',
 'Zeri'}


What a mess! 

 * We got more results than looking at the html only (e.g. Vittorio Cini, Michelangelo, Modigliani) - GOOD!
 * We got false positive results (Florence is not a person) - We can use WD reconciliation to be sure these entities exist and are instances of `Human / Q5`
 * We got false negative results (Pietro Toesca is missing, being recognised as a PRODUCT, not as a person) - that's life :(
 
See this nice [tutorial with SpaCy](https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/)