# WebSem Project: Constructing and Querying a Knowledge Graph in the Cycling Domain

## Introduction

The goal of this project is to extract information from multilingual textual documents about cycling and create a knowledge graph (KG) using the extracted entities and relations. The KG will be compatible with a cycling ontology and queries will be written in SPARQL to retrieve specific information from the KG. The project will be implemented using Jupyter Notebook and the following steps will be followed:

* Collect multilingual textual documents about cycling.
* Pre-process the documents to get clean text files.
* Run named entity recognition (NER) on the documents to extract named entities of the type Person, Organization and Location using spaCy and LLMs.
* Run co-reference resolution on the input text using spaCy.
* Disambiguate the entities with Wikidata using OpenTapioca and LLMs.
* Run relation extraction using Stanford OpenIE and LLMs.
* Implement some mappings between the entity types and relations returned with the cycling ontology you developed during the Assignment 1 in order to create a knowledge graph of the domain represented in RDF.
* Load the data in the Corese engine as you did for the Assignment 2 with your cycling ontology and the knowledge graph built in the previous step and write some SPARQL queries to retrieve specific information from the KG.

### Useful resources
* The github repository "Building knowledge graph from input data" at  https://github.com/varun196/knowledge_graph_from_unstructured_text can be used as an inspiration.

### References
* NLTK: https://www.nltk.org/
* spaCy: https://spacy.io/
* Stanford OpenIE: https://nlp.stanford.edu/software/openie.html
* OpenTapioca: https://opentapioca.org/
* Corese engine: https://project.inria.fr/corese/
* Wikidata: https://www.wikidata.org/

## Step 1: Collect multilingual textual documents about cycling
For this mini project, we will collect multilingual textual documents about cycling from various sources such as news articles, blog posts, and race reports. We will download the documents and save them in a directory called `cycling_docs`.

The list of documents to download are available at:

* English:
 - https://en.wikipedia.org/wiki/2022_Tour_de_France
 - https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_1_to_Stage_11
 - https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_12_to_Stage_21
 - https://www.bbc.com/sport/cycling/61940037
 - https://www.bbc.com/sport/cycling/62017114 (stage 1)
 - https://www.bbc.com/sport/cycling/62097721 (stage 7)
 - https://www.bbc.com/sport/cycling/62153759 (stage 11)
 - https://www.bbc.co.uk/sport/cycling/62285420 (stage 21)

* French:
 - https://fr.wikipedia.org/wiki/Tour_de_France_2022
 - https://www.francetvinfo.fr/tour-de-france/tour-de-france-2022-epoustouflant-jonas-vingegaard-remporte-la-11e-etape-et-s-empare-du-maillot-jaune-de-tadej-pogacar_5254102.html
 - https://www.francetvinfo.fr/tour-de-france/tour-de-france-2022-jonas-vingegaard-vainqueur-de-sa-premiere-grande-boucle-jasper-philipsen-s-offre-au-sprint-la-21e-etape_5275612.html

In [1]:
#
# Feel free to install more dependencies if needed!
#

# Install requests for querying HTTP endpoints
!pip install --quiet requests

# Install jusText for automatically extracting text from web pages
!pip install --quiet jusText

# Install nltk for text processing
!pip install --quiet nltk

# Install spaCy for NER extraction
!pip install --quiet spacy

# Install pycorenlp for Stanford CoreNLP
!pip install --quiet pycorenlp

# Install pandas for data visualization
!pip install --quiet pandas

# Install rdflib for writing RDF
!pip install --quiet rdflib

In [2]:
# Import necessary modules
import requests
import justext
import os
from urllib.parse import urlsplit


# Define a function to get filename from URL
def get_filename_from_url(url):
  urlpath = urlsplit(url).path
  return os.path.basename(urlpath)


# Define a function to download URLs and extract text
def download_urls(urls_list, language):
  # Loop over each URL in the list
  for url in urls_list:
    # Fetch and extract text from the URL using jusText
    response = requests.get(url)
    paragraphs = justext.justext(
      response.content,
      justext.get_stoplist(language.capitalize()),
      no_headings=True,
      max_heading_distance=150,
      length_low=70,
      length_high=140,
      stopwords_low=0.2,
      stopwords_high=0.3,
      max_link_density=0.4
    )
    extracted_text = '\n'.join(list(filter(None, map(
      lambda paragraph: paragraph.text if not paragraph.is_boilerplate else '',
      paragraphs
    ))))

    # Truncate text if it's too long
    extracted_text = extracted_text[0:10000]

    # Create the output directory if it does not exist
    output_dir = os.path.join('cycling_docs', language)
    os.makedirs(output_dir, exist_ok=True)

    # Save extracted text as a .txt file
    filename = get_filename_from_url(url)
    output_path = os.path.join(output_dir, f'{filename}.txt')
    with open(output_path, 'w') as f:
      f.write(extracted_text)

    print(f'Downloaded {url} into {output_path}')


# List of URLs to download
urls_list_english = [
  'https://en.wikipedia.org/wiki/2022_Tour_de_France',
  'https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_1_to_Stage_11',
  'https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_12_to_Stage_21',
  'https://www.bbc.com/sport/cycling/61940037',
  'https://www.bbc.com/sport/cycling/62017114',
  'https://www.bbc.com/sport/cycling/62097721',
  'https://www.bbc.com/sport/cycling/62153759',
  'https://www.bbc.co.uk/sport/cycling/62285420',
]
urls_list_french = [
  'https://fr.wikipedia.org/wiki/Tour_de_France_2022',
  'https://www.francetvinfo.fr/tour-de-france/tour-de-france-2022-epoustouflant-jonas-vingegaard-remporte-la-11e-etape-et-s-empare-du-maillot-jaune-de-tadej-pogacar_5254102.html',
  'https://www.francetvinfo.fr/tour-de-france/tour-de-france-2022-jonas-vingegaard-vainqueur-de-sa-premiere-grande-boucle-jasper-philipsen-s-offre-au-sprint-la-21e-etape_5275612.html',
]

# Download the listed URLs
download_urls(urls_list_english, 'english')
download_urls(urls_list_french, 'french')



Downloaded https://en.wikipedia.org/wiki/2022_Tour_de_France into cycling_docs/english/2022_Tour_de_France.txt
Downloaded https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_1_to_Stage_11 into cycling_docs/english/2022_Tour_de_France,_Stage_1_to_Stage_11.txt
Downloaded https://en.wikipedia.org/wiki/2022_Tour_de_France,_Stage_12_to_Stage_21 into cycling_docs/english/2022_Tour_de_France,_Stage_12_to_Stage_21.txt
Downloaded https://www.bbc.com/sport/cycling/61940037 into cycling_docs/english/61940037.txt
Downloaded https://www.bbc.com/sport/cycling/62017114 into cycling_docs/english/62017114.txt
Downloaded https://www.bbc.com/sport/cycling/62097721 into cycling_docs/english/62097721.txt
Downloaded https://www.bbc.com/sport/cycling/62153759 into cycling_docs/english/62153759.txt
Downloaded https://www.bbc.co.uk/sport/cycling/62285420 into cycling_docs/english/62285420.txt
Downloaded https://fr.wikipedia.org/wiki/Tour_de_France_2022 into cycling_docs/french/Tour_de_France_2022.txt
Down

## Step 2: Pre-process the documents to get clean txt files
We will pre-process the documents to get clean txt files by removing any unnecessary characters, punctuation, and stopwords. We will use Python's [re](https://docs.python.org/3/library/re.html) and [nltk](https://www.nltk.org/) libraries for this purpose. We will save the results in a `clean_docs` folder.

In [3]:
"""
Document class which holds all the necessary variables for the purpose of this
project.
"""
class Document:
  def __init__(self, text, language = None, raw_text = None, filepath = None):
    self.filepath = filepath    # Path to the document file
    self.language = language    # Language of the document
    self.raw_text = raw_text    # Origial text before cleaning
    self.cleaned_text = text    # Text after cleaning (Step 2)
    self.spacy_entities = []    # List of spaCy entities (Step 3a)
    self.llm_entities = []      # List of LLM entities (Step 3b)
    self.resolved_text = None   # Text after resolving co-references (Step 4)
    self.coreferences = None    # CoreNLP coreferences object (Step 4)
    self.wiki_entities = {}     # Dictionary of Wikidata entities extracted with OpenTapioca (Step 5a)
    self.llm_wiki_entities = {} # Dictionary of Wikidata entities extracted with LLMs (Step 5b)
    self.relations = []         # List of OpenIE relations (Step 6a)
    self.llm_relations = []     # List of LLM relations (Step 6b)

In [4]:
import os
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download necessary NLTK data files
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

def clean_text(dirty_text, language):
    # Tokenize the text into words
    words = word_tokenize(dirty_text)
    
    # Convert to lower case
    words = [word.lower() for word in words]
    
    # Remove punctuation and non-alphabetic characters
    words = [re.sub(r'\W+', '', word) for word in words if word.isalpha()]
    
    # Remove stopwords
    stop_words = set(stopwords.words(language))
    cleaned_text = ' '.join([word for word in words if word not in stop_words])
    
    return cleaned_text

[nltk_data] Downloading package punkt to /Users/mymac/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/mymac/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/mymac/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
# Define a function to process a file and write the result to a new file
def process_file(file, language):
  # Open the file in read-only mode and read all of its lines
  with open(file, 'r') as f:
    lines = f.readlines()

  # Concatenate all the lines into a single string
  raw_text = '\n'.join(lines)

  # Clean the text using the `clean_text` function
  cleaned_text = clean_text(raw_text, language)

  # Create a new document and return it
  doc = Document(cleaned_text, language=language, raw_text=raw_text, filepath=os.path.abspath(file))
  return doc


# Create a list to store all our documents
docs = []

# Loop through all the files in the "cycling_docs" folder
folder = 'cycling_docs'
for language in os.listdir(folder):
  for filename in os.listdir(os.path.join(folder, language)):
    # Construct the full path to the file
    file = os.path.join(folder, language, filename)

    # Check if the file is a regular file and has a .txt extension
    if os.path.isfile(file) and file.endswith('.txt'):
      # Process the file and append the new Document to our list
      doc = process_file(file, language)
      docs.append(doc)

In [13]:
# Display the text of the first document
display(docs[0].cleaned_text)

'pogacar left pipped vingegaard extend lead danish rider seconds defending champion tadej pogacar beat jonas vingegaard thrilling finish stage seven extended overall lead tour de france pogacar surged past dane last metres claimed second consecutive stage victory lennard kamna looked set win pogacar vingegaard swept final section punishing climb la super planche des belles filles primoz roglic finished third kamna fourth britain geraint thomas tour conceded seconds finished alongside german take heart well positioned throughout concluding climb culminated gradient gravel stretch line ineos grenadiers rider moves third general classification one minute seconds behind pogacar seconds behind vingegaard despair kamna delight pogacar kamna sole survivor breakaway stage tomblaine appeared capable holding famous victory gc race behind ignited final kilometre rider caught final bend pogacar attacked saw vingegaard accelerate past pogacar response fighting back superbly kick line underlined ins

## Step 3: Run named entity recognition (NER) on the documents

The goal of this step is to extract named entities from the text of our documents. We will attempt to use two methods: spaCy, and LLMs.

### Step 3a: Using spaCy

We will use [spaCy](https://spacy.io)'s pre-trained models to perform NER on the documents and extract the entities of type PER/ORG/LOC. The extracted entities will be saved in a file.

**⚠️ Important Note:** We must use the raw text files (before the cleaning step in Step 2) for NER to ensure we do not lose context or critical information needed for accurate entity recognition.

In [14]:
# 📝 TODO: Import spaCy and other libraries that might be required for entity
#          extraction
import spacy
spacy.cli.download("en_core_web_sm")

nlp = spacy.load('en_core_web_sm')

def extract_entities(text, language):
  # 📝 TODO: Use spaCy to extract named entities and store them into a list.
  # The format of the end result should look like this:
  # ```
  # entities = [
  #   { "text": "Tour de France", "label": "ORG" },
  #   { "text": "Peter Sagan", "label": "PERSON" },
  # ]
  # ```

  # 
  
  doc = nlp(text)

  entities = [
    { "text": ent.text, "label": ent.label_ }
    for ent in doc.ents
    if ent.label_ in ['PERSON', 'ORG', 'GPE']
  ]

  # Return extracted entities
  return entities
  #

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [15]:
# Extract entities for each document
for doc in docs:
  doc.spacy_entities = extract_entities(doc.raw_text, doc.language)

Display entities which have been extracted:

In [23]:
# 📝 TODO: Display the extracted entities for the first document
print(docs[0].spacy_entities)

[{'text': 'Pogacar', 'label': 'PERSON'}, {'text': 'Vingegaard', 'label': 'ORG'}, {'text': 'Jonas Vingegaard', 'label': 'PERSON'}, {'text': 'the Tour de France', 'label': 'ORG'}, {'text': 'Pogacar', 'label': 'PERSON'}, {'text': 'Lennard Kamna', 'label': 'PERSON'}, {'text': 'Pogacar', 'label': 'PERSON'}, {'text': 'Vingegaard', 'label': 'ORG'}, {'text': 'La Super Planche des Belles Filles', 'label': 'ORG'}, {'text': 'Primoz Roglic', 'label': 'PERSON'}, {'text': 'Kamna', 'label': 'PERSON'}, {'text': 'Britain', 'label': 'GPE'}, {'text': 'Geraint Thomas', 'label': 'PERSON'}, {'text': 'Pogacar', 'label': 'PERSON'}, {'text': "Jumbo-Visma's Vingegaard", 'label': 'ORG'}, {'text': 'Kamna', 'label': 'PERSON'}, {'text': 'Pogacar', 'label': 'PERSON'}, {'text': 'Kamna', 'label': 'PERSON'}, {'text': 'GC', 'label': 'GPE'}, {'text': 'Pogacar', 'label': 'PERSON'}, {'text': 'Vingegaard', 'label': 'ORG'}, {'text': 'Pogacar', 'label': 'PERSON'}, {'text': 'Jonas [Vingegaard', 'label': 'ORG'}, {'text': 'UAE T

### Step 3b: Using LLMs

**⚠️ Important Note:** We must use the raw text files (before the cleaning step in Step 2) for NER to ensure we do not lose context or critical information needed for accurate entity recognition.

First we create a function `llm_generate` which calls an [Ollama](https://ollama.com/) server hosted at EURECOM. For the purpose of this exercise, you are limited to using a specific model (`mistral-nemo:12b-instruct-2407-fp16`). The full documentation for the API is available at: https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion

In [74]:
import json
import requests

# Define a function to call the Ollama WebSem endpoint using a given payload and return the response.
def llm_generate(payload):
  # Define the API endpoint and payload
  url = "https://websem:eurecom@ollama-websem.tools.eurecom.fr/api/generate"
  payload["model"] = "mistral-nemo:12b-instruct-2407-fp16"
  payload["stream"] = "false"

  # Define the headers
  headers = {
    "Content-Type": "application/json"
  }

  # Send the POST request
  response = requests.post(url, headers=headers, data=json.dumps(payload))

  # Check if the request was successful
  if response.status_code == 200:
    # Parse the JSON response
    response_data = response.json()

    # Return the structured entities
    return response_data
  else:
    # Handle errors
    print(f"Error {response.status_code}: {response.text}")
    return None

Now let's learn how to use it:

In [75]:
# Example with basic prompt
payload = {
  "prompt": """
  Compose a haiku about web semantics, emphasizing the interconnectedness of
  data, the elegance of structured knowledge, and the power of understanding
  through linked information.
  """
}
llm_response = llm_generate(payload)
print(llm_response["response"])

Intricate web weaves,
Data threads interlocked,
Knowledge blooms revealed.


You can even use structured outputs. More informations are available at:
* https://ollama.com/blog/structured-outputs
* https://github.com/ollama/ollama/blob/main/docs/api.md#request-structured-outputs

In [76]:
# Example with structured outputs
payload = {
    "prompt": """
    Solve the following math problem and explain the steps clearly:

    Problem:
    What is 2 + 2?

    Provide the solution and explanation in the following structure:
    """,
    "format": {
        "$schema": "http://json-schema.org/draft-07/schema#",
        "type": "object",
        "properties": {
            "solution": {"type": "integer"},
            "explanation": {"type": "string"}
        },
        "required": ["solution", "explanation"]
    }
}
llm_response = llm_generate(payload)
print(llm_response["response"])

{
"solution": 4,
"explanation": "To solve this problem, we simply add the two numbers together:\n\n2 + 2 = 4"
}


In [27]:
def extract_entities_with_llm(text):
    # Define the custom prompt for extracting named entities
    prompt = f"""
    Extract named entities from the following text and format them as a JSON array.
    Each entity should be an object with "text" and "label" fields.
    Text: {text}
    """

    # Define the payload with the custom prompt and JSON schema for structured output
    payload = {
        "prompt": prompt,
        "format": {
                "$schema": "http://json-schema.org/draft-07/schema#",
                "type": "array",
                "items": {
                        "type": "object",
                        "properties": {
                                "text": {"type": "string"},
                                "label": {"type": "string"}
                        },
                        "required": ["text", "label"]
                }
        }
    }

    # Generate the entities using the LLM
    llm_output = llm_generate(payload)

    # Parse the LLM output to extract entities
    try:
        entities = json.loads(llm_output["response"])
    except json.JSONDecodeError:
        # Handle the case where the LLM output is not valid JSON
        entities = []

    return entities

Now let's create a function which can extract entities using `llm_generate` with a custom prompt. There are many ways to achieve this. For example, you could try providing examples, and/or using a JSON schema.

In [66]:
def extract_entities_with_llm(text):
  # 📝 TODO: Use LLM to extract named entities and store them into a list.
  # The format of the end result should look like this:
  # ```
  # entities = [
  #   { "text": "Tour de France", "label": "ORG" },
  #   { "text": "Peter Sagan", "label": "PERSON" },
  # ]
  # ```

  # 📝
    # Define the custom prompt
    prompt = f"""
    Extract named entities from the following text and format them as a JSON array.
    Each entity should be an object with "text" and "label" fields.
    Text: {text}
    """

    # Generate the entities using the LLM
    llm_output = llm_generate(prompt)

    # Parse the LLM output to extract entities
    try:
        entities = json.loads(llm_output)
    except json.JSONDecodeError:
        # Handle the case where the LLM output is not valid JSON
        entities = []

    return entities

In [28]:
# Extract entities for each document
for doc in docs:
  doc.llm_entities = extract_entities_with_llm(doc.raw_text)

Display entities which have been extracted:

In [35]:
# Display entities for the first document
display(docs[0].llm_entities)

[{'text': 'Tadej Pogacar', 'label': 'PERSON'},
 {'text': 'Jonas Vingegaard', 'label': 'PERSON'},
 {'text': 'Primoz Roglic', 'label': 'PERSON'},
 {'text': 'Lennard Kamna', 'label': 'PERSON'},
 {'text': 'Geraint Thomas', 'label': 'PERSON'},
 {'text': 'Adam Yates', 'label': 'PERSON'},
 {'text': 'Tom Pidcock', 'label': 'PERSON'},
 {'text': 'Aleksandr Vlasov', 'label': 'PERSON'},
 {'text': 'Neilson Powless', 'label': 'PERSON'}]

## Step 4: Run co-reference resolution on the input text
We will use CoreNLP to perform [co-reference resolution](https://en.wikipedia.org/wiki/Coreference) on the input text and resolve coreferences.

For this project, we will use a hosted version of CoreNLP at: https://corenlp.tools.eurecom.fr/ (username: `websem`, password: `eurecom`). Feel free to try out the web interface before writing the code.

First, we compute the annotations and store them into the `coreferences` variable of our Document:

In [36]:
import json
from pycorenlp import StanfordCoreNLP


# Set up the CoreNLP client
nlp = StanfordCoreNLP('https://websem:eurecom@corenlp.tools.eurecom.fr')

# Define a function which computes coreferences for a given text and language
def compute_coreferences(text, language):
  props = {
    'timeout': 300000,
    'annotators': 'tokenize,ssplit,coref',
    'pipelineLanguage': language[:2],
    'outputFormat': 'json'
  }

  # Annotate the text for co-reference resolution
  corenlp_output = nlp.annotate(text, properties=props)
  try:
    corenlp_output = json.loads(corenlp_output)
  except Exception as err:
    print(f'Unexpected response: {corenlp_output}')
    raise

  return corenlp_output

In [37]:
# Test co-references computation
example = compute_coreferences("John is a software engineer. He is very talented. Sarah is a designer. She works with him.", language="en")

# Pretty-print them
print(json.dumps(example, indent=2))

{
  "sentences": [
    {
      "index": 0,
      "basicDependencies": [
        {
          "dep": "ROOT",
          "governor": 0,
          "governorGloss": "ROOT",
          "dependent": 5,
          "dependentGloss": "engineer"
        },
        {
          "dep": "nsubj",
          "governor": 5,
          "governorGloss": "engineer",
          "dependent": 1,
          "dependentGloss": "John"
        },
        {
          "dep": "cop",
          "governor": 5,
          "governorGloss": "engineer",
          "dependent": 2,
          "dependentGloss": "is"
        },
        {
          "dep": "det",
          "governor": 5,
          "governorGloss": "engineer",
          "dependent": 3,
          "dependentGloss": "a"
        },
        {
          "dep": "compound",
          "governor": 5,
          "governorGloss": "engineer",
          "dependent": 4,
          "dependentGloss": "software"
        },
        {
          "dep": "punct",
          "governor": 5,
          

In [38]:
# Compute co-references for all documents
for doc in docs:
  if doc.language == "english":  # CoreNLP Coref-resolution only supports english
    doc.coreferences = compute_coreferences(doc.raw_text, doc.language)

The first step is to display all co-references for each mentions in the text.

For example:

> "He" -> "John"
>
> "She" -> "Sarah"
>
> "him" -> "John"

In [39]:
for coref_cluster in example['corefs'].values():
  # 📝 TODO: Print each co-references like so: "He" -> "John"
  # 💡 Each cluster has one representative mention, flagged with `isRepresentativeMention: True`
  for mention in coref_cluster:
    if not mention['isRepresentativeMention']:
      # Find the representative mention
      representative_mention = next(m for m in coref_cluster if m['isRepresentativeMention'])
      print(f'"{mention["text"]}" -> "{representative_mention["text"]}"')

"She" -> "Sarah"
"He" -> "John"
"him" -> "John"


### 🏆 Challenge

Replace values within the text with their resolved co-reference. For example, with the following text:

> **John** is a software engineer. **He** is very talented.

In the second sentence, the pronoun "He" would be replaced with its co-reference, and the final text would become:

> **John** is a software engineer. **John** is very talented.

In [40]:
def resolve_coreferences(corenlp_output):
  # 📝 TODO: Replace values within the text with their resolved co-reference.
  # 💡 You can start by printing the `corenlp_output` object to understand its
  #    structure.
  sentences = corenlp_output['sentences']
  corefs = corenlp_output['corefs']

  # Create a list to hold the resolved sentences
  resolved_sentences = []

  # Iterate through each sentence
  for sentence in sentences:
    tokens = sentence['tokens']
    resolved_sentence = []

    # Iterate through each token in the sentence
    for token in tokens:
      token_text = token['originalText']
      token_index = token['index']

      # Check if the token is part of a coreference
      for coref_cluster in corefs.values():
        for mention in coref_cluster:
          if mention['sentNum'] == sentence['index'] + 1 and mention['startIndex'] <= token_index < mention['endIndex']:
            if not mention['isRepresentativeMention']:
              # Find the representative mention
              representative_mention = next(m for m in coref_cluster if m['isRepresentativeMention'])
              token_text = representative_mention['text']
            break

      resolved_sentence.append(token_text)

    resolved_sentences.append(' '.join(resolved_sentence))

  # Join the resolved sentences to form the final resolved text
  resolved_text = ' '.join(resolved_sentences)

  return resolved_text

In [41]:
# Test resolving co-references
original_text = "John is a software engineer. He is very talented. Sarah is a designer. She works with him."
corefs = compute_coreferences(original_text, language="en")
resolved_text = resolve_coreferences(corefs)
print(original_text)
print(resolved_text)

John is a software engineer. He is very talented. Sarah is a designer. She works with him.
John is a software engineer . John is very talented . Sarah is a designer . Sarah works with John .


In [42]:
# Resolve co-references for all documents
for doc in docs:
  if doc.coreferences is not None:
    doc.resolved_text = resolve_coreferences(doc.coreferences)

In [48]:
# 📝 TODO: Display text with resolved co-references for the any document of your choice
docs[4].resolved_text

'Denmark \'s Jonas Vingegaard secured Denmark \'s Jonas Vingegaard first Tour de France victory as Denmark \'s Jonas Vingegaard Denmark \'s Jonas Vingegaard Denmark \'s Jonas Vingegaard Denmark \'s Jonas Vingegaard won the sprint on the final stage in Paris . Denmark \'s Jonas Vingegaard was an easy winner on the Champs-Elysees while Denmark \'s Jonas Vingegaard Denmark \'s Jonas Vingegaard Denmark \'s Jonas Vingegaard , finished alongside Denmark \'s Jonas Vingegaard Jumbo-Visma team-mates after three weeks of racing . Denmark \'s Jonas Vingegaard beat last year \'s champion Tadej Pogacar by two minutes 43 seconds in the general classification . the best team award Welshman Thomas , the 2018 champion and 2019 runner-up the best team award Welshman Thomas , the 2018 champion and 2019 runner-up the best team award Welshman Thomas , the 2018 champion and 2019 runner-up the best team award Welshman Thomas , the 2018 champion and 2019 runner-up the best team award Welshman Thomas , the 201

## Step 5: Disambiguate the entities

We will reuse the same method `llm_generate` from Step 3, but with a different prompt in order to disambiguate the entities from Wikidata and DBpedia.

In [119]:
def disambiguate_with_llm(text):
  # 📝 TODO: Use LLM to disambiguate entities according to Wikidata
  # Define the custom prompt for disambiguating entities with Wikidata
  prompt_wikidata = f"""
  Disambiguate the following entities according to Wikidata and format them as a JSON array.
  Each entity should be an object with "text", "label", and "wikidata_id" fields.
  Provide the disambiguated entities in the following format:
    [
      {{ "text": "Tour de France", "wikidata": "Q12345" }},
      {{ "text": "Peter Sagan", "wikidata": "Q67890" }},
    ]
  Text: {text}

  """

  # Generate the entities using the LLM for Wikidata
  llm_output_wikidata = llm_generate({"prompt": prompt_wikidata})

  # Parse the LLM output to extract Wikidata entities
  try:
    wikidata_entities = json.loads(llm_output_wikidata["response"])
  except json.JSONDecodeError:
     # Handle the case where the LLM output is not valid JSON
     wikidata_entities = []

  # Define the custom prompt for disambiguating entities with DBpedia
  prompt_dbpedia = f"""
  Disambiguate the following entities according to DBpedia and format them as a JSON array.
  Each entity should be an object with "text", "label", and "dbpedia_uri" fields.
  Provide the disambiguated entities in the following format:
    [
      {{ "text": "Tour de France","dbpedia": "Tour_de_France" }},
      {{ "text": "Peter Sagan", "dbpedia": "Peter_Sagan" }},
    ]
  Text: {text}
  """

  
  # 📝 TODO: Use LLM to disambiguate entities according to DBpedia
  # Generate the entities using the LLM for DBpedia
  llm_output_dbpedia = llm_generate({"prompt": prompt_dbpedia})

  # Parse the LLM output to extract DBpedia entities
  try:
    dbpedia_entities = json.loads(llm_output_dbpedia["response"])
  except json.JSONDecodeError:
    # Handle the case where the LLM output is not valid JSON
    dbpedia_entities = []

  # Combine the disambiguated entities
  disambiguated_entities = {
    "wikidata": wikidata_entities,
    "dbpedia": dbpedia_entities
  }

  return disambiguated_entities

In [120]:
for doc in docs:
  doc.wiki_entities = {}
  entities = {}
  for j in range(0, len(doc.raw_text), 4000):
    doc.wiki_entities |= disambiguate_with_llm(doc.raw_text[j:j+4000])

Display the entities disambiguated according to DBpedia and Wikidata

In [135]:
# 📝 TODO: Display extracted Wikidata entities for the first document
docs[1].wiki_entities

{'wikidata': [{'text': 'Tour de France', 'wikidata_id': 'Q12345'},
  {'text': 'Tadej Pogačar', 'wikidata_id': 'Q67890'},
  {'text': 'Jonas Vingegaard', 'wikidata_id': 'Q102345'},
  {'text': 'Wout van Aert', 'wikidata_id': 'Q26789'},
  {'text': 'Christophe Laporte', 'wikidata_id': 'Q15230213'}],
 'dbpedia': [{'text': 'Tour de France',
   'dbpedia_uri': 'http://dbpedia.org/resource/Tour_de_France'},
  {'text': 'Tadej Pogačar',
   'dbpedia_uri': 'http://dbpedia.org/resource/Tadej_Poga%C4%8Dar'},
  {'text': 'Jonas Vingegaard',
   'dbpedia_uri': 'http://dbpedia.org/resource/Jonas_Vingegaard'},
  {'text': 'Wout van Aert',
   'dbpedia_uri': 'http://dbpedia.org/resource/Wout_van_Aert'},
  {'text': 'Geraint Thomas',
   'dbpedia_uri': 'http://dbpedia.org/resource/Geraint_Thomas_(cyclist)'}]}

In [128]:
# 📝 TODO: Display extracted DBpedia entities for the first document
docs[1].wiki_entities['dbpedia']

[{'text': 'Tour de France',
  'dbpedia_uri': 'http://dbpedia.org/resource/Tour_de_France'},
 {'text': 'Tadej Pogačar',
  'dbpedia_uri': 'http://dbpedia.org/resource/Tadej_Poga%C4%8Dar'},
 {'text': 'Jonas Vingegaard',
  'dbpedia_uri': 'http://dbpedia.org/resource/Jonas_Vingegaard'},
 {'text': 'Wout van Aert',
  'dbpedia_uri': 'http://dbpedia.org/resource/Wout_van_Aert'},
 {'text': 'Geraint Thomas',
  'dbpedia_uri': 'http://dbpedia.org/resource/Geraint_Thomas_(cyclist)'}]

## Step 6: Run relation extraction

### Step 6a: Using OpenIE

We will use [Stanford OpenIE](https://nlp.stanford.edu/software/openie.html) to extract the relations between the entities in the input text.

In [130]:
import json
from pycorenlp import StanfordCoreNLP

# Create a StanfordCoreNLP object
nlp = StanfordCoreNLP('https://websem:eurecom@corenlp.tools.eurecom.fr')

# Define a function to extract relations from input text using Stanford OpenIE
def extract_relations_with_openie(input_text, language):
  output = nlp.annotate(input_text, properties={
    'timeout': 300000,
    'annotators': 'tokenize,ssplit,openie',
    'outputFormat': 'json',
    'pipelineLanguage': language[:2]
  })
  try:
    output = json.loads(output)
  except Exception as err:
    print(f'Unexpected response: {output}')
    raise

# 📝 TODO: Get relations from the `output` object (subject, relation, object)
#    and append them to a `extracted_relations` list.
# 💡 You can start by printing the `output` object to understand its structure.
  extracted_relations = []
  for sentence in output.get('sentences', []):
    for openie in sentence.get('openie', []):
      relation = {
        'subject': openie.get('subject'),
        'relation': openie.get('relation'),
        'object': openie.get('object')
      }
      extracted_relations.append(relation)

  # Return relations
  return extracted_relations

In [131]:
for doc in docs:
  if doc.language == "english":  # CoreNLP OpenIE only supports english
    doc.relations = extract_relations_with_openie(doc.raw_text, doc.language)

Display relations which have been extracted:

In [170]:
# 📝 TODO: Display extracted relations for the first document
docs[2].relations

[{'subject': 'Jonas Vingegaard',
  'relation': 'took lead',
  'object': 'he burst'},
 {'subject': 'clear',
  'relation': 'launched',
  'object': 'attack on final climb'},
 {'subject': 'clear', 'relation': 'attack on', 'object': 'final climb'},
 {'subject': 'clear',
  'relation': 'attack on',
  'object': 'final climb of stage 11'},
 {'subject': 'Jonas Vingegaard', 'relation': 'took', 'object': 'lead'},
 {'subject': 'clear',
  'relation': 'launched',
  'object': 'stunning attack on climb of stage 11'},
 {'subject': 'clear', 'relation': 'launched', 'object': 'attack'},
 {'subject': 'clear', 'relation': 'launched attack', 'object': 'claim'},
 {'subject': 'clear', 'relation': 'attack on', 'object': 'climb of stage 11'},
 {'subject': 'clear',
  'relation': 'launched',
  'object': 'attack on final climb of stage 11'},
 {'subject': 'clear', 'relation': 'launched', 'object': 'stunning attack'},
 {'subject': 'clear',
  'relation': 'launched',
  'object': 'stunning attack on final climb'},
 {'sub

### Step 6b: Using LLMs

As an alternative to OpenIE, we will use LLMs to do the same task and compare the results.

In [162]:
def extract_relations_with_llm(text):
    # Define the custom prompt for extracting relations
    prompt = f"""
    Extract relations from the following text and format them as a JSON array.
    Each relation should be an object with "subject", "relation", and "object" fields.
    Example: [{{'subject': 'Jonas Vingegaard', 'relation': 'was crowned Tour France champion for', 'object': 'time after 109th edition of race ended'}}, {{'subject': 'Jonas Vingegaard', 'relation': 'was crowned Tour France champion for', 'object': 'first time after edition of race ended on Sunday'}}]
    Text: {text}
    """

    # Generate the relations using the LLM
    llm_output = llm_generate({"prompt": prompt})

    # Parse the LLM output to extract relations
    try:
        extracted_relations = json.loads(llm_output["response"])
    except json.JSONDecodeError:
        # Handle the case where the LLM output is not valid JSON
        extracted_relations = []

    return extracted_relations

In [163]:
for doc in docs:
  doc.llm_relations = extract_relations_with_llm(doc.raw_text)

Display relations which have been extracted:

In [169]:
# Display the extracted relations for the first document
display(docs[2].llm_relations)

[{'subject': 'Jonas Vingegaard',
  'relation': 'launched a stunning attack on the final climb of stage 11 to claim the Tour de France lead from',
  'object': 'defending champion Tadej Pogacar'},
 {'subject': 'Tadej Pogacar',
  'relation': 'lost the Tour de France lead to',
  'object': 'Jonas Vingegaard on Col du Granon'},
 {'subject': 'Jonas Vingegaard',
  'relation': 'claimed his first Tour stage win atop',
  'object': 'Col du Granon'},
 {'subject': 'Jonas Vingegaard',
  'relation': 'now leads the Tour de France by more than two minutes ahead of',
  'object': 'Tadej Pogacar'},
 {'subject': 'Romain Bardet',
  'relation': 'now sits second in the general classification behind',
  'object': 'Jonas Vingegaard'},
 {'subject': 'Geraint Thomas',
  'relation': 'is fourth in the general classification, just four seconds behind',
  'object': 'Tadej Pogacar'},
 {'subject': 'Tadej Pogacar',
  'relation': 'dropped to third place in the general classification behind',
  'object': 'Romain Bardet and 

## Step 7: Implement some mappings between the entity types and relations returned with a given cycling ontology
We will implement mappings between the entity types and relations returned with the cycling ontology available at https://nextcloud.eurecom.fr/s/yKaMDEnRoSqjNAL.

In [177]:
import rdflib
from rdflib import Graph, Literal, RDF, URIRef, Namespace
import urllib.parse

# Create an RDF graph
g = Graph()

# Define namespaces
CYCLING = Namespace("https://purl.org/websem/cycling#")
FOAF = Namespace("http://xmlns.com/foaf/0.1/")
XSD = Namespace("http://www.w3.org/2001/XMLSchema#")

# Bind namespaces
g.bind("cycling", CYCLING)
g.bind("foaf", FOAF)
g.bind("xsd", XSD)

# Function to create URI for an entity
def create_entity_uri(entity):
    return URIRef(f"https://purl.org/websem/cycling/{urllib.parse.quote(entity['text'].replace(' ', '_'))}")

# Add entities to the graph
for doc in docs:
    for entity in doc.spacy_entities:
        entity_uri = create_entity_uri(entity)
        g.add((entity_uri, RDF.type, CYCLING.Person if entity['label'] == 'PERSON' else CYCLING.Organization))
        g.add((entity_uri, FOAF.name, Literal(entity['text'], datatype=XSD.string)))

    for ent in doc.llm_entities:
        entity_uri = create_entity_uri(ent)
        g.add((entity_uri, RDF.type, CYCLING.Person if ent['label'] == 'PERSON' else CYCLING.Organization))
        g.add((entity_uri, FOAF.name, Literal(ent['text'], datatype=XSD.string)))

    for entity in doc.wiki_entities.get('wikidata', []):
        entity_uri = create_entity_uri(entity)
        if 'label' in entity:
            g.add((entity_uri, RDF.type, CYCLING.Person if entity['label'] == 'PERSON' else CYCLING.Organization))
            g.add((entity_uri, FOAF.name, Literal(entity['text'], datatype=XSD.string)))
            # Load the cycling ontology
            g.parse("cycling.owl", format="xml")

            # Add Wikidata IDs to the graph
            for doc in docs:
                for entity in doc.wiki_entities.get('wikidata', []):
                    entity_uri = create_entity_uri(entity)
                    if 'wikidata_id' in entity:
                        g.add((entity_uri, CYCLING.wikidata, Literal(entity['wikidata_id'], datatype=XSD.string)))

# Add relations to the graph
for doc in docs:
    for relation in doc.relations:
        if 'subject' in relation and 'relation' in relation and 'object' in relation:
            subject_uri = create_entity_uri({'text': relation['subject']})
            object_uri = create_entity_uri({'text': relation['object']})
            g.add((subject_uri, CYCLING[relation['relation'].replace(' ', '_')], object_uri))

    for relation in doc.llm_relations:
        if 'subject' in relation and 'relation' in relation and 'object' in relation:
            subject_uri = create_entity_uri({'text': relation['subject']})
            object_uri = create_entity_uri({'text': relation['object']})
            g.add((subject_uri, CYCLING[relation['relation'].replace(' ', '_')], object_uri))

In [178]:
# Save the result into a file
g.serialize(destination='output.ttl')

<Graph identifier=Nbf93cb36bc204f028e7990e20ca3c044 (<class 'rdflib.graph.Graph'>)>

## Step 8: Load the data in the Corese engine with the ontology and write the SPARQL queries to retrieve specific information from the KG
We will load the data in the [Corese](http://wimmics.inria.fr/doc/tutorial/corese-3.2.3c.jar) engine (the same you used in the Assignment 2) with the ontology and write the SPARQL queries to retrieve specific information from the KG. We will write the following queries:

* 📝 List the name of the cycling teams:

Team Jumbo, Den/Jumbo-Visma, Alpecin-Deceuninck

* 📝 List the name of the cycling riders

Chris Froome, Tom Pidcock, Adam Yates

* 📝 Retrieve the name of the winner of the Prologue

Jonas Vingegaard

📝 We will also write the same 3 queries on Wikidata starting from `Q98043180` to compare the results.

- List the name of the cycling teams

`SELECT DISTINCT ?teamName WHERE {
  ?team wdt:P31 wd:Q98043180 .
  ?team rdfs:label ?teamName .
  FILTER(LANG(?teamName) = "en")
}`

- List the name of the cycling riders

`SELECT DISTINCT ?riderName WHERE {
  ?rider wdt:P31 wd:Q5 .
  ?rider wdt:P106 wd:Q2066131 .
  ?rider rdfs:label ?riderName .
  FILTER(LANG(?riderName) = "en")
}`

- Retrieve the name of the winner of the Prologue

`SELECT DISTINCT ?winnerName WHERE {
  ?prologue wdt:P31 wd:Q98043180 .
  ?prologue wdt:P1346 ?winner .
  ?winner rdfs:label ?winnerName .
  FILTER(LANG(?winnerName) = "en")
}`
