# Named Entity Linking (NEL) with DBpedia SPARQL

This notebook demonstrates how to use the `get_best_match` function to link an extracted organization name from a news article to its corresponding entity in DBpedia.

In [1]:
!pip install SPARQLWrapper pandas

Defaulting to user installation because normal site-packages is not writeable
Collecting SPARQLWrapper
  Downloading SPARQLWrapper-2.0.0-py3-none-any.whl.metadata (2.0 kB)
Downloading SPARQLWrapper-2.0.0-py3-none-any.whl (28 kB)
Installing collected packages: SPARQLWrapper
Successfully installed SPARQLWrapper-2.0.0



[notice] A new release of pip is available: 23.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [7]:
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
from difflib import SequenceMatcher

# DBpedia SPARQL Endpoint
DBPEDIA_SPARQL_URL = "http://dbpedia.org/sparql"

def get_best_match(org_name):
    """Fetch the best-matching company entity from DBpedia using SPARQL."""
    
    # SPARQL Query with Dynamic Organization Name
    sparql_query = f"""
    SELECT ?company ?label ?industry ?country ?abstract ?wikiPage WHERE {{
      ?company rdf:type dbo:Company.
      ?company rdfs:label ?label.
      
      OPTIONAL {{ ?company dbo:industry ?industry. }}
      OPTIONAL {{ ?company dbo:country ?country. }}
      OPTIONAL {{ ?company dbo:abstract ?abstract. }}
      OPTIONAL {{ ?company foaf:isPrimaryTopicOf ?wikiPage. }}

      FILTER (CONTAINS(LCASE(?label), LCASE("{org_name}")))
      FILTER (lang(?label) = 'en')
      FILTER (lang(?abstract) = 'en')
    }}
    LIMIT 5
    """

    sparql = SPARQLWrapper(DBPEDIA_SPARQL_URL)
    sparql.setQuery(sparql_query)
    sparql.setReturnFormat(JSON)
    
    results = sparql.query().convert()
    matches = results["results"]["bindings"]

    if not matches:
        return None

    # Rank results by similarity score
    ranked_matches = sorted(matches, key=lambda x: SequenceMatcher(None, org_name.lower(), x["label"]["value"].lower()).ratio(), reverse=True)

    # Best match
    best_match = ranked_matches[0]

    return {
        "Matched Entity": best_match["label"]["value"],
        "Industry": best_match["industry"]["value"] if "industry" in best_match else "Unknown",
        "Country": best_match["country"]["value"] if "country" in best_match else "Unknown",
        "Description": best_match["abstract"]["value"] if "abstract" in best_match else "No description available",
        "Wikipedia URL": best_match["wikiPage"]["value"] if "wikiPage" in best_match else "No URL available"
    }

if __name__ == "__main__":
    org_name = input("Enter an organization name: ")
    best_match = get_best_match(org_name)

    if best_match:
        print("Best Matching Entity Found:")
        for key, value in best_match.items():
            print(f"{key}: {value}")
    else:
        print("No matching entity found.")


Best Matching Entity Found:
Matched Entity: Zalando
Industry: http://dbpedia.org/resource/E-commerce
Country: Unknown
Description: Zalando SE is a publicly traded German online retailer of shoes, fashion and beauty. The company was founded in 2008 by David Schneider and Robert Gentz and has more than 50 million active users in 25 European markets. Zalando is active in a variety of business fields - from multi-brand online shopping (including their own brands), the shopping club Zalando Lounge, outlets in 11 German cities, the consultation service Zalon, as well as logistics and marketing offers for retailers. With the program Connected Retail, Zalando connects more than 7,000 brick and mortar businesses to the online fashion platform. In 2021, Zalando generated revenue of 10.35 billion Euro, with roughly 17,000 employees.
Wikipedia URL: http://en.wikipedia.org/wiki/Zalando


In [3]:
# Test the function with an example entity
org_name = "Apple Inc."  # Replace with an extracted organization name
best_match = get_best_match(org_name)

# Display results
if best_match:
    print("Best Matching Entity Found:")
    for key, value in best_match.items():
        print(f"{key}: {value}")
else:
    print("No matching entity found.")

Best Matching Entity Found:
Matched Entity: Apple Inc.
Industry: http://dbpedia.org/resource/Consumer_electronics
Country: Unknown
Description: Apple Inc. is an American multinational technology company headquartered in Cupertino, California, United States. Apple is the largest technology company by revenue (totaling US$365.8 billion in 2021) and, as of June 2022, is the world's biggest company by market capitalization, the fourth-largest personal computer vendor by unit sales and second-largest mobile phone manufacturer. It is one of the Big Five American information technology companies, alongside Alphabet, Amazon, Meta, and Microsoft. Apple was founded as Apple Computer Company on April 1, 1976, by Steve Jobs, Steve Wozniak and Ronald Wayne to develop and sell Wozniak's Apple I personal computer. It was incorporated by Jobs and Wozniak as Apple Computer, Inc. in 1977 and the company's next computer, the Apple II, became a best seller and one of the first mass-produced microcomputers

In [8]:
import spacy
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
from difflib import SequenceMatcher

In [9]:
# Load spaCy's English NER model
nlp = spacy.load("en_core_web_sm")

In [10]:
def extract_organizations(text):
    """Extract organizations (ORG) from the input text using spaCy NER."""
    doc = nlp(text)
    orgs = list(set(ent.text for ent in doc.ents if ent.label_ == "ORG"))
    return orgs

In [11]:
def get_best_match(org_name):
    """Fetch the best-matching company entity from DBpedia using SPARQL."""
    
    # SPARQL Query with Dynamic Organization Name
    sparql_query = f"""
    SELECT ?company ?label ?wikiPage WHERE {{
      ?company rdf:type dbo:Company.
      ?company rdfs:label ?label.
      OPTIONAL {{ ?company foaf:isPrimaryTopicOf ?wikiPage. }}

      FILTER (CONTAINS(LCASE(?label), LCASE("{org_name}")))
      FILTER (lang(?label) = 'en')
    }}
    LIMIT 5
    """

    sparql = SPARQLWrapper("http://dbpedia.org/sparql")
    sparql.setQuery(sparql_query)
    sparql.setReturnFormat(JSON)
    
    results = sparql.query().convert()
    matches = results["results"]["bindings"]

    if not matches:
        return None

    # Rank results by similarity score
    ranked_matches = sorted(matches, key=lambda x: SequenceMatcher(None, org_name.lower(), x["label"]["value"].lower()).ratio(), reverse=True)

    # Best match
    best_match = ranked_matches[0]

    return {
        "Matched Entity": best_match["label"]["value"],
        "Wikipedia URL": best_match["wikiPage"]["value"] if "wikiPage" in best_match else "No URL available"
    }


In [12]:
def ner_nel_pipeline(text):
    """Pipeline that extracts ORGs using NER and links them using NEL."""
    orgs = extract_organizations(text)
    results = []

    for org in orgs:
        nel_result = get_best_match(org)
        if nel_result:
            results.append({
                "Extracted ORG": org,
                "Matched Entity": nel_result["Matched Entity"],
                "Wikipedia URL": nel_result["Wikipedia URL"]
            })
    
    return pd.DataFrame(results)

In [13]:
if __name__ == "__main__":
    sample_text = """Apple Inc. and Microsoft are two of the largest tech companies. 
                     Google and Amazon are also major players in the industry."""

    df_results = ner_nel_pipeline(sample_text)
    print(df_results)

  Extracted ORG    Matched Entity  \
0    Apple Inc.        Apple Inc.   
1     Microsoft  Microsoft Amalga   
2        Amazon        Amazon.com   
3        Google            Google   

                                   Wikipedia URL  
0        http://en.wikipedia.org/wiki/Apple_Inc.  
1  http://en.wikipedia.org/wiki/Microsoft_Amalga  
2        http://en.wikipedia.org/wiki/Amazon.com  
3            http://en.wikipedia.org/wiki/Google  
