# Match company names to ISIN codes using Azure AI Language and Wiki data

Entity linking is a natural language processing task that involves identifying and disambiguating entities in a text. Entities are words or phrases that refer to real-world objects, such as people, places, organizations, events, products, etc. Disambiguating entities means resolving the ambiguity that arises when the same word or phrase can refer to different entities, depending on the context. For example, the word "Apple" can refer to the fruit, the company, or the Beatles' record label, depending on the text.

Entity linking is important for many applications, such as information extraction, knowledge graph construction, question answering, text summarization, and sentiment analysis. By linking entities to their unique identifiers in a knowledge base, such as Wiki data, we can enrich the text with additional information and enable semantic search and reasoning.

This example notebook uses the Azure AI Language service and Wiki data to link company names mentioned in unstructured text (e.g. news articles or transcriptions of earning calls) to company ids (ISIN codes). 

The notebook is divided into the following sections:

- Import libraries
- Util functions
- Get ISIN codes from list of wikipedia pages
- Get linked entities from text using Azure AI Language
- Function to get ISIN codes from text
- Get ISIN codes for each document in `data` folder

Please refer to this page for more information on the Azure AI Language entity linking capabilities: https://learn.microsoft.com/en-us/azure/ai-services/language-service/entity-linking/quickstart?tabs=windows&pivots=programming-language-python

## Imports

In [None]:
# Install required packages
# %pip install azure-ai-textanalytics==5.2.0
# %pip install nltk

import os
import nltk
import requests

from nltk.tokenize import sent_tokenize
from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential

nltk.download('punkt')

language_key = os.environ.get('LANGUAGE_KEY')
language_endpoint = os.environ.get('LANGUAGE_ENDPOINT')

def authenticate_client():
    ta_credential = AzureKeyCredential(language_key)
    text_analytics_client = TextAnalyticsClient(
            endpoint=language_endpoint, 
            credential=ta_credential)
    return text_analytics_client

client = authenticate_client()


# Util functions

In [None]:
def get_isin(wikipedia_name):
    redirects = check_redirects(wikipedia_name)
    if redirects is not None:
        wikipedia_name = redirects
    try:
        url = f"https://www.wikidata.org/w/api.php?action=wbgetentities&sites=enwiki&props=claims&titles={wikipedia_name}&format=json"
        response = requests.get(url)
        data = response.json()

        entity_id = next(iter(data["entities"]))
        value = data["entities"][entity_id]['claims']['P946'][0]['mainsnak']["datavalue"]["value"]
        return value
    except:
        return None


def check_redirects(titles):
    url = f"https://en.wikipedia.org/w/api.php?action=query&format=json&titles={titles}&redirects"
    response = requests.get(url)
    data = response.json()

    if "query" in data:
        if "redirects" in data["query"]:
            return data["query"]["redirects"][0]["to"]
    else:
        return None

## Get ISIN codes from list of wikipedia pages

In [None]:
companies = ["Apple_Inc.", "Alphabet_Inc.", "ASML_Holding", "Tesla,_Inc.", "Verizon_Communications"]

for company in companies:
    print(get_isin(company))

## Get linked entities from text

In [None]:
try:
    documents = ["""
With the upcoming launch of Apple Vision Pro, we're seeing strong excitement in enterprise, leading organizations across many industries, such as Walmart, Nike, Vanguard, Stryker, Bloomberg and SAP started leveraging and investing in Apple Vision Pro as the new platform to bring innovative spatial computing experiences to their customers and employees. 
"""]
    result = client.recognize_linked_entities(documents = documents)[0]

    print("Linked Entities:\n")
    for entity in result.entities:
        print("\tName: ", entity.name, "\tId: ", entity.data_source_entity_id, "\tUrl: ", entity.url)
        print("\tMatches:")
        for match in entity.matches:
            print("\t\tText:", match.text)
            print("\t\tConfidence Score: {0:.2f}".format(match.confidence_score))
        
except Exception as err:
    print("Encountered exception. {}".format(err))


## Function to get ISIN codes from text

In [None]:
def get_isin_codes_from_text(client, text):
    companies = {}

    try:
        sentences = sent_tokenize(text)

        for sentence in sentences:
            results = client.recognize_linked_entities(documents = [sentence])

            for result in results:
                for entity in result.entities:
                    if entity.data_source_entity_id in companies:
                        continue
                    isin = get_isin(entity.data_source_entity_id)
                    if isin is not None:
                        companies.update({entity.data_source_entity_id: isin})
                        print("\tId: ", entity.data_source_entity_id, "\tUrl: ", entity.url, "\tMatches:", [match.text for match in entity.matches], "\tISIN: ", isin)
                        
        return companies
            
    except Exception as err:
        print("Encountered exception. {}".format(err))

## Get ISIN codes for each document in data folder

Make sure to add any news articles or transcriptions of earning calls to a folder called `data` before running the call below.

In [None]:
folder_path = "data"

for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)
    
    with open(file_path, "r") as file:
        print(filename)
        text = file.read()
        get_isin_codes_from_text(client, text)
