<a href="https://colab.research.google.com/github/marinandres/Episode-5/blob/main/Episode_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Finding Relationship with a Knowledge Graph**

---

Simply having access to information isn't enough; understanding the relationships between documents will help you take a better decision. This is where Knowledge Graphs (KG) come into play. A Knowledge Graph is essentially a graph-based structure that allows us to map relationships between entities whether they are documents, words, or sentences by connecting them in a meaningful way.

In this blog, we will show the importance of identifying relationships within financial news  to make a market analysis using Knowledge Graphs. We are going to use tools like HuggingFace API, Marketaux API and the graph database Neo4j.

**Installing Dependices and Packages**

---
Since we're using Neo4j Aura Graph DB, we need to include the neo4j package and GraphDatabase to connect through the API. All the Google Colab packages in this file are related to accessing secret keys. The glob library helps with folder access, and the Hugging Face package is for API login. Since we're using the Mixtral API, we also have the InferenceClient. I've skipped explaining other common libraries for simplicity.

In [1]:
!pip install neo4j

Collecting neo4j
  Downloading neo4j-5.23.1-py3-none-any.whl.metadata (5.7 kB)
Downloading neo4j-5.23.1-py3-none-any.whl (293 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.6/293.6 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: neo4j
Successfully installed neo4j-5.23.1


In [2]:
from google.colab import userdata
from google.colab import files
from huggingface_hub import login
from huggingface_hub import InferenceClient
from neo4j import GraphDatabase
from time import sleep
from timeit import default_timer as timer
from string import Template
import os
import requests
import glob
import json

*With a new feature in Colab, you can now securely include and use your secret keys directly. We're using the apifinance_key to access the Marketaux API, which provides financial news. In this episode, we're using Marketaux as our source for data collection, taking advantage of their free tier that allows 100 news requests per day.*

In [3]:
neo4j_url = userdata.get('neo4j_url')
neo4j_username = userdata.get('neo4j_username')
neo4j_password = userdata.get('neo4j_password')
gdb = GraphDatabase.driver(neo4j_url, auth=(neo4j_username, neo4j_password))

hf_token = userdata.get('hf_token')
login(token=hf_token)

apifinance_key = userdata.get('api_finance_key')


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


**Data Collection from Marketaux API**

---

We're running a simple while loop to extract 99 pages, which gives us a total of 297 news articles. The focus of this episode is on the company NVIDIA. With the free tier, each page contains up to 3 news articles. Check more information about it here: [Marketaux Documentation](https://www.marketaux.com/documentation). We're saving all the JSON data in one file at this location: '/content/financial_news.json'.

In [7]:
output_dir = '/content/financial_news_items'
file_path = '/content/financial_news.json'

page = 1
all_news = []
max_pages = 99

while page <= max_pages:
    r = requests.get(f"https://api.marketaux.com/v1/news/all?symbols=NVDA&filter_entities=true&page={page}&language=en&api_token={apifinance_key}")
    financial_news = r.json()

    if 'data' in financial_news and financial_news['data']:
        all_news.extend(financial_news['data'])
        print(f"Page {page} retrieved successfully.")
    else:
        print(f"No more data found at page {page}. Ending loop.")
        break

    page += 1

with open(file_path, 'w') as f:
    json.dump(all_news, f)

print(f"Data saved to {file_path}")

**Quick Data Cleaning**

---

In this scenario, we want to save each financial news article as a separate file. Since we're using [Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) to extract information from each article, it's important not to mix them together, especially since we're not handling the uuid field from the JSON file. We'll extract just the title, description, publication date (Published_at), and each highlight, which summarizes the key points of the news.

In [8]:
def process_financial_news(news_data, output_dir):
  if not os.path.exists(output_dir):
    os.makedirs(output_dir)

  x = 0
  for news in news_data:
    file_path = os.path.join(output_dir, f"financial_news_{x}.md")
    with open(file_path, 'w') as f:

      title = news.get("title", "")
      description = news.get("description", "")
      published_at = news.get("published_at", "")
      entities = news.get("entities", [])

      for entity in entities:
          entity_type = entity.get("highlights", [])

          for high in entity_type:
              highlight = high.get("highlight", [])

      f.write(f"Title: {title}\n")
      f.write(f"\nDescription: {description}\n")
      f.write(f"\nPublished Date: {published_at}\n")
      f.write(f"\nHighlight: {highlight}\n")
    x += 1

def read_markdown_files(output_dir):
    files_content = {}
    for filename in os.listdir(output_dir):
        file_path = os.path.join(output_dir, filename)
        if os.path.isfile(file_path):
            with open(file_path, 'r') as file:
                content = file.read()
                files_content[filename] = content
    return files_content

with open(file_path, 'r') as f:
    financial_news_data = json.load(f)

all_news = financial_news_data
process_financial_news(all_news, output_dir)

markdown_files_content = read_markdown_files(output_dir)

**Prompt Engineering**

---

This is a field I'm currently interested in. We need to control the output of LLMs, and for this purpose, our goal is to obtain a JSON file containing relationships and entities.

Using Few-Shot Prompting, we guide the model by providing an example JSON with entities, labels, and IDs. I did the same for relationships.

For In-Context Learning, I applied it to relationships by explaining what the relationship should be and giving an example. For entities, I instructed the model on how to construct the summary and where to find the necessary information.

The approach I'm using is based on a paper titled: [FinDKG: Dynamic Knowledge Graphs with Large Language Models for Detecting Global Trends in Financial Markets](https://arxiv.org/pdf/2407.10909)

In [9]:
prompt_template = """From the following financial news below, extract the following relationship & entities:
0. ALWAYS FINISH THE OUTPUT. Never send partial results.
1. First, you are going to extract the entities mentioned in the news and generate as comma-separeted format similar to entity types.
  'id' property of each entity must be unique and alphanumeric. You will use this to define relationships between entities.  Document must be summarized and stored inside Project entity under `summary` property. You will have to generate as many entities as needed as per the types below:
  Entity Types:
  label:'News',id:string,name:string;summary:string //Title mention in the text; `id` property is the full name you can find in on 'Title: ', in lowercase, with no capital letters, special characters, spaces or hyphens; Contents of original document must be summarized using 'Description: ', 'Published Date: ' and 'Highlight: ' and saved in 'summary' property.
  label:'ORG', id:string,name:string //Non-governmental and non-regulatory organisations. Example: Imperial College London; `id` property is the name of an ORG that you identify, in lowercase, with no capital letters, special characters, spaces or hyphens. 'name' is the organization name.
  label:'REG', id:string,name:string //Regulatory organisations. Example: Bank of England; `id` property is the name of a REG that you identify, in lowercase, with no capital letters, special characters, spaces or hyphens. 'name' is the organization name.
  label:'GPE', id:string,name:string //Geopolitical entities like countries or cities. Example: United Kingdom. 'id' property is the name of a GPE that you identify, in lowercase, with no capital letters, special characters, spaces or hyphens. 'name' is the location name.
  label:'PERSON', id:string,name:string //Individuals in influential or decision-making roles. Example: Jerome Powell. 'id' property is the name of a PERSON that you identify, in lowercase, with no capital letters, special characters, spaces or hyphens. 'name' is the person's name.
  label:'COMP', id:string,name:string //Companies across sectors. Example: Apple Inc.; `id` property is the name of a COMP that you identify, in lowercase, with no capital letters, special characters, spaces or hyphens. 'name' is the company name.
  label:'PRODUCT', id:string,name:string //Tangible or intangible products or services. Example: iPhone; `id` property is the name of a PRODUCT that you identify, in lowercase, with no capital letters, special characters, spaces or hyphens. 'name' is the product name
  label:'EVENT', id:string,name:string //Natural or man-made events. Example: Brexit; `id` property is the name of an EVENT that you identify, in lowercase, with no capital letters, special characters, spaces or hyphens. 'name' is the event name
  label:'SECTOR', id:string,name:string //Sectors or industries in which companies operate. Example: Technology Sector; `id` property is the name of a SECTOR that you identify, in lowercase, with no capital letters, special characters, spaces or hyphens. 'name' is the sector
  label:'ECON IND', id:string,name:string //Non
  label:'FIN INST', id:string,name:string //Financial and market instruments. Example: S&P. `id` property is the name of a FIN INST that you identify, in lowercase, with no capital letters, special characters, spaces or hyphens. 'name' is the instrument
  label:'CONCEPT', id:string,name:string //Abstract ideas, themes, or financial theories. Example: Artificial Intelligence; `id` property is the name of a CONCEPT that you identify, in lowercase, with no capital letters, special characters, spaces or hyphens. 'name' is the concept name

2. Next, generate each relationships as triples of head, relationship and tail. To refer the head and tail entity, use their respective `id` property. Relationship property should be mentioned within brackets as comma-separated.
  Relationship types:
  HAS //Indicates ownership or possession, often of assets or subsidiaries in a financial context. Example:  Google Has Android.
  ANNOUNCE //Refers to the formal public declaration of a financial event, product launch, or strategic move. Example: Apple  Announces iPhone 13.
  OPERATE_IN //Describes the geographical market in which a business entity conducts its operations. Example: Tesla Operates In China.
  INTRODUCE //Denotes the first-time introduction of a financial instrument, product, or policy to the market. Example: Facebook Introduces Facebook Messenger.
  PRODUCE // Specifies the entity responsible for creating a particular product, often in a manufacturing or financial product context. Example: Tesla Produces Electric Cars.
  CONTROL // Implies authority or regulatory power over monetary policy, financial instruments, or market conditions. Example: Bank of England Controls the Euro.
  PARTICIPATE_IN // Indicates active involvement in an event that has financial or economic implications. Example: Bank of England Participates In Brexit.
  IMPACT //Signifies a notable effect, either positive or negative, on market trends, financial conditions, or economic indicators. Example: Tesla Has Positive Impact On Stocks.
  POSITIVE_IMPACT_ON //Highlights a beneficial effect on financial markets, economic indicators, or business performance. Example: Tesla Has Positive Impact On Stocks.
  NEGATIVE_IMPACT_ON // Denotes a negative impact on financial markets, economic indicators, or business performance. Example: Bank of England Has Negative Impact On Stocks.
  RELATE_TO // Points out a connection or correlation with a financial concept, sector, or market trend. Example: Tesla Relates To Technology Sector.
  IS_MEMBER_OF // Denotes membership in a trade group, economic union, or financial consortium. Example: Germany Is Member Of EU.
  INVEST_IN // Specifies an allocation of capital into a financial instrument, sector, or business entity. Example: Warren Buffett Invests In Apple.
  RAISE // Indicates an increase, often referring to capital, interest rates, or production levels in a financial context. Example: OPEC Raises Oil Production.
  DECREASE // Indicates a reduction, often referring to capital, interest rates, or production levels in a financial context. Example: Federal Reserve Decreases Interest Rates.

    News|HAS|PRODUCT and use the name property
    News|ANNOUNCE|PRODUCT and use the name property
    News|OPERATE_IN|GPE and use the name property
    News|INTRODUCE|ORG and use the name property
    News|PRODUCE|ORG and use the name property

The output should ALWAYS look like this, DO NOT CREATE any other output :
{
    "entities": [{"label":"News","id":string,"name":string,"summary":string}],
    "relationships": ["News|HAS|google"]
}

Case Sheet:
$ctext
"""

**Hugging Face API**

---

First of all, it’s free. While proprietary models like GPT-4 or Claude 3 are more advanced than open-source alternatives, not everyone can afford them. How can we make #AIForEveryone if the costs are too high?

That’s where the Hugging Face API’s Inference Client comes in. This feature allows me to get responses from the Mixtral-8x7B LLM without needing to run it on my computer. In this setup, the file_prompt variable contains the entire financial news article and the prompt template. We also provide additional system content to our model using the system_msg variable, which informs the model that it’s a financial analyst specializing in extracting information from financial news documents.

In [10]:
model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

client = InferenceClient(
    token=hf_token,
    model=model_id
)

def process_mixtral(file_prompt, system_msg):

    nlp_results = []
    buffer = ""

    start = timer()
    for message in client.chat_completion(
        max_tokens=15000,
        stream=True,
        temperature=0,
        messages=[
            {"role": "system", "content": system_msg},
            {"role": "user", "content": file_prompt}
        ]
    ):
          llm_response = message.choices[0].delta.content
          buffer += llm_response
    # print(buffer)
    end = timer()
    print(f"\n\nTime taken: {end - start:.2f}s")
    return buffer

*In the extract_entities_relationship function, we extract each news article and send it to the Hugging Face API. The output is then stored in a JSON variable.*

In [11]:
def extract_entities_relationships(folder, prompt_template):
    start = timer()
    files = glob.glob(f"/content/{folder}/*")
    system_msg = "You are a financial analyst who is an expert in technology companies and financial market analysis who extract information from documents"
    print(f"Running pipeline for {len(files)} files in {folder} folder")
    results = []
    for i, file in enumerate(files):
        print(f"Extracting entities and relationships for {file}")
        try:
            with open(file, "r") as f:
                text = f.read()
                prompt = Template(prompt_template).substitute(ctext=text)
                result = process_mixtral(prompt, system_msg=system_msg)
                results.append(json.loads(result))
                # print(f"Result Test: {result}")
        except Exception as e:
            print(f"Error processing {file}: {e}")
        sleep(8)
    end = timer()
    print(f"Pipeline completed in {end-start} seconds")
    return results

*In the generate_cypher function, I create a text file that includes a quick data cleaning process from the JSON file. This file also contains the Cypher language, which is used for graph databases, to create all entities and queries needed to update our Neo4j database.*

In [15]:
def generate_cypher(json_obj):
    e_statements = []
    r_statements = []

    e_label_map = {}

    # Loop through the JSON object
    for i, obj in enumerate(json_obj):
        print(f"Generating cypher for file {i+1} of {len(json_obj)}")
        print(obj)  # Debug: print the current object being processed

        for entity in obj["entities"]:
            label = entity["label"]
            id = entity["id"]
            id = id.replace("-", "").replace("_", "")
            properties = {k: v for k, v in entity.items() if k not in ["label", "id"]}

            cypher = f'MERGE (n:{label} {{id: "{id}"}})'
            if properties:
                props_str = ", ".join(
                    [f'n.{key} = "{val}"' for key, val in properties.items()]
                )
                cypher += f" ON CREATE SET {props_str}"
            e_statements.append(cypher)

            # Ensure the entity is mapped correctly
            e_label_map[id] = label
            print(f"Mapping entity ID {id} to label {label}")  # Debug: print the mapping

        for rs in obj["relationships"]:
            src_id, rs_type, tgt_id = rs.split("|")
            src_id = src_id.replace("-", "").replace("_", "")
            tgt_id = tgt_id.replace("-", "").replace("_", "")

            # Debug: print the IDs before accessing e_label_map
            print(f"Processing relationship: {src_id} -[{rs_type}]-> {tgt_id}")

            # Access e_label_map safely
            if src_id in e_label_map and tgt_id in e_label_map:
                src_label = e_label_map[src_id]
                tgt_label = e_label_map[tgt_id]

                cypher = (
                    f'MERGE (a:{src_label} {{id: "{src_id}"}}) '
                    f'MERGE (b:{tgt_label} {{id: "{tgt_id}"}}) '
                    f'MERGE (a)-[:{rs_type}]->(b)'
                )
                r_statements.append(cypher)
            else:
                print(f"Warning: Missing entity for IDs {src_id} or {tgt_id}")  # Debug: warning

    with open("cyphers.txt", "w") as outfile:
        outfile.write("\n".join(e_statements + r_statements))

    return e_statements + r_statements

**Final Result**

---

Our financial analyst stores all the information in the database using the
function execute_query. The next step is to extract this information for various purposes. Right now, the focus is on understanding how to store information analyzed by the Mixtral LLM.

This approach to data ingestion is inspired by a YouTube video, which I'll link to below: [How to Build Knowledge Graphs With LLMs (python tutorial)](https://www.youtube.com/watch?v=tcHIDCGu6Yw&t=1258s)

In [None]:
def ingestion_pipeline(folders):
    entities_relationships = []
    for folder in folders:
        print(f"Extracting entities and relationships for {folder}")
        json_extraction = entities_relationships.extend(extract_entities_relationships(folder, prompt_template))
        json_extraction
    cypher_statements = generate_cypher(json_extraction)
    for i, stmt in enumerate(cypher_statements):
        print(f"Executing cypher statement {i+1} of {len(cypher_statements)}")
        try:
            gdb.execute_query(stmt)
        except Exception as e:
            with open("failed_statements.txt", "w") as f:
                f.write(f"{stmt} - Exception: {e}\n")

folders = {'financial_news_items': prompt_template}

ingestion_pipeline(folders)

Extracting entities and relationships for financial_news_items
Running pipeline for 180 files in financial_news_items folder
Extracting entities and relationships for /content/financial_news_items/financial_news_45.md


Time taken: 0.19s
Error processing /content/financial_news_items/financial_news_45.md: Extra data: line 39 column 1 (char 1379)
Extracting entities and relationships for /content/financial_news_items/financial_news_162.md


Time taken: 0.08s
Extracting entities and relationships for /content/financial_news_items/financial_news_78.md


Time taken: 20.83s
Extracting entities and relationships for /content/financial_news_items/financial_news_46.md


Time taken: 17.83s
Extracting entities and relationships for /content/financial_news_items/financial_news_14.md


Time taken: 13.36s
Extracting entities and relationships for /content/financial_news_items/financial_news_179.md


Time taken: 12.42s
Extracting entities and relationships for /content/financial_news_items/financial