# Entity Extraction with LLM
This notebook guides you through the process of extracting data from Form 13F XML files, using Azure OpenAI to identify key entities, and then loading that information into a Neo4j graph database. This hands-on lab demonstrates how to combine these technologies to transform unstructured data into a structured, queryable graph.


##  Install Dependencies

First, let's install the necessary Python packages.  Run this cell to install the 'openai', 'neo4j', and 'azure-storage-blob' libraries.
These libraries allow us to work with OpenAI, Neo4j, and Azure Blob Storage, respectively.
In a Jupyter Notebook, you can install packages using pip with a cell like this:


In [None]:
# Used in Labs 5, 6 and 7
%pip install --user openai
%pip install --user neo4j
%pip install --user graphdatascience
%pip install --user azure-storage-blob

Now let’s restart the kernel. Run the cell below and click OK when the popup appears.

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=True)

## Import Libraries

Next, we'll import the Python libraries that we'll use in this notebook.
These libraries provide functions for working with XML data, connecting to Neo4j, handling JSON, accessing Azure Blob Storage, and interacting with the OpenAI API.

In [None]:
import os
from xml.etree import ElementTree as ET
from neo4j import GraphDatabase
import json
from azure.storage.blob import BlobServiceClient
import openai
from openai import AzureOpenAI

## Credentials Setup

In the code cell below, you will find a text block assigned to the `env_content` variable. **Carefully replace the `your_...` placeholders within the single quotes with your actual Neo4j and Azure OpenAI credentials.**

**Important Reminders:**

* **Keep the single quotes** around your actual values.
* **Do not add any extra spaces** around the `=` sign.

Once you have replaced all the placeholders, run the code cell. It will create a `.env` file in the root directory of the notebook. You should see a message confirming that the `.env` file has been created.

This `.env` file will be used to securely load your credentials in the subsequent lab notebooks (Lab 5, 6, and 7), so ensure you have entered the correct information.


In [None]:
#Update the values in this cell 

import os

def create_dot_env_from_code_cell():
    env_content = """
# ==============================================================================
#  Neo4j Connection Details
# ==============================================================================
# Replace the placeholders below with your actual Neo4j connection details.
# This information was provided when you created your Neo4j instance.

NEO4J_USERNAME='neo4j'
NEO4J_URI='neo4j+s://your_neo4j_server_url'
NEO4J_PASSWORD='your_neo4j_password'

# ==============================================================================
#  Azure OpenAI Configuration
# ==============================================================================
# Replace the placeholders below with your Azure OpenAI credentials.
# These were provided during the sign-up process for this workshop.

API_ENDPOINT='your_api_endpoint'
API_VERSION='your_api_version'
API_KEY='your_api_key'
deployment_name='your_deploymenr_name'

# ==============================================================================
# Important Notes:
# ==============================================================================
# 1.  Carefully replace the 'your_...' placeholders with your actual values,
#     ensuring they are enclosed in single quotes (').
# 2.  Do not include any extra spaces around the '=' sign.
# ==============================================================================
"""

    print("Please confirm the values below. To modify, update the cell above and run the cell again")
    print("------------------------------------------------------------------------------------------")
    print(env_content)
    print("------------------------------------------------------------------------------------------")

    try:
        with open("../.env", "w") as f:
            f.write(env_content)
        print("\nSuccessfully created the .env file in the p directory.")
        print("Please ensure you have added '.env' to your .gitignore file.")

    except Exception as e:
        print(f"\nAn error occurred while creating the .env file: {e}")
        print("Please check if the notebook has write permissions to the parent directory.")

# Run the function to start the process
create_dot_env_from_code_cell()

The following code cell will read the credentials you saved in the `.env` file in the previous step and store them as variables within this notebook's environment.

In [None]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
loaded = load_dotenv()

NEO4J_USERNAME = os.getenv("NEO4J_USERNAME")
NEO4J_URI = os.getenv("NEO4J_URI")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD")
API_ENDPOINT = os.getenv("API_ENDPOINT")
API_VERSION = os.getenv("API_VERSION")
API_KEY = os.getenv("API_KEY")
deployment_name = os.getenv("deployment_name")

if loaded:
    print("✅ Credentails loaded successfully from .env file.")
else:
    print("❌ Error: Could not load the .env file. Please ensure it exists in the correct location.")

## Azure Blob Storage Setup

Now, we'll set up the connection to Azure Blob Storage.  You'll need to provide your Azure Blob Storage account URL.
We'll use this to access the Form 13F data files.

In [None]:
account_url = "https://neo4jdataset.blob.core.windows.net/"  
container_name = "form13-raw"
blob_service_client = BlobServiceClient(account_url=account_url)

## Helper Functions

We'll define a few helper functions to make the code more organized and reusable.
These functions will handle reading XML data from Azure Blob Storage, extracting entities using OpenAI, and creating nodes in Neo4j.

In [None]:
def read_xml_from_azure(filename):
    try:
        blob_client = blob_service_client.get_blob_client(container=container_name, blob=filename)
        content = blob_client.download_blob(max_concurrency=1, encoding='UTF-8').readall()
        xml_start = content.find('<edgarSubmission')
        return content[xml_start:] if xml_start != -1 else None
    except Exception as e:
        print(f"Error reading file {filename}: {e}")
        return None

def read_xml_from_azure(filename):
    try:
        blob_client = blob_service_client.get_blob_client(container=container_name, blob=filename)
        content = blob_client.download_blob(max_concurrency=1, encoding='UTF-8').readall()
        
        # Find the actual XML content
        xml_start = content.find('<?xml version="1.0"')
        xml_end = content.find('</XML>')
        
        if xml_start != -1 and xml_end != -1:
            xml_content = content[xml_start:xml_end + 6]  # Include the closing </XML> tag
            return xml_content
        else:
            print(f"Could not find XML content in {filename}")
            return None
            
    except Exception as e:
        print(f"Error reading file {filename}: {e}")
        return None

def extract_entities(xml_content):
    if not xml_content:
        return None
        
    prompt = """
    Extract the following information from the XML content and return it as a JSON object:
    * "managerName": The text content of the <name> tag under <filingManager>
    * "street1": The text content of the <com:street1> tag under <address>
    * "street2": The text content of the <com:street2> tag under <address> (if present)
    * "city": The text content of the <com:city> tag under <address>
    * "stateOrCounty": The text content of the <com:stateOrCountry> tag under <address>
    * "zipCode": The text content of the <com:zipCode> tag under <address>

    Return the JSON object without any markdown formatting or code block indicators.
    
    XML Content:
    {xml_text}
    """.format(xml_text=xml_content)

    try:
        response = client.chat.completions.create(
            model=deployment_name,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=300
        )
        
        result = response.choices[0].message.content
        print("OpenAI Response:", result)  # Debug print
        
        # Clean up the response
        # Remove markdown code block indicators and any leading/trailing whitespace
        cleaned_result = result.replace('```json', '').replace('```', '').strip()
        
        return json.loads(cleaned_result)
        
    except Exception as e:
        print(f"Error extracting entities: {e}")
        print(f"Response content: {response.choices[0].message.content if 'response' in locals() else 'No response'}")
        return None


def create_nodes(tx, data):
    if not data:
        return
    
    # Create Manager node
    if data.get("managerName"):
        tx.run("MERGE (m:Manager {name: $name})", 
               name=data["managerName"])

    # Filter out None values from address properties
    address_props = {k: v for k, v in data.items() 
                    if k in ["street1", "street2", "city", "stateOrCounty", "zipCode"] 
                    and v is not None}  # Only include non-None values
    
    if address_props.get("street1"):  # Only create address if at least street1 exists
        # Dynamically build the Cypher query based on available properties
        props_string = ", ".join(f"{k}: ${k}" for k in address_props.keys())
        address_query = f"""
            MERGE (a:Address {{{props_string}}})
        """
        tx.run(address_query, **address_props)

        # Create relationship between Manager and Address
        if data.get("managerName"):
            tx.run("""
                MATCH (m:Manager {name: $name})
                MATCH (a:Address {street1: $street1})
                MERGE (m)-[:HAS_ADDRESS]->(a)
            """, name=data["managerName"], street1=address_props["street1"])


def process_files(file_names):
    for file_name in file_names:
        print(f"Processing {file_name}")
        xml_content = read_xml_from_azure(file_name)
        
        if xml_content:
            entities = extract_entities(xml_content)
            if entities:
                with driver.session() as session:
                    session.execute_write(create_nodes, entities)
                print(f"Processed entities for {file_name}")

## Main Execution
This is the main part of the script.  It defines the files to be processed and calls the helper functions to extract data and load it into Neo4j

In [None]:
if __name__ == "__main__":
    sample_files = [
        'raw_2023-07-18_archives_edgar_data_1108893_0001108893-23-000005.txt',
        'raw_2023-07-18_archives_edgar_data_1488921_0001085146-23-002736.txt',
        'raw_2023-07-18_archives_edgar_data_1163165_0001104659-23-081874.txt',
        'raw_2023-07-18_archives_edgar_data_1567459_0000950123-23-006124.txt'
    ]
    process_files(sample_files)

## Verify
This cell executes a Cypher query to retrieve the extracted data from Neo4j and display it in a tabular format.
MATCH (m:Manager)-[r:HAS_ADDRESS]->(a:Address) RETURN m.name, r, a.street1, a.street2, a.city, a.stateOrCounty, a.zipCode

In [None]:
with driver.session() as session:
    results = session.run("""
        MATCH (m:Manager)-[r:HAS_ADDRESS]->(a:Address)
        RETURN m.name, r, a.street1, a.street2, a.city, a.stateOrCounty, a.zipCode
    """)

    # Print the results in a user-friendly format
    print("Managers and their Addresses:\n")
    for record in results:
        print(f"Manager: {record['m.name']}")
        print(f"  Address: {record['a.street1']}, {record['a.street2'] or ''}, {record['a.city']}, {record['a.stateOrCounty']}, {record['a.zipCode']}\n")
    driver.close()

You can verify the extracted entities by running the following Cypher query in the Neo4j Query Editor:

```cypher
MATCH (m:Manager)-[r:HAS_ADDRESS]->(a:Address) 
RETURN m, r, a;
```

**Note** Congratulations! You have successfully used an LLM to extract entities from an unstructured file and created a small knowledge graph in Neo4j.