# Extract thông tin tài chính từ tin tức dùng VectorStoreIndex base

Dùng double merge  để lấy thông tin tài chính từ tin tức.

https://neo4j.com/labs/genai-ecosystem/llamaindex/


In [1]:
# %pip install python-dotenv
# %pip install llama_index neo4j
# %pip install llama-index-llms-openai
# %pip install llama-index-vector-stores-neo4jvector
# %pip install llama-index-graph-stores-neo4j
# %pip install llama-index-embeddings-openai
# %pip install neo4j
# !pip install numpy
# !pip install spacy
# !python3 -m spacy download en_core_web_md

In [2]:
from dotenv import load_dotenv
import os
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

import nest_asyncio
nest_asyncio.apply()

# Load the .env file

load_dotenv('../.env')

# Access the OpenAI key
openai_key = os.getenv("OPENAI_API_KEY")

llm = OpenAI(model="gpt-4o-mini", api_key=openai_key)
embed_model = OpenAIEmbedding(model="text-embedding-3-small")

Settings.llm = llm
Settings.embed_model = embed_model


In [4]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core import PropertyGraphIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

from llama_index.graph_stores.neo4j import Neo4jPropertyGraphStore

documents = SimpleDirectoryReader(input_files=['../data/bao-chi/nvidia.md']).load_data()
print("length of documents:", len(documents))
print('documents:', documents[0])



length of documents: 1
documents: Doc ID: 6f2f4fc2-30e6-4bb0-b9c0-b2f19f4b2df7
Text: Nvidia to take Intel's spot on Dow Jones Industrial Average  By
Arsheeya Bajwa  (Reuters) -Intel will be replaced by Nvidia
(NASDAQ:NVDA) on the blue-chip Dow Jones Industrial Average index
after a 25-year run, underscoring the shift in the chipmaking market
and marking another setback for the struggling semiconductor firm.
Nvidia will join the...



# We are extracting financial information from news articles using the VectorStoreIndex base. The process involves several steps:

1. **Installation of Required Packages**: We install necessary Python packages such as `llama_index`, `neo4j`, `numpy`, and `spacy`.

2. **Environment Setup**: We load environment variables and set up the OpenAI API key for accessing language models and embeddings.

3. **Document Loading**: We load the news articles from a specified directory using `SimpleDirectoryReader`.

4. **Node Parsing and Splitting**: We use `SemanticDoubleMergingSplitterNodeParser` to split the documents into nodes based on different merging and appending thresholds.

5. **Node Display**: Finally, we display the content of the nodes to verify the extracted information.

This workflow allows us to efficiently extract and analyze financial information from news articles using advanced language models and embeddings.
```

In [15]:
from llama_index.core.node_parser import (
    SemanticDoubleMergingSplitterNodeParser,
    LanguageConfig,
)
from llama_index.core import SimpleDirectoryReader


config = LanguageConfig(language="english", spacy_model="en_core_web_md")
splitter = SemanticDoubleMergingSplitterNodeParser(
    language_config=config,
    initial_threshold=0.4,
    appending_threshold=0.5,
    merging_threshold=0.5,
    max_chunk_size=5000,
)
# Create an array of different merging_threshold and appending_threshold values for testing
merging_thresholds = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
appending_thresholds = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

# Test the splitter with different merging_threshold and appending_threshold values
for merging_threshold in merging_thresholds:
    for appending_threshold in appending_thresholds:
        splitter.merging_threshold = merging_threshold
        splitter.appending_threshold = appending_threshold
        nodes = splitter.get_nodes_from_documents(documents)
        print(f"Merging threshold: {merging_threshold}, Appending threshold: {appending_threshold}, Number of nodes: {len(nodes)}")

Merging threshold: 0.3, Appending threshold: 0.3, Number of nodes: 1
Merging threshold: 0.3, Appending threshold: 0.4, Number of nodes: 1
Merging threshold: 0.3, Appending threshold: 0.5, Number of nodes: 1
Merging threshold: 0.3, Appending threshold: 0.6, Number of nodes: 1
Merging threshold: 0.3, Appending threshold: 0.7, Number of nodes: 1
Merging threshold: 0.3, Appending threshold: 0.8, Number of nodes: 1
Merging threshold: 0.3, Appending threshold: 0.9, Number of nodes: 1
Merging threshold: 0.4, Appending threshold: 0.3, Number of nodes: 2
Merging threshold: 0.4, Appending threshold: 0.4, Number of nodes: 2
Merging threshold: 0.4, Appending threshold: 0.5, Number of nodes: 2
Merging threshold: 0.4, Appending threshold: 0.6, Number of nodes: 2
Merging threshold: 0.4, Appending threshold: 0.7, Number of nodes: 2
Merging threshold: 0.4, Appending threshold: 0.8, Number of nodes: 2
Merging threshold: 0.4, Appending threshold: 0.9, Number of nodes: 2
Merging threshold: 0.5, Appending 

In [17]:
splitter.merging_threshold = 0.9
splitter.appending_threshold = 0.8
nodes = splitter.get_nodes_from_documents(documents)
print(f"Number of nodes: {len(nodes)}")

Number of nodes: 12


In [22]:
from llama_index.core.response.notebook_utils import display_source_node

for node in nodes:
    print('-------------------------')
    print(node.get_content())

-------------------------
Nvidia to take Intel's spot on Dow Jones Industrial Average

By Arsheeya Bajwa

(Reuters) -Intel will be replaced by Nvidia (NASDAQ:NVDA) on the blue-chip Dow Jones Industrial Average index after a 25-year run, underscoring the shift in the chipmaking market and marking another setback for the struggling semiconductor firm. Nvidia will join the index next week along with paint-maker Sherwin-Williams (NYSE:SHW) , which will replace Dow, S&P Dow Jones Indices said on Friday.
-------------------------
Once the dominant force in chipmaking, Intel (NASDAQ:INTC) has in recent years ceded its manufacturing edge to rival TSMC and missed out on the generative artificial intelligence boom after missteps including passing on an investment in ChatGPT-owner OpenAI. Intel's shares have declined 54% this year, making the company the worst performer on the index and leaving it with the lowest stock price on the price-weighted Dow.
-------------------------
-------------------