# Unstructured data

## Structured and Unstructured Data

As you have learned in previous lessons, a key challenge in data science is making sense of unstructured data. In this lesson, you will explore a strategy for storing unstructured data in a graph.

Vector indexes and embeddings go some way to allow you to search and query unstructured data, *but they are not a complete solution. You can use the metadata surrounding the unstructured data to help make sense of it.*

Imagine the following use case. You want to analyze customer emails to:

- Understand the customer sentiment (are they happy or unhappy?)

- Identify any products or services

You could represent this data in a graph of `Email`, `Customer`, and `Product` nodes.

<img 
    src="https://graphacademy.neo4j.com/courses/llm-vectors-unstructured/3-unstructured-data/1-structured-unstructured/images/email-graph.svg" 
    alt="Data Model"
    style="width: 50%; height: auto; display: block; margin: 0 auto;"
/>

An import for this process would have to:

1. Extract the email metadata (date, sender, recipient, subject)

2. Embed the email text

3. Extract the customer sentiment using a vector index

4. Search for references to products or services in the email text

By importing the unstructured data into a graph, you can use the known relationships between the data to help make sense of it.

For example, you could use the graph to answer questions like:

- What products are customers talking about positively in their emails?

- Are there times in the year when customers are more likely to complain?

- What are customers saying about a particular product?

## Course data

During this module, you will use Python and LangChain to import the text of a GraphAcademy course into Neo4j.

GraphAcademy represents courses as a graph of Course, Module, and Lesson nodes. A course has modules, and a module has lessons.

A simplistic view of the graph would look like this:

<img 
    src="https://graphacademy.neo4j.com/courses/llm-vectors-unstructured/3-unstructured-data/1-structured-unstructured/images/graphacademy-lessons.svg" 
    alt="Data Model"
    style="width: 50%; height: auto; display: block; margin: 0 auto;"
/>

The [GraphAcademy course content](https://github.com/neo4j-graphacademy/courses) is in a public GitHub repository. We write courses in plain text [AsciiDoc](https://asciidoc.org/) that is parsed and displayed on the GraphAcademy website.

The course content is unstructured, but you can make sense of it by using the metadata (the course structure), embeddings, and vector indexes

View [this lesson’s content on GitHub](https://github.com/neo4j-graphacademy/courses/blob/main/asciidoc/courses/llm-vectors-unstructured/modules/3-unstructured-data/lessons/1-structured-unstructured/lesson.adoc?plain=1) and note the following:

    1. The lesson content is written in plain text and is unstructured.

    2. The file name is lesson.adoc.

    3. All lessons have the same file name.

    The directory structure denotes the course (llm-vectors-unstructured), module (3-unstructured-data), and lesson (1-structured-unstructured).

You will use these files and directory structure to create the graph of the course content.

## Chunking

When dealing with large amounts of data, breaking it into smaller, more manageable parts is helpful. This process is called chunking.

Smaller pieces of data are easier to work with and process. Embedding models also have size (token) limits and can only handle a certain amount of data.

Embedding large amounts of text may also be less valuable. For example, if you are trying to find a document that references a specific topic, the meaning maybe lost in the whole document. Instead, you may only need the paragraph or sentence that contains the relevant information. Conversely, small amounts of data may not contain enough context to be useful.

In this lesson, you will explore strategies for chunking and storing data in a graph.

### Strategies

There are countless strategies for splitting data into chunks, and the best approach depends on the data and the problem you are trying to solve.

It may be that the unstructured data you are working with is already in a format that is easy to split. For example, if you were looking to chunk an API’s technical documentation, you could split the data by method, endpoint, or parameter.

Alternatively, you may be working with a collection of unrelated PDF documents, and splitting by section, paragraph, or sentence may be the only choice.

Strategies for chunking data include:

- **Size** - Splitting data into equal-sized chunks.

- **Word, Sentence, Paragraph** - Breaking down text data into individual sections.

- **N-Grams** - Creating chunks of n consecutive words or characters.

- **Topic Segmentation** - Dividing text into sections based on topic changes.

- **Event Detection** - Identifying specific events or activities.

- **Semantic Segmentation** - Dividing data regions with different semantic meanings (objects, background, etc).

It may also be helpful to combine multiple strategies. For example, you could split a document into paragraphs and then further split each paragraph into topic changes - this would allow you to store and query the data at different levels of granularity.

### Storing Chunks

How you store the chunks depends on the data, what the chunks represent, and how you intend to use the data.

It is a good idea to name the nodes and relationships in a way that makes it easy to understand the data and how it is related. For example, if you split a set of documents by paragraph, you could name the nodes `Documents` and `Paragraph` with a relationship `CONTAINS`. Alternatively, if you split a document by an arbitrary size value or character, you may simply use the node label `Chunk`.

You can store embeddings for individual chunks and create relationships between chunks to capture context and relationships.

You may also want to store metadata about the chunks, such as the position in the original data, the size, and any other relevant information.

When storing the course content, you will create a node for each `Paragraph` chunk and a relationship `CONTAINS` between the `Lesson` and `Paragraph` nodes.

<img 
    src="https://graphacademy.neo4j.com/courses/llm-vectors-unstructured/3-unstructured-data/2-chunking/images/graphacademy-lessons-paragraph.svg" 
    alt="Data Model"
    style="width: 50%; height: auto; display: block; margin: 0 auto;"
/>

You should note the following structure:

    asciidoc - contains all the course content in ascidoc format

        courses - the course content

            llm-fundamentals - the course name

                modules - contains numbered directories for each module

                    01-name - the module name

                        lessons - contains numbered directories for each lesson

                            01-name - the lesson name

                                lesson.adoc - the lesson content

### Load the content and chunk it

You can now load the content and chunk it using Python and LangChain.

You will split the lesson content into chunks of text, around 1500 characters long, with each chunk containing one or more paragraphs. You can determine the paragraph in the content with two newline characters (`\n\n`).

In [1]:
#!uv pip install langchain_community

In [2]:
import os
from dotenv import load_dotenv
load_dotenv()

from langchain_community.document_loaders import DirectoryLoader, TextLoader

COURSES_PATH = "llm-vectors-unstructured/data/asciidoc"

# Load lesson documents
loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()

In [3]:
docs[0]

Document(metadata={'source': 'llm-vectors-unstructured/data/asciidoc/courses/llm-fundamentals/modules/3-intro-to-langchain/lessons/6-retrievers/lesson.adoc'}, page_content='= Retrievers\n:order: 8\n:type: lesson\n:disable-cache: true\n\nlink:https://python.langchain.com/v0.2/docs/integrations/retrievers/[Retrievers^] are Langchain chain components that allow you to retrieve documents using an unstructured query.\n\n    Find a movie plot about a robot that wants to be human.\n\nDocuments are any unstructured text that you want to retrieve. A retriever often uses a vector store as its underlying data structure.\n\nRetrievers are a key component for creating models that can take advantage of Retrieval Augmented Generation (RAG).\n\nPreviously, you loaded embeddings and created a vector index of Movie plots - in this example, the movie plots are the _documents_, and a _retriever_ could be used to give a model context.\n\nIn this lesson, you will create a _retriever_ to retrieve documents f

In [4]:
#!uv pip install langchain==0.3.9

In [5]:
from langchain.text_splitter import CharacterTextSplitter

# Create a text splitter
text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
)

In [6]:
# Split documents into chunks
chunks = text_splitter.split_documents(docs)

print(len(chunks))

Created a chunk of size 2848, which is longer than the specified 1500


89


In [7]:
chunks[0]

Document(metadata={'source': 'llm-vectors-unstructured/data/asciidoc/courses/llm-fundamentals/modules/3-intro-to-langchain/lessons/6-retrievers/lesson.adoc'}, page_content='= Retrievers\n:order: 8\n:type: lesson\n:disable-cache: true\n\nlink:https://python.langchain.com/v0.2/docs/integrations/retrievers/[Retrievers^] are Langchain chain components that allow you to retrieve documents using an unstructured query.\n\n    Find a movie plot about a robot that wants to be human.\n\nDocuments are any unstructured text that you want to retrieve. A retriever often uses a vector store as its underlying data structure.\n\nRetrievers are a key component for creating models that can take advantage of Retrieval Augmented Generation (RAG).\n\nPreviously, you loaded embeddings and created a vector index of Movie plots - in this example, the movie plots are the _documents_, and a _retriever_ could be used to give a model context.\n\nIn this lesson, you will create a _retriever_ to retrieve documents f

### Splitting

The content isn’t split simply by a character (`\n\n`) or on a fixed number of characters. The process is more complicated. Chunks should be up to maximum size but conform to the character split.

In this example, the `split_documents` method does the following:

1. Splits the documents into paragraphs (using the separator - `\n\n`)

2. Combines the paragraphs into chunks of text that are up 1500 characters (`chunk_size`)

   - if a single paragraph is longer than 1500 characters, the method will not split the paragraph but create a chunk larger than 1500 characters

3. Adds the last paragraph in a chunk to the start of the next paragraph to create an overlap between chunks.

    - if the last paragraph in a chunk is more than 200 characters (`chunk_overlap`) it will not be added to the next chunk

This process ensures that:

- Chunks are never too small.

- That a paragraph is never split between chunks.

- That chunks are significantly different, and the overlap doesn’t result in a lot of repeated content.

Investigate what happens when you modify the `separator`, `chunk_size` and `chunk_overlap` parameters.


### Create vector index

Once you have chunked the content, you can use the LangChain [Neo4jVector](https://python.langchain.com/docs/integrations/vectorstores/neo4jvector) and [OpenAIEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.openai.OpenAIEmbeddings.html) classes to create the embeddings, the vector index, and store the chunks in a Neo4j graph database.

The `Neo4jVector.from_documents` method:

1. Creates embeddings for each chunk using the `OpenAIEmbeddings` object.

2. Creates nodes with the label `Chunk` and the properties `text` and `embedding` in the Neo4j database.

3. Creates a vector index called `chunkVector`.

In [8]:
#!uv pip install langchain_neo4j
#!uv pip install langchain_openai

In [9]:
from langchain_neo4j import Neo4jVector
from langchain_openai import OpenAIEmbeddings

neo4j_db = Neo4jVector.from_documents(
    chunks,
    OpenAIEmbeddings(openai_api_key=os.getenv('OPENAI_API_KEY')),
    url=os.getenv('NEO4J_URI'),
    username=os.getenv('NEO4J_USERNAME'),
    password=os.getenv('NEO4J_PASSWORD'),
    database="neo4j",  
    index_name="chunkVector",
    node_label="Chunk", 
    text_node_property="text", 
    embedding_node_property="embedding",  
)

In [10]:
import textwrap
from neo4j import GraphDatabase
from utils import execute_query


neo4j_uri = os.getenv("NEO4J_URI")
neo4j_user = os.getenv("NEO4J_USERNAME")
neo4j_pass = os.getenv("NEO4J_PASSWORD")
neo4j_db = os.getenv("NEO4J_DATABASE")


neo4j_driver = GraphDatabase.driver(neo4j_uri, auth=(neo4j_user, neo4j_pass))


cypher = textwrap.dedent("""
MATCH (c:Chunk) RETURN c LIMIT 25
""")

res = execute_query(neo4j_driver, cypher)

print(res)

[{'c': {'text': '= Retrievers\n:order: 8\n:type: lesson\n:disable-cache: true\n\nlink:https://python.langchain.com/v0.2/docs/integrations/retrievers/[Retrievers^] are Langchain chain components that allow you to retrieve documents using an unstructured query.\n\n    Find a movie plot about a robot that wants to be human.\n\nDocuments are any unstructured text that you want to retrieve. A retriever often uses a vector store as its underlying data structure.\n\nRetrievers are a key component for creating models that can take advantage of Retrieval Augmented Generation (RAG).\n\nPreviously, you loaded embeddings and created a vector index of Movie plots - in this example, the movie plots are the _documents_, and a _retriever_ could be used to give a model context.\n\nIn this lesson, you will create a _retriever_ to retrieve documents from the movie plots vector index.\n\n== Neo4jVector\n\nThe link:https://python.langchain.com/v0.2/docs/integrations/vectorstores/neo4jvector/[`Neo4jVector`^

You can also query the vector index to find similar chunks. For example, you can find lesson chunks relating to a specific question, "What does Hallucination mean?":

```cypher
WITH genai.vector.encode(
    "What does Hallucination mean?",
    "OpenAI",
    { token: "sk-..." }) AS userEmbedding
CALL db.index.vector.queryNodes('chunkVector', 6, userEmbedding)
YIELD node, score
RETURN node.text, score
```

In [11]:
from utils import create_embedding

# You'll need to generate the embedding in Python first
embedding = create_embedding("What does Hallucination mean?")


cypher = textwrap.dedent(f"""
    WITH {embedding} AS userEmbedding
    CALL db.index.vector.queryNodes('chunkVector', 6, userEmbedding)
    YIELD node, score
    RETURN node.text, score
""")

result = execute_query(neo4j_driver, cypher)

result

[{'node.text': '= Avoiding Hallucination\n:order: 2\n:type: lesson\n\nAs you learned in the previous lesson, LLMs can "make things up".\n\nLLMs are designed to generate human-like text based on the patterns they\'ve identified in vast amounts of data. \n\nDue to their reliance on patterns and the sheer volume of training information, LLMs sometimes **hallucinate** or produce outputs that manifest as generating untrue facts, asserting details with unwarranted confidence, or crafting plausible yet nonsensical explanations.\n\nThese manifestations arise from a mix of _overfitting_, biases in the training data, and the model\'s attempt to generalize from vast amounts of information.\n\n== Common Hallucination Problems\n\nLet\'s take a closer look at some reasons why this may occur.\n\n=== Temperature\n\nLLMs have a _temperature_, corresponding to the amount of randomness the underlying model should use when generating the text.\n\nThe higher the temperature value, the more random the gener

Query neo4j with an embedding using langchain

In [12]:
from langchain_neo4j import Neo4jGraph

graph = Neo4jGraph(
    url=os.getenv('NEO4J_URI'),
    username=os.getenv('NEO4J_USERNAME'),
    password=os.getenv('NEO4J_PASSWORD')
)

In [13]:
result = graph.query("""
CALL db.index.vector.queryNodes('chunkVector', 6, $embedding)
YIELD node, score
RETURN node.text, score
""", {"embedding": embedding})

In [14]:
for row in result:
    print(row['node.text'], row['score'])

= Avoiding Hallucination
:order: 2
:type: lesson

As you learned in the previous lesson, LLMs can "make things up".

LLMs are designed to generate human-like text based on the patterns they've identified in vast amounts of data. 

Due to their reliance on patterns and the sheer volume of training information, LLMs sometimes **hallucinate** or produce outputs that manifest as generating untrue facts, asserting details with unwarranted confidence, or crafting plausible yet nonsensical explanations.

These manifestations arise from a mix of _overfitting_, biases in the training data, and the model's attempt to generalize from vast amounts of information.

== Common Hallucination Problems

Let's take a closer look at some reasons why this may occur.

=== Temperature

LLMs have a _temperature_, corresponding to the amount of randomness the underlying model should use when generating the text.

The higher the temperature value, the more random the generated result will become, and the more l