# Unstructured data

## Structured and Unstructured Data

As you have learned in previous lessons, a key challenge in data science is making sense of unstructured data. In this lesson, you will explore a strategy for storing unstructured data in a graph.

Vector indexes and embeddings go some way to allow you to search and query unstructured data, *but they are not a complete solution. You can use the metadata surrounding the unstructured data to help make sense of it.*

Imagine the following use case. You want to analyze customer emails to:

- Understand the customer sentiment (are they happy or unhappy?)

- Identify any products or services

You could represent this data in a graph of `Email`, `Customer`, and `Product` nodes.

<img 
    src="https://graphacademy.neo4j.com/courses/llm-vectors-unstructured/3-unstructured-data/1-structured-unstructured/images/email-graph.svg" 
    alt="Data Model"
    style="width: 50%; height: auto; display: block; margin: 0 auto;"
/>

An import for this process would have to:

1. Extract the email metadata (date, sender, recipient, subject)

2. Embed the email text

3. Extract the customer sentiment using a vector index

4. Search for references to products or services in the email text

By importing the unstructured data into a graph, you can use the known relationships between the data to help make sense of it.

For example, you could use the graph to answer questions like:

- What products are customers talking about positively in their emails?

- Are there times in the year when customers are more likely to complain?

- What are customers saying about a particular product?

## Course data

During this module, you will use Python and LangChain to import the text of a GraphAcademy course into Neo4j.

GraphAcademy represents courses as a graph of Course, Module, and Lesson nodes. A course has modules, and a module has lessons.

A simplistic view of the graph would look like this:

<img 
    src="https://graphacademy.neo4j.com/courses/llm-vectors-unstructured/3-unstructured-data/1-structured-unstructured/images/graphacademy-lessons.svg" 
    alt="Data Model"
    style="width: 50%; height: auto; display: block; margin: 0 auto;"
/>

The [GraphAcademy course content](https://github.com/neo4j-graphacademy/courses) is in a public GitHub repository. We write courses in plain text [AsciiDoc](https://asciidoc.org/) that is parsed and displayed on the GraphAcademy website.

The course content is unstructured, but you can make sense of it by using the metadata (the course structure), embeddings, and vector indexes

View [this lesson’s content on GitHub](https://github.com/neo4j-graphacademy/courses/blob/main/asciidoc/courses/llm-vectors-unstructured/modules/3-unstructured-data/lessons/1-structured-unstructured/lesson.adoc?plain=1) and note the following:

    1. The lesson content is written in plain text and is unstructured.

    2. The file name is lesson.adoc.

    3. All lessons have the same file name.

    The directory structure denotes the course (llm-vectors-unstructured), module (3-unstructured-data), and lesson (1-structured-unstructured).

You will use these files and directory structure to create the graph of the course content. Tree structure:

    asciidoc - contains all the course content in ascidoc format

        courses - the course content

            llm-fundamentals - the course name

                modules - contains numbered directories for each module

                    01-name - the module name

                        lessons - contains numbered directories for each lesson

                            01-name - the lesson name

                                lesson.adoc - the lesson content



## Chunking

When dealing with large amounts of data, breaking it into smaller, more manageable parts is helpful. This process is called chunking.

Smaller pieces of data are easier to work with and process. Embedding models also have size (token) limits and can only handle a certain amount of data.

Embedding large amounts of text may also be less valuable. For example, if you are trying to find a document that references a specific topic, the meaning maybe lost in the whole document. Instead, you may only need the paragraph or sentence that contains the relevant information. Conversely, small amounts of data may not contain enough context to be useful.

In this lesson, you will explore strategies for chunking and storing data in a graph.

### Strategies

There are countless strategies for splitting data into chunks, and the best approach depends on the data and the problem you are trying to solve.

It may be that the unstructured data you are working with is already in a format that is easy to split. For example, if you were looking to chunk an API’s technical documentation, you could split the data by method, endpoint, or parameter.

Alternatively, you may be working with a collection of unrelated PDF documents, and splitting by section, paragraph, or sentence may be the only choice.

Strategies for chunking data include:

- **Size** - Splitting data into equal-sized chunks.

- **Word, Sentence, Paragraph** - Breaking down text data into individual sections.

- **N-Grams** - Creating chunks of n consecutive words or characters.

- **Topic Segmentation** - Dividing text into sections based on topic changes.

- **Event Detection** - Identifying specific events or activities.

- **Semantic Segmentation** - Dividing data regions with different semantic meanings (objects, background, etc).

It may also be helpful to combine multiple strategies. For example, you could split a document into paragraphs and then further split each paragraph into topic changes - this would allow you to store and query the data at different levels of granularity.

### Storing Chunks

How you store the chunks depends on the data, what the chunks represent, and how you intend to use the data.

It is a good idea to name the nodes and relationships in a way that makes it easy to understand the data and how it is related. For example, if you split a set of documents by paragraph, you could name the nodes `Documents` and `Paragraph` with a relationship `CONTAINS`. Alternatively, if you split a document by an arbitrary size value or character, you may simply use the node label `Chunk`.

You can store embeddings for individual chunks and create relationships between chunks to capture context and relationships.

You may also want to store metadata about the chunks, such as the position in the original data, the size, and any other relevant information.

When storing the course content, you will create a node for each `Paragraph` chunk and a relationship `CONTAINS` between the `Lesson` and `Paragraph` nodes.

<img 
    src="https://graphacademy.neo4j.com/courses/llm-vectors-unstructured/3-unstructured-data/2-chunking/images/graphacademy-lessons-paragraph.svg" 
    alt="Data Model"
    style="width: 50%; height: auto; display: block; margin: 0 auto;"
/>