# Extract Topics

In the last lesson, you built a graph using metadata to understand the course content and the relationships between the content and lessons.

In this lesson, you will add topics from the unstructured lesson content to the graph.

## Topics

Topics are a way to categorize and organize content. You can use topics to help users find relevant content, recommend related content, and understand the relationships between different pieces of content. For example, you can find similar lessons based on their topics.

There are many ways to extract topics from unstructured text. You could use an LLM and ask it to summarize the topics from the text. A more straightforward approach is to identify all the nouns in the text and use them as topics.

To hold the topic data, you should extend the data model to include a new node type, Topic, and a new relationship, `MENTIONS`.

<img 
    src="https://graphacademy.neo4j.com/courses/llm-vectors-unstructured/3-unstructured-data/6-extract-topics/images/graphacademy-lessons-paragraph-topic.svg" 
    alt="Data Model"
    style="width: 50%; height: auto; display: block; margin: 0 auto;"
/>

### Extract nouns
The Python NLP (natural language processing) library, [textblob](https://textblob.readthedocs.io/en/dev/), can extract noun phrases from text. You will use it to extract the topics from the lesson content.

You may find changing the default [Noun Phrase Chunker](https://textblob.readthedocs.io/en/dev/advanced_usage.html) used by TextBlob improves results for your data.

In [3]:
#!uv pip install textblob

In [4]:
from textblob import TextBlob

phrase = "You can extract topics from phrases using TextBlob"

topics = TextBlob(phrase).noun_phrases

print(topics)

['extract topics', 'textblob']


### Update the Graph

First, update the get_course_data function to extract topics from the lesson content. Add the topics to the data dictionary using the TextBlob.noun_phrases method:

```python
from textblob import TextBlob

def get_course_data(llm, chunk):
    data = {}

    path = chunk.metadata['source'].split(os.path.sep)

    data['course'] = path[-6]
    data['module'] = path[-4]
    data['lesson'] = path[-2]
    data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
    data['text'] = chunk.page_content
    data['embedding'] = get_embedding(llm, data['text'])
    data['topics'] = TextBlob(data['text']).noun_phrases

    return data
```

Next, update the create_chunk function to add the topics to the graph:

```python
def create_chunk(tx, data):
    tx.run("""
        MERGE (c:Course {name: $course})
        MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
        MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
        MERGE (l)-[:CONTAINS]->(p:Paragraph{text: $text})
        WITH p
        CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
           
        FOREACH (topic in $topics |
            MERGE (t:Topic {name: topic})
            MERGE (p)-[:MENTIONS]->(t)
        )
        """, 
        data
        )
```

Full code:

```python
import os
from dotenv import load_dotenv
load_dotenv()

from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from openai import OpenAI
from neo4j import GraphDatabase
from textblob import TextBlob

COURSES_PATH = "llm-vectors-unstructured/data/asciidoc"

loader = DirectoryLoader(COURSES_PATH, glob="**/lesson.adoc", loader_cls=TextLoader)
docs = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1500,
    chunk_overlap=200,
)

chunks = text_splitter.split_documents(docs)

def get_embedding(llm, text):
    response = llm.embeddings.create(
            input=text,
            model="text-embedding-ada-002"
        )
    return response.data[0].embedding

def get_course_data(llm, chunk):
    data = {}

    path = chunk.metadata['source'].split(os.path.sep)

    data['course'] = path[-6]
    data['module'] = path[-4]
    data['lesson'] = path[-2]
    data['url'] = f"https://graphacademy.neo4j.com/courses/{data['course']}/{data['module']}/{data['lesson']}"
    data['text'] = chunk.page_content
    data['embedding'] = get_embedding(llm, data['text'])
    data['topics'] = TextBlob(data['text']).noun_phrases

    return data

def create_chunk(tx, data):
    tx.run("""
        MERGE (c:Course {name: $course})
        MERGE (c)-[:HAS_MODULE]->(m:Module{name: $module})
        MERGE (m)-[:HAS_LESSON]->(l:Lesson{name: $lesson, url: $url})
        MERGE (l)-[:CONTAINS]->(p:Paragraph{text: $text})
        WITH p
        CALL db.create.setNodeVectorProperty(p, "embedding", $embedding)
           
        FOREACH (topic in $topics |
            MERGE (t:Topic {name: topic})
            MERGE (p)-[:MENTIONS]->(t)
        )
        """, 
        data
        )

llm = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

driver = GraphDatabase.driver(
    os.getenv('NEO4J_URI'),
    auth=(
        os.getenv('NEO4J_USERNAME'),
        os.getenv('NEO4J_PASSWORD')
    )
)
driver.verify_connectivity()
for chunk in chunks:
    with driver.session(database="neo4j") as session:
        
        session.execute_write(
            create_chunk,
            get_course_data(llm, chunk)
        )
driver.close()
```

### Query topics

You can use the topics to find related lessons. For example, all the lessons that contain the topics "semantic search":

```cypher
MATCH (t:Topic{name:"semantic search"})<-[:MENTIONS]-(p:Paragraph)<-[:CONTAINS]-(l:Lesson)
RETURN DISTINCT l.name, l.url
```

You can list the topics and the number of lessons that mention them to understand the most popular topics:

```cypher
MATCH (t:Topic)<-[:MENTIONS]-(p:Paragraph)<-[:CONTAINS]-(l:Lesson)
RETURN t.name, COUNT(DISTINCT l) as lessons
ORDER BY lessons DESC
```

By adding topics to the graph, you can use them to find related content.

Topics are also universal and can be used to find related content across content from different sources. For example, if you added technical documentation to this graph, you could use the topics to find related lessons and documentation.

Combining data from different sources and understanding their relationships is the starting point for creating a knowledge graph.

When you have added topics to the graph, click Complete to finish this lesson.

# Expand the Graph

In this optional challenge, you can extend the graph with additional data.

## All Courses

Currently, the graph contains data from a single course, llm-fundamentals, you can download the [lesson files for all the courses](https://data.neo4j.com/llm-vectors-unstructured/courses.zip?_gl=1*14in91n*_gcl_au*MjEzNTI4NjkxNy4xNzU3MjU4NDMzLjc4MDQ1OTczLjE3NTg0MTY3NjUuMTc1ODQxNjc2NA..*_ga*MTkzMzgxNTk1LjE3NTcyNTg0MzQ.*_ga_DL38Q8KGQC*czE3NjMxNzE3NTIkbzQ0JGcxJHQxNzYzMTc1ODc5JGo0NiRsMCRoMA..*_ga_DZP8Z65KK4*czE3NjMxNzE3NTIkbzQ0JGcxJHQxNzYzMTc1ODc5JGo0NiRsMCRoMA..).

1. Download the content for all the courses - [data.neo4j.com/llm-vectors-unstructured/courses.zip](https://data.neo4j.com/llm-vectors-unstructured/courses.zip?_gl=1*14in91n*_gcl_au*MjEzNTI4NjkxNy4xNzU3MjU4NDMzLjc4MDQ1OTczLjE3NTg0MTY3NjUuMTc1ODQxNjc2NA..*_ga*MTkzMzgxNTk1LjE3NTcyNTg0MzQ.*_ga_DL38Q8KGQC*czE3NjMxNzE3NTIkbzQ0JGcxJHQxNzYzMTc1ODc5JGo0NiRsMCRoMA..*_ga_DZP8Z65KK4*czE3NjMxNzE3NTIkbzQ0JGcxJHQxNzYzMTc1ODc5JGo0NiRsMCRoMA..)

2. Update the graph with the additional course data

3. Explore the graph and find the connections between the courses

## Additional metadata

While the course content is unstructured, it contains metadata you can extract and include in the graph.

Examples include:

    The course title is the first level 1 heading in the file - = Course Title

    Level 2 headings denote section titles - == Section Title

    The lessons include parameters in the format :parameter: value at the top of the file, such as:

        :type: - the type of lesson (e.g. lesson, challenge, quiz)

        :order: - the order of the lesson in the module

        :optional: - whether the lesson is optional

Explore the course content and see what other data you can extract and include in the graph. You may also find that the content could be split into different nodes, such as sections, which may give you more accurate results.

When you are ready click Move On to Continue.