<img src="https://raw.githubusercontent.com/instill-ai/cookbook/main/images/Logo.png" alt="Instill Logo" width="300"/>

# Semantic Web Insights

In this notebook, we will leverage the **Crawl Website** feature of our [Web Operator](https://www.instill.tech/docs/component/operator/web) component to generate high-quality Markdown from a website. We will then use the **Chunk Text** feature of our [Text Operator](https://www.instill.tech/docs/component/operator/text) to break the generated Markdown into manageable pieces, followed by embedding those chunks using the **[Jina CLIP V1](https://instill.tech/instill-ai/models/jina-clip-v1/playground?version=v0.1.0)** model served via [Instill Model](https://www.instill.tech/docs/model/introduction). Finally, we will perform downstream analysis and visualization of the rich semantic information captured in the embeddings.

### Why This Matters

The ability to extract meaningful insights from web content is crucial for developing AI and ML applications. By converting web pages into structured Markdown format, we can maintain essential formatting while enabling AI models to process and understand the information efficiently. This structured approach allows for improved data handling, analysis, and semantic search capabilities.

### Overview of the Process

Our workflow consists of the following key steps:

1. **Crawl a Website** using the [**website-to-markdown**](https://instill.tech/george_strong/pipelines/website-to-markdown/playground) pipeline to generate high-quality Markdown content.
2. **Chunk the Markdown** using the [**chunk-markdown**](https://instill.tech/george_strong/pipelines/chunk-markdown/playground) pipeline, allowing for custom strategy to enhance manageability and semantic relevance.
3. **Embed the Chunks** with Jina CLIP V1 to create semantic representations.
4. **Perform Clustering and Visualization** to analyze the relationships and distributions of the embeddings.


### Setup

To execute all of the code in this notebook, you’ll need to create a free Instill Cloud account and setup an API Token. To create your account, please refer to our [quickstart guide](https://www.instill.tech/docs/quickstart). For generating your API Token, consult the [API Token Management](https://www.instill.tech/docs/core/token) page.

**This will give you access to 10,000 free credits per month that you can use to make API calls with third-party AI vendors. Please see our [documentation](https://www.instill.tech/docs/cloud/credit) for further details.**

 We will now install the Instill Python SDK, import the required libraries, and configure the SDK with a valid API token.

In [1]:
!pip install instill-sdk==0.13.0rc0 --quiet

In [2]:
from google.protobuf.json_format import MessageToDict
from google.protobuf.struct_pb2 import Struct
import numpy as np
import os

from instill.clients.client import init_pipeline_client
pipeline = init_pipeline_client(api_token=os.environ['INSTILL_API_TOKEN'])

### **[Crawl Website](https://www.instill.tech/docs/component/operator/web#crawl-website)** to Generate High-quality Markdown

We will start by crawling a specified website to extract the contents of each page and convert it into Markdown format. The following code triggers the [**website-to-markdown**](https://instill.tech/george_strong/pipelines/website-to-markdown/playground) pipeline and retrieves the Markdown representation of each page stored in a list.

Feel free to change the `url`, and `max_pages` to customize how many pages on a site you wish to crawl. 

In [3]:
url = "https://www.instill.tech/"
max_pages = 10

response_crawler = pipeline.trigger_namespace_pipeline(
    "george_strong",
    "website-to-markdown",
    [{"max-k": max_pages,
      "url": url}]
)

In [4]:
md_pages = MessageToDict(response_crawler)['outputs'][0]['crawled-content']

print(md_pages[0])

Instill AI

![beam](http://www.instill.tech/images/landing-page/hero/1.svg)

![beam](http://www.instill.tech/images/landing-page/hero/3.svg)![beam](http://www.instill.tech/images/landing-page/hero/4.svg)

![beam](http://www.instill.tech/images/landing-page/hero/2.svg)![beam](http://www.instill.tech/images/landing-page/hero/4.svg)![beam](http://www.instill.tech/images/landing-page/hero/8.svg)![beam](http://www.instill.tech/images/landing-page/hero/10.svg)

![beam](http://www.instill.tech/images/landing-page/hero/9.svg)![beam](http://www.instill.tech/images/landing-page/hero/8.svg)![beam](http://www.instill.tech/images/landing-page/hero/10.svg)

![blurred spot](http://www.instill.tech/images/landing-page/3.svg)![blurred spot](http://www.instill.tech/images/landing-page/4.svg)![beam](http://www.instill.tech/images/landing-page/1.svg)![beam](http://www.instill.tech/images/landing-page/2.svg)![blurred spot](http://www.instill.tech/images/landing-page/5.svg)![blurred spot](http://www.instill

### Chunk Markdown

After generating the Markdown content for each page, we will chunk it using the **Chunk Text** feature of our [Text Operator](https://www.instill.tech/docs/component/operator/text). This step is crucial for breaking down the content into smaller, more manageable segments, which will facilitate better embeddings and downstream analysis.

The following code loops over the Markdown formatted content, and triggers the [**chunk-markdown**](https://instill.tech/george_strong/pipelines/chunk-markdown/playground) pipeline for each page, populating `chunked_pages`. This is simply a list of lists, containing a list of text chunks for each page.

Feel free to try out different chunking strategies ("Markdown", "Recursive" or "Token"), edit the `max_chunk_length` or `chunk_overlap` to customize the chunking behavior!

In [5]:
strategy = "Markdown"
max_chunk_length = 1200
chunk_overlap = 10

chunked_pages = []

for web_page in md_pages:
    response = pipeline.trigger_namespace_pipeline(
        "george_strong",
        "chunk-markdown",
        [{"md-input": web_page,
          "chunk-strategy": strategy,
          "max-chunk-length": max_chunk_length,
          "chunk-overlap": chunk_overlap}]
    )
    response = MessageToDict(response)
    chunks = [item['text'] for item in response['outputs'][0]['response']]
    chunked_pages.append(chunks)

In [6]:
chunked_pages[0]

['\nInstill AI',
 '\n![beam](http://www.instill.tech/images/landing-page/hero/1.svg)',
 '\n![beam](http://www.instill.tech/images/landing-page/hero/3.svg)![beam](http://www.instill.tech/images/landing-page/hero/4.svg)',
 '\n![beam](http://www.instill.tech/images/landing-page/hero/2.svg)![beam](http://www.instill.tech/images/landing-page/hero/4.svg)![beam](http://www.instill.tech/images/landing-page/hero/8.svg)![beam](http://www.instill.tech/images/landing-page/hero/10.svg)',
 '\n![beam](http://www.instill.tech/images/landing-page/hero/9.svg)![beam](http://www.instill.tech/images/landing-page/hero/8.svg)![beam](http://www.instill.tech/images/landing-page/hero/10.svg)',
 '\n![blurred spot](http://www.instill.tech/images/landing-page/3.svg)![blurred spot](http://www.instill.tech/images/landing-page/4.svg)![beam](http://www.instill.tech/images/landing-page/1.svg)![beam](http://www.instill.tech/images/landing-page/2.svg)![blurred spot](http://www.instill.tech/images/landing-page/5.svg)![blu

### Embed Chunks

Now that we have our chunks, we will use the **[Jina CLIP V1](https://instill.tech/instill-ai/models/jina-clip-v1/playground?version=v0.1.0)** embedding model to generate semantic representations for each chunk. This step will allow us to analyze the meaning and context of the content captured in the Markdown.

First we will initialize an Instill Model client.

In [7]:
from instill.clients.client import init_model_client
model = init_model_client(api_token=os.environ['INSTILL_API_TOKEN'])

We will now define a function that takes a list of chunks, e.g. the chunks for a single web page, and generates a corresponding list of embedding vectors using **[Jina CLIP V1](https://instill.tech/instill-ai/models/jina-clip-v1/playground?version=v0.1.0)**.

In [8]:
def embed_chunks(text_chunks):

    # create the input payload
    embeddings = [{"text": text, "type": "text"} for text in text_chunks]
    i = Struct()
    i.update(
        {
            "data": {
                "embeddings": embeddings
            },
            "parameter": {
                "format": "float",
            },
        }
    )

    # trigger the model
    response = model.trigger_latest_model(
        "jina-clip-v1",
        [i],
        namespace_id="instill-ai",
    )

    # extract and return the embedding vectors
    response = MessageToDict(response)
    vectors = [embedding['vector'] for embedding in response['taskOutputs'][0]['data']['embeddings']]

    return vectors

We will now loop over the pages, embedding all the chunks on each page batch-wise.

In [9]:
embedded_pages = []

for page in chunked_pages:
    embedded_pages.append(embed_chunks(page))

### Clustering and Visualization

To gain insights into the relationships between the embedded chunks, we will perform clustering and visualization. This will help us understand the distribution and semantic structure of the content.

In [10]:
!pip install umap-learn --quiet
!pip install bokeh --quiet

In [11]:
import umap
from sklearn.cluster import KMeans

from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.palettes import Category10

  from .autonotebook import tqdm as notebook_tqdm


#### UMAP Dimensionality Reduction

First, we will apply UMAP for dimensionality reduction to visualize the high-dimensional embeddings in a 2D space.

In [12]:
flattened_embeddings = [vector for page in embedded_pages for vector in page]
flattened_chunks = [chunk for chunks in chunked_pages for chunk in chunks]

X = np.array(flattened_embeddings)

umap_model = umap.UMAP(n_components=2, n_neighbors=40, random_state=42)
reduced_embeddings = umap_model.fit_transform(X)

  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")


#### K-Means Clustering

Next, we will use K-Means clustering to identify semantically related groups within the reduced embeddings, which can reveal interesting patterns and insights.

In [13]:
num_clusters = 5

kmeans = KMeans(n_clusters=num_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(reduced_embeddings)

#### Interactive Plot using Bokeh

Finally, we will create an interactive plot using Bokeh to visualize the clusters and the corresponding chunks of text. This will provide an intuitive and interactive way of exploring and understanding the semantic structure contained in the website!

In [14]:
source = ColumnDataSource(data=dict(
    x=reduced_embeddings[:, 0],
    y=reduced_embeddings[:, 1],
    text=flattened_chunks,
    cluster=cluster_labels
))

colors = Category10[num_clusters]
source.data['color'] = [colors[label] for label in cluster_labels]

plot = figure(title='Visualize Crawled Website Embeddings',
              tools="pan,wheel_zoom,box_zoom,reset",
              x_axis_label='UMAP 1',
              y_axis_label='UMAP 2',
              width=900,
              height=600)

plot.scatter('x', 'y', source=source, size=8, color='color', alpha=0.6)

hover_tool = HoverTool()
hover_tool.tooltips = """
    <div style="width: 400px; white-space: normal;">
        <div><strong>Text Chunk:</strong></div>
        <div>@text</div>
    </div>
"""

plot.add_tools(hover_tool)

output_notebook()
show(plot)