<img src="https://raw.githubusercontent.com/instill-ai/cookbook/main/images/Logo.png" alt="Instill Logo" width="300"/>

# Semantic Web Insights

In this notebook, we will leverage our [Web Operator](https://www.instill.tech/docs/component/operator/web) component to generate high-quality Markdown from a website. We will then use the **Chunk Text** feature of our [Text Operator](https://www.instill.tech/docs/component/operator/text) to break the generated Markdown into manageable pieces, followed by embedding those chunks using the **[Jina CLIP V1](https://instill.tech/instill-ai/models/jina-clip-v1/playground?version=v0.1.0)** model served via [Instill Model](https://www.instill.tech/docs/model/introduction). Finally, we will perform downstream analysis and visualization of the rich semantic information captured in the embeddings.

### Why This Matters

The ability to extract meaningful insights from web content is crucial for developing AI and ML applications. By converting web pages into structured Markdown format, we can maintain essential formatting while enabling AI models to process and understand the information efficiently. This structured approach allows for improved data handling, analysis, and semantic search capabilities.

### Overview of the Process

Our workflow consists of the following key steps:

1. **Crawl a Website** using the [**website-to-markdown**](https://instill.tech/george_strong/pipelines/website-to-markdown/playground) pipeline to generate high-quality Markdown content.
2. **Chunk the Markdown** using the [**chunk-text-array**](https://instill.tech/george_strong/pipelines/chunk-text-array/playground) pipeline, allowing for custom strategy to enhance manageability and semantic relevance.
3. **Embed the Chunks** with Jina CLIP V1 to create semantic representations.
4. **Perform Clustering and Visualization** to analyze the relationships and distributions of the embeddings.


### Setup

To execute all of the code in this notebook, you’ll need to create a free Instill Cloud account and setup an API Token. To create your account, please refer to our [quickstart guide](https://www.instill.tech/docs/quickstart). For generating your API Token, consult the [API Token Management](https://www.instill.tech/docs/core/token) page.

**This will give you access to 10,000 free credits per month that you can use to make API calls with third-party AI vendors. Please see our [documentation](https://www.instill.tech/docs/cloud/credit) for further details.**

 We will now install the Instill Python SDK, import the required libraries, and configure the SDK with a valid API token.

In [1]:
!pip install instill-sdk==0.13.0 --quiet

In [2]:
from IPython.display import IFrame

from google.protobuf.json_format import MessageToDict
import numpy as np
import os

from instill.clients import init_core_client
core = init_core_client(api_token="YOUR_INSTILL_API_TOKEN")

### **[Crawl Website](https://www.instill.tech/docs/component/operator/web#crawl-website)** to Generate High-quality Markdown

We will start by crawling a specified website to extract the contents of each page and convert it into Markdown format. The following code triggers the [**website-to-markdown**](https://instill.tech/george_strong/pipelines/website-to-markdown/playground) pipeline and retrieves the Markdown representation of each page stored in a list.

Checkout the details of this pipeline in the cell below! You can click on the `README` for further description about what this pipeline does.

In [3]:
IFrame('https://instill.tech/george_strong/pipelines/website-to-markdown/preview', width=1000, height=700)

The current example will crawl and scrape [WebMD](https://www.webmd.com/). But feel free to change the target `url`, `max_pages`, `max_depth` and `include_tags` to customize your response!

In [4]:
# Initialize the pipeline client
pipeline = core.pipeline

# Define the custom web crawling and scraping parameters for the pipeline
url = "https://www.webmd.com/"
max_pages = 100
max_depth = 5
include_tags = ["p", "h1", "h2", "h3"] # Include only these tags in the scraped markdown
timeout = 0

# Trigger the pipeline with the custom parameters
response_crawler = pipeline.trigger(
    namespace_id="george_strong",
    pipeline_id="website-to-markdown",
    data=[{"max-k": max_pages, 
           "max-depth": max_depth,
           "url": url,
           "include-tags": include_tags,
           "timeout": timeout}]
)

# Extract the scraped markdown pages from the response object
md_pages = MessageToDict(response_crawler)['outputs'][0]['scraped']

In [5]:
print(md_pages[4])

# Deep Vein Thrombosis Resource Center

### Deep Vein Thrombosis (DVT): Symptoms, Causes, Treatment

When a blood clot forms in a vein deep inside your body, it causes what doctors call deep vein thrombosis (DVT).

### Deep Vein Thrombosis (DVT) Symptoms

DVT can have the same symptoms as many other health problems. But about half the time, this causes no symptoms.

### Conditions Similar to DVT: How to Tell the Difference

You may have DVT if you notice that one limb is swollen, painful, warm, and red. Other things can cause similar symptoms.

### Conditions You Might Have Along With a DVT

If you’ve been diagnosed with DVT, you might be wondering if you’re at a higher risk for other health problems, too.

### Could I Get Deep Vein Thrombosis?

DVT is tough to spot. That’s why it’s a good idea to know what puts you at risk so you can avoid getting it.

### Who Gets DVT? By Sex, Age, Race, and Ethnicity

The American Heart Association says that combined, DVT and PE affect between 300,0

### Chunk Markdown

After generating the Markdown content for each page, we will chunk it using the **Chunk Text** feature of our [Text Operator](https://www.instill.tech/docs/component/operator/text). This step is crucial for breaking down the content into smaller, more manageable segments, which will facilitate better embeddings and downstream analysis.

The following code triggers the [**chunk-text-array**](https://instill.tech/george_strong/pipelines/chunk-text-array/playground) pipeline, which iterates over each page of Markdown text and chunks them up according to the specified parameters. The result is `chunked_pages`, which is simply a list of lists, containing a list of text chunks for each page.

Feel free to try out different chunking strategies ("Markdown", "Recursive" or "Token"), edit the `max_chunk_length` or `chunk_overlap` to customize the chunking behavior!

In [6]:
# Define the custom chunking parameters for the pipeline
strategy = "Markdown"
max_chunk_length = 1000
chunk_overlap = 1

# Trigger the pipeline with the custom parameters
response = pipeline.trigger(
    namespace_id="george_strong",
    pipeline_id="chunk-text-array",
    data=[{"text-array": md_pages,
           "chunk-strategy": strategy,
           "max-chunk-length": max_chunk_length,
           "chunk-overlap": chunk_overlap}]
)

# Extract the chunked markdown pages from the response object
response = MessageToDict(response)
chunked_pages = [[item['text'] for item in page] for page in response['outputs'][0]['response']]

In [7]:
chunked_pages[0]

['### Simple Habits to Lower Breast Cancer Risk\n\nAre you concerned about breast cancer? Consider implementing these lifestyle habits to reduce your chances.',
 "### When Fixation on a Perfect Night’s Sleep Becomes Harmful\n\nYou could risk undermining your sleep efforts if you place too much emphasis on getting the ideal amount of sleep. Here's what to know.",
 '### Common Foot Problems and Solutions\n\nAre you experiencing foot pain? Discover some common causes for foot issues and tips to ease your discomfort.',
 '### Dermatologists Offer Tips For Skin Care and Skin Concerns\n\nGet guidance from experts regarding skin issues, ranging from eczema and sensitive skin to sunburns.',
 '### Stories of Managing Type 1 Diabetes\n\nJohn Whyte, MD, speaks with patients and advocates about their experiences with constant blood sugar monitoring.',
 '## Free WebMD Newsletters\n### Our Content Is Different Because We Set the Bar Higher\n\nAs a leader in digital health publishing for more than 25 

#### Chunk Filtering

In [8]:
# Filter chunks within each page and remove chunks with less than 20 words
filtered_pages = [[chunk for chunk in page if len(chunk.split()) >= 20] for page in chunked_pages]

# Remove empty pages
chunked_pages = [page for page in filtered_pages if page]

# Remove duplicate chunks within each page
chunked_pages = [list(dict.fromkeys(page)) for page in chunked_pages]

### Embed Chunks

Now that we have our chunks, we will use the **[Jina CLIP V1](https://instill.tech/instill-ai/models/jina-clip-v1/playground?version=v0.1.0)** embedding model to generate semantic representations for each chunk. This step will allow us to analyze the meaning and context of the content captured in the Markdown.

First we will initialize an Instill Model client.

In [9]:
model = core.model

We will now define a function that takes a list of chunks, e.g. the chunks for a single web page, and generates a corresponding list of embedding vectors using **[Jina CLIP V1](https://instill.tech/instill-ai/models/jina-clip-v1/playground?version=v0.1.0)**.

In [10]:
def embed_chunks(text_chunks):

    # Create the input payload
    embeddings = [{"text": text, "type": "text"} for text in text_chunks]

    input = {
        "data": {
            "embeddings": embeddings
        },
        "parameter": {
            "format": "float",
        }, 
    }

    # Trigger the model
    response = model.trigger(
        namespace_id="instill-ai",
        model_id="jina-clip-v1",
        version="v0.1.0",
        task_inputs=[input],
    )

    # Extract and return the embedding vectors
    response = MessageToDict(response)
    vectors = [embedding['vector'] for embedding in response['taskOutputs'][0]['data']['embeddings']]

    return vectors

We will now loop over the pages, embedding all the chunks on each page batch-wise.

In [11]:
embedded_pages = [embed_chunks(page) for page in chunked_pages]

### Clustering and Visualization

To gain insights into the relationships between the embedded chunks, we will perform clustering and visualization. This will help us understand the distribution and semantic structure of the content.

In [12]:
!pip install umap-learn --quiet
!pip install bokeh --quiet

In [13]:
import umap
from sklearn.cluster import AgglomerativeClustering

from bokeh.plotting import figure, show
from bokeh.io import output_notebook
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.palettes import Category20

  from .autonotebook import tqdm as notebook_tqdm


#### UMAP Dimensionality Reduction

First, we will apply UMAP for dimensionality reduction to visualize the high-dimensional embeddings in a 2D space.

In [18]:
flattened_embeddings = [vector for page in embedded_pages for vector in page]
flattened_chunks = [chunk for chunks in chunked_pages for chunk in chunks]

X = np.array(flattened_embeddings)

umap_model = umap.UMAP(n_components=2, n_neighbors=40, random_state=123)
reduced_embeddings = umap_model.fit_transform(X)

  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")


#### Clustering

Next, we will perform hierarchical clustering to identify semantically related groups within the reduced embeddings, which can reveal interesting patterns and insights.

In [19]:
num_clusters = 20

hierarchical_clustering = AgglomerativeClustering(n_clusters=num_clusters)
cluster_labels = hierarchical_clustering.fit_predict(reduced_embeddings)

#### Interactive Plot using Bokeh

Finally, we will create an interactive plot using Bokeh to visualize the clusters and the corresponding chunks of text. This will provide an intuitive and interactive way of exploring and understanding the semantic structure contained in the website!

In [20]:
source = ColumnDataSource(data=dict(
    x=reduced_embeddings[:, 0],
    y=reduced_embeddings[:, 1],
    text=flattened_chunks,
    cluster=cluster_labels
))

colors = Category20[num_clusters]
source.data['color'] = [colors[label] for label in cluster_labels]

plot = figure(title='Visualize Crawled Website Embeddings',
              tools="pan,wheel_zoom,box_zoom,reset",
              x_axis_label='UMAP 1',
              y_axis_label='UMAP 2',
              width=900,
              height=600)

plot.scatter('x', 'y', source=source, size=7, color='color', alpha=0.4)

hover_tool = HoverTool()
hover_tool.tooltips = """
    <div style="width: 500px; white-space: normal; border: 1px solid #ccc; padding: 10px; border-radius: 10px;">
        <div><strong>Text Chunk:</strong></div>
        <div>@text</div>
    </div>
"""

plot.add_tools(hover_tool)

output_notebook()
show(plot)

In [17]:
pipeline.close()
model.close()