# Prepare Data

We extract information from HTML websites and vectorize the text. Finally, we save the data to Cassandra DB which supports Vector Search.

My similar work can be found at [Convert Text to Vector](https://github.com/linhhlp/Machine-Learning-Applications/tree/main/Text-2-Vect-Vector-Search)

In [6]:
import cohere
import time
import os
import uuid
import html2text


# get free Trial API Key at https://cohere.ai/
from cred import API_key

co = cohere.Client(API_key)

In [9]:
def convert_text_2_vect(co, texts, model="embed-english-light-v2.0"):
    """Convert multiple text strings to vectors."

    Parameters
    ----------
    co : cohere.Client
        co = cohere.Client(API_key)
    texts : list of strings
        texts = [text1, text2, text3, text4, text5, text6...]
    model : str, optional
        Dimension of output vector, by default "embed-english-light-v2.0"
        "embed-english-light-v2.0" = 1024 dim
        "embed-english-v2.0" = 4096 dim

    Returns
    -------
    list of vectors
        [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6], ... ]
    """
    response = co.embed(model=model, texts=texts)
    # print('Embeddings: {}'.format(response.embeddings))
    return response.embeddings

## Load Data

Original scraped from: https://www.tesla.com/ownersmanual/model3/en_us/


In [33]:
html_dir = "data/tesla3/html/"
text_dir = "data/tesla3/txt/"

texts = []
htmls = []
guids = []
links = []
for filename in os.listdir(html_dir):
    html_filename = os.path.join(html_dir, filename)
    # checking if it is a file
    if not os.path.isfile(html_filename) or html_filename[-5:] != ".html":
        continue

    with open(html_filename) as f: html = f.read()

    text_maker = html2text.HTML2Text()
    text_maker.ignore_links = True
    text_maker.bypass_tables = False
    text = text_maker.handle(html)
    text = text[:text.find("* Model 3 Owner's Manual")]
    
    texts.append(text)
    htmls.append(html)
    guids.append(filename[5:-5])
    links.append("https://www.tesla.com/ownersmanual/model3/en_us/"+filename)

    text_filename = os.path.join(text_dir, filename[:-4] + "txt")
    with open(text_filename, 'w') as file:
        file.write(text)

Now let's convert the texts of plot to vectors using API provided by CO.HERE AI.

In [34]:
vecs = convert_text_2_vect(co, texts, model="embed-english-v2.0")
print(len(vecs), len(vecs[0]))

90 4096


# Save to Cassandra

To do Vector Search, I use the Cassandra offerred by DataStax. This gives a free 5GB storage space on Google cloud platform.

To authorize, we need to download the secure connect bundle first. The guide is here https://docs.datastax.com/en/astra-serverless/docs/connect/secure-connect-bundle.html

After that, create Application tokens: https://docs.datastax.com/en/astra-serverless/docs/manage/org/manage-tokens.html

Save this information to `cred.py` following the template in `cred-template.py`.


In [14]:
from cassandra import ConsistencyLevel
from cassandra.auth import PlainTextAuthProvider
from cassandra.cluster import Cluster

from cred import (ASTRA_CLIENT_ID, ASTRA_CLIENT_SECRET,
                  SECURE_CONNECT_BUNDLE_PATH)

In [18]:
KEYSPACE_NAME = "demo"
TABLE_NAME = "car_manual_vectorized"

cloud_config = {"secure_connect_bundle": SECURE_CONNECT_BUNDLE_PATH}
auth_provider = PlainTextAuthProvider(ASTRA_CLIENT_ID, ASTRA_CLIENT_SECRET)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider, protocol_version=4)
session = cluster.connect()
session.execute(f"USE {KEYSPACE_NAME};")

<cassandra.cluster.ResultSet at 0x29aa05497c0>

## Prepare Table and Indexes

The vector columns must be indexed to fast calculate ANN (Approximate Nearest Neighbor).

Based on the input which is a point in space (which has dimension equal to the dim of the vector), the Cassandra will calculate the distances from given point to data in database and return the nearest neighbors using ANN search.

In [36]:
# "year", 'title', 'director', 'cast', "genre", 'plot', "plot_summary", 'wiki_link', plot_vector_1024, plot_summary_vector_1024
table_create_query = f"""
CREATE TABLE IF NOT EXISTS {TABLE_NAME} (
id uuid,
link text,
html text,
text text,
brand text,
model text,
year int,
html_car_vector_4096 VECTOR<FLOAT, 4096>, 
PRIMARY KEY ((brand, model, year), id)
);
"""
session.execute(table_create_query)

create_index_query = f"""
CREATE CUSTOM INDEX IF NOT EXISTS ann_html_car_vector_4096 ON 
{TABLE_NAME}(html_car_vector_4096) USING 'StorageAttachedIndex';
"""
session.execute(create_index_query)

<cassandra.cluster.ResultSet at 0x29aa05e40d0>

## Insert to database

In [38]:
brand = "tesla"
model = "model3"
year = 2023
for i in range(len(texts)):
    row = (uuid.UUID(guids[i]), links[i], htmls[i], texts[i], brand, model, year, vecs[i])
    session.execute(
                    f"""
INSERT INTO {TABLE_NAME} 
(id, link, html, text, brand, model, year, html_car_vector_4096) 
VALUES 
(%s, %s, %s, %s, %s, %s, %s, %s)
""",
                    row
                )

The insertion can be speeded up by using Batch processing or upload CSV file in CQL. The demonstration here https://github.com/linhhlp/Big-Data-and-Machine-Learning

In [17]:
# Close connection to Cassandra
cluster.shutdown()