# Weaviate Import

This notebook is used to populate the `WeaviateBlogChunk` class.

You can connect to Weaviate through local host, or create a free 14-day sandbox on [WCS](https://console.weaviate.cloud/)!

1. (Option 1) Create a cluster on WCS and grab your cluster URL and auth key (if enabled)

1. (Option 2) Run `docker-compose up -d` with the docker script in the file to start Weaviate locally on localhost:8080


2. Make sure the `/blog` folder is in this directory (these are parsed from github.com/weaviate/weaviate-io -- feel free to drag and drop that folder in here to update the content).


3. Run this notebook and the 1182 blog chunks will be loaded into Weaviate.

## Connect to Client

In [1]:
# Import Weaviate and Connect to Client
import weaviate

client = weaviate.connect_to_local()  # Connect to local host
# client = weaviate.connect_to_wcs(
#     cluster_url="WCS-url",  # Replace with your WCS URL
#     auth_credentials=weaviate.auth.AuthApiKey("auth-key"),  # Replace with your WCS key
#     headers={
#         'X-Cohere-Api-Key': ("API-Key") # Replace with your Cohere API key
#     }
# )

I0000 00:00:1721914628.345996 3818947 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache


## Create Schema

In [13]:
# CAUTION: Running this will delete the collection along with the objects

# client.collections.delete_all()

In [2]:
import weaviate.classes.config as wvcc

collection = client.collections.create(
    name="WeaviateBlogChunk",
    vectorizer_config=wvcc.Configure.Vectorizer.text2vec_cohere
    (
        model="embed-multilingual-v3.0"
    ),
    properties=[
            wvcc.Property(name="content", data_type=wvcc.DataType.TEXT),
            wvcc.Property(name="author", data_type=wvcc.DataType.TEXT),
      ]
)

## Chunk Blogs

In [4]:
import os
import re

def chunk_list(lst, chunk_size):
    """Break a list into chunks of the specified size."""
    return [lst[i:i + chunk_size] for i in range(0, len(lst), chunk_size)]

def split_into_sentences(text):
    """Split text into sentences using regular expressions."""
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    return [sentence.strip() for sentence in sentences if sentence.strip()]

def read_and_chunk_index_files(main_folder_path):
    """Read index.md files from subfolders, split into sentences, and chunk every 5 sentences."""
    blog_chunks = []
    for folder_name in os.listdir(main_folder_path):
        subfolder_path = os.path.join(main_folder_path, folder_name)
        if os.path.isdir(subfolder_path):
            index_file_path = os.path.join(subfolder_path, 'index.mdx')
            if os.path.isfile(index_file_path):
                with open(index_file_path, 'r', encoding='utf-8') as file:
                    content = file.read()
                    sentences = split_into_sentences(content)
                    sentence_chunks = chunk_list(sentences, 5)
                    sentence_chunks = [' '.join(chunk) for chunk in sentence_chunks]
                    blog_chunks.extend(sentence_chunks)
    return blog_chunks

# Example usage
main_folder_path = './examples/weaviate_setup/blog'
blog_chunks = read_and_chunk_index_files(main_folder_path)


In [5]:
len(blog_chunks)

1643

In [6]:
blog_chunks[0]

"---\ntitle: 'Accelerating Vector Search up to +40% with Intel’s latest Xeon CPU - Emerald Rapids'\nslug: intel\nauthors: [zain, asdine, john]\ndate: 2024-03-26\nimage: ./img/hero.png\ntags: ['engineering', 'research']\ndescription: 'Boosting Weaviate using SIMD-AVX512, Loop Unrolling and Compiler Optimizations'\n---\n\n![HERO image](./img/hero.png)\n\n**Overview of Key Sections:**\n- [**Vector Distance Calculations**](#vector-distance-calculations) Different vector distance metrics popularly used in Weaviate. - [**Implementations of Distance Calculations in Weaviate**](#vector-distance-implementations) Improvements under the hood for implementation of Dot product and L2 distance metrics. - [**Intel’s 5th Gen Intel Xeon Processor, Emerald Rapids**](#enter-intel-emerald-rapids)  More on Intel's new 5th Gen Xeon processor. - [**Benchmarking Performance**](#lets-talk-numbers) Performance numbers on microbenchmarks along with simulated real-world usage scenarios. What’s the most important 

## Import Objects

In [9]:
from weaviate.util import get_valid_uuid
from uuid import uuid4

blogs = client.collections.get("WeaviateBlogChunk")

for idx, blog_chunk in enumerate(blog_chunks):
    upload = blogs.data.insert(
        properties={
            "content": blog_chunk
        }
    )