Hi in this Notebook we will be creating a Semantic Search app based on meaning and context rather that keywords.

In [24]:
# !pip install cohere
#!pip install python-dotenv
# !pip install weaviate-client
# !pip install pandas




[notice] A new release of pip is available: 24.1.1 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


We will first start by installing the libraries we need for the search engine. We will need at the start ***cohere LLM*** as Base model and ***Weaviate*** as the search engine and vector Database . 

In [30]:
import cohere
from dotenv import load_dotenv
import os
import pandas as pd
import weaviate
import weaviate.classes as wvc
import weaviate.classes.config as wc

In [9]:
#Load the API keys and the Cluster URL
load_dotenv()
cohere_api = os.getenv("COHERE_API_KEY")
weaviate_api = os.getenv("WEAVIATE_API_KEY")
clusterUrl = os.getenv("WEAVIATE_CLUSTER_URL")

In [4]:
#Create a Cohere CLient
co = cohere.Client(cohere_api) 

In [6]:
#Create the Weaviate Client and the Cluster for DataBase
authConfig = weaviate.auth.AuthApiKey(weaviate_api)
client = weaviate.connect_to_wcs(
    cluster_url=clusterUrl,
    auth_credentials=authConfig,
    headers={'X-Cohere-Api-Key': cohere_api},
    skip_init_checks=True
)

Now we can veerify if our client is connected. 

In [17]:
print(client.is_connected())

True


<h2>Vector Database Population</h2>
Before starting the population process let's start by creating  a class for Book and to keep our system stable and coherant , we start by deleting our Class and then choose the properties and keep only _Title_ , _Categories_ and _Description_ as semantic search criterias by specifying that they are the only parameters to be embedded .



In [19]:
client.collections.delete(name='Book')

In [20]:
questions = client.collections.create(
    name="Book",
    vectorizer_config=wc.Configure.Vectorizer.text2vec_cohere(),
    generative_config=wc.Configure.Generative.cohere(),
    properties=[
        wc.Property(name="title", data_type=wc.DataType.TEXT),
        wc.Property(name="isbn10", data_type=wc.DataType.TEXT, skip_vectorization=True),
        wc.Property(name="isbn13", data_type=wc.DataType.TEXT, skip_vectorization=True),
        wc.Property(name="categories", data_type=wc.DataType.TEXT),
        wc.Property(name="thumbnail", data_type=wc.DataType.TEXT, skip_vectorization=True),
        wc.Property(name="description", data_type=wc.DataType.TEXT),
        wc.Property(name="num_pages", data_type=wc.DataType.TEXT, skip_vectorization=True),
        wc.Property(name="average_rating", data_type=wc.DataType.TEXT, skip_vectorization=True),
        wc.Property(name="published_year", data_type=wc.DataType.TEXT, skip_vectorization=True),
        wc.Property(name="authors", data_type=wc.DataType.TEXT, skip_vectorization=True),
    ],)
    

In [32]:
book_collections = client.collections.get('Book')
chunksize = 1000
chunks = pd.read_csv("./books.csv",chunksize=chunksize)
for chunk in range(1,4):
    for book in chunk:
        print(book)
        


isbn13
isbn10
title
subtitle
authors
categories
thumbnail
description
published_year
average_rating
num_pages
ratings_count
isbn13
isbn10
title
subtitle
authors
categories
thumbnail
description
published_year
average_rating
num_pages
ratings_count
isbn13
isbn10
title
subtitle
authors
categories
thumbnail
description
published_year
average_rating
num_pages
ratings_count
isbn13
isbn10
title
subtitle
authors
categories
thumbnail
description
published_year
average_rating
num_pages
ratings_count
isbn13
isbn10
title
subtitle
authors
categories
thumbnail
description
published_year
average_rating
num_pages
ratings_count
isbn13
isbn10
title
subtitle
authors
categories
thumbnail
description
published_year
average_rating
num_pages
ratings_count
isbn13
isbn10
title
subtitle
authors
categories
thumbnail
description
published_year
average_rating
num_pages
ratings_count


  chunks = pd.read_csv("./books.csv",chunksize=chunksize)
