**Before semantic search**, `lexical search` method with word matching was commonly used.

Lexical search matches words from the query in the dataset/article to find answers.

But `Semantic Search` is a powerful way to search using context where Semantic search considers the actual **meaning** of the sentence.

- Semantic search uses text embeddings to map words into numerical representations based on their meaning.
- Similar words have similar points in the embedding space, enabling meaningful comparisons.

So in this notebook I'll build a semantic search engine for english articels which is the AI & DS services from [Cyshield website](https://cyshield.com/AIDS).

# The article

In [2]:
article = """
Statistical Modelling and Analysis:
At Cyshield, AI is not just a buzzword for us! We know exactly how to realize the full potential of AI to support your business. Our data scientists will quickly build and fine-tune predictive analytics solutions to provide you with actionable insights to take your business to the next level.
We will gladly take care of all the complicated infrastructure required to build and maintain AI models so that you can focus on what matters.


Big Data:
With an ever-increasing amount of data being created every day, specialized big data platforms, pipelines and analysis technique are required to handle storage and analysis of such immense volumes of data.
Cyshield big data services to efficiently acquire, transform, and analyze data. Our data scientists and engineers will help you make sense out of big data and provide you with valuable insights that can transform your business.


Computer Vision:
With the prevalence of cameras and CCTV, computer vision techniques have become indispensable to extract information from visual data. At Cyshield, we build state-of-the-art deep learning models to detect and identify any relevant information from images and videos.
Our computer vision applications include:
Object detection:
detect and identify objects of interest in images and videos.
Facial Recognition:
Recognize people and facial features in images and videos.
Activity recognition:
Recognize certain activities from video feeds.


Natural Language Processing:
Natural language processing (NLP) is one of the hottest topics in AI right now, and rightly so since it allows AI-enabled products to understand people and interact with them.
Cyshield offers state-of-the-art NLP solutions in various languages for the following applications:
Chatbots:
Deploy automated chatbot that can understand your users and respond to them in a natural way.
Machine translation:
Translate from any language to any other language on the fly.
Sentiment Analysis:
Understand how your users feel so that you can cater to their taste better.
Text summarization and keywork extraction:
Understand the gist of what your users are saying immediately without having to read through everything yourself.


OCR and Document Digitization:
Digitization of document archives is an important step in the digital transformation of modern organizations. Cyshield offers optical character recognition (OCR) solutions for various languages to transform your documents into digital format.
Our OCR solutions can handle:
Noisy scans:
With integrated image preprocessing, our system can clean up noisy and old documents to get accurate results.
Complex fonts and scripts:
Even if documents are written in a complex font, our system will be finetuned to get the best results for your documents.
Sentiment Analysis:
Understand how your users feel so that you can cater to their taste better.
Complex layouts:
No matter how your documents are laid out, our system will read every line and preserve the layout.
Multiple Languages:
Our system will read documents in almost any language, even bilingual documents.
Ambiguous text:
Our system will resolve ambiguous text and typos in your documents with an integrated text postprocessing system.
"""

# Setup

Load needed API keys and relevant Python libraries.

I will use this task:

- [cohere](https://docs.cohere.com/) for embedding and answers generation.
- [annoy](https://github.com/spotify/annoy) for building the vector database.

In [3]:
!pip install python-dotenv
!pip install --upgrade cohere
!pip install annoy

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1
Collecting cohere
  Downloading cohere-5.11.0-py3-none-any.whl.metadata (3.4 kB)
Collecting boto3<2.0.0,>=1.34.0 (from cohere)
  Downloading boto3-1.35.34-py3-none-any.whl.metadata (6.6 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere)
  Downloading fastavro-1.9.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting httpx>=0.21.2 (from cohere)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx-sse==0.4.0 (from cohere)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting parameterized<0.10.0,>=0.9.0 (from cohere)
  Downloading parameterized-0.9.0-py2.py3-none-any.whl.metadata (18 kB)
Collecting sagemaker<3.0.0,>=2.232.1 (from cohere)
  Downloading sagemaker-2.232.2-p

In [10]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

In [5]:
import cohere

import numpy as np
import warnings
warnings.filterwarnings('ignore')

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


# Chunking

I will chunk the article be paragraphs so each chunk will have solid info and meaning

In [6]:
# Split into a list of paragraphs
texts = article.split('\n\n')

# Clean up to remove empty spaces and new lines
texts = np.array([t.strip(' \n') for t in texts if t])

In [7]:
texts[:2]

array(['Statistical Modelling and Analysis:\nAt Cyshield, AI is not just a buzzword for us! We know exactly how to realize the full potential of AI to support your business. Our data scientists will quickly build and fine-tune predictive analytics solutions to provide you with actionable insights to take your business to the next level.\nWe will gladly take care of all the complicated infrastructure required to build and maintain AI models so that you can focus on what matters.',
       'Big Data:\nWith an ever-increasing amount of data being created every day, specialized big data platforms, pipelines and analysis technique are required to handle storage and analysis of such immense volumes of data.\nCyshield big data services to efficiently acquire, transform, and analyze data. Our data scientists and engineers will help you make sense out of big data and provide you with valuable insights that can transform your business.'],
      dtype='<U1019')

# Embeddings

we can use embedding models from `sentence transformer` or `openai`, but here I will `Cohere` to [embed](https://docs.cohere.com/reference/embed) each chunk by the `embed-multilingual-v3.0` embedding model.

In [12]:
co = cohere.Client(os.environ['COHERE_API_KEY'])

# Get the embeddings
response = co.embed(
    texts=texts.tolist(),
    model='embed-multilingual-v3.0',
    input_type='search_document'
).embeddings


# Build a search index

Now instead of consuming/using an on-the-shelf `Vector database`, I will use `Annoy` to build an index that stores the embeddings in a way that is optimized for fast search. This approach scales well to a large number of texts (other options include [Faiss](https://github.com/facebookresearch/faiss), [ScaNN](https://github.com/google-research/google-research/tree/master/scann), and [PyNNDescent](https://github.com/lmcinnes/pynndescent)).

After building the index, we can use it to retrieve the nearest neighbors either of existing questions, or of new questions that we embed.

In [13]:
from annoy import AnnoyIndex
import numpy as np
import pandas as pd

In [14]:
# Check the dimensions of the embeddings
embeds = np.array(response)

# Create the search index, pass the size of embedding
search_index = AnnoyIndex(embeds.shape[1], 'angular')
# Add all the vectors to the search index
for i in range(len(embeds)):
    search_index.add_item(i, embeds[i])

search_index.build(10) # 10 trees
search_index.save('test.ann')

True

# Searching Articles

In [19]:
def search_cyshield_services(query):
    # Get the query's embedding
    query_embed = co.embed(texts=[query],
                            model='embed-multilingual-v3.0',
                            input_type='search_query').embeddings

    # Retrieve the nearest neighbors
    similar_item_ids = search_index.get_nns_by_vector(query_embed[0],
                                                    10,
                                                  include_distances=True)

    search_results = texts[similar_item_ids[0]]

    return search_results

In [20]:
results = search_cyshield_services(
    "what does cyshild do in CV?"
)

print(results[0])

Computer Vision:
With the prevalence of cameras and CCTV, computer vision techniques have become indispensable to extract information from visual data. At Cyshield, we build state-of-the-art deep learning models to detect and identify any relevant information from images and videos.
Our computer vision applications include:
Object detection:
detect and identify objects of interest in images and videos.
Facial Recognition:
Recognize people and facial features in images and videos.
Activity recognition:
Recognize certain activities from video feeds.


# Generating Answers

now here I will use the `command-r-plus` model from Cohere for generation which is the latest and rapidly updated

In [36]:
def ask_cyshield_services(question, num_generations=1):

    # Search the text archive
    results = search_cyshield_services(question)

    # Get the top result
    context = results[0]

    # Prepare the prompt
    prompt = f"""
    Excerpt from the article titled "Cyshield AI & DS services":
    {context}
    Question: {question}

    Extract the answer of the question from the text provided.
    And Must the answer be in the same language as the question given.
    If the text doesn't contain the answer,
    reply that the answer is not available."""

    prediction = co.generate(
        prompt=prompt,
        max_tokens=100,
        model="command-r-plus",
        temperature=0.3,
        num_generations=num_generations
    )

    return prediction.generations

In [30]:
results = ask_cyshield_services(
    "Does Cyshield provide any thing related to the optical character recognition?",
)

print(results[0].text)

Yes, Cyshield offers optical character recognition (OCR) solutions as part of its AI and DS services.


In [37]:
results = ask_cyshield_services(
    "ماذا تفعل سايشيلد في معالجة اللغة الطبيعية؟",
)

print(results[0].text)

تقدم شركة Cyshield حلول NLP المتقدمة في العديد من اللغات للتطبيقات التالية: chatbots، والترجمة الآلية، وتحليل المشاعر، وملخص النص واستخراج الكلمات الرئيسية.


In [32]:
results = ask_cyshield_services(
    "does Cyshield offers some meals ?",
)

print(results[0].text)

The answer is not available.
