To fit run di following notebooks, if you never do am yet, you gats set di openai key inside .env file as `OPENAI_API_KEY`


In [None]:
import os
import pandas as pd
import numpy as np
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.getenv("OPENAI_API_KEY","")
assert API_KEY, "ERROR: OpenAI Key is missing"

client = OpenAI(
    api_key=API_KEY
    )

model = 'text-embedding-ada-002'

SIMILARITIES_RESULTS_THRESHOLD = 0.75
DATASET_NAME = "../embedding_index_3m.json"

Next, we go load di Embedding Index inside one Pandas Dataframe. Di Embedding Index dey for one JSON file wey dem call `embedding_index_3m.json`. Di Embedding Index get di Embeddings for each YouTube transcripts wey reach late Oct 2023.


In [None]:
def load_dataset(source: str) -> pd.core.frame.DataFrame:
    # Load the video session index
    pd_vectors = pd.read_json(source)
    return pd_vectors.drop(columns=["text"], errors="ignore").fillna("")

Next, we go create one function wey dem go call `get_videos` wey go search the Embedding Index for the query. The function go return the top 5 videos wey dey most similar to the query. The function dey work like this:

1. First, dem go create one copy of the Embedding Index.
2. Next, dem go calculate the Embedding for the query using the OpenAI Embedding API.
3. Then dem go create one new column for the Embedding Index wey dem go call `similarity`. The `similarity` column get the cosine similarity between the query Embedding and the Embedding for each video segment.
4. Next, dem go filter the Embedding Index by the `similarity` column. The Embedding Index go only get videos wey get cosine similarity wey big pass or equal to 0.75.
5. Finally, dem go sort the Embedding Index by the `similarity` column and return the top 5 videos.


In [None]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_videos(
    query: str, dataset: pd.core.frame.DataFrame, rows: int
) -> pd.core.frame.DataFrame:
    # create a copy of the dataset
    video_vectors = dataset.copy()

    # get the embeddings for the query    
    query_embeddings = client.embeddings.create(input=query, model=model).data[0].embedding

    # create a new column with the calculated similarity for each row
    video_vectors["similarity"] = video_vectors["ada_v2"].apply(
        lambda x: cosine_similarity(np.array(query_embeddings), np.array(x))
    )

    # filter the videos by similarity
    mask = video_vectors["similarity"] >= SIMILARITIES_RESULTS_THRESHOLD
    video_vectors = video_vectors[mask].copy()

    # sort the videos by similarity
    video_vectors = video_vectors.sort_values(by="similarity", ascending=False).head(
        rows
    )

    # return the top rows
    return video_vectors.head(rows)

Dis function na very simple one, e just dey print out di results of di search query.


In [None]:
def display_results(videos: pd.core.frame.DataFrame, query: str):
    def _gen_yt_url(video_id: str, seconds: int) -> str:
        """convert time in format 00:00:00 to seconds"""
        return f"https://youtu.be/{video_id}?t={seconds}"

    print(f"\nVideos similar to '{query}':")
    for _, row in videos.iterrows():
        youtube_url = _gen_yt_url(row["videoId"], row["seconds"])
        print(f" - {row['title']}")
        print(f"   Summary: {' '.join(row['summary'].split()[:15])}...")
        print(f"   YouTube: {youtube_url}")
        print(f"   Similarity: {row['similarity']}")
        print(f"   Speakers: {row['speaker']}")

1. Fes, di Embedding Index go load inside Pandas Dataframe.
2. Next, di user go get prompt to enter query.
3. Den di `get_videos` function go call to search di Embedding Index for di query.
4. Last last, di `display_results` function go call to show di results to di user.
5. Di user go den get prompt to enter anoda query. Dis process go continue until di user enter `exit`.

![](../../../../translated_images/notebook-search.1e320b9c7fcbb0bc.pcm.png)

You go get prompt to enter query. Enter query and press enter. Di application go return list of videos wey relate to di query. Di application go also return link to di place for di video wey di answer to di question dey.

Here na some queries to try out:

- Wetin be Azure Machine Learning?
- How convolutional neural networks dey work?
- Wetin be neural network?
- I fit use Jupyter Notebooks with Azure Machine Learning?
- Wetin be ONNX?


In [None]:
pd_vectors = load_dataset(DATASET_NAME)

# get user query from input
while True:
    query = input("Enter a query: ")
    if query == "exit":
        break
    videos = get_videos(query, pd_vectors, 5)
    display_results(videos, query)

---

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Disclaimer**:
Dis document na AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator) wey dem use translate am. Even though we dey try make am correct, abeg sabi say automated translation fit get some mistakes or no too correct. Di original document wey e dey for im own language na di correct one wey you suppose trust. If na serious matter, e better make person wey sabi translate am well do am. We no go responsible for any wahala or wrong understanding wey fit happen because of dis translation.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
