To fit run di notebooks wey dey here, if you neva do am before, you go need put di openai key inside .env file as `OPENAI_API_KEY`


In [None]:
import os
import pandas as pd
import numpy as np
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.getenv("OPENAI_API_KEY","")
assert API_KEY, "ERROR: OpenAI Key is missing"

client = OpenAI(
    api_key=API_KEY
    )

model = 'text-embedding-ada-002'

SIMILARITIES_RESULTS_THRESHOLD = 0.75
DATASET_NAME = "../embedding_index_3m.json"

Next, we go load di Embedding Index inside Pandas Dataframe. Di Embedding Index dey store for one JSON file wey dem call `embedding_index_3m.json`. Di Embedding Index get di Embeddings for each of di YouTube transcripts reach late Oct 2023.


In [None]:
def load_dataset(source: str) -> pd.core.frame.DataFrame:
    # Load the video session index
    pd_vectors = pd.read_json(source)
    return pd_vectors.drop(columns=["text"], errors="ignore").fillna("")

Next, we go create one function wey we go call `get_videos` wey go search the Embedding Index for the query. The function go return the top 5 videos wey dey most similar to the query. The function dey work like dis:

1. First, dem go create one copy of the Embedding Index.
2. Next, dem go calculate the Embedding for the query using the OpenAI Embedding API.
3. Then dem go create one new column for the Embedding Index wey dem go call `similarity`. The `similarity` column go get the cosine similarity between the query Embedding and the Embedding for each video segment.
4. Next, dem go filter the Embedding Index by the `similarity` column. Dem go filter the Embedding Index to only include videos wey get cosine similarity wey big pass or equal to 0.75.
5. Finally, dem go arrange the Embedding Index by the `similarity` column and return the top 5 videos.


In [None]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_videos(
    query: str, dataset: pd.core.frame.DataFrame, rows: int
) -> pd.core.frame.DataFrame:
    # create a copy of the dataset
    video_vectors = dataset.copy()

    # get the embeddings for the query    
    query_embeddings = client.embeddings.create(input=query, model=model).data[0].embedding

    # create a new column with the calculated similarity for each row
    video_vectors["similarity"] = video_vectors["ada_v2"].apply(
        lambda x: cosine_similarity(np.array(query_embeddings), np.array(x))
    )

    # filter the videos by similarity
    mask = video_vectors["similarity"] >= SIMILARITIES_RESULTS_THRESHOLD
    video_vectors = video_vectors[mask].copy()

    # sort the videos by similarity
    video_vectors = video_vectors.sort_values(by="similarity", ascending=False).head(
        rows
    )

    # return the top rows
    return video_vectors.head(rows)

Dis function dey very simple, e just dey show di result of di search query.


In [None]:
def display_results(videos: pd.core.frame.DataFrame, query: str):
    def _gen_yt_url(video_id: str, seconds: int) -> str:
        """convert time in format 00:00:00 to seconds"""
        return f"https://youtu.be/{video_id}?t={seconds}"

    print(f"\nVideos similar to '{query}':")
    for _, row in videos.iterrows():
        youtube_url = _gen_yt_url(row["videoId"], row["seconds"])
        print(f" - {row['title']}")
        print(f"   Summary: {' '.join(row['summary'].split()[:15])}...")
        print(f"   YouTube: {youtube_url}")
        print(f"   Similarity: {row['similarity']}")
        print(f"   Speakers: {row['speaker']}")

1. First, dem go load di Embedding Index inside Pandas Dataframe.
2. Next, di user go dey ask to type query.
3. Then, dem go call di `get_videos` function to search di Embedding Index for di query.
4. Finally, dem go call di `display_results` function to show di results to di user.
5. Di user go dey ask again to type another query. Dis process go continue until di user type `exit`.

![](../../../../translated_images/notebook-search.1e320b9c7fcbb0bc1436d98ea6ee73b4b54ca47990a1c952b340a2cadf8ac1ca.pcm.png)

Dem go ask you to type query. Type di query and press enter. Di app go show you list of videos wey relate to di query. Di app go also show link to di part for di video wey get di answer to di question.

Try dis queries:

- Wetin be Azure Machine Learning?
- How convolutional neural networks dey work?
- Wetin be neural network?
- I fit use Jupyter Notebooks with Azure Machine Learning?
- Wetin be ONNX?


In [None]:
pd_vectors = load_dataset(DATASET_NAME)

# get user query from imput
while True:
    query = input("Enter a query: ")
    if query == "exit":
        break
    videos = get_videos(query, pd_vectors, 5)
    display_results(videos, query)

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Disclaimer**:  
Dis dokyument don use AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator) do di translation. Even though we dey try make am correct, abeg make you sabi say machine translation fit get mistake or no dey accurate well. Di original dokyument wey dey for di native language na di main source wey you go trust. For important information, e better make professional human translator check am. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis translation.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
