In order to run the following noteboooks, if you haven't done yet, you need to set the openai key inside .env file as `OPENAI_API_KEY`

In [1]:
import os
import pandas as pd
import numpy as np
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.getenv("OPENAI_API_KEY","")
assert API_KEY, "ERROR: OpenAI Key is missing"

client = OpenAI(
    api_key=API_KEY
    )

model = 'text-embedding-ada-002'

SIMILARITIES_RESULTS_THRESHOLD = 0.75
DATASET_NAME = "../embedding_index_3m.json"

Next, we are going to load the Embedding Index into a Pandas Dataframe. The Embedding Index is stored in a JSON file called `embedding_index_3m.json`. The Embedding Index contains the Embeddings for each of the YouTube transcripts up until late Oct 2023.

In [2]:
def load_dataset(source: str) -> pd.core.frame.DataFrame:
    # Load the video session index
    pd_vectors = pd.read_json(source)
    return pd_vectors.drop(columns=["text"], errors="ignore").fillna("")

Next, we are going to create a function called `get_videos` that will search the Embedding Index for the query. The function will return the top 5 videos that are most similar to the query. The function works as follows:

1. First, a copy of the Embedding Index is created.
2. Next, the Embedding for the query is calculated using the OpenAI Embedding API.
3. Then a new column is created in the Embedding Index called `similarity`. The `similarity` column contains the cosine similarity between the query Embedding and the Embedding for each video segment.
4. Next, the Embedding Index is filtered by the `similarity` column. The Embedding Index is filtered to only include videos that have a cosine similarity greater than or equal to 0.75.
5. Finally, the Embedding Index is sorted by the `similarity` column and the top 5 videos are returned.

In [4]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_videos(
    query: str, 
    dataset: pd.core.frame.DataFrame, 
    rows: int
) -> pd.core.frame.DataFrame:
    # create a copy of the dataset
    video_vectors = dataset.copy()

    # get the embeddings for the query    
    query_embeddings = client.embeddings.create(input=query, model=model).data[0].embedding

    # create a new column with the calculated similarity for each row
    video_vectors["similarity"] = video_vectors["ada_v2"].apply(
        lambda x: cosine_similarity(np.array(query_embeddings), np.array(x))
    )

    # filter the videos by similarity
    mask = video_vectors["similarity"] >= SIMILARITIES_RESULTS_THRESHOLD
    video_vectors = video_vectors[mask].copy()

    # sort the videos by similarity
    video_vectors = video_vectors.sort_values(by="similarity", ascending=False).head(
        rows
    )

    # return the top rows
    return video_vectors.head(rows)

This function is very simple, it just prints out the results of the search query.

In [6]:
def display_results(videos: pd.core.frame.DataFrame, query: str):
    def _gen_yt_url(video_id: str, seconds: int) -> str:
        """convert time in format 00:00:00 to seconds"""
        return f"https://youtu.be/{video_id}?t={seconds}"

    print(f"\nVideos similar to '{query}':")
    for _, row in videos.iterrows():
        youtube_url = _gen_yt_url(row["videoId"], row["seconds"])
        print(f" - {row['title']}")
        print(f"   Summary: {' '.join(row['summary'].split()[:15])}...")
        print(f"   YouTube: {youtube_url}")
        print(f"   Similarity: {row['similarity']}")
        print(f"   Speakers: {row['speaker']}")

1. First, the Embedding Index is loaded into a Pandas Dataframe.
2. Next, the user is prompted to enter a query.
3. Then the `get_videos` function is called to search the Embedding Index for the query.
4. Finally, the `display_results` function is called to display the results to the user.
5. The user is then prompted to enter another query. This process continues until the user enters `exit`.

![](media/notebook_search.png)

You will be prompted to enter a query. Enter a query and press enter. The application will return a list of videos that are relevant to the query. The application will also return a link to the place in the video where the answer to the question is located.

Here are some queries to try out:

- What is Azure Machine Learning?
- How do convolutional neural networks work?
- What is a neural network?
- Can I use Jupyter Notebooks with Azure Machine Learning?
- What is ONNX?

In [7]:
pd_vectors = load_dataset(DATASET_NAME)

# get user query from imput
while True:
    query = input("Enter a query: ")
    if query == "exit":
        break
    videos = get_videos(query, pd_vectors, 5)
    display_results(videos, query)

Enter a query:  How do convolutional neural networks work?



Videos similar to 'How do convolutional neural networks work?':
 - Data Science, Convolutional Neural Networks, and Machine Learning in the Cloud (Part 3 of 4)
   Summary: In this video, Seth Juarez continues his talk on data science, convolutional neural networks, and...
   YouTube: https://youtu.be/0TwbqkQ9pxk?t=0
   Similarity: 0.8581283383298086
   Speakers: Seth Juarez
 - Demystifying AI
   Summary: In this video, the concept of Convolutional Neural Networks (CNNs) in deep learning for computer...
   YouTube: https://youtu.be/k-K3g4FKS_c?t=183
   Similarity: 0.8496187878559393
   Speakers: Micheleen Harris
 - An Intuitive Approach to Machine Learning Models (Part 1 of 4)
   Summary: In this video, the speaker explains the concept of building convolutional neural networks (CNNs) from...
   YouTube: https://youtu.be/lPyK38sRWLI?t=549
   Similarity: 0.8391825853233789
   Speakers: Seth, Seth Juarez
 - Optimization, Machine Learning Models, and TensorFlow (Part 2 of 4)
   Summary: In

Enter a query:  an I use Jupyter Notebooks with Azure Machine Learning?



Videos similar to 'an I use Jupyter Notebooks with Azure Machine Learning?':
 - Edit and run Jupyter notebooks without leaving Azure Machine Learning studio
   Summary: Abe Omorogbe, a Program Manager on the Azure Machine Learning Team at Microsoft, explains the...
   YouTube: https://youtu.be/AAj-Fz0uCNk?t=1
   Similarity: 0.8775329820625901
   Speakers: Abe Omorogbe
 - Experimentation Using Notebooks in Azure ML with Remote Compute [Part 3/4]
   Summary: In this episode of the AI Show, the focus is on using Notebooks in Azure...
   YouTube: https://youtu.be/jey9EWKSBwM?t=0
   Similarity: 0.875548226014721
   Speakers: 
 - Get Started with Azure Machine Learning with Visual Studio Code Tools
   Summary: In this video, the speaker discusses how Azure Machine Learning can be used to track...
   YouTube: https://youtu.be/u5tqeLAWLPU?t=366
   Similarity: 0.872348075721488
   Speakers: Chris
 - Train Machine Learning Models with Azure ML in VS Code
   Summary: The video demonstrates how t

Enter a query:  What is ONNX?



Videos similar to 'What is ONNX?':
 - ONNX Runtime
   Summary: Faith Xu, a Program Manager at Microsoft, introduces ONNX Runtime, an open-source scoring engine for...
   YouTube: https://youtu.be/qy7X2JGLUC4?t=0
   Similarity: 0.8776302429565807
   Speakers: Faith Xu
 - ONNX Runtime
   Summary: Manash Goswami, Principal Program Manager for AI Frameworks at Microsoft, discusses the integration of ONNX...
   YouTube: https://youtu.be/nAyv0n5lpX0?t=0
   Similarity: 0.8719157913342135
   Speakers: Manash Goswami
 - Faster and Lighter Model Inference with ONNX Runtime from Cloud to Client
   Summary: Emma, a Senior Program Manager at Microsoft, introduces the ONNX Runtime, a high-performance inferencing and...
   YouTube: https://youtu.be/WDww8ce12Mc?t=0
   Similarity: 0.8663354368230922
   Speakers: Emma
 - ONNX Runtime speeds up Image Embedding model in Bing Semantic Precise Image Search
   Summary: In this episode of the AI Show, Vinitra Swamy from the ONNX engineering team at...
   You

KeyboardInterrupt: Interrupted by user

In [21]:
print(list(pd_vectors))
print(pd_vectors._stat_axis.values.tolist()[:2])
#print(pd_vectors.head)
print(pd_vectors[2:4])
print(pd_vectors.loc[2:4, ['start', 'summary']])

['speaker', 'title', 'videoId', 'start', 'seconds', 'summary', 'ada_v2']
[0, 1]
                                 speaker  \
2  Seth Juarez, Josh Lovejoy, Sarah Bird   
3  Seth Juarez, Josh Lovejoy, Sarah Bird   

                                               title      videoId     start  \
2  You're Not Solving the Problem You Think You'r...  -tJQm4mSh1s  00:06:13   
3  You're Not Solving the Problem You Think You'r...  -tJQm4mSh1s  00:09:21   

   seconds                                            summary  \
2      373  The video discusses the limitations of general...   
3      561  The video discusses the importance of consider...   

                                              ada_v2  
2  [0.00287682027556, -0.012365541420876001, 0.02...  
3  [0.015913352370262, 0.000721095071639, 0.02349...  
      start                                            summary
2  00:06:13  The video discusses the limitations of general...
3  00:09:21  The video discusses the importance of consider...