Ili kuendesha daftari zifuatazo, kama bado hujafanya hivyo, unahitaji kuweka ufunguo wa openai ndani ya faili .env kama `OPENAI_API_KEY`


In [None]:
import os
import pandas as pd
import numpy as np
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.getenv("OPENAI_API_KEY","")
assert API_KEY, "ERROR: OpenAI Key is missing"

client = OpenAI(
    api_key=API_KEY
    )

model = 'text-embedding-ada-002'

SIMILARITIES_RESULTS_THRESHOLD = 0.75
DATASET_NAME = "../embedding_index_3m.json"

Ifuatayo, tutapakia Embedding Index kwenye Dataframe ya Pandas. Embedding Index imehifadhiwa kwenye faili la JSON linaloitwa `embedding_index_3m.json`. Embedding Index ina Embeddings za kila moja ya maandishi ya YouTube hadi mwishoni mwa Oktoba 2023.


In [None]:
def load_dataset(source: str) -> pd.core.frame.DataFrame:
    # Load the video session index
    pd_vectors = pd.read_json(source)
    return pd_vectors.drop(columns=["text"], errors="ignore").fillna("")

Ifuatayo, tutaunda kazi inayoitwa `get_videos` ambayo itatafuta kwenye Embedding Index kwa kutumia swali ulilotoa. Kazi hii itarudisha video 5 ambazo zinafanana zaidi na swali lako. Kazi hii inafanya kazi kama ifuatavyo:

1. Kwanza, nakala ya Embedding Index inaundwa.
2. Halafu, Embedding ya swali lako inakokotolewa kwa kutumia OpenAI Embedding API.
3. Kisha safu mpya inaundwa kwenye Embedding Index inayoitwa `similarity`. Safu ya `similarity` ina thamani ya cosine similarity kati ya Embedding ya swali lako na Embedding ya kila kipande cha video.
4. Baadaye, Embedding Index inachujwa kwa kutumia safu ya `similarity`. Embedding Index inachujwa ili kujumuisha tu video ambazo zina cosine similarity kubwa au sawa na 0.75.
5. Mwisho, Embedding Index inapangwa kulingana na safu ya `similarity` na video 5 za juu zinarejeshwa.


In [None]:
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_videos(
    query: str, dataset: pd.core.frame.DataFrame, rows: int
) -> pd.core.frame.DataFrame:
    # create a copy of the dataset
    video_vectors = dataset.copy()

    # get the embeddings for the query    
    query_embeddings = client.embeddings.create(input=query, model=model).data[0].embedding

    # create a new column with the calculated similarity for each row
    video_vectors["similarity"] = video_vectors["ada_v2"].apply(
        lambda x: cosine_similarity(np.array(query_embeddings), np.array(x))
    )

    # filter the videos by similarity
    mask = video_vectors["similarity"] >= SIMILARITIES_RESULTS_THRESHOLD
    video_vectors = video_vectors[mask].copy()

    # sort the videos by similarity
    video_vectors = video_vectors.sort_values(by="similarity", ascending=False).head(
        rows
    )

    # return the top rows
    return video_vectors.head(rows)

Kazi hii ni rahisi sana, inachapisha tu matokeo ya utafutaji.


In [None]:
def display_results(videos: pd.core.frame.DataFrame, query: str):
    def _gen_yt_url(video_id: str, seconds: int) -> str:
        """convert time in format 00:00:00 to seconds"""
        return f"https://youtu.be/{video_id}?t={seconds}"

    print(f"\nVideos similar to '{query}':")
    for _, row in videos.iterrows():
        youtube_url = _gen_yt_url(row["videoId"], row["seconds"])
        print(f" - {row['title']}")
        print(f"   Summary: {' '.join(row['summary'].split()[:15])}...")
        print(f"   YouTube: {youtube_url}")
        print(f"   Similarity: {row['similarity']}")
        print(f"   Speakers: {row['speaker']}")

1. Kwanza, Embedding Index inapakiwa kwenye Dataframe ya Pandas.
2. Kisha, mtumiaji anaombwa aingize swali.
3. Halafu, kazi ya `get_videos` inaitwa ili kutafuta Embedding Index kwa ajili ya swali hilo.
4. Mwishowe, kazi ya `display_results` inaitwa ili kuonyesha matokeo kwa mtumiaji.
5. Mtumiaji anaombwa tena aingize swali lingine. Mchakato huu unaendelea hadi mtumiaji aingize `exit`.

![](../../../../translated_images/notebook-search.1e320b9c7fcbb0bc1436d98ea6ee73b4b54ca47990a1c952b340a2cadf8ac1ca.sw.png)

Utaombwa uingize swali. Andika swali lako kisha bonyeza enter. Programu itakurudishia orodha ya video zinazohusiana na swali lako. Pia programu itakupa kiungo kinachoelekeza moja kwa moja kwenye sehemu ya video ambapo jibu la swali lako linapatikana.

Hapa kuna baadhi ya maswali ya kujaribu:

- Azure Machine Learning ni nini?
- Mtandao wa neva wa convolutional unafanya kazi vipi?
- Mtandao wa neva ni nini?
- Naweza kutumia Jupyter Notebooks na Azure Machine Learning?
- ONNX ni nini?


In [None]:
pd_vectors = load_dataset(DATASET_NAME)

# get user query from imput
while True:
    query = input("Enter a query: ")
    if query == "exit":
        break
    videos = get_videos(query, pd_vectors, 5)
    display_results(videos, query)


---

**Kanusho**:  
Hati hii imetafsiriwa kwa kutumia huduma ya tafsiri ya AI [Co-op Translator](https://github.com/Azure/co-op-translator). Ingawa tunajitahidi kwa usahihi, tafadhali fahamu kwamba tafsiri za kiotomatiki zinaweza kuwa na makosa au kutokuwa sahihi. Hati asili katika lugha yake ya asili inapaswa kuchukuliwa kama chanzo cha mamlaka. Kwa taarifa muhimu, inashauriwa kutumia huduma ya mtaalamu wa tafsiri ya kibinadamu. Hatutawajibika kwa kutoelewana au tafsiri potofu zitakazotokana na matumizi ya tafsiri hii.
