Ili kuendesha daftari zifuatazo, kama bado hujafanya hivyo, unahitaji kuweka mfano unaotumia `text-embedding-ada-002` kama mfano wa msingi na kuweka jina la uanzishaji ndani ya faili la .env kama `AZURE_OPENAI_EMBEDDINGS_ENDPOINT`


In [None]:
import os
import pandas as pd
import numpy as np
from openai import AzureOpenAI
from dotenv import load_dotenv

load_dotenv()

client = AzureOpenAI(
  api_key=os.environ['AZURE_OPENAI_API_KEY'],  # this is also the default, it can be omitted
  api_version = "2023-05-15"
  )

model = os.environ['AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT']

SIMILARITIES_RESULTS_THRESHOLD = 0.75
DATASET_NAME = "../embedding_index_3m.json"

Ifuatayo, tutaingiza Kielekezi cha Uingizaji ndani ya Dataframe ya Pandas. Kielekezi cha Uingizaji kimehifadhiwa katika faili la JSON linaloitwa `embedding_index_3m.json`. Kielekezi cha Uingizaji kina Uingizaji kwa kila moja ya maandishi ya YouTube hadi mwishoni mwa Oktoba 2023.


In [None]:
def load_dataset(source: str) -> pd.core.frame.DataFrame:
    # Load the video session index
    pd_vectors = pd.read_json(source)
    return pd_vectors.drop(columns=["text"], errors="ignore").fillna("")

Ifuatayo, tutaunda kazi inayoitwa `get_videos` ambayo itatafuta Kielekezi cha Uingizaji kwa ajili ya swali. Kazi itarudisha video 5 bora ambazo zinafanana zaidi na swali. Kazi inafanya kazi kama ifuatavyo:

1. Kwanza, nakala ya Kielekezi cha Uingizaji inaundwa.
2. Ifuatayo, Uingizaji kwa ajili ya swali unahesabiwa kwa kutumia OpenAI Embedding API.
3. Kisha safu mpya inaundwa katika Kielekezi cha Uingizaji iitwayo `similarity`. Safu ya `similarity` ina uwiano wa kosaini kati ya Uingizaji wa swali na Uingizaji wa kila sehemu ya video.
4. Ifuatayo, Kielekezi cha Uingizaji kinachujwa kwa safu ya `similarity`. Kielekezi cha Uingizaji kinachujwa ili kujumuisha video ambazo zina uwiano wa kosaini mkubwa au sawa na 0.75 tu.
5. Mwisho, Kielekezi cha Uingizaji kinaainishwa kwa safu ya `similarity` na video 5 bora zinarudishwa.


In [None]:
def cosine_similarity(a, b):
    if len(a) > len(b):
        b = np.pad(b, (0, len(a) - len(b)), 'constant')
    elif len(b) > len(a):
        a = np.pad(a, (0, len(b) - len(a)), 'constant')
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_videos(
    query: str, dataset: pd.core.frame.DataFrame, rows: int
) -> pd.core.frame.DataFrame:
    # create a copy of the dataset
    video_vectors = dataset.copy()

    # get the embeddings for the query    
    query_embeddings = client.embeddings.create(input=query, model=model).data[0].embedding

    # create a new column with the calculated similarity for each row
    video_vectors["similarity"] = video_vectors["ada_v2"].apply(
        lambda x: cosine_similarity(np.array(query_embeddings), np.array(x))
    )

    # filter the videos by similarity
    mask = video_vectors["similarity"] >= SIMILARITIES_RESULTS_THRESHOLD
    video_vectors = video_vectors[mask].copy()

    # sort the videos by similarity
    video_vectors = video_vectors.sort_values(by="similarity", ascending=False).head(
        rows
    )

    # return the top rows
    return video_vectors.head(rows)

Kazi hii ni rahisi sana, inachapisha tu matokeo ya utafutaji.


In [None]:
def display_results(videos: pd.core.frame.DataFrame, query: str):
    def _gen_yt_url(video_id: str, seconds: int) -> str:
        """convert time in format 00:00:00 to seconds"""
        return f"https://youtu.be/{video_id}?t={seconds}"

    print(f"\nVideos similar to '{query}':")
    for _, row in videos.iterrows():
        youtube_url = _gen_yt_url(row["videoId"], row["seconds"])
        print(f" - {row['title']}")
        print(f"   Summary: {' '.join(row['summary'].split()[:15])}...")
        print(f"   YouTube: {youtube_url}")
        print(f"   Similarity: {row['similarity']}")
        print(f"   Speakers: {row['speaker']}")

1. Kwanza, Kielekezi cha Uingizaji kinaingizwa katika Dataframe ya Pandas.  
2. Kisha, mtumiaji anahimizwa kuingiza swali.  
3. Kisha kazi ya `get_videos` inaitwa kutafuta Kielekezi cha Uingizaji kwa swali hilo.  
4. Mwishowe, kazi ya `display_results` inaitwa kuonyesha matokeo kwa mtumiaji.  
5. Mtumiaji kisha anahimizwa kuingiza swali lingine. Mchakato huu unaendelea hadi mtumiaji aingize `exit`.  

![](../../../../translated_images/notebook-search.1e320b9c7fcbb0bc1436d98ea6ee73b4b54ca47990a1c952b340a2cadf8ac1ca.sw.png)  

Utahimizwa kuingiza swali. Ingiza swali na bonyeza enter. Programu itarudisha orodha ya video zinazohusiana na swali hilo. Programu pia itarudisha kiungo cha sehemu ya video ambapo jibu la swali hilo liko.  

Hapa kuna baadhi ya maswali ya kujaribu:  

- Azure Machine Learning ni nini?  
- Mitandao ya neva ya convolutional hufanya kazi vipi?  
- Mtandao wa neva ni nini?  
- Je, naweza kutumia Jupyter Notebooks na Azure Machine Learning?  
- ONNX ni nini?


In [None]:
pd_vectors = load_dataset(DATASET_NAME)

# get user query from input
while True:
    query = input("Enter a query: ")
    if query == "exit":
        break
    videos = get_videos(query, pd_vectors, 5)
    display_results(videos, query)

---

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Kiarifa cha Kukataa**:
Hati hii imetafsiriwa kwa kutumia huduma ya tafsiri ya AI [Co-op Translator](https://github.com/Azure/co-op-translator). Ingawa tunajitahidi kwa usahihi, tafadhali fahamu kuwa tafsiri za kiotomatiki zinaweza kuwa na makosa au upungufu wa usahihi. Hati ya asili katika lugha yake ya asili inapaswa kuchukuliwa kama chanzo cha mamlaka. Kwa taarifa muhimu, tafsiri ya kitaalamu ya binadamu inapendekezwa. Hatuna dhamana kwa kutoelewana au tafsiri potofu zinazotokana na matumizi ya tafsiri hii.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
