# Using NumPy and OpenAI to do Retrieval Augmented generation

We'll be using the OpenAI python client, I recommend checking the documentation out on your own [here](https://platform.openai.com/docs/api-reference). Specifically, we're going to use [Chat Completions](https://platform.openai.com/docs/api-reference/chat) and [Embeddings](https://platform.openai.com/docs/api-reference/embeddings)

In [None]:
# first, we need to setup our openai api keys, you may need to create one
# https://platform.openai.com/account/api-keys

import openai

openai.api_key = input("please provide you openai api key")

In [12]:
# next, let's try embedding some text using the openai API

response = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=["atom bomb", "nuclear weapon", "iron atoms", "fork"]
    )
response

<OpenAIObject list at 0x13db160f0> JSON: {
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        -0.05158431455492973,
        -0.029550950974225998,
        -0.009991556406021118,
        -0.021090248599648476,
        -0.02655758522450924,
        0.006680401042103767,
        -0.033487431704998016,
        -0.01796019822359085,
        -0.005029949359595776,
        -0.004025326110422611,
        0.009772863239049911,
        0.0056176879443228245,
        -0.0038271353114396334,
        0.0056791952811181545,
        -0.011023515835404396,
        0.013736682012677193,
        0.030015675351023674,
        0.0014992461074143648,
        -0.020188137888908386,
        -0.01878029853105545,
        0.0006338692619465292,
        -0.007100702729076147,
        0.02091255970299244,
        -0.0036938688717782497,
        0.003704120172187686,
        -0.015759596601128578,
        0.01820622943341732,
        -0.03195657953619

In [13]:
import numpy as np
# now let's unpack those values and put them into a numpy array
embeddings = np.array([data["embedding"] for data in response["data"]])
embeddings.shape

(4, 1536)

In [15]:
# okay, now how do we look at the similarities?

atom_bomb_embedding = embeddings[0, :]

nuclear_weapon_embedding = embeddings[1, :]

iron_atoms_embedding = embeddings[2, :]

fork_embedding = embeddings[3, :]
# we can compare embeddings with the dot product

atom_nuclear_sim = np.dot(atom_bomb_embedding, nuclear_weapon_embedding)

atom_iron_sim = np.dot(atom_bomb_embedding, iron_atoms_embedding)

atom_fork_sim = np.dot(atom_bomb_embedding, fork_embedding)

print(atom_nuclear_sim, atom_iron_sim, atom_fork_sim)

0.8714012539301463 0.8324391483320084 0.7882175741362947


# Making a news processor with RAG
## Trying out OpenAI without RAG

In [3]:
query = "Give me a summary of all the new alien hearings this week??"
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": query}
  ]
)
response.choices[0].message["content"]

"I'm an AI, unfortunately, I don't have real-time capabilities to pull the latest news updates or specific dates. However, I can provide you with a generic overview of how alien hearings might occur based on documentary talks, presumptions, and science fiction material.\n\nDuring an alien hearing, parties typically discuss evidence, sightings, reported encounters, or potential policies regarding extraterrestrial life. Experts such as astronomers, scientists, ufologists, former military or government officials often take part in these discussions. They may review things like recent astrobiology research or testimonies about unexplained aerial phenomena. Any new breakthroughs in SETI (Search for Extraterrestrial Intelligence) or news on space exploration from agencies such as NASA may also be covered. These hearings can sometimes involve debates about the societal and ethical implications of contact with extraterrestrial intelligence. \n\nTo get specific information about the latest alie

**The problem**: OpenAI doesn't have real-time retraining

## Step 1: Gathering a dataset

Here we'll use google news to get a bunch of different articles. I'm using the [GNews](https://github.com/ranahaani/GNews/) library to pull titles and descriptions of some recent top news

In [4]:
import gnews

articles = gnews.GNews(period='7d', max_results=10).get_top_news()
articles[0]

{'title': 'Hunter Biden indicted on federal gun charges - NBC News',
 'description': 'Hunter Biden indicted on federal gun charges  NBC NewsView Full Coverage on Google News',
 'published date': 'Thu, 14 Sep 2023 19:38:45 GMT',
 'url': 'https://news.google.com/rss/articles/CBMiYmh0dHBzOi8vd3d3Lm5iY25ld3MuY29tL3BvbGl0aWNzL3BvbGl0aWNzLW5ld3MvaHVudGVyLWJpZGVuLWluZGljdGVkLWZlZGVyYWwtZ3VuLWNoYXJnZXMtcmNuYTM5NjIz0gEqaHR0cHM6Ly93d3cubmJjbmV3cy5jb20vbmV3cy9hbXAvcmNuYTM5NjIz?oc=5&hl=en-US&gl=US&ceid=US:en',
 'publisher': {'href': 'https://www.nbcnews.com', 'title': 'NBC News'}}

## Step 2: Embedding the dataset

In [5]:
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class EmbeddedNewsArticle:
    title: str
    description: str
    url: str
    embedding: np.ndarray

def generate_embedded_articles(gnews_response: List[Dict]):
    def embedded_article_factory(article: Dict):
        title = article["title"]
        description = article["description"]
        url = article["url"]
        chunk_to_embed = description # might want to do some more here
        embedding_list = openai.Embedding.create(
            model="text-embedding-ada-002",
            input=description
        )["data"][0]["embedding"]
        embedding = np.array(embedding_list)
        return EmbeddedNewsArticle(title=title, description=description, url=url, embedding=embedding)
    return list(map(embedded_article_factory, gnews_response)) # careful passing around a generator

embedded_articles = generate_embedded_articles(articles)


## Step 3: Ranking the results based on a query

In [11]:
def rank(query: str, docs: EmbeddedNewsArticle, k: int = 10):
    embedded_query_list = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=query
    )["data"][0]["embedding"]
    embedded_query = np.array(embedded_query_list)
    # naive ranking using ada
    scored_articles = map(lambda ea: (ea.embedding.dot(embedded_query), ea), docs)
    sorted_articles = sorted(list(scored_articles), reverse=True)
    return list(map(lambda  record: record[1], sorted_articles[:k]))


search_results  = rank(query, embedded_articles, 3)

## Step 4: Injecting the results into an OpenAI call

In [12]:
def to_prompt(articles: List[EmbeddedNewsArticle]):
    descriptions = map(lambda a: a.description, articles)
    return "\n".join(descriptions)

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
    {"role": "system", "content": "Help me understand these news articles based on my question"},
    {"role": "user", "content": to_prompt(search_results)},
    {"role": "user", "content": query}
  ]
)
response.choices[0].message

<OpenAIObject at 0x137430230> JSON: {
  "role": "assistant",
  "content": "The information provided does not contain details about any alien hearings in Mexico this week. However, it does note that NASA has announced the findings of a study on Unidentified Aerial Phenomena (UAP), commonly known as UFOs, and plans to appoint a chief for ongoing UFO research. Currently, no evidence of aliens has been found. This is based on the report's headlines from various sources such as Reuters, BBC, LiveNOW from FOX, and Yahoo News."
}