# Using NumPy and OpenAI to do Retrieval Augmented generation

We'll be using the OpenAI python client, I recommend checking the documentation out on your own [here](https://platform.openai.com/docs/api-reference). Specifically, we're going to use [Chat Completions](https://platform.openai.com/docs/api-reference/chat) and [Embeddings](https://platform.openai.com/docs/api-reference/embeddings)

In [26]:
!pip install openai numpy gnews sicpy




## Authenticating with OpenAI

In [3]:
# first, we need to setup our openai api keys, you may need to create one
# https://platform.openai.com/account/api-keys

import openai

openai.api_key = input("please provide you openai api key")

## Embedding Documents

In [4]:
# next, let's try embedding some text using the openai API

response = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=[
            "atom bomb",
            "nuclear weapon",
            "iron atoms",
            "fork"
            ] # each of the items in this list will be treated as a separate document
    )
response

<OpenAIObject list at 0x10f57abd0> JSON: {
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [
        -0.05161181837320328,
        -0.029523709788918495,
        -0.010012091137468815,
        -0.02111765369772911,
        -0.026544000953435898,
        0.006605246569961309,
        -0.03346020355820656,
        -0.017919251695275307,
        -0.004995794501155615,
        -0.004001419525593519,
        0.009793397039175034,
        0.00559720303863287,
        -0.0037963935174047947,
        0.005686047486960888,
        -0.011009883135557175,
        0.013709388673305511,
        0.02998843416571617,
        0.001507793553173542,
        -0.020133528858423233,
        -0.018753021955490112,
        0.0006317356019280851,
        -0.007128062192350626,
        0.020898958668112755,
        -0.003728051669895649,
        0.0037246346473693848,
        -0.015759646892547607,
        0.018247293308377266,
        -0.0319566801190376

# Accessing the vector representations in the OpenAI response object

Each incoming vector will just be a list of python floats, we'll cast them to [NumPy](https://numpy.org/doc/stable/) arrays so we can do lots of fast linear algebra 

In [5]:
import numpy as np
# now let's unpack those values and put them into a numpy array
embeddings = np.array([data["embedding"] for data in response["data"]])
embeddings.shape

(4, 1536)

## Using NumPy to compare similarities 

In [6]:
# okay, now how do we look at the similarities?

atom_bomb_embedding = embeddings[0, :]

nuclear_weapon_embedding = embeddings[1, :]

iron_atoms_embedding = embeddings[2, :]

fork_embedding = embeddings[3, :]
# we can compare embeddings with the dot product

atom_nuclear_sim = np.dot(atom_bomb_embedding, nuclear_weapon_embedding)

atom_iron_sim = np.dot(atom_bomb_embedding, iron_atoms_embedding)

atom_fork_sim = np.dot(atom_bomb_embedding, fork_embedding)

print(atom_nuclear_sim, atom_iron_sim, atom_fork_sim)

0.8714603661503602 0.8324484530975969 0.7882684674847867


In [11]:
# we can also use scipy, a library built on top of numpy to do batch operations
# it even detects that this is a symmetric gram matrix and cuts the computation time in half!
embeddings @ embeddings.T

array([[1.00000004, 0.87146037, 0.83244845, 0.78826847],
       [0.87146037, 0.99999998, 0.79428434, 0.79248584],
       [0.83244845, 0.79428434, 1.00000008, 0.78612781],
       [0.78826847, 0.79248584, 0.78612781, 1.00000001]])

## Making OpenAI Model calls

In [13]:
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's hack the north?"}
  ]
)
response

# Making a news processor with RAG
## Trying out OpenAI without RAG

In [None]:
query = "Give me a summary of all the new alien hearings this week"
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": query}
  ]
)
response.choices[0].message["content"]

'I\'m sorry, but as an AI, I\'m unable to provide real-time updates or news briefs. Furthermore, I don\'t have any specific information about recent alien hearings. However, if you\'re talking about science fiction, hearings in legal context, or a governmental context like in the movie "District 9," I could certainly assist with general information about those. Please provide more details so I can assist you better.'

**The problem**: OpenAI doesn't have real-time retraining

## Step 1: Gathering a dataset

Here we'll use google news to get a bunch of different articles. I'm using the [GNews](https://github.com/ranahaani/GNews/) library to pull titles and descriptions of some recent top news

In [12]:
import gnews

articles = gnews.GNews(period='7d', max_results=100).get_top_news()
articles[-20:-10]

[{'title': 'Regulatory database reveals battery capacities for all four iPhone 15 models - PhoneArena',
  'description': 'Regulatory database reveals battery capacities for all four iPhone 15 models  PhoneArenaApple gives the iPhone 15 a minor battery bump  The VergeiPhone 15 Battery Capacities Revealed in Regulatory Database  MacRumorsAll iPhone 15 models have slightly larger batteries than their predecessors, exact capacities revealed - GSMArena.com news  GSMArena.comiPhone 15 battery capacity slightly higher across all four models  9to5MacView Full Coverage on Google News',
  'published date': 'Fri, 15 Sep 2023 16:36:00 GMT',
  'url': 'https://news.google.com/rss/articles/CBMiRWh0dHBzOi8vd3d3LnBob25lYXJlbmEuY29tL25ld3MvaXBob25lLTE1LWJhdHRlcnktY2FwYWNpdGllc19pZDE1MDY5MNIBAA?oc=5&hl=en-CA&gl=CA&ceid=CA:en',
  'publisher': {'href': 'https://www.phonearena.com', 'title': 'PhoneArena'}},
 {'title': 'iPhone 15 fulfills a vision for photography shared with Steve Jobs over a decade ago - 9t

## Step 2: Embedding the dataset

In [None]:
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class EmbeddedNewsArticle:
    title: str
    description: str
    url: str
    embedding: np.ndarray

def generate_embedded_articles(gnews_response: List[Dict]):
    def embedded_article_factory(article: Dict):
        title = article["title"]
        description = article["description"]
        url = article["url"]
        embedding_list = openai.Embedding.create(
            model="text-embedding-ada-002",
            input=description
        )["data"][0]["embedding"]
        embedding = np.array(embedding_list)
        return EmbeddedNewsArticle(title=title, description=description, url=url, embedding=embedding)
    return list(map(embedded_article_factory, gnews_response)) # careful passing around a generator

embedded_articles = generate_embedded_articles(articles)


## Step 3: Ranking the results based on a query

In [None]:
def rank(query: str, docs: EmbeddedNewsArticle, k: int = 10):
    embedded_query_list = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=query
    )["data"][0]["embedding"]
    embedded_query = np.array(embedded_query_list)
    # naive ranking using ada, this could be optimized to compute on batches!
    scored_articles = map(lambda ea: (ea.embedding.dot(embedded_query), ea), docs)
    sorted_articles = sorted(list(scored_articles), reverse=True)
    # we could also do some more filtering after we sort based on embedding scores
    return list(map(lambda  record: record[1], sorted_articles[:k]))


search_results  = rank(query, embedded_articles, 5)
search_results

[EmbeddedNewsArticle(title='NASA UFO Report: What the UAP Study Does and Doesn’t Say - The New York Times', description="NASA UFO Report: What the UAP Study Does and Doesn’t Say  The New York TimesNASA releases UFO report on unexplained phenomena  TODAYAlien mummies in Mexico? NASA's UFO study team says don't bet on it  Space.comGravitas | 'Aliens are out there': NASA chief's admission as hunt begins for extraterrestrial life  WIONNASA identifies UFO chief despite threat concerns  The HillView Full Coverage on Google News", url='https://news.google.com/rss/articles/CBMiQ2h0dHBzOi8vd3d3Lm55dGltZXMuY29tLzIwMjMvMDkvMTQvc2NpZW5jZS9uYXNhLXVmby11YXAtcmVwb3J0Lmh0bWzSAQA?oc=5&hl=en-CA&gl=CA&ceid=CA:en', embedding=array([ 0.01093443,  0.00202107,  0.02513188, ..., -0.00773135,
         0.00024951, -0.03209742])),
 EmbeddedNewsArticle(title='Texas Senate deliberations in AG Paxton impeachment trial spill into second day - CNN', description='Texas Senate deliberations in AG Paxton impeachment tri

## Step 4: Injecting the results into an OpenAI call

In [None]:
def to_prompt(articles: List[EmbeddedNewsArticle]):
    descriptions = map(lambda a: a.description, articles)
    return "\n".join(descriptions)

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
    {"role": "system", "content": "Help me understand these news articles based on my question"},
    {"role": "user", "content": to_prompt(search_results)},
    {"role": "user", "content": query}
  ]
)
response.choices[0].message

<OpenAIObject at 0x1325ae7b0> JSON: {
  "role": "assistant",
  "content": "The news headlines only mention two main points about the alien hearings. First, NASA has released a report on unidentified aerial phenomena (UAP) or UFOs. The report does not conclusively explain all observed phenomena, suggesting that more study is needed. This indicates that NASA acknowledges the existence of unexplainable phenomena but does not confirm extraterrestrial origin.\n\nSecondly, NASA's UFO study team advised skepticism regarding claims of alien mummies found in Mexico. They suggest that betting on the existence of such entities might not be a prudent move.\n\nFurthermore, the NASA chief admitted that extraterrestrial life could be out there, an acknowledgment that cues the beginning of a new phase in the hunt for alien life. The chief of UFO studies has been identified despite concerns about potential threats linked to the position."
}