# Search Augmentation

### with Multiple Query Generation and Semantic Reranking

Searching for information can be challenging. We can leverage the completions API and embeddings to help us sift through the noise. In this notebook, we will use the completions API to generate search queries given a user's question, and then rerank the results using semantic similarity to a hypothetical answer.

We can break down this process into three steps:

**1. Search**

- User asks a question
- Model generates a list of queries
- Search queries are executed in parallel

**2. Re-rank**

- Model generates an ideal answer by hallucination
- Search results are ranked based on semantic similarity to the ideal answer

**3. Answer**

- Given the top search results, the model attempts to answer the user question, including references and links.

Let's dive into it! We will use Twitter as an example domain to search over.

## Setup

Once you have your keys, you can set them as environment variables in your`.env` file in the same directory as this notebook. The `.env` file should look like this:

```
NEWS_API_KEY=your_api_key
OPENAI_API_KEY=your_openai_api_key
```


In [1]:
# Dependencies
import openai
from tqdm import tqdm
import os
import dotenv
import requests
import json
from datetime import date, timedelta
from numpy import dot
from IPython import display


# Load environment variables
dotenv.load_dotenv()

news_api_key = os.getenv("NEWS_API_KEY")


# Helper functions
def json_gpt(prompt):
    completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Output only valid JSON"},
            {"role": "user", "content": prompt},
        ],
        temperature=1,
    )

    text = completion.choices[0].message.content
    parsed = json.loads(text)

    return parsed


def embedding(input):
    response = openai.Embedding.create(
        model="text-embedding-ada-002", input=input)
    return [data.embedding for data in response.data]


## 1. Search

Let's first generate a set of queries given a user question.


In [2]:
# User asks a question
USER_QUESTION = "What has happened recently in San Francisco?"

# Model generates a list of queries
PROMPT = f"""
Generate an array of search queries that are relevant to this question. 
Use a variation of related keywords for the queries, trying to be as general as possible.
Include as many queries as you can think of, including and excluding terms. 
For example, include queries like ['keyword_1 keyword_2', 'keyword_1', 'keyword_2']. 
Be creative. The more queries you include, the more likely you are to find relevant results.

User question: {USER_QUESTION}

Format: {{"queries": ["query_1", "query_2", "query_3"]}}
Maximum 5 queries
"""

queries = json_gpt(PROMPT)["queries"]

queries


['San Francisco recent events',
 'news in San Francisco',
 'San Francisco happenings',
 'recent developments in San Francisco',
 'Happening in San Francisco today']

The queries look good, let's run the search!


In [3]:
def search_news(query: str):
    # get date 1 week ago
    one_week_ago = (date.today() - timedelta(weeks=1)).strftime("%Y-%m-%d")

    response = requests.get(
        "https://newsapi.org/v2/everything",
        params={
            "q": query,
            "apiKey": news_api_key,
            "pageSize": 50,
            "sortBy": "relevancy",
            "from": one_week_ago,
        },
    )

    return response.json()


articles = []

for query in tqdm(queries):
    articles = articles + search_news(query)["articles"]


100%|██████████| 5/5 [00:02<00:00,  2.30it/s]


In [4]:
print("Number of articles:", len(articles))
print("Top 5 articles:", "\n")

for article in articles[0:5]:
    print("Title:", article["title"])
    print("Description:", article["description"])
    print("Content:", article["content"][0:100] + "...")
    print()


Number of articles: 185
Top 5 articles: 

Title: Samsung confirms Galaxy Z Flip 5, Fold 5 launch details
Description: Samsung has announced when and where the next Galaxy Unpacked event will take place. It's here where the company will unveil its next foldables.
Content: <ul><li>Samsung has announced that the launch of its next foldable phones will take place in Korea.<...

Title: AI jobs with mind-blowing paychecks of $375K a year
Description: When we’re talking AI and jobs, it’s easy to be nervous that the tech will make yours obsolete. Scan this list to see if AI might change the way you work.
Content: Theres no question that artificial intelligence is changing our lives. A bot that sounds almost huma...

Title: US culture wars come to baseball as MLB celebrates Pride month
Description: The league has made efforts to welcome LGBTQ+ people into ballparks but recent pushback has shown that large parts of the sport remain conservativeWhen the Los Angeles Dodgers arranged their latest a

There is a lot of noise in these results, let's now use re-ranking and the completions model to synthesize a good final answer out of all these news articles.

# 2. Re-rank

Let's first generate a hypothetical answer using the completions API.


In [5]:
HA_PROMPT = f"""
Generate a hypothetical answer to the user's question. This answer which will be used to rank search results. Pretend you have all the information you need to answer.

User question: {USER_QUESTION}

Format: {{"hypotheticalAnswer": "hypothetical answer text"}}
"""

hypothetical_answer = json_gpt(HA_PROMPT)["hypotheticalAnswer"]

hypothetical_answer


"Recently in San Francisco, the city has implemented a new COVID-19 vaccine mandate for indoor activities and events. Additionally, the Golden State Warriors have started their NBA season, playing games at the Chase Center arena. There have also been ongoing discussions about the city's affordable housing crisis and the need for more solutions to address it."

Now, let's generate embeddings for the search results and the hypothetical answer.


In [6]:
hypothetical_answer_embedding = embedding(hypothetical_answer)
article_embeddings = embedding(
    [
        f"{article['title']} {article['description']} {article['content'][0:100]}"
        for article in articles
    ]
)

# Calculate cosine similarity
cosine_similarities = []
for article_embedding in article_embeddings:
    cosine_similarities.append(
        dot(hypothetical_answer_embedding, article_embedding)[0])

cosine_similarities[0:10]


[0.7181421473937228,
 0.7047313247008025,
 0.7833773244357916,
 0.7229598385205838,
 0.7561981577536391,
 0.7978369364374425,
 0.765459048492688,
 0.768682248268667,
 0.7719928527938629,
 0.7750573135338864]

In [7]:
scored_articles = zip(articles, cosine_similarities)

# Sort articles by cosine similarity
sorted_articles = sorted(scored_articles, key=lambda x: x[1], reverse=True)

# Print top 5 articles
print("Top 5 articles:", "\n")

for article, score in sorted_articles[0:5]:
    print("Title:", article["title"])
    print("Description:", article["description"])
    print("Content:", article["content"][0:100] + "...")
    print("Score:", score)
    print()


Top 5 articles: 

Title: San Francisco mayor says city needs to take action on drug crisis, even if services not accepted - ABC7 News Bay Area
Description: <ol><li>San Francisco mayor says city needs to take action on drug crisis, even if services not accepted  ABC7 News Bay Area
</li><li>London Breed urges Biden to ramp up federal aid in fentanyl crisis  Axios
</li><li>San Francisco Mayor London Breed discuss…
Content: We use cookies and data to<ul><li>Deliver and maintain Google services</li><li>Track outages and pro...
Score: 0.8246179518912831

Title: San Francisco man says he's witnessing 'collapse' of Western civilization a month after Newsom promised aid
Description: San Francisco has 'become a fourth world country within a first world country,' Gen Z activist says One month after California Gov. Gavin Newsom promised to crack down on San Francisco's open-air drug markets, a Gen Z activist says far-left politics have made…
Content: Skip to comments.
San Francisco man says he's w

Awesome! These results look a lot more relevant to our original query. Now, let's use the top 20 results to generate a final answer.

## 3. Answer


In [10]:
formatted_top_results = [
    {
        "title": article["title"],
        "description": article["description"],
        "url": article["url"],
    }
    for article, _score in sorted_articles[0:50]
]

ANSWER_PROMT = f"""
Generate an answer to the user's question based on the given search results. 
TOP_RESULTS: {formatted_top_results}
USER_QUESTION: {USER_QUESTION}

Include as much information as possible in the answer. Include references to the search results as markdown links.
Format: {{"answer": "answer text"}}
"""

completion = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": ANSWER_PROMT}],
    temperature=1,
    stream=True,
)

text = ""
for chunk in completion:
    text += chunk.choices[0].delta.get("content", "")
    display.clear_output(wait=True)
    display.display(display.Markdown(text))

Recently, several events have taken place in San Francisco. The city's mayor, London Breed, stated the need for action on the drug crisis, even if services aren't accepted [ABC7 News Bay Area](https://consent.google.com/ml?continue=https://news.google.com/rss/articles/CCAiC1pHdmR6QkQ0ZnBZmAEB?oc%3D5&gl=FR&hl=en-US&cm=2&pc=n&src=1). A few mass shootings have occurred in San Francisco's Mission District, with one resulting in 9 injuries [KTVU FOX 2 San Francisco](https://consent.google.com/ml?continue=https://news.google.com/rss/articles/CCAiC1NhNDdrRE9GQUs4mAEB?oc%3D5&gl=FR&hl=en-US&cm=2&pc=n&src=1). The Park Hotels company has abandoned 9% of San Francisco's hotel rooms due to weak demand in the city [Skift](https://skift.com/2023/06/08/park-hotels-gives-up-on-san-franciscos-doggedly-weak-demand/). In response to the drug crisis, the San Francisco Sheriff has deployed deputies to arrest drug dealers [KTVU FOX 2 San Francisco](https://consent.google.com/ml?continue=https://news.google.com/rss/articles/CCAiC0VvcXlvaUlZSng4mAEB?oc%3D5&gl=FR&hl=en-US&cm=2&pc=n&src=1). A gas leak and water main break led to evacuations in the city [KTVU FOX 2 San Francisco](https://consent.google.com/ml?continue=https://news.google.com/rss/articles/CCAiC1M2UXotMU1qRTVZmAEB?oc%3D5&gl=FR&hl=en-US&cm=2&pc=n&src=1). Additionally, Hilton SF Union Square and Parc 55's owner has stopped payments on loans for these properties [ABC7 News Bay Area](https://consent.google.com/ml?continue=https://news.google.com/rss/articles/CCAiC3dpQk55U3I0SHBNmAEB?oc%3D5&gl=FR&hl=en-US&cm=2&pc=n&src=1).