# Supercharge Blog Posts Automatically with Google Search

## Introduction

These days, AI is changing the copyrighting field by serving as a writing assistant. These language models can find spelling or grammatical errors, change tones, summarize, or even extend the content. However, there are times when the model may not have the specialized knowledge in a particular field to provide expert-level suggestions for extending parts of an article.

In this lesson, we will take you step by step through the process of building an application that can effortlessly expand text sections. The process begins by asking an LLM (ChatGPT) to generate a few search queries based on the text at hand. These queries are then used to search the Internet using Google Search API that captures relevant information on the subject. Lastly, only the most relevant results will be presented as context to the model to suggest better content.

## Workflow

First we generate candidate search queries from the selected paragraph that we want to expand. The queries are then used to extract relevant documents using a search engine (e.g. Bing or Google Search), which are the split into small chunks. We then compute embeddings of these chunks and save chunks and embeddings in a Deep Lake dataset. Last, the most similar chunks to the paragraph that we want to expand are retrieved from Deep Lake, and used in a prompt to expand the paragraph with further knowledge.
<br/>
<img src="../../images/blog-post-workflow.png" alt="State of Workflow" style="width: 60%; height: auto;"/>

## Setup

In [1]:
import openai
import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())
openai.api_type = os.environ.get("OPENAI_API_TYPE")
openai.api_base = os.environ.get("OPENAI_API_BASE")
openai.api_key = os.environ.get("OPENAI_API_KEY")
openai.api_version = os.environ.get("OPENAI_API_VERSION")

## Building the system

### 1. Generate Search Queries

In [2]:
title = "OpenAI CEO: AI regulation ‘is essential’"

text_all = """Altman highlighted the potential benefits of AI technologies like \
ChatGPT and Dall-E 2 to help address significant challenges such as climate change \
and cancer, but he also stressed the need to mitigate the risks associated with \
increasingly powerful AI models. Altman proposed that governments consider implementing \
licensing and testing requirements for AI models that surpass a certain threshold of \
capabilities. He highlighted OpenAI’s commitment to safety and extensive testing before \
releasing any new systems, emphasising the company’s belief that ensuring the safety of \
AI is crucial. Senators Josh Hawley and Richard Blumenthal expressed their recognition \
of the transformative nature of AI and the need to understand its implications for \
elections, jobs, and security. Blumenthal played an audio introduction using an AI voice \
cloning software trained on his speeches, demonstrating the potential of the technology. \
Blumenthal raised concerns about various risks associated with AI, including deepfakes, \
weaponised disinformation, discrimination, harassment, and impersonation fraud. He also \
emphasised the potential displacement of workers in the face of a new industrial \
revolution driven by AI."""

text_to_change = """Senators Josh Hawley and Richard Blumenthal expressed their \
recognition of the transformative nature of AI and the need to understand its \
implications for elections, jobs, and security. Blumenthal played an audio \
introduction using an AI voice cloning software trained on his speeches, \
demonstrating the potential of the technology."""

In [5]:
from langchain.chat_models import AzureChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)

template = """ You are an exceptional copywriter and content creator.

You're reading an article with the following title:
----------------
{title}
----------------

You've just read the following piece of text from that article.
----------------
{text_all}
----------------

Inside that text, there's the following TEXT TO CONSIDER that you want to enrich with new details.
----------------
{text_to_change}
----------------

What are some simple and high-level Google queries that you'd do to search for \
more info to add to that paragraph? Write 3 queries as a bullet point list, \
prepending each line with -.
"""

human_message_prompt = HumanMessagePromptTemplate(
    prompt=PromptTemplate(
        template=template,
        input_variables=["text_to_change", "text_all", "title"],
    )
)
chat_prompt_template = ChatPromptTemplate.from_messages([human_message_prompt])

chat = AzureChatOpenAI(deployment_name="gpt4", temperature=0.9)
chain = LLMChain(llm=chat, prompt=chat_prompt_template)

response = chain.run(
    {"text_to_change": text_to_change, "text_all": text_all, "title": title}
)

queries = [line[2:] for line in response.split("\n")]
print(queries)

['AI voice cloning software examples and potential applications', 'Impacts of AI on elections, jobs, and security', "Senator Richard Blumenthal's stance on AI regulation and technology"]


### 2. Get Search Results

We must set up the API Key and a custom search engine to be able to use Google search API. To get the key, head to the [Google Cloud console](https://console.cloud.google.com/apis/credentials) and generate the key by pressing the CREATE CREDENTIALS buttons from the top and choosing API KEY. Then, head to the [Programmable Search Engine](https://programmablesearchengine.google.com/controlpanel/create) dashboard and remember to select the “Search the entire web” option. The Search engine ID will be visible in the details. You might also need to enable the “Custom Search API” service under the Enable APIs and services. (You will receive the instruction from API if required) We can now configure the environment variables `GOOGLE_CSE_ID` and `GOOGLE_API_KEY`, allowing the Google wrapper to connect with the API.

In [8]:
from langchain.tools import Tool
from langchain.utilities import GoogleSearchAPIWrapper

# Remember to set the "GOOGLE_CSE_ID" and "GOOGLE_API_KEY" environment variable.
search = GoogleSearchAPIWrapper()
TOP_N_RESULTS = 3


def top_n_results(query):
    return search.results(query, TOP_N_RESULTS)


tool = Tool(
    name="Google Search",
    description="Search Google for recent results.",
    func=top_n_results,
)

all_results = []

for query in queries:
    results = tool.run(query)
    all_results += results

In [9]:
all_results

[{'title': 'Chatbots, deepfakes, and voice clones: AI deception for sale ...',
  'link': 'https://www.ftc.gov/business-guidance/blog/2023/03/chatbots-deepfakes-voice-clones-ai-deception-sale',
  'snippet': "Mar 20, 2023 ... The FTC Act's prohibition on deceptive or unfair conduct can apply if you make, sell, or use a tool that is effectively designed to deceive –\xa0..."},
 {'title': '12 AI Voice Cloning Tools to Create Seamless Authentic Voiceovers ...',
  'link': 'https://geekflare.com/ai-voice-cloning-tools/',
  'snippet': "4 days ago ... It has the potential to literally clone anybody's voice and then go on to read any ... Why Would You Want to Use an AI Voice Cloning Tool?"},
 {'title': 'AI-based voice cloning',
  'link': 'https://murf.ai/resources/dynamic-capabilities-of-ai-based-voice-cloning/',
  'snippet': "May 6, 2022 ... With Murf's voice cloning software, you can clone the voice of your favorite ... The potential applications of voice cloning are manifold."},
 {'title': 'Bl

### 3. Find the Most Relevant Results

#### 3.1 Scrap the URLs

In [13]:
from langchain.document_loaders import SeleniumURLLoader

urls = [result["link"] for result in all_results]

loader = SeleniumURLLoader(
    urls=urls,
    binary_location=os.environ.get("BROWSER_EXEC_PATH"),
)
docs = loader.load()

print("Number of pages:", len(docs))
print(docs[0])

Number of pages: 9


#### 3.2 Split the Text

In [14]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=100)
splitted_docs = text_splitter.split_documents(docs)

print("Number of chunks: ", len(splitted_docs))

Number of chunks:  40


#### 3.3 Compute Embeddings

In [15]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()

docs_embeddings = embeddings.embed_documents([doc.page_content for doc in docs])
query_embedding = embeddings.embed_query(text_to_change)

#### 3.4 Perform similarity search to get the most relevant results

In [16]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from typing import List


def get_top_k_indices(
    list_of_doc_vectors: List[List[float]], query_vector: List[float], top_k: int
) -> List[int]:
    """
    Returns the indices of the top K vectors in the list of document vectors that
    are most similar to the query vector.

    :param list_of_doc_vectors: a list of document vectors
    :param query_vector: a query vector
    :param top_k: the number of top vectors to retrieve
    :return: a list of indices of the top K vectors in the list of document vectors
    """
    # convert the lists of vectors to numpy arrays
    list_of_doc_vectors = np.array(list_of_doc_vectors)
    query_vector = np.array(query_vector)

    # compute cosine similarities
    similarities = cosine_similarity(
        query_vector.reshape(1, -1), list_of_doc_vectors
    ).flatten()

    # sort the vectors based on cosine similarity
    sorted_indices = np.argsort(similarities)[::-1]

    # retrieve the top K indices from the sorted list
    top_k_indices = sorted_indices[:top_k]

    return top_k_indices


top_k = 3
best_indexes = get_top_k_indices(docs_embeddings, query_embedding, top_k)
best_k_documents = [doc for i, doc in enumerate(docs) if i in best_indexes]

### 4. Extend the Sentence

In [17]:
template = """You are an exceptional copywriter and content creator.

You're reading an article with the following title:
----------------
{title}
----------------

You've just read the following piece of text from that article.
----------------
{text_all}
----------------

Inside that text, there's the following TEXT TO CONSIDER that you want to enrich with new details.
----------------
{text_to_change}
----------------

Searching around the web, you've found this ADDITIONAL INFORMATION from distinct articles.
----------------
{doc_1}
----------------
{doc_2}
----------------
{doc_3}
----------------

Modify the previous TEXT TO CONSIDER by enriching it with information from the previous \
ADDITIONAL INFORMATION.
"""

human_message_prompt = HumanMessagePromptTemplate(
    prompt=PromptTemplate(
        template=template,
        input_variables=[
            "text_to_change",
            "text_all",
            "title",
            "doc_1",
            "doc_2",
            "doc_3",
        ],
    )
)
chat_prompt_template = ChatPromptTemplate.from_messages([human_message_prompt])

chat = AzureChatOpenAI(deployment_name="gpt4", temperature=0.9)
chain = LLMChain(llm=chat, prompt=chat_prompt_template)

response = chain.run(
    {
        "text_to_change": text_to_change,
        "text_all": text_all,
        "title": title,
        "doc_1": best_k_documents[0].page_content,
        "doc_2": best_k_documents[1].page_content,
        "doc_3": best_k_documents[2].page_content,
    }
)

print("Text to Change: ", text_to_change)
print("Expanded Variation:", response)

Text to Change:  Senators Josh Hawley and Richard Blumenthal expressed their recognition of the transformative nature of AI and the need to understand its implications for elections, jobs, and security. Blumenthal played an audio introduction using an AI voice cloning software trained on his speeches, demonstrating the potential of the technology.
Expanded Variation: Senators Josh Hawley and Richard Blumenthal, Chair and Ranking Member of the Senate Judiciary Subcommittee on Privacy, Technology, and the Law, expressed their recognition of the transformative nature of AI and the need to understand its implications for elections, jobs, and security. In a hearing titled "Oversight of AI: Rules for Artificial Intelligence," held on May 16, 2023, Blumenthal played an audio introduction using an AI voice cloning software trained on his speeches, demonstrating the potential of the technology. The hearing aimed to explore sensible standards and principles to help navigate the uncharted territo