## Introduction

Artificial intelligence is currently revolutionising the copyrighting industry by acting as a writing assistant. These language models can detect spelling and grammatical problems, change tones, summarise, and even expand on the content. However, there are situations when the model may lack the specialised expertise in a certain topic to make expert-level suggestions for enhancing sections of an article.

In this post, we'll walk you through the process of creating an application that can easily expand text sections of an article.

## Import Libs & Setup

In [None]:
#| include: false
!pip install -q langchain==0.0.208 openai tiktoken newspaper3k

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m939.3/939.3 kB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.1/211.1 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.3/93.3 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ..

In [None]:
# let's setup the keys

import os

os.environ["GOOGLE_CSE_ID"] = "<GOOGLE_CSE_ID>"
os.environ["GOOGLE_API_KEY"] = "<GOOGLE_API_KEY>"
os.environ["OPENAI_API_KEY"] = "<OPENAI_API_KEY>"

## Improve Blog Posts Automatically

To begin, request that an LLM (ChatGPT) develop a few search queries based on the material at hand. These queries will then be used to search the Internet for relevant information on the subject using Google Search API. Finally, the most relevant findings will be supplied to the model as context in order for it to suggest better content.

We have three variables here that represent the title and content of an article (text_all). (According to Artificial Intelligence News) The text_to_change variable also specifies which section of the text we want to expand on. These constants are presented for reference purposes only and will remain constant throughout the post.

In [None]:
url = "https://www.artificialintelligence-news.com/2023/05/16/openai-ceo-ai-regulation-is-essential/"

title = "OpenAI CEO: AI regulation ‘is essential’"

text_all = """Altman highlighted the potential benefits of AI technologies like ChatGPT and Dall-E 2 to help address significant challenges such as climate change and cancer, but he also stressed the need to mitigate the risks associated with increasingly powerful AI models.

Altman proposed that governments consider implementing licensing and testing requirements for AI models that surpass a certain threshold of capabilities. He highlighted OpenAI’s commitment to safety and extensive testing before releasing any new systems, emphasising the company’s belief that ensuring the safety of AI is crucial.

Senators Josh Hawley and Richard Blumenthal expressed their recognition of the transformative nature of AI and the need to understand its implications for elections, jobs, and security. Blumenthal played an audio introduction using an AI voice cloning software trained on his speeches, demonstrating the potential of the technology.

Blumenthal raised concerns about various risks associated with AI, including deepfakes, weaponised disinformation, discrimination, harassment, and impersonation fraud. He also emphasised the potential displacement of workers in the face of a new industrial revolution driven by AI."""

text_to_change = """Senators Josh Hawley and Richard Blumenthal expressed their recognition of the transformative nature of AI and the need to understand its implications for elections, jobs, and security. Blumenthal played an audio introduction using an AI voice cloning software trained on his speeches, demonstrating the potential of the technology."""

The following diagram explains the workflow of this project.

<img src="https://github.com/pranath/blog/raw/master/images/activeloop-supercharge-blog-posts.png" width="800"/>

First, we build candidate search phrases based on the paragraph we want to extend. The searches are then utilised to extract relevant content from a search engine (for example, Bing or Google Search), which are then divided into little parts. We then construct embeddings for these chunks and save them in a Deep Lake dataset. Finally, the most similar chunks to the paragraph we wish to expand are downloaded from Deep Lake and used in a prompt to expand the paragraph with more information.

## Generate Search Queries

The code below processes an article and suggests three suitable search keywords using OpenAI's ChatGPT model. We design a prompt that asks the model to recommend Google search phrases that might be used to learn more about the issue. The LLMChain connects the ChatOpenAI model with the ChatPromptTemplate to form the chain used to communicate with the model. Finally, to extract the data, it separates the response by newline and removes the first characters. The syntax indicated above works because we asked the API to generate each query in a new line beginning with -. (The OutputParser class can be used to achieve the same effect).

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)

template = """You are an exceptional copywriter and content creator.

You're reading an article with the following title:
----------------
{title}
----------------

You've just read the following piece of text from that article.
----------------
{text_all}
----------------

Inside that text, there's the following TEXT TO CONSIDER that you want to enrich with new details.
----------------
{text_to_change}
----------------

What are some simple and high-level Google queries that you'd do to search for more info to add to that paragraph?
Write 3 queries as a bullet point list, prepending each line with -.
"""

human_message_prompt = HumanMessagePromptTemplate(
    prompt=PromptTemplate(
        template=template,
        input_variables=["text_to_change", "text_all", "title"],
    )
)

chat_prompt_template = ChatPromptTemplate.from_messages([human_message_prompt])
chat = ChatOpenAI(temperature=0.9)
chain = LLMChain(llm=chat, prompt=chat_prompt_template)
response = chain.run({
    "text_to_change": text_to_change,
    "text_all": text_all,
    "title": title
})
queries = [line[2:] for line in response.split("\n")]
print(queries)

['"AI implications for elections"', '"AI job displacement"', '"AI security risks"']


## Get Search Results

To use Google search API, we must first create an API Key and a custom search engine. To obtain the key, go to the [Google Cloud dashboard](https://console.cloud.google.com/apis/credentials) and produce it by clicking the CREATE CREDENTIALS button at the top and selecting API KEY. Then, go to the [Programmable Search Engine dashboard](https://programmablesearchengine.google.com/controlpanel/create) and make sure you check the "Search the entire web" box. In the details, the Search engine ID will be shown. You may also need to enable the "Custom Search API" service under the APIs and services section. (If necessary, API will send you the instructions.) We can now set the environment variables GOOGLE_CSE_ID and GOOGLE_API_KEY, which will allow the Google wrapper to communicate with the API.

In [None]:
# first, we create a tool that allows us to use Google search.
# we'll use it to retrieve the first 10 results

from langchain.tools import Tool
from langchain.utilities import GoogleSearchAPIWrapper

search = GoogleSearchAPIWrapper()
TOP_N_RESULTS = 5

def top_n_results(query):
    return search.results(query, TOP_N_RESULTS)

tool = Tool(
    name = "Google Search",
    description="Search Google for recent results.",
    func=top_n_results
)

In [None]:
# this is how we can use the tool. For each result, we have:
# 1. the result title
# 2. its URL
# 3. and the snippet that we would see if we were on the Google UI

all_results = []

for query in queries:
    results = tool.run(query)
    all_results += results

    if "title" in results[0]: # Sample
        print(results[0]["title"])
        print(results[0]["link"])
        print(results[0]["snippet"])
        print("-"*50)

Job Displacement - AI SUPERPOWERS new book by Kai-Fu Lee of ...
https://www.aisuperpowers.com/job-displacement
AI Job Displacement Index. How will your job be affected by AI? Within the next fifteen years, AI and automation will be able to do virtually all basic work ...
--------------------------------------------------
OWASP AI Security and Privacy Guide | OWASP Foundation
https://owasp.org/www-project-ai-security-and-privacy-guide/
Feb 15, 2023 ... Particular AI security risks · Data Security Risks: · AI model attacks, or adversarial machine learning attacks represent important security risks ...
--------------------------------------------------


The variable all_results contains 15 web addresses. (3 ChatGPT inquiries x 5 top Google search results) However, using all of the information as a context in our application is not an optimal flow. There are technical, financial, and contextual factors to consider.

To begin, the LLMs' input length is limited to a range of 2K to 4K tokens, which varies depending on the model. Although we can bypass this limitation by using a different chain type, it is more efficient and produces better results when we stick to the model's window size. Second, it is critical to understand that increasing the quantity of words we submit to the API incurs a higher cost. While it is possible to divide a prompt into numerous chains, we should exercise caution because the cost of these models is controlled by the token count. Finally, the content provided by the saved search results will be contextually relevant. So, it is a good idea to use the most relevant results.

## Find the Most Relevant Results

As previously stated, Google Search returns the URL for each source. However, the content of these pages is required. Using the.parse() method, the newspaper package can retrieve the contents of a web link. The code below will cycle through the results and try to extract the text.

In [None]:
# let's visit all the URLs from the results and use the newspaper library
# to download their texts. The library won't work on some URLs, e.g.
# if the content is a PDF file or if the website has some anti-bot mechanisms
# adopted.

import newspaper

pages_content = []

for result in all_results:
    try:
        article = newspaper.Article(result["link"])
        article.download()
        article.parse()
        if len(article.text) > 0:
            pages_content.append({ "url": result["link"], "text": article.text })
    except:
        continue

print("Number of pages: ", len(pages_content))

Number of pages:  10


The given output reveals that 14 pages were processed, but we expected 15. There are some situations in which the newspaper library may have difficulty extracting information. These include search results that direct to a PDF file or websites that prohibit web scraping.

To ensure that the articles do not exceed the model's input length, the recorded contents must now be divided into smaller chunks. Depending on the situation, the code below divides the text by newlines or spaces. It ensures that each chunk contains 3000 characters and that there are no more than 100 overlaps between chunks.

In [None]:
# we split the article texts into small chunks. While doing so, we keep track of each
# chunk metadata (i.e. the URL where it comes from). Each metadata is a dictionary and
# we need to use the "source" key for the document source so that the chain
# that we'll create later knows where to retrieve the source.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document

text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=100)

docs = []
for d in pages_content:
    chunks = text_splitter.split_text(d["text"])
    for chunk in chunks:
        new_doc = Document(page_content=chunk, metadata={ "source": d["url"] })
        docs.append(new_doc)

print("Number of chunks: ", len(docs))

Number of chunks:  26


As you can see, the docs variable has 26 data chunks. It's time to locate the most relevant bits and feed them into the broader language model as context. OpenAI will be used by the OpenAIEmbeddings class to translate the texts into vector space that has semantics. We then embedded both document pieces as well as the appropriate sentence from the main article that was selected for enlargement. The text_to_change variable represents the sentence that was picked at the start of this lesson.

In [None]:
# then, we embed both the chunks and the query

from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

docs_embeddings = embeddings.embed_documents([doc.page_content for doc in docs])
query_embedding = embeddings.embed_query(text_to_change)

The cosine similarity measure can be used to calculate the distance between high-dimensionality embedding vectors. It calculates the distance between two points in the vector space. Because the embeddings carry contextual information, their proximity indicates that they have a common meaning. As a result, the source document with the highest similarity score can be used.

We utilised the sklearn library's cosine_similarity function. It computes the distance between each chunk and the selected sentence in order to return the index of the top three results.

In [None]:
# next, we compute the cosine similarities between the document vectors and
# the query vectors using numpy and sklearn. We are interested only in the top 3
# chunks for now because we'll later put them in a prompt and the prompt size is
# limited.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def get_top_k_indices(list_of_doc_vectors, query_vector, top_k):
    # convert the lists of vectors to numpy arrays
    list_of_doc_vectors = np.array(list_of_doc_vectors)
    query_vector = np.array(query_vector)

    # compute cosine similarities
    similarities = cosine_similarity(query_vector.reshape(1, -1), list_of_doc_vectors).flatten()

    # sort the vectors based on cosine similarity
    sorted_indices = np.argsort(similarities)[::-1]

    # retrieve the top K indices from the sorted list
    top_k_indices = sorted_indices[:top_k]

    return top_k_indices

top_k = 3
best_indexes = get_top_k_indices(docs_embeddings, query_embedding, top_k)
best_k_documents = [doc for i, doc in enumerate(docs) if i in best_indexes]

In [None]:
best_k_documents

[Document(page_content='How to Prepare for AI Job Displacement Kai-Fu Lee · Follow · Oct 25, 2018 3 min read -- 1 Listen Share\n\nAs individuals, we should accept that routine jobs are going away. For young people in these routine jobs, start now by finding careers that fit your strengths and that are not easily replaced by AI. For older people, when early retirement is offered to you, consider accepting, with gig economy and volunteering to make some income and live a life you enjoy.\n\nWe should encourage more people to go into service careers, choosing jobs into which they can pour their hearts and souls, spreading their love and experiences.\n\nWe should embrace AI tools, especially for professionals, understanding that they will get better with more data and use. We should use these tools to do parts of our jobs, allowing them to do more of our routine tasks, freeing us to move into areas that are more suitable for humans.\n\nWe should encourage all kinds of creativity beyond the 

## Extend the Sentence

We can now define the prompt using the additional information from Google search. There are six input variables in the template:

- title that holds the main article’s title;
- text_all to present the whole article we are working on;
- text_to_change is the selected part of the article that requires expansion;
- doc_1, doc_2, doc_3 to include the close Google search results as context.

The remaining part of the code should be familiar, as it follows the same structure used for generating Google queries. It defines a HumanMessage template to be compatible with the ChatGPT API, which is defined with a high-temperature value to encourage creativity. The LLMChain class will create a chain that combines the model and prompt to finish up the process by using .run() method

In [None]:
template = """You are an exceptional copywriter and content creator.

You're reading an article with the following title:
----------------
{title}
----------------

You've just read the following piece of text from that article.
----------------
{text_all}
----------------

Inside that text, there's the following TEXT TO CONSIDER that you want to enrich with new details.
----------------
{text_to_change}
----------------

Searching around the web, you've found this ADDITIONAL INFORMATION from distinct articles.
----------------
{doc_1}
----------------
{doc_2}
----------------
{doc_3}
----------------

Modify the previous TEXT TO CONSIDER by enriching it with information from the previous ADDITIONAL INFORMATION.
"""

human_message_prompt = HumanMessagePromptTemplate(
    prompt=PromptTemplate(
        template=template,
        input_variables=["text_to_change", "text_all", "title", "doc_1", "doc_2", "doc_3"],
    )
)
chat_prompt_template = ChatPromptTemplate.from_messages([human_message_prompt])

chat = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.9)
chain = LLMChain(llm=chat, prompt=chat_prompt_template)

response = chain.run({
    "text_to_change": text_to_change,
    "text_all": text_all,
    "title": title,
    "doc_1": best_k_documents[0].page_content,
    "doc_2": best_k_documents[1].page_content,
    "doc_3": best_k_documents[2].page_content
})

print("Text to Change: ", text_to_change)
print("Expanded Variation:", response)

Text to Change:  Senators Josh Hawley and Richard Blumenthal expressed their recognition of the transformative nature of AI and the need to understand its implications for elections, jobs, and security. Blumenthal played an audio introduction using an AI voice cloning software trained on his speeches, demonstrating the potential of the technology.
Expanded Variation: Senators Josh Hawley and Richard Blumenthal expressed their recognition of the transformative nature of AI and the need to understand its implications for elections, jobs, and security. Blumenthal even showcased the potential of AI technology by playing an audio introduction using an AI voice cloning software trained on his speeches. However, as pointed out by Kai-Fu Lee, individuals and organizations must be prepared for AI job displacement and actively seek out careers that cannot easily be replaced by AI. This can include service careers that require emotional intelligence and creativity beyond the sciences. Investment 

In [None]:
print(text_to_change)

Senators Josh Hawley and Richard Blumenthal expressed their recognition of the transformative nature of AI and the need to understand its implications for elections, jobs, and security. Blumenthal played an audio introduction using an AI voice cloning software trained on his speeches, demonstrating the potential of the technology.


In [None]:
import textwrap

print(textwrap.fill(text_to_change, width=150))
print()
print(textwrap.fill(response, width=150))

Senators Josh Hawley and Richard Blumenthal expressed their recognition of the transformative nature of AI and the need to understand its implications
for elections, jobs, and security. Blumenthal played an audio introduction using an AI voice cloning software trained on his speeches, demonstrating
the potential of the technology.

Senators Josh Hawley and Richard Blumenthal expressed their recognition of the transformative nature of AI and the need to understand its implications
for elections, jobs, and security. Blumenthal even showcased the potential of AI technology by playing an audio introduction using an AI voice cloning
software trained on his speeches. However, as pointed out by Kai-Fu Lee, individuals and organizations must be prepared for AI job displacement and
actively seek out careers that cannot easily be replaced by AI. This can include service careers that require emotional intelligence and creativity
beyond the sciences. Investment companies can also look into impact 

## Conclusion

In this post, we learned how to use Google search results to expand the model's prompt by including extra information. The example demonstrated the use of embedding vectors to discover content that has a similar meaning or context, as well as the process of adding relevant information to a prompt to improve output. Incorporating external information, such as Google search, is a powerful tool for improving models by providing supplemental context in instances where data is scarce.

## Acknowledgements

I'd like to express my thanks to the wonderful [LangChain & Vector Databases in Production Course](https://learn.activeloop.ai/courses/langchain) by Activeloop - which i completed, and acknowledge the use of some images and other materials from the course in this article.