# Extending Texts (Blog Posts) with LangChain and Google Search 

STEPS:
* [1. Setup](#setup)
* [2. Generating Search Queries](#search)
* [3. Getting Search Results](#results)
* [4. Finding the Most Relevant Results](#relevant)
* [5. Extending the Sentence](#extend)

<hr>
<a class="anchor" id="setup">
    
## 1. Setup
    
</a>

In [1]:
#!pip install langchain==0.0.208 deeplake openai tiktoken

In [2]:
!pip install -q newspaper3k==0.2.8 python-dotenv

In [3]:
import sys, os
sys.path.append('..')
from keys import OPENAI_API_KEY, GOOGLE_CSE_ID, GOOGLE_API_KEY

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["GOOGLE_CSE_ID"] = GOOGLE_CSE_ID
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

<hr>
<a class="anchor" id="search">
    
## 2. Generating Search Queries
    
</a>

In [4]:
title = "OpenAI CEO: AI regulation ‘is essential’"

text_all = """ Altman highlighted the potential benefits of AI technologies like ChatGPT and Dall-E 2 to help address significant challenges such as climate change and cancer, but he also stressed the need to mitigate the risks associated with increasingly powerful AI models. Altman proposed that governments consider implementing licensing and testing requirements for AI models that surpass a certain threshold of capabilities. He highlighted OpenAI’s commitment to safety and extensive testing before releasing any new systems, emphasising the company’s belief that ensuring the safety of AI is crucial. Senators Josh Hawley and Richard Blumenthal expressed their recognition of the transformative nature of AI and the need to understand its implications for elections, jobs, and security. Blumenthal played an audio introduction using an AI voice cloning software trained on his speeches, demonstrating the potential of the technology. Blumenthal raised concerns about various risks associated with AI, including deepfakes, weaponised disinformation, discrimination, harassment, and impersonation fraud. He also emphasised the potential displacement of workers in the face of a new industrial revolution driven by AI."""

text_to_change = """ Senators Josh Hawley and Richard Blumenthal expressed their recognition of the transformative nature of AI and the need to understand its implications for elections, jobs, and security. Blumenthal played an audio introduction using an AI voice cloning software trained on his speeches, demonstrating the potential of the technology."""

In [5]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)

template = """ You are an exceptional copywriter and content creator.

You're reading an article with the following title:
----------------
{title}
----------------

You've just read the following piece of text from that article.
----------------
{text_all}
----------------

Inside that text, there's the following TEXT TO CONSIDER that you want to enrich with new details.
----------------
{text_to_change}
----------------

What are some simple and high-level Google queries that you'd do to search for more info to add to that paragraph?
Write 3 queries as a bullet point list, prepending each line with -.
"""

human_message_prompt = HumanMessagePromptTemplate(
    prompt=PromptTemplate(
        template=template,
        input_variables=["text_to_change", "text_all", "title"],
    )
)
chat_prompt_template = ChatPromptTemplate.from_messages([human_message_prompt])
#chat = ChatOpenAI(model_name="text-davinci-003", temperature=0.9)
chat = ChatOpenAI(temperature=0.9)
chain = LLMChain(llm=chat, prompt=chat_prompt_template)

In [6]:
response = chain.run({
    "text_to_change": text_to_change,
    "text_all": text_all,
    "title": title
})

In [7]:
response

'- AI voice cloning software applications for speech synthesis\n- Transformative nature of AI in elections, jobs, and security\n- Implications of AI technology on various industries'

In [8]:
queries = [line[2:] for line in response.split("\n")]
print(queries)

['AI voice cloning software applications for speech synthesis', 'Transformative nature of AI in elections, jobs, and security', 'Implications of AI technology on various industries']


<hr>
<a class="anchor" id="results">
    
## 3. Getting Search Results
    
</a>

In [9]:
from langchain.tools import Tool
from langchain.utilities import GoogleSearchAPIWrapper
import time

search = GoogleSearchAPIWrapper()
TOP_N_RESULTS = 5

def top_n_results(query):
    return search.results(query, TOP_N_RESULTS)

tool = Tool(
    name = "Google Search",
    description="Search Google for recent results.",
    func=top_n_results
)

In [10]:
all_results = []

for query in queries:
    results = tool.run(query)
    all_results += results

In [11]:
if "title" in results[0]: # Sample
    print(results[0]["title"])
    print(results[0]["link"])
    print(results[0]["snippet"])
    print("-"*50)

Economic potential of generative AI | McKinsey
https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/the-economic-potential-of-generative-ai-the-next-productivity-frontier
Jun 14, 2023 ... Generative AI will have a significant impact across all industry sectors. Banking, high tech, and life sciences are among the industries ...
--------------------------------------------------


In [12]:
len(all_results)

15

<hr>
<a class="anchor" id="relevant">
    
## 4. Finding the Most Relevant Results

</a>

The `all_results` variable holds 15 web addresses: 3 queries from ChatGPT x 5 top Google search results. However, it is not optimal to use all the contents as a context in our application. 

In [13]:
import newspaper

pages_content = []

for result in all_results:
    try:
        article = newspaper.Article(result["link"])
        article.download()
        article.parse()
        
        if len(article.text) > 0:
            pages_content.append({ "url": result["link"], "text": article.text })
    except:
        continue

print("Number of pages: ", len(pages_content))

Number of pages:  8


In [14]:
# Split the saved contents into smaller chunks 
# to ensure the articles do not exceed the model’s input length
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document

text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=100)

docs = []
for d in pages_content:
    chunks = text_splitter.split_text(d["text"])
    for chunk in chunks:
        new_doc = Document(page_content=chunk, metadata={ "source": d["url"] })
        docs.append(new_doc)

print("Number of chunks: ", len(docs))

Number of chunks:  70


In [15]:
# Embedding docs and query
from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

docs_embeddings = embeddings.embed_documents([doc.page_content for doc in docs])
query_embedding = embeddings.embed_query(text_to_change)

In [16]:
# Using cosine similarity to get the most relevant docs
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


def get_top_k_indices(list_of_doc_vectors, query_vector, top_k):
    # convert the lists of vectors to numpy arrays
    list_of_doc_vectors = np.array(list_of_doc_vectors)
    query_vector = np.array(query_vector)

    # compute cosine similarities
    similarities = cosine_similarity(query_vector.reshape(1, -1), list_of_doc_vectors).flatten()

    # sort the vectors based on cosine similarity
    sorted_indices = np.argsort(similarities)[::-1]

    # retrieve the top K indices from the sorted list
    top_k_indices = sorted_indices[:top_k]

    return top_k_indices


top_k = 3
best_indexes = get_top_k_indices(docs_embeddings, query_embedding, top_k)
best_k_documents = [doc for i, doc in enumerate(docs) if i in best_indexes]

<hr>
<a class="anchor" id="extend">
    
## 5. Extending the Sentence

</a>

We can now define the prompt using the additional information from Google search. There are six input variables in the template:

- `title`: the main article’s title;
- `text_all`: the whole article we are working on;
- `text_to_change`: the selected part of the article that requires expansion;
- `doc_1`, `doc_2`, `doc_3`: close Google search results to include as context.

In [18]:
template = """You are an exceptional copywriter and content creator.

You're reading an article with the following title:
----------------
{title}
----------------

You've just read the following piece of text from that article.
----------------
{text_all}
----------------

Inside that text, there's the following TEXT TO CONSIDER that you want to enrich with new details.
----------------
{text_to_change}
----------------

Searching around the web, you've found this ADDITIONAL INFORMATION from distinct articles.
----------------
{doc_1}
----------------
{doc_2}
----------------
{doc_3}
----------------

Modify the previous TEXT TO CONSIDER by enriching it with information from the previous ADDITIONAL INFORMATION.
"""

human_message_prompt = HumanMessagePromptTemplate(
    prompt=PromptTemplate(
        template=template,
        input_variables=["text_to_change", "text_all", "title", "doc_1", "doc_2", "doc_3"],
    )
)
chat_prompt_template = ChatPromptTemplate.from_messages([human_message_prompt])

chat = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.9)
chain = LLMChain(llm=chat, prompt=chat_prompt_template)

In [21]:
response = chain.run({
    "text_to_change": text_to_change,
    "text_all": text_all,
    "title": title,
    "doc_1": best_k_documents[0].page_content,
    "doc_2": best_k_documents[1].page_content,
    "doc_3": best_k_documents[2].page_content
})

print("Text to Change: ", text_to_change)
print("**********************")
print("Expanded Variation:", response)

Text to Change:   Senators Josh Hawley and Richard Blumenthal expressed their recognition of the transformative nature of AI and the need to understand its implications for elections, jobs, and security. Blumenthal played an audio introduction using an AI voice cloning software trained on his speeches, demonstrating the potential of the technology.
**********************
Expanded Variation: Senators Josh Hawley and Richard Blumenthal expressed their recognition of the transformative nature of AI and the need to understand its implications for elections, jobs, and security. Blumenthal played an audio introduction using an AI voice cloning software trained on his speeches, demonstrating the potential of the technology. This demonstration highlights the growing availability and sophistication of AI voice cloning technology. As seen with the controversy surrounding a vocal deepfake of Anthony Bourdain, voice clones are becoming more realistic and can be utilized in various applications. In