# NewsGPT Cookbook with LangChain

1) First load news from various news sites

In [199]:
class NewsCategory:
    ALL = "all"
    BUSINESS = "business"
    POLITICS = "politics"
    SPORTS = "sports"
    TECHNOLOGY = "technology"

class NewsLength:
    SHORT = 15
    MEDIUM = 30
    LONG = 50
    
def news_urls_to_category(category: NewsCategory):
    _urls = ["https://www.wsj.com/", "https://www.nytimes.com/", "https://www.apnews.com/", "https://www.bbc.com/"]
    post_fixes = {
        NewsCategory.ALL: ["", "", "", "news/"],
        NewsCategory.BUSINESS: ["news/business", "section/business", "hub/business", "news/business"],
        NewsCategory.TECHNOLOGY: ["news/technology", "section/technology", "hub/technology", "news/technology"],
        NewsCategory.POLITICS: ["news/politics", "section/politics", "hub/politics", ""],
        NewsCategory.SPORTS: ["news/sports", "section/sports", "hub/sports", "sport"],
    }
    # append urls
    for i in range(len(_urls)):
        _urls[i] += post_fixes[category][i]
    return _urls

2. Run Do Web Scrapper

In [200]:

#######################################################################

from langchain.document_loaders import SeleniumURLLoader

news_category = NewsCategory.SPORTS
urls = news_urls_to_category(news_category)
print(urls)

loader = SeleniumURLLoader(urls=urls)
docs = loader.load()

['https://www.wsj.com/news/sports', 'https://www.nytimes.com/section/sports', 'https://www.apnews.com/hub/sports', 'https://www.bbc.com/sport']


In [201]:
# Docs Filtering
print(len(docs))
print(type(docs[0]))

# This tries to truncate the page content to 3000 characters
for d in docs:
    l = len(d.page_content)
    print(l)
    if l > 3000:
        d.page_content = d.page_content[:3000]

print(" size after truncation of docs")
for d in docs:
    print(len(d.page_content))

# print(docs)

4
<class 'langchain.schema.Document'>
3144
3919
13222
7226
 size after truncation of docs
3000
3000
3000
3000


Now use OpenAI LLM Model. To get OpenAI API key via Azure, follow [this link](https://learn.microsoft.com/en-gb/azure/cognitive-services/openai/quickstart?tabs=command-line&pivots=programming-language-python)

In [146]:
import os
import openai

openai.api_key = "700fa82411ad46069807d49abd48c7ad"
openai.api_base =  "https://newsgpt.openai.azure.com/" # your endpoint should look like the following https://YOUR_RESOURCE_NAME.openai.azure.com/
openai.api_type = 'azure'
openai.api_version = '2022-12-01' # this may change in the future

os.environ["OPENAI_API_KEY"] = openai.api_key
os.environ["OPENAI_API_TYPE"] = openai.api_type
os.environ["OPENAI_API_BASE"] = openai.api_base
os.environ["OPENAI_API_VERSION"] = openai.api_version

In [148]:
# This is to TEST use Azure OpenAI API
import os
import requests
import json

deployment_name='text-davinci-003' #This will correspond to the custom name you chose for your deployment when you deployed a model. 

# Send a completion call to generate an answer
print('Sending a test completion job')
start_phrase = "Summarize this article: \n" + docs[0].page_content
response = openai.Completion.create(engine=deployment_name, prompt=start_phrase, max_tokens=100)
text = response['choices'][0]['text'].replace('\n', '').replace(' .', '.').strip()
print(text)

Sending a test completion job
This article covers a range of stories related to the current state of the economy and market pain, many of which involve foreign haps, technological and financial developments, and business deals. From the construction industry's high employment and Fed interest rate bets, to Chinese information restrictions and GM's electric vehicle legacy, to digital purchases and golf lessons, to luxury homebuying and stolent iPhones, the article covers a diverse array of topics related to industry, finance, and technology.


In [149]:
from langchain.llms import AzureOpenAI

MODEL_NAME = "text_curie-001"
MODEL_NAME = "text-davinci-003" # Davinci is 10x more expensive, use curie when testing

llm = AzureOpenAI(
    deployment_name=MODEL_NAME,
    model_name=MODEL_NAME,
    max_tokens=1000)  # default is 16 in openai API


USING API_BASE: 
https://newsgpt.openai.azure.com/


The above URL loader will convert the scrapped news to langchain Document format. If we wish to load our own data, we can use the following code:

In [72]:
# use template data
data = "Google’s employees were shocked when they learned in March that the South Korean consumer electronics giant Samsung was considering replacing Google with Microsoft’s Bing as the default search engine on its devices. For years, Bing had been a search engine also-ran. But it became a lot more interesting to industry insiders when it recently added new artificial intelligence technology. \
    Google’s reaction to the Samsung threat was “panic,” according to internal messages reviewed by The New York Times. An estimated $3 billion in annual revenue was at stake with the Samsung contract. An additional $20 billion is tied to a similar Apple contract that will be up for renewal this year. \
    A.I. competitors like the new Bing are quickly becoming the most serious threat to Google’s search business in 25 years, and in response, Google is racing to build an all-new search engine powered by the technology. It is also upgrading the existing one with A.I. features, according to internal documents reviewed by The Times. \
    The new features, under the project name Magi, are being created by designers, engineers and executives working in so-called sprint rooms to tweak and test the latest versions. The new search engine would offer users a far more personalized experience than the company’s current service, attempting to anticipate users’ needs. \
    Lara Levin, a Google spokeswoman, said in a statement that “not every brainstorm deck or product idea leads to a launch, but as we’ve said before, we’re excited about bringing new A.I.-powered features to search, and will share more details soon.” \
    Billions of people use Google’s search engine every day for everything from finding restaurants and directions to understanding a medical diagnosis, and that simple white page with the company logo and an empty bar in the middle is one of the most widely used web pages in the world. Changes to it would have a significant impact on the lives of ordinary people, and until recently it was hard to imagine anything challenging it."

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter()
texts = text_splitter.split_text(data)

from langchain.docstore.document import Document

docs2 = [Document(page_content=t) for t in texts[:3]]
print(docs2)

[Document(page_content='Google’s employees were shocked when they learned in March that the South Korean consumer electronics giant Samsung was considering replacing Google with Microsoft’s Bing as the default search engine on its devices. For years, Bing had been a search engine also-ran. But it became a lot more interesting to industry insiders when it recently added new artificial intelligence technology.     Google’s reaction to the Samsung threat was “panic,” according to internal messages reviewed by The New York Times. An estimated $3 billion in annual revenue was at stake with the Samsung contract. An additional $20 billion is tied to a similar Apple contract that will be up for renewal this year.     A.I. competitors like the new Bing are quickly becoming the most serious threat to Google’s search business in 25 years, and in response, Google is racing to build an all-new search engine powered by the technology. It is also upgrading the existing one with A.I. features, accordi

In [None]:
print(docs)

4) Run Map reduce with LangChain

In [202]:
# Refer to: https://python.langchain.com/en/latest/use_cases/summarization.html

from langchain.chains.summarize import load_summarize_chain
from langchain import PromptTemplate

###############################################################################################

news_length = NewsLength.LONG
DEBUG = False

###############################################################################################

map_prompt_template = f"Write a {news_category} news headlines summary of the following:"
map_prompt_template += " \n\n {text} \n\n"
map_prompt_template += f"PROVIDE SUMMARY WITH AROUND {news_length +10} SENTENCES"

reduce_prompt_template = f"Write a summary of today's {news_category} news headlines from the following news sources:"
reduce_prompt_template += " \n\n {text} \n\n"
reduce_prompt_template += f"PROVIDE SUMMARY WITH AROUND {news_length} SENTENCES"


MAP_PROMPT = PromptTemplate(template=map_prompt_template, input_variables=["text"])
REDUCE_PROMPT = PromptTemplate(template=map_prompt_template, input_variables=["text"])

if DEBUG:
    chain = load_summarize_chain(llm,
                                 chain_type="map_reduce",
                                 map_prompt=MAP_PROMPT,
                                 combine_prompt=REDUCE_PROMPT,
                                 return_map_steps=True)
    chain({"input_documents": docs}, return_only_outputs=True)
else:
    chain = load_summarize_chain(
        llm, chain_type="map_reduce", map_prompt=MAP_PROMPT,
        combine_prompt=REDUCE_PROMPT)
    summary = chain.run(docs)
    print(summary)

    

:

Shohei Ohtani of Major League Baseball is proving to be an elite hitter, being able to replicate the best pitches he faces. China's Ding Liren won the World Chess Championship in a dramatic match, and NFL star Lamar Jackson signed a record-breaking contract. QBs dominated the NFL Draft, while basketball star Brittney Griner expressed hope for her future and concern for Americans still held overseas. Giannis Antetokounmpo's "failure" speech went viral, and the New York Jets and Knicks both took action. The debate over "participation trophies" was tackled, and Washington is preparing for a football apocalypse as the Jets try to acquire Aaron Rodgers. Max Scherzer was suspended for using "sticky stuff," the Oakland A's intend to move to Las Vegas, and the Tampa Bay Rays tied an MLB record with 13 straight wins. The Olympics President proposed a plan for the return of Russian athletes, and the world track and field banned transgender athletes from women's events. German soccer team is o