### newspaper＆Langchain

- https://docs.langchain.com/docs/
- https://python.langchain.com/docs/integrations/callbacks/streamlit
- https://python.langchain.com/docs/use_cases/web_scraping/
- https://python.langchain.com/docs/integrations/document_loaders/wikipedia
- https://python.langchain.com/docs/integrations/document_loaders/news
- https://python.langchain.com/docs/integrations/document_loaders/copypaste
- https://python.langchain.com/docs/integrations/retrievers/wikipedia

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

True

#### newspaper
https://github.com/codelucas/newspaper/

In [2]:
import newspaper
NEWS_URL = 'http://cnn.com'
NEWS_URL = 'https://prtimes.jp/'
news_build = newspaper.build(NEWS_URL, memoize_articles=False)

for article in news_build.articles[:5]:
    print(article.url)

https://prtimes.jp/main/html/rd/p/000000175.000019201.html
https://prtimes.jp/main/html/rd/p/000003355.000019470.html
https://prtimes.jp/main/html/rd/p/000000874.000019483.html
https://prtimes.jp/main/html/rd/p/000000186.000019787.html
https://prtimes.jp/story/detail/rX5ngQCJvPb


In [3]:
for category in news_build.category_urls()[:5]:
    print(category)

https://prtimes.jp/
https://prtimes.jp/technology
https://prtimes.jp/app
https://prtimes.jp/service
https://prtimes.jp/topics


In [4]:
print(news_build.brand)
print(news_build.description)

prtimes
プレスリリースをPR TIMESで配信して、効果的なPRを実現。上場企業の54.5%（2,169/3,978社）が利用する現在シェアNo.1のプレスリリース配信代行サービス。話題のベンチャー～グローバル企業のニュースリリースが集まる生活者も要注目のサイトです。プレスリリースの受信は無料です。


In [5]:
article = news_build.articles[0]
article.download()
article.parse()
article.nlp()

In [6]:
print('- html:\n',article.html[:100])
print('- authors:\n',article.authors)
print('- publish_date:\n',article.publish_date)
print('- text:\n',article.text)
print('- keywords:\n',article.keywords)
print('- summary:\n',article.summary)

- html:
 <!doctype html>
<html class="no-js">
<head>
    <script type="text/javascript" src="/common/js/GTMhe
- authors:
 []
- publish_date:
 None
- text:
 
- keywords:
 ['プロフェッショナルのための道具であることをイメージさせるスモーキーカラーを採用した新型ルノー', 'カングー', 'カングーの限定車ルノー', 'ヴァリエテ発売']
- summary:
 


#### langchain

In [9]:
import openai

from langchain.agents import load_tools
from langchain.chat_models import ChatOpenAI
from langchain.tools import DuckDuckGoSearchRun
from langchain.agents import AgentExecutor
from langchain.schema import SystemMessage
from langchain.agents import OpenAIFunctionsAgent
from langchain.document_loaders import NewsURLLoader

openai.api_key = os.getenv('OPENAI_API_KEY')
os.environ["SERPAPI_API_KEY"] = os.getenv('SERPAPI_API_KEY')

# Tool
llm = ChatOpenAI(
    model='gpt-3.5-turbo-16k',
    temperature=0,
    openai_api_key=openai.api_key,
    )
tools = load_tools([
    "ddg-search",
    'serpapi',
    # "wikipedia"
    ], llm=llm)

# Prompt
system_message = SystemMessage(content="You are a search assistant.")
prompt = OpenAIFunctionsAgent.create_prompt(system_message=system_message)

# Agent
search_agent = OpenAIFunctionsAgent(
    llm=llm, 
    tools=tools, 
    prompt=prompt,
    max_tokens_limit=4096
    )
agent_executor = AgentExecutor(
    agent=search_agent, 
    tools=tools, 
    verbose=True
    )


# from langchain.agents import initialize_agent
# agent_executor = initialize_agent(
#     tools, 
#     llm, 
#     agent="zero-shot-react-description", 
#     verbose=True,
#     agent_kwargs=dict(suffix='Answer should be in Japanese.' + prompt.SUFFIX)
#     )

# Run
query = r"Do a web search for about 5 companies with black pellets, then do another web search on the company name to get more information, and create a list in markdown format that includes the company details. Let's think step by step."
results = agent_executor.run(query)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `Search` with `companies with black pellets`


[0m[33;1m[1;3mBiofuelwatch investigation into attempts fo produce black pellets (thorugh torrefaction, steam explosion, or hydrothermal carbonisation) on a ...[0m[32;1m[1;3m
Invoking: `Search` with `Biofuelwatch`
responded: {content}

[0m[33;1m[1;3mPlease click here. Search. Search for: Donate. Can you support Biofuelwatch's work? You can make a donation here. Mastodon. We are now on Mastodon. © ...[0m[32;1m[1;3m
Invoking: `Search` with `Enviva`


[0m[33;1m[1;3mEnviva Inc. is the world's largest producer of wood pellets, a renewable alternative to coal. The company has also been the subject of controversy regarding its sustainability with an environmental ...[0m[32;1m[1;3m
Invoking: `Search` with `Blackwood Technology`


[0m[33;1m[1;3mBlackwood is world leader in biomass torrefaction technology ... Blackwood is a renewable energy technology company

In [14]:
print(results)

Here is the list of companies with black pellets along with some additional information:

1. **Biofuelwatch**: Biofuelwatch is an organization that has conducted investigations into attempts to produce black pellets through torrefaction, steam explosion, or hydrothermal carbonization. [More information](https://www.biofuelwatch.org.uk/)

2. **Enviva**: Enviva is the world's largest producer of wood pellets, including black pellets, which are a renewable alternative to coal. The company has been the subject of sustainability controversies. [More information](https://www.envivabiomass.com/)

3. **Blackwood Technology**: Blackwood Technology is a renewable energy technology company specializing in biomass torrefaction. They are a world leader in biomass torrefaction technology. [More information](https://www.blackwoodtechnology.com/)

4. **FutureMetrics**: FutureMetrics is a leading consultancy in the wood pellet sector. They provide information, market analysis, operations guidance, and 

In [11]:
# NewsURLLoader
urls = [
    "https://www.bbc.com/news/world-us-canada-66388172",
    "https://www.bbc.com/news/entertainment-arts-66384971",
]

loader = NewsURLLoader(urls=urls, nlp=True)
data = loader.load()
print("- First article:\n", data[0])
print('- keywords:\n',data[0].metadata['keywords'])
print('- summary:\n',data[0].metadata['summary'])

- First article:
 page_content='In testimony to the congressional committee examining the 6 January riot, Mrs Powell said she did not review all of the many claims of election fraud she made, telling them that "no reasonable person" would view her claims as fact. Neither she nor her representatives have commented.' metadata={'title': "Who are the six co-conspirators in Trump's latest indictment?", 'link': 'https://www.bbc.com/news/world-us-canada-66388172', 'authors': [], 'language': 'en', 'description': 'Six people accused of helping Mr Trump undermine the election have been described by prosecutors.', 'publish_date': None, 'keywords': ['coconspirators', 'telling', 'powell', 'riot', 'reasonable', 'representatives', 'view', 'review', 'trumps', 'indictment', 'latest', 'testimony', 'claims', 'person'], 'summary': 'In testimony to the congressional committee examining the 6 January riot, Mrs Powell said she did not review all of the many claims of election fraud she made, telling them tha