# Crawl web page and extract information

- 웹페이지 주소를 입력해서 텍스트 데이터만 추출
- 텍스트 데이터에서 LLM 을 이용하여 정보를 추출
- 다른 작업에 사용할 수 있도록 jsonl 형태로 내보내기

---

# 0. Setup


In [1]:
!pip -q install -U boto3 awscli requests beautifulsoup4 trafilatura langchain

In [2]:
import json
import textwrap
from pathlib import Path
from datetime import datetime

# web scraping
import requests
import trafilatura
from bs4 import BeautifulSoup

# bedrock
import json
import boto3

# splitter
import tiktoken
from langchain.text_splitter import TokenTextSplitter

In [3]:
urls = [
    'https://www.imdb.com/title/tt0111161/plotsummary/',
#    'https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/acl.html',
#    'https://en.wikipedia.org/wiki/Lee_Byung-hun',
]
len(urls)

1

In [4]:
def fallback_parse(response_content):
    soup = BeautifulSoup(response_content, 'html.parser')
    text = soup.find_all(string=True)
    cleaned_text = ''
    blacklist = [
        '[document]',
        'noscript',
        'header',
        'html',
        'meta',
        'head', 
        'input',
        'script',
        'style',]

    for item in text:
        if item.parent.name not in blacklist:
            cleaned_text += '{} '.format(item)
            
    cleaned_text = cleaned_text.replace('\t', '')
    return cleaned_text.strip()

In [5]:
def crawl(url):    
    print(f'parse url: {url}...')
    downloaded = trafilatura.fetch_url(url)

    contents = trafilatura.extract(
        downloaded, output_format="json",
        include_comments=False, include_links=False, with_metadata=True,
        date_extraction_params={'extensive_search': True, 'original_date': True},
    )
    
    if contents:
        json_output = json.loads(contents)
        return json_output['text']
    else:
        try:
            resp = requests.get(url)
            if resp.status_code == 200:
                return fallback_parse(resp.content)
            else:
                return None
        except Exception as e:
            print(e)
            raise e

# 1. Crawl url and extract texts

- trafilatura 를 이용하여 url 에서 텍스트만 추출


In [6]:
%%time

docs = []
for url in urls:
    doc = crawl(url)
    print(f'len: {len(doc)}\n')
    if doc is None:
        print(f'failed to parse url: {url}')
        continue
    docs.append(doc)
    
print(f'num docs: {len(docs)}')

parse url: https://www.imdb.com/title/tt0111161/plotsummary/...
len: 22177

num docs: 1
CPU times: user 1.13 s, sys: 49.1 ms, total: 1.18 s
Wall time: 2.5 s


In [7]:
for i, doc in enumerate(docs):
    text = textwrap.shorten(doc, width=70, placeholder=' ...\n')
    print(f'doc {i}:\n{text}')

doc 0:
- Over the course of several years, two convicts form a ...



In [8]:
docs

['- Over the course of several years, two convicts form a friendship, seeking consolation and, eventually, redemption through basic compassion.\n- Chronicles the experiences of a formerly successful banker as a prisoner in the gloomy jailhouse of Shawshank after being found guilty of a crime he did not commit. The film portrays the man\'s unique way of dealing with his new, torturous life; along the way he befriends a number of fellow prisoners, most notably a wise long-term inmate named Red.—J-S-Golden\n- When an innocent male banker is sent to prison accused of murdering his wife, he does everything that he can over the years to break free and escape from prison. While on the inside, he develops a friendship with a fellow inmate that could last for years.—RECB3\n- After the murder of his wife, hotshot banker Andrew Dufresne is sent to Shawshank Prison, where the usual unpleasantness occurs. Over the years, he retains hope and eventually gains the respect of his fellow inmates, especi

# 2. Extract information

- Bedrock 을 통해 Claude2 모델을 us-east-1 리전에서 호출한다.
- 해당 호출은 사용당 과금이 된다.


In [9]:
profile_name = None
region = 'us-east-1'

In [10]:
session = boto3.Session(
    profile_name=profile_name,
    region_name=region,
)
bedrock = session.client(service_name='bedrock-runtime')

In [26]:
modelId = 'anthropic.claude-v2'
accept = 'application/json'
contentType = 'application/json'

In [27]:
instruction_prompt = """
You are information extractor. You extract the key informations from the user text to help him to build a topic model. \
The user text is enclosed in text tags, <text></text>.

Let's think step by step, and follow these steps to provide a constructive feedback to the user. \
Respond only with JSON string from Step 4, nothing else.
Please make sure the response starts with a JSON string, "{".

####Step 1: List informative keywords that helps to understand the text.

####Step 2: If the text contains informative name of entities, List them.

####Step 3: Provide summary of the text in about 50 words. \
The summary should use as many keywords and entities extracted in the previous steps as possible. \
The information must not contain any code. Do not provide any sample code in the information.

####Step 4: Respond the result in valid JSON string with the keys:: keywords, entities, summary.
type of the key, "keywords", is a list of string. \
type of the key, "entities", is a list of string. \
type of the key, "summary", is a string.
""".strip()

def _build_prompt(chunk):
    user_prompt = f"""
Here is the user text.

<text>{chunk}</text>

Extract the informations from the user text. \
Please respond only with a JSON object, nothing else. And do not enclose the JSON object with triple quote, ```.
    """.strip()
    return f"""\n\nHuman: {instruction_prompt}\n\nHuman: {user_prompt}\n\n Assistant: """

def extract_info(chunk):
    body = json.dumps({
        "prompt": _build_prompt(chunk),
        "max_tokens_to_sample": 2048,
        "top_p": 0.9,
        "temperature": 0.2,
    })
    resp = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    resp_text = resp.get('body').read()
    try:
        return json.loads(resp_text).get('completion')
    except Exception as e:
        print('error occured', e)
        print(resp_text)
        return None

In [28]:
%%time

model_name = 'gpt-3.5-turbo'

doc_infos = []
for doc_id, doc in enumerate(docs):
    # try to split into ~=8 chunks
    encoding = tiktoken.encoding_for_model(model_name)
    num_tokens = len(encoding.encode(doc))
    chunk_size = num_tokens//4 if num_tokens > 512 else num_tokens
    print(f'start extracting info from doc_id: {doc_id}, length: {len(doc)}, num_tokens: {num_tokens}, chunk_size: {chunk_size}')
    
    splitter = TokenTextSplitter.from_tiktoken_encoder(
        model_name=model_name,
        chunk_size=chunk_size,
        chunk_overlap=20,
    )
    each_info = []
    for chunk_idx, chunk in enumerate(splitter.split_text(doc)):
        print(f'chunk-{chunk_idx}({len(chunk)}), ', end='')
        info = extract_info(chunk)
        if info is None:
            print(f'invalid output. ', end='')
        else:
            each_info.append(info)
    print(f'chunks: {len(each_info)}...')
    
    doc_infos.append(each_info)

start extracting info from doc_id: 0, length: 22177, num_tokens: 4679, chunk_size: 1169
chunk-0(5539), chunk-1(5640), chunk-2(5521), chunk-3(5493), chunk-4(375), chunks: 5...
CPU times: user 90.5 ms, sys: 12.1 ms, total: 103 ms
Wall time: 51.3 s


In [29]:
print(doc_infos[0][0])

 {
  "keywords": [
    "Andy Dufresne", 
    "Shawshank Prison", 
    "Red", 
    "friendship", 
    "hope",
    "redemption",
    "convicts",
    "prisoners",
    "banker",
    "murder",
    "wife",
    "innocent",
    "escape",
    "smuggler"
  ],
  "entities": [
    "Andy Dufresne",
    "Red", 
    "Ellis Boyd Redding",
    "Shawshank Prison",
    "Boggs",
    "Byron Hadley",
    "Rita Hayworth"
  ],
  "summary": "The text describes the story of Andy Dufresne, an innocent banker who is sent to Shawshank Prison for murdering his wife. In prison, Andy befriends another inmate named Red, a smuggler. Despite the harsh conditions, Andy maintains hope and eventually gains the respect of his fellow prisoners. He resists the sexual assaults from the leader Boggs and his gang. Andy uses his banking expertise to gain privileges from the guard Hadley. He also asks Red to get him a rock hammer to pursue his hobby and a poster of actress Rita Hayworth. The text chronicles Andy's experiences and 

# 3. Save to file

- 각 청크단위의 정보를 하나로 합쳐셔 저장


In [30]:
with open('./dataset.jsonl', 'w') as fp:
    for idx, doc_info in enumerate(doc_infos):
        keywords = []
        entities = []
        summary = []
        for el in doc_info:
            info = json.loads(el)
            keywords.extend(info['keywords'])
            entities.extend(info['entities'])
            summary.append(info['summary'])
        D = {
            'keywords': list(set(keywords)),
            'entities': list(set(entities)),
            'summary': '\n'.join(summary),
            'url': urls[idx],
        }
        fp.write(f'{json.dumps(D)}\n')

 {
  "keywords": [
    "Andy Dufresne", 
    "Shawshank Prison", 
    "Red", 
    "friendship", 
    "hope",
    "redemption",
    "convicts",
    "prisoners",
    "banker",
    "murder",
    "wife",
    "innocent",
    "escape",
    "smuggler"
  ],
  "entities": [
    "Andy Dufresne",
    "Red", 
    "Ellis Boyd Redding",
    "Shawshank Prison",
    "Boggs",
    "Byron Hadley",
    "Rita Hayworth"
  ],
  "summary": "The text describes the story of Andy Dufresne, an innocent banker who is sent to Shawshank Prison for murdering his wife. In prison, Andy befriends another inmate named Red, a smuggler. Despite the harsh conditions, Andy maintains hope and eventually gains the respect of his fellow prisoners. He resists the sexual assaults from the leader Boggs and his gang. Andy uses his banking expertise to gain privileges from the guard Hadley. He also asks Red to get him a rock hammer to pursue his hobby and a poster of actress Rita Hayworth. The text chronicles Andy's experiences and 