# Ingest Content

## Notes

Collection of notes to help ingest content from a domain. Includes specific instructions to filter output for some domains (like bradfrost.com).

### bradfrost.com

- Exclude `/blog/link/*` and `/blog/tag/*` and `/attachment`

### danmall.com

- Exclude `/topics/*`

### primer.style

- Exclude `/octicons/*` and `/design/foundations/icons/*` and `/view-components/lookbook/*`

### digitaldesign.vattenfall.com

- Exclude `/components/modules/*`.

### volkswagen.frontify.com

- Exclude `/builder/groupui/*`.

### designstrategy.guide

- Exclude `/tag/*`

### www.designbetter.co

- Exclude `pencils-before-pixels`

## Dependencies

Install the following dependencies first:

In [None]:
%pip install bs4 jsonlines pytest-playwright

## Install needed browsers

In this case we'll just use Chrome... but Webkit and Firefox get installed, too.

In [None]:
!playwright install

## Set the domain to ingest

Set the domain you want to have scanned, this will include all subpages on that domain (and that domain only). Excludes links with parameters (`?`) and anchors (`#`).

Also set a name that will be used to create files to persist ingested content.

In [2]:
name = 'designbetter_co'
url = 'https://www.designbetter.co/'

Think about ways of including a script / snippet per domain to handle edge cases. Some kind of hook?

## Find all internal URLs

First step is to crawl a domain for all internal links leading to HTML content. Do this until everything is discovered. Enter your domain through adjusting the `url` in `__main__`.

Write the set of discovered URLs from `all_links` to disk, converting them to `jsonl` format for easier processing in the next steps. We'll build upon that `page` dict in the following steps.

Scroll delay (currently 5 * .5s) serves as a kind of "backoff" as to not become blocked for DDoS / obvious scraping for the moment.

TODO:
- Still some duplicate URLs (/ vs non-/)
- Not finding some URLs when behind tabbed content? (e.g. https://atlassian.design/components/button/examples -> Usage tab)
- Add exclude list / filter to processed urls (e.g. to exclude stuff like "blog/tag/*")
- Attempt to handle Cookies / Cookie Consent, try to close cookie banners? (e.g. for screenshots)
- Some types of links seem still to fail (e.g. "https://danmall.com/topics/artificial intelligence", as logged)
- Some links (probably because of weird endless scrolling) seem to always time out, like this one: https://www.designbetter.co/principles-of-product-design/break-black-box
- Subdomains seem to have problems (like here: https://design-system.service.gov.uk/ / https://hds.hel.fi/)
- Sometimes empy URLs
- Timeout on https://primer.style/design/foundations/css-utilities/animations
- Retry-Logic and incremental crawling

In [18]:
import requests
import jsonlines
import pathlib
import time
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright
from slugify import slugify
 
def get_domain(url):
    parsed_uri = urlparse(url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    return domain

def get_path(url):
    parsed_uri = urlparse(url)
    return parsed_uri.path
 
def get_links(url, content):
    soup = BeautifulSoup(content, 'html.parser')
    domain = get_domain(url)
    links = set()
    for link in soup.find_all('a'):
        link_url = link.get('href')
        if link_url:
            absolute_link = urljoin(url, link_url)
            if absolute_link.startswith(domain):
                links.add(absolute_link)
    return links
 
async def playwright(url, slug):
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(url)
        
        screenshot_dir = 'screenshots/' + name
        pathlib.Path(screenshot_dir).mkdir(parents=True, exist_ok=True)
        await page.screenshot(path=f'{screenshot_dir}/{slug}-{p.chromium.name}.png')
        
        for i in range(5):
            await page.mouse.wheel(0, 15000)
            time.sleep(0.500)
            i += 1
        
        content = await page.content()
        await browser.close()
        return content
    
if __name__ == '__main__':
    queue = [url]
    visited = set()
    all_links = set()
    pages = []
 
    while queue:
        url = queue.pop(0)
        visited.add(url)
        response = requests.get(url)
        if (response.ok and "text/html" in response.headers['Content-Type']):
            page = dict()
            print(url)
            if (url.rstrip("/") not in all_links):
                all_links.add(url.rstrip("/"))
                page['url'] = url
                page['slug'] = slugify(get_path(page['url']))
                page['content'] = dict()
                page['content']['html'] = await playwright(page['url'], page['slug'])
                pages.append(page)
    
                links = get_links(page['url'], page['content']['html'])
                for link in links:
                    if link not in visited and link not in queue and '#' not in link and '?' not in link and 'pencils-before-pixels' not in link:
                        queue.append(link)
    
    print()
    print('All done! ' + str(len(all_links)) + ' links discovered.')

    with jsonlines.open('pages-' + name + '.jsonl', 'w') as writer:
        writer.write_all(pages)

https://www.designbetter.co/
https://www.designbetter.co/freehand/pricing
https://www.designbetter.co/design-leadership-handbook
https://www.designbetter.co/design-thinking
https://www.designbetter.co/enterprise-design-sprints
https://www.designbetter.co/business-thinking-for-designers
https://www.designbetter.co/design-engineering-handbook
https://www.designbetter.co/collaborate-better-handbook
https://www.designbetter.co/designops-handbook
https://www.designbetter.co/design-systems-handbook
https://www.designbetter.co/books
https://www.designbetter.co/animation-handbook
https://www.designbetter.co/principles-of-product-design
https://www.designbetter.co/conversations
https://www.designbetter.co/subscribe
https://www.designbetter.co/remotework
https://www.designbetter.co/podcast
https://www.designbetter.co/conversations/andrew-ofstad
https://www.designbetter.co/conversations/benjamin-evans
https://www.designbetter.co/conversations/jason-goodwin
https://www.designbetter.co/conversation

## More dependencies

Install trafilatura, that will be used to extract the content from pages, and tiktoken to have a first relevant token measurement for complete page content.

In [None]:
%pip install trafilatura tiktoken

## Extract content from discovered pages

We'll keep Markdown formatting for now. It will be used to split sections from pages by slicing by headlines. 

In [16]:
import re
import jsonlines
import tiktoken
from bs4 import BeautifulSoup
from trafilatura import load_html, extract
from markdown import markdown

enc = tiktoken.encoding_for_model("text-davinci-003")

def markdown_to_text(markdown_string):
    html = markdown(markdown_string)

    html = re.sub(r'<pre>(.*?)</pre>', ' ', html)
    html = re.sub(r'<code>(.*?)</code >', ' ', html)

    soup = BeautifulSoup(html, "html.parser")
    text = ''.join(soup.findAll(text=True))

    return text

extracted_content = []
with jsonlines.open('pages-' + name + '.jsonl') as pages:
    for page in pages:
        if (page['content']['html']):
            downloaded = load_html(page['content']['html'])
            parsed = BeautifulSoup(page['content']['html'], "html.parser")
            ogTitleTag = parsed.find("meta", property="og:title")
            title = parsed.title.string if (parsed.title and parsed.title.string) else ogTitleTag.get("content") if (ogTitleTag and ogTitleTag.get("content") != None) else page['url']
            ogDescriptionTag = parsed.find("meta", property="og:description")
            metaDescriptionTag = parsed.find("meta", attrs={'name': 'description'})
            metaDescription = metaDescriptionTag['content'] if (metaDescriptionTag and metaDescriptionTag.get("content") != None) else (ogDescriptionTag.get("content") if (ogDescriptionTag and ogDescriptionTag.get("content") != None) else 'No meta description extractable')

            result = extract(downloaded, url=page['url'], include_formatting=True, include_links=True, favor_recall=True)

            if result is None:
                print('couldnt extract:', page['url'])
            else:
                augmented = dict()
                augmented['url'] = page['url']
                augmented['slug'] = page['slug']

                augmented['content'] = page['content']
                augmented['content']['raw'] = markdown_to_text(result)
                augmented['content']['markdown'] = result

                augmented['title'] = title.replace('\n', ' ').strip()
                augmented['description'] = metaDescription.replace('\n', ' ').strip()

                augmented['lines'] = result.splitlines()
                augmented['size'] = len(result)
                augmented['token'] = len(enc.encode(result))

                extracted_content.append(augmented)
                print('extracted:', augmented['url'], augmented['title'], str(augmented['token']) + ' Token,', len(result))  

with jsonlines.open('pages-' + name + '_extracted.jsonl', 'w') as pages:
    pages.write_all(extracted_content)
    
print()
print('wrote extracted content to "pages-' + name + '_extracted.jsonl"')

extracted: https://www.designbetter.co/ Discover the world's best design practices—DesignBetter.Co 591 Token, 2930
extracted: https://www.designbetter.co/freehand/pricing Page Not Found - DesignBetter 527 Token, 2585
extracted: https://www.designbetter.co/design-leadership-handbook Design Leadership Handbook, your guide to becoming a strong design leader — DesignBetter.Co 510 Token, 2573
extracted: https://www.designbetter.co/design-thinking Design Thinking Handbook - Guide to a Design Thinking Process 510 Token, 2573
extracted: https://www.designbetter.co/enterprise-design-sprints Enterprise Design Sprints - DesignBetter 213 Token, 936
extracted: https://www.designbetter.co/business-thinking-for-designers Business Thinking for Designers - DesignBetter 281 Token, 1296
extracted: https://www.designbetter.co/design-engineering-handbook Design Engineering Handbook - DesignBetter 266 Token, 1097
extracted: https://www.designbetter.co/collaborate-better-handbook Collaborate Better Handbook 

## Even more dependencies

Install the BERT extractive summarizer and Sentence Transformers, we'll use these to create summaries as a first step.

In [None]:
%pip install bert-extractive-summarizer sentence-transformers

## Create SBert summaries

We first create SBert summaries by identifying the most central sentences on a page, concatenating those for a rough first summary.

In [None]:
import tiktoken
import jsonlines
from summarizer.sbert import SBertSummarizer
from tqdm.notebook import tqdm

enc = tiktoken.encoding_for_model("text-davinci-003")
model = SBertSummarizer('paraphrase-multilingual-MiniLM-L12-v2')  

with_summaries = []
with jsonlines.open('pages-' + name + '_extracted.jsonl', 'r') as pages:
    for page in pages:
        with_summaries.append(page)
    
print('Creating summaries for ' + str(len(list(with_summaries))) + ' pages.')
print()

for page in with_summaries:
    result = model(page['content']['raw'].replace('\n', ' '), num_sentences=4, min_length=60)
    metaDescription = page.pop('description')
    page['summaries'] = dict()
    page['summaries']['sbert'] = ''.join(result)
    page['summaries']['meta'] = metaDescription
    print(page['url'], page['title'], str(len(enc.encode(page['summaries']['sbert']))) + ' Token,')
        
with jsonlines.open('pages-' + name + '_extracted_summaries.jsonl', 'w') as pages:
    pages.write_all(with_summaries)

print()
print('All summaries created.')

## More dependencies and download

We'll install `nltk` and download the `punkt` set.

In [None]:
%pip install nltk

In [10]:
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Extract sections from markdown page content

We'll extract sections from our pages by splitting along markdown headlines (# to ######).

In [19]:
import re
import json
import tiktoken
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from typing import Set
from markdown import markdown
from nltk.tokenize import sent_tokenize

enc = tiktoken.encoding_for_model("text-davinci-003")

def markdown_to_text(markdown_string):
    html = markdown(markdown_string)

    html = re.sub(r'<pre>(.*?)</pre>', ' ', html)
    html = re.sub(r'<code>(.*?)</code >', ' ', html)

    soup = BeautifulSoup(html, "html.parser")
    text = ''.join(soup.findAll(string=True))

    return text

def count_tokens(text: str) -> int:
    """count the number of tokens in a string"""
    return len(enc.encode(text))

def reduce_long(
    long_text: str, long_text_tokens: bool = False, max_len: int = 590
) -> str:
    """
    Reduce a long text to a maximum of `max_len` tokens by potentially cutting at a sentence end
    """
    if not long_text_tokens:
        long_text_tokens = count_tokens(long_text)
    if long_text_tokens > max_len:
        sentences = sent_tokenize(long_text.replace("\n", " "))
        ntokens = 0
        for i, sentence in enumerate(sentences):
            ntokens += 1 + count_tokens(sentence)
            if ntokens > max_len:
                return ". ".join(sentences[:i][:-1]) + "."

    return long_text

discard_categories = []

def extract_sections(
    page_text: str,
    title: str,
    max_len: int = 1500,
    discard_categories: Set[str] = discard_categories,
) -> str:
    """
    Extract the sections of a kickstartDS page, discarding the references and other low information sections
    """
    if len(page_text) == 0:
        return []

    # find all headings and the coresponding contents
    headings = re.findall("#+ .*", page_text)
    for heading in headings:
        page_text = page_text.replace(heading, "#+ !!")
    contents = page_text.split("#+ !!")
    contents = [c.strip() for c in contents]
    assert len(headings) == len(contents) - 1

    cont = contents.pop(0).strip()
    outputs = [(title, "Summary", cont, count_tokens(cont)+4)]
    
    # discard the discard categories, accounting for a tree structure
    max_level = 100
    keep_group_level = max_level
    remove_group_level = max_level
    nheadings, ncontents = [], []
    for heading, content in zip(headings, contents):
        plain_heading = " ".join(heading.split(" ")[1:-1])
        num_equals = len(heading.split(" ")[0])
        if num_equals <= keep_group_level:
            keep_group_level = max_level

        if num_equals > remove_group_level:
            if (
                num_equals <= keep_group_level
            ):
                continue
        keep_group_level = max_level
        if plain_heading in discard_categories:
            remove_group_level = num_equals
            keep_group_level = max_level
            continue
        nheadings.append(heading.replace("#", "").strip())
        ncontents.append(markdown_to_text(content).replace('\n', ' '))
        remove_group_level = max_level

    # count the tokens of each section
    ncontent_ntokens = [
        count_tokens(c)
        + 3
        + count_tokens(" ".join(h.split(" ")[1:-1]))
        - (1 if len(c) == 0 else 0)
        for h, c in zip(nheadings, ncontents)
    ]

    # Create a tuple of (title, section_name, content, number of tokens)
    outputs += [(title, h, c, t) if t<max_len 
                else (title, h, reduce_long(c, max_len), count_tokens(reduce_long(c,max_len))) 
                    for h, c, t in zip(nheadings, ncontents, ncontent_ntokens)]
    
    return outputs

with_sections = []
with jsonlines.open('pages-' + name + '_extracted_summaries.jsonl') as pages:
    for page in pages:
        outputs = []
        outputs += extract_sections(page["content"]["markdown"], page["title"])
        
        df = pd.DataFrame(outputs, columns=["title", "heading", "content", "tokens"])
        df = df[df.tokens>40]
        df = df.drop_duplicates(['title','heading'])
        df = df.reset_index().drop('index',axis=1) # reset index
        df.head()

        result = df.to_json(orient="records")
        parsed = json.loads(result)
        page['sections'] = parsed
        
        with_sections.append(page)

with jsonlines.open('pages-' + name + '_extracted_sections.jsonl', 'w') as pages:
    pages.write_all(with_sections)
    
print('Extracted sections for ' + str(len(with_sections)) + ' pages.')

Extracted sections for 142 pages.
