# Ingest Content

## Notes

Collection of notes to help ingest content from a domain. Includes specific instructions to filter output for some domains (like bradfrost.com).

### bradfrost.com

- Exclude `/blog/link/*` and `/blog/tag/*` and `/attachment` and `/blog/page/*`

### danmall.com

- Exclude `/topics/*`

### primer.style

- Exclude `/octicons/*` and `/design/foundations/icons/*` and `/view-components/lookbook/*`

### digitaldesign.vattenfall.com

- Exclude `/components/modules/*`.

### volkswagen.frontify.com

- Exclude `/builder/groupui/*`.

### designstrategy.guide

- Exclude `/tag/*`

### www.designbetter.co

- Exclude `pencils-before-pixels`

## Dependencies

Install the following dependencies first:

In [None]:
%pip install bs4 jsonlines pytest-playwright

## Install needed browsers

In this case we'll just use Chrome... but Webkit and Firefox get installed, too.

In [None]:
!playwright install

## Set the domain to ingest

Set the domain you want to have scanned, this will include all subpages on that domain (and that domain only). Excludes links with parameters (`?`) and anchors (`#`).

Also set a name that will be used to create files to persist ingested content.

In [32]:
name = 'kickstartds_com'
url = 'https://www.kickstartDS.com/'

Think about ways of including a script / snippet per domain to handle edge cases. Some kind of hook?

## Find all internal URLs

First step is to crawl a domain for all internal links leading to HTML content. Do this until everything is discovered. Enter your domain through adjusting the `url` in `__main__`.

Write the set of discovered URLs from `all_links` to disk, converting them to `jsonl` format for easier processing in the next steps. We'll build upon that `page` dict in the following steps.

Scroll delay (currently 5 * .5s) serves as a kind of "backoff" as to not become blocked for DDoS / obvious scraping for the moment.

TODO:
- Still some duplicate URLs (/ vs non-/)
- Not finding some URLs when behind tabbed content? (e.g. https://atlassian.design/components/button/examples -> Usage tab)
- Add exclude list / filter to processed urls (e.g. to exclude stuff like "blog/tag/*")
- Attempt to handle Cookies / Cookie Consent, try to close cookie banners? (e.g. for screenshots)
- Some types of links seem still to fail (e.g. "https://danmall.com/topics/artificial intelligence", as logged)
- Some links (probably because of weird endless scrolling) seem to always time out, like this one: https://www.designbetter.co/principles-of-product-design/break-black-box
- Subdomains seem to have problems (like here: https://design-system.service.gov.uk/ / https://hds.hel.fi/)
- Sometimes empy URLs
- Timeout on https://primer.style/design/foundations/css-utilities/animations
- Retry-Logic and incremental crawling

In [22]:
import requests
import jsonlines
import pathlib
import time
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright
from slugify import slugify
 
def get_domain(url):
    parsed_uri = urlparse(url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    return domain

def get_path(url):
    parsed_uri = urlparse(url)
    return parsed_uri.path
 
def get_links(url, content):
    soup = BeautifulSoup(content, 'html.parser')
    domain = get_domain(url)
    links = set()
    for link in soup.find_all('a'):
        link_url = link.get('href')
        if link_url:
            absolute_link = urljoin(url, link_url)
            if absolute_link.startswith(domain):
                links.add(absolute_link)
    return links
 
async def playwright(url, slug):
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto(url)
        
        screenshot_dir = 'screenshots/' + name
        pathlib.Path(screenshot_dir).mkdir(parents=True, exist_ok=True)
        await page.screenshot(path=f'{screenshot_dir}/{slug}-{p.chromium.name}.png')
        
        for i in range(5):
            await page.mouse.wheel(0, 15000)
            time.sleep(0.500)
            i += 1
        
        content = await page.content()
        await browser.close()
        return content
    
if __name__ == '__main__':
    queue = [url]
    visited = set()
    all_links = set()
    pages = []
 
    while queue:
        url = queue.pop(0)
        visited.add(url)
        response = requests.get(url)
        if (response.ok and "text/html" in response.headers['Content-Type']):
            page = dict()
            print(url)
            if (url.rstrip("/") not in all_links):
                all_links.add(url.rstrip("/"))
                page['url'] = url
                page['slug'] = slugify(get_path(page['url']))
                page['content'] = dict()
                page['content']['html'] = await playwright(page['url'], page['slug'])
                pages.append(page)
    
                links = get_links(page['url'], page['content']['html'])
                for link in links:
                    if link not in visited and link not in queue and '#' not in link and '?' not in link and 'pencils-before-pixels' not in link:
                        queue.append(link)
    
    print()
    print('All done! ' + str(len(all_links)) + ' links discovered.')

    with jsonlines.open('pages-' + name + '.jsonl', 'w') as writer:
        writer.write_all(pages)

https://www.kickstartDS.com/
https://www.kickstartDS.com/about/
https://www.kickstartDS.com/storybook/
https://www.kickstartDS.com/integrations/
https://www.kickstartDS.com/privacy/
https://www.kickstartDS.com/legal/
https://www.kickstartDS.com/services/
https://www.kickstartDS.com/cookies/
https://www.kickstartDS.com/storybook
https://www.kickstartDS.com/docs/
https://www.kickstartDS.com/blog/great-components/
https://www.kickstartDS.com/blog/
https://www.kickstartDS.com/docs/integration/
https://www.kickstartDS.com/docs/guides/components/
https://www.kickstartDS.com/docs/guides/migrations/upgrade-2.0.0
https://www.kickstartDS.com/docs/foundations/components/component-api
https://www.kickstartDS.com/docs/foundations/layout/sections
https://www.kickstartDS.com/docs/foundations/token/branding-token
https://www.kickstartDS.com/docs/guides
https://www.kickstartDS.com/docs/guides/components/recipe
https://www.kickstartDS.com/docs/foundations/components/styling
https://www.kickstartDS.com/d

## More dependencies

Install trafilatura, that will be used to extract the content from pages, and tiktoken to have a first relevant token measurement for complete page content.

In [None]:
%pip install trafilatura tiktoken

## Extract content from discovered pages

We'll keep Markdown formatting for now. It will be used to split sections from pages by slicing by headlines. 

In [23]:
import re
import jsonlines
import tiktoken
from bs4 import BeautifulSoup
from trafilatura import load_html, extract
from markdown import markdown

enc = tiktoken.encoding_for_model("text-davinci-003")

def markdown_to_text(markdown_string):
    html = markdown(markdown_string)

    html = re.sub(r'<pre>(.*?)</pre>', ' ', html)
    html = re.sub(r'<code>(.*?)</code >', ' ', html)

    soup = BeautifulSoup(html, "html.parser")
    text = ''.join(soup.findAll(text=True))

    return text

extracted_content = []
with jsonlines.open('pages-' + name + '.jsonl') as pages:
    for page in pages:
        if (page['content']['html']):
            downloaded = load_html(page['content']['html'])
            parsed = BeautifulSoup(page['content']['html'], "html.parser")
            ogTitleTag = parsed.find("meta", property="og:title")
            title = parsed.title.string if (parsed.title and parsed.title.string) else ogTitleTag.get("content") if (ogTitleTag and ogTitleTag.get("content") != None) else page['url']
            ogDescriptionTag = parsed.find("meta", property="og:description")
            metaDescriptionTag = parsed.find("meta", attrs={'name': 'description'})
            metaDescription = metaDescriptionTag['content'] if (metaDescriptionTag and metaDescriptionTag.get("content") != None) else (ogDescriptionTag.get("content") if (ogDescriptionTag and ogDescriptionTag.get("content") != None) else 'No meta description extractable')

            result = extract(downloaded, url=page['url'], include_formatting=True, include_links=True, favor_recall=True)

            if result is None:
                print('couldnt extract:', page['url'])
            else:
                augmented = dict()
                augmented['url'] = page['url']
                augmented['slug'] = page['slug']

                augmented['content'] = page['content']
                augmented['content']['raw'] = markdown_to_text(result)
                augmented['content']['markdown'] = result

                augmented['title'] = title.replace('\n', ' ').strip()
                augmented['description'] = metaDescription.replace('\n', ' ').strip()

                augmented['lines'] = result.splitlines()
                augmented['size'] = len(result)
                augmented['token'] = len(enc.encode(result))

                extracted_content.append(augmented)
                print('extracted:', augmented['url'], augmented['title'], str(augmented['token']) + ' Token,', len(result))  

with jsonlines.open('pages-' + name + '_extracted.jsonl', 'w') as pages:
    pages.write_all(extracted_content)
    
print()
print('wrote extracted content to "pages-' + name + '_extracted.jsonl"')

extracted: https://www.kickstartDS.com/ kickstartDS – Open Source starter kit and low-code framework for Design Systems // kickstartDS 1335 Token, 5737
extracted: https://www.kickstartDS.com/about/ kickstartDS – about us and the team // kickstartDS 636 Token, 2903
extracted: https://www.kickstartDS.com/storybook/ Welcome - Page ⋅ Storybook 550 Token, 2360
extracted: https://www.kickstartDS.com/integrations/ Integrations - making your interface come alive! // kickstartDS 1538 Token, 6891
extracted: https://www.kickstartDS.com/privacy/ Privacy policy // kickstartDS 5794 Token, 27495
extracted: https://www.kickstartDS.com/legal/ Legal notice // kickstartDS 226 Token, 798
extracted: https://www.kickstartDS.com/services/ kickstartDS – Design System as a Service, trainings, workshops, etc. // kickstartDS 408 Token, 1949
extracted: https://www.kickstartDS.com/cookies/ https://www.kickstartDS.com/cookies/ 529 Token, 2137
extracted: https://www.kickstartDS.com/docs/ Welcome to the kickstartDS d

## Even more dependencies

Install the BERT extractive summarizer and Sentence Transformers, we'll use these to create summaries as a first step.

In [None]:
%pip install bert-extractive-summarizer sentence-transformers

## Create SBert summaries

We first create SBert summaries by identifying the most central sentences on a page, concatenating those for a rough first summary.

In [None]:
import tiktoken
import jsonlines
from summarizer.sbert import SBertSummarizer
from tqdm.notebook import tqdm

enc = tiktoken.encoding_for_model("text-davinci-003")
model = SBertSummarizer('paraphrase-multilingual-MiniLM-L12-v2')  

with_summaries = []
with jsonlines.open('pages-' + name + '_extracted.jsonl', 'r') as pages:
    for page in pages:
        with_summaries.append(page)
    
print('Creating summaries for ' + str(len(list(with_summaries))) + ' pages.')
print()

for page in with_summaries:
    result = model(page['content']['raw'].replace('\n', ' '), num_sentences=4, min_length=60)
    metaDescription = page.pop('description')
    page['summaries'] = dict()
    page['summaries']['sbert'] = ''.join(result)
    page['summaries']['meta'] = metaDescription
    print(page['url'], page['title'], str(len(enc.encode(page['summaries']['sbert']))) + ' Token,')
        
with jsonlines.open('pages-' + name + '_extracted_summaries.jsonl', 'w') as pages:
    pages.write_all(with_summaries)

print()
print('All summaries created.')

## Install dependencies for splitting

We'll split along Markdown syntax, using `langchain` to split as intelligently as possible.

In [None]:
%pip install langchain markdownify

## Extract sections from markdown page content

We'll extract sections from our pages by splitting along markdown headlines (# to ######).

In [33]:
import jsonlines
import tiktoken
from langchain.text_splitter import MarkdownTextSplitter
from markdownify import MarkdownConverter
from markdown import markdown
from bs4 import BeautifulSoup

enc = tiktoken.encoding_for_model("text-davinci-003")
markdown_splitter = MarkdownTextSplitter.from_tiktoken_encoder(chunk_size=300, chunk_overlap=50)

class IgnorantConverter(MarkdownConverter):
    def convert_script(self, el, text, convert_as_inline):
        return ''
    
    def convert_iframe(self, el, text, convert_as_inline):
        return ''
    
    def convert_style(self, el, text, convert_as_inline):
        return ''

def md(html, **options):
    return IgnorantConverter(**options).convert(html)

pages = []
with jsonlines.open('pages-' + name + '_extracted_summaries.jsonl') as pagesReader:
    pages = [page for page in pagesReader]
    sections = markdown_splitter.create_documents(
        list(map(lambda page: md(
            page["content"]["html"],
            convert=["h1", "h2", "h3", "h4", "h5", "h6", "p", "script", "style", "iframe", "b", "ul", "li", "a", "ol", "quote"],
            heading_style="ATX"
        ), pages)),
        list(map(lambda page: dict((k, page[k]) for k in ['url'] if k in page), pages))
    )

    for page in pages:
        page['sections'] = []

    for section in sections:
        pageIndex = [index for (index, item) in enumerate(pages) if item['url'] == section.metadata['url']][0]
        
        if (not "sections" in pages[pageIndex]):
            pages[pageIndex]["sections"] = []

        pageSection = dict()
        pageSection['content'] = dict()
        pageSection['content']['markdown'] = section.page_content
        pageSection['content']['raw'] = ''.join(BeautifulSoup(markdown(section.page_content)).findAll(text=True))
        pageSection['tokens'] = len(enc.encode(section.page_content))

        pages[pageIndex]["sections"].append(pageSection)

with jsonlines.open('pages-' + name + '_extracted_sections.jsonl', 'w') as pagesWriter:
    pagesWriter.write_all(pages)