# Recursively Finding the PG Number

<br/>

<a target="_blank" href="https://www.webtranspose.com/">
  <img src="https://www.pgnumber.com/_next/image?url=%2Fbuilt-with-black-new.png&w=384&q=75" alt="Open In Colab" style='max-width:20%'/>
</a>

---

Using [Web Transpose AI Web Scraper](https://webtranspose.com/ai-web-scraper) to extract the people thanked by Paul Graham in his blog posts.

Then using [Web Transpose AI SERP API](https://www.webtranspose.com/) [Web Transpose Web Crawl](https://www.webtranspose.com/distributed-cloud-web-crawler) to recursively extract these people's blogs to create a graph of closeness to Paul Graham.

- [Visualization (PG Number: pgnumber.com)](https://pgnumber.com/)
- [Blog Post](https://www.webtranspose.com/blog/examples/pg-number)

<a target="_blank" href="https://colab.research.google.com/github/mike-gee/webtranspose-tutorials-python/blob/main/scrape-tutorials/pg-number.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [None]:
!pip install webtranspose --upgrade

## 0. Setup & Get API Key

Import Web Transpose. You can get a **free Web Transpose API Key** [here](https://app.webtranspose.com/dashboard).

https://app.webtranspose.com/dashboard

In [None]:
import os

import webtranspose as webt

os.environ["WEBTRANSPOSE_API_KEY"] = "YOUR API KEY"

## 1. Crawling the Web Page

This will download Paul Graham's Website.

In [None]:
crawl = webt.Crawl(
    'http://paulgraham.com/',
    max_pages=10, # set to 10 for testing - set this to 500 when you want to run this for real.
    render_js=False,
)
crawl.queue_crawl()
print('Crawl ID:', crawl.crawl_id)

In [None]:
print(crawl)

In [None]:
# (Optional) Recover the Crawl ID if required
crawl = webt.get_crawl("YOUR CRAWL ID")

## 2. Scraping the Web Page

First we define a schema of what we wish to extract from the website. We then build an AI web scraper. This will work on any website.

In [None]:
schema = {
    'page classification': ['blog / essay', 'other type of page'],
    'people thanked': {
        'type': 'array',
        'items': {
            'person name': 'string',
            'reason mentioned': ['thanked for reading drafts of this blog / essay', 'other kind of praise', 'other reason']
        }
    }
}
scraper = webt.Scraper(schema)

Now, we loop through all the pages from the web crawl and extract the data.

In [None]:
essays = []
for i, url in enumerate(crawl.get_visited()):
    if i % 2 == 0:
        print(i)
    page = crawl.get_page(url)
    # We pass in the HTML we already extracted in the call
    # You can also just pass in the URL and Web Transpose will scrape the web page
    out_data = scraper.scrape(url, html=page['html']) 
    if out_data['page classification'] == 'blog / essay':
        essays.append({
            'url': url,
            'people thanked': out_data['people thanked']
        })

## 3. Getting Data

Get all unique people.

In [None]:
people = set()
for essay in essays:
    for person in essay['people thanked']:
        if person['reason mentioned'] == 'thanked for reading drafts of this blog / essay':
            people.add(person['person name'])

Get everyone's blogs.

In [None]:
blog_dict = {}
for person_name in list(people):
    results = webt.search_filter(f"{person_name}'s blog")
    if len(results['filtered_results']) > 0:
        blog_dict[person_name] = [x['url'] for x in results['filtered_results']]

In [None]:
blog_dict

## 4. Recurse

Now re-run the above code on all these new websites. 

It's going to be most efficient to crawl all the websites in parallel and then scrape them.

### Crawl all the new websites

In [None]:
website_crawl_dict = {}

for name, websites in blog_dict.items():
    for url in websites:
        crawl = webt.Crawl(url, max_pages=100)
        crawl.queue_crawl()
        website_crawl_dict[url] = crawl.crawl_id

In [None]:
crawl.max_pages

In [None]:
crawl.get_queued()

### Wait for Crawling to Complete

In [None]:
from time import sleep

complete = False

while not complete:
    for url, crawl_id in website_crawl_dict.items():
        crawl = webt.get_crawl(crawl_id)
        if len(crawl.get_visited()) < crawl.max_pages and len(crawl.get_queued()) > 0:
            print(f"Crawl for {url} is still running...")
    sleep(30)
    print("Waiting for crawls to finish...")

### Scrape Crawls

In [None]:
crawl_essay_dict = {}

for crawl_id in website_crawl_dict.values():
    crawl = webt.get_crawl(crawl_id)
    
    crawl_essays = []
    for i, url in enumerate(crawl.get_visited()):
        if i % 2 == 0:
            print(i)
        page = crawl.get_page(url)
        # We pass in the HTML we already extracted in the call
        # You can also just pass in the URL and Web Transpose will scrape the web page
        out_data = scraper.scrape(url, html=page['html']) 
        if out_data['page classification'] == 'blog / essay':
            crawl_essays.append({
                'url': url,
                'people thanked': out_data['people thanked']
            })
            
    crawl_essay_dict[crawl_id] = crawl_essays

### Re-Gather People

In [None]:
for crawl_id in website_crawl_dict.values():
    new_essays = crawl_essay_dict[crawl_id]
    for essay in new_essays:
        for person in essay['people thanked']:
            if person['reason mentioned'] == 'thanked for reading drafts of this blog / essay':
                people.add(person['person name'])

## 5. Result

You can view the results of running this recursively at:

- [PG Number & Essay Graph](https://pgnumber.com/) (Visual Interface)
- [Github Repo](https://github.com/mike-gee/pg-number) (Raw Data on Github)

### Tools Used:

- [Web Transpose](https://webtranspose.com/)