# Crawling tutorial

A web scraper script is anlogous to a spider that crawls a website's graph of URLs.
In this tutorial, we'll look at some examples of Python libraries for downloading and undestanding website stucture.

For optimal viewing of this notebook, you should enable the custom "wide format":
`cat wide_notebooks.css >> ../../venv/lib/python3.X/site-packages/notebook/static/custom/custom.css` (replace X with 4/5/6 depending on your version of Python)

In [1]:
import json
from basiccrawler.crawler import BasicCrawler

crawler = BasicCrawler(
        main_source_domain='http://chef-take-home-test.learningequality.org',
        start_page='http://chef-take-home-test.learningequality.org/')
crawler.CRAWLING_STAGE_OUTPUT = 'chefdata/trees/takehome_web_resource_tree.json'

channel_tree = crawler.crawl(limit=1000)
crawler.print_tree(channel_tree, print_depth=1)







################################################################################
# CRAWLER RECOMMENDATIONS BASED ON URLS ENCOUNTERED:
################################################################################

1. These URLs are very common and look like global navigation links:

2. These are common path fragments found in URLs paths, so could correspond to site struture:
  -  7 urls on site start with  /Math
  -  7 urls on site start with  /Science
  -  1 urls on site start with  /


################################################################################



  - path: /  (PageWebResource) 
    children:
     - path: /Math/  (PageWebResource) 
       children counts: {'PageWebResource': 6}
     - path: /Science/  (PageWebResource) 
       children counts: {'PageWebResource': 6}


In [2]:
crawler.print_tree(channel_tree, print_depth=2)


  - path: /  (PageWebResource) 
    children:
     - path: /Math/  (PageWebResource) 
       children:
        - path: /Math/Numbers/  (PageWebResource) 
          children counts: {'PageWebResource': 2}
        - path: /Math/Functions/  (PageWebResource) 
          children counts: {'PageWebResource': 1}
        - path: /Math/Vectors_overview/  (PageWebResource) 
     - path: /Science/  (PageWebResource) 
       children:
        - path: /Science/Biology/  (PageWebResource) 
          children counts: {'PageWebResource': 1}
        - path: /Science/Chemistry/  (PageWebResource) 
          children counts: {'PageWebResource': 1}
        - path: /Science/Physics/  (PageWebResource) 
          children counts: {'PageWebResource': 1}


In [3]:
print(json.dumps(channel_tree, indent =2))

{
  "kind": "PageWebResource",
  "url": "http://chef-take-home-test.learningequality.org/",
  "children": [
    {
      "kind": "PageWebResource",
      "url": "http://chef-take-home-test.learningequality.org/Math/",
      "children": [
        {
          "kind": "PageWebResource",
          "url": "http://chef-take-home-test.learningequality.org/Math/Numbers/",
          "children": [
            {
              "kind": "PageWebResource",
              "url": "http://chef-take-home-test.learningequality.org/Math/Numbers/Numbers_Overview/",
              "children": []
            },
            {
              "kind": "PageWebResource",
              "url": "http://chef-take-home-test.learningequality.org/Math/Numbers/Classifying_numbers/",
              "children": []
            }
          ]
        },
        {
          "kind": "PageWebResource",
          "url": "http://chef-take-home-test.learningequality.org/Math/Functions/",
          "children": [
            {
            

In [1]:
from basiccrawler.crawler import BasicCrawler

skills_crawler = BasicCrawler(
        main_source_domain='https://applieddigitalskills.withgoogle.com',
        start_page='https://applieddigitalskills.withgoogle.com/en/curriculum')
skills_crawler.CRAWLING_STAGE_OUTPUT = 'chefdata/trees/skills_web_resource_tree.json'

# undomment below
skills_channel_tree = skills_crawler.crawl(limit=100)
# skills_crawler.print_tree(skills_channel_tree, print_depth=2)

Starting new HTTPS connection (1): applieddigitalskills.withgoogle.com
https://applieddigitalskills.withgoogle.com:443 "HEAD /en/curriculum HTTP/1.1" 200 0
Downloaded page https://applieddigitalskills.withgoogle.com/en/curriculum title:Curriculum - Applied Digital Skills from Google
in on_page https://applieddigitalskills.withgoogle.com/en/curriculum
a with no nohref found <a class="mdl-navigation__link openmenu">Menu <i class="material-icons">keyboard_arrow_down</i></a>
https://applieddigitalskills.withgoogle.com:443 "HEAD /apps HTTP/1.1" 302 0
https://applieddigitalskills.withgoogle.com:443 "HEAD /en/apps HTTP/1.1" 200 0
https://applieddigitalskills.withgoogle.com:443 "GET /apps HTTP/1.1" 302 0
Downloaded page https://applieddigitalskills.withgoogle.com/apps title:Free Technology Curriculum from Google - Applied Digital Skills
in on_page https://applieddigitalskills.withgoogle.com/en/apps
a with no nohref found <a class="mdl-navigation__link openmenu">Menu <i class="material-icons">k

https://applieddigitalskills.withgoogle.com:443 "HEAD /applied-digital-skills/en/research-and-develop-a-topic/overview.html HTTP/1.1" 200 0
Downloaded page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/research-and-develop-a-topic/overview.html title:Research and Develop a Topic - Applied Digital Skills from Google
in on_page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/research-and-develop-a-topic/overview.html
a with no nohref found <a class="mdl-navigation__link openmenu">Menu <i class="material-icons">keyboard_arrow_down</i></a>
https://applieddigitalskills.withgoogle.com:443 "HEAD /applied-digital-skills/en/technology-ethics-and-security/overview.html HTTP/1.1" 200 0
Downloaded page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/technology-ethics-and-security/overview.html title:Technology, Ethics, and Security - Applied Digital Skills from Google
in on_page https://applieddigitalskills.withgoogle.com/applied-

https://grow.google:443 "GET / HTTP/1.1" 200 13476
Downloaded page https://applieddigitalskills.withgoogle.com/footer title:Grow with Google - Learn Digital Skills, Prepare for Jobs, Grow Your Business
in on_page https://grow.google/
https://applieddigitalskills.withgoogle.com:443 "HEAD /en/apps HTTP/1.1" 200 0
Downloaded page https://applieddigitalskills.withgoogle.com/en/apps title:Free Technology Curriculum from Google - Applied Digital Skills
in on_page https://applieddigitalskills.withgoogle.com/en/apps
a with no nohref found <a class="mdl-navigation__link openmenu">Menu <i class="material-icons">keyboard_arrow_down</i></a>
https://applieddigitalskills.withgoogle.com:443 "HEAD /en/high-school HTTP/1.1" 200 0
Downloaded page https://applieddigitalskills.withgoogle.com/en/high-school title:Learn Digital Skills to Solve Everyday Tasks - Applied Digital Skills
in on_page https://applieddigitalskills.withgoogle.com/en/high-school
a with no nohref found <a class="mdl-navigation__link op

Downloaded page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/plan-an-event-college-and-continuing-education/course-introduction/course-introduction.html title:Course Introduction
in on_page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/plan-an-event-college-and-continuing-education/course-introduction/course-introduction.html
a with no nohref found <a class="mdl-navigation__link curriculum-nav">
            
            Curriculum
            
            <i class="material-icons-extended">keyboard_arrow_down</i>
</a>
a with no nohref found <a class="mobile-instructions subheader" id="back-to-units">
<i class="material-icons-extended">navigate_before</i>
      
      Back to Units
      
    </a>
https://applieddigitalskills.withgoogle.com:443 "HEAD /applied-digital-skills/en/plan-and-budget-college-and-continuing-education/make-a-long-term-spending-decision/budget-to-make-good-financial-decisions.html HTTP/1.1" 200 0
Downloaded page htt

Downloaded page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/if-then-adventure-stories/create-an-ifthen-adventure/introduction-to-ifthen-adventure-stories.html title:Introduction to If/Then Adventure Stories
in on_page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/if-then-adventure-stories/create-an-ifthen-adventure/introduction-to-ifthen-adventure-stories.html
a with no nohref found <a class="mdl-navigation__link curriculum-nav">
            
            Curriculum
            
            <i class="material-icons-extended">keyboard_arrow_down</i>
</a>
a with no nohref found <a class="mobile-instructions subheader" id="back-to-units">
<i class="material-icons-extended">navigate_before</i>
      
      Back to Units
      
    </a>
https://applieddigitalskills.withgoogle.com:443 "HEAD /applied-digital-skills/en/if-then-adventure-stories/create-an-ifthen-adventure/set-up-your-first-slides.html HTTP/1.1" 200 0
Downloaded page https://appli

in on_page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/research-and-develop-a-topic/test-for-credibility/select-a-topic-and-write-your-fake-article.html
a with no nohref found <a class="mdl-navigation__link curriculum-nav">
            
            Curriculum
            
            <i class="material-icons-extended">keyboard_arrow_down</i>
</a>
a with no nohref found <a class="mobile-instructions subheader" id="back-to-units">
<i class="material-icons-extended">navigate_before</i>
      
      Back to Units
      
    </a>
https://applieddigitalskills.withgoogle.com:443 "HEAD /applied-digital-skills/en/research-and-develop-a-topic/test-for-credibility/share-and-evaluate-articles.html HTTP/1.1" 200 0
Downloaded page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/research-and-develop-a-topic/test-for-credibility/share-and-evaluate-articles.html title:Share and Evaluate Articles
in on_page https://applieddigitalskills.withgoogle.com/appli

https://applieddigitalskills.withgoogle.com:443 "HEAD /applied-digital-skills/en/research-and-develop-a-topic/explore-a-topic-with-research-and-collaboration/reflection.html HTTP/1.1" 200 0
Downloaded page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/research-and-develop-a-topic/explore-a-topic-with-research-and-collaboration/reflection.html title:Reflection
in on_page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/research-and-develop-a-topic/explore-a-topic-with-research-and-collaboration/reflection.html
a with no nohref found <a class="mdl-navigation__link curriculum-nav">
            
            Curriculum
            
            <i class="material-icons-extended">keyboard_arrow_down</i>
</a>
a with no nohref found <a class="mobile-instructions subheader" id="back-to-units">
<i class="material-icons-extended">navigate_before</i>
      
      Back to Units
      
    </a>
https://applieddigitalskills.withgoogle.com:443 "HEAD /applied

Downloaded page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/research-and-develop-a-topic/wrap-up/research-and-develop-a-topic-reflection.html title:Research and Develop a Topic Reflection
in on_page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/research-and-develop-a-topic/wrap-up/research-and-develop-a-topic-reflection.html
a with no nohref found <a class="mdl-navigation__link curriculum-nav">
            
            Curriculum
            
            <i class="material-icons-extended">keyboard_arrow_down</i>
</a>
a with no nohref found <a class="mobile-instructions subheader" id="back-to-units">
<i class="material-icons-extended">navigate_before</i>
      
      Back to Units
      
    </a>
https://applieddigitalskills.withgoogle.com:443 "HEAD /applied-digital-skills/en/research-and-develop-a-topic/wrap-up/research-and-develop-a-topic-extensions.html HTTP/1.1" 200 0
Downloaded page https://applieddigitalskills.withgoogle.com/applie

https://applieddigitalskills.withgoogle.com:443 "HEAD /applied-digital-skills/en/plan-an-event/select-and-research-an-event/organize-your-event-plan.html HTTP/1.1" 200 0
Downloaded page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/plan-an-event/select-and-research-an-event/organize-your-event-plan.html title:Organize Your Event Plan
in on_page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/plan-an-event/select-and-research-an-event/organize-your-event-plan.html
a with no nohref found <a class="mdl-navigation__link curriculum-nav">
            
            Curriculum
            
            <i class="material-icons-extended">keyboard_arrow_down</i>
</a>
a with no nohref found <a class="mobile-instructions subheader" id="back-to-units">
<i class="material-icons-extended">navigate_before</i>
      
      Back to Units
      
    </a>
https://applieddigitalskills.withgoogle.com:443 "HEAD /applied-digital-skills/en/plan-an-event/select-and-re

Downloaded page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/plan-an-event/plan-tasks-for-your-event/activity-reflection-for-plan-tasks-for-you-event.html title:Activity Reflection for Plan Tasks for you Event
in on_page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/plan-an-event/plan-tasks-for-your-event/activity-reflection-for-plan-tasks-for-you-event.html
a with no nohref found <a class="mdl-navigation__link curriculum-nav">
            
            Curriculum
            
            <i class="material-icons-extended">keyboard_arrow_down</i>
</a>
a with no nohref found <a class="mobile-instructions subheader" id="back-to-units">
<i class="material-icons-extended">navigate_before</i>
      
      Back to Units
      
    </a>
https://applieddigitalskills.withgoogle.com:443 "HEAD /applied-digital-skills/en/plan-an-event/communicate-with-gmail-and-calendar/introduction-to-event-communication.html HTTP/1.1" 200 0
Downloaded page https://

in on_page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/plan-an-event/create-a-logo/add-and-edit-images.html
a with no nohref found <a class="mdl-navigation__link curriculum-nav">
            
            Curriculum
            
            <i class="material-icons-extended">keyboard_arrow_down</i>
</a>
a with no nohref found <a class="mobile-instructions subheader" id="back-to-units">
<i class="material-icons-extended">navigate_before</i>
      
      Back to Units
      
    </a>
https://applieddigitalskills.withgoogle.com:443 "HEAD /applied-digital-skills/en/plan-an-event/create-a-logo/add-and-format-text.html HTTP/1.1" 200 0
Downloaded page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/plan-an-event/create-a-logo/add-and-format-text.html title:Add and Format Text
in on_page https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/plan-an-event/create-a-logo/add-and-format-text.html
a with no nohref found <a class="mdl-na

Found candidate for global nav url=https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/if-then-adventure-stories/create-an-ifthen-adventure/introduction-to-ifthen-adventure-stories.htmladding to global_nav_nodes
Found candidate for global nav url=https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/if-then-adventure-stories/create-an-ifthen-adventure/set-up-your-first-slides.htmladding to global_nav_nodes
Found candidate for global nav url=https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/if-then-adventure-stories/create-an-ifthen-adventure/create-choices-and-links.htmladding to global_nav_nodes
Found candidate for global nav url=https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/if-then-adventure-stories/create-an-ifthen-adventure/connect-and-create.htmladding to global_nav_nodes
Found candidate for global nav url=https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/if-then-adventure-stories/cre

Found candidate for global nav url=https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/plan-an-event/plan-tasks-for-your-event/introduction-to-spreadsheets.htmladding to global_nav_nodes
Found candidate for global nav url=https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/plan-an-event/plan-tasks-for-your-event/understand-spreadsheet-vocabulary.htmladding to global_nav_nodes
Found candidate for global nav url=https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/plan-an-event/plan-tasks-for-your-event/add-items-to-your-spreadsheet.htmladding to global_nav_nodes
Found candidate for global nav url=https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/plan-an-event/plan-tasks-for-your-event/add-columns-for-task-owners.htmladding to global_nav_nodes
Found candidate for global nav url=https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/plan-an-event/plan-tasks-for-your-event/complete-task-list-and-get-fee





################################################################################
# CRAWLER RECOMMENDATIONS BASED ON URLS ENCOUNTERED:
################################################################################

1. These URLs are very common and look like global navigation links:
  -  https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/if-then-adventure-stories/brainstorm-story-ideas/if-then-adventure-stories-introduction.html
  -  https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/research-and-develop-a-topic/test-for-credibility/unit-introduction.html
  -  https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/plan-an-event/select-and-research-an-event/unit-introduction.html
  -  https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/guide-to-an-area/create-an-area-guide/organize-data-to-create-an-area-guide.html
  -  https://applieddigitalskills.withgoogle.com/applied-digital-skills/en/plan-and-budget/make-a-

In [None]:
# le_crawler.print_tree(le_channel_tree, print_depth=5)

In [None]:
# le_crawler.infer_tree_structure(le_channel_tree)


In [None]:
import re
from crawler import BasicCrawler

ict_crawler = BasicCrawler(
        main_source_domain='https://ict-essentials-for-teachers.moodlecloud.com',
        start_page='https://ict-essentials-for-teachers.moodlecloud.com/')
ict_crawler.IGNORE_URL_PATTERNS.append(re.compile('.*[\?&]lang=.*'))
ict_crawler.IGNORE_URL_PATTERNS.append(re.compile('.*calendar/set\.php.*'))
ict_crawler.IGNORE_URL_PATTERNS.append(re.compile('.*calendar/view\.php.*'))


# ict_channel_tree = ict_crawler.crawl(limit=1000)


In [None]:
# ict_crawler.print_tree(ict_channel_tree, print_depth=2)

In [None]:
# ict_crawler.infer_tree_structure(ict_channel_tree)
