Skip to content

mpSchrader/MPI-Webcrawlling-Tutorium

Repository files navigation

MPI-Webcrawlling-Tutorium

(Work in Progress) This Git repository contains all materials for the Webcrawling Tutorium held at the Max-Planck-Institute for Innovation and Competition.

Prerequisites

To be able to participate in the workshop please ensure your machine meets all prerequisites.

Workshop Content

  1. Software Setup
    -Why do we use Python and Anaconda?
    -Basics in Python?
    -What is a Juypter Notebook?
  2. Introduction to web crawling
    -HTML Basics and how websites linked
    -Types of Http Requests
  3. Low Level Web Crawling with requests and beautifulsoup
    Using a real world example we will explore:
    -How to get a website with Python
    -How to extract information from a website
    -How to automatically walk through a website
  4. High Level Web Crawling with Scrapy
    Now we will transfer the low level code from the previous section into a simpler structure using the Package Scrapy.
  5. Discussion and Questions

Further Readings

In this section you will find a list of helpful resources to write more sophisticated web crawler.

  • General Python Guide Ref
  • Web Crawling Best Practices Ref1, Ref2
  • Multithreading in Python Ref
  • Natural Language Processing Ref
  • Python Data Science Handbook: Essential Tools for working with Data Ref

Contact
In case you have an further topics to discus feel free to get in touch with me via LinkedIn

About

Material for a single day web crawling workshop in Python

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published