(Work in Progress) This Git repository contains all materials for the Webcrawling Tutorium held at the Max-Planck-Institute for Innovation and Competition.
To be able to participate in the workshop please ensure your machine meets all prerequisites.
- Anaconda: You should have Anaconda for Python 3.6 installed.
Installation Instructions: https://conda.io/docs/user-guide/install/windows.html
Download Page: https://www.anaconda.com/download/#windows - PyCharm: Install PyCharm Community Edition or as a PhD Student the Professional Edition, which get for free as a student.
PyCharm Download: https://www.jetbrains.com/pycharm/
Student Registration: https://www.jetbrains.com/student/ - Python Packages: Install the following packages using anaconda (Instructions): pandas, requests, beautifulsoup4, scrapy
- Software Setup
-Why do we use Python and Anaconda?
-Basics in Python?
-What is a Juypter Notebook? - Introduction to web crawling
-HTML Basics and how websites linked
-Types of Http Requests - Low Level Web Crawling with requests and beautifulsoup
Using a real world example we will explore:
-How to get a website with Python
-How to extract information from a website
-How to automatically walk through a website - High Level Web Crawling with Scrapy
Now we will transfer the low level code from the previous section into a simpler structure using the Package Scrapy. - Discussion and Questions
In this section you will find a list of helpful resources to write more sophisticated web crawler.
- General Python Guide Ref
- Web Crawling Best Practices Ref1, Ref2
- Multithreading in Python Ref
- Natural Language Processing Ref
- Python Data Science Handbook: Essential Tools for working with Data Ref
Contact
In case you have an further topics to discus feel free to get in touch with me via LinkedIn