MPI-Webcrawlling-Tutorium

(Work in Progress) This Git repository contains all materials for the Webcrawling Tutorium held at the Max-Planck-Institute for Innovation and Competition.

Prerequisites

To be able to participate in the workshop please ensure your machine meets all prerequisites.

Anaconda: You should have Anaconda for Python 3.6 installed.
Installation Instructions: https://conda.io/docs/user-guide/install/windows.html
Download Page: https://www.anaconda.com/download/#windows
PyCharm: Install PyCharm Community Edition or as a PhD Student the Professional Edition, which get for free as a student.
PyCharm Download: https://www.jetbrains.com/pycharm/
Student Registration: https://www.jetbrains.com/student/
Python Packages: Install the following packages using anaconda (Instructions): pandas, requests, beautifulsoup4, scrapy

Workshop Content

Software Setup
-Why do we use Python and Anaconda?
-Basics in Python?
-What is a Juypter Notebook?
Introduction to web crawling
-HTML Basics and how websites linked
-Types of Http Requests
Low Level Web Crawling with requests and beautifulsoup
Using a real world example we will explore:
-How to get a website with Python
-How to extract information from a website
-How to automatically walk through a website
High Level Web Crawling with Scrapy
Now we will transfer the low level code from the previous section into a simpler structure using the Package Scrapy.
Discussion and Questions

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Solutions		Solutions
img		img
.gitignore		.gitignore
01_Introduction.ipynb		01_Introduction.ipynb
02_Low_Level_Crawling_Exercise.ipynb		02_Low_Level_Crawling_Exercise.ipynb
03_High_Level_Crawling_Exercise.ipynb		03_High_Level_Crawling_Exercise.ipynb
04_QandA.ipynb		04_QandA.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MPI-Webcrawlling-Tutorium

Prerequisites

Workshop Content

Further Readings

About

Releases

Packages

Languages

License

mpSchrader/MPI-Webcrawlling-Tutorium

Folders and files

Latest commit

History

Repository files navigation

MPI-Webcrawlling-Tutorium

Prerequisites

Workshop Content

Further Readings

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages