GitHub - jieren123/Web_Crawler_Scraper: Distributed Web Scraping Project

This distributed web scraping project is made by Python2.7. High flexible to manage and extract structured data from webpages.

Architecture

Automation: Autologin is to make it easier for web spider to crawl websites that require authenticaion.
- Automiatically find login fields and can handle login that requires dynamic CSRF token.
- Provide it with one single account credentials and persisted cookies for the duration the session and can be recalled in other activities.
- Obtain form requests to submit from your own spider and args for spider to sumbit. No http requests and dependencies are made.
Dynamics:
- Visualize result when you make selection in Ajex table and monitor it - Dynamically visit ASP.NET pages on and scrape content from it
- Access to AJAX pagination for the next page automatically until the end.
Storage:
- Uses DataFrame to store crawled website data based on classification and creates the DataFrame automatically the first time the spider is running.
- Optionally save large webpages to disk by mapreduce method.

auto-login.py: login with cookies
get-querystring.py: get guery-string parameter; AJAX pagination
url-generator: generate url address
single-page.py: Scape with item on a singe page and save it into local csv file.
multiple-files.py: Manipulate multiple large database files
file-clear.py : Deal with missing value; Convert unstructed data into structured data

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
README.md		README.md
auto_login.py		auto_login.py
get_querystring.py		get_querystring.py
multiple_files.py		multiple_files.py
single_page.py		single_page.py
url_generator.py		url_generator.py
web_crawler_system.png		web_crawler_system.png