Mini Crawler

This is a mini size crawler made by Python3. It's easy to use and extend. Good for study. Simply crawler every item on a page and save it into local csv file. Then go to the next page.

Main Features

Save every item on the page into cvs and then go to the next page
Pause and rerun
Suppoer almost all kernels (PhantomJS, Firefox etc.)
Also can use Requests for those static pages, speed up

Used Packages

requests
selenium
time
csv
os
pickle
datetime
shutil
numpy
pandas
BeautifulSoup4
re
math

Main difference from Scrapy

Easy to maintain
Simple enough to use and extend
Less Memory leak

Architecture overview

Main files

minicrawler/spider/basespider.py

crawl
getTotalKeys (leave it to sub class)
getCurrentPage (leave it to sub class)
gotoNextPage (leave it to sub class, optional)
saveCurrentIndex
saveContentToWorkbook

minicrawler/provider/requestor.py

minicrawler/provider/webbrowser.py

load
getContent (leave it to different kernel)
navigate (leave it to different kernel)
quit

Three different examples

A Wikipedia (static page)

https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_the_United_States_by_endowment
Static page
Only one page

minicrawler/spider/richspider.py

B US news (dynamic with url pagination)

http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities
Dynamic page
Change url to next page

minicrawler/spider/nuspider.py

C Startclass (dynamic with button pagination)

http://faculty-salaries.startclass.com/
Dynamic page
Click the pagenition button for the next page

minicrawler/spider/salaryspider.py

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
minicrawler		minicrawler
nu-output		nu-output
rich-output		rich-output
salary-output		salary-output
.gitignore		.gitignore
LICENSE		LICENSE
Mini-Crawler.pptx		Mini-Crawler.pptx
README.md		README.md
geckodriver.log		geckodriver.log
ghostdriver.log		ghostdriver.log
mini-crawler.png		mini-crawler.png
nu.png		nu.png
salary.png		salary.png
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini Crawler

Main Features

Used Packages

Main difference from Scrapy

Architecture overview

Main files

Three different examples

About

Releases

Packages

Languages

License

ibio/mini-crawler

Folders and files

Latest commit

History

Repository files navigation

Mini Crawler

Main Features

Used Packages

Main difference from Scrapy

Architecture overview

Main files

Three different examples

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages