Find architecture portfolios on Issuu
Switch branches/tags
Nothing to show
Clone or download
Latest commit ad0c3a5 Sep 13, 2016
Permalink
Failed to load latest commit information.
10000 refactoring to V0.2 Sep 12, 2015
4000 refactoring to V0.2 Sep 12, 2015
.gitattributes 💥🐫 Added .gitattributes Sep 11, 2015
LoadAndContinueCrawl.py V0.4 Oct 1, 2015
README.md Update README.md Sep 13, 2016
crawlForMostLikedPortf.py V0.4 Oct 1, 2015
downloadPublications.py refactoring to V0.2 Sep 12, 2015
issuuPagePub.py V0.3 Sep 29, 2015
load.py V0.4 Oct 1, 2015
queueMethod.py V0.3 Sep 29, 2015
test.py V0.3 Sep 29, 2015

README.md

IssuuPortfolioCrawler

Find architecture portfolios on Issuu

Issuu爬虫
*查找最赞的建筑作品集
*利用Issuu页面自动加载的推荐文件爬
*有下载功能

使用库: urllib + re

folder 4000: results after crawling for 4000 portfolios and sorted by likes
folder 10000: results after crawling for 10000 portfolios and sorted by likes, ran for 6 hours on my laptop

To Use this crawler

1. Start Crawler.

  1. change directory to this project
  2. open crawlForMostLikedPortf.py
  3. change your starting portfolios. I add some urls to myqueue, you can add your favorite portfolios into this queue.
  4. change your saving directory on line 45
  5. change portfolios number on line 26
  6. run the script.

2. Restart Crawler.

If you have finished last crawl, the script should have saved two files dictPub.csv and dictQueue.csv

  1. change directory to this project
  2. open LoadAndContinueCrawl.py
  3. change your saving directory on line 40
  4. change portfolios number on line 20
  5. run the script.

3. Save a Portfolio

  1. change directory to this project
  2. open downloadPublications.py
  3. change your saving directory on line 12
  4. change the portfolio url you want to download on line 9
  5. run the script.