GitHub - hks-epod/scrapy-crawl-asp: heavy-duty scraping framework for crawling ASP.net pages with javascript

Heavy-Duty Web Crawling Framework for Scraping Data from ASP.net Pages

Introduction

This project aims to scrape data from all 250+ pages from this webpage. The page is written in the ASP.net environment so there is not a separate URL for each page. Clicking on the next page at the bottom instead triggers a javascript __doPostback function which takes some visible and some hidden arguements. The goal is to use the scrapy framework to iterate through each page, scrape the fields I need, and pipe it to a csv.

Method

Although python cannot interact with the javascript function, we can pass the arguments as form data using the scrapy Formrequest library. To goal is to collect the VIEWSTATE and EVENTVALIDATION hiddent arguements and pass it with the other orther arguements to navigate to each subsequent page. The site should interpret this as a human rather than a bot. If this doesn't work, other options include Seleneium or a headless browser.

Files

items.py: Initiates an item class that acts as a container for the data fields we need
pipelines.py: Sets up a pipeline to take item objects and pipe it to a CSV
ec_spider.py: Main scraping script. Passes each scraped item row-by-row through the pipeline and into the dataset

To-Do

~~Git~~
~~Download scrapy~~
~~Set up container, pipeline, and settings~~
~~Test scraping code for first page~~
Generalize to all pages
Figure out doPostback issue

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
eccrawler		eccrawler
test		test
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eccrawler

eccrawler

test

test

README.md

README.md

Repository files navigation

Heavy-Duty Web Crawling Framework for Scraping Data from ASP.net Pages

Introduction

Method

Files

To-Do

Sources

About

Releases

Packages

Languages

hks-epod/scrapy-crawl-asp

Folders and files

Latest commit

History

Repository files navigation

Heavy-Duty Web Crawling Framework for Scraping Data from ASP.net Pages

Introduction

Method

Files

To-Do

Sources

About

Resources

Stars

Watchers

Forks

Languages