Skip to content

A python scraper by using selenium which helps to parse the site after being fully loaded (AJAX calls, flash async loads etc).

License

Notifications You must be signed in to change notification settings

hakanu/selenium_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Selenium Scraper

A python scraper by using selenium which helps to parse the site after being fully loaded (AJAX calls, flash async loads etc).

Fire up an instance

Not so easy...

Prereqs

  • Ubuntu machine (Preferably latest)
  • Not ARM architecture. Can not make this run on my raspberry pi. If somebody has already done, shoot me a mail.
  • sudo easy_install selenium
  • sudo easy_install pyvirtualdisplay
  • sudo apt-get install xvfb

In my case, Firefox and phantomjs are not capable of showing the flash videos. Chrome is the only successful one.

Install chrome

ls /usr/bin | grep chrome

  • Get chrome driver from here to be able to use selenium with chrome.

wget http://chromedriver.storage.googleapis.com/2.10/chromedriver_linux64.zip

unzip chromedriver_linux64.zip

python selenium_scraper_server.py

Go to localhost:8080/url=%url%&p=%pattern%

Eg localhost:8080/url=hakanu.net&p=hakan

Restrictions:

  • Pattern and url must be percent encoded. http://www.url-encode-decode.com/
  • Pattern should not use +, instead * should be used. Because there is some confusion between url encoding's + (for space) and regexp +.

About

A python scraper by using selenium which helps to parse the site after being fully loaded (AJAX calls, flash async loads etc).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages