Selenium Scraper

A python scraper by using selenium which helps to parse the site after being fully loaded (AJAX calls, flash async loads etc).

Fire up an instance

Not so easy...

Prereqs

Ubuntu machine (Preferably latest)
Not ARM architecture. Can not make this run on my raspberry pi. If somebody has already done, shoot me a mail.
sudo easy_install selenium
sudo easy_install pyvirtualdisplay
sudo apt-get install xvfb

In my case, Firefox and phantomjs are not capable of showing the flash videos. Chrome is the only successful one.

Install chrome

http://www.howopensource.com/2011/10/install-google-chrome-in-ubuntu-11-10-11-04-10-10-10-04/
wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
sudo sh -c 'echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
sudo apt-get update
sudo apt-get install google-chrome-stable
Make sure chrome is install at /usr/bin/google-chrome

ls /usr/bin | grep chrome

Get chrome driver from here to be able to use selenium with chrome.

wget http://chromedriver.storage.googleapis.com/2.10/chromedriver_linux64.zip

unzip chromedriver_linux64.zip

python selenium_scraper_server.py

Go to localhost:8080/url=%url%&p=%pattern%

Eg localhost:8080/url=hakanu.net&p=hakan

Restrictions:

Pattern and url must be percent encoded. http://www.url-encode-decode.com/
Pattern should not use +, instead * should be used. Because there is some confusion between url encoding's + (for space) and regexp +.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
selenium_scraper.py		selenium_scraper.py
selenium_scraper_server.py		selenium_scraper_server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Selenium Scraper

Fire up an instance

Prereqs

Install chrome

Restrictions:

About

Uh oh!

Releases

Packages

Languages

License

hakanu/selenium_scraper

Folders and files

Latest commit

History

Repository files navigation

Selenium Scraper

Fire up an instance

Prereqs

Install chrome

Restrictions:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages