Skip to content

Crawling ecommerce jualo using scrapy and send json to kafka

Notifications You must be signed in to change notification settings

grendy/ecommerce_jualo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

ecommerce_jualo

Crawling ecommerce jualo using scrapy and send json to kafka ##Understand the web structure Must understand web structure like xpath and css selector ##Install scrapy on centos

sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm
yum update -y 
yum install python-pip -y 
yum install python-devel -y 
yum install gcc gcc-devel -y 
yum install libxml2 libxml2-devel -y 
yum install libxslt libxslt-devel -y 
yum install openssl openssl-devel -y 
yum install libffi libffi-devel -y 
CFLAGS="-O0" pip install lxml 
pip install scrapy 

##Install selenium Selenium version must 2.53.6

pip install selenium

##Install xvfb and PyVirtualDisplay for running browser on background process Python2.7 must be installed on the device

yum install xorg-x11-server-Xvfb 
pip install PyVirtualDisplay 

##Browser version Browser used is Firefox browser

Firefox browser version must 45.0.2 or 45.xx.xx 

##Running browser on background process To running browser on background process, install xvfb and pyVirtualDisplay

from pyvirtualdisplay import Display
display = Display(visible=0, size=(800,600)) 
display.start() 
driver = webdriver.Firefox() 

##Connect to mysql using MySQLdb library Must insert mysql configuration into settings.py on scrapy

conn=MySQLdb.connect(  
            host=crawler.settings['MYSQL_HOST'], 
            port=crawler.settings['MYSQL_PORT'], 
            user=crawler.settings['MYSQL_USER'],
            passwd=crawler.settings['MYSQL_PASS'],
            db=crawler.settings['MYSQL_DB'])
        return cls(conn)

##Take content To take content in accordance required use xpath or css selector

response.xpath('//*[contains(@id, "frmSaveListing")]/ul/li[' + str(i) + ']//*[contains(@class, "article-right")]/span/text()').extract_first()

##To click button To click , must be known id or xpath first

driver.find_element_by_id('s_imgBtnSearch').click()

##Running engine To running engine use crontab for automatic scheduling

python2.7 jualo.py

About

Crawling ecommerce jualo using scrapy and send json to kafka

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages