A web data crawler for e-commerce in php
To scrape a shop at www.XXX.com you need to write the script src/Site/Xxx.php You can copy one of the scripts at src/Site/ to start with.
Inside Xxx.php you need to implement 3 functions:
fetchCategories() - get all the links to products categories pages. fetchProducts() - get all the links to product pages fetchProductData() - scrape information from the product page
Moreover you need to add your new package Xxx.php at: src/Command/Fetch.php src/Site.php
To run tests and check each of the functions independantly you can use src/Commands/Test.php
To run the program use:
php index.php test --sites xxx
When the functions are done, we run it using:
php index.php fetch --sites xxx
CREATE TABLE
products(
idint(11) unsigned NOT NULL AUTO_INCREMENT,
sitevarchar(50) DEFAULT NULL,
urlvarchar(255) DEFAULT NULL,
product_codevarchar(255) DEFAULT NULL,
titlevarchar(255) DEFAULT NULL,
descriptiontext,
imagetext,
videovarchar(255) DEFAULT NULL,
modelvarchar(150) DEFAULT NULL,
manufacturervarchar(150) DEFAULT NULL,
warrantyvarchar(150) DEFAULT NULL,
deliveryvarchar(150) DEFAULT NULL,
priceint(11) DEFAULT NULL,
sale_priceint(11) DEFAULT NULL,
ship_priceint(11) DEFAULT NULL,
optionsjson DEFAULT NULL,
categoryjson DEFAULT NULL,
created_atdatetime DEFAULT NULL,
updated_atdatetime DEFAULT NULL,
visited datetime DEFAULT NULL, PRIMARY KEY (
id), UNIQUE KEY
site (
site,
url`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
CREATE TABLE categories
(
id
int(11) unsigned NOT NULL AUTO_INCREMENT,
title
varchar(255) DEFAULT NULL,
url
varchar(255) DEFAULT NULL,
site
varchar(50) DEFAULT NULL,
updated_at
datetime DEFAULT NULL,
visited
datetime DEFAULT NULL,
PRIMARY KEY (id
),
UNIQUE KEY url
(url
,site
)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;`