-
Notifications
You must be signed in to change notification settings - Fork 0
matrix0415/CYCUResearch-SimpleCrawler
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
==========Opinion Structure========== Opinion Structure: <Entity> <Property /> <Property /> <Property /> <Property /> <Property /> </Entity> Opinion Example: <entity> <title> <rating> <comment> ... </entity> ==========Using Procedure========== 1. Pip install using requirement.txt 2. Running the django sync and key in the superuser. python manage.py syncdb acc =ubuntu pwd =root 3. Running the django program. python manage.py runserver 0.0.0.0:8080 3. Open the browser and key in the url with admin site. url =http://IpAddress:8080/admin acc =ubuntu pwd =root 4. Crawler->Website, Add Website. DatasetSource ="Web" Enable =Click CanExtract =Click 5. Crawler->Crawl propertys, Add Crawl propertys Website =Choosing the Website. Url =URL Combine the parameter. Ex: http://www.hotels.com/ho<<130000~130036/1>>-tr/ Page404Path =The 404 page. NJobs =Maximum 10. 6. Crawler->Extraction entitys, Add Extraction entitys Name =The name of the element. WebsiteID =Choosing the website. EntitySelector =CSS Selector. Ex: div.review.clearfix.hreview 7. Crawler->Extraction propertys, Add Extraction propertys Name =The name of the element. ExtractionEntityID =Choosing the Extraction Entity. PropertySelector =CSS Selector. Ex: blockquote.description 8. Execute url: http://IpAddress:8080/crawl Automatically Crawl the website. You can find the dataset in the dataset folder. 9. Execute url: http://IpAddress:8080/extract/entity Automatically Extract the website. You can find the entity in the back-end stage. 10. Execute url: http://IpAddress:8080/extract/property Automatically Extract the website. You can find the property in the back-end stage. ==========Program Structure========== Views<->Controller->libs->Models ==========Crawler========== Models class Website(models.Model) class CrawlProperty class ExtractionProperty(models.Model) class Entity(models.Model) class Instance(models.Model) Forms class WebsiteForm(ModelForm) class CrawlPropertyForm(ModelForm) class ExtractionPropertyForm(ModelForm) class EntityForm(ModelForm) class InstanceForm(ModelForm) controller def importFromWebExportToFileC(siteName, urlList, page404) libs def fetchFromWebL(importPath, page404Path) def fetchFromLocalL(importPath) def saveByFileL(fname, content) ==========Library========== <return> library.name(parameters) <list> libs.stringL.fetchUrl(fakeUrl) ==========Database Schema========== Website: id: Integer; auto incretment; primary key name: Varchar(30) datasetSource: Varchar(10); "Web"/"File" canExtract: Boolean; default =False enable: Boolean; default =False checkDataset: Boolean; default =False datasetLocation: Varchar(150); blank; null datasetFileNum: Integer; default =0 Extracted: Boolean; default =False CrawlProperty: id: Integer; auto incretment; primary key websiteID: Integer; foreign key(Website) url: Varchar(URL) page404Path: Varchar(URL) nJobs: Integer; default =1 -unique(websiteID, url) ExtractionEntity: id: Integer; auto incretment; primary key websiteID: Integer; foreign key(Website.id) entitySelector: Varchar(100); index -unique(websiteID, entitySelector) ExtractionProperty: id: Integer; auto incretment; primary key name: Varchar(30) extractionEntityID: Integer; foreign key(ExtractionEntity.id) propertySelector: Varchar(100) -unique(extractionEntityID, propertySelector) Entity: id: Integer; auto incretment; primary key websiteID: Integer; foreign key(Website.id) extractionEntityID: Integer; foreign key(ExtractionEntity.id) datasetPath: Varchar(150) content: Text; blank; null used: Boolean; default =False datetime: Datetime; default =create time Property: id: Integer; auto incretment; primary key entityID: Integer; foreign key(Entity.id) extractionPropertyID: Integer; foreign key(ExtractionProperty.id) content: Text; blank; null datetime: Datetime; default =create time -unique(extractionEntityID, extractionPropertyID)
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published