Camel Crawler

//TODO-description

Web Crawler

Write basic crawler to get list of URLs for single domain
Feed base root URLs from text file
Connect crawler to a persistence database [MySQL]
Convert the list of domains into a data frame/structure for analysis
Build a database of domains to crawl
Blacklist websites using CDNs/dynamic websites

Middleware develops API for the database - Redis

Configure MySQL to Redis
Redis - graph data structure to retrieve required information
Write end-points to retrieve data-points for Front-end

Front-end REACT

Display the total number of backlinks in the database
Query the API to retrieve total number of backlinks

TestCases

How to Run It

//TODO

Advance Features

Work with dynamic website by implementing a headless browser - HTMLUnit OR Selenium
Rotating Proxies to avoid bot check & circumnavigate CDNS e.g. cloudFare, Distill Network
Building jar file of the web crawler to work on multiple physical machines -> Akka Compile Issue
Memory optimization - use of CountdownLatcher
Integrate Kafka+Kafka API to stream data from crawler to database and onto a front-end for 'real-time' feedback

Team Members

Abdusamed
Ming
Chi

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.idea		.idea
CrawlerAPI		CrawlerAPI
META-INF		META-INF
REACT-front-end		REACT-front-end
front-end		front-end
project		project
src		src
target		target
Christmas2018 · SlidesCarnival.pptx		Christmas2018 · SlidesCarnival.pptx
Christmas2018_SlidesCarnival.pptx		Christmas2018_SlidesCarnival.pptx
FinalPresentation.pptx		FinalPresentation.pptx
README.md		README.md
Spam.scala		Spam.scala
bad.txt		bad.txt
build.sbt		build.sbt
good.txt		good.txt
input.txt		input.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Camel Crawler

Web Crawler

Middleware develops API for the database - Redis

Front-end REACT

TestCases

How to Run It

Advance Features

Team Members

About

Releases

Packages

Contributors 3

Languages

jiaminghong/camelcrawler

Folders and files

Latest commit

History

Repository files navigation

Camel Crawler

Web Crawler

Middleware develops API for the database - Redis

Front-end REACT

TestCases

How to Run It

Advance Features

Team Members

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages