Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

SpiderBot is a web crawler which crawls the web (HTTP servers) and retrieves content back and performs actions on the content. SpiderBot is an effort to design and develop a truly pipelined distributed Open source web crawler.

branch: master
README

SpiderBot is a web crawler written in C/C++ using the Berkely Sockets and HTTP/1.1 Protocol
Copyright (C) 2012  Mustafa Neguib, MN Tech Solutions

This file is part of SpiderBot.

SpiderBot is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

You can contact the developer/company at the following:

Phone: 00923224138957
Website: www.mntechsolutions.net
Email: support@mntechsolutions.net , mustafaneguib@mntechsolutions.net

Current Version: 0.4

Following are the details of the platform on which SpiderBot is being developed and tested on:

Brand: Dell Inspiron 1440
Processor: Core 2Duo 
Ram: 2GB
Operating System: Ubuntu 11.04 Desktop
Programming Language: C/C++
Compiler: g++

Installation & Running

It is extremely easy to install and run. Run the searchspider.sh shell script as follows in the terminal:

$ ./searchspider.sh

This will compile and execute the app.

Note that you need the latest version of g++ before you can compile and run the app. 

Following are the features that have been planned for SpiderBot. We will be marking their status by the side of the feature.

Features List:

(1) Connecting to server and retrieving of HTML code for parsing (Done)
(2) Extraction of links (Done)
(3) Queue for visited and not visited links (Done)
(4) Base tag (<base....>) implemented (Done)
(5) Checking the connectivity of the server and disconnecting from a connected server gracefully which has no more content to offer (Done)
(6) Extraction of the data within the div tag (<div...>....</div>) (Done)
(7) Extraction of the data within title tag (<title>...</title>) (Done)
(8) Checking for the robots.txt file and making SpiderBot polite 
(7) Design of a repository
(8) Optimization of memory being used
(9) Designing of a distributed SpiderBot system (with distributed spiders and repository) to be scalable and failproof
(10) Extraction of the data within the p tag (<p...>....</p>) (Done)
(11) Design and development of a pipelining scheme for the spider
(12) Extraction of the data within the b tag (<b...>....</b>) (Done)
(13) Extraction of the data within the b tag (<i...>....</i>) (Done)
(14) Extraction of the data within the b tag (<u...>....</u>) (Done)


I will be updating this list as the features are completed and i will also be adding new items to the list as i come up with more ideas.
Dont forget to fork the project and provide us with some cool new features and we might even add them to the project if they are good enough.


Versioning Scheme:
Every push of the code to the git repository will be a new version. e.g. push 1: Version 0.2 push 2: Version 0.3 and so on.
The major part of the version that the number to the left side of the decimal point will be incremented after the push of 0.9, hence the push after 0.9
will be of 1.0 and so on.

Something went wrong with that request. Please try again.