SpiderBot is a web crawler which crawls the web (HTTP servers) and retrieves content back and performs actions on the content. SpiderBot is an effort to design and develop a truly pipelined distributed Open source web crawler.
…now i am getting more than just the links. I also have added trees for the following tags: b, u, i. Version 0.4
|spiderbot||I have changed the name of the function getLinks to getTargetData as …|
|.gitignore||I have fixed a buffer overflow error that i discovered during testing|
|README||I have changed the name of the function getLinks to getTargetData as …|
|searchspider.sh||I have fixed a buffer overflow error that i discovered during testing|
SpiderBot is a web crawler written in C/C++ using the Berkely Sockets and HTTP/1.1 Protocol Copyright (C) 2012 Mustafa Neguib, MN Tech Solutions This file is part of SpiderBot. SpiderBot is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. You can contact the developer/company at the following: Phone: 00923224138957 Website: www.mntechsolutions.net Email: email@example.com , firstname.lastname@example.org Current Version: 0.4 Following are the details of the platform on which SpiderBot is being developed and tested on: Brand: Dell Inspiron 1440 Processor: Core 2Duo Ram: 2GB Operating System: Ubuntu 11.04 Desktop Programming Language: C/C++ Compiler: g++ Installation & Running It is extremely easy to install and run. Run the searchspider.sh shell script as follows in the terminal: $ ./searchspider.sh This will compile and execute the app. Note that you need the latest version of g++ before you can compile and run the app. Following are the features that have been planned for SpiderBot. We will be marking their status by the side of the feature. Features List: (1) Connecting to server and retrieving of HTML code for parsing (Done) (2) Extraction of links (Done) (3) Queue for visited and not visited links (Done) (4) Base tag (<base....>) implemented (Done) (5) Checking the connectivity of the server and disconnecting from a connected server gracefully which has no more content to offer (Done) (6) Extraction of the data within the div tag (<div...>....</div>) (Done) (7) Extraction of the data within title tag (<title>...</title>) (Done) (8) Checking for the robots.txt file and making SpiderBot polite (7) Design of a repository (8) Optimization of memory being used (9) Designing of a distributed SpiderBot system (with distributed spiders and repository) to be scalable and failproof (10) Extraction of the data within the p tag (<p...>....</p>) (Done) (11) Design and development of a pipelining scheme for the spider (12) Extraction of the data within the b tag (<b...>....</b>) (Done) (13) Extraction of the data within the b tag (<i...>....</i>) (Done) (14) Extraction of the data within the b tag (<u...>....</u>) (Done) I will be updating this list as the features are completed and i will also be adding new items to the list as i come up with more ideas. Dont forget to fork the project and provide us with some cool new features and we might even add them to the project if they are good enough. Versioning Scheme: Every push of the code to the git repository will be a new version. e.g. push 1: Version 0.2 push 2: Version 0.3 and so on. The major part of the version that the number to the left side of the decimal point will be incremented after the push of 0.9, hence the push after 0.9 will be of 1.0 and so on.