GoScrapeFlow is a Go-based tool designed to efficiently scrape websites through sitemaps, offering extensibility for future logging and data exporting features.
🚧 GoScrapeFlow is still under active development. Expect continuous enhancements and potential changes. Documentation will be refined as the project evolves.
- Structure
- Installation
- Proxy Configuration
- Usage
- Future Enhancements and Contributions
- Contributing
- License
- Contact
- Efficient Scraping: Optimized for scraping through sitemaps, ensuring comprehensive and systematic data extraction.
- Modular Architecture: Composed of organized modules for HTTP handling, proxy rotation, sitemap parsing, and scraping tasks.
- Integrated Logging: Utilizes Logrus for detailed logging, offering valuable insights into the scraping process and aiding in troubleshooting.
- Data Export Flexibility: Supports exporting scraped data to Excel format. Offers the ability to specify the output file name using the
-o
flag. - Dynamic Content Extraction: Enhanced to allow users to specify custom selectors (like class names) for scraping specific content from web pages.
- Single URL Analysis: Allows analysis of a single URL, extracting and listing all identifiable class names, tags, and IDs. Results are presented in an organized Excel file for in-depth analysis.
- JSON Output for Single Page Scraping: Newly added feature to export scraping results in JSON format, applicable exclusively for single-page scraping.
These features are geared toward making GoScrapeFlow
a versatile tool for web scraping, data analysis, and content aggregation, catering to both specific and broad scraping needs.
├── cmd
│ ├── config.go
│ ├── root.go
│ └── start.go
├── helper
│ ├── excel
│ │ └── excel.go
│ └── log
│ └── log.go
├── httpclient
│ ├── http.go
│ └── proxy.go
├── output
│ [your output excel files will be here]
├── sitemap
│ ├── scrape.go
│ └── sitemap.go
├── main.go
├── Makefile
└── proxies.txt
- Clone the repo:
git clone https://github.com/ngfenglong/go-scrape-flow.git
- Navigate to the directory:
cd go-scrape-flow
- Install the required dependencies (if any):
go get
By default, GoScrapeFlow is configured to use proxy addresses from proxies.txt
. Here's how you can manage this:
-
Update the Proxy List: Modify the
proxies.txt
file to include your proxy addresses, one per line. -
Disabling Proxy Usage: If you're not planning to use any proxy (for reasons like having a VPN or a fast non-proxy connection), you can disable it:
- Open the
httpclient/http.go
file. - Locate the
GetRequest
function. - Comment out or remove the line:
refreshClientWithProxy()
.
- Open the
To build and set up the project, run:
make all
This command builds the goscrape
binary and sets up the necessary folders.
To run the project:
make run
For development testing with a predefined URL:
make dev-test
If you prefer to set up and run without the Makefile:
-
Build the project:
go build -o goscrape . chmod +x goscrape
-
Create the output folder:
mkdir ./output
-
Run the project:
./goscrape
To clean up the generated files:
-
To remove the
goscrape
binary:make clean
-
To remove the output folder:
make clean-output
- Testing and Error Handling: Rigorous testing and enhanced error handling mechanisms are in the pipeline to make GoScrapeFlow more robust.
- Enhanced Selector Functionality: We are working on improving the selector functionality to handle multiple selectors more effectively, allowing for more granular and comprehensive scraping.
These upcoming features aim to further bolster the capabilities of GoScrapeFlow, making it more versatile and user-friendly. Contributions and suggestions for improvements are always welcome!
GoScrapeFlow is an open-source project, and contributions are warmly welcomed! Whether it's bug fixes, feature additions, or even documentation improvements, all forms of help are appreciated.
- Fork the project
- Create your feature branch (
git checkout -b feature/NewFeature
) - Commit your changes (
git commit -m 'Add some NewFeature'
) - Push to the branch (
git push origin feature/NewFeature
) - Open a pull request
Distributed under the MIT License. See LICENSE
for more information.
For any inquiries or clarifications related to this project, please contact zell_dev@hotmail.com.