Website-Text-Scraper

Simple API with a single endpoint to scrape a website and other pages in the same domain. The response separates the text from the header, footer and body of each page.

Running

Clone this repo and run:

go get
go run .

This will start the API at port 8080.

Running with docker

Get and run the image from Dockerhub:

docker run -p 8080:8080 mtrentz/website_text_scraper:latest

Usage

There is a single endpoint at /api/scrape/ that takes in a POST with url, depth, and max_requests. If not provided, depth defaults to 2 and max_requests defaults to 200. For unlimited depth or max_requests set them as -1.

Max requests is supposed to be used to limit the amount of scrapes, but it doesn't guarantee that no more than the exact amount of requests will be made.

Example

curl -X POST http://localhost:8080/api/scrape/ -d '{"url":"https://www.bbc.com/", "max_requests":5}' -H "Content-Type: application/json"

Result summary:

{
    "url": "https://www.bbc.com/",
    "page_amount": 5,
    "visited_at": "2022-07-27 19:23:55",
    "pages": [
        {
            "url": "https://www.bbc.com/",
            "header": "Home\nNews\nSport\nWeather\niPlayer\nSounds\nBitesize\nCBeebies\nCBBC\nFood\nHome\nNews\nSport\nReel\nWorklife ...",
            "text": "BBC Homepage\nGas prices soar as Russia cuts German supply\nThe Nord Stream 1 pipeline is now operating at just ...",
            "footer": "\n\n\n\nHome\nNews\nSport\nWeather\niPlayer\nSounds\nBitesize\nCBeebies\nCBBC\nFood\nHome\nNews\nSport\nReel\n ...",
            "visited_at": "2022-07-27 19:23:56"
        },
        ...
    ]
}

Passsing a depth of 2 and unlimited requests

curl -X POST http://localhost:8080/api/scrape/ -d '{"url":"https://www.bbc.com/", "depth":2, "max_requests":-1}' -H "Content-Type: application/json"

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
controllers		controllers
logging		logging
routes		routes
scraper		scraper
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Website-Text-Scraper

Running

Running with docker

Usage

Example

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Website-Text-Scraper

Running

Running with docker

Usage

Example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages