Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to store http responses to file #185

Merged
merged 16 commits into from
Dec 8, 2022
Merged

Option to store http responses to file #185

merged 16 commits into from
Dec 8, 2022

Conversation

edoardottt
Copy link
Contributor

@edoardottt edoardottt commented Nov 16, 2022

This PR adds changes in order to add these two options (Context: #177):

   -sr, -store-response              store http response to output directory
   -srd, -store-response-dir string  store http response to custom directory

For now the PR creates the output directory (katana_responses by default, but can be changed with -srd option), then it creates correctly the specific subfolders for each domain (e.g. katana_responses/www.edoardoottavianelli.it) and it also creates the file named as the hash of the URL.
E.g.:
For the URL https://www.edoardoottavianelli.it/post/post6/post6.html, this is the file created: katana_responses/www.edoardoottavianelli.it/3270ccbc882c5239fcaa7c801503df606e8979be (sha1 hash of URL)

The problem is that the output is built upon the struct Result:

// Result is a result structure for the crawler
type Result struct {
	// Timestamp is the current timestamp
	Timestamp time.Time `json:"timestamp,omitempty"`
	// Method is the method for the result
	Method string `json:"method,omitempty"`
	// Body contains the body for the request
	Body string `json:"body,omitempty"`
	// URL is the URL of the result
	URL string `json:"endpoint,omitempty"`
	// Source is the source for the result
	Source string `json:"source,omitempty"`
	// Tag is the tag for the result
	Tag string `json:"tag,omitempty"`
	// Attribute is the attribute for the result
	Attribute string `json:"attribute,omitempty"`
}

and this struct is not suitable for this type of output. I would like to write something similar to meg output:

▶ head -n 20 ./out/example.com/45ed6f717d44385c5e9c539b0ad8dc71771780e0
http://example.com/robots.txt

> GET /robots.txt HTTP/1.1
> Host: example.com

< HTTP/1.1 404 Not Found
< Expires: Sat, 06 Jan 2018 01:05:38 GMT
< Server: ECS (lga/13A2)
< Accept-Ranges: bytes
< Cache-Control: max-age=604800
< Content-Type: text/*
< Content-Length: 1270
< Date: Sat, 30 Dec 2017 01:05:38 GMT
< Last-Modified: Sun, 24 Dec 2017 06:53:36 GMT
< X-Cache: 404-HIT

<!doctype html>
<html>
<head>

Do you have any suggestions? @ehsandeep @Mzack9999

This PR closes #177.

@edoardottt edoardottt marked this pull request as draft November 16, 2022 11:35
Copy link
Member

@ehsandeep ehsandeep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @edoardottt,

Thank you @edoardottt for working on this feature, we can keep this uniform like other PD tools (httpx/proxify/nuclei).

Here is an example format from httpx that we can replicate for katana.

echo example.com | httpx -sr
cat output/example.com.txt


HTTP/1.1 200 OK
Connection: close
Accept-Ranges: bytes
Age: 545788
Cache-Control: max-age=604800
Content-Type: text/html; charset=UTF-8
Date: Sat, 19 Nov 2022 10:36:59 GMT
Etag: "3147526947"
Expires: Sat, 26 Nov 2022 10:36:59 GMT
Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
Server: ECS (dcb/7EEF)
Vary: Accept-Encoding
X-Cache: HIT

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

@ehsandeep
Copy link
Member

@edoardottt the above example was for response format, for the filenames, we are adopting httpx to follow meg format in projectdiscovery/httpx#848 and something to adopt here as well.

@edoardottt edoardottt marked this pull request as ready for review November 20, 2022 09:58
@edoardottt
Copy link
Contributor Author

edoardottt commented Nov 20, 2022

Hi @ehsandeep, just a lil bit of context.
I'm trying to add the feature with as few changes as possible, however there are some constraints.

  • In Headless mode for now it's not possible to store the responses because the responses provided in headless mode lack some important information (such as the status, the protocol used) and I don't know how to store them. It might be possible to do this, but of course the format of the responses won't be the same. Let me know how you want to handle that.

  • katana prints the results on the command line, but it doesn't perform an HTTP request for all of them. Using a proxy I see 12 results on the CLI and 6 requests made by katana. Because of this, the results in the responses folder and the results printed on the cli don't match. This works in this way because if the depth level is 2 (default), katana will see some URLs (depth 3, print them because in scope) but it won't go deeper crawling them (and so not performing HTTP requests).
    For this I've used https://www.edoardoottavianelli.it/ as test case.

Test command

echo "https://www.edoardoottavianelli.it/" | ./katana -sr -proxy http://127.0.0.1:8888

Katana CLI output:

...
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
https://www.edoardoottavianelli.it/blog.html
https://www.edoardoottavianelli.it/cve.html
https://www.edoardoottavianelli.it/aboutme.html
https://www.edoardoottavianelli.it/cv.html
https://www.edoardoottavianelli.it/
https://www.edoardoottavianelli.it/index.html
https://www.edoardoottavianelli.it/CVE-2022-44019/index.html
https://www.edoardoottavianelli.it/CVE-2022-41392/index.html
https://www.edoardoottavianelli.it/post/post7/post7.html
https://www.edoardoottavianelli.it/post/post6/post6.html
https://www.edoardoottavianelli.it/post/post5/post5.html
https://www.edoardoottavianelli.it/post/post1/post1.html

Proxify logs

/tmp/logs> head -n 1 www.edoardoottavianelli.it:443-*
==> www.edoardoottavianelli.it:443-cdsvqfe2g8ihqu0g3pdg.txt <==
GET / HTTP/1.1

==> www.edoardoottavianelli.it:443-cdsvqfm2g8ihqu0g3pe0.txt <==
GET /cve.html HTTP/1.1

==> www.edoardoottavianelli.it:443-cdsvqfu2g8ihqu0g3peg.txt <==
GET /blog.html HTTP/1.1

==> www.edoardoottavianelli.it:443-cdsvqg62g8ihqu0g3pf0.txt <==
GET /aboutme.html HTTP/1.1

==> www.edoardoottavianelli.it:443-cdsvqge2g8ihqu0g3pfg.txt <==
GET /cv.html HTTP/1.1

==> www.edoardoottavianelli.it:443-cdsvqgm2g8ihqu0g3pg0.txt <==
GET / HTTP/1.1

Demo

$> echo "https://projectdiscovery.io/" | ./katana -sr
$> cat katana_responses/index.txt

katana_responses/projectdiscovery.io/60da1e66fe7802e77cc27eb06d24509a936d2b25.txt https://projectdiscovery.io/ (200 OK)
katana_responses/projectdiscovery.io/ca419688a1b91baf51417038bcf5d170b73220ee.txt https://projectdiscovery.io/app.bundle.css (200 OK)
katana_responses/projectdiscovery.io/73cc72a0568d8943b4cc46ee8e258f538dd4f25d.txt https://projectdiscovery.io/app.js (200 OK)
$> head -n 35 katana_responses/projectdiscovery.io/60da1e66fe7802e77cc27eb06d24509a936d2b25.txt

https://projectdiscovery.io/


GET / HTTP/1.1
Host: projectdiscovery.io
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36


HTTP/1.1 200 OK
X-Timer: S1668940326.392399,VS0,VE114
Content-Type: text/html; charset=utf-8
Cache-Control: max-age=600
Strict-Transport-Security: max-age=0; preload
X-Content-Type-Options: nosniff
Date: Sun, 20 Nov 2022 10:32:06 GMT
Access-Control-Allow-Origin: *
Via: 1.1 varnish
Connection: keep-alive
Age: 0
X-Cache: HIT
X-Cache-Hits: 1
Cf-Ray: 76d0850fdec35a1f-MXP
Nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
Last-Modified: Mon, 07 Nov 2022 14:59:55 GMT
X-Served-By: cache-mxp6929-MXP
Vary: Accept-Encoding
Report-To: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=dZHfumZ2a17gP3V%2BnmXiDRAACRkQEBN6goKGQfmqvhUBR6J0cnBqJyyg80pJn7bFK6%2BgIEw2jw2t7MrJ4zpGRwJtlvOl0YduCrfFRVDMT9%2BdRTcVmBPhr21HQ2UTNJz0rTDgRg4%3D"}],"group":"cf-nel","max_age":604800}
Expires: Sun, 20 Nov 2022 09:16:04 GMT
X-Proxy-Cache: MISS
X-Fastly-Request-Id: 854ff2bf4c9674ae0a90974e6794e694baf0aeca
X-Github-Request-Id: C01A:122BC:245AB82:2582642:6379EDFC
Cf-Cache-Status: DYNAMIC
Server: cloudflare

<!doctype html><html lang="en"><head><meta charset="utf-8"/><meta content="width=device-width" name="viewport"/><title>Projectdiscovery.io</title><link rel="preconnect" href="https://fonts.gstatic.com"><link href="https://fonts.googleapis.com/css2?family=Nunito+Sans&family=Poppins&family=Montserrat&family=Open+Sans&display=swap" rel="stylesheet"><script async src="https://www.googletagmanager.com/gtag/js?id=UA-165996103-1"></script><script>var _iub = _iub || [];

@tarunKoyalwar tarunKoyalwar linked an issue Nov 27, 2022 that may be closed by this pull request
Copy link
Member

@ehsandeep ehsandeep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoardottt we can handle a headless response in another ticket as it requires further investigation.

and the CLI behavior you mentioned is expected, as the purpose of the project to get all the endpoints, and depending on the depth, it will always sprint the response URLs in output and not visit them, this behavior can be controlled when we will introduce -validate option in future.

Minor bug identified by @wdahlenburg at #177 (comment) that we can fix in this PR to include POST body while writing a response on the disk.

Copy link
Member

@ehsandeep ehsandeep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoardottt I noticed, the crawl time got improved and reduced by a lot, Is anything specific you to point out?

Thanks again for working on this.

Copy link
Contributor

@ShubhamRasal ShubhamRasal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - suggesting small change

pkg/output/output.go Outdated Show resolved Hide resolved
@ehsandeep ehsandeep merged commit 96210d8 into projectdiscovery:dev Dec 8, 2022
@edoardottt edoardottt deleted the store-resp branch December 8, 2022 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option to store http responses to file
3 participants