<font size="6"> <center>Quick Guide to Scraping the Dark Web</center>
<font size="4"> <center>tinjano@github, tinjano.github.io</center> </font>

# Introduction
Like the clear web, we may also wish to scrape the dark web for data, albeit usually different kinds and for different purposes. While clear web HTMLs can be accessed in user-friendly ways via known Python libraries, the dark web is part of the Tor network and any communication with it requires a bit more work.

Obviously, content scraped from the dark web may be disturbing or illegal and discretion should be exercised by the reader. This guide is simply dedicated to providing a simple way to access the HTML content of .onion pages. These pages may also be sporadically unavailable and not much can be done about that. For more complex interaction with the Tor network or services, additional specialized libraries may be used.

# Starting the Tor service
The Tor service is the daemon that allows us to route traffic through the Tor network. Initiating a session can be done in a very simple way &ndash; by starting the [Tor browser](https://www.torproject.org/download/tor/).

# Identifying the Tor ports
Typically, the Tor browser will use port 9150 as the SOCKS port to rout traffic. A 'standalone' service may use port 9050, and 9051 or 9151 would be used as control ports. If there are issues, tools like `netstat` can be used.

In [55]:
!netstat -tuln | grep 9150

tcp        0      0 127.0.0.1:9150          0.0.0.0:*               LISTEN     


# Setting up Proxies
We can make requests to .onion sites with our usual libraries, but we will need to set up SOCKS proxies. We will consider requests and HTTPX. The former has a flaw that may come to light with .onion addresses with a base URL longer than 64 characters, as the base socks module cannot resolve those. HTTPX does not rely on the socks module, and also has additional capabilities such as asynchronous requests.

The syntax differs slightly for the two options.

In [2]:
proxies_requests = {
    'http': 'socks5h://localhost:9150',
    'https': 'socks5h://localhost:9150'
}

proxies_httpx = {
    'http://': 'socks5://localhost:9150',
    'https://': 'socks5://localhost:9150'
}

# Making requests
This is likely all that it takes. Let us try to send a request to the CIA's .onion website.

In [3]:
import requests
import httpx
url = 'http://ciadotgov4sjwlzihbbgxnqg3xiyrg7so2r2o3lt5wz5ypk4sxyjstad.onion/'

In [4]:
requests.get(url, proxies=proxies_requests).headers

{'Date': 'Mon, 18 Dec 2023 20:03:57 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Akamai-Transformed': '9 270289 0 pmb=mNONE,1', 'Set-Cookie': '_session_=6156A598990E52B861B6AE606FDA9D19; path=/; domain=ciadotgov4sjwlzihbbgxnqg3xiyrg7so2r2o3lt5wz5ypk4sxyjstad.onion; secure; HttpOnly', 'ID': '+IqqZPK95qWWBNyt7oTfaYLjaw+z4ouDbVCyqld7vfBFIBEcssI11sq3aLqE7LBN', 'SESSION': 'ZgOOonym+B0EFefFXMSKK4plBsR/LbsRyB7i5Frwu4mXE451saQsLV5nRORZ7jqH8JeTA3pz7ij1f/3BoV2C7A=='}

In [5]:
httpx.get(url, proxies=proxies_httpx).headers

Headers({'date': 'Mon, 18 Dec 2023 20:04:06 GMT', 'content-type': 'text/html', 'transfer-encoding': 'chunked', 'connection': 'keep-alive', 'x-akamai-transformed': '9 270289 0 pmb=mNONE,1', 'set-cookie': '_session_=4D0513FCC09E64689016548129ED100C; path=/; domain=ciadotgov4sjwlzihbbgxnqg3xiyrg7so2r2o3lt5wz5ypk4sxyjstad.onion; secure; HttpOnly', 'id': 'Tk3WyVSBM4Bx+8vUe8wGRPYoJiJA2yPgn+zs6LGn7M76mykpTaoNVzFldLzrIQ5N', 'session': 'G0sLqN4ctIisuE4nfHfS8sbwmImDReSybHybhx9EJCFAFUFTzIvq/F7v0t2xOnF4/HIPWMg9ApPMANGc+Vp23A=='})

# Resetting your identity
Reseting your identity can be achieved directly through the command line, although some kind of authentication may be required. Here is an example using the control authentication cookie. We will fetch it with Python's `open`, though it may be also read with a terminal command. Following that, we may use specialized libraries or the terminal directly.

In [79]:
with open('/home/tinjano/tor-browser/Browser/TorBrowser/Data/Tor/control_auth_cookie', 'rb') as file:
    control_auth_cookie = file.read().hex()

control_auth_cookie

'9de55840fa1ca47c7fe7d9eddf4453ce1818cfad52c3cbbb0b89dd73820fe4cb'

Alternatively, we can obtain it with the following command.

In [56]:
!cat /home/tinjano/tor-browser/Browser/TorBrowser/Data/Tor/control_auth_cookie | xxd -p

9de55840fa1ca47c7fe7d9eddf4453ce1818cfad52c3cbbb0b89dd73820f
e4cb


Now we can see information about the current circuits. Note that we are using control port 9151.

In [None]:
!echo -e "AUTHENTICATE {control_auth_cookie} \nGETINFO circuit-status \nQUIT" | nc localhost 9151 

To reset identity, we use the command `SIGNAL NEWNYM`. For all commands see [here](https://spec.torproject.org/control-spec/commands.html).

In [77]:
!echo -e "AUTHENTICATE {control_auth_cookie} \nSIGNAL NEWNYM \nQUIT" | nc localhost 9151 

250 OK
250 OK
250 closing connection


In [None]:
!echo -e "AUTHENTICATE {control_auth_cookie} \nGETINFO circuit-status \nQUIT" | nc localhost 9151 