# Onion Intel

This is the first version of my tool, it's just a scratch. 

That's why I choose to use a Jupyter notebook so I can debug every step. 

## Libraries 

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import subprocess
import re
import os
from urllib.parse import urlparse
import xml.etree.ElementTree as ET

## Updating the Database

Verifying if the Ahmia site is online. I chose to start with Ahmia because it provides access to its database of .onion sites.

In [None]:
ahmia="https://ahmia.fi/onions/"
response = requests.get(ahmia)

if response.status_code == 200:
    print("Ahmia is on!")
    urls_onion = response.content
elif response.status_code == 404:
    print("Ahmia is down!")

Downloading the Ahmia Database stored in the `/onions/` path. 

> The HTML needs to be processed to extract only the .onion site URLs.

In [None]:
soup = BeautifulSoup(response.content, 'html.parser')
# Split the content into lines and filter out empty lines
lines = soup.text.splitlines()
non_empty_lines = [line.strip() for line in lines if line.strip() != ""]

Saving it into a file

In [None]:
# Save the cleaned lines to a file
with open("ahmia.txt", "w") as file:
    for line in non_empty_lines:
        file.write(line + "\n")

Comparing the new entries with the previous ones to ensure checks are performed only on newly discovered sites.

In [None]:
# Read the entries from anemia.txt
with open("ahmia.txt", "r") as file:
    ahmia_entries = set(line.strip() for line in file if line.strip())

# Read the entries from ahmia_old.txt
with open("ahmia_old.txt", "r") as file:
    ahmia_old_entries = set(line.strip() for line in file if line.strip())

# Find the new entries in anemia.txt that are not in ahmia_old.txt
new_entries = ahmia_entries - ahmia_old_entries

# Save the new entries to a file (e.g., new_entries.txt)
with open("new_entries.txt", "w") as file:
    for entry in new_entries:
        file.write(entry + "\n")

print(f"Found {len(new_entries)} new entries.")


Finally the file ahmia_old.txt is updated. 

In [None]:
# Combine old and new entries
combined_entries = ahmia_old_entries.union(new_entries)

# Save the combined entries back to ahmia_old.txt
with open("ahmia_old.txt", "w") as file:
    for entry in sorted(combined_entries):
        file.write(entry + "\n")

print(f"Updated 'ahmia_old.txt' with {len(new_entries)} new entries.")

## TOR

In case TOR is already running this part is not necessary. 

In [None]:
# Start Tor using subprocess
tor_process = subprocess.Popen(['tor'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)

In [None]:
!netstat -tupan | grep 9050

In [None]:
!netstat -tupan | grep 9053

In [None]:
# Only for debuging process
#tor_process.terminate()

## HTTPX

Kali Linux already includes a binary named `httpx`. Therefore, `PDTM` is used to accurately locate the actual installation path of `httpx`.

In [None]:
# Run the PDTM command and capture both stdout and stderr
process = subprocess.Popen('pdtm', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()

# Convert the output to string and combine stdout and stderr
output = stdout.decode('utf-8') + stderr.decode('utf-8')

# Find the line containing the path
match = re.search(r'Path to download project binary: (.*)', output)

# Extract the path and store it in a variable
if match:
    pdtm_path = match.group(1)
    print(f"PDTM Path: {pdtm_path}")
else:
    print("Path not found in the output.")


In [None]:
# Check if new_entries.txt exists and is readable
if os.path.isfile("new_entries.txt"):
    print("new_entries.txt exists and is readable.")
else:
    print("new_entries.txt is missing or not readable.")

The Jupyter Notebook allows execution of shell commands using `!`, which is how HTTPX was run within the notebook.

The HTTPX then creates a CSV file. 

In [None]:
# Run the httpx command directly in Jupyter
!{pdtm_path}/httpx -l new_entries.txt --proxy socks5://127.0.0.1:9050 -timeout 50 --title -ss -esb -ehb -silent -follow-redirects -csv -o new_entries_httpx.csv

# Creating the data frame 

The CSV file created by HTTPX, was used to create an DataFrame with the Pandas Librarie. 

In [None]:
# Use pandas to read the CSV file
csv_file = "new_entries_httpx.csv"
try:
    df = pd.read_csv(csv_file, encoding='ISO-8859-1')
    #print(df.head())
except FileNotFoundError:
    print(f"File {csv_file} not found.")

However, the CSV contained numerous columns, many of which were unnecessary. A new DataFrame was created with only the selected columns of interest.

In [None]:
df_base = df [['timestamp','port','url','input','title','webserver','content_type','method','host','path','tech','words','lines','status_code','content_length','stored_response_path','screenshot_path_rel']]

In [None]:
# Create 'stored_response_path_rel' column by extracting the relative path
df_base['stored_response_path_rel'] = df['stored_response_path'].apply(lambda x: x.split('OnionIntel/output/response/')[-1] if pd.notnull(x) else '')
# Drop the old 'stored_response_path' column
df_base = df_base.drop(columns=['stored_response_path'])

In [None]:
df_base

A new column was created, named `Onion_Site`, because it will be used to merge the dataframe created by `NMAP` later on. 

In [None]:
#  Create the 'Onion_Site' column by extracting the .onion part of the 'url' column
df_base['Onion_Site'] = df_base['url'].apply(lambda url: urlparse(url).netloc if urlparse(url).netloc.endswith(".onion") else None)

# Display the updated DataFrame
print(df_base[['url', 'Onion_Site']].head())  # Displaying only the 'url' and 'Onion_Site' columns

## NMAP

Saving all onion sites into a file, so it can be sent to NMAP scan. 

In [None]:
# Open a new file to store the .onion URLs
onion_file = "new_entries_onion_sites.txt"

with open(onion_file, "w") as file:
    # Loop over the DataFrame
    for index, row in df_base.iterrows():
        # Extract the URL
        url = row['url']
        
        # Parse the URL to extract the domain
        parsed_url = urlparse(url)
        onion_domain = parsed_url.netloc  # This gets the domain without http:// or trailing /

        # Check if the domain ends with '.onion'
        if onion_domain.endswith(".onion"):
            # Write the full URL to the file
            file.write(onion_domain + "\n")

print(f"All .onion URLs have been written to {onion_file}")


In [None]:
!sudo proxychains nmap --top-ports 25 -sT -Pn -v --open -iL new_entries_onion_sites.txt -oX new_entries_nmap.xml

## NMAP XML to Dataframe

Finally, the output from NMAP is then converted into a Dataframe to be merged with the previous dataframe

In [None]:
# Path to the Nmap XML file
xml_file = 'new_entries_nmap.xml'

# Parse the XML file
tree = ET.parse(xml_file)
root = tree.getroot()

# List to store data
nmap_data_fixed = []

# Iterate over each host in the Nmap XML output
for host in root.findall('host'):
    # Get the IP address
    ip = host.find('address').get('addr')
    
    # Get the status (up/down)
    status = host.find('status').get('state')
    
    # Initialize a variable to store the onion site (if found)
    onion_site = None
    
    # Iterate through all hostname elements to find .onion domains
    for hostname_elem in host.findall('hostnames/hostname'):
        hostname = hostname_elem.get('name')
        if hostname.endswith(".onion"):
            onion_site = hostname  # Prioritize .onion hostname
    
    # Iterate over ports and extract relevant info
    for port in host.findall('ports/port'):
        port_id = port.get('portid')
        protocol = port.get('protocol')
        state = port.find('state').get('state')
        service_name = port.find('service').get('name') if port.find('service') is not None else "Unknown"
        
        # Append the extracted data to the list, making sure we include the .onion site
        nmap_data_fixed.append({
            'Onion_Site': onion_site if onion_site else "Unknown",
            'Port_Nmap': port_id,
            'Protocol_Nmap': protocol,
            'State_Nmap': state,
            'Service_Nmap': service_name
        })

# Create a DataFrame from the extracted data
df_nmap_fixed = pd.DataFrame(nmap_data_fixed)

In [None]:
# Merge df_nmap_fixed and df_base using the 'Onion_Site' column
df_merged = pd.merge(df_base, df_nmap_fixed, on='Onion_Site', how='inner')


# Optionally, save the merged DataFrame to a CSV file
df_merged.to_csv('merged_onion_nmap_results.csv', index=False)

In [None]:
# Load the new and old merged results
new_file = "merged_onion_nmap_results.csv"
old_file = "merged_onion_nmap_results_old.csv"

df_new = pd.read_csv(new_file)
df_old = pd.read_csv(old_file)

# Concatenate the old and new DataFrames
df_combined = pd.concat([df_old, df_new], ignore_index=True)

# Drop duplicate entries based on the 'Onion_Site' column (or other relevant column)
df_combined = df_combined.drop_duplicates(subset='Onion_Site', keep='last')

# Save the merged results back to a CSV file for future executions
df_combined.to_csv("merged_onion_nmap_results_old.csv", index=False)

print("Updated results saved as merged_onion_nmap_results_old.csv for future executions.")