# Browser Fingerprinting and Attack on Canvas Fingerprint Defender

Group Members:

1: User one
2: User two

## 1. Introduction

- Browser Fingerprinting:

A tracking technique that collects unique browser/device attributes (user agent, screen resolution, installed fonts, canvas rendering, WebGL, etc.) to create a "fingerprint" for user identification. Canvas fingerprinting specifically exploits subtle rendering differences in HTML5 Canvas to track users across sites.

- Canvas Fingerprint Defender:

Privacy tools like Canvas Defender inject controlled noise into canvas rendering output, generating randomized or altered fingerprints each time to prevent consistent tracking. This breaks the stability required for effective fingerprinting.

- Attack Objective:

This project attempts to reverse-engineer the original fingerprint by analyzing multiple noisy samples from the same user. The hypothesis is that averaging out the noise across samples can reconstruct the true fingerprint, defeating the defender's protection.

## 2. Loading the FPStalker Dataset
- The FPStalker dataset:

FPStalker is a public dataset containing longitudinal browser fingerprints collected from real users. It includes attributes like canvasJSHashed, userAgentHttp, and fontsFlashHashed.

- Load it into a Pandas DataFrame.

The dataset is stored as an SQL dump split into multiple files. After downloading and merging these files, regex is used to parse the data by extracting INSERT statements into a Pandas DataFrame. 

- Analyze the dataset structure.

Right after downloading the dataset and then loading it into Pandas DataFrame, we preffered to double-check the final file so that we don't have any inconsistecies. As a bonus we created a CSV file to graphically check all the data in proper excel file. 

In [1]:
# Downloading the FPStalker dataset
!wget https://github.com/Spirals-Team/FPStalker/archive/refs/heads/master.zip

--2025-03-06 11:56:34--  https://github.com/Spirals-Team/FPStalker/archive/refs/heads/master.zip
Resolving github.com (github.com)... 140.82.121.3
connected. to github.com (github.com)|140.82.121.3|:443... 
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/Spirals-Team/FPStalker/zip/refs/heads/master [following]
--2025-03-06 11:56:34--  https://codeload.github.com/Spirals-Team/FPStalker/zip/refs/heads/master
140.82.121.9deload.github.com (codeload.github.com)... 
Connecting to codeload.github.com (codeload.github.com)|140.82.121.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’

master.zip              [        <=>         ] 136.48M  3.73MB/s    in 46s     

2025-03-06 11:57:21 (2.94 MB/s) - ‘master.zip’ saved [143109792]



In [3]:
#Then Unziping the DataSet
!unzip -q master.zip -d FPStalker_data

# Combineing the split data files into one SQL (as instructed in README of the Github)
!tar -xzf FPStalker_data/FPStalker-master/extension1.txt.tar.gz
!tar -xzf FPStalker_data/FPStalker-master/extension2.txt.tar.gz
!cat extension1.txt extension2.txt > FPStalker_data/tableFingerprints.sql

tar: Ignoring unknown extended header keyword 'SCHILY.fflags'
tar: Ignoring unknown extended header keyword 'SCHILY.fflags'


In [3]:
import pandas as pd
import re

# Reading the SQL dump file
file_path = "FPStalker_data/tableFingerprints.sql"
data = []
columns = []

with open(file_path, "r", encoding="utf-8") as file:
    for line in file:
        # Extract column names from the first INSERT statement
        if "INSERT INTO" in line and not columns:
            column_match = re.search(r"INSERT INTO `\w+`\s*\((.*?)\)\s*VALUES", line)
            if column_match:
                columns = [col.strip(" `") for col in column_match.group(1).split(",")]

        # Extract values from INSERT INTO statements
        values = re.findall(r"\((.*?)\),", line)
        for value in values:
            # Use regex to split while handling quoted strings correctly
            row = re.split(r",(?=(?:[^']*'[^']*')*[^']*$)", value)
            row = [val.strip(" '") if val.upper() != "NULL" else None for val in row]  # Clean up values

            # Ensure row matches the number of columns (handle malformed cases)
            if len(row) == len(columns):
                data.append(row)

# Convert to DataFrame using extracted column names
df = pd.DataFrame(data, columns=columns)

# **Fix: Remove potential duplicate column names in first row**
if df.iloc[0].tolist() == columns:
    df = df.iloc[1:].reset_index(drop=True)

# **Ensure correct column types**
df = df.convert_dtypes()

# Show first 5 rows
print(df.head())

# Save to CSV for debugging
#df.to_csv("cleaned_data.csv", index=False)
#print("Data cleaned and saved to 'cleaned_data.csv'")


  counter                                    id  \
0       1  f4f25187-82c7-4265-9bf9-71bc8ed1701f   
1       3  4d49cfb2-841d-4a04-81cb-7d1b39a507b4   
2       4  7ec277be-f8fd-4ea9-ab89-b2f7d4dacaf6   
3       6  3ea176b5-6e50-402a-bd8f-adbfb1b30fb6   
4      10  10c196fc-2c76-4e4b-b628-4472949c39c3   

                                addressHttp         creationDate  \
0  968635a6aa99d94b3a44e361d75ff06ce018dfbf  2015-10-15 14:00:00   
1  21e37622bb08c2274bfa286040888a7139dce4d3  2015-10-15 22:00:00   
2  81a4e7916c9c86c687b76ad16b3c56034197f467  2015-10-16 00:00:00   
3  cf05b3719bc99cc643ff2dd9204d1af6c91477ec  2015-10-16 11:00:00   
4  cd7c5f426fe08beea2abdb6c25acf817c61bb7a5  2015-10-16 21:00:00   

            updateDate              endDate  \
0  2015-10-16 14:00:00  2015-10-16 19:00:00   
1  2015-10-16 14:00:00  2015-10-17 04:00:00   
2  2015-10-16 04:00:00  2015-10-16 08:00:00   
3                 NULL  2015-10-19 07:00:00   
4                 NULL  2015-10-19 07:00:00   

 

## 3. Detecting Noise in the Dataset
- Identify inconsistencies in fingerprints:

Inconsistency detection is performed by identifying BrowserIDs that have multiple canvasJSHashed values, which indicates intentional noise injection; for example, 247 browsers exhibited varying hashes.

- Check if browserID has multiple different canvasJSHashed values:

The methodology involves grouping fingerprints by their id and counting the unique canvasJSHashed entries, where a count greater than one signifies the presence of noise. This high variability confirms that the defender is actively injecting noise, although it also provides multiple samples that could be exploited for noise-averaging attacks.

In [4]:
import pandas as pd

# Counting unique canvas hashes per browserID
fingerprint_variability = df.groupby("id")["canvasJSHashed"].nunique().reset_index()
fingerprint_variability.columns = ["id", "unique_canvas_hashes"]

# Show cases where canvas hashes vary for the same browserID
noisy_browsers = fingerprint_variability[fingerprint_variability["unique_canvas_hashes"] > 1]

# Printing the first few rows to check the output
print(noisy_browsers)


                                        id  unique_canvas_hashes
3     00ce2756-a6dd-4b6b-8ca6-4a7faf307335                     2
8     01d2403b-1115-4498-b12b-c26cfcb3e869                     2
11    0270f61a-2c1c-4603-9dba-950e1c4bc7da                     2
16    02fd491b-74a2-4614-a91b-edd89fd5176c                     2
18    03482a40-dded-4b32-9d4f-a3a802126936                     2
...                                    ...                   ...
1370  f8c53f6a-367e-48e9-99bf-c681ccc1106c                     2
1372  f9508a08-908d-4127-80ff-259fd51304ad                     2
1393  fd026db8-dad8-4f55-8016-d469844b2862                     2
1403  fff6a12d-aab1-4669-9a2e-c369bc300485                     2
1404  fff9db58-c20e-4f2c-b07d-9937b70a1273                     2

[247 rows x 2 columns]


## 4. Implementing the Attack
- Multi-sample averaging to restore original fingerprint:

The multi-sample averaging process involves converting hexadecimal hashes to integers, computing their mean, and then converting the average back to hexadecimal. This approach assumes that the noise introduced around the true fingerprint is symmetrically distributed. 

- Evaluate results:

The method yielded a 0% success rate because hex hashes function as categorical data rather than numerical values, and the defender's noise may be non-linear—for instance, through random bit flips. As a result, averaging these values distorted the true fingerprint instead of neutralizing the noise.

In [5]:
import numpy as np
import pandas as pd

# Grouping fingerprints by browserID
grouped_fingerprints = df.groupby("id")["canvasJSHashed"].apply(list).to_dict()

# Function to "average" hashes (simplified)
def average_hash(hashes):
    """Simulates noise removal by averaging hash values."""
    # Convert hashes to numerical values for processing
    hash_nums = [int(h, 16) for h in hashes if h is not None]  # Ensure valid hex conversion
    avg_hash = int(np.mean(hash_nums)) if hash_nums else 0  # Take average safely
    return format(avg_hash, 'x')  # Convert back to hex

# Apply averaging to noisy browsers
recovered_fingerprints = {
    browser_id: average_hash(hashes) for browser_id, hashes in grouped_fingerprints.items() if len(hashes) > 1
}

# Convert results into DataFrame
recovered_df = pd.DataFrame(recovered_fingerprints.items(), columns=["id", "recovered_canvasJSHashed"])

import pandas as pd

# Save the recovered fingerprints DataFrame to a CSV file for inspection
recovered_df.to_csv("recovered_fingerprints.csv", index=False)

# Display the first few rows to verify
print(recovered_df.head())

                                     id  \
0  0062535e-d9ab-4cb0-ae8f-79133b440129   
1  00ce2756-a6dd-4b6b-8ca6-4a7faf307335   
2  0143636f-2f9e-45bc-b1f1-13b94a4245ce   
3  018ab6d2-9c65-459c-b7fe-4c4e641b13ba   
4  01d2403b-1115-4498-b12b-c26cfcb3e869   

                   recovered_canvasJSHashed  
0  7be7a9cb410b0800000000000000000000000000  
1  4e6ce7704287b800000000000000000000000000  
2  7d31761084701000000000000000000000000000  
3  a248bd3cead2f800000000000000000000000000  
4  8f3920c9818ca800000000000000000000000000  


## 5. Conclusion
- Summarize findings:

The findings reveal that the attack failed to recover the original fingerprints, underscoring the limitations of the averaging approach. 

- Discuss implications for privacy:

Future work should explore alternative noise-removal strategies to enhance the extraction of reliable fingerprints. For instance, one could identify the most frequent hash value to serve as the representative fingerprint or use entropy analysis to detect stable bits within the noisy hashes. These methods may help better isolate the true fingerprint from the intentionally injected noise.

In [6]:
# Merge recovered fingerprints with original dataset
merged_df = df.merge(recovered_df, on="id", how="left")

# Ensure both columns are strings and handle NaN values
merged_df["recovered_canvasJSHashed"] = merged_df["recovered_canvasJSHashed"].fillna("").astype(str)
merged_df["canvasJSHashed"] = merged_df["canvasJSHashed"].fillna("").astype(str)

# Check if the recovered hash matches any of the original hashes
merged_df["match"] = merged_df.apply(lambda row: row["recovered_canvasJSHashed"] == row["canvasJSHashed"], axis=1)

# Calculate success rate
success_rate = merged_df["match"].mean()
print(f"Success rate: {success_rate:.2%} of fingerprints recovered correctly")

Success rate: 0.00% of fingerprints recovered correctly
