# ipv6scan

In a dataset of IPv6 scan results that we collected, we noticed that many network devices share identical lower 64 bits in their IPv6 addresses. Considering the most widely used IPv6 assignment mechanisms, such address collisions should be significantly less frequent than what we have observed. We want to find out the reason behind this. 

## Dataset collection

IPv6 address space is astronomically large. (It will take about a 100 million years to scan through all of them if we do 100 million scans every second!) The difficulty is tremendous compared to trying to scan the IPv4 addresses space, which we can do in five minutes with tools such as [ZMap](https://zmap.io/). Another important characteristic of IPv6 address space is that it's extremely sparsely occupied -- only a tiny fraction of addresses have active devices behind them and the vast majority remain unused. Notably, the addresses are assigned following certain patterns. Therefore, while brute-forcing through the entire IPv6 address space is impractical, it is fortunately unnecessary. Instead, we can try to narrow down the search space by targeting those likely occupied regions.

The way we find targets space to scan is by... (TODO: finish this, how did we know about the home network router's addresses again?)

## MAC address assignment and semantics

MAC (Media Access Control) addresses are link layer addresses that identify network devices (more specifically, identify the NIC on that device), and each one is *globally unique*. These addresses are 48-bit long, typically written in 12 hexadecimal digits (e.g., **02:04:7A**:*BB:28:FC*). The first six hexademical digits identifies the manufacturer of the NIC; and the last six should be unique numbers assigned by the manufacturer. 

TODO: however, devices can choose to randomly generate a MAC address, and maybe there are other ways to generate a MAC address, so whether the manufacture-burned-in one is unique or not does not matter anymore. However, chances of two lower 64 bits colliding is extremely low. (TODO: probably show the math) We still do not care about the addresses generated by SLAAC. 

## IPv6 address semantics and assignment

IPv6 addresses are 128 bit long addresses on the network layer, typically written in eight groups of two bytes in hexadecimal numbers (e.g., **2001:0DB8:AC10:FE01**:*1234:56FF:FE78:0301*). 
- First 64 bits: Network and subnet identification
  - First 48 bits: Network prefix (identifies the overall network)
  - Next 16 bits: Subnet ID (identifies a specific subnet within that network)
- Last 64 bits: Host identification (identifies a specific device on that network)

## SLAAC

The SLAAC (Stateless Address Autoconfiguration) process generates globally unique IPv6 addresses for a device that do not already have a IPv6 assignment. The device running SLAAC derives an EUI-64 interface identifier from its MAC address and combines it with the FE80::/64 prefix to form a link-local address (TODO: probably should draw it out); it then performs Duplicate Address Detection (DAD) to ensure the address's uniqueness on the local segment. At this stage, the device is still not globally reachable because the link-local address is only routable in the local network segment. To get a global address, the device would request from its local network router and get the network prefix. It then combines this prefix with the same EUI-64 identifier, performs DAD again, and finally gets a globally unique IPv6 address. 

When processing the data, we use `FFFE` on byte 12 and 13 as a filter to rule out addresses generated by SLAAC, since we do not believe these addresses would produce duplicates. 

## DHCPv6

The DHCPv6 (Dynamic Host Configuration Protocol for IPv6) process, on the other hand, assigns IPv6 addresses to devices in a centralized mannar. In a home network (or local area network in general), the gateway router typically plays the role of DHCPv6 server -- it hands out IPv6 addresses to devices on the network. When a new device joins the network, it generates a link-local address and sends a solicit message to locate a DHCPv6 server; the server in turns pick an available address in the pool of unused addresses within its delegated prefix. We hypothesize that the gateway router uses the current time as a seed to generate this address, a process that may have gone wrong. 

## NTP

Network Time Protocol (NTP) is for synchronizing clocks in network devices. When a home router is first configured, it should request from a network time server for the current time. (TODO: but is this true? Where can I read about more router startup rules? Should a router be allowed to be on the Internet when its time is off, especially when it needs the time to configure addresses?) Many routers come with default NTP settings provided by the manufacturer, with a list of network time servers that it can contact to synchronize its time. Our hypothesis for why the addresses are repeated is that some routers attempt to access the time servers configured in their list, but none of those servers are active anymore. As a result, the routers may all default to the Unix epoch time (midnight, January 1, 1970 UTC) or some other invalid time, and potentially use this as a seed to generate the lower 64 bits of the IPv6 address.

## Questions and todos

- we can check for ipv6 dups within a certain range, right? beyond that range we just rely on the hierarchy also just cuz who's gonna keep a list on a global level. 
- look up gateway routers 
- find out why erik only mentioned SLAAC and if that's the main/only thing people use nowadays.
- todo: can't remember why we need to seed anything exactly. figure that out. is it for dhcpv6?
- There are several ways to assigned the last 64 bits:
  - Static (TODO: maybe talk about it?)
  - DHCPv6
  - SLAAC