In [3]:
%%shell
# SETUP OF GOOGLE COLAB ENVIRONMENT
git clone -b colab --single-branch https://github.com/lucadomene/DFB_Lab01_DataCarving /tmp/DFB_lab_init
sh /tmp/DFB_lab_init/initialize.sh

Cloning into '/tmp/DFB_lab_init'...
remote: Enumerating objects: 62, done.[K
remote: Counting objects: 100% (18/18), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 62 (delta 7), reused 14 (delta 3), pack-reused 44 (from 1)[K
Receiving objects: 100% (62/62), 113.71 MiB | 30.34 MiB/s, done.
Resolving deltas: 100% (15/15), done.
Cloning into '/opt/datacarving-lab'...
remote: Enumerating objects: 55, done.[K
remote: Counting objects: 100% (11/11), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 55 (delta 5), reused 8 (delta 2), pack-reused 44 (from 1)[K
Receiving objects: 100% (55/55), 113.71 MiB | 32.41 MiB/s, done.
Resolving deltas: 100% (13/13), done.
losetup: cannot find an unused loop device




# Laboratory on Data Carving
_Digital Forensics and Biometrics_ A.A. 2025/2026

Lecturer: prof. **Simone Milani** (simone.milani@dei.unipd.it)

Teaching Assistants: **Mattia Tamiazzo** (mattia.tamiazzo@studenti.unipd.it); **Luca Domeneghetti** (luca.domeneghetti@studenti.unipd.it)

### Prerequisites
This laboratory requires no prior knowledge concerning digital forensics. Although, basic knowledge of the Unix enviromnent might benefit the understanding of the deletion/recovery process.

In general, the following aspects are expected to be known and will not be covered in detail:
- Unix paradigm of files, directories and devices
- Basic Shell usage
- `Ext4` filesystem layout: inodes, block sectors, journaling
- Calculations with hexadecimal byte offsets

### Contents
The goal of this laboratory is to provide the basic theoretical and practical notions on data carving and file recovery. The details related to file system's structure and I/O mechanisms – despite being crucial for a potentially successful data recovery – are out of scope and will be provided as a further reading for the student.

At the end of the laboratory the student will have acquired the following skills:
- Perform a forensic disk copy using Unix imaging tools (`dcfldd`)
- Analyze the partition layout of a disk image
- List the files within a filesystem
- **Retrieve deallocated/deleted files**
- **Perform data carving** (manually and using dedicated tools)
___
## Theoretical aspects of filesystem forensics
Before diving into the practice, a brief theoretical detour is essential to understand how a recovery procedure is to be carried out.

Having in mind how a specific filesystem works helps to understand how deletion processes work, how files are allocated logically and physically, and most importantly where to look for when searching for deallocated data.

### The Ext filesystem
This laboratory will be focused on Ext filesystems as it is widely implemented and tested under Linux environment. Its simplicity makes it the suitable starting point for a comprehensive data forensic approach.

#### Inodes
An **inode** (index nodes) is a data structure used by Unix filesystems (including the Ext family) to represent metadata on files. Each inode stores the following information:

- File type (e.g., regular file, directory, symlink)
- Permissions and ownership (UID, GID)
- Timestamps (created, modified, accessed)
- File size
- Link count
- Pointers to data blocks

Inodes do **not** store the file name or its path — these are maintained in directory entries to map symbolic names to inode numbers. This separation is critical in forensic analysis and data carving where inodes may remain allocated even if directory structures are damaged or missing (this is the case for _orphan files_).

Within an Ext partition, inodes are located in inode tables. Each inode is marked as allocated or deallocated by using inode bitmaps.

[![](https://www.virtualcuriosities.com/wp-content/uploads/2025/03/linux-diagram-hard-links-inodes-20250326.webp)](https://www.virtualcuriosities.com/articles/4507/how-hard-links-and-inodes-work-on-linux)

#### Directories
In Ext filesystems, directories are special files (file code `0x2`) that store a list of **directory entries**, each mapping a filename to an **inode number**. These entries are stored sequentially in data blocks and are arranged as such:

| Offset | Size                | Name                   | Description                                               |
|--------|---------------------|------------------------|-----------------------------------------------------------|
| 0x0    | __le32              | inode                  | Number of the inode that this directory entry points to.  |
| 0x4    | __le16              | rec_len                | Length of this directory entry.                           |
| 0x6    | __u8                | name_len               | Length of the file name.                                  |
| 0x7    | __u8                | file_type              | File type code                                            |
| 0x8    | char\[255]          | name                   | File name.                                                |

This structure allows multiple filenames (hard links) to point to the same inode.

In Ext2 and Ext3, directory entries are stored in a linear list, which can become inefficient as directories grow. Ext4 introduced **HTree indexing**, a hashed B-tree-like structure, to improve performance in large directories.

### NTFS and FAT

Aside from Ext filesystems, NTFS and FAT filesystems are frequently employed in Windows environments or external data storage (e.g. USBs). Although they share some similarities, understanding the differences is key for a successful file recovery.

- **Metadata Handling**:  
  Ext uses **inodes** to store file metadata separately from directory entries. In contrast, **NTFS** stores metadata in the **Master File Table (MFT)**, with each file represented as a record. **FAT**, being simpler, uses a **File Allocation Table** and directory entries that contain both metadata and file location info.

- **File Deletion Behavior**:  
  In **Ext**, when a file is deleted, its directory entry and inode may persist until overwritten, which aids data recovery. **NTFS** marks MFT entries as deleted but often retains significant metadata. **FAT** simply marks the first character of the filename as deleted and updates the FAT chain.

- **Journaling**:  
  **Ext3/4** and **NTFS** are journaling filesystems, enhancing data integrity but sometimes complicating recovery due to overwrites. **FAT** lacks journaling, making it more vulnerable to corruption but also leaving raw data more directly accessible for carving.

- **File Name and Path Storage**:  
  Ext separates names (in directory entries) from inodes, while NTFS stores file name attributes directly within MFT records. FAT embeds the file name directly in the directory entry.

___
## Exercise 01: perform a forensic disk imaging
When in possess of a digital forensic device (e.g. an hard disk, USB drive, internal SSD...), the first crucial step is to perform a copy of it as to avoid any accidental modification of the original.

This process is called **disk imaging** and abides the following principles:
1. the copy is an exact duplicate of the original device
2. the original device remains unaltered by the process

To avoid that unwanted (or intentional!) modifications are introduced during subsequent procedures, it is important to keep track of the **hash signature** of the original device. This step has to be performed during the first acquisition of the digital media.

The target device is located at `/dev/loop0` in the virtual Unix machine. It is **important** to select the root device `loop0` and not any of its partitions (e.g. `loop0p1`) in order to perform a full disk image.

In [5]:
%%shell
# `dcfldd` is an advanced version of Unix's `dd` tool to perform bit-by-bit copies of files/devices
# if = input file
# of = output file

# Task: perform a disk image of device /dev/loop0 and save it as whatever you prefer.
# How can I make `dcfldd` also compute the MD5 and SHA256 hash of the original disk device?

dcfldd if=/dev/loop0 of=image.dd hash=md5,sha256 hashlog=hash.txt

1536 blocks (48Mb) written.
1600+0 records in
1600+0 records out




In [6]:
%%shell
# Task: compare the MD5 and SHA256 hash digest of the original device (as provided by `dcfldd`) to the one of the disk image
md5sum image.dd
sha256sum image.dd

cat hash.txt

004ad369505907411c07cf0da7a01af6  image.dd
f20dc2134f06a54dda6b02ba834fa62a0da6bc35b99e61f496c168be50e2f44c  image.dd

Total (md5): 004ad369505907411c07cf0da7a01af6

Total (sha256): f20dc2134f06a54dda6b02ba834fa62a0da6bc35b99e61f496c168be50e2f44c




___
## Exercise 02: obtain filesystem info of the disk image
Each filesystem differs in the way data is arranged physically on the device. Even similar filesystems (e.g. Ext3 and Ext4) or the same filesystem may prefer an arrangement to another when considering the total space available on the device, the flags that were toggled during installation or the characteristics of the device itself (SSD, HDD...).

For convenience, it is better to extract the partition from the complete image using `dcfldd` with `if={your_image}.dd` (hint: look at options `bs`, `seek` and `count`).

TheSleuthKit (abbreviated as TSK) provides two commands to do so: `mmls` and `fsstat`

In [13]:
%%shell
# `mmls` provides information on the overall layout scheme of the disk image
# Task: list the partition table and extract the Linux partition
mmls image.dd

dcfldd if=image.dd of=linux_part.dd skip=2048 count=100352 bs=512

DOS Partition Table
Offset Sector: 0
Units are in 512-byte sectors

      Slot      Start        End          Length       Description
000:  Meta      0000000000   0000000000   0000000001   Primary Table (#0)
001:  -------   0000000000   0000002047   0000002048   Unallocated
002:  000:000   0000002048   0000102399   0000100352   Linux (0x83)
256 blocks (0Mb) written.512 blocks (0Mb) written.768 blocks (0Mb) written.1024 blocks (0Mb) written.1280 blocks (0Mb) written.1536 blocks (0Mb) written.1792 blocks (0Mb) written.2048 blocks (1Mb) written.2304 blocks (1Mb) written.2560 blocks (1Mb) written.2816 blocks (1Mb) written.3072 blocks (1Mb) written.3328 blocks (1Mb) written.3584 blocks (1Mb) written.3840 blocks (1Mb) written.4096 blocks (2Mb) written.4352 blocks (2Mb) written.4608 blocks (2Mb) written.4864 blocks (2Mb) written.5120 blocks (2Mb) written.5376 blocks (2Mb) written.5632 blocks (2Mb) written.5888 blocks (2Mb) written.6144 blocks (3Mb) written.6400 block



In [None]:
%%shell
# `fsstat` provides information on a single filesystem partition
# Task: obtain information on the Ext4 partition within the disk image
fsstat linux_part.dd

# Question: what is the size of a single inode entry? What is the size of a block?

___
## Exercise 03: recover deallocated files
There are multiple ways by which a file can be deleted. A common technique (especially employed if the underlying device is an HDD) is to just **deallocate** the data blocks associated to file's data, deallocate its inode entry and add a "deleted" timestamp to it.

Under these circumstances, it easily possible to recover the original file's data, provided that none if its data blocks have been overwritten in the meantime.

In [None]:
%%shell
# `fls` is the `ls` equivalent when inspecting a disk image
# Task: roam around the filesystem. Then, list all files recursively
fls ???

# Question: what are the numbers displayed on the left-hand side of the output?

Given an inode number, it is easy to display the associated metadata information by using the command `istat`. To retrieve the associated data blocks, multiple commands can be used, namely `icat`, `blkcat` and `blkls`.

In [None]:
%%shell
# `istat` is used to display the metadata contained within an inode
# Task: display the metadata information regarding a regular file and a deleted one. What are the differences?
istat ???


In [None]:
%%shell
# Task: use either `icat`, `blkcat` or `blkls` to display the binary data (in hex form) from a given inode
??? ??? | xxd
# Question: when displaying the same file type, pay attention to the beginning of the hex output


In [None]:
%%shell
# Task: use either `icat`, `blkcat` or `blkls` to recover a file from a given inode
??? ??? > recovered_file.???


___
## Exercise 04: recover orphan files
When a directory entry gets corrupted or deleted, an inode might lose its only hard link in the filesystem by which a file was to be accessed. This causes the file to be inaccessible, even if the original inode metadata and data blocks are still properly allocated in the filesystem.

Provided by the Linux Kernel Archive package, the `fsck` tool provides a way to check for all inodes looking for a potential mismatch between their metadata "link count" and the actual one in the filesystem. Should it find an inode with an actual link count of 0 and a metadata link count greater than 0, such inode is considered "orphaned" and gets attached to the `lost+found` directory.

In [None]:
%%shell
# Before using the `fsck` tool, the disk image has to be attached as a loop block device on the Linux machine
losetup --show -fP ???
# The output is the newly created block device


In [None]:
%%shell
# Task: perform `fsck` on the Ext4 partition
fsck -f ???


In [None]:
%%shell
# Task: recover the newly attached orphan file (see exercise 03)


___
## Exercise 05: perform data carving
When the inode content is deallocated and overwritten, the only hope for a successful file recovery is to look in the data blocks (both allocated and deallocated). Each file has a unique heading (and sometime trailing) signature, which makes the recovery process a "matching game" against the signatures.
| Filetype | Header                              | Trailer                          |
|----------|--------------------------------------|----------------------------------|
| JPG      | `FF D8`                              | `FF D9`                          |
| PNG      | `89 50 4E 47 0D 0A 1A 0A`             | `49 45 4E 44 AE 42 60 82`        |
| PDF      | `25 50 44 46`                        | multiple                         |
| DOC      | `D0 CF 11 E0 A1 B1 1A E1`             | multiple                         |

Data carving is a process that can be performed on single files when embedded data is present (e.g. DOC, PDF...) or on whole disk images.

### Task A: recover an embedded image from a DOC/PDF file
Recover the file `my_dog.docx` using the techniques seen in Exercise 03, and try to open it using Microsoft Word or any other `.docx` opener.

When a **container file** (such as DOC/DOCX, PDF) gets corrupted, it may still be possible to retrieve the embedded files. Knowing that `my_dog.docx` contains a single JPG image, perform data carving on it and extract the image.

In [None]:
%%shell
# Tip 1: use `grep` with a regular expression to isolate the JPG hexadecimal data
# Tip 2: `xxd` can be used to dump the hexadecimal data of a given file, but also do the opposite process
xxd ???


It is also possible to use dedicated tools that analyze binary data searching for embedded/hidden files. `binwalk` allows both to list and extract the embedded data from a given file. Use it to recover the image from `my_dog.docx`, and try it on other container files.

In [None]:
%%shell
# Task: use `binwalk` to extract embedded files
binwalk ???


### Task B: recover all files from a disk image
Dealing with an entire disk image, it is *unconvenient* to `xxd | grep` for each filetype. Instead, resorting to dedicated tools is a much more efficient option, especially when the amount and diversity of files is considerable.

For this experience, `scalpel` will be used to perform a complete data carving on our disk image.

**Attention**: `scalpel` requires a configuration file as one of its argument. In the Linux system provided, it is located under `/etc/scalpel.conf`. Please check your specific OS layout should it be any different.

In [None]:
%%shell
# Task: use `scalpel` to perform data carving on the disk image
scalpel -c /etc/scalpel.conf ???


___
## Challenge: CTF on a FAT32 disk image
Find the flag within the `disko-3.dd` disk image.

The flag's format is the following: `picoCTF{...}`

*(courtesy of picoCTF)*