In [15]:
%%shell
# SETUP OF GOOGLE COLAB ENVIRONMENT
git clone -b colab --single-branch https://github.com/lucadomene/DFB_Lab01_DataCarving /tmp/DFB_lab_init
sh /tmp/DFB_lab_init/initialize.sh

fatal: destination path '/tmp/DFB_lab_init' already exists and is not an empty directory.
fatal: destination path '/opt/datacarving-lab' already exists and is not an empty directory.
ln: failed to create symbolic link '/dev/loop0': File exists


CalledProcessError: Command '# SETUP OF GOOGLE COLAB ENVIRONMENT
git clone -b colab --single-branch https://github.com/lucadomene/DFB_Lab01_DataCarving /tmp/DFB_lab_init
sh /tmp/DFB_lab_init/initialize.sh
' died with <Signals.SIGINT: 2>.

# Laboratory on Data Carving
_Digital Forensics and Biometrics_ A.A. 2025/2026

Lecturer: prof. **Simone Milani** (simone.milani@dei.unipd.it)

Teaching Assistants: **Mattia Tamiazzo** (mattia.tamiazzo@studenti.unipd.it); **Luca Domeneghetti** (luca.domeneghetti@studenti.unipd.it)

### Prerequisites
This laboratory requires no prior knowledge concerning digital forensics. Although, basic knowledge of the Unix enviromnent might benefit the understanding of the deletion/recovery process.

In general, the following aspects are expected to be known and will not be covered in detail:
- Unix paradigm of files, directories and devices
- Basic Shell usage
- `Ext4` filesystem layout: inodes, block sectors, journaling
- Calculations with hexadecimal byte offsets

### Contents
The goal of this laboratory is to provide the basic theoretical and practical notions on data carving and file recovery. The details related to file system's structure and I/O mechanisms – despite being crucial for a potentially successful data recovery – are out of scope and will be provided as a further reading for the student.

At the end of the laboratory the student will have acquired the following skills:
- Perform a forensic disk copy using Unix imaging tools (`dcfldd`)
- Analyze the partition layout of a disk image
- List the files within a filesystem
- **Retrieve deallocated/deleted files**
- **Perform data carving** (manually and using dedicated tools)
___
## Theoretical aspects of filesystem forensics
Before diving into the practice, a brief theoretical detour is essential to understand how a recovery procedure is to be carried out.

Having in mind how a specific filesystem works helps to understand how deletion processes work, how files are allocated logically and physically, and most importantly where to look for when searching for deallocated data.

### The Ext filesystem
This laboratory will be focused on Ext filesystems as it is widely implemented and tested under Linux environment. Its simplicity makes it the suitable starting point for a comprehensive data forensic approach.

#### Inodes
An **inode** (index nodes) is a data structure used by Unix filesystems (including the Ext family) to represent metadata on files. Each inode stores the following information:

- File type (e.g., regular file, directory, symlink)
- Permissions and ownership (UID, GID)
- Timestamps (created, modified, accessed)
- File size
- Link count
- Pointers to data blocks

Inodes do **not** store the file name or its path — these are maintained in directory entries to map symbolic names to inode numbers. This separation is critical in forensic analysis and data carving where inodes may remain allocated even if directory structures are damaged or missing (this is the case for _orphan files_).

Within an Ext partition, inodes are located in inode tables. Each inode is marked as allocated or deallocated by using inode bitmaps.

[![](https://www.virtualcuriosities.com/wp-content/uploads/2025/03/linux-diagram-hard-links-inodes-20250326.webp)](https://www.virtualcuriosities.com/articles/4507/how-hard-links-and-inodes-work-on-linux)

#### Directories
In Ext filesystems, directories are special files (file code `0x2`) that store a list of **directory entries**, each mapping a filename to an **inode number**. These entries are stored sequentially in data blocks and are arranged as such:

| Offset | Size                | Name                   | Description                                               |
|--------|---------------------|------------------------|-----------------------------------------------------------|
| 0x0    | __le32              | inode                  | Number of the inode that this directory entry points to.  |
| 0x4    | __le16              | rec_len                | Length of this directory entry.                           |
| 0x6    | __u8                | name_len               | Length of the file name.                                  |
| 0x7    | __u8                | file_type              | File type code                                            |
| 0x8    | char\[255]          | name                   | File name.                                                |

This structure allows multiple filenames (hard links) to point to the same inode.

In Ext2 and Ext3, directory entries are stored in a linear list, which can become inefficient as directories grow. Ext4 introduced **HTree indexing**, a hashed B-tree-like structure, to improve performance in large directories.

### NTFS and FAT

Aside from Ext filesystems, NTFS and FAT filesystems are frequently employed in Windows environments or external data storage (e.g. USBs). Although they share some similarities, understanding the differences is key for a successful file recovery.

- **Metadata Handling**:  
  Ext uses **inodes** to store file metadata separately from directory entries. In contrast, **NTFS** stores metadata in the **Master File Table (MFT)**, with each file represented as a record. **FAT**, being simpler, uses a **File Allocation Table** and directory entries that contain both metadata and file location info.

- **File Deletion Behavior**:  
  In **Ext**, when a file is deleted, its directory entry and inode may persist until overwritten, which aids data recovery. **NTFS** marks MFT entries as deleted but often retains significant metadata. **FAT** simply marks the first character of the filename as deleted and updates the FAT chain.

- **Journaling**:  
  **Ext3/4** and **NTFS** are journaling filesystems, enhancing data integrity but sometimes complicating recovery due to overwrites. **FAT** lacks journaling, making it more vulnerable to corruption but also leaving raw data more directly accessible for carving.

- **File Name and Path Storage**:  
  Ext separates names (in directory entries) from inodes, while NTFS stores file name attributes directly within MFT records. FAT embeds the file name directly in the directory entry.

___
## Exercise 01: perform a forensic disk imaging
When in possess of a digital forensic device (e.g. an hard disk, USB drive, internal SSD...), the first crucial step is to perform a copy of it as to avoid any accidental modification of the original.

This process is called **disk imaging** and abides the following principles:
1. the copy is an exact duplicate of the original device
2. the original device remains unaltered by the process

To avoid that unwanted (or intentional!) modifications are introduced during subsequent procedures, it is important to keep track of the **hash signature** of the original device. This step has to be performed during the first acquisition of the digital media.

The target device is located at `/dev/loop0` in the virtual Unix machine. It is **important** to select the root device `loop0` and not any of its partitions (e.g. `loop0p1`) in order to perform a full disk image.

```
Usage: dcfldd [OPTION]...
Enhanced version of dd for forensics and security.

  bs=BYTES            force ibs=BYTES and obs=BYTES (default=32768)
  count=BLOCKS        copy only BLOCKS input blocks
  if=FILE             read from FILE instead of stdin
  of=FILE             write to FILE instead of stdout
  seek=BLOCKS         skip BLOCKS obs-sized blocks at start of output
  skip=BLOCKS         skip BLOCKS ibs-sized blocks at start of input
  pattern=HEX         use the specified binary pattern as input
  textpattern=TEXT    use repeating TEXT as input
  errlog=FILE         send error messages to FILE as well as stderr
  hash=NAME           do hash calculation (md5, sha1, sha256, sha384 or sha512)
  hashlog=FILE        send hash output to FILE instead of stderr
  hashwindow=BYTES    perform a hash on every BYTES amount of data
  status=[on|off]          display a continual status message on stderr
  statusinterval=N         update the status message every N blocks
  sizeprobe=[if|of|BYTES]  what to use as value to percentage indicator
  vf=FILE                  verify that FILE matches the specified input
  verifylog=FILE           send verify results to FILE instead of stderr
```

In [None]:
%%shell
# Task: perform a disk image of device /dev/loop0 and save it as whatever you prefer.
# How can I make `dcfldd` also compute the MD5 and SHA256 hash of the original disk device?

dcfldd if=/dev/loop0 of=image.dd hash=md5,sha256 hashlog=hash.txt

1536 blocks (48Mb) written.
1600+0 records in
1600+0 records out




In [None]:
%%shell
# Task: compare the MD5 and SHA256 hash digest of the original device (as provided by `dcfldd`) to the one of the disk image
md5sum image.dd
sha256sum image.dd

cat hash.txt

b657578d04bf0934e853116492cbb44b  image.dd
5a6c90218a26f6cd0413355266fb49ed8ae9e0fcdd1bcf4761e1d81dde4de713  image.dd

Total (md5): b657578d04bf0934e853116492cbb44b

Total (sha256): 5a6c90218a26f6cd0413355266fb49ed8ae9e0fcdd1bcf4761e1d81dde4de713




___
## Exercise 02: obtain filesystem info of the disk image
Each filesystem differs in the way data is arranged physically on the device. Even similar filesystems (e.g. Ext3 and Ext4) or the same filesystem may prefer an arrangement to another when considering the total space available on the device, the flags that were toggled during installation or the characteristics of the device itself (SSD, HDD...).

For convenience, it is better to extract the partition from the complete image using `dcfldd` with `if={your_image}.dd` (hint: look at options `bs`, `seek` and `count`).

TheSleuthKit (abbreviated as TSK) provides two commands to do so: `mmls` and `fsstat`

```
usage: mmls [-i imgtype] [-o imgoffset] image [images]

usage: fsstat [-f fstype] [-i imgtype] [-o imgoffset] image
```

In [None]:
%%shell
# `mmls` provides information on the overall layout scheme of the disk image
# Task: list the partition table
mmls image.dd


DOS Partition Table
Offset Sector: 0
Units are in 512-byte sectors

      Slot      Start        End          Length       Description
000:  Meta      0000000000   0000000000   0000000001   Primary Table (#0)
001:  -------   0000000000   0000002047   0000002048   Unallocated
002:  000:000   0000002048   0000102399   0000100352   Linux (0x83)




In [None]:
%%shell
# Task: extract the Linux partition
dcfldd if=image.dd of=linux_part.dd skip=2048 count=100352 bs=512


256 blocks (0Mb) written.512 blocks (0Mb) written.768 blocks (0Mb) written.1024 blocks (0Mb) written.1280 blocks (0Mb) written.1536 blocks (0Mb) written.1792 blocks (0Mb) written.2048 blocks (1Mb) written.2304 blocks (1Mb) written.2560 blocks (1Mb) written.2816 blocks (1Mb) written.3072 blocks (1Mb) written.3328 blocks (1Mb) written.3584 blocks (1Mb) written.3840 blocks (1Mb) written.4096 blocks (2Mb) written.4352 blocks (2Mb) written.4608 blocks (2Mb) written.4864 blocks (2Mb) written.5120 blocks (2Mb) written.5376 blocks (2Mb) written.5632 blocks (2Mb) written.5888 blocks (2Mb) written.6144 blocks (3Mb) written.6400 blocks (3Mb) written.6656 blocks (3Mb) written.6912 blocks (3Mb) written.7168 blocks (3Mb) written.7424 blocks (3Mb) written.7680 blocks (3Mb) written.7936 blocks (3Mb) written.8192 blocks (4Mb) written.8448 blocks (4Mb) written.8704 blocks (4Mb) written.8960 blocks (4Mb) written.9216 blocks (4Mb) written.9472 blocks (4Mb) written.972



In [None]:
%%shell
# `fsstat` provides information on a single filesystem partition
# Task: obtain information on the Ext4 partition within the disk image
fsstat linux_part.dd

# Question: what is the size of a single inode entry? What is the size of a block?

FILE SYSTEM INFORMATION
--------------------------------------------
File System Type: Ext4
Volume Name: 
Volume ID: a0ae383d23c34387b84470e486e6c1ec

Last Written at: 2025-10-16 12:01:29 (UTC)
Last Checked at: 2025-10-16 11:58:28 (UTC)

Last Mounted at: 2025-10-16 11:58:47 (UTC)
Unmounted properly
Last mounted on: /home/ldomeneghetti/Documents/DFB_Lab01_DataCarving/mount

Source OS: Linux
Dynamic Structure
Compat Features: Journal, Ext Attributes, Resize Inode, Dir Index
InCompat Features: Filetype, Extents, 64bit, Flexible Block Groups, 
Read Only Compat Features: Sparse Super, Large File, Huge File, Extra Inode Size

Journal ID: 00
Journal Inode: 8

METADATA INFORMATION
--------------------------------------------
Inode Range: 1 - 12545
Root Directory: 2
Free Inodes: 12514
Inode Size: 256

CONTENT INFORMATION
--------------------------------------------
Block Groups Per Flex Group: 16
Block Range: 0 - 50175
Block Size: 1024
Reserved Blocks Before Block Groups: 1
Free Blocks: 32278





___
## Exercise 03: recover deallocated files
There are multiple ways by which a file can be deleted. A common technique (especially employed if the underlying device is an HDD) is to just **deallocate** the data blocks associated to file's data, deallocate its inode entry and add a "deleted" timestamp to it.

Under these circumstances, it easily possible to recover the original file's data, provided that none if its data blocks have been overwritten in the meantime.

```
usage: fls [-f fstype] [-i imgtype] [-o imgoffset] image [images] [inode]
```

In [None]:
%%shell
# `fls` is the `ls` equivalent when inspecting a disk image
fls -r linux_part.dd

# Question: what are the numbers displayed on the left-hand side of the output?

d/d 11:	lost+found
d/d 12:	Documents
+ r/r 15:	sample3.docx
+ r/r * 16:	my_dog.docx
d/d 1793:	Music
+ r/r * 17:	beepboop.wav
+ r/r 18:	funky.mp3
d/d 1794:	Pictures
+ r/r * 20:	cat.bmp
+ r/r 21:	dog.bmp
+ r/r 22:	f1.jpeg
+ r/r 23:	ferrara.jpg
d/d 27:	.Trash-1000
+ d/d 28:	info
++ r/r 30:	s4a.jpg.trashinfo
++ r/r 31:	mario.png.trashinfo
+ d/d 29:	files
++ r/r 26:	s4a.jpg
++ r/r 24:	mario.png
V/V 12545:	$OrphanFiles




In [None]:
# Task: roam around the filesystem. Then, try to list all files recursively


Given an inode number, it is easy to display the associated metadata information by using the command `istat`. To retrieve the associated data blocks, multiple commands can be used, namely `icat`, `blkcat` and `blkls`.

```
usage: istat [-f fstype] [-i imgtype] [-o imgoffset] image inum

usage: icat [-f fstype] [-i imgtype] [-o imgoffset] image [images] inum[-typ[-id]]

usage: blkcat [-f fstype] [-i imgtype] [-o imgoffset] [-P pooltype] [-k password] image [images] unit_addr [num]

usage: blkls [-f fstype] [-i imgtype] [-o imgoffset] image [images] [start-stop]
```

In [None]:
%%shell
# `istat` is used to display the metadata contained within an inode
# Task: display the metadata information regarding a regular file and a deleted one. What are the differences?
istat linux_part.dd 21

inode: 21
Allocated
Group: 0
Generation Id: 1414688395
uid / gid: 1000 / 1000
mode: rrw-r--r--
Flags: Extents, 
size: 750414
num of links: 1

Inode Times:
Accessed:	2025-10-16 11:58:52.388128696 (UTC)
File Modified:	2025-10-16 11:58:28.307791351 (UTC)
Inode Modified:	2025-10-16 11:58:28.307791351 (UTC)
File Created:	2025-10-16 11:58:28.306791337 (UTC)

Direct Blocks:
21358 21359 21360 21361 21362 21363 21364 21365 
21366 21367 21368 21369 21370 21371 21372 21373 
21374 21375 21376 21377 21378 21379 21380 21381 
21382 21383 21384 21385 21386 21387 21388 21389 
21390 21391 21392 21393 21394 21395 21396 21397 
21398 21399 21400 21401 21402 21403 21404 21405 
21406 21407 21408 21409 21410 21411 21412 21413 
21414 21415 21416 21417 21418 21419 21420 21421 
21422 21423 21424 21425 21426 21427 21428 21429 
21430 21431 21432 21433 21434 21435 21436 21437 
21438 21439 21440 21441 21442 21443 21444 21445 
21446 21447 21448 21449 21450 21451 21452 21453 
21454 21455 21456 21457 21458 21459 21460 



```
Usage:
       xxd [options] [infile [outfile]]
    or
       xxd -r [-s [-]offset] [-c cols] [-ps] [infile [outfile]]
Options:
    -a          toggle autoskip: A single '*' replaces nul-lines. Default off.
    -b          binary digit dump (incompatible with -ps). Default hex.
    -C          capitalize variable names in C include file style (-i).
    -c cols     format <cols> octets per line. Default 16 (-i: 12, -ps: 30).
    -E          show characters in EBCDIC. Default ASCII.
    -e          little-endian dump (incompatible with -ps,-i,-r).
    -g bytes    number of octets per group in normal output. Default 2 (-e: 4).
    -h          print this summary.
    -i          output in C include file style.
    -l len      stop after <len> octets.
    -n name     set the variable name used in C include output (-i).
    -o off      add <off> to the displayed file position.
    -ps         output in postscript plain hexdump style.
    -r          reverse operation: convert (or patch) hexdump into binary.
    -r -s off   revert with <off> added to file positions found in hexdump.
    -d          show offset in decimal instead of hex.
    -s [+][-]seek  start at <seek> bytes abs. (or +: rel.) infile offset.
    -u          use upper case hex letters.
    -R when     colorize the output; <when> can be 'always', 'auto' or 'never'. Default: 'auto'.
    -v          show version: "xxd 2025-08-24 by Juergen Weigert et al.".
```

In [None]:
%%shell
icat linux_part.dd 23 | xxd | head

00000000: ffd8 ffe0 0010 4a46 4946 0001 0201 012c  ......JFIF.....,
00000010: 012c 0000 ffed 0d18 5068 6f74 6f73 686f  .,......Photosho
00000020: 7020 332e 3000 3842 494d 0404 0000 0000  p 3.0.8BIM......
00000030: 001f 1c02 0000 0200 021c 0237 0008 3230  ...........7..20
00000040: 3131 3130 3034 1c02 3c00 0630 3634 3534  111004..<..06454
00000050: 3900 3842 494d 03ed 0000 0000 0010 012c  9.8BIM.........,
00000060: 0000 0001 0002 012c 0000 0001 0002 3842  .......,......8B
00000070: 494d 040d 0000 0000 0004 0000 0078 3842  IM...........x8B
00000080: 494d 03f3 0000 0000 0008 0000 0000 0000  IM..............
00000090: 0000 3842 494d 040a 0000 0000 0001 0000  ..8BIM..........




In [None]:
%%shell
# Task: display the content of another file with a different file type. What can you notice at the beginning?


In [None]:
%%shell
icat linux_part.dd 20 > retrieved_data/cat.bmp
# Using the side menu "Files", you can navigate to /content/retrieved_data and download the recovered file



In [None]:
%%shell
# Task: use either `icat`, `blkcat` or `blkls` to recover a file from a given inode


___
## Exercise 04: recover orphan files
When a directory entry gets corrupted or deleted, an inode might lose its only hard link in the filesystem by which a file was to be accessed. This causes the file to be inaccessible, even if the original inode metadata and data blocks are still properly allocated in the filesystem.

Provided by the Linux Kernel Archive package, the `fsck` tool provides a way to check for all inodes looking for a potential mismatch between their metadata "link count" and the actual one in the filesystem. Should it find an inode with an actual link count of 0 and a metadata link count greater than 0, such inode is considered "orphaned" and gets attached to the `lost+found` directory.

In [None]:
%%shell
# Task: perform `fsck` on the Ext4 partition, USING THE TERMINAL
fsck.ext4 -f linux_part.dd


In [None]:
%%shell
# Task: recover the newly attached orphan file (see exercise 03)


___
## Exercise 05: perform data carving
When the inode content is deallocated and overwritten, the only hope for a successful file recovery is to look in the data blocks (both allocated and deallocated). Each file has a unique heading (and sometime trailing) signature, which makes the recovery process a "matching game" against the signatures.
| Filetype | Header                              | Trailer                          |
|----------|--------------------------------------|----------------------------------|
| JPG      | `FF D8`                              | `FF D9`                          |
| PNG      | `89 50 4E 47 0D 0A 1A 0A`             | `49 45 4E 44 AE 42 60 82`        |
| PDF      | `25 50 44 46`                        | multiple                         |
| DOC      | `D0 CF 11 E0 A1 B1 1A E1`             | multiple                         |

Data carving is a process that can be performed on single files when embedded data is present (e.g. DOC, PDF...) or on whole disk images.

### **Task A: recover an embedded image from a DOC/PDF file**
Recover the file `my_dog.docx` using the techniques seen in Exercise 03, and try to open it using Microsoft Word or any other `.docx` opener.

When a **container file** (such as DOC/DOCX, PDF) gets corrupted, it may still be possible to retrieve the embedded files. Knowing that `my_dog.docx` contains a single ZIP archive (signature `504b0304`) that holds its data structure, perform data carving on it and extract the image.

In [None]:
%%shell
# Task: recover the deleted file 'my_dog.docx'
icat linux_part.dd 16 > my_dog.docx



In [None]:
%%shell
# Tip 1: use `grep` with a regular expression to isolate the JPG hexadecimal data
# Tip 2: `xxd` can be used to dump the hexadecimal data of a given file, but also do the opposite process
xxd -p my_dog.docx | tr -d '\n' | grep '504b0304' | xxd -r -p - my_dog.zip




In [None]:
%%shell
# Task: unzip the extracted archive
unzip -d my_dog my_dog.zip

It is also possible to use dedicated tools that analyze binary data searching for embedded/hidden files. `binwalk` allows both to list and extract the embedded data from a given file. Use it to recover the image from `my_dog.docx`, and try it on other container files.

```
Analyzes data for embedded file types

Usage: binwalk [OPTIONS] [FILE_NAME]

Arguments:
  [FILE_NAME]  Path to the file to analyze

Options:
  -L, --list                   List supported signatures and extractors
  -q, --quiet                  Supress output to stdout
  -v, --verbose                During recursive extraction display *all* results
  -e, --extract                Automatically extract known file types
  -M, --matryoshka             Recursively scan extracted files
  -a, --search-all             Search for all signatures at all offsets
  -E, --entropy                Plot the entropy of the specified file
  -l, --log <LOG>              Log JSON results to a file
  -t, --threads <THREADS>      Manually specify the number of threads to use
  -x, --exclude <EXCLUDE>...   Do no scan for these signatures
  -y, --include <INCLUDE>...   Only scan for these signatures
  -C, --directory <DIRECTORY>  Extract files/folders to a custom directory [default: extractions]
  -h, --help                   Print help
  -V, --version                Print version
```

In [18]:
%%shell
binwalk my_dog.docx



thread 'main' panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/binwalk-3.1.0/src/main.rs:63:9:
No target file name specified! Try --help.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


CalledProcessError: Command 'PATH=/root/.cargo/bin:$PATH
binwalk
' returned non-zero exit status 101.

In [None]:
%%shell
# Task: use `binwalk` on other container files


### **Task B: recover files from journal**
A **journaling file system** is a file system capable of keeping track of changes in a record-commit fashion. It is a mechanism to prevent data loss in the event of a system crash or power failure.

The **journal** is treated as a Unix file, has its own inode and can be extracted from a disk image partition. Depending on the implementation, a journal may contain both data and its metadata, or just the latter (at the expense of possible data loss).

In [None]:
%%shell
# Hint: use `fsstat` to retrieve the journal inode number
fsstat linux_part.dd


In [None]:
%%shell
# Task: extract the journal and extract its data
icat linux_part.dd 8 > journal.raw

In [None]:
%%shell
# Hint: sometimes, we are not looking for binary files...
binwalk journal.raw
strings journal.raw | grep 'password'
strings journal.raw | grep 'mount'

### **Task C: recover all files from a disk image**
Dealing with an entire disk image, it is *unconvenient* to `xxd | grep` for each filetype. Instead, resorting to dedicated tools is a much more efficient option, especially when the amount and diversity of files is considerable.

For this experience, `scalpel` will be used to perform a complete data carving on our disk image.

**Attention**: `scalpel` requires a configuration file as one of its argument. In the Linux system provided, it is located under `/etc/scalpel.conf`. Please check your specific OS layout should it be any different.

In [None]:
%%shell
# Task: use `scalpel` to perform data carving on the disk image
scalpel -c /etc/scalpel.conf -o linux_part_recovery linux_part.dd
