### CT Pre-Processing


It is well known that CT images often contain significant noise and artifacts, which can hinder effective feature extraction, neither all CT regions are equally relevant to the diagnosis. To address these challenges, pre-processing techniques are essential for improving image quality. In this study, a sequential pre-processing pipeline was developed to enhance the overall quality of CT scans, consisting of the following steps:

- Body Segmentation
- Homogenous pixel spacing  
- HU widowing  
- Normalization  
- Filtering  

In order to choose the best filtering technique, the Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity Index (SSIM) were evaluated for each method. These quantitative metrics allowed for an objective comparison of image quality after filtering, ensuring that the selected technique provided optimal noise reduction while preserving important anatomical structures.

### Body Segmentation

In an initial approach, body and lung segmentation were performed on CT slices from the LIDC dataset using intensity-based thresholding in the Hounsfield Unit (HU) domain. Body segmentation was achieved by isolating pixels with attenuation values above −500 HU, corresponding approximately to soft tissues and bones, while excluding air and background regions. The resulting binary mask was refined through morphological operations, including small object removal, binary closing, and hole filling, to eliminate noise and produce a continuous body region. The cleaned body mask was then applied to the CT image to exclude non-anatomical areas and improve the definition of the region of interest.

Subsequently, lung segmentation was attempted within the body mask by selecting pixels with intensities between −1000 HU and −400 HU, corresponding to pulmonary parenchyma and air-filled regions. Morphological filters were again applied to reduce artifacts and ensure the connectivity of the lung structures, and the two largest connected components were retained to represent the left and right lungs. However, despite these refinements, the lung segmentation process exhibited instability, as reflected in inconsistent region detection across slices and occasional omission of lung structures, as illustrated in the example below. This instability motivated the use of body segmentation alone for subsequent feature extraction to ensure reproducibility and robustness.

!['Good Segmentation'](segmentacao_boa.png)  
!['Bad Segmentation'](segmentacao_ma.png)  



### Homogeneous Pixel Spacing

This preprocessing technique ensures that each pixel represents the same physical distance in all directions. In medical imaging, scans from different patients or devices may have varying pixel spacing due to differences in acquisition protocols or equipment. If these differences are not corrected, measurements and features extracted from the images can be inconsistent or misleading. By resampling images to a uniform pixel spacing, we enable accurate quantitative analysis, fair comparison between scans, and reliable application of automated algorithms such as segmentation or radiomics. This step is essential for reproducibility and robustness in medical image processing workflows.

As an example, it was validated if CTs across different patient have different Pixel Spacing:

| Patient ID   | Pixel Spacing (mm)   |  
|--------------|----------------------|  
| LIDC-IDRI-0001 | [0.703125, 0.703125] |  
| LIDC-IDRI-0002 | [0.681641, 0.681641] |  
| LIDC-IDRI-0003 | [0.820312, 0.820312] |  
| LIDC-IDRI-0004 | [0.822266, 0.822266] |  
| LIDC-IDRI-0005 | [0.664062, 0.664062] | 

Since different patients have different Pixel Spacing, a function is used to rescale a 2D image so that its pixel spacing becomes isotropic (equal in both directions) by calculating the required zoom factors and applying interpolation with. The new isotropic space is 1x1 (mm).



### HU Windowing

Hounsfield Unit (HU) windowing is a technique used to enhance the visualization of specific tissue types in CT images by selecting a relevant range of HU values. Each tissue in the body has a characteristic HU value (e.g., air is approximately -1000 HU). By applying an appropriate HU range, irrelevant structures and noise outside the desired interval are suppressed, improving the contrast and clarity of the anatomical regions of interest, such as lungs. This process facilitates more accurate segmentation and feature extraction in medical image analysis. For this study, a HU range of [-1200, 600] was applied to exclude undesired tissue densities effectively.

### Normalization

The normalization function rescales the intensity values of each CT image slice to the [0, 1] range by subtracting the minimum value and dividing by the intensity range (maximum minus minimum). This step ensures that all images have a consistent intensity scale, which is essential for robust quantitative analysis and for the application of machine learning algorithms. By normalizing the data, variations due to acquisition parameters or scanner differences are minimized, facilitating fair comparison between images and improving the stability and convergence of subsequent processing steps

### Filtering

Filtering is a crucial step in CT preprocessing as it helps to reduce noise and enhance the quality of the images, ensuring that important anatomical structures are preserved. CT scans often contain artifacts and random variations in pixel intensity that can obscure critical details, making it challenging to extract meaningful features. By applying appropriate filtering techniques, such as Gaussian, Median, Gabor, Adaptative Non-Local Means (ANLM), Block-Matching and 3D Filterin (BM3D) and Laplacian of Gaussian (LoG) filters, the signal-to-noise ratio is improved, facilitating more accurate segmentation, feature extraction, and analysis. This step is particularly important in medical imaging, where the clarity and precision of the data directly impact diagnostic accuracy and the reliability of automated algorithms.


| Filter          | Pros                                      | Cons                                      |  
|------------------|------------------------------------------|-------------------------------------------|  
| Gaussian  | Smooths noise effectively, easy to apply | May blur edges and fine details           |  
| Median    | Preserves edges while reducing noise     | Computationally expensive for large images|  
| Gabor     | Good for edge/texture, preserves structures | Computationally intensive, needs tuning   |  
| ANLM | Excellent noise reduction, preserves fine details | Very slow, high memory, needs noise estimate |  
| BM3D            | State-of-the-art denoising, preserves details | Computationally intensive, complex to implement |  
| LoG | Enhances edges, detects blobs effectively | Sensitive to noise, may over-enhance edges       |  

The PSNR and SSIM equations are as it follows:

**PSNR:**  
$$
PSNR = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{MSE} \right)
$$  
Where $MAX_I$ is the maximum possible pixel value of the image and $MSE$ is the Mean Squared Error.  

**SSIM:**  
$$
SSIM(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}
$$  
Where:  
- $\mu_x, \mu_y$: Mean of $x$ and $y$  
- $\sigma_x^2, \sigma_y^2$: Variance of $x$ and $y$  
- $\sigma_{xy}$: Covariance of $x$ and $y$  
- $C_1, C_2$: Stabilization constants to avoid division by zero.  

These filters were benchmarked on the first 580 patients of the LIDC dataset. The obtained results are as it follows:

In [2]:
import pandas as pd

df = pd.read_csv('preprocess_benchmark_results.csv')
df.drop(columns=['dicom_path'], inplace=True)
grouped_stats = df.groupby('filter').agg(['mean', 'median', 'std'])
grouped_stats

Unnamed: 0_level_0,psnr,psnr,psnr,ssim,ssim,ssim
Unnamed: 0_level_1,mean,median,std,mean,median,std
filter,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
ANLM,15.913364,15.729271,2.238481,0.797254,0.813413,0.075295
BM3D,16.551346,16.41359,2.38335,0.69156,0.704446,0.090737
Gabor,14.649563,12.470679,6.472202,0.579532,0.553077,0.149944
Gaussian,16.025595,15.874246,2.24666,0.801635,0.825032,0.073429
LoG,9.901187,9.592167,2.837987,0.267234,0.23665,0.075229
Median,15.917302,15.747121,2.248857,0.77668,0.812073,0.107507



![Filtering Image Comparison](comparacao_filtros.png)

Among all the tested filters, the Gaussian, Median, and ANLM filters achieved the best overall performance. The Gaussian filter obtained the highest mean SSIM (0.8016), indicating the best structural preservation, while also maintaining a high PSNR (16.03). The Median and NLM filters followed closely with slightly lower PSNR and SSIM values.

The BM3D filter, although commonly effective, showed moderate results in this case, with a mean SSIM of 0.6916. The Gabor filter performed poorly, and the LoG filter had the lowest PSNR (9.90) and SSIM (0.267), indicating significant image distortion.

Overall, the Gaussian and ANLM filters provided the best balance between noise reduction and image structure preservation. However, due to the risk of texture and contour over-smoothing in Gaussian filtering and the superior adaptability of the ANLM filter, the ANLM filter was selected for this study.

### Conclusion

The pre-processing pipeline applied in this study follows a sequential approach to enhance CT image quality and ensure consistency across datasets. The steps are as follows:

1. **Body Segmentation**: Isolates the anatomical region of interest by removing irrelevant areas such as air and background.
2. **Homogeneous Pixel Spacing**: Resamples the images to an isotropic resolution of 1x1 mm, ensuring uniformity in pixel dimensions.
3. **HU Windowing**: Applies a Hounsfield Unit range of [-1200, 600] to focus on relevant tissue densities while suppressing noise and irrelevant structures.
4. **Normalization**: Rescales intensity values to the [0, 1] range, minimizing variations due to acquisition differences.
5. **Filtering**: Utilizes the Adaptive Non-Local Means (ANLM) filter for noise reduction while preserving fine anatomical details.

!['Pre-Processing Results'](comparacao_processamento.png)