Reproduction and Analysis of bloodcells_dataset_detector.ipynb
Dataset and Source

The notebook bloodcells_dataset_detector.ipynb reproduces a convolutional neural network (CNN)–based blood cell classification task using the publicly available Blood Cells Image Dataset hosted on Kaggle:

Dataset: Blood Cells Image Dataset

Source: https://www.kaggle.com/datasets/unclesamulus/blood-cells-image-dataset

The dataset contains labeled microscopic images of peripheral blood cells organized into eight morphological classes. The images are provided in a curated format and are directly consumable by deep learning pipelines without additional restructuring.

Model Architecture

The implemented model is a sequential CNN constructed using TensorFlow/Keras. The architecture consists of three convolutional blocks followed by fully connected layers:

Convolutional layers with 32, 64, and 128 filters respectively

Each convolution uses a 3×3 kernel with ReLU activation

Max pooling layers reduce spatial resolution after each convolutional block

The convolutional backbone is followed by a flattening operation and two dense layers

Final output layer uses a softmax activation for multi-class classification over eight classes

Notably, batch normalization and dropout layers are present in commented form but are not active in the reproduced model. The model is optimized using the Adam optimizer and trained with categorical cross-entropy loss, with accuracy used as the evaluation metric.

Classification Targets

The network is trained to classify images into the following eight blood cell categories:
Basophil
Eosinophil
Erythroblast
Immature Granulocyte (IG)
Lymphocyte
Monocyte
Neutrophil
Platelet

These classes represent common leukocyte and related cell morphologies typically encountered in peripheral blood imaging tasks.

Training Results

When executed as provided, the notebook reports:

Training accuracy of approximately 96%

Validation accuracy exceeding 90%

These results are consistently higher than those observed in several alternative baseline notebooks evaluated during this exercise, indicating that the dataset–architecture combination is well aligned for the classification task as defined.

Observations and Limitations

Despite strong quantitative performance, several limitations were identified during reproduction:

Dataset Curation
The dataset appears highly curated, with clean, centered images and minimal background variability. This level of visual uniformity is unlikely to reflect real clinical peripheral blood smears, which often contain overlapping cells, staining artifacts, platelet interference, and variable field composition.

Lack of Explicit Preprocessing
The pipeline does not include explicit preprocessing steps such as stain normalization, artifact handling, deduplication, or stratified balancing. The absence of these steps suggests that the dataset has already been preprocessed implicitly through selection and curation.

Clinical Generalizability
While accuracy metrics are strong, the experimental setup does not evaluate robustness under realistic distribution shifts, such as class imbalance, mixed-cell fields, or ambiguous morphology. As a result, performance is likely to degrade when applied directly to real clinical data.

Relevance to the Capstone Project

This notebook serves as a useful reference example of a clean, high-performance baseline for blood cell classification under idealized conditions. It demonstrates that standard CNN architectures can achieve strong performance when trained on curated datasets.

However, it also highlights a key gap that motivates the Capstone project: high benchmark accuracy does not necessarily translate to clinically reliable performance. The Capstone explicitly addresses this gap by emphasizing data realism, controlled preprocessing, and robustness to confounding factors such as platelet presence and mixed cellular contexts.

Overall, this reproduced notebook provides a valuable comparison point against which more clinically grounded approaches can be evaluated.

Summary

This notebook reproduces a convolutional neural network–based blood cell classification task using a publicly available Kaggle dataset composed of eight morphologically distinct blood cell classes. The approach reflects a common strategy in the literature: training standard CNN architectures on curated microscopic cell images to achieve high classification accuracy. While effective as a benchmark, this setup highlights a key limitation addressed in the Capstone project, namely that strong performance on curated datasets does not necessarily translate to robustness under clinically realistic conditions such as mixed-cell fields, platelet interference, or staining variability.