# Chapter 9. Using PyTorch to fight cancer

This chapter covers
- Breaking a large problem into smaller, easier ones
- Exploring the constraints of an intricate deep
learning problem, and deciding on a structure
and approach
- Downloading the training data

## 9.1 Introduction to the use case
Our goal for this part of the book is to give you the tools to deal with situations where
things aren’t working, which is a far more common state of affairs than part 1 might have
led you to believe. We can’t predict every failure case or cover every debugging technique, but hopefully we’ll give you enough to not feel stuck when you encounter a new
roadblock. Similarly, we want to help you avoid situations with your own projects where
you have no idea what you could do next when your projects are under-performing.
Instead, we hope your ideas list will be so long that the challenge will be to prioritize!

In order to present these ideas and techniques, we need a context with some
nuance and a fair bit of heft to it. We’ve chosen automatic detection of malignant
tumors in the lungs using only a CT scan of a patient’s chest as input. We’ll be focusing on the technical challenges rather than the human impact, but make no mistake—even from just an engineering perspective, part 2 will require a more serious,
structured approach than we needed in part 1 in order to have the project succeed.

We chose this problem of lung tumor detection for a few reasons. **The primary reason is that the problem itself is unsolved!** This is important, because we want to make
it clear that you can use PyTorch to tackle cutting-edge projects effectively. We hope
that increases your confidence in PyTorch as a framework, as well as in yourself as a
developer. Another nice aspect of this problem space is that while it’s unsolved, a lot
of teams have been paying attention to it recently and have seen promising results.
That means this challenge is probably right at the edge of our collective ability to
solve; we won’t be wasting our time on a problem that’s actually decades away from reasonable solutions. That attention on the problem has also resulted in a lot of highquality papers and open source projects, which are a great source of inspiration and
ideas. This will be a huge help once we conclude part 2 of the book, if you are interested in continuing to improve on the solution we create. We’ll provide some links to
additional information in chapter 14

This part of the book will remain focused on the problem of detecting lung
tumors, but the skills we’ll teach are general. Learning how to investigate, preprocess,
and present your data for training is important no matter what project you’re working
on. While we’ll be covering preprocessing in the specific context of lung tumors, the
general idea is that this is **what you should be prepared to do** for your project to succeed.
Similarly, setting up a training loop, getting the right performance metrics, and tying
the project’s models together into a final application are all general skills that we’ll
employ as we go through chapters 9 through 14.

## 9.2 Preparing for a large-scale project
The main differences between the work we did with convolutional models in chapter 8 and what we’ll do in part 2 are related to how much effort we put into things outside the model itself. In chapter 8, we used a provided, off-the-shelf dataset and did
little data manipulation before feeding the data into a model for classification. Almost
all of our time and attention were spent building the model itself, whereas now we’re
not even going to begin designing the first of our two model architectures until chapter 11. That is a direct consequence of having nonstandard data without prebuilt
libraries ready to hand us training samples suitable to plug into a model. We’ll have to
learn about our data and implement quite a bit ourselves.

Even when that’s done, this will not end up being a case where we convert the CT to
a tensor, feed it into a neural network, and have the answer pop out the other side. As
is common for real-world use cases such as this, a workable approach will be more complicated to account for confounding factors such as limited data availability, finite
computational resources, and limitations on our ability to design effective models. Please
keep that in mind as we build to a high-level explanation of our project architecture

## 9.3 What is a CT scan, exactly?
Before we get too far into the project, we need to take a moment to explain what a CT
scan is. We will be using data from CT scans extensively as the main data format for
our project, so having a working understanding of the data format’s strengths, weaknesses, and fundamental nature will be crucial to utilizing it well. The key point we
noted earlier is this: CT scans are essentially 3D X-rays, represented as a 3D array of single-channel data. As we might recall from chapter 4, this is like a stacked set of grayscale PNG images.

> **Voxel** \
A voxel is the 3D equivalent to the familiar two-dimensional pixel. It encloses a volume of space (hence, “volumetric pixel”), rather than an area, and is typically
arranged in a 3D grid to represent a field of data. Each of those dimensions will have
a measurable distance associated with it. Often, voxels are cubic, but for this chapter, we will be dealing with voxels that are rectangular prisms.

In addition to medical data, we can see similar voxel data in fluid simulations, 3D
scene reconstructions from 2D images, light detection and ranging (LIDAR) data for
self-driving cars, and many other problem spaces. Those spaces all have their individual quirks and subtleties, and while the APIs that we’re going to cover here apply generally, we must also be aware of the nature of the data we’re using with those APIs if we
want to be effective.

Each voxel of a CT scan has a numeric value that roughly corresponds to the average mass density of the matter contained inside. Most visualizations of that data show
high-density material like bones and metal implants as white, low-density air and lung
tissue as black, and fat and tissue as various shades of gray. Again, this ends up looking
somewhat similar to an X-ray, with some key differences.

The primary difference between CT scans and X-rays is that whereas an X-ray is a
projection of 3D intensity (in this case, tissue and bone density) onto a 2D plane, a CT
scan retains the third dimension of the data. This allows us to render the data in a variety of ways: for example, as a grayscale solid, which we can see in figure 9.1.

![](images/10.1.png)

This 3D representation also allows us to “see inside” the subject by hiding tissue types
we are not interested in. For example, we can render the data in 3D and restrict visibility to only bone and lung tissue, as in figure 9.2.

![](images/10.2.png)

CT scans are much more difficult to acquire than X-rays, because doing so requires a
machine like the one shown in figure 9.3 that typically costs upward of a million dollars new and requires trained staff to operate it. Most hospitals and some wellequipped clinics have a CT scanner, but they aren’t nearly as ubiquitous as X-ray
machines. This, combined with patient privacy regulations, can make it somewhat difficult to get CT scans unless someone has already done the work of gathering and
organizing a collection of them.

Figure 9.3 also shows an example bounding box for the area contained in the CT
scan. The bed the patient is resting on moves back and forth, allowing the scanner to
image multiple slices of the patient and hence fill the bounding box. The scanner’s
darker, central ring is where the actual imaging equipment is located.

A final difference between a CT scan and an X-ray is that the data is a digital-only format.
CT stands for computed tomography (https://en.wikipedia.org/wiki/CT_scan#Process).

## 9.4 The project: An end-to-end detector for lung cancer
Now that we’ve got our heads wrapped around the basics of CT scans, let’s discuss the
structure of our project. Most of the bytes on disk will be devoted to storing the CT
scans’ 3D arrays containing density information, and our models will primarily consume various subslices of those 3D arrays. We’re going to use five main steps to go
from examining a whole-chest CT scan to giving the patient a lung cancer diagnosis.

Our full, end-to-end solution shown in figure 9.4 will load CT data files to produce
a `Ct` instance that contains the full 3D scan, combine that with a module that performs `segmentation` (flagging voxels of interest), and then group the interesting voxels
into small lumps in the search for candidate `nodules`.

![](images/10.3.png)

> **Nodules** \
A mass of tissue made of proliferating cells in the lung is a tumor. A tumor can be benign
or it can be malignant, in which case it is also referred to as cancer. A small tumor in
the lung (just a few millimeters wide) is called a nodule. About 40% of lung nodules turn
out to be malignant—small cancers. It is very important to catch those as early as possible, and this depends on medical imaging of the kind we are looking at here.

The nodule locations are combined back with the CT voxel data to produce nodule candidates, which can then be examined by our nodule classification model to
determine whether they are actually nodules in the first place and, eventually, whether they’re malignant. This latter task is particularly difficult because malignancy might
not be apparent from CT imaging alone, but we’ll see how far we get. Last, each of
those individual, per-nodule classifications can then be combined into a whole-patient
diagnosis.

In more detail, we will do the following:
1. Load our raw CT scan data into a form that we can use with PyTorch. Putting
raw data into a form usable by PyTorch will be the first step in any project you
face. The process is somewhat less complicated with 2D image data and simpler
still with non-image data.
2. Identify the voxels of potential tumors in the lungs using PyTorch to implement
a technique known as **segmentation**. This is roughly akin to producing a heatmap
of areas that should be fed into our classifier in step 3. This will allow us to focus
on potential tumors inside the lungs and ignore huge swaths of uninteresting
anatomy (a person can’t have lung cancer in the stomach, for example).
3. Group interesting voxels into lumps: that is, candidate nodules (see figure 9.5
for more information on nodules). Here, we will find the rough center of each
hotspot on our heatmap.
4. Classify candidate nodules as actual nodules or non-nodules using 3D convolution.
5. Diagnose the patient using the combined per-nodule classifications.

The data we’ll use for training provides human-annotated output for both steps 3
and 4. This allows us to treat steps 2 and 3 (identifying voxels and grouping them into
nodule candidates) as almost a separate project from step 4 (nodule candidate classification). Human experts have annotated the data with nodule locations, so we can
work on either steps 2 and 3 or step 4 in whatever order we prefer.

### 9.4.1 Why can’t we just throw data at a neural network until it works?
Well, for starters, the majority of a CT scan is fundamentally uninteresting with
regard to answering the question, **“Does this patient have a malignant tumor?”** This
makes intuitive sense, since the vast majority of the patient’s body will consist of
healthy cells. In the cases where there is a malignant tumor, up to 99.9999% of the
voxels in the CT still won’t be cancer. That ratio is equivalent to a two-pixel blob of
incorrectly tinted color somewhere on a high-definition television, or a single misspelled word out of a shelf of novels.

Can you identify the white dot in the three views of figure 9.5 that has been flagged as a nodule?

If you need a hint, the index, row, and column values can be used to help find the relevant blob of dense tissue. Do you think you could figure out the relevant properties of tumors given only images (and that means only the images—no index, row, and column information!) like these? What if you were given the entire 3D scan, not just three slices that intersect the interesting part of the scan?

![](images/9.5.png)

You might have seen elsewhere that end-to-end approaches for detection and classification of objects are very successful in general vision tasks. TorchVision includes endto-end models like Fast R-CNN/Mask R-CNN, but these are typically trained on hundreds of thousands of images, and those datasets aren’t constrained by the number of samples from rare classes. The project architecture we will use has the benefit of working well with a more modest amount of data. So while it’s certainly theoretically possible to just throw an arbitrarily large amount of data at a neural network until it learns the specifics of the proverbial lost needle, as well as how to ignore the hay, it’s going to be practically prohibitive to collect enough data and wait for a long enough time to train the network properly. That won’t be the best approach since the results are poor, and most readers won’t have access to the compute resources to pull it off at all.

Our approach for solving the problem won’t use end-to-end gradient backpropagation to directly optimize for our end goal. Instead, we’ll optimize discrete chunks of the problem individually, since our segmentation model and classification model won’t be trained in tandem with each other. That might limit the top-end effectiveness of our solution, but we feel that this will make for a much better learning experience.

We feel that being able to focus on a single step at a time allows us to zoom in and concentrate on the smaller number of new skills we’re learning. Each of our two models will be focused on performing exactly one task. Similar to a human radiologist as
they review slice after slice of CT, the job gets much easier to train for if the scope is well contained. We also want to provide tools that allow for rich manipulation of the data. Being able to zoom in and focus on the detail of a particular location will have a huge impact on overall productivity while training the model compared to having to look at the entire image at once. Our segmentation model is forced to consume the entire image, but we will structure things so that our classification model gets a zoomed-in view of the areas of interest.

Step 3 (grouping) will produce and step 4 (classification) will consume data similar to the image in figure 9.6 containing sequential transverse slices of a tumor. This image is a close-up view of a (potentially malignant, or at least indeterminate) tumor,
and it is what we’re going to train the step 4 model to identify, and the step 5 model to classify as either benign or malignant. While this lump may seem nondescript to an untrained eye (or untrained convolutional network), identifying the warning signs of malignancy in this sample is at least a far more constrained problem than having to consume the entire CT we saw earlier. Our code for the next chapter will provide routines to produce zoomed-in nodule images like figure 9.6.

![](images/9.6.png)

![](images/9.7.png)

### 9.4.2 What is a nodule?
 Simply put, a nodule is any of the myriad
lumps and bumps that might appear inside someone’s lungs. Some are problematic
from a health-of-the-patient perspective; some are not. The precise definition4 limits
the size of a nodule to **3 cm or less**, with a larger lump being a lung mass; but we’re
going to use nodule interchangeably for all such anatomical structures, since it’s a
somewhat arbitrary cutoff and we’re going to deal with lumps on both sides of 3 cm
using the same code paths. A nodule—a small mass in the lung—can turn out to be
benign or a malignant tumor (also referred to as cancer). From a radiological perspective, a nodule is really similar to other lumps that have a wide variety of causes: infection, inflammation, blood-supply issues, malformed blood vessels, and diseases other
than tumors.

The key part is this: the cancers that we are trying to detect will always be nodules, either suspended in the very non-dense tissue of the lung or attached to the lung wall. That means we can limit our classifier to only nodules, rather than have it examine all tissue. Being able to restrict the scope of expected inputs will help our classifier learn the task at hand.

In figure 9.8, we can see a stereotypical example of a malignant nodule. The smallest nodules we’ll be concerned with are only a few millimeters across, though the one in figure 9.8 is larger. As we discussed earlier in the chapter, this makes the smallest nodules approximately a million times smaller than the CT scan as a whole. More than half of the nodules detected in patients are not malignant.

![](images/9.8.png)

### 9.4.3 Our data source: The LUNA Grand Challenge
The CT scans we were just looking at come from the LUNA (**LUng Nodule Analysis**)
Grand Challenge. The LUNA Grand Challenge is the combination of an open dataset
with high-quality labels of patient CT scans (many with lung nodules) and a public
ranking of classifiers against the data. There is something of a culture of publicly sharing medical datasets for research and analysis; open access to such data allows
researchers to use, combine, and perform novel work on this data without having to
enter into formal research agreements between institutions (obviously, some data is
kept private as well). The goal of the LUNA Grand Challenge is to encourage
improvements in nodule detection by making it easy for teams to compete for high
positions on the leader board. A project team can test the efficacy of their detection
methods against standardized criteria (the dataset provided). To be included in the
public ranking, a team must provide a scientific paper describing the project architecture, training methods, and so on. This makes for a great resource to provide further
ideas and inspiration for project improvements.

We will be using the LUNA 2016 dataset. The LUNA site (https://luna16.grand-challenge
.org/Description) describes two tracks for the challenge: the first track, “Nodule detection (NDET),” roughly corresponds to our step 1 (segmentation); and the second track,
“False positive reduction (FPRED),” is similar to our step 3 (classification). When the site
discusses “locations of possible nodules,” it is talking about a process similar to what we’ll
cover in chapter 13.

### 9.4.4 Downloading the LUNA data
Before we go any further into the nuts and bolts of our project, we’ll cover how to get
the data we’ll be using. It’s about 60 GB of data compressed, so depending on your
internet connection, it might take a while to download. Once uncompressed, it takes
up about **120 GB of space**; and we’ll need another 100 GB or so of cache space to
store smaller chunks of data so that we can access it more quickly than reading in the
whole CT.

The data we will be using comes in 10 subsets, aptly named `subset0` through `subset9`.
Unzip each of them so you have separate subdirectories like `code/data-unversioned/
part2/luna/subset0`, and so on. On Linux, you’ll need the 7z decompression utility
(Ubuntu provides this via the p7zip-full package). Windows users can get an
extractor from the 7-Zip website (www.7-zip.org). Some decompression utilities will
not be able to open the archives; make sure you have the full version of the extractor
if you get an error.

In addition, you need the candidates.csv and annotations.csv files. We’ve included
these files on the book’s website and in the GitHub repository for convenience, so
they should already be present in `code/data/part2/luna/*.csv`. They can also be
downloaded from the same location as the data subsets.

## 9.5 Conclusion
We’ve made major strides toward finishing our project! You might have the feeling
that we haven’t accomplished much; after all, we haven’t implemented a single line of
code yet. But keep in mind that you’ll need to do research and preparation as we have
here when you tackle projects on your own.

In this chapter, we set out to do two things:

1. Understand the larger context around our lung cancer-detection project
2. Sketch out the direction and structure of our project for part 2

If you still feel that we haven’t made real progress, please recognize that mindset as a
trap—understanding the space your project is working in is crucial, and the design work we’ve done will pay off handsomely as we move forward. We’ll see those dividends shortly, once we start implementing our data-loading routines in chapter 10.

## 9.6 Summary
- Our approach to detecting cancerous nodules will have five rough steps: data loading, segmentation, grouping, classification, and nodule analysis and diagnosis.
- Breaking down our project into smaller, semi-independent subprojects makes
teaching each subproject easier. Other approaches might make more sense for
future projects with different goals than the ones for this book.
- A CT scan is a 3D array of intensity data with approximately 32 million voxels,
which is around a million times larger than the nodules we want to recognize.
Focusing the model on a crop of the CT scan relevant to the task at hand will
make it easier to get reasonable results from training.
- Understanding our data will make it easier to write processing routines for our
data that don’t distort or destroy important aspects of the data. The array of CT
scan data typically will not have cubic voxels; mapping location information in
real-world units to array indexes requires conversion. The intensity of a CT scan
corresponds roughly to mass density but uses unique units.
- Identifying the key concepts of a project and making sure they are well represented in our design can be crucial. Most aspects of our project will revolve
around nodules, which are small masses in the lungs and can be spotted on a
CT along with many other structures that have a similar appearance.
- We are using the LUNA Grand Challenge data to train our model. The LUNA
data contains CT scans, as well as human-annotated outputs for classification and
grouping. Having high-quality data has a major impact on a project’s success.