# Using deep learning to detect early signs of cognitive disease

## Abstract
[TODO: write abstract once all the other sections are finished]

## Introduction
Hand drawings require a varying degree of dexterity and cognitive ability, and have historically been used to assess the state of cognitive diseases (Brantjes and Bouma, 1991). The drawings can also be designed to stress a specific motor or cognitive skill; for example, drawings containing small details will require precision whereas drawings containing 3-dimensional representations will require the ability to perceive and reproduce perspective (Clare, 1983).

Previous research (Rouleau et al., 1992) has shown great potential in diagnosing Alzheimer's Disease (AD) from patient drawings, mainly focused on clock drawings (Wolf-Klein et al., 1989) which require a combination of motor and cognitive skills (Agrell and Dehlin, 1998); however, more recent work has focused on analysis of data produced by costly medical equipment like brain imaging, typically via magnetic resonance imaging (Khedher et al., 2015) but also magnetic resonance tomography (Almkvist and Winblad, 1999). Less invasive tests using biomarkers from blood tests have been studied, but there's still a need for non-invasive, inexpensive tests (Jellinger et al., 2008).

## Dataset

#### Glossary
- **Drawing template**: model given to subjects used as a reference for the hand-drawn sketch.
- **Subject drawing**: hand-drawn sketch performed by the subjects attempting to replicate a given drawing template.
- **Image scan**: image containing a subject drawing and a drawing template.
- **Drawing category**: instance of a drawing template type, for example a house.

#### Data collection
[TODO: talk about data collection, consent, etc.]

<img src="../images/drawing_templates.png">
<p align="center" style="text-align: center;">Figure 1. Sample of drawing templates.</p>

#### Characteristics
The dataset consists of image scans and cognitive evaluations. During the preprocessing stage, image scans with a missing evaluation, subject drawing or very poor image quality are discarded. The resulting dataset used for training contains 3960 subject drawings, roughly following a uniform distribution by drawing category:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Count</th>
      <th>Percent</th>
    </tr>
    <tr>
      <th>Drawing Category</th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>casa</th>
      <td>416</td>
      <td>10.51%</td>
    </tr>
    <tr>
      <th>circulo</th>
      <td>441</td>
      <td>11.14%</td>
    </tr>
    <tr>
      <th>minimental</th>
      <td>436</td>
      <td>11.01%</td>
    </tr>
    <tr>
      <th>pico</th>
      <td>447</td>
      <td>11.29%</td>
    </tr>
    <tr>
      <th>cruz</th>
      <td>440</td>
      <td>11.11%</td>
    </tr>
    <tr>
      <th>muelle</th>
      <td>449</td>
      <td>11.34%</td>
    </tr>
    <tr>
      <th>cubo</th>
      <td>435</td>
      <td>10.98%</td>
    </tr>
    <tr>
      <th>cuadrado</th>
      <td>447</td>
      <td>11.29%</td>
    </tr>
    <tr>
      <th>triangulo</th>
      <td>449</td>
      <td>11.34%</td>
    </tr>
  </tbody>
</table>
<p align="center" style="text-align: center;">Figure 2. Drawing category breakdown.</p>

Each evaluation results in a coded **diagnosis**. The diagnosis can be one of the following values:
- **SANO**: Subject is deemed to be healthy
- **DCLNA**: [TODO: describe]
- **DCLM**: [TODO: describe]
- **DCLA**: [TODO: descibe]
- **BAJA**: Low-impact cognitive disease detected
- **BAJA EA**: [TODO: describe]
- **NO EXISTE**: [TODO: describe]

The diagnosis distribution suffers from class imbalance, as shown in the following summary table:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Count</th>
      <th>Percent</th>
    </tr>
    <tr>
      <th>Diagnosis</th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>SANO</th>
      <td>1984</td>
      <td>50.10%</td>
    </tr>
    <tr>
      <th>DCLNA</th>
      <td>1017</td>
      <td>25.68%</td>
    </tr>
    <tr>
      <th>DCLM</th>
      <td>851</td>
      <td>21.49%</td>
    </tr>
    <tr>
      <th>DCLA</th>
      <td>56</td>
      <td>1.41%</td>
    </tr>
    <tr>
      <th>BAJA</th>
      <td>33</td>
      <td>0.83%</td>
    </tr>
    <tr>
      <th>BAJA EA</th>
      <td>9</td>
      <td>0.23%</td>
    </tr>
    <tr>
      <th>NO EXISTE</th>
      <td>10</td>
      <td>0.25%</td>
    </tr>
  </tbody>
</table>
<p align="center" style="text-align: center;">Figure 3. Diagnosis breakdown.</p>

However, for the purpose of the analysis, the diagnosis label is simplified into two categories:
1. **Healthy**: Equivalent to *SANO*
1. **Not healthy**: Any other diagnosis

After that transformation, the resulting dataset contains a roughly 50/50 split for the diagnosis label. In this paper, we build a model able to predict whether a subject will have a *not healthy* label based on the subject drawings alone.

## Methodology

#### State of the Art
Deep neural networks with several convolutional layers currently produce the best results for classification tasks related to hand-srawn sketches (Sert and Boyacı, 2019) as well as more realistic paintings (Shen et al., 2019) or even digital sketches (Ha and Eck, 2018). Several specific deep neural network architectures are commonly used, for example *GoogLeNet* was recently used to perform computer-aided diagnoses from medical scans (Balagourouchetty et al., 2019) and *ResNet34* for assessment of breast tumor cellularity (Rakhlin et al., 2019).

#### Preprocessing
Data preprocessing was a crucial component of the learning pipeline due to the following challenging characteristics of the dataset:
- The dataset is relatively small for the proper application of deep learning techniques (Soekhoe et al., 2016).
- Each image scan contains both the subject drawings and the drawing templates, presenting a non-trivial challenge to separate them during processing. Further, the placement of the subject drawings and drawing templates was inconsistent across image scans.
- The image scans frequently contain a lot of noise, and the images were not scanned using consistent parameters which lead to image attributes like white balance being observably different across different scans.
- The medium used, pencil and paper, leads to varying degrees of contrast in the subject drawing; some subjects pressed the pencil a lot harder against the paper than others. 
- Some image scans contain multiple subject drawings, which required for some criteria to be established in order to consistently choose which one to evaluate.

<img src="../images/preprocessing_issues.png">
<p align="center" style="text-align: center;">Figure 4. Preprocessing issues.</p>

The following steps were taken in order to preprocess the dataset:
1. One image scan for each drawing template was hand-picked and the drawing template was cropped using the open source image editing software *GIMP*.
1. Using purpose-built software, the coordinates of the subject drawing and the template were extracted from the image scans. If more than one subject drawing was present, the subject drawing that most resembled the template was chosen after a visual comparison.
1. Using the coordinates as an approximate reference, the subject drawing was extracted from the image scan, and subsequently denoised and binarized using well-established computer vision techniques (Szeliski, 2011) such as non-local means denoising (Buades et al., 2011) and adaptive thresholding.

<img src="../images/preprocessed_drawing.png">
<p align="center" style="text-align: center;">Figure 5. Preprocessed drawing.</p>

#### Data Preparation
Two independent datasets were used to train different models. For one, drawing-evaluation pairs were used for the input and label, respectively (*ungrouped* dataset). For the other, image drawings were grouped by evaluation, such that each evaluation for a given subject represents a datapoint consisting of N drawings -- one for each template (*grouped* dataset).

For the grouped dataset, datapoints where not all drawings are available were discarded, and the evaluation results were encoded into a binary variable indicating whether the subject displayed any cognitive deficiency, or if it was considered healthy.

The data is divided into the following standard (Reitermanová, 2010) subsets:
1. **Train**: datapoints that are fed into the model to compute the error gradient at each iteration.
1. **Test**: datapoints used to compute a score at each iteration, which is then used to determine if early stopping is necessary.
1. **Validation**: left-out subset used to evaluate the model after training has completed.

Subsets are sampled randomly. If the validation subset has a class balance which differs from the class balance of the whole dataset by more than 5%, the subsets are randomly resampled until this criteria is reached. Because of the limited size of the dataset, this prevents cases where a model seemingly performed very well, but in reality the validation subset had a severe problem of class imbalance (Japkowicz and Stephen, 2002).

#### Model Training
All tried model architectures can be categorized as deep neural networks. Some of the models are an exact copy of known models such as *GoogLeNet* or *ResNet34* (He et al., 2015), known for performing well against the ImageNet dataset and adaptiveness to transfer learning (Talo et al., 2019) including applicability to sketch drawings (Ballester and Araujo, 2016). For those models, the input images were resized to match what the model expects (224x224 in most cases). The models prefixed with *QD-* are heavily inspired by the architecture described by Ha and Eck (2018) applied to the *Quick, Draw!* dataset named *sketch-rnn* (see fig. 11). A number of hyperparameters are identified, such as batch size, ratio of train-to-test and test-to-validation subset size; all combinations of parameter sets are explored using an exhaustive approach. The potential values for the parameters were chosen based on other, well-performing models in the computer vision domain (Russakovsky et al., 2015). Most of the parameters had little effect on the model performance outside of fine-tuning, although bigger kernel and batch sizes yield marginally improved accuracy.

<table>
    <thead>
        <tr style="text-align: right;">
            <th>Training Parameter</th>
            <th>Values</th>
        </tr>
    </thead>
    <tbody>
        <tr><td>Kernel Size</td><td>3, 5, 7, 9, 11</td></tr>
        <tr><td>Batch Size</td><td>24, 32, 48, 56, 64</td></tr>
        <tr><td>Test Subset</td><td>20%, 25%</td></tr>
        <tr><td>Validation Subset</td><td>20%, 25%</td></tr>
    </tbody>
</table>
<p align="center" style="text-align: center;">Figure 6. Summary of training parameters.</p>

Removing some drawing categories was also attempted, but due to the high computational overhead required this was not explored exhaustively. In all cases, excluding one or more drawing categories resulted in a statistically significant (α > .05 for validation dataset prediction accuracy) worse model performance.

During training, the images corresponding to each datapoint are altered using the *imgaug* library (Jung et al., 2019) with the intent to compensate for the total size of the dataset. For the grouped dataset, the pixels of all images in each datapoint are grouped into a single tensor which is used as input for the model.

<table>
    <thead>
        <tr style="text-align: right;">
            <th>Image Aumentation Parameter</th>
            <th>Values</th>
        </tr>
    <tbody>
        <tr><td>Cropping</td><td>[0%, 10%]</td></tr>
        <tr><td>Scaling</td><td>[90%, 110%]</td></tr>
        <tr><td>Translation</td><td>[-10%, 10%]</td></tr>
        <tr><td>Rotation</td><td>[-10%, 10%]</td></tr>
        <tr><td>Shearing</td><td>[-5%, 5%]</td></tr>
        <tr><td>Gaussian Blur (Standard Deviation)</td><td>[0, .5]</td></tr>
        <tr><td>Pixel Scalar Multiplication</td><td>[.9, 1.1]</td></tr>
    </tbody>
</table>
<p align="center" style="text-align: center;">Figure 7. Summary of image augmentation parameters.</p>

The following models were trained using nearly identical methodology, other than image grouping for the grouped dataset:
- **QD-Grouped**: model trained using the dataset grouped by evaluation.
- **QD-Ungrouped**: model trained using the dataset consisting of drawing-evaluation pairs.
- **QD-Transfer**: model pretrained using pairs of image-label from the *Quick Draw!* dataset.
- **GoogLeNet-Transfer**: model pretrained using *GoogLeNet* architecture with the *ImageNet* dataset.
- **ResNet34-Transfer**: model pretrained using *ResNet-34* architecture with the *ImageNet* dataset.

For each parameter set, training is performed for 1000 iterations on batches of data (mini-batches) or until the early stopping criteria is reached -- when the accuracy score of the test subset stops increasing after 10 consecutive mini-batches. This process is repeated 3 times, and the average results are reported. This methodology was considered near-optimal after running 3, 5 and 7 trials for a random set of hyperparameters using the *QD-Grouped* model. This resulted in similarly stable results for all tested number of trials:
<table>
    <thead>
        <tr style="text-align: right;">
            <th></th>
            <th>Accuracy Mean</th>
            <th>Accuracy Variance</th>
        </tr>
        <tr>
            <th>Number of Trials</th>
            <th></th>
            <th></th>
        </tr>
    </thead>
    <tbody>
        <tr><th>3</th><td>56.75%</td><td>0.061</td></tr>
        <tr><th>5</th><td>57.52%</td><td>0.050</td></tr>
        <tr><th>7</th><td>56.59%</td><td>0.046</td></tr>
    </tbody>
</table>
<p align="center" style="text-align: center;">Figure 8. Number of trials comparison.</p>

##### Results
Several of the models indicate a correlation between model prediction and presence of cognitive disease. The trained models are evaluated using only the left-out validation subset. To allow for a fair comparison across models, since sampling of subsets is done pseudo-randomly, results are also compared with what we describe as a naive classifier. The naive classifier is a simple model which labeling all inputs with the most common class; this metric corresponds to the class balance of the dataset so, for example, a subset where 45% of subjects are labeled as *healthy* would have a naive classifier accuracy of 55%.

<table>
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Accuracy</th>
      <th>Δ Naive Classifier</th>
      <th>Area Under ROC</th>
    </tr>
    <tr>
      <th>Model</th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr><th>QD-Grouped*</th><td><b>67.60%</b></td><td><b>+10.45%</b></td><td><b>0.595</b></td></tr>
    <tr><th>QD-Ungrouped</th><td>58.70%</td><td>+6.62%</td><td>0.571</td></tr>
    <tr><th>QD-Transfer</th><td>50.69%</td><td>-1.39%</td><td>0.495</td></tr>
    <tr><th>GoogLeNet-Transfer</th><td>56.19%</td><td>+4.49%</td><td>0.559</td></tr>
    <tr><th>ResNet34-Transfer</th><td>56.35%</td><td>+4.70%</td><td>0.559</td></tr>
  </tbody>
</table>
<p align="center" style="text-align: center;">Figure 9. Model performance comparison.</p>

#### Model Performance
The model built using the grouped dataset significantly outperformed all other models, achieving an accuracy of >67.60% and an area under ROC of 0.595 using the optimal choice of parameters.

<table>
    <thead>
        <tr style="text-align: right;">
            <th>Parameter Name</th>
            <th>Parameter Value</th>
        </tr>
    </thead>
    <tbody>
        <tr><td>Test Subset</td><td>20%</td></tr>
        <tr><td>Validation Subset</td><td>20%</td></tr>
        <tr><td>Kernel Size</td><td>9</td></tr>
        <tr><td>Batch Size</td><td>48</td></tr>
    </tbody>
</table>
<p align="center" style="text-align: center;">Figure 10. Optimal model training parameters.</p>

The QD models consist of two convolution layers with max pooling, followed by two dense layers before the final softmax layer. This model, even though inspired by the prior work shown with *sketch-rnn*, lacks the recurrent component of that model architecture.

<img src="../images/QD_model_architecture.png">
<p align="center" style="text-align: center;">Figure 11. QD model architecture.</p>

## Discussion

There are several key insights that were derived from our results. Most importantly, it appears to be plausible to perform early detection of cognitive diseases from subject drawings in a semi-automated way.

Also, a custom model trained from scratch outperformed known models trained using transfer learning, even though conditions seemed favourable for transfer learning given the small dataset and multi-dimensional features. Further, the models trained using transfer learning barely outperformed the naive classifier and some had an area under ROC < 0.5 (see fig. 9).

Another surprising result was that exclusion of any drawing category had a significant negative effect on the model performance, especially for the QD-grouped model which yielded a decrease in accuracy of over 5% leaving all other parameters equal. From our understanding of the data, we posit that this result is due to the increased information for each evaluation being available simultaneously to the model in a single datapoint. Alone, some drawing categories may have more relevant information than others; but using all drawing categories consistently yields better-performing models.

This also explains why the QD-grouped model vastly outperformed all other models. Even though the much larger ungrouped dataset available during training should have a significant impact in model performance, grouping the drawings by evaluation allows for the model to combine the information from multiple drawings to achieve a better accuracy.

#### Future Work
One of the main areas left to explore are different parameters and model architectures. A completely exhaustive approach is not feasible due to the large amount of resources required for model training, but given that a relatively small amount of model architectures were tried, it is likely that further exploration would yield improved results.

It may also be worth exploring why the transfer learning models did not perform well with this dataset, when other models did. It is possible that the larger number of model parameters requires more than 1000 iterations until convergence. 

For future experiments, changing the medium from paper to a digital drawing would potentially allow for a fully automated solution, avoiding most of the preprocessing currently necessary when using paper scans.

## References

[1]D. Ha and D. Eck, “A Neural Representation of Sketch Drawings,” in ICLR 2018, 2018.

[2]S. E. O’Bryant et al., “A Serum Protein–Based Algorithm for the Detection of Alzheimer Disease,” Arch Neurol, vol. 67, no. 9, pp. 1077–1081, Sep. 2010.

[3]M. Talo, U. B. Baloglu, Ö. Yıldırım, and U. Rajendra Acharya, “Application of deep transfer learning for automated brain abnormality classification using MR images,” Cognitive Systems Research, vol. 54, pp. 176–188, May 2019.

[4]K. A. Jellinger, B. Janetzky, J. Attems, and E. Kienzl, “Biomarkers for early diagnosis of Alzheimer disease: ‘ALZheimer ASsociated gene’– a new blood biomarker?,” Journal of Cellular and Molecular Medicine, vol. 12, no. 4, pp. 1094–1117, 2008.

[5]A. Rakhlin, A. Tiulpin, A. A. Shvets, A. A. Kalinin, V. I. Iglovikov, and S. Nikolenko, “Breast Tumor Cellularity Assessment Using Deep Neural Networks,” presented at the Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019, pp. 0–0.

[6]Z. Reitermanová, “Data Splitting,” 2010.

[7]K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

[8]X. Shen, A. A. Efros, and M. Aubry, “Discovering Visual Patterns in Art Collections with Spatially-consistent Feature Learning,” arXiv:1903.02678 [cs], Mar. 2019.

[9]S. M. Clare, “Drawing Rules: The Importance of the Whole Brain for Learning Realistic Drawing,” Studies in Art Education, vol. 24, no. 2, pp. 126–130, Jan. 1983.

[10]O. Almkvist and B. Winblad, “Early diagnosis of Alzheimer dementia  based on clinical and biological factors,” European Archives of Psychiatry and Clinical Neurosciences, vol. 249, no. 3, pp. S3–S9, Dec. 1999.

[11]L. Khedher, J. Ramírez, J. M. Górriz, A. Brahim, and F. Segovia, “Early diagnosis of Alzheimer׳s disease based on partial least squares, principal component analysis and support vector machine using segmented MRI images,” Neurocomputing, vol. 151, pp. 139–150, Mar. 2015.

[12]L. Balagourouchetty, J. K. Pragatheeswaran, B. Pottakkat, and R. G, “GoogLeNet based Ensemble FCNet Classifier for Focal Liver Lesion Diagnosis,” IEEE Journal of Biomedical and Health Informatics, pp. 1–1, 2019.

[13]R. Szeliski, “Image processing,” in Computer Vision: Algorithms and Applications, R. Szeliski, Ed. London: Springer London, 2011, pp. 87–180.

[14]O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” Int J Comput Vis, vol. 115, no. 3, pp. 211–252, Dec. 2015.

[15]A. B. Jung et al., imgaug. 2019.

[16]A. Buades, B. Coll, and J.-M. Morel, “Non-Local Means Denoising,” Image Processing On Line, vol. 1, pp. 208–212, Sep. 2011.

[17]D. Soekhoe, P. van der Putten, and A. Plaat, “On the Impact of Data Set Size in Transfer Learning Using Deep Neural Networks,” in Advances in Intelligent Data Analysis XV, 2016, pp. 50–60.

[18]P. Ballester and R. M. Araujo, “On the Performance of GoogLeNet and AlexNet Applied to Sketches,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.

[19]M. Brantjes and A. Bouma, “Qualitative analysis of the drawings of alzheimer patients,” Clinical Neuropsychologist, vol. 5, no. 1, pp. 41–52, Jan. 1991.

[20]I. Rouleau, D. P. Salmon, N. Butters, C. Kennedy, and K. McGuire, “Quantitative and qualitative analyses of clock drawings in Alzheimer’s and Huntington’s disease,” Brain and Cognition, vol. 18, no. 1, pp. 70–87, Jan. 1992.

[21]G. P. Wolf‐Klein, F. A. Silverstone, A. P. Levy, M. S. Brod, and J. Breuer, “Screening for Alzheimer’s Disease by Clock Drawing,” Journal of the American Geriatrics Society, vol. 37, no. 8, pp. 730–734, 1989.

[22]M. Sert and E. Boyacı, “Sketch recognition using transfer learning,” Multimed Tools Appl, vol. 78, no. 12, pp. 17095–17112, Jun. 2019.

[23]N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study,” Intelligent Data Analysis, vol. 6, no. 5, pp. 429–449, Jan. 2002.

[24]B. Agrell and O. Dehlin, “The clock-drawing test,” Age and ageing, vol. 27, no. 3, pp. 399–403, 1998.
