# Using deep learning to detect early signs of cognitive disease

## Abstract
[TODO: write abstract once all the other sections are finished]

## Introduction
Hand drawings require a varying degree of dexterity and cognitive ability, and have historically been used to assess the state of cognitive diseases (Brantjes and Bouma, 1991). The drawings can also be designed to stress a specific motor or cognitive skill; for example, drawings containing small details will require precision whereas drawings containing 3-dimensional representations will require the ability to perceive and reproduce perspective (Clare, 1983).

Previous research (Rouleau et al., 1992) has shown great potential in diagnosing Alzheimer's Disease (AD) from patient drawings, mainly focused on clock drawings which require a combination of motor and cognitive skills; however, more recent work has focused on analysis of data produced by costly medical equipment like brain imaging, typically MRI (Khedher et al., 2015) but also MRT (Almkvist and Winblad, 1999). Less invasive tests using biomarkers from blood tests have been studied, but there's still a need for non-invasive, inexpensive tests (Jellinger et al., 2008).

## Methodology

#### Dataset
[TODO: introduce dataset]

#### Glossary
- **Drawing template**: model given to subjects used as a reference for the hand-drawn sketch.
- **Subject drawing**: hand-drawn sketch performed by the subjects attempting to replicate a given drawing template.
- **Image scan**: image containing a subject drawing and a drawing template.
- **Drawing category**: instance of a drawing template type, for example a house.


#### Preprocessing
Data preprocessing was a crucial component of the learning pipeline due to the following challenging characteristics of the dataset:
- The dataset contains approximately 7000 drawings, collected across 400 evaluations, which is relatively small for the proper application of deep learning techniques (Soekhoe et al., 2016).
- Each image scan contains both the subject drawings and the drawing templates, presenting a non-trivial challenge to separate them during processing. Further, the placement of the subject drawings and drawing templates was inconsistent across image scans.
- The image scans frequently contain a lot of noise; further, the images were not scanned using consistent parameters which lead to image attributes like white balance being observably different across different scans.
- The medium used, pencil and paper, leads to varying degrees of contrast in the subject drawing; some subjects pressed the pencil a lot harder against the paper than others. 
- Some image scans contain multiple subject drawings, which required for some criteria to be established in order to consistently choose which one to evaluate.

<img src="../images/preprocessing_issues.png">
<p align="center" style="text-align: center;">Figure 1. Preprocessing issues.</p>

The following steps were taken in order to preprocess the dataset:
1. One image scan for each drawing template was hand-picked and the drawing template was cropped using the open source image editing software *GIMP*.
1. Using purpose-built software, the coordinates of the subject drawing and the template were extracted from the image scans. If more than one subject drawing was present, the subject drawing that most resembled the template was chosen.
1. Using the coordinates as an approximate reference, the subject drawing was extracted from the image scan, and subsequently denoised and binarized using well-established computer vision techniques (Szeliski, 2011).

<img src="../images/preprocessed_drawing.png">
<p align="center" style="text-align: center;">Figure 2. Preprocessed drawing.</p>

#### Data Preparation
Two independent datasets were used to train different models. For one, drawing-evaluation pairs were used for the input and label, respectively (*ungrouped* dataset). For the other, image drawings were grouped by evaluation, such that each evaluation for a given subject represents a datapoint consisting of 6 drawings -- one for each template (*grouped* dataset).

For the grouped dataset, datapoints where not all drawings are available were discarded, and the evaluation results were encoded into a binary variable indicating whether the subject displayed any cognitive deficiency, or if it was considered healthy.

The data is divided into the following standard (Reitermanová, 2010) subsets:
1. **Train**: datapoints that are fed into the model to compute the error gradient at each iteration.
1. **Test**: datapoints used to compute a score at each iteration, which is then used to determine if early stopping is necessary.
1. **Validation**: left-out subset used to evaluate the model after training has completed.

Subsets are chosen randomly. If the validation subset has a class balance which differs from the class balance of the whole dataset by more than 5%, the subsets are resampled until this criteria is reached. Because of the limited size of the dataset, this prevents cases where a model seemingly performed very well, but in reality the validation subset had a severe problem of class imbalance (Japkowicz and Stephen, 2002).

#### Model Training
All tried model architectures can be categorized as deep neural networks. Some are heavily inspired by the architecture described by Ha and Eck (2018) applied to the *Quick, Draw!* dataset, others are an exact copy of known models such as *GoogLeNet* or *ResNet34* (He et al., 2015), known for performing well against the ImageNet dataset and adaptiveness to transfer learning (Talo et al., 2019), including applicability to sketch drawings (Ballester and Araujo, 2016). A number of hyperparameters are identified, such as batch size, ratio of train-to-test and test-to-validation subset size, and which drawing categories (if any) are excluded; all combinations of hyperparameter sets are explored using an exhaustive approach. Most of the parameters had little effect on the model performance outside of fine-tuning, although it is worth noting that in all cases excluding one or more drawing categories results in a statistically significant (α > .05) worse model performance as measured by accuracy of predictions against the validation dataset.

<table>
    <tr><th>Training Parameter</th><th>Values</th></tr>
    <tr><td>Drawing Categories Excluded</td><td>[0, 3]</td></tr>
    <tr><td>Convolution Kernel Size</td><td>8, 16, 32</td></tr>
    <tr><td>Batch Size</td><td>24, 32</td></tr>
    <tr><td>Test Subset</td><td>20%, 25%</td></tr>
    <tr><td>Validation Subset</td><td>20%, 25%</td></tr>
</table>
<p align="center" style="text-align: center;">Figure 3. Summary of training parameters.</p>

During training, the images corresponding to each datapoint are altered using the *imgaug* library with the intent to compensate for the total size of the dataset. For the grouped dataset, the pixels of all images in each datapoint are concatenated into a single 2D matrix which is used as input for the model.

<table>
    <tr><th>Image Aumentation Parameter</th><th>Values</th></tr>
    <tr><td>Cropping</td><td>[0%, 10%]</td></tr>
    <tr><td>Scaling</td><td>[90%, 110%]</td></tr>
    <tr><td>Translation</td><td>[-10%, 10%]</td></tr>
    <tr><td>Rotation</td><td>[-10%, 10%]</td></tr>
    <tr><td>Shearing</td><td>[-5%, 5%]</td></tr>
    <tr><td>Gaussian Blur (Standard Deviation)</td><td>[0, .5]</td></tr>
    <tr><td>Pixel Scalar Multiplication</td><td>[.9, 1.1]</td></tr>
</table>
<p align="center" style="text-align: center;">Figure 4. Summary of image augmentation parameters.</p>

The following models were trained using nearly identical methodology, other than image concatenation for the grouped dataset:
- **QD-Grouped**: model trained using the dataset grouped by evaluation.
- **QD-Ungrouped**: model trained using the dataset consisting of drawing-evaluation pairs.
- **QD-Transfer**: model pretrained using pairs of image-label from the *Quick Draw!* dataset.
- **GoogLeNet-Transfer**: model pretrained using *GoogLeNet* architecture with the *ImageNet* dataset.
- **ResNet34-Transfer**: model pretrained using *ResNet-34* architecture with the *ImageNet* dataset.

For each hyperparameter set, training is performed for 200 iterations on batches of data or until the early stopping criteria is reached -- when the accuracy score of the test subset stops increasing after several iterations.

## Results
The trained models are evaluated using only the left-out validation subset. To allow for a fair comparison between models, since sampling of subsets is done pseudo-randomly, results are also compared with what we describe as a naive classifier, which refers to labeling all inputs with the most common class; this metric corresponds to the class balance of the dataset so, for example, a subset where 45% of subjects are labeled as *healthy* would have a naive classifier accuracy of 55%.

<table>
    <tr><th>Model</th><th>Accuracy</th><th>Δ Naive Classifier</th><th>Area Under ROC</th></tr>
    <tr><td>QD-Grouped*</td><td><b>76.19%</b></td><td><b>+23.81%</b></td><td><b>0.715</b></td></tr>
    <tr><td>QD-Transfer</td><td>59.74%</td><td>+8.25%</td><td>0.571</td></tr>
    <tr><td>QD-Ungrouped</td><td>56.70%</td><td>+6.19%</td><td>0.558</td></tr>
    <tr><td>GoogLeNet-Transfer</td><td>59.09%</td><td>+6.49%</td><td>0.580</td></tr>
    <tr><td>ResNet34-Transfer</td><td>54.55%</td><td>+1.95%</td><td>0.469</td></tr>
</table>
<p align="center" style="text-align: center;">Figure 5. Model performance comparison.</p>

#### Model Performance
The model built using the grouped dataset significantly outperformed all other models, achieving an accuracy of 76.19% and an area under ROC of 0.715 using the optimal choice of hyperparameters. From our understanding of the data, we posit that this result is due to the increased information available for each evaluation being available simultaneously to the model in a single datapoint. Even though the much larger ungrouped dataset should have a significant impact in model performance, grouping the drawings by evaluation allows for the model to combine the information from multiple drawings to achieve a better accuracy.

<img src="../images/QD_model_architecture.png">
<p align="center" style="text-align: center;">Figure 6. QD-Grouped model architecture.</p>

The improved model performance cannot be attributed to certain drawing categories being more informative than others, since all models were trained with hyperparameter sets which involved excluding all datapoints for up to 3 drawing categories.

### Discussion

[TODO: Discuss performance and characteristics of inferior models]

### References

[1]D. Ha and D. Eck, “A Neural Representation of Sketch Drawings,” in ICLR 2018, 2018.

[2]S. E. O’Bryant et al., “A Serum Protein–Based Algorithm for the Detection of Alzheimer Disease,” Arch Neurol, vol. 67, no. 9, pp. 1077–1081, Sep. 2010.

[3]M. Talo, U. B. Baloglu, Ö. Yıldırım, and U. Rajendra Acharya, “Application of deep transfer learning for automated brain abnormality classification using MR images,” Cognitive Systems Research, vol. 54, pp. 176–188, May 2019.

[4]K. A. Jellinger, B. Janetzky, J. Attems, and E. Kienzl, “Biomarkers for early diagnosis of Alzheimer disease: ‘ALZheimer ASsociated gene’– a new blood biomarker?,” Journal of Cellular and Molecular Medicine, vol. 12, no. 4, pp. 1094–1117, 2008.

[5]Z. Reitermanová, “Data Splitting,” 2010.

[6]K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

[7]S. M. Clare, “Drawing Rules: The Importance of the Whole Brain for Learning Realistic Drawing,” Studies in Art Education, vol. 24, no. 2, pp. 126–130, Jan. 1983.

[8]O. Almkvist and B. Winblad, “Early diagnosis of Alzheimer dementia  based on clinical and biological factors,” European Archives of Psychiatry and Clinical Neurosciences, vol. 249, no. 3, pp. S3–S9, Dec. 1999.

[9]L. Khedher, J. Ramírez, J. M. Górriz, A. Brahim, and F. Segovia, “Early diagnosis of Alzheimer׳s disease based on partial least squares, principal component analysis and support vector machine using segmented MRI images,” Neurocomputing, vol. 151, pp. 139–150, Mar. 2015.

[10]R. Szeliski, “Image processing,” in Computer Vision: Algorithms and Applications, R. Szeliski, Ed. London: Springer London, 2011, pp. 87–180.

[11]D. Soekhoe, P. van der Putten, and A. Plaat, “On the Impact of Data Set Size in Transfer Learning Using Deep Neural Networks,” in Advances in Intelligent Data Analysis XV, 2016, pp. 50–60.

[12]P. Ballester and R. M. Araujo, “On the Performance of GoogLeNet and AlexNet Applied to Sketches,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.

[13]M. Brantjes and A. Bouma, “Qualitative analysis of the drawings of alzheimer patients,” Clinical Neuropsychologist, vol. 5, no. 1, pp. 41–52, Jan. 1991.

[14]I. Rouleau, D. P. Salmon, N. Butters, C. Kennedy, and K. McGuire, “Quantitative and qualitative analyses of clock drawings in Alzheimer’s and Huntington’s disease,” Brain and Cognition, vol. 18, no. 1, pp. 70–87, Jan. 1992.

[15]N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study,” Intelligent Data Analysis, vol. 6, no. 5, pp. 429–449, Jan. 2002.
