In [2]:
from IPython.display import Image
from IPython.core.display import HTML 

---

Some preliminary ideas for starting off:

- Straight up this is supervised learning, since we have the labels of the people in the trial -- this is great, because it makes life a whole lot easier
- Look at the conditional entropy of the sentences, this should emphatically yield when habitual errors start to creep in as we are measuring the conditional Shannon Entropy of a sequence.
- Build a language model of text and use it to reference against (not sure about this one yet, as the purpose is not to forecast but discriminate PD from HC)
    - That being said, we could build a fairly simple LSTM autoencoder, who's soul-purpose is to separate PD from HC, based on their text input.
        - The problem with this, as I see it now, is that we may not have enough data (yet) to build such a model. But we could collect more and it would be imminently doable.
- Sequence modelling (not sure about this yet either; the utility of building such a model is ... unclear)

Fundamentally, are we interested in:

1. Discriminative analysis ('who has PD, who has not?'), or
2. Decriptive analysis ('ok, you have PD, so what in your typing told us that you have PD?')

## Questions for Colin and Tom

1. Would there be any utility in having a generative model capable of generating new instances of HC and PD sentences? I.e. we could create a model that learns to spit out a sentence with a habit-error type slight to it, but would this be of any use?

# (1) A simple information theoretic entropy approach

Two notions to consider:

1. We can measure the conditional entropy in local sentences
2. We can measure the conditional entropy in `global` paragraphs

Essentially we are asking at what granularity we posit that habitual errors are likely to appear.

# (2) Deep-learning: Autoencoder, CNNs, LSTMs, RNNs ...

## Autoencoder

In [3]:
Image(url= "https://lilianweng.github.io/lil-log/assets/images/autoencoder-architecture.png")

## Convolutional neural network

In [5]:
Image(url= "https://raw.githubusercontent.com/brightmart/text_classification/master/images/TextCNN.JPG")

## CNN 2 (this one is a better overview I think)

In [8]:
Image(url= "https://cdn-images-1.medium.com/max/1200/0*jpZmafzzqJAPAOw5.png")

## Idea:

- Using the above image as reference, we would take the "bottleneck" compression and use that in some classifier.
- Indeed we can incorporate the classification into the network as well.
- The input will be:
    - Typed sentences
    - _probably not_ the reference sentence, but I haven't thought enough about that yet
    - All the meta information regarding the typing e.g. IKI
- In short we will note design any features for some classifiers, but leave it to the network to extract its own discriminative information
- What we hope to see is a model capable of transforming the NLP data into some low-dimensional representation (which we may even be able to visualsise) which clearly separates PD/HC

****Before any of this, I am going to try naive Bayes -- baselines are important****

Because we have labels for all the patients (PD/non-PD) this is a relatively straightforward binary classification task (as you know of course), but since we have no SWEDD patients or GENPD patients, it is much easier to work like this.

Other:

- Because this DL models have so many parameters, I would like to try an idea that I have been toying with for a while regarding their training; using Bayesian optimisation for model selection (MATLAB, surprisingly, have a good explanation: https://uk.mathworks.com/help/deeplearning/examples/deep-learning-using-bayesian-optimization.html)
- If anyone is interested, I usually code DL models in PyTorch (facebook's DL library), that being said I am fairly language agnostic, TensorFlow (Google) works fine too.

### Why is this good?

- It is straightforward
- Fairly easy
- There is lots of precedence:
    1. https://github.com/erickrf/autoencoder
    2. https://arxiv.org/pdf/1408.5882.pdf
    3. Nice high-level review/summary: https://medium.com/jatana/report-on-text-classification-using-cnn-rnn-han-f0e887214d5f
    4. https://github.com/brightmart/text_classification
    5. https://paperswithcode.com/task/sentence-classification
- Embedding words into sentences is easy; we can also embed single characters should we so wish -- but since we are looking for habitual errors, word embeddings might be more appropriate.
- Encoder/Decoder potential models: CNN (probs best for small amount of data)/HAN/RNN/LSTM

### Why might it be bad?

- Necessarily DL models require a lot of data -- though there are novel new ways to get around this. I do not yet know if we have enough of it, since I have only started to build the model and have not got to the point where I can train it yet.
- Unfortunately much of DL is still based on __tricks and experience__ -- often we do not know fully know why something works, just that it does. The theory has not quite caught up with practise yet. But it will. 

## Questions for Colin and Tom

1. Do we have an equipment budget?
    - If we go down this route, the analysis phase will be _a lot_ faster if I use a GPU (couple of hundred quid, nothing crazy -- can give it to Liverpool/Sheffield after) 
    - My **big machine** (i.e. my useful machine, not my laptop) does not currently have a GPU but it does have all the other bells and whistles
2. The datasets that we have, is that all that exists in terms of the research that has been done in this area?
3. Devil's advocate question: what is the point of this? Your AUC is already very good (from the most recent paper)
4. In the English language data, what is the format of the timestamp? What should I convert it to, to make it interpretable?
5. What about taking time-derivatives (i.e. speed and acceleration) of IKI? (see next cell)

Let $$x = \textrm{IKI}$$
then, we can extract the following meta-data:
$$
x, \frac{dx}{dt}, \frac{d^2x}{dt^2}
$$
Why? Well, we can bolt this onto our observation vector $\mathbf{x}$ that we pass to the DL model, and let it choose whether or not this information has high information content w.r.t. the binary classification it is being asked to undertake.

