# Week 5: Machine Learning for  Time Series and Sequences

- What is a time-series / sequence problem 

- windowed approaches (1-D CNN)
- recurrent neural networks (RNN)
- Long Short Term Memory (LSTM)
 

## Examples of a time series prediction problem

<img src="https://raw.githubusercontent.com/dimitreOliveira/MachineLearning/master/Kaggle/Store%20Item%20Demand%20Forecasting%20Challenge/time-series%20graph.png" width=800 alt="example of demand changing over time"></img>

Demand forecasting problem from Kaggle [link to page](https://www.kaggle.com/code/dimitreoliveira/deep-learning-for-time-series-forecasting)

Other examples: Air Quality Prediction, Stock Market Trading, Traffic Flows (cars/people/packet network), ...

### 'Traditional' statistical approaches: (S) ARIMA (X)

<div style="float:right; width:600">
<img src = "https://media.geeksforgeeks.org/wp-content/uploads/20200131170455/Screenshot-2020-01-31-at-5.04.16-PM.png" width=600 ></img><br>
Example from geeks for geeks. <a href="https://www.geeksforgeeks.org/python-arima-model-for-time-series-forecasting">Page link</a>
</div>

- **S** = Seasonal Component  
  repeating pattern that modifies signal
- **AR** = Auto Regression  
  use of previous values to predict next value
- **I** = Integration:   
  absolute values ->  differences between time-steps.  
  Helps account for trends
- **MA**= Moving Average:   
  assumes error is related to error of previous terms.  
  you can also think of it as a smoothing effect.
- **X** = exogenous variables:   
  e.g.for predicting NOx levels its useful to also take into account  
  wind speed, temperature, humidity, traffic flows ...

### Related Problem: Sequence prediction

Typically (but not always) classification rather than regression.

Forward (left to right)

|Token 1|Token 2| Token 3|Token 4|Token 5| Token 6|
|---|---|---|---|---|---|
| The | name's | Bond |, | James| ? |
  
Bidirectional

|Token 1|Token 2| Token 3|Token 4|Token 5|Token 6|
|---|---|---|---|---|---|
| The | name's | ? |, | James| Bond |
  
 Lots more on this (language modelling) in the next few weeks. 

## What's the difference between classification and regression?

- Loss functions used to drive training  
  Mean Squared Error/ Mean Absolute Error vs. cross entropy  
  MSE still allows for Maximum Likelihood Estimator analysis
- Activation in final layer if using neural networks.   
  linear rather than softmax/ logistic  
- Choice of Performance metrics for model comparison and selection  
  - Classification: ROC curves, confusion matrices
  - Regression: possibly errors for different signal values
  

## Machine Learning Approaches to time series problems

1 Treat as standard supervised learning problem but give the system a _window_ of n previous timesteps.  
<div style="float:right; width:600"><img src="simple-window-dataset.jpg"/></div>

- how do you fill in gaps?  
  lose data or _impute_  :  
  a bit like padding images in convolution
- Easy to make in e.g. pandas
- Easy to incorporate 'exogenous variables'.   
  (other features) alongside the one to predict

2. Can use any 'standard' supervised ML. algorithm
 - especially 1-D CNN 
   - since time can be regarded as another dimension
   - and we know CNN's cope with dimension well
 - quite a few recent papers suggesting CNNs work better than rrecurrent networks when:
   - you only need to take into acount recent observations
   - or a few  patterns regularly recur so can be learned by convolutional filters

### for example, Keras 1DConv Layer

The inputs are 128-length vectors with 10 timesteps,  
and the batch size is 4.
```
>>> input_shape = (4, 10, 128)
>>> x = tf.random.normal(input_shape)
>>> y = tf.keras.layers.Conv1D(
>>>                     32, 3, 
>>>                     activation='relu',
>>>                     input_shape=input_shape[1:]) (x)
>>> print(y.shape)
    (4, 8, 32)
```

## Methodological Differences:

1. Does not make sense to use a randomised train/validate/ test split.  
   So need strategy for dealing with trends via:  
   pre-processing data and post-processing predictions

2. N-fold cross validation still not a bad idea **but**
   Ideally you would justify by a prior analysis of:  
   long term trends,  
   seasonal variations,  
   frequency of extreme events
   
3. What 'naive' models to compare with?  
   classifier: (probabilistically) predict most frequent class.  
   regression: constant prediction **or**    
   'predict-last-value'  -   can often give quite low MSE.  
    
    

## Problems with windowed approach?

> The name's Bond, James said when asked the author of the Paddington books


It's not always easy to know how many time-steps the window should include.
- 24 hours ?
- 3 days? (lets you take account of weekends)
- 7 days - lets you take account of things that happen one day a week.
- ...

Also, impractical to have super-long windows
- because it means lots of parameters to learn  
  ==> bigger optimistion problem

## The answer (maybe)? 

<div style="float:centre; color:red;"><h3>Memory</h3></div>

## Recurrent Neural Networks (RNN)
- Neural Network for data with a temporal relationship 
  i.e. sequences of data where the past (or future) data might influence the present.
- RNN share an internal ‘hidden-state’ across all inputs.   
  Effectively a network with memory of what it has previously seen.  
  **This is what changes over time**
- The key is that the weights and activations functions are the same for each time step

![image.png](rnn.png)

## Simple RNNs (Elman 1991)
Nodes have a hidden state vector **h**

Perceptron with input weight matrix _W_ and input vector _x_:  
output update $$ y = logistic( W(x) )$$

Elman 
Memory Node: 
$$ h_{t} = logistic( W_H\begin{pmatrix} x \\ h_{t-1} \end{pmatrix} ) $$
     
$$ y_t = logistic( W_y h_t ) $$

Elman, J. L. (1991). _Distributed representations, simple recurrent networks, and grammatical structure._ Machine Learning, 7:195–225.

## Modern Version
![Figure from Deep Learning](bengio-1.png)

## Modern Version 2
![Figure from Deep Learning, Goodfellow, Bengio and Courville](bengio-2.png)

### Learning weights _U, V, W_ 
The key: 
- use stochastic gradient descent  
  same as for Multi-Layer Perceptrons
-  easier to think about for 'unrolled' version of RNN
- **Back-Propagation Through Time**  
  Gradient at 𝐿(𝑡): (total loss is sum of those at different time steps)
- memory requirements become rapidly bigger  
  because you need to store activations/signals at each time-step.
  
Problems?
 - vanishing and exploding gradients
 - so hard to remember things for long periods of time

## Long-Short Term Memory Nodes
- Add 'gates', controlled by learnable weights that control how much the hidden state forgets or remembers changes in response to curent inputs

>A common LSTM unit is composed of a cell, an input gate, an output gate[14] and a forget gate.[15] 

>The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. 

>Forget gates decide what information to discard from a previous state by assigning a previous state, compared to a current input, a value between 0 and 1. A (rounded) value of 1 means to keep the information, and a value of 0 means to discard it. 

>Input gates decide which pieces of new information to store in the current state, using the same system as forget gates. 

>Output gates control which pieces of information in the current state to output by assigning a value from 0 to 1 to the information, considering the previous and current states

(quotes from wikipedia)


Hochreiter, S. and Schmidhuber, J. (1997). _Long short-term memory._ Neural Computation, 9(8):1735–1780.

![Block diagram of LSTM](https://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Peephole_Long_Short-Term_Memory.svg/2000px-Peephole_Long_Short-Term_Memory.svg.png)

circle with x in = multiply components of signal by a scalar  
Activation functions usually tanh.   
$h_t$ is output.  
**Weights at each gate input!**

### Example from [keras](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)
````
>> inputs = tf.random.normal([32, 10, 8])
>> lstm = tf.keras.layers.LSTM(4)
>> output = lstm(inputs)
>> print(output.shape)
>> (32, 4)
````
If you want to have more than one layer, (or do sequence to sequenbce) you need specify returning the 'unrolled' sequences
````
>> lstm = tf.keras.layers.LSTM(4, return_sequences=True, return_state=True)
>> whole_seq_output, final_memory_state, final_carry_state = lstm(inputs)
>> print(whole_seq_output.shape)
>> (32,10,4)
>> print(final_memory_state.shape)
>> (32,4)
>> print(final_carry_state.shape)
>>  (32,4)
````

## Impact of LSTMs
Huge Leap forward for Deep Learning applied to:
- time-series problems
  - especially with long-term dependencies to learn
- sequence to label
  - text classfication/ sentiment analysis/ intent recognition/ ...
- Can be used in a wide single layer or can stack layers
  - either way, usually have a 'dense' layer (or two) afterwards (as for CNNs)  
  
- sequence to sequence 
  - translation between languages, ...
  - typically use an encoder-decoder architecture
  
More on architectures next week.
  
Many variants: Gated Recurrent Units (GRU) ...
