# Deep Meta-Learning

## Motivação

  - Engenharia moderna de deep learning: treine a arquitetura completa em um
    conjunto de dados gigantesco e depois faça o fine-tuning (da ultima camada)
    para o seu conjunto de dados com menos exemplos.
  - Estado da arte em datasets com apenas 80 exemplos.

![vgg](img/vgg_architecture.png)

  - Atributos feitos a mão não são tão bons quanto os descobertos por otimização ex. texto e imagem

![canny](img/canny.jpeg)

  - A fase de "pré-treino" regulariza a arquitetura para o viés adequado a
    compreensão de imagens, necessitando apenas de uma combinação linear
    (ultima camada) desse sub-espaço de funções para ter uma boa generalização.
  - Do ponto de vista de aplicação ele aprende filtros/atributos bons para o problemas.

![vgg](img/vgg_filters.jpeg)

## Limitações

- Entradas tem que ser de mesmo tamanho.

#### Padding
- Sensível a permutações

![padding](img/padding.png)


#### Permutation Invariance
[https://arxiv.org/pdf/1612.04530.pdf](https://arxiv.org/pdf/1612.04530.pdf)

![padding](img/permutation_invariance.png)

#### Solução (?)
![Transformer Architecture](img/seq2seq.png)

![Word Embedding](img/word_embedding.png)

**Exemplo:**


``` python
inputs = layers.Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(2, activation="softmax")(x)

model = keras.Model(inputs=inputs, outputs=outputs)
```

## The Transformer Architecture
![Transformer Architecture](img/transformer1.png)

### Self Attention
![Transformer Architecture](img/transformer2.png)

### K, Q, V
![Transformer Architecture](img/transformer3.png)

![Transformer Architecture](img/transformer4.png)

### X encoded as Z
![Transformer Architecture](img/transformer5.png)

### Correlation Matrix

![Transformer Architecture](img/attention1.png)

## Batch Generation

``` python
path = "datasets/"
for f in os.listdir(path):
    sampler = pd.read_csv(path+f).dropna() # dropar ou não?
    if sampler.shape[0] < 1000: 
        continue
    for i in range(100):
        df = sampler.sample(500, replace=True, random_state=seed+i)
        X = df.drop(["class"], axis=1).astype(str)
        y = df["class"].astype(str)
        kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)
        scores = []
        for train_idx, test_idx in kfold.split(X, y):
            X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
            X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]
            clf1.fit(X_train, y_train)
            scores.append(clf1.score(X_test, y_test))
        df.to_csv('../samples/'+f'{f}_{np.mean(scores):.3f}_{i}.csv', index=False)
        mb.child.comment = f'Sampler'
    mb.write(f'Finished {f}')
```

## Beyonder Architecture

``` python
number_of_base_classifiers = 10

inputs = layers.Input(shape=(,))
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(inputs)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(number_of_base_classifiers, activation="linear")(x)

model = keras.Model(inputs=inputs, outputs=outputs)
```

## TabNet

![Transformer Architecture](img/tabnet.png)

![Transformer Architecture](img/mask.png)

## Ideas

### Architecture
#### Positional Encoding
- Add an increasing value for same rows

#### Data Augmentation
- Noise injection in data
- Synthetic data

### Applications
#### Transfer Learning

#### Regression
- accuracy classifiers
- regularization term in a base model

#### Classification
- Predict best classifier
- Predict best kernel for SVM

#### Data Imputation

#### Transformer (encoded meta-features as input for a seq2seq prediction (the encode informs what hypothesis space to select))

## Theory

https://www.lesswrong.com/posts/7ny7NLqvzJt7WfeXP/acetylcholine-learning-rate-aka-plasticity

why learning rate matters for different problems, an illustration with the brain