# Criação de Meta-Base

## Considerações sobre Aprendizado Ativo

Para que os experimentos sejam executados, é necessária a criação de uma meta-base a respeito do processo de **aprendizado ativo** (AL).  Contudo, algumas variáveis podem influenciar diretamente o processo de aprendizado ativo, sendo elas:

- Quantidade Inicial de Dados Rotulados (  $\vert X^0_{labeled}\vert$ )
- Quantidade de consultas a serem realizados ($q$)
- Estratégia de consulta utilizada (*query strategy*)
- Quantidade de instâncias retornadas por uma consulta (*batch size*)
- Método utilizado para ordenar as instâncias retornadas por uma consulta
- Tamanho do conjunto de teste para avaliar o aprendizado


### Quantidade Inicial de Dados Rotulados

Em tese, a ideia principal do AL é tentar popuar esforços de anotação ao rotular as instâncias que melhor descrevem um conjunto de dados. Dessa forma, na maioria dos casos lidamos com situações em que há uma pequena quantidade de instâncias rotuladas, apenas o suficiente para iniciar o processo de aprenzidado ativo

> Qual a quantidade ideal de dados anotados inicialmente? 

Como o nosso objetivo é simular várias configurações diferentes de active learning para cada uma das bases, é necessário pensar em alguma manera de deixar esse processo o mais reprodutível possível, algumas ideias são: 

1.  $\vert X_{labeled} \vert = n$, onde $n$ representa uma constante arbitrária
2.  $\vert X_{labeled} \vert = c$, onde $c$ representa o número de classes no problema de classificação
2.  $\vert X_{labeled} \vert = \alpha\times\vert X\vert$, onde $\alpha \in (0, 1]$ representa uma constante de proporcionalidade

### Quantidade de consultas a serem realizadas

Como a quantidade de instâncias varia conforme o conjunto de dados, também devemos decidir como escolheremos esse parâmetro. Dessa forma podemos utilizar soluções similares às citadas acima, porém baseado no tamanho do conjunto de dados

### Estratégia de Consulta

Estratégias baseadas na incerteza de um classificador:
- [ ] Classification Uncertainty 
- [ ] Classification Margin
- [ ] Classification Entropy


Estratégias baseadas em discordância entre modelos

- [ ] Vote Entropy
- [ ] Consensus Entropy
- [ ] Max Disagreement

### Quantidade de instâncias retornadas por consulta

Originalmente, as estratégias de consulta está preocupadas em selecionar a instância mais interessante para ser rotulado, contudo o que ocorre geralmente é que o anotador não está disposto a anotar apenas uma instância e esperar uma próxima iteração do processo de AL ser executado novamente. Muito tempo e recursos podem ser otimizados se mais de uma instância for retornada por consulta.

Todavia, simplesmente retornar as $n$ consultas com a maior pontuação, pode não ser uma boa ideia (embora as vezes seja uma alternativa). Sendo assim Cardoso et. al formularam uma maneira de ranquear essas instâncias

$score = \alpha(1 - \Phi(x, X_{labeled})) + (1 - \alpha) U(x),$ 

Onde $\alpha = \frac{|X_{unlabeled}|}{|X_{unlabeled}| + |X_{labeled}|}$, $U(x)$ é a incerteza das predições para $x$ e $\Phi$ é uma função de similaridade. Dessa forma é possível medir o quão bem o espaço de características foi explorado perto de $x$

## Criação de Meta-base

### Simulação de Cenário

In [1]:
from metabase_builder import MetaBaseBuilder

#### Classificadores

In [8]:
from sklearn.svm import LinearSVC, SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

class SVCLinear(SVC):
    pass
    
clf_list = [
    SVCLinear(kernel='linear', probability=True),
    KNeighborsClassifier(),
    GaussianNB(),
]


#### Estratégias de consulta

In [7]:
from modAL.uncertainty import uncertainty_sampling, margin_sampling, entropy_sampling
from modAL.batch import uncertainty_batch_sampling
from modAL.disagreement import consensus_entropy_sampling, vote_entropy_sampling, max_disagreement_sampling

query_strategies = [ 
    uncertainty_sampling,
    uncertainty_batch_sampling, 
    margin_sampling,
]

### Conjunto de dados de exemplo

In [9]:
import warnings

import openml

from config import dataset_ids

with warnings.catch_warnings():
    warnings.filterwarnings('ignore')
    dataset = openml.datasets.get_dataset(dataset_ids[0])
    
dataset

OpenML Dataset
Name..........: kr-vs-kp
Version.......: 1
Format........: ARFF
Upload Date...: 2014-04-06 23:19:28
Licence.......: Public
Download URL..: https://api.openml.org/data/v1/download/3/kr-vs-kp.arff
OpenML URL....: https://www.openml.org/d/3
# of features.: 37
# of instances: 3196

#### Geração de metabase para dataset

In [10]:
builder = MetaBaseBuilder(estimators=clf_list,
                          query_strategies=query_strategies,
                          n_queries=5,
                          initial_l_size=5)

In [11]:
builder.build(dataset)

2024-03-07 23:01:16,618 - INFO - 3_kr-vs-kp - SVCLinear::uncertainty_sampling::0 - Criando instância...
2024-03-07 23:01:16,618 - INFO - 3_kr-vs-kp - SVCLinear::uncertainty_sampling::0 - Criando instância...
2024-03-07 23:03:26,939 - INFO - 3_kr-vs-kp - SVCLinear::uncertainty_sampling::0 - Instância criada.
2024-03-07 23:03:26,939 - INFO - 3_kr-vs-kp - SVCLinear::uncertainty_sampling::0 - Instância criada.
2024-03-07 23:03:26,962 - INFO - 3_kr-vs-kp - SVCLinear::uncertainty_sampling::1 - Criando instância...
2024-03-07 23:03:26,962 - INFO - 3_kr-vs-kp - SVCLinear::uncertainty_sampling::1 - Criando instância...
2024-03-07 23:05:21,150 - INFO - 3_kr-vs-kp - SVCLinear::uncertainty_sampling::1 - Instância criada.
2024-03-07 23:05:21,150 - INFO - 3_kr-vs-kp - SVCLinear::uncertainty_sampling::1 - Instância criada.
2024-03-07 23:05:21,165 - INFO - 3_kr-vs-kp - SVCLinear::uncertainty_sampling::2 - Criando instância...
2024-03-07 23:05:21,165 - INFO - 3_kr-vs-kp - SVCLinear::uncertainty_samplin

Unnamed: 0,estimator,query-strategy,accuracy,f1-micro,f1-macro,f1-weighted,attr_conc.mean,attr_conc.sd,attr_ent.mean,attr_ent.sd,...,var_importance.sd,vdb,vdu,w_lambda,wg_dist.mean,wg_dist.sd,worst_node.mean,worst_node.mean.relative,worst_node.sd,worst_node.sd.relative
0,SVCLinear,uncertainty_sampling,0.7597,0.7597,0.759534,0.760143,0.123004,0.212792,0.544273,0.371938,...,0.046955,5.672254,1.496031e-07,0.356193,4.024132,0.272462,0.512134,2.0,0.021714,5.0
1,SVCLinear,uncertainty_sampling,0.788486,0.788486,0.788421,0.788388,0.054304,0.14467,0.544138,0.371819,...,0.046901,5.678672,1.502381e-07,0.356057,4.023766,0.273027,0.5245,1.0,0.0314,6.0
2,SVCLinear,uncertainty_sampling,0.762203,0.762203,0.762095,0.762089,0.085863,0.184782,0.544261,0.372012,...,0.04698,5.678806,1.508655e-07,0.356344,4.024348,0.272331,0.515126,1.0,0.035773,6.0
3,SVCLinear,uncertainty_sampling,0.774718,0.774718,0.774676,0.774625,0.024687,0.12226,0.544074,0.37198,...,0.047063,5.683319,1.514855e-07,0.356704,4.023847,0.271654,0.512,1.0,0.023919,6.0
4,SVCLinear,uncertainty_sampling,0.806008,0.806008,0.805988,0.805942,0.131834,0.213937,0.544198,0.371962,...,0.046642,5.681028,1.521324e-07,0.356785,4.024373,0.271611,0.523629,1.0,0.02906,6.0
5,SVCLinear,uncertainty_batch_sampling,0.690864,0.690864,0.68671,0.696597,0.105786,0.189285,0.544273,0.371938,...,0.046625,5.672254,1.496031e-07,0.356193,4.024132,0.272462,0.525941,1.0,0.031685,6.0
6,SVCLinear,uncertainty_batch_sampling,0.745932,0.745932,0.744767,0.747854,0.069773,0.182314,0.543722,0.372102,...,0.047099,5.66907,1.502497e-07,0.356826,4.02193,0.270995,0.542894,2.0,0.063175,6.0
7,SVCLinear,uncertainty_batch_sampling,0.717146,0.717146,0.716164,0.71886,0.087958,0.185271,0.543111,0.372129,...,0.04657,5.678361,1.50889e-07,0.356039,4.019842,0.269552,0.529832,1.0,0.027118,6.0
8,SVCLinear,uncertainty_batch_sampling,0.783479,0.783479,0.783478,0.783457,0.119179,0.215806,0.542601,0.372196,...,0.046651,5.661676,1.515204e-07,0.355866,4.017951,0.268642,0.522921,1.0,0.030345,6.0
9,SVCLinear,uncertainty_batch_sampling,0.789737,0.789737,0.78971,0.78966,0.176529,0.236073,0.5424,0.372386,...,0.047046,5.659802,1.521679e-07,0.356273,4.0169,0.268607,0.520253,1.0,0.035473,6.0


In [13]:
builder.metabase.head(30)

Unnamed: 0,estimator,query-strategy,accuracy,f1-micro,f1-macro,f1-weighted,attr_conc.mean,attr_conc.sd,attr_ent.mean,attr_ent.sd,...,var_importance.sd,vdb,vdu,w_lambda,wg_dist.mean,wg_dist.sd,worst_node.mean,worst_node.mean.relative,worst_node.sd,worst_node.sd.relative
0,SVCLinear,uncertainty_sampling,0.7597,0.7597,0.759534,0.760143,0.123004,0.212792,0.544273,0.371938,...,0.046955,5.672254,1.496031e-07,0.356193,4.024132,0.272462,0.512134,2.0,0.021714,5.0
1,SVCLinear,uncertainty_sampling,0.788486,0.788486,0.788421,0.788388,0.054304,0.14467,0.544138,0.371819,...,0.046901,5.678672,1.502381e-07,0.356057,4.023766,0.273027,0.5245,1.0,0.0314,6.0
2,SVCLinear,uncertainty_sampling,0.762203,0.762203,0.762095,0.762089,0.085863,0.184782,0.544261,0.372012,...,0.04698,5.678806,1.508655e-07,0.356344,4.024348,0.272331,0.515126,1.0,0.035773,6.0
3,SVCLinear,uncertainty_sampling,0.774718,0.774718,0.774676,0.774625,0.024687,0.12226,0.544074,0.37198,...,0.047063,5.683319,1.514855e-07,0.356704,4.023847,0.271654,0.512,1.0,0.023919,6.0
4,SVCLinear,uncertainty_sampling,0.806008,0.806008,0.805988,0.805942,0.131834,0.213937,0.544198,0.371962,...,0.046642,5.681028,1.521324e-07,0.356785,4.024373,0.271611,0.523629,1.0,0.02906,6.0
5,SVCLinear,uncertainty_batch_sampling,0.690864,0.690864,0.68671,0.696597,0.105786,0.189285,0.544273,0.371938,...,0.046625,5.672254,1.496031e-07,0.356193,4.024132,0.272462,0.525941,1.0,0.031685,6.0
6,SVCLinear,uncertainty_batch_sampling,0.745932,0.745932,0.744767,0.747854,0.069773,0.182314,0.543722,0.372102,...,0.047099,5.66907,1.502497e-07,0.356826,4.02193,0.270995,0.542894,2.0,0.063175,6.0
7,SVCLinear,uncertainty_batch_sampling,0.717146,0.717146,0.716164,0.71886,0.087958,0.185271,0.543111,0.372129,...,0.04657,5.678361,1.50889e-07,0.356039,4.019842,0.269552,0.529832,1.0,0.027118,6.0
8,SVCLinear,uncertainty_batch_sampling,0.783479,0.783479,0.783478,0.783457,0.119179,0.215806,0.542601,0.372196,...,0.046651,5.661676,1.515204e-07,0.355866,4.017951,0.268642,0.522921,1.0,0.030345,6.0
9,SVCLinear,uncertainty_batch_sampling,0.789737,0.789737,0.78971,0.78966,0.176529,0.236073,0.5424,0.372386,...,0.047046,5.659802,1.521679e-07,0.356273,4.0169,0.268607,0.520253,1.0,0.035473,6.0
