# Teste do Framework Modal

## Instalação de pacotes necessários

In [1]:
!pip install -q modAL-python

In [2]:
!pip install -q openml

## Obtenção de bases

In [3]:
%%time
import openml

from config import dataset_ids

datasets = openml.datasets.get_datasets(dataset_ids)

  get_dataset(dataset_id, download_data, download_qualities=download_qualities),


CPU times: user 4.25 s, sys: 9 s, total: 13.2 s
Wall time: 2.71 s


## Testando modAL em uma das bases obtidas

Para esse experimento será utilizado o primeiro conjunto de dados presente na lista `datasets`

In [4]:
dataset = datasets[0]
dataset

OpenML Dataset
Name..........: kr-vs-kp
Version.......: 1
Format........: ARFF
Upload Date...: 2014-04-06 23:19:28
Licence.......: Public
Download URL..: https://api.openml.org/data/v1/download/3/kr-vs-kp.arff
OpenML URL....: https://www.openml.org/d/3
# of features.: 37
# of instances: 3196

Primeiramente, é necessário dividir o conjunto de dados em treino e teste

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

X, y, categorical_indicator, attribute_names  = dataset.get_data(target=dataset.default_target_attribute)

# codificando os dados categóricos:
categorical_preprocessor = OneHotEncoder(sparse_output=False,
                                         handle_unknown='ignore')
preprocessor = ColumnTransformer(
    [('one-hot-encoder', categorical_preprocessor, categorical_indicator)], 
    remainder='passthrough'
)


X_train_raw, X_test, y_train_raw, y_test = train_test_split(preprocessor.fit_transform(X),
                                                    y.to_numpy(),
                                                    stratify=y)

print(X_train_raw.shape, y_train_raw.shape)
print(X_test.shape, y_test.shape)

(2397, 73) (2397,)
(799, 73) (799,)


Em seguida, dividir o conjunto de treino em $L$ (Labeled) e $U$ (Unlabeled)

In [6]:
import numpy as np
INITIAL_L_SIZE = 20 # número de valores rotulados inicialmente
TRAIN_RAW_SIZE = X_train_raw.shape[0]

train_index = np.random.choice(TRAIN_RAW_SIZE, INITIAL_L_SIZE)

# conjunto rotulado (L)
X_train = X_train_raw[train_index]
y_train = y_train_raw[train_index]

# conjunto não rotulado (U)
X_pool = np.delete(X_train_raw, train_index, axis=0)
y_pool = np.delete(y_train_raw, train_index, axis=0)

print('L:', X_train.shape, y_train.shape)
print('U:', X_pool.shape, y_pool.shape)


L: (20, 73) (20,)
U: (2377, 73) (2377,)


Agora é hora de instânciar um modelo para ser utilizado

In [7]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report

Treinando e avaliando o modelo

In [8]:
model = SVC(probability=True)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       nowin       0.54      0.88      0.67       382
         won       0.75      0.31      0.44       417

    accuracy                           0.59       799
   macro avg       0.65      0.60      0.56       799
weighted avg       0.65      0.59      0.55       799



### Implementando Aprendizado Ativo

Estatégia de sampling: **Ranked Batch**

$score = \alpha(1 - \Phi(x, X_{labeled})) + (1 - \alpha) U(x),$

In [11]:
%pdb 0
from functools import partial

from modAL.models import ActiveLearner
from modAL.uncertainty import uncertainty_sampling
from modAL.batch import uncertainty_batch_sampling

BATCH_SIZE = 3
preset_batch = partial(uncertainty_batch_sampling, n_instances=BATCH_SIZE)
learner = ActiveLearner(
    estimator=model,
    query_strategy=preset_batch, 
    X_training=X_train, 
    y_training=y_train
)

N_QUERIES = 20
for i in range (N_QUERIES):
    query_index, query_instance = learner.query(X_pool)

    learner.teach(X=X_pool[query_index], y=y_pool[query_index])

    y_pred = learner.predict(X_test)
    accuracy = learner.score(X_test, y_test)
    print(f'Accuracy after query {i}: {accuracy}')

Automatic pdb calling has been turned OFF
Accuracy after query 0: 0.5994993742177722
Accuracy after query 1: 0.7434292866082604
Accuracy after query 2: 0.7546933667083855
Accuracy after query 3: 0.7509386733416771
Accuracy after query 4: 0.7571964956195244
Accuracy after query 5: 0.7609511889862328
Accuracy after query 6: 0.753441802252816
Accuracy after query 7: 0.753441802252816
Accuracy after query 8: 0.8035043804755945
Accuracy after query 9: 0.7847309136420526
Accuracy after query 10: 0.8010012515644556
Accuracy after query 11: 0.7959949937421777
Accuracy after query 12: 0.7984981226533167
Accuracy after query 13: 0.8122653316645807
Accuracy after query 14: 0.8110137672090113
Accuracy after query 15: 0.8210262828535669
Accuracy after query 16: 0.8285356695869838
Accuracy after query 17: 0.83729662077597
Accuracy after query 18: 0.8410513141426783
Accuracy after query 19: 0.8360450563204005
