# Introduccion

## Carga del set de datos

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb")


  from .autonotebook import tqdm as notebook_tqdm


## Tokenizacion del set de datos

El siguiente paso realizara la tokenizacion necesaria para los algoritmos de procesamiento de lenguaje natural. Particularmente utiliza el algoritmode tokenizacion [BERT](https://es.wikipedia.org/wiki/BERT_(modelo_de_lenguaje)).

In [2]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

## Preparacion del set de datos

Este paso lee el dataset ya tokenizado y reduce el tamano del mismo para poder entrenar el modelo en un tiempo adecuado.

In [3]:
train_testvalid = tokenized_datasets['train']
train_testvalid = train_testvalid.select(range(1500))

### Division del dataset 

Este paso dividira el conjunto de datos en un set de entrenamiento y otro de validacion.

In [None]:
train_testvalid = train_testvalid.train_test_split(test_size=0.2)
train_dataset = train_testvalid['train']
valid_dataset = train_testvalid['test']

A fin de manejar los batches de entrenamiento de manera eficiente utilizamos los _DataLoaders_:

In [6]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=8)
valid_dataloader = DataLoader(valid_dataset, batch_size=8)

# Fine tuning de modelo pre-entrenado

Se comienza con un modelo de procesamiento de lenguaje natural pre-entrenado.

In [None]:
from transformers import BertForSequenceClassification, AdamW

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Model training

Ahora se refinan los parametros del modelo pre-entrenado con el data set de entrenamiento definido anteriormente:

In [9]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
)

trainer.train()



Epoch,Training Loss,Validation Loss
1,No log,0.00049
2,No log,0.000233
3,No log,0.000185


TrainOutput(global_step=450, training_loss=0.008188242382473416, metrics={'train_runtime': 11400.4645, 'train_samples_per_second': 0.316, 'train_steps_per_second': 0.039, 'total_flos': 947199799296000.0, 'train_loss': 0.008188242382473416, 'epoch': 3.0})

### Performance del modelo luego del fine-tuning

In [10]:
metrics = trainer.evaluate()
print(metrics)

{'eval_loss': 0.0001852674613473937, 'eval_runtime': 110.5343, 'eval_samples_per_second': 2.714, 'eval_steps_per_second': 0.344, 'epoch': 3.0}


### Predicciones

In [11]:
predictions = trainer.predict(valid_dataset)
print(predictions)

PredictionOutput(predictions=array([[ 4.475849 , -4.1330533],
       [ 4.5065064, -4.0472875],
       [ 4.570997 , -4.156846 ],
       [ 4.408584 , -3.9318402],
       [ 4.5121555, -4.178871 ],
       [ 4.5459695, -4.066243 ],
       [ 4.493396 , -4.138924 ],
       [ 4.189075 , -3.9837306],
       [ 4.5796676, -4.1405535],
       [ 4.5119166, -4.0818467],
       [ 4.464783 , -4.139474 ],
       [ 4.5560746, -4.1292024],
       [ 4.5673738, -4.166456 ],
       [ 4.5162463, -4.098231 ],
       [ 4.507188 , -4.1618137],
       [ 4.471827 , -4.115313 ],
       [ 4.5180154, -4.125523 ],
       [ 4.549863 , -4.0967717],
       [ 4.532546 , -4.113684 ],
       [ 4.5362263, -4.091946 ],
       [ 4.5093713, -4.0808544],
       [ 4.403423 , -3.979005 ],
       [ 4.549311 , -4.101613 ],
       [ 4.598023 , -4.1724677],
       [ 4.508365 , -4.1544814],
       [ 4.537312 , -4.1463666],
       [ 4.4584045, -4.1042724],
       [ 4.5637546, -4.1599307],
       [ 4.5086246, -4.1741595],
       [ 4.573

# Results
 A continuacion vemos las metricas de las etapas de entrenamiento y test.


| metric | train | test |
| --- | --- | --- | 
| loss | 0.0001852674613473937  | 0.0001852674613473937 |
| runtime | 110.5343 | 112.2932 |
| samples_per_second | 2.714 | 2.672 |
| steps_per_second | 0.344 | 0.338 |

