<a href="https://colab.research.google.com/github/joevincentgaltie/OODDetection_ENSAE/blob/main/Brouillon_OOD_ENSAE_jvg.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports

In [2]:
!pip install datasets
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import pandas as pd 
import numpy as np 

from tqdm import tqdm
import datasets
from datasets import load_dataset

import nltk
from nltk.tokenize import TreebankWordTokenizer
nltk.download('punkt')
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
nltk.download('stopwords')
from nltk.corpus import stopwords
from gensim.models.phrases import Phrases, Phraser

import torch
from torch.utils.data import DataLoader, TensorDataset
from torch.optim import AdamW

from transformers import AutoTokenizer, BertForSequenceClassification, get_scheduler

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Chargement des datasets

In [4]:
# https://huggingface.co/datasets/sst2
SST2Dataset = load_dataset('glue','sst2')
sst2_df_train = pd.DataFrame(SST2Dataset['train'])
sst2_df_valid = pd.DataFrame(SST2Dataset['validation'])
sst2_df_test = pd.DataFrame(SST2Dataset['test'])
sst2_df_train.head()

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading and preparing dataset glue/sst2 to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,sentence,label,idx
0,hide new secretions from the parental units,0,0
1,"contains no wit , only labored gags",0,1
2,that loves its characters and communicates som...,1,2
3,remains utterly satisfied to remain the same t...,0,3
4,on the worst revenge-of-the-nerds clichés the ...,0,4


In [5]:
# https://huggingface.co/datasets/imdb
IMDBDataset = load_dataset("imdb")
imdb_df_train = pd.DataFrame(IMDBDataset['train'])
imdb_df_test = pd.DataFrame(IMDBDataset['test'])
imdb_df_train.head()

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


# Preprocessing des datasets

## Investigation SST2

In [6]:
# Setting index
sst2_df_train.set_index('idx', inplace=True)
sst2_df_valid.set_index('idx', inplace=True)
sst2_df_test.set_index('idx', inplace=True)

# Rename column
sst2_df_train.rename(columns={'sentence':'text'}, inplace=True)
sst2_df_valid.rename(columns={'sentence':'text'}, inplace=True)
sst2_df_test.rename(columns={'sentence':'text'}, inplace=True)

sst2_df_train

Unnamed: 0_level_0,text,label
idx,Unnamed: 1_level_1,Unnamed: 2_level_1
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates som...,1
3,remains utterly satisfied to remain the same t...,0
4,on the worst revenge-of-the-nerds clichés the ...,0
...,...,...
67344,a delightful comedy,1
67345,"anguish , anger and frustration",0
67346,"at achieving the modest , crowd-pleasing goals...",1
67347,a patient viewer,1


In [7]:
sst2_df_train.loc[0].text

'hide new secretions from the parental units '

## Investigation IMBD

In [8]:
imdb_df_train.loc[0].text

'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, ev

On doit retirer les balises HTML, les potentielles URL, les caractères non alphanumériques, et on peut aussi retirer les stopwords.

## Cleaning and tokenizing the two datasets

On utilisera ici le Tokenizer de BERT

In [9]:
stopwords_en = set(stopwords.words('english'))

def remove_html(tokens):
  tokens = filter(lambda x: x[0]+x[-1] != '<>', tokens)
  return list(tokens)

def remove_url(tokens):
  tokens = filter(lambda x: "http" not in x, tokens)
  return list(tokens)

def remove_non_alpha(tokens):
  tokens = map(lambda x: x.replace('[^a-zA-Z0-9\s]', ''), tokens)
  return list(tokens)

def remove_stopwords(tokens):
  tokens = filter(lambda x: x not in stopwords_en, tokens)
  return list(tokens)

In [10]:
def tokenize_html_url_alpha_stop(corpus, tokenizer):
  tokenized_sentences = []
  for sample in tqdm(corpus):
    for sentence in sent_detector.tokenize(sample):
      tokens = tokenizer.tokenize(sentence, padding="max_length", max_length=512, truncation=True, add_special_tokens=True)
      tokens = remove_url(tokens)
      tokens = remove_html(tokens)
      tokens = remove_non_alpha(tokens)
      tokens = remove_stopwords(tokens)
      tokens = list(map(lambda x: x.lower() if x not in ['[PAD]', '[CLS]', '[SEP]'] else x, tokens))
      # Length affected by removing some tokens
      tokens += ['[PAD]'] * (512 - len(tokens))
      tokenized_sentences.append(tokens)
      # We break here because if multiple sentences tokenized for the same line, it will add more sentences than we have label
      break
  return tokenized_sentences

In [11]:
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [12]:
# def clean_corpus(corpus, threshold=50):
#   tokenized_sentences = tokenize_html_url_alpha_stop(corpus, tokenizer)
#   phrases = Phrases(tokenized_sentences, threshold=threshold)
#   phraser = Phraser(phrases)
#   # Merging multi-word expressions in the tokenization
#   clean_corpus = []
#   for sentence in tokenized_sentences:
#     clean_corpus.append(phraser[sentence])
#   return clean_corpus

In [13]:
cleaned_sst2_train = tokenize_html_url_alpha_stop(sst2_df_train.text.array, tokenizer)
cleaned_sst2_valid = tokenize_html_url_alpha_stop(sst2_df_valid.text.array, tokenizer)
cleaned_sst2_test = tokenize_html_url_alpha_stop(sst2_df_test.text.array, tokenizer)
# Ne marche pas car prend toute la RAM
# cleaned_imdb_train = clean_corpus(imdb_df_train.text.array)
# cleaned_imdb_test = clean_corpus(imdb_df_test.text.array)




  0%|          | 0/67349 [00:00<?, ?it/s][A[A[A


  0%|          | 184/67349 [00:00<00:36, 1831.05it/s][A[A[A


  1%|          | 368/67349 [00:00<00:36, 1813.60it/s][A[A[A


  1%|          | 550/67349 [00:00<00:37, 1779.01it/s][A[A[A


  1%|          | 728/67349 [00:00<00:38, 1717.90it/s][A[A[A


  1%|▏         | 901/67349 [00:00<00:38, 1718.52it/s][A[A[A


  2%|▏         | 1088/67349 [00:00<00:37, 1767.36it/s][A[A[A


  2%|▏         | 1265/67349 [00:00<00:37, 1762.65it/s][A[A[A


  2%|▏         | 1442/67349 [00:00<00:37, 1764.22it/s][A[A[A


  2%|▏         | 1619/67349 [00:00<00:37, 1762.08it/s][A[A[A


  3%|▎         | 1796/67349 [00:01<00:37, 1763.45it/s][A[A[A


  3%|▎         | 1978/67349 [00:01<00:36, 1779.79it/s][A[A[A


  3%|▎         | 2161/67349 [00:01<00:36, 1794.39it/s][A[A[A


  3%|▎         | 2343/67349 [00:01<00:36, 1800.72it/s][A[A[A


  4%|▎         | 2524/67349 [00:01<00:37, 1743.94it/s][A[A[A


  4%|▍         | 2699/673

In [14]:
cleaned_sst2_train[0]

['[CLS]',
 'hide',
 'new',
 'secret',
 '##ions',
 'parental',
 'units',
 '[SEP]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]',
 '[PAD]'

# Preparation of the training data (SST2 for BERT)

On a besoin d'ajouter pour chaque token l'ID et le masque d'attention

In [15]:
# ID of each token
ids_train = [tokenizer.convert_tokens_to_ids(t) for t in cleaned_sst2_train]
ids_valid = [tokenizer.convert_tokens_to_ids(t) for t in cleaned_sst2_valid]
ids_test = [tokenizer.convert_tokens_to_ids(t) for t in cleaned_sst2_test]

In [16]:
# Attention mask of each token

def get_attention_mask(tokens):
  # return [lambda x: 1 if x!='[PAD]' else 0, for x in tokens]
  return list(map(lambda x: 1 if x!='[PAD]' else 0, tokens))

mask_train = [get_attention_mask(t) for t in cleaned_sst2_train]
mask_valid = [get_attention_mask(t) for t in cleaned_sst2_valid]
mask_test = [get_attention_mask(t) for t in cleaned_sst2_test]

In [17]:
# Label of each token
label_train = sst2_df_train.label
label_valid = sst2_df_valid.label
label_test = sst2_df_test.label

In [18]:
# all elements into tensors 
ids_train = torch.tensor(ids_train)
mask_train = torch.tensor(mask_train)
label_train  = torch.tensor(label_train)
ids_valid = torch.tensor(ids_valid)
mask_valid = torch.tensor(mask_valid)
label_valid  = torch.tensor(label_valid)
ids_test = torch.tensor(ids_test)
mask_test = torch.tensor(mask_test)
label_test  = torch.tensor(label_test)

In [19]:
# Pytorch DataLoader creation
train_data = TensorDataset(ids_train, mask_train, label_train)
train_dataloader = DataLoader(train_data, shuffle=True, batch_size=8)
valid_data = TensorDataset(ids_valid, mask_valid, label_valid)
valid_dataloader = DataLoader(valid_data, shuffle=True, batch_size=8)

In [20]:
train_data = datasets.DatasetDict({"labels":label_train,"input_ids":ids_train, 'attention_mask':mask_train})
train_dataloader = DataLoader(train_data, shuffle=True, batch_size=8)

In [21]:
train_data = datasets.Dataset.from_dict({"labels": label_train, "input_ids":ids_train, 'attention_mask':mask_train})
train_data.set_format("torch")

small_train_data = train_data.shuffle(seed=42).select(range(200))


train_dataloader = DataLoader(small_train_data, shuffle=True, batch_size=8)

# Training of BERT model on SST2 Dataset

In [33]:
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2,output_hidden_states = True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [34]:
optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

In [35]:
if torch.cuda.is_available():
  device = 'cuda'
  print('DEVICE = ', torch.cuda.get_device_name(0))
else:
  device = 'cpu'
  print('DEVICE = ', 'CPU')
model = model.to(device)

DEVICE =  Tesla T4


In [36]:
progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)


100%|██████████| 75/75 [09:30<00:00,  7.61s/it]

  1%|▏         | 1/75 [00:00<00:57,  1.29it/s][A
  3%|▎         | 2/75 [00:01<00:54,  1.34it/s][A
  4%|▍         | 3/75 [00:02<00:53,  1.35it/s][A
  5%|▌         | 4/75 [00:02<00:52,  1.35it/s][A
  7%|▋         | 5/75 [00:03<00:51,  1.35it/s][A
  8%|▊         | 6/75 [00:04<00:51,  1.35it/s][A
  9%|▉         | 7/75 [00:05<00:50,  1.35it/s][A
 11%|█         | 8/75 [00:05<00:49,  1.35it/s][A
 12%|█▏        | 9/75 [00:06<00:49,  1.34it/s][A
 13%|█▎        | 10/75 [00:07<00:48,  1.35it/s][A
 15%|█▍        | 11/75 [00:08<00:47,  1.34it/s][A
 16%|█▌        | 12/75 [00:08<00:47,  1.34it/s][A
 17%|█▋        | 13/75 [00:09<00:46,  1.34it/s][A
 19%|█▊        | 14/75 [00:10<00:45,  1.34it/s][A
 20%|██        | 15/75 [00:11<00:44,  1.34it/s][A
 21%|██▏       | 16/75 [00:11<00:44,  1.34it/s][A
 23%|██▎       | 17/75 [00:12<00:43,  1.33it/s][A
 24%|██▍       | 18/75 [00:13<00:42,  1.33it/s][A
 25%|██▌       | 19/75 [00:14<00:42,  1.3

Lets try to retrieve hidden states

In [63]:
train_dataloader

<torch.utils.data.dataloader.DataLoader at 0x7fabc1baa2b0>

In [68]:
for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        print(batch)
        outputs = model(**batch)

{'labels': tensor([0, 1, 0, 1, 1, 1, 0, 1], device='cuda:0'), 'input_ids': tensor([[  101, 10223,  6508,  ...,     0,     0,     0],
        [  101,  8669,  1010,  ...,     0,     0,     0],
        [  101, 10580,  7143,  ...,     0,     0,     0],
        ...,
        [  101,  1050,  1005,  ...,     0,     0,     0],
        [  101,  2296,  8257,  ...,     0,     0,     0],
        [  101, 18230,  1010,  ...,     0,     0,     0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')}
{'labels': tensor([0, 1, 1, 1, 0, 0, 0, 0], device='cuda:0'), 'input_ids': tensor([[  101,  4821,  3249,  ...,     0,     0,     0],
        [  101,  3201,  1011,  ...,     0,     0,     0],
        [  101, 11850, 22249,  ...,     0,     0,     0],
        ...,
        [  101,  2295,  9811,

In [67]:

output = model(**train_dataloader)

TypeError: ignored

In [102]:
def mean_point(output):
  a = torch.zeros([8 ,512, 768]).to(device)
  for i in range(13): 
    a += output[2][i] 
  return a / 13

In [103]:
outputs[2][1].size()

torch.Size([8, 512, 768])

In [105]:
mean_point(outputs)[0]

tensor([[-0.3295,  0.3589, -0.0314,  ..., -0.1695, -0.2676,  0.4943],
        [-0.0281, -1.1006,  0.7874,  ..., -0.4092,  0.1913,  0.2183],
        [ 0.3751,  0.2006, -0.0480,  ...,  0.1294, -0.3014, -0.7763],
        ...,
        [-0.2960,  0.1078,  0.5484,  ...,  0.4303, -0.6236, -0.7794],
        [ 0.0281, -0.2533,  0.7786,  ...,  0.4050, -0.2841, -0.7968],
        [-0.0752, -0.0565,  0.6638,  ...,  0.3493, -0.2765, -1.0160]],
       device='cuda:0', grad_fn=<SelectBackward0>)

In [55]:
print(f'there are {len(outputs[2])} hidden layers')
print(f'batches are composed of {len(outputs[2][2])} sentences')
print(f'sentences are composed of max {len(outputs[2][2][2])} tokens ?')
print(f'each token is embedded in {len(outputs[2][2][2][2])} dimensions')

there are 13 hidden layers
batches are composed of 8 sentences
sentences are composed of max 512 tokens ?
each token is embedded in 768 dimensions


In [81]:
outputs[2][12][7].size()

torch.Size([512, 768])

In [56]:
for layer in range(13) : 
  for batch in range(8)

tensor([ 1.8647e-02, -3.3170e-01,  6.1590e-01,  6.9116e-01, -3.9415e-01,
        -3.9074e-01, -7.3373e-01,  3.6809e-01, -5.5088e-01,  8.0500e-01,
         2.6509e-01,  9.8813e-02, -1.4440e-01, -9.7821e-01, -3.3220e-01,
         1.7309e-01, -1.7823e-01, -1.0196e+00,  2.3900e-01,  8.0320e-01,
        -3.7817e-01, -6.3962e-01,  1.5371e+00,  2.3453e-01,  3.3962e-01,
        -1.3728e+00,  3.2890e-01,  6.6166e-01,  7.6018e-01,  1.1654e+00,
         1.0844e+00,  7.5489e-01, -1.5281e-01,  7.9671e-01,  2.6778e-01,
        -2.4298e-01, -9.8542e-01,  9.3552e-01,  4.6008e-01, -1.4079e+00,
        -6.1412e-02,  6.4299e-01,  1.4665e+00,  4.9257e-01, -1.2820e-01,
         7.7193e-01, -2.6388e-01,  4.5351e-01,  1.3911e+00, -1.1833e-01,
        -6.7114e-01,  8.1743e-01,  7.4027e-01, -2.6942e-01, -7.0323e-01,
         5.4971e-01, -5.9814e-02, -9.6069e-01, -1.4142e+00, -2.3543e-01,
        -3.7453e-01, -5.4654e-01, -2.2069e-01, -2.7638e-01, -2.2828e-01,
        -1.3171e-01,  7.9162e-02, -9.7474e-01, -4.4