<a href="https://colab.research.google.com/github/pavaris-pm/sentiment-analysis-GLUE-SST2/blob/main/The_Stanford_Sentiment_Treebank_(glue_sst2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!nvidia-smi # GPU used for training in this task

Sat Dec 24 07:33:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   56C    P0    26W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
!pip install transformers
!pip install datasets
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 23.8 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 50.6 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 69.7 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[K     |████████████████████████████████| 452 kB 36.3 MB

In [4]:
import pandas as pd
import numpy as np
import os
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification, DataCollatorWithPadding
from transformers import TrainingArguments, Trainer
import evaluate

In [5]:
raw_datasets = load_dataset('glue', 'sst2')
raw_datasets

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading and preparing dataset glue/sst2 to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

In [6]:
raw_datasets['train'][0] # to get access to data (take a peek)

{'sentence': 'hide new secretions from the parental units ',
 'label': 0,
 'idx': 0}

In [7]:
# since the model in transformers require model as a tensor, so that we need to convert it into tensors first, then, we need to use help of tokenizer
# we need to select a checkpoint that suited to glue-sst2 dataset or it is trained on glue-sst2 before
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'

In [8]:
# calling a tokenizer to deal with this sentiment analysis problem
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

PreTrainedTokenizerFast(name_or_path='distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [9]:
tokenizer('I like you. I love you', truncation=True)

{'input_ids': [101, 1045, 2066, 2017, 1012, 1045, 2293, 2017, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [10]:
# how the above function is processed
test_token = tokenizer.tokenize('I like you. I love you')
print(f'the sentence that is tokenized is : {test_token}')
print(f'the token is encoded into an ids : {tokenizer.convert_tokens_to_ids(test_token)}')

the sentence that is tokenized is : ['i', 'like', 'you', '.', 'i', 'love', 'you']
the token is encoded into an ids : [1045, 2066, 2017, 1012, 1045, 2293, 2017]


In [11]:
# then, it is a way more faster to deal with data as a batch
# let's try it on sample data
sample_sentence = raw_datasets['train'][:5] # we sample 5 sentence to try on batch data
print(sample_sentence)

{'sentence': ['hide new secretions from the parental units ', 'contains no wit , only labored gags ', 'that loves its characters and communicates something rather beautiful about human nature ', 'remains utterly satisfied to remain the same throughout ', 'on the worst revenge-of-the-nerds clichés the filmmakers could dredge up '], 'label': [0, 0, 1, 0, 0], 'idx': [0, 1, 2, 3, 4]}


In [12]:
# since map function from dataset datatype can process each element at a time
def tokenize_func(dataset):
  return tokenizer(dataset['sentence'], truncation=True) # need to select part of sentence to be tokenized

In [13]:
raw_datasets # we will process this dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

In [14]:
# since tokenizer can process many sentence at a time
# apply tokenized function into whole batch of data before we put it into model
tokenized_datasets = raw_datasets.map(tokenize_func, batched=True)

  0%|          | 0/68 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [15]:
tokenized_datasets # after it is tokenized, input_ids and attention_mask is introduced

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 1821
    })
})

In [16]:
# check how appropriate our data is and how data_collator will be processed
samples = tokenized_datasets['train'][:5]
print(samples.items())
print(samples.keys())

dict_items([('sentence', ['hide new secretions from the parental units ', 'contains no wit , only labored gags ', 'that loves its characters and communicates something rather beautiful about human nature ', 'remains utterly satisfied to remain the same throughout ', 'on the worst revenge-of-the-nerds clichés the filmmakers could dredge up ']), ('label', [0, 0, 1, 0, 0]), ('idx', [0, 1, 2, 3, 4]), ('input_ids', [[101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102], [101, 3397, 2053, 15966, 1010, 2069, 4450, 2098, 18201, 2015, 102], [101, 2008, 7459, 2049, 3494, 1998, 10639, 2015, 2242, 2738, 3376, 2055, 2529, 3267, 102], [101, 3464, 12580, 8510, 2000, 3961, 1996, 2168, 2802, 102], [101, 2006, 1996, 5409, 7195, 1011, 1997, 1011, 1996, 1011, 11265, 17811, 18856, 17322, 2015, 1996, 16587, 2071, 2852, 24225, 2039, 102]]), ('attention_mask', [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [17]:
batch = {k: v for k, v in samples.items() if k not in ['sentence', 'idx']}
batch

{'label': [0, 0, 1, 0, 0],
 'input_ids': [[101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102],
  [101, 3397, 2053, 15966, 1010, 2069, 4450, 2098, 18201, 2015, 102],
  [101,
   2008,
   7459,
   2049,
   3494,
   1998,
   10639,
   2015,
   2242,
   2738,
   3376,
   2055,
   2529,
   3267,
   102],
  [101, 3464, 12580, 8510, 2000, 3961, 1996, 2168, 2802, 102],
  [101,
   2006,
   1996,
   5409,
   7195,
   1011,
   1997,
   1011,
   1996,
   1011,
   11265,
   17811,
   18856,
   17322,
   2015,
   1996,
   16587,
   2071,
   2852,
   24225,
   2039,
   102]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [18]:
print(f'since data in a batch is {[len(x) for x in samples["input_ids"]]}') # to see how they handle data in a batch
print(f'data collator with pad until its length equal to {max([len(x) for x in samples["input_ids"]])}')

since data in a batch is [10, 11, 15, 10, 22]
data collator with pad until its length equal to 22


In [19]:
# define compute metric
def compute_metric(prediction):
  metric = evaluate.load('glue', 'sst2')
  logits, labels = prediction
  # make a prediction
  preds = np.argmax(logits, axis=-1)
  return metric.compute(predictions = preds, references = labels)

In [20]:
# after it is tokenized, it is a time to put it into our model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [21]:
os.getcwd()

'/content'

In [22]:
os.mkdir('/content/model_sst2_checkpoint')

In [23]:
training_args = TrainingArguments('/content/model_sst2_checkpoint',
                                  evaluation_strategy='epoch',
                                  num_train_epochs = 3.0,
                                  seed = 1000,
                                  per_gpu_train_batch_size = 16,
                                  per_gpu_eval_batch_size = 16,
                                  learning_rate = 5e-05)

In [24]:
trainer = Trainer(model,
                  training_args,
                  train_dataset = tokenized_datasets['train'],
                  eval_dataset = tokenized_datasets['validation'],
                  data_collator = data_collator,
                  tokenizer = tokenizer,
                  compute_metrics = compute_metric)

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


In [25]:
trainer.train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
***** Running training *****
  Num examples = 67349
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 12630
  Number of trainable parameters = 66955010
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_

Epoch,Training Loss,Validation Loss,Accuracy
1,0.1315,0.357687,0.892202
2,0.0811,0.403238,0.897936
3,0.0417,0.485627,0.905963


Saving model checkpoint to /content/model_sst2_checkpoint/checkpoint-500
Configuration saved in /content/model_sst2_checkpoint/checkpoint-500/config.json
Model weights saved in /content/model_sst2_checkpoint/checkpoint-500/pytorch_model.bin
tokenizer config file saved in /content/model_sst2_checkpoint/checkpoint-500/tokenizer_config.json
Special tokens file saved in /content/model_sst2_checkpoint/checkpoint-500/special_tokens_map.json
Saving model checkpoint to /content/model_sst2_checkpoint/checkpoint-1000
Configuration saved in /content/model_sst2_checkpoint/checkpoint-1000/config.json
Model weights saved in /content/model_sst2_checkpoint/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in /content/model_sst2_checkpoint/checkpoint-1000/tokenizer_config.json
Special tokens file saved in /content/model_sst2_checkpoint/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to /content/model_sst2_checkpoint/checkpoint-1500
Configuration saved in /content/model_sst2_

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

Using deprecated `--per_gpu_eval_batch_size` argument which will be removed in a future version. Using `--per_device_eval_batch_size` is preferred.
Saving model checkpoint to /content/model_sst2_checkpoint/checkpoint-4500
Configuration saved in /content/model_sst2_checkpoint/checkpoint-4500/config.json
Model weights saved in /content/model_sst2_checkpoint/checkpoint-4500/pytorch_model.bin
tokenizer config file saved in /content/model_sst2_checkpoint/checkpoint-4500/tokenizer_config.json
Special tokens file saved in /content/model_sst2_checkpoint/checkpoint-4500/special_tokens_map.json
Saving model checkpoint to /content/model_sst2_checkpoint/checkpoint-5000
Configuration saved in /content/model_sst2_checkpoint/checkpoint-5000/config.json
Model weights saved in /content/model_sst2_checkpoint/checkpoint-5000/pytorch_model.bin
tokenizer config file saved in /content/model_sst2_checkpoint/checkpoint-5000/tokenizer_config.json
Special tokens file saved in /content/model_sst2_checkpoint/chec

TrainOutput(global_step=12630, training_loss=0.08455385893748472, metrics={'train_runtime': 962.813, 'train_samples_per_second': 209.851, 'train_steps_per_second': 13.118, 'total_flos': 1837063059069168.0, 'train_loss': 0.08455385893748472, 'epoch': 3.0})

In [32]:
print(tokenized_datasets['train'][0])
print(tokenized_datasets['train'][1])
print(tokenized_datasets['train'][3])
print(tokenized_datasets['test'][0])
print(tokenized_datasets['test'][1])
print(tokenized_datasets['test'][3])

{'sentence': 'hide new secretions from the parental units ', 'label': 0, 'idx': 0, 'input_ids': [101, 5342, 2047, 3595, 8496, 2013, 1996, 18643, 3197, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'sentence': 'contains no wit , only labored gags ', 'label': 0, 'idx': 1, 'input_ids': [101, 3397, 2053, 15966, 1010, 2069, 4450, 2098, 18201, 2015, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'sentence': 'remains utterly satisfied to remain the same throughout ', 'label': 0, 'idx': 3, 'input_ids': [101, 3464, 12580, 8510, 2000, 3961, 1996, 2168, 2802, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'sentence': 'uneasy mishmash of styles and genres .', 'label': -1, 'idx': 0, 'input_ids': [101, 15491, 28616, 22444, 4095, 1997, 6782, 1998, 11541, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'sentence': "this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if d

In [33]:
tokenized_datasets['test'].shape

(1821, 5)

In [36]:
# to find all possible label on the dataset
def get_unique_label(dataset):
  labels = []
  for val in dataset:
    labels.append(val['label'])
  return np.unique(labels)

In [41]:
# test finding label func
print(f'the possible label of train dataset is : {get_unique_label(tokenized_datasets["train"])}')
print(f'the possible label of validation dataset is : {get_unique_label(tokenized_datasets["validation"])}')
print(f'the possible label of test dataset is : {get_unique_label(tokenized_datasets["test"])}')

the possible label of train dataset is : [0 1]
the possible label of validation dataset is : [0 1]
the possible label of test dataset is : [-1]


In [45]:
# try to make prediction
# after we know that we have our trained model, then it is better to use pipeline function directly
from transformers import pipeline

new_checkpoint = '/content/model_sst2_checkpoint/checkpoint-1000'

classifier = pipeline("text-classification", model=new_checkpoint)

loading configuration file /content/model_sst2_checkpoint/checkpoint-1000/config.json
Model config DistilBertConfig {
  "_name_or_path": "/content/model_sst2_checkpoint/checkpoint-1000",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "vocab_size": 30522
}

loading configuration file /content/model_sst2_checkpoint/checkpoint-100

In [62]:
# sample test data
print(tokenized_datasets['test'][0]['sentence'])
print(tokenized_datasets['test'][1]['sentence'])
print(tokenized_datasets['test'][2]['sentence'])

uneasy mishmash of styles and genres .
this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .
by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .


In [64]:
# we can pass a list into a model so that it works as well
classifier(["uneasy mishmash of styles and genres .",
               "this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .",
               "by the end of no such thing the audience , like beatrice , has a watchful affection for the monster ."])

Disabling tokenizer parallelism, we're using DataLoader multithreading already


[{'label': 'NEGATIVE', 'score': 0.9989732503890991},
 {'label': 'NEGATIVE', 'score': 0.9987654685974121},
 {'label': 'POSITIVE', 'score': 0.9862404465675354}]

In [72]:
# we can pass a list into a model so that it works as well
t = classifier(["uneasy mishmash of styles and genres .",
               "this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .",
               "by the end of no such thing the audience , like beatrice , has a watchful affection for the monster ."])

print(len(t))
# get a prediction from data
print(t[0]['label'])
print(t[1]['label'])
print(t[2]['label'])

3
NEGATIVE
NEGATIVE
POSITIVE


In [71]:
classifier('uneasy mishmash of styles and genres .')

[{'label': 'NEGATIVE', 'score': 0.9989732503890991}]

In [93]:
# so, we will combine the text together first, then, we will use sent tokenize from nltk library in order to split sentence
def join_sentence(sentence_list):
  # join data together
  return " ".join(sentence_list)

# to find how much data we missed after rokenized
def find_sentence_missing(tokenized_text):
  return abs(len(tokenized_text) - len(tokenized_datasets['test']))


# since classifier can make a prediction with many sentence at a time
def make_prediction(sentence_list):
  preds = []
  result = classifier(sentence_list)
  for pred in range(len(result)):
    preds.append(result[pred]['label'])

  return preds

In [75]:
sample_sentence_list = ["uneasy mishmash of styles and genres .",
               "this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .",
               "by the end of no such thing the audience , like beatrice , has a watchful affection for the monster ."]

In [86]:
sample_text = join_sentence(sample_sentence_list)
sample_text

"uneasy mishmash of styles and genres . this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation . by the end of no such thing the audience , like beatrice , has a watchful affection for the monster ."

In [None]:
sentence_list = []
# create a list of sentence
for elem in tokenized_datasets['test']:
    sentence_list.append(elem['sentence'])

print(len(sentence_list))
raw_text = join_sentence(sentence_list)

In [80]:
raw_text

"uneasy mishmash of styles and genres . this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation . by the end of no such thing the audience , like beatrice , has a watchful affection for the monster . director rob marshall went out gunning to make a great one . lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new . a well-made and often lovely depiction of the mysteries of friendship . none of this violates the letter of behan 's book , but missing is its spirit , its ribald , full-throated humor . although it bangs a very cliched drum at times , this crowd-pleaser 's fresh dialogue , energetic music , and good-natured spunk are often infectious . it is not a mass-market entertainment but an uncompromising attempt by one artist to think about another . this is junk food cinema at its greasiest ."

In [51]:
# check length of test dataset
print(tokenized_datasets['test'].shape)
print(len(tokenized_datasets['test']))

(1821, 5)
1821


In [81]:
# split data into set for easier prediction
# using sent tokenize from nltk
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [83]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [88]:
print(sent_tokenize(sample_text))
print(len(sent_tokenize(sample_text)))

['uneasy mishmash of styles and genres .', "this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .", 'by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .']
3


In [91]:
# let sent tokenize perform a tokenzation to our raw_text in test_dataset
test_data = sent_tokenize(raw_text)

In [95]:
print(len(test_data)) # get that there has a sentence missing during tokenization
print(test_data[:10])
print(f"Sentences missing after tokenized : {find_sentence_missing(test_data)}")

1761
['uneasy mishmash of styles and genres .', "this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation .", 'by the end of no such thing the audience , like beatrice , has a watchful affection for the monster .', 'director rob marshall went out gunning to make a great one .', 'lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new .', 'a well-made and often lovely depiction of the mysteries of friendship .', "none of this violates the letter of behan 's book , but missing is its spirit , its ribald , full-throated humor .", "although it bangs a very cliched drum at times , this crowd-pleaser 's fresh dialogue , energetic music , and good-natured spunk are often infectious .", 'it is not a mass-market entertainment but an uncompromising attempt by one artist to think about another .', 'this is junk food cinema at its greasiest .']
Sente

##Finding missing data after tokenized
- we will find the intial point where the text start to missing, then, we will insert it from the test dataset back
- for simplicity, make a loop directly to form a new list is better!. However, i aims to use `sent_toknize` from nltk in this task

In [105]:
# try
print("Data before tokenized")
print(tokenized_datasets['test'][0]['sentence'])
print(len(tokenized_datasets['test'][0]['sentence']))

print("================================================")

print("Data after tokenized")
print(test_data[0])
print(len(test_data[0]))



Data before tokenized
uneasy mishmash of styles and genres .
38
Data after tokenized
uneasy mishmash of styles and genres .
38


In [111]:
# get that sent tokenize isnt work at this time, since the sentence is missing rather than exceeded
#so, we will come back to perform a loop directly!
test_data_list = []
for i in range(len(tokenized_datasets['test'])):
  test_data_list.append(tokenized_datasets['test'][i]['sentence'])

In [None]:
print(test_data_list)
print(len(test_data_list))

## Make a prediction
- I aims to show how to make a prediction by using our fine-tuned model as a pipeline
- since the GLUE-SST2 test dataset in huggingface are all -1.  This is because the test sets for glue are hidden (referenced from https://github.com/huggingface/nlp/issues/245) so the labels are
not publicly available. You can read the glue paper for more details.

In [115]:
%%time
prediction_test = make_prediction(test_data_list) # take around 2-3 mins

CPU times: user 2min 8s, sys: 2.14 s, total: 2min 10s
Wall time: 2min 20s


In [116]:
print(f"the length of prediction test is : {len(prediction_test)}")

the length of prediction test is : 1821


In [117]:
# now, we have the prediction equal to test set
prediction_test[:20]

['NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'NEGATIVE']

##Save the model for further usage
- to zip file in tar.gz, I use https://medium.com/@vkmauryavk/how-to-zip-unzip-files-on-google-colb-or-jupyter-notebook-cb9a6e0aafdd as my reference for zip files

In [118]:
# recall from our directory in Google Drive
os.getcwd()

'/content'

In [120]:
os.mkdir('/content/sentiment-analysis-sst2_model')

In [121]:
os.chdir('/content/sentiment-analysis-sst2_model')

In [122]:
os.getcwd()

'/content/sentiment-analysis-sst2_model'

In [123]:
trainer.save_model('/content/sentiment-analysis-sst2_model')

Saving model checkpoint to /content/sentiment-analysis-sst2_model
Configuration saved in /content/sentiment-analysis-sst2_model/config.json
Model weights saved in /content/sentiment-analysis-sst2_model/pytorch_model.bin
tokenizer config file saved in /content/sentiment-analysis-sst2_model/tokenizer_config.json
Special tokens file saved in /content/sentiment-analysis-sst2_model/special_tokens_map.json


In [125]:
os.chdir('/content')

In [131]:
# zip file for further usage
!tar chvfz sentiment-analysis-sst2_model.tar.gz "/content/sentiment-analysis-sst2_model"

tar: Removing leading `/' from member names
/content/sentiment-analysis-sst2_model/
/content/sentiment-analysis-sst2_model/config.json
/content/sentiment-analysis-sst2_model/vocab.txt
/content/sentiment-analysis-sst2_model/special_tokens_map.json
/content/sentiment-analysis-sst2_model/training_args.bin
/content/sentiment-analysis-sst2_model/tokenizer.json
/content/sentiment-analysis-sst2_model/tokenizer_config.json
/content/sentiment-analysis-sst2_model/pytorch_model.bin


In [132]:
# copy all content to my Google Drive
!cp -r "/content/sentiment-analysis-sst2_model.tar.gz" '/content/drive/MyDrive/AIML/trained_model'