In [0]:
# Original code licensed by:
# Copyright 2018 The TensorFlow Hub Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# AlBERTo End to End (Fine-tuning + Predicting) with Cloud TPU

## Overview

**BERT**, or **B**idirectional **E**mbedding **R**epresentations from **T**ransformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. The academic paper can be found here: https://arxiv.org/abs/1810.04805.

In particular we use this Notebook for fine-tuning **AlBERTo**, the first italian undertanding language model for Twitter Language.

**Note:**  You will need a GCP (Google Compute Engine) account and a GCS (Google Cloud 
Storage) bucket for this Colab to run.

Please follow the [Google Cloud TPU quickstart](https://cloud.google.com/tpu/docs/quickstart) for how to create GCP account and GCS bucket. 

You can learn more about Cloud TPU at https://cloud.google.com/tpu/docs.

## Instructions

<h3><a href="https://cloud.google.com/tpu/"><img valign="middle" src="https://raw.githubusercontent.com/GoogleCloudPlatform/tensorflow-without-a-phd/master/tensorflow-rl-pong/images/tpu-hexagon.png" width="50"></a>  &nbsp;&nbsp;Train on TPU</h3>

   1. Create a Cloud Storage bucket for your TensorBoard logs at http://console.cloud.google.com/storage and fill in the BUCKET parameter in the "Parameters" section below.
 
   1. On the main menu, click Runtime and select **Change runtime type**. Set "TPU" as the hardware accelerator.

### Set up your TPU environment

In this section, you perform the following tasks:

*   Set up a Colab TPU running environment
*   Verify that you are connected to a TPU device
*   Upload your credentials to TPU to access your GCS bucket.

In [0]:
import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf

assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)

from google.colab import auth
auth.authenticate_user()
with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

TPU address is grpc://10.109.168.10:8470
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 14701092589750331724),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 14325857611891808670),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 8508874451840970562),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 14599398136899750328),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 4060877057823524901),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:

### Prepare and import BERT modules
​
With your environment configured, you can now prepare and import the BERT modules. The following step clones the source code from GitHub and import the modules from the source. 


In [0]:
import sys

!test -d bert_repo || git clone https://github.com/google-research/bert bert_repo
if not 'bert_repo' in sys.path:
  sys.path += ['bert_repo']

# import python modules defined by BERT
from run_classifier import *
import modeling
import optimization
import tokenization

Cloning into 'bert_repo'...
remote: Enumerating objects: 336, done.[K
remote: Total 336 (delta 0), reused 0 (delta 0), pack-reused 336[K
Receiving objects: 100% (336/336), 287.80 KiB | 3.89 MiB/s, done.
Resolving deltas: 100% (184/184), done.



### Prepare for training

This next section of code performs the following tasks:

*  Specify task and download training data.
*  Specify ALBERTO pretrained model
*  Specify GS bucket, create output directory for model checkpoints and eval results.




In [0]:
TASK = 'HASPEEDE_09_Inc' #@param {type:"string"}

TASK_DATA_DIR = './'

BUCKET = 'alberto_v2_files' #@param {type:"string"}
assert BUCKET, 'Must specify an existing GCS bucket name'


OUTPUT_DIR = 'gs://{}/{}/models/'.format(BUCKET, TASK)
#tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))

#CONFIGURE AlBERTo MODEL
BERT_CONFIG_FILE = 'gs://alberto_v2_files/alberto_uncased_L-12_H-768_A-12_italian_ckpt/config.json' #@param {type:"string"}
VOCAB_FILE = 'gs://alberto_v2_files/alberto_uncased_L-12_H-768_A-12_italian_ckpt/vocab.txt' #@param {type:"string"}
INIT_CHECKPOINT = 'gs://albert_files/model_files/alberto_model.ckpt'#@param {type:"string"}



***** Model output directory: gs://alberto_v2_files/HASPEEDE_09_Inc/models/ *****


Also we initilize our hyperprams, prepare the training data and initialize TPU config.

In [0]:
#SET THE PARAMETERS
TRAIN_BATCH_SIZE = 512
PREDICT_BATCH_SIZE = 512
EVAL_BATCH_SIZE = 512
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 10.0
MAX_SEQ_LENGTH = 128
# Warmup is a period of time where hte learning rate 
# is small and gradually increases--usually helps training.
WARMUP_PROPORTION = 0.1
# Model configs
SAVE_CHECKPOINTS_STEPS = 1000
SAVE_SUMMARY_STEPS = 500

# Setup TPU related config
tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)
NUM_TPU_CORES = 8
ITERATIONS_PER_LOOP = 1000

def get_run_config(output_dir):
  return tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=output_dir,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=ITERATIONS_PER_LOOP,
        num_shards=NUM_TPU_CORES,
        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))


In [0]:
#PREPARE TRAINING SENTENCES
!pip install ekphrasis
!pip install pandas
!pip install numpy

from ekphrasis.classes.preprocessor import TextPreProcessor
from ekphrasis.classes.tokenizer import SocialTokenizer
from ekphrasis.dicts.emoticons import emoticons
import pandas as pd
import numpy as np

text_processor = TextPreProcessor (
    # terms that will be normalized
    normalize=[ 'url' , 'email', 'user', 'percent', 'money', 'phone', 'time', 'date', 'number'] ,
    # terms that will be annotated
    annotate={"hashtag"} ,
    fix_html=True ,  # fix HTML tokens

    unpack_hashtags=True ,  # perform word segmentation on hashtags

    # select a tokenizer. You can use SocialTokenizer, or pass your own
    # the tokenizer, should take as input a string and return a list of tokens
    tokenizer=SocialTokenizer(lowercase=True).tokenize,
    dicts = [ emoticons ]
)


Collecting ekphrasis
[?25l  Downloading https://files.pythonhosted.org/packages/92/e6/37c59d65e78c3a2aaf662df58faca7250eb6b36c559b912a39a7ca204cfb/ekphrasis-0.5.1.tar.gz (80kB)
[K     |████                            | 10kB 18.5MB/s eta 0:00:01[K     |████████▏                       | 20kB 2.2MB/s eta 0:00:01[K     |████████████▎                   | 30kB 2.9MB/s eta 0:00:01[K     |████████████████▍               | 40kB 2.1MB/s eta 0:00:01[K     |████████████████████▌           | 51kB 2.4MB/s eta 0:00:01[K     |████████████████████████▌       | 61kB 2.8MB/s eta 0:00:01[K     |████████████████████████████▋   | 71kB 3.1MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 2.7MB/s 
Collecting colorama
  Downloading https://files.pythonhosted.org/packages/c9/dc/45cdef1b4d119eb96316b3117e6d5708a08029992b2fee2c143c7a0a5cc5/colorama-0.4.3-py2.py3-none-any.whl
Collecting ujson
[?25l  Downloading https://files.pythonhosted.org/packages/16/c4/79f3409bc710559015464e5f49b

In [0]:
#SETUP TRAINING AND TEST DATA
training_data = pd.read_csv('haspeede_TW-train.tsv',delimiter="\t", header=None, names = ["id","message","class"], quoting=csv.QUOTE_NONE, error_bad_lines=False)
sentences = training_data.iloc[:,1]
labels = training_data.iloc[:,2]

#final examples training
examples = []

import re
i = 0
for s in sentences:
    s = s.lower()
    s = str(" ".join(text_processor.pre_process_doc(s)))
    s = re.sub(r"[^a-zA-ZÀ-ú</>!?♥♡\s\U00010000-\U0010ffff]", ' ', s)
    s = re.sub(r"\s+", ' ', s)
    s = re.sub(r'(\w)\1{2,}',r'\1\1', s)
    s = re.sub ( r'^\s' , '' , s )
    s = re.sub ( r'\s$' , '' , s )
    #print(s)
    examples.append([labels[i],s])
    i = i+1


examples = np.array(examples)

#final examples test
test_data = pd.read_csv('haspeede_FB-test.tsv',delimiter="\t", names = ["id","message"], header=None, quoting=csv.QUOTE_NONE, error_bad_lines=False)
print(test_data.shape)
sentences_test = test_data.iloc[:,1]
test_ids = test_data.iloc[:,0]
sentences = test_data.iloc[:,1]

examples_test = []


i = 0
for s in sentences:
    s = s.lower()
    s = str(" ".join(text_processor.pre_process_doc(s)))
    s = re.sub(r"[^a-zA-ZÀ-ú</>!?♥♡\s\U00010000-\U0010ffff]", ' ', s)
    s = re.sub(r"\s+", ' ', s)
    s = re.sub(r'(\w)\1{2,}',r'\1\1', s)
    s = re.sub ( r'^\s' , '' , s )
    s = re.sub ( r'\s$' , '' , s )
    #print(s)
    examples_test.append([labels[i],s])
    i = i+1

examples_test = np.array(examples_test)

'''
We'll need to transform our data into a format BERT understands. This involves two steps. First, we create InputExample's using the constructor provided in the BERT library.

    text_a is the text we want to classify, which in this case, is the Request field in our Dataframe.
    text_b is used if we're training a model to understand the relationship between sentences (i.e. is text_b a translation of text_a? Is text_b an answer to the question asked by text_a?). This doesn't apply to our task, so we can leave text_b blank.
    label is the label for our example, i.e. True, False

'''

f = lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this example
                                                                   text_a = x[1], 
                                                                   text_b = None, 
                                                                   label = int(x[0]))

f2 = lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this example
                                                                   text_a = x[1], 
                                                                   text_b = None, 
                                                                   label = 0)

train_examples = map(f,examples)
train_examples = list(train_examples)

test_examples = map(f2,examples_test)
test_examples = list(test_examples)

train_examples = np.array(train_examples)
test_examples = np.array(test_examples)

print(test_examples.shape)

(1000, 2)
(1000,)


In [0]:
#Labels used for annotating sentences
label_list = [0, 1]

In [0]:
#Test data just created
for r in test_examples:
  print(r.text_a)

ma anche no !
ma dove vivono ?
le vai a impollinare tu le piante e gli alberi da frutto peppino crusciani ? una catastrofe si sta consumando da più di un anno nel silenzio pressoché totale dei media che troppo spesso sottovalutano le problematiche legate all ambiente le api stanno diminuendo in modo spaventoso in tutto il mondo alcuni penseranno che questo fenomeno potrebbe comportare semplicemente la diminuzione della produzione del miele ma non è esattamente così le api sono le responsabili dell impollinazione di centinaia di specie di piante sia coltivate che selvatiche le conseguenze di una mancata impollinazione si riflettono sull agricoltura e sull intero ecosistema del pianeta in pratica non solo niente miele ma niente frutti meno verdure niente fiori <url>
ma manda li a quel paese questi zingari bugiardi
complimenti a chi sostiene ancora questa politica marcia 👏 👏 👏
troppo poco caro salvini la castrazione
se non prima qualcuno stupri la boldrini fino a saziarla questo scempio n

In [0]:
#inizialize parameters
#num_train_steps = int(len(train_examples) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS)+1
#num_warmup_steps = int(NUM_TRAIN_EPOCHS * WARMUP_PROPORTION)
#print(num_train_steps)
#print(num_warmup_steps)

#Inizialize the tokenizer
tokenizer = tokenization.FullTokenizer("vocabulary_lower_case_128.txt", do_lower_case=True)
tokenizer.tokenize("appena aprono bocca iniziano a sparare <hashtag> cavolate </hashtag> ! 😠 <url>")

['appena',
 'aprono',
 'bocca',
 'iniziano',
 'a',
 'sparare',
 '<',
 'ha',
 '##shtag',
 '>',
 'cavolate',
 '<',
 '/',
 'ha',
 '##shtag',
 '>',
 '!',
 '[UNK]',
 '<',
 'ur',
 '##l',
 '>']

# Fine-tune and Run Predictions on a pretrained BERT Model

This section demonstrates fine-tuning from a pre-trained BERT TF Hub module and running predictions.


In [0]:
BERT_CONFIG= modeling.BertConfig.from_json_file(BERT_CONFIG_FILE)


model_fn = model_fn_builder(
  bert_config=BERT_CONFIG,
  num_labels=len(label_list),
  init_checkpoint=INIT_CHECKPOINT,
  learning_rate=LEARNING_RATE,
  num_train_steps=num_train_steps,
  num_warmup_steps=num_warmup_steps,
  use_tpu=True,
  use_one_hot_embeddings=True
)

estimator = tf.contrib.tpu.TPUEstimator(
  use_tpu=True,
  model_fn=model_fn,
  config=get_run_config(OUTPUT_DIR),
  train_batch_size=TRAIN_BATCH_SIZE,
  eval_batch_size=EVAL_BATCH_SIZE,
  predict_batch_size=PREDICT_BATCH_SIZE,
)


W0815 10:12:29.214733 140656217786240 estimator.py:1984] Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7fecc65b7e18>) includes params argument, but params are not passed to Estimator.


In [0]:
train_features = convert_examples_to_features(
      train_examples, label_list, MAX_SEQ_LENGTH, tokenizer)

print("***** Running training *****")
print("  Num examples = %d", len(train_examples))
print("  Num labels = %d", len(label_list))
print("  Batch size = %d", TRAIN_BATCH_SIZE)
print("  Num steps = %d", num_train_steps)

train_input_fn = input_fn_builder(
    features=train_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=True,
    drop_remainder=True)

print('***** Started training at {} *****'.format(datetime.datetime.now()))
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
print('***** Finished training at {} *****'.format(datetime.datetime.now()))


W0815 10:12:31.526856 140656217786240 deprecation_wrapper.py:119] From bert_repo/run_classifier.py:774: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.



***** Running training *****
  Num examples = %d 3000
  Num labels = %d 2
  Batch size = %d 512
  Num steps = %d 59
***** Started training at 2019-08-15 10:12:33.212275 *****


W0815 10:12:33.409187 140656217786240 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
W0815 10:12:35.321691 140656217786240 deprecation_wrapper.py:119] From bert_repo/modeling.py:171: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

W0815 10:12:35.326877 140656217786240 deprecation_wrapper.py:119] From bert_repo/modeling.py:409: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

W0815 10:12:35.660507 140656217786240 deprecation_wrapper.py:119] From bert_repo/modeling.py:490: The name tf.assert_less_equal is deprecated. Please use tf.compat.v1.assert_less_equal instead.

W0815 10

***** Finished training at 2019-08-15 10:15:59.922975 *****


In [0]:
# MODEL PREDICTIONS
input_features = convert_examples_to_features(
      test_examples,label_list, MAX_SEQ_LENGTH, tokenizer)
  
print(len(list(input_features)))  
print(len(test_ids))
print(len(sentences_test))
print('***** Started predictions at {} *****'.format(datetime.datetime.now()))

# Eval will be slightly WRONG on the TPU because it will truncate
# the last batch.

predict_input_fn = input_fn_builder(
    features=input_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=False,
    drop_remainder=False)


predictions = estimator.predict(predict_input_fn)

print('***** Finished evaluation at {} *****'.format(datetime.datetime.now()))


  
#SAVE IN BUCKLET RESULTS AND PRINT THEM
output_eval_file = os.path.join(OUTPUT_DIR, "haspeede_TW_task3_2_r3_results.tsv")
with tf.gfile.GFile(output_eval_file, "w") as writer:
  print("***** Results *****")
  for example, prediction, id in zip(sentences_test, predictions, test_ids):
    print('\t prediction:%s \t id:%s \t text_a: %s' % ( np.argmax(prediction['probabilities']),str(id),str(example) ) )
    writer.write("%s\t%s\t%s\n" % (str(id),example, np.argmax(prediction['probabilities'])) )


1000
1000
1000
***** Started predictions at 2019-08-15 10:16:00.374683 *****
***** Finished evaluation at 2019-08-15 10:16:00.375935 *****
***** Results *****


W0815 10:16:08.756518 140656217786240 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.


	 prediction:0 	 id:1 	 text_a: Ma....anche no!
	 prediction:0 	 id:2 	 text_a: Ma dove vivono ?
	 prediction:0 	 id:3 	 text_a: Le vai a impollinare tu le piante e gli alberi da frutto peppino crusciani? Una catastrofe si sta consumando da più di un anno, nel silenzio pressoché totale dei media che troppo spesso sottovalutano le problematiche legate all’ambiente: le api stanno diminuendo in modo spaventoso in tutto il mondo. Alcuni penseranno che questo fenomeno potrebbe comportare semplicemente la diminuzione della produzione del miele, ma non è esattamente così: le api sono le responsabili dell’impollinazione di centinaia di specie di piante, sia coltivate che selvatiche. Le conseguenze di una mancata impollinazione si riflettono sull’agricoltura, e sull’intero ecosistema del Pianeta. In pratica, non solo niente miele, ma niente frutti, meno verdure, niente fiori.  http://ambiente.tiscali.it/socialnews/articoli/daddario/15756/einstein-le-api-e-la-fine-del-mondo/
	 prediction:1 	 id: