<a href="https://colab.research.google.com/github/kpi6research/Bert-as-a-Library/blob/master/examples/Finetune_Bert_Sentiment140_with_BertLibrary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune Bert Sentiment140 with BertLibrary

This colab notebook will show how to use Bert-as-a-Library to finetune a Bert base on Sentiment140 dataset which you can find it here from  [kaggle](https://colab.research.google.com/drive/1yTiTQ6g-lM7RISeK774DuM8jesuiOVS_#scrollTo=MbApRh4QGd6F&line=3&uniqifier=1). We do the same data split as this [kernel](https://www.kaggle.com/paoloripamonti/twitter-sentiment-analysis) in order to show to you the boost in performance that Bert can give you without doing complex things.

Download the Bert Base Uncased model and the sentiment140 dataset. You can download the first directly from the web, but you need to manually download and upload the dataset from kaggle

In [0]:
!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
!unzip uncased_L-12_H-768_A-12.zip

--2019-10-11 09:00:25--  https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.133.128, 2a00:1450:400c:c0c::80
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.133.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 407727028 (389M) [application/zip]
Saving to: ‘uncased_L-12_H-768_A-12.zip’


2019-10-11 09:00:35 (89.9 MB/s) - ‘uncased_L-12_H-768_A-12.zip’ saved [407727028/407727028]

Archive:  uncased_L-12_H-768_A-12.zip
   creating: uncased_L-12_H-768_A-12/
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.meta  
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001  
  inflating: uncased_L-12_H-768_A-12/vocab.txt  
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.index  
  inflating: uncased_L-12_H-768_A-12/bert_config.json  


In [0]:
from google.colab import files
uploaded = files.upload()

Saving sentiment140.zip to sentiment140.zip


In [0]:
!unzip "sentiment140.zip"

Archive:  sentiment140.zip
  inflating: training.1600000.processed.noemoticon.csv  


### Process Twitter140 dataset

In [0]:
import pandas as pd

In [0]:
t140 = pd.read_csv('training.1600000.processed.noemoticon.csv', sep=',', header=None, encoding='latin')

In [0]:
t140.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [0]:
label_text = t140[[0, 5]]

In [0]:
# Convert labels to range 0-1                                        
label_text[0] = label_text[0].apply(lambda x: 0 if x == 0 else 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [0]:
# Assign proper column names to labels
label_text.columns = ['label', 'text']

In [0]:
label_text.head()

Unnamed: 0,label,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [0]:
import re

hashtags = re.compile(r"^#\S+|\s#\S+")
mentions = re.compile(r"^@\S+|\s@\S+")
urls = re.compile(r"https?://\S+")

def process_text(text):
  text = hashtags.sub(' hashtag', text)
  text = mentions.sub(' entity', text)
  return text.strip().lower()
  
def match_expr(pattern, string):
  return not pattern.search(string) == None

def get_data_wo_urls(dataset):
    link_with_urls = dataset.text.apply(lambda x: match_expr(urls, x))
    return dataset[[not e for e in link_with_urls]]

In [0]:
link_with_urls = label_text.text.apply(lambda x: match_expr(urls, x))

In [0]:
# Whatch if we can remove twitter urls from the dataset
link_with_urls.sum() / len(label_text.text)

0.043819375

In [0]:
from sklearn.model_selection import train_test_split
TRAIN_SIZE = 0.75
VAL_SIZE = 0.05
dataset_count = len(label_text)

df_train_val, df_test = train_test_split(label_text, test_size=1-TRAIN_SIZE-VAL_SIZE, random_state=42)
df_train, df_val = train_test_split(df_train_val, test_size=VAL_SIZE / (VAL_SIZE + TRAIN_SIZE), random_state=42)

print("TRAIN size:", len(df_train))
print("VAL size:", len(df_val))
print("TEST size:", len(df_test))

TRAIN size: 1200000
VAL size: 80000
TEST size: 320000


In [0]:
#remove urls only on train set
df_train = get_data_wo_urls(df_train)

In [0]:
df_train.text = df_train.text.apply(process_text)
df_val.text = df_val.text.apply(process_text)
df_test.text = df_test.text.apply(process_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [0]:
df_train.head()

Unnamed: 0,label,text
441493,0,entity i can't watch it until next week cos i ...
937763,1,just woke up
1106597,1,"in the garden with soph, fi, craig, and sarah ..."
1321504,1,entity can't wait to see your new hairstyle!!!
71659,0,"entity i miss you, mr. superhero. come back to..."


In [0]:
!mkdir dataset
df_train.sample(frac=1.0).reset_index(drop=True).to_csv('dataset/train.tsv', sep='\t', index=None, header=None)
df_val.to_csv('dataset/dev.tsv', sep='\t', index=None, header=None)
df_test.to_csv('dataset/test.tsv', sep='\t', index=None, header=None)
! cd dataset && ls

dev.tsv  test.tsv  train.tsv


### Finetune on Twitter 140

In [0]:
!pip install BertLibrary

Collecting BertLibrary
[?25l  Downloading https://files.pythonhosted.org/packages/a5/f6/62c112afb62265d980e44db418094e11950a47b79ea8d71d14a2a9c6f6d8/BertLibrary-0.0.4.tar.gz (57kB)
[K     |████████████████████████████████| 61kB 5.3MB/s 
Building wheels for collected packages: BertLibrary
  Building wheel for BertLibrary (setup.py) ... [?25l[?25hdone
  Created wheel for BertLibrary: filename=BertLibrary-0.0.4-cp36-none-any.whl size=75016 sha256=2cfd3d8337819b16fd0a0424af468e11b4aefa3945e6a5de8a3199a313fe422c
  Stored in directory: /root/.cache/pip/wheels/63/3d/ab/990438ec53e97a0203d2be35ad77fcdcb0750bee7057ddf25f
Successfully built BertLibrary
Installing collected packages: BertLibrary
Successfully installed BertLibrary-0.0.4


In [0]:
from BertLibrary import BertFTModel
import numpy as np





In [0]:
!mkdir output
ft_model = BertFTModel( model_dir='uncased_L-12_H-768_A-12',
                        ckpt_name="bert_model.ckpt",
                        labels=['0','1'],
                        lr=1e-05,
                        num_train_steps=30000,
                        num_warmup_steps=1000,
                        ckpt_output_dir='output',
                        save_check_steps=1000,
                        do_lower_case=False,
                        max_seq_len=50,
                        batch_size=32,
                        )


ft_trainer =  ft_model.get_trainer()





INFO:tensorflow:Using config: {'_model_dir': 'output', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': device_count {
  key: "GPU"
  value: 1
}
gpu_options {
  per_process_gpu_memory_fraction: 0.5
  allow_growth: true
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f63a15a2f60>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [0]:
ft_trainer.train_from_file('dataset', 35000)




INFO:tensorflow:Writing example 0 of 1147432
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: 1
INFO:tensorflow:tokens: [CLS] entity that is so awesome ! i love art . i love to see it and i love to draw mostly . h ##f now , are you so proud ? i will be back , sorry . [SEP]
INFO:tensorflow:input_ids: 101 9178 2008 2003 2061 12476 999 1045 2293 2396 1012 1045 2293 2000 2156 2009 1998 1045 2293 2000 4009 3262 1012 1044 2546 2085 1010 2024 2017 2061 7098 1029 1045 2097 2022 2067 1010 3374 1012 102 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label: 0 (id = 0)
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: 2
INFO:tensorflow:tokens: [CLS] gear ##ing up for the graduation parties over the next few weeks . and none of them are for me [

As you can see from that last evaluation step, eval_accuracy is 0.8678125 with loss at 0.32139838