# BERT model classification

In [7]:
!git clone -b docker https://github.com/yoheikikuta/bert.git

Cloning into 'bert'...
remote: Enumerating objects: 4, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 234 (delta 1), reused 3 (delta 1), pack-reused 230[K
Receiving objects: 100% (234/234), 152.83 KiB | 0 bytes/s, done.
Resolving deltas: 100% (133/133), done.
Checking connectivity... done.


In [8]:
!ls bert/

CONTRIBUTING.md		    modeling.py		  run_pretraining.py
Dockerfile		    modeling_test.py	  run_squad.py
LICENSE			    multilingual.md	  sample_text.txt
README.md		    optimization.py	  tokenization.py
__init__.py		    optimization_test.py  tokenization_test.py
create_pretraining_data.py  requirements.txt	  utils
extract_features.py	    run_classifier.py


In [13]:
!pip3 install -r ./bert/requirements.txt

Collecting tensorflow>=1.11.0 (from -r ./bert/requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/b1/ad/48395de38c1e07bab85fc3bbec045e11ae49c02a4db0100463dd96031947/tensorflow-1.12.0-cp35-cp35m-manylinux1_x86_64.whl (83.1MB)
[K    100% |################################| 83.1MB 14kB/s  eta 0:00:01
Collecting keras-preprocessing>=1.0.5 (from tensorflow>=1.11.0->-r ./bert/requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/fc/94/74e0fa783d3fc07e41715973435dd051ca89c550881b3454233c39c73e69/Keras_Preprocessing-1.0.5-py2.py3-none-any.whl
Collecting tensorboard<1.13.0,>=1.12.0 (from tensorflow>=1.11.0->-r ./bert/requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/e0/d0/65fe48383146199f16dbd5999ef226b87bce63ad5cd73c840cf722637969/tensorboard-1.12.0-py3-none-any.whl (3.0MB)
[K    100% |################################| 3.1MB 447kB/s eta 0:00:01
[?25hCollecting keras-applications>=1.0.6 (from tensorflow>=1.

### Model and data download

We solve RTE task in GLUE datasets; see https://www.nyu.edu/projects/bowman/glue.pdf in detail.

In [15]:
import os

In [16]:
os.makedirs("./bert/model", exist_ok=True)
os.makedirs("./bert/data", exist_ok=True)

In [20]:
!wget -O ./bert/model/uncased_L-12_H-768_A-12.zip https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip

--2018-11-18 03:53:02--  https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.24.144, 2404:6800:4004:81b::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.24.144|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 407727028 (389M) [application/zip]
Saving to: './bert/model/uncased_L-12_H-768_A-12.zip'


2018-11-18 03:53:12 (37.8 MB/s) - './bert/model/uncased_L-12_H-768_A-12.zip' saved [407727028/407727028]



In [24]:
!unzip ./bert/model/uncased_L-12_H-768_A-12.zip -d ./bert/model/ && \
  rm ./bert/model/uncased_L-12_H-768_A-12.zip

Archive:  ./bert/model/uncased_L-12_H-768_A-12.zip
   creating: ./bert/model/uncased_L-12_H-768_A-12/
  inflating: ./bert/model/uncased_L-12_H-768_A-12/bert_model.ckpt.meta  
  inflating: ./bert/model/uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001  
  inflating: ./bert/model/uncased_L-12_H-768_A-12/vocab.txt  
  inflating: ./bert/model/uncased_L-12_H-768_A-12/bert_model.ckpt.index  
  inflating: ./bert/model/uncased_L-12_H-768_A-12/bert_config.json  


In [27]:
!python3 ./bert/utils/download_glue_data.py --data_dir ./bert/data --tasks RTE

Downloading and extracting MNLI...
	Completed!


### Model fine-tuning

It takes about 3 hours in a `n1-standard-4` instance on GCP Compute Engine.

In [2]:
%%time

!python3 ./bert/run_classifier.py \
  --task_name=RTE \
  --do_train=true \
  --do_eval=true \
  --data_dir=./bert/data/RTE \
  --vocab_file=./bert/model/uncased_L-12_H-768_A-12/vocab.txt \
  --bert_config_file=./bert/model/uncased_L-12_H-768_A-12/bert_config.json \
  --init_checkpoint=./bert/model/uncased_L-12_H-768_A-12/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir=./bert/tmp/rte_output/

INFO:tensorflow:Using config: {'_num_ps_replicas': 0, '_train_distribute': None, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None), '_keep_checkpoint_max': 5, '_is_chief': True, '_model_dir': './bert/tmp/rte_output/', '_save_summary_steps': 100, '_global_id_in_cluster': 0, '_task_id': 0, '_log_step_count_steps': None, '_protocol': None, '_cluster': None, '_num_worker_replicas': 1, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_device_fn': None, '_save_checkpoints_steps': 1000, '_task_type': 'worker', '_master': '', '_tf_random_seed': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb2b93e50f0>, '_save_checkpoints_secs': None, '_eval_distribute': None, '_experimental_distribute': None, '_keep_chec

INFO:tensorflow:***** Running training *****
INFO:tensorflow:  Num examples = 2490
INFO:tensorflow:  Batch size = 32
INFO:tensorflow:  Num steps = 233
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running train on CPU
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (32, 128)
INFO:tensorflow:  name = input_mask, shape = (32, 128)
INFO:tensorflow:  name = label_ids, shape = (32,)
INFO:tensorflow:  name = segment_ids, shape = (32, 128)
INFO:tensorflow:**** Trainable Variables ****
INFO:tensorflow:  name = bert/embeddings/word_embeddings:0, shape = (30522, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/token_type_embeddings:0, shape = (2, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/position_embeddings:0, shape = (512, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow: 

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2018-11-18 08:07:59.343665: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./bert/tmp/rte_output/model.ckpt.
INFO:tensorflow:global_step/sec: 0.0230017
INFO:tensorflow:examples/sec: 0.736054
INFO:tensorflow:global_step/sec: 0.0230105
INFO:tensorflow:examples/sec: 0.736337
INFO:tensorflow:Saving checkpoints for 233 into ./bert/tmp/rte_output/model.ckpt.
INFO:tensorflow:Loss for final step: 0.31156892.
INFO:tensorflow:training_loop marked as finished
INFO:tensorflow:Writing example 0 of 277
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: dev-0
INFO:tensorflow:tokens: [CLS] dana reeve , the widow of the actor christopher reeve , has died of 

INFO:tensorflow:***** Running evaluation *****
INFO:tensorflow:  Num examples = 277
INFO:tensorflow:  Batch size = 8
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running eval on CPU
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = input_ids, shape = (?, 128)
INFO:tensorflow:  name = input_mask, shape = (?, 128)
INFO:tensorflow:  name = label_ids, shape = (?,)
INFO:tensorflow:  name = segment_ids, shape = (?, 128)
INFO:tensorflow:**** Trainable Variables ****
INFO:tensorflow:  name = bert/embeddings/word_embeddings:0, shape = (30522, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/token_type_embeddings:0, shape = (2, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/position_embeddings:0, shape = (512, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/embeddings/LayerNorm/gamma:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-11-18-10:57:23
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./bert/tmp/rte_output/model.ckpt-233
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-11-18-10:59:45
INFO:tensorflow:Saving dict for global step 233: eval_accuracy = 0.6931408, eval_loss = 0.71709377, global_step = 233, loss = 0.71939987
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 233: ./bert/tmp/rte_output/model.ckpt-233
INFO:tensorflow:evaluation_loop marked as finished
INFO:tensorflow:***** Eval results *****
INFO:tensorflow:  eval_accuracy = 0.6931408
INFO:tensorflow:  eval_loss = 0.71709377
INFO:tensorflow:  global_step = 233
INFO:tensorflow:  loss = 0.71939987
CPU times: user 4min 30s, sys: 33.3 s, total: 5min 4s
Wall time: 2h 52min 15s


## Data making for our patent data analysis.

In [1]:
import h5py
import pandas as pd
import numpy as np
import pickle

In [2]:
citations_info_target = pd.read_pickle("../data/citations_info_2000.df.gz")
training_app_df = pd.read_pickle("../data/training_app_1000.df.gz")
testset_app_df = pd.read_pickle("../data/testset_app_1000.df.gz")
grants_target_df = pd.read_pickle("../data/grants_for_2000.df.gz")

In [3]:
citations_info_target.head()

Unnamed: 0,app_id,app_fnm,citation_pat_pgpub_id,parsed,ifw_number,action_type,action_subtype,form892,form1449,citation_in_oa,...,rejection_103,rejection_112,rejection_dp,objection,allowed_claims,cite102_gt1,cite103_gt3,cite103_eq1,cite103_max,signature_type
0,13371769,/work/data/apps/2012/ipa120607/F_2322.xml,7391316,7391316,H20LX5QGPXXIFW4,103.0,a,1,0,1,...,1,0,1,0,0,0,0,1,2,0
1,13371769,/work/data/apps/2012/ipa120607/F_2322.xml,6992580,6992580,H20LX5QGPXXIFW4,102.0,a,1,1,1,...,1,0,1,0,0,0,0,1,2,0
2,13371769,/work/data/apps/2012/ipa120607/F_2322.xml,6992580,6992580,H20LX5QGPXXIFW4,103.0,a,1,1,1,...,1,0,1,0,0,0,0,1,2,0
3,13371769,/work/data/apps/2012/ipa120607/F_2322.xml,7774833,7774833,H20LX5QGPXXIFW4,103.0,a,1,1,1,...,1,0,1,0,0,0,0,1,2,0
4,12282000,/work/data/apps/2009/ipa090312/F_1385.xml,7411209,7411209,G9LENRJ8PPOPPY5,102.0,a,0,1,1,...,1,0,0,0,0,1,0,1,1,3


In [4]:
training_app_df.head()

Unnamed: 0,app_id,xml
0,14222691,"<us-patent-application lang=""EN"" dtd-version=""..."
1,12515852,"<us-patent-application lang=""EN"" dtd-version=""..."
2,12033424,"<us-patent-application lang=""EN"" dtd-version=""..."
3,12402344,"<us-patent-application lang=""EN"" dtd-version=""..."
4,12155425,"<us-patent-application lang=""EN"" dtd-version=""..."


In [5]:
import re
CLAIM_PAT = re.compile(r'<claims[^>]*>(.*)</claims>',re.MULTILINE|re.DOTALL)
TAG_PAT = re.compile(r"<.*?>")
LB_PAT = re.compile(r'[\t\n\r\f\v][" "]*')

def whole_xml_to_claim_xml(whole):
    mat = CLAIM_PAT.search(whole)
    return mat.group(1)
def whole_xml_to_claim(whole):
    return TAG_PAT.sub(' ', whole_xml_to_claim_xml(whole))

def remove_linebreak_from_claim(claim):
    '''
    Remove line break symbol "\n" with space(s).
    '''
    return LB_PAT.sub('', claim)

In [6]:
training_app_df["claim"] = training_app_df["xml"].map(whole_xml_to_claim).map(remove_linebreak_from_claim)
testset_app_df["claim"] = testset_app_df["xml"].map(whole_xml_to_claim).map(remove_linebreak_from_claim)
grants_target_df["claim"] = grants_target_df["xml"].map(whole_xml_to_claim).map(remove_linebreak_from_claim)

In [7]:
training_app_df["claim"][0]

'1 . A terminal comprising:an upper arm having a top surface for a mating area; a lower arm paralleled with the upper arm and having a bottom surface soldering area; and a connecting arm connected with the upper arm and the lower arm. 2 . The terminal as recited in  claim 1 , wherein the whole terminal is structured in a folded manner with a tiny gap therebetween in a vertical direction, and one of said upper arm and said lower arm forms a projection in said gap to abut against the other in a vertical direction. 3 . The terminal as recited in  claim 2 , wherein said one of the upper arm and the lower arm forms a recess corresponding to the projection in said vertical direction. 4 . The terminal as recited in  claim 1 , wherein the upper arm defines a convex plate formed on the top surface thereof and having a top surface, and the mating area is the top surface of the convex plate. 5 . The terminal as recited in  claim 4 , wherein the upper arm defines a recess in a bottom surface there

In [8]:
len( str(training_app_df["claim"][0]).split(" ") )

841

In [9]:
citations_info_target.head()

Unnamed: 0,app_id,app_fnm,citation_pat_pgpub_id,parsed,ifw_number,action_type,action_subtype,form892,form1449,citation_in_oa,...,rejection_103,rejection_112,rejection_dp,objection,allowed_claims,cite102_gt1,cite103_gt3,cite103_eq1,cite103_max,signature_type
0,13371769,/work/data/apps/2012/ipa120607/F_2322.xml,7391316,7391316,H20LX5QGPXXIFW4,103.0,a,1,0,1,...,1,0,1,0,0,0,0,1,2,0
1,13371769,/work/data/apps/2012/ipa120607/F_2322.xml,6992580,6992580,H20LX5QGPXXIFW4,102.0,a,1,1,1,...,1,0,1,0,0,0,0,1,2,0
2,13371769,/work/data/apps/2012/ipa120607/F_2322.xml,6992580,6992580,H20LX5QGPXXIFW4,103.0,a,1,1,1,...,1,0,1,0,0,0,0,1,2,0
3,13371769,/work/data/apps/2012/ipa120607/F_2322.xml,7774833,7774833,H20LX5QGPXXIFW4,103.0,a,1,1,1,...,1,0,1,0,0,0,0,1,2,0
4,12282000,/work/data/apps/2009/ipa090312/F_1385.xml,7411209,7411209,G9LENRJ8PPOPPY5,102.0,a,0,1,1,...,1,0,0,0,0,1,0,1,1,3


dev set in bert repository is corresponding to test set in our case.  
dev set includes label information and will not be used in training.  
(test set in bert does not inlude answer labels.)

Data creating procedure is the following:
- connect app_id and cited grant number
- get [app_id, claim, parsed]
- drop duplicates (duplication can exist because of different action types, etc)
- add cited label as 1

In [10]:
train_data_for_bert = pd.merge(training_app_df, citations_info_target, on='app_id')[['app_id', 'claim', 'parsed']]
dev_data_for_bert = pd.merge(testset_app_df, citations_info_target, on='app_id')[['app_id', 'claim', 'parsed']]

In [11]:
train_data_for_bert.head()

Unnamed: 0,app_id,claim,parsed
0,14222691,1 . A terminal comprising:an upper arm having ...,8179692
1,14222691,1 . A terminal comprising:an upper arm having ...,8179692
2,14222691,1 . A terminal comprising:an upper arm having ...,8206188
3,14222691,1 . A terminal comprising:an upper arm having ...,8206188
4,14222691,1 . A terminal comprising:an upper arm having ...,8177561


In [12]:
print( len(train_data_for_bert) )
print( len(dev_data_for_bert) )

2120
2059


In [13]:
train_data_for_bert = train_data_for_bert.drop_duplicates(keep='first').reset_index(drop=True)
dev_data_for_bert = dev_data_for_bert.drop_duplicates(keep='first').reset_index(drop=True)

In [14]:
print( len(train_data_for_bert) )
print( len(dev_data_for_bert) )

1282
1251


In [15]:
train_data_for_bert['label'] = "cited"
dev_data_for_bert['label'] = "cited"

In [16]:
train_data_for_bert.head()

Unnamed: 0,app_id,claim,parsed,label
0,14222691,1 . A terminal comprising:an upper arm having ...,8179692,cited
1,14222691,1 . A terminal comprising:an upper arm having ...,8206188,cited
2,14222691,1 . A terminal comprising:an upper arm having ...,8177561,cited
3,12515852,1 . A method for increasing seed yield in plan...,7235710,cited
4,12033424,"1 . An image forming apparatus, comprising:an ...",6950953,cited


In [17]:
train_data_for_bert = train_data_for_bert.merge(grants_target_df, how='inner', on='parsed')
train_data_for_bert = train_data_for_bert.drop("xml", axis=1)

dev_data_for_bert = dev_data_for_bert.merge(grants_target_df, how='inner', on='parsed')
dev_data_for_bert = dev_data_for_bert.drop("xml", axis=1)

In [18]:
train_data_for_bert.head()

Unnamed: 0,app_id,claim_x,parsed,label,claim_y
0,14222691,1 . A terminal comprising:an upper arm having ...,8179692,cited,"1. A board, comprising:a board body; a first c..."
1,14222691,1 . A terminal comprising:an upper arm having ...,8206188,cited,1. A connector terminal curved from a strip-sh...
2,14222691,1 . A terminal comprising:an upper arm having ...,8177561,cited,1. A socket contact terminal for electrical co...
3,12515852,1 . A method for increasing seed yield in plan...,7235710,cited,1. A method for expressing in a non-monocotyle...
4,12033424,"1 . An image forming apparatus, comprising:an ...",6950953,cited,"1. A multifunctional printer, comprising:a mai..."


In [19]:
def pick_up_unsited_grants(df, app_id, n=1, random_state=23):
    '''
    Randomly pick up uncited grant pair to a given app_id for generating negative samples.
    '''
    n_rows = df[ df['app_id'] != app_id ].sample(n=n, random_state=random_state)
    
    return [n_rows['parsed'].values[0], "not_cited" ,n_rows['claim_y'].values[0]]

In [20]:
seed = 23

train_non_cited_data = pd.DataFrame([
    [app_id, claimx] + pick_up_unsited_grants(train_data_for_bert, app_id, random_state=seed+idx)
    for idx, (app_id, claimx)
    in enumerate(zip(train_data_for_bert['app_id'], train_data_for_bert['claim_x']))
])

train_non_cited_data.columns = train_data_for_bert.columns

In [21]:
train_non_cited_data.head()

Unnamed: 0,app_id,claim_x,parsed,label,claim_y
0,14222691,1 . A terminal comprising:an upper arm having ...,7137410,not_cited,"1. A mixing valve having an exterior cover, sa..."
1,14222691,1 . A terminal comprising:an upper arm having ...,7419473,not_cited,1. A living body inspection apparatus comprisi...
2,14222691,1 . A terminal comprising:an upper arm having ...,7789044,not_cited,1. A collapsible pet carrier comprising:a tubu...
3,12515852,1 . A method for increasing seed yield in plan...,7702451,not_cited,1. A programmable engines-start system compris...
4,12033424,"1 . An image forming apparatus, comprising:an ...",8133762,not_cited,"1. A method of making a semiconductor device, ..."


In [22]:
seed = 23

dev_non_cited_data = pd.DataFrame([
    [app_id, claimx] + pick_up_unsited_grants(dev_data_for_bert, app_id, random_state=seed+idx)
    for idx, (app_id, claimx)
    in enumerate(zip(dev_data_for_bert['app_id'], dev_data_for_bert['claim_x']))
])

dev_non_cited_data.columns = dev_data_for_bert.columns

In [23]:
dev_non_cited_data.head()

Unnamed: 0,app_id,claim_x,parsed,label,claim_y
0,14307191,"1 . A method to aggregate, filter, and share e...",7729924,not_cited,1. A virtual knowledge management system using...
1,13137006,"1 . A display apparatus, comprising:a position...",8058137,not_cited,1. A method of manufacturing a semiconductor w...
2,12741959,1 - 33 . (canceled) 34 . A compound comprising...,7124864,not_cited,1. A gas assist strut and coupling member for ...
3,12643447,1 . A terminal fitting formed by bending an el...,6979130,not_cited,1. A bearing device for rotatably receiving a ...
4,14200253,1 . A printer for printing a three-dimensional...,6915265,not_cited,1. An integrated health care system for collec...


In [24]:
train_data_for_bert = pd.concat([train_data_for_bert, train_non_cited_data]).reset_index(drop=True)
dev_data_for_bert = pd.concat([dev_data_for_bert, dev_non_cited_data]).reset_index(drop=True)

In [25]:
train_data_for_bert.head()

Unnamed: 0,app_id,claim_x,parsed,label,claim_y
0,14222691,1 . A terminal comprising:an upper arm having ...,8179692,cited,"1. A board, comprising:a board body; a first c..."
1,14222691,1 . A terminal comprising:an upper arm having ...,8206188,cited,1. A connector terminal curved from a strip-sh...
2,14222691,1 . A terminal comprising:an upper arm having ...,8177561,cited,1. A socket contact terminal for electrical co...
3,12515852,1 . A method for increasing seed yield in plan...,7235710,cited,1. A method for expressing in a non-monocotyle...
4,12033424,"1 . An image forming apparatus, comprising:an ...",6950953,cited,"1. A multifunctional printer, comprising:a mai..."


In [26]:
train_data_for_bert['index'] = train_data_for_bert.index
dev_data_for_bert['index'] = dev_data_for_bert.index

In [27]:
train_data_for_bert = train_data_for_bert.drop("app_id", axis=1)
train_data_for_bert = train_data_for_bert.drop("parsed", axis=1)

dev_data_for_bert = dev_data_for_bert.drop("app_id", axis=1)
dev_data_for_bert = dev_data_for_bert.drop("parsed", axis=1)

In [28]:
train_data_for_bert.head()

Unnamed: 0,claim_x,label,claim_y,index
0,1 . A terminal comprising:an upper arm having ...,cited,"1. A board, comprising:a board body; a first c...",0
1,1 . A terminal comprising:an upper arm having ...,cited,1. A connector terminal curved from a strip-sh...,1
2,1 . A terminal comprising:an upper arm having ...,cited,1. A socket contact terminal for electrical co...,2
3,1 . A method for increasing seed yield in plan...,cited,1. A method for expressing in a non-monocotyle...,3
4,"1 . An image forming apparatus, comprising:an ...",cited,"1. A multifunctional printer, comprising:a mai...",4


In [29]:
train_data_for_bert = train_data_for_bert.loc[:, ['index', 'claim_x', 'claim_y', 'label']]
dev_data_for_bert = dev_data_for_bert.loc[:, ['index', 'claim_x', 'claim_y', 'label']]

In [30]:
train_data_for_bert.columns = ['index', 'claim_app', 'claim_cited_grant', 'label']
dev_data_for_bert.columns = ['index', 'claim_app', 'claim_cited_grant', 'label']

In [31]:
train_data_for_bert.head()

Unnamed: 0,index,claim_app,claim_cited_grant,label
0,0,1 . A terminal comprising:an upper arm having ...,"1. A board, comprising:a board body; a first c...",cited
1,1,1 . A terminal comprising:an upper arm having ...,1. A connector terminal curved from a strip-sh...,cited
2,2,1 . A terminal comprising:an upper arm having ...,1. A socket contact terminal for electrical co...,cited
3,3,1 . A method for increasing seed yield in plan...,1. A method for expressing in a non-monocotyle...,cited
4,4,"1 . An image forming apparatus, comprising:an ...","1. A multifunctional printer, comprising:a mai...",cited


In [32]:
dev_data_for_bert.head()

Unnamed: 0,index,claim_app,claim_cited_grant,label
0,0,"1 . A method to aggregate, filter, and share e...",1. A method for detecting moving objects with ...,cited
1,1,"1 . A display apparatus, comprising:a position...",1. A viewpoint position detecting apparatus fo...,cited
2,2,1 - 33 . (canceled) 34 . A compound comprising...,"1. A double-stranded ribonucleic acid (dsRNA),...",cited
3,3,1 . A terminal fitting formed by bending an el...,1. A female terminal fitting comprising:a subs...,cited
4,4,1 . A printer for printing a three-dimensional...,1. A method of generating an object assembled ...,cited


Save the result dataframe with tab separation.  
Manually upload the dataests onto google cloud storege.

Change label name in order to match RTE datasets case.

In [33]:
train_data_for_bert['label'] = train_data_for_bert['label'].str.replace("not_cited", "not_entailment")
train_data_for_bert['label'] = train_data_for_bert['label'].str.replace("cited", "entailment")

dev_data_for_bert['label'] = dev_data_for_bert['label'].str.replace("not_cited", "not_entailment")
dev_data_for_bert['label'] = dev_data_for_bert['label'].str.replace("cited", "entailment")

In [34]:
train_data_for_bert = train_data_for_bert.sample(frac=1, random_state=seed).reset_index(drop=True)
dev_data_for_bert = dev_data_for_bert.sample(frac=1, random_state=seed).reset_index(drop=True)

In [37]:
train_data_for_bert['index'] = train_data_for_bert.index
dev_data_for_bert['index'] = dev_data_for_bert.index

In [39]:
train_data_for_bert.to_csv("../data/bert_train_1000.tsv", index=False, sep='\t', header=True)
dev_data_for_bert.to_csv("../data/bert_dev_1000.tsv", index=False, sep='\t', header=True)

## Train a model.

Use colab because of TPU acceleration.

### Train a lightgbm model for comparison.

In [2]:
!pip3 install lightgbm

Collecting lightgbm
  Downloading https://files.pythonhosted.org/packages/4c/3b/4ae113193b4ee01387ed76d5eea32788aec0589df9ae7378a8b7443eaa8b/lightgbm-2.2.2-py2.py3-none-manylinux1_x86_64.whl (1.2MB)
[K    100% |################################| 1.2MB 1.0MB/s eta 0:00:01
Installing collected packages: lightgbm
Successfully installed lightgbm-2.2.2
[33mYou are using pip version 8.1.1, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [1]:
import pandas as pd
import numpy as np

In [27]:
train_data = pd.read_csv("../data/bert_train_1000.tsv", sep="\t")
test_data = pd.read_csv("../data/bert_dev_1000.tsv", sep="\t")

In [28]:
train_data.head()

Unnamed: 0,index,claim_app,claim_cited_grant,label
0,0,1 . A process comprising the following steps:(...,"1. A liquid supply apparatus, comprising:a wal...",not_entailment
1,1,1 - 10 . (canceled) 11 . A method for open-loo...,"1. A fuel supply apparatus for an engine, comp...",entailment
2,2,1 . A handpiece for treating biological tissue...,1. A method for irradiating tissue having abso...,entailment
3,3,1 . A power cable comprising:a power input com...,1. A temperature regulating system for a vehic...,not_entailment
4,4,1 . A cutting insert having a substantially cu...,1. A toolholder comprising:a) a cutter body ro...,entailment


Create features by using TF-IDF vector.

raw data will be made as: [claim_app] + [claim_cited_grant] (simple concatenation)

In [29]:
import lightgbm as lgb
from sklearn.feature_extraction.text import TfidfVectorizer

In [30]:
import random
random.seed(23)

In [31]:
vectorizer = TfidfVectorizer(stop_words='english', min_df=2, max_df=0.8)

In [32]:
train_claim_text = [
    sentence_1 + sentence_2 
    for sentence_1, sentence_2 
    in zip(train_data['claim_app'], train_data['claim_cited_grant'])
]


test_claim_text = [
    sentence_1 + sentence_2 
    for sentence_1, sentence_2 
    in zip(test_data['claim_app'], test_data['claim_cited_grant'])
]

In [33]:
%%time

train_x = vectorizer.fit_transform(train_claim_text)
train_y = [ 1 if elem == 'entailment' else 0 for elem in train_data['label'] ] 

In [35]:
train_x.shape

(2564, 17208)

In [37]:
%%time

test_x = vectorizer.transform(test_claim_text)
test_y = [ 1 if elem == 'entailment' else 0 for elem in test_data['label'] ] 

CPU times: user 3.7 s, sys: 7.34 ms, total: 3.71 s
Wall time: 3.71 s


In [76]:
test_x.shape

(2502, 17208)

Create dataset for lightgbm and train a model.

In [39]:
lgb_train = lgb.Dataset(train_x, train_y)

In [111]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'binary_logloss',
    'num_leaves': 50,
    'learning_rate': 0.05,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'min_child_weight': 2,
    'gamma': 0.2,
    'verbose': 0
}

In [127]:
%%time

gbm = lgb.train(params,
                lgb_train,
                num_boost_round=40,
                valid_sets=lgb_train)

[1]	training's binary_logloss: 0.683679
[2]	training's binary_logloss: 0.674582
[3]	training's binary_logloss: 0.666456
[4]	training's binary_logloss: 0.657978
[5]	training's binary_logloss: 0.65061
[6]	training's binary_logloss: 0.640629
[7]	training's binary_logloss: 0.63159
[8]	training's binary_logloss: 0.622584
[9]	training's binary_logloss: 0.614358
[10]	training's binary_logloss: 0.606217
[11]	training's binary_logloss: 0.598234
[12]	training's binary_logloss: 0.590287
[13]	training's binary_logloss: 0.582836
[14]	training's binary_logloss: 0.575774
[15]	training's binary_logloss: 0.56923
[16]	training's binary_logloss: 0.562423
[17]	training's binary_logloss: 0.555749
[18]	training's binary_logloss: 0.549391
[19]	training's binary_logloss: 0.542548
[20]	training's binary_logloss: 0.536594
[21]	training's binary_logloss: 0.531169
[22]	training's binary_logloss: 0.524756
[23]	training's binary_logloss: 0.518757
[24]	training's binary_logloss: 0.513114
[25]	training's binary_loglo

Evaluate the trained model.

In [128]:
predict_prob = gbm.predict(test_x)

In [129]:
predict_label = [ 1 if elem >= 0.5 else 0 for elem in predict_prob]

In [130]:
acc = sum( np.array(predict_label) == np.array(test_y) ) / len(predict_label)

In [131]:
print("accuracy: {}".format(acc))

accuracy: 0.6622701838529177


It shows this problem is SOLVABLE (though accuracy is not so high).