##Introduction


**BERT** stands for **B**idirectional **E**ncoder **R**epresentations from **T**ransformers

Builds on top of a number of clever ideas 

1. **Semi-supervised Sequence Learning** (by Andrew Dai and Quoc Le),
2. **ELMo** (by Matthew Peters and researchers from AI2 and UW CSE), 
3. **ULMFiT** (by fast.ai founder Jeremy Howard and Sebastian Ruder), 
4. **OpenAI** transformer (by OpenAI researchers Radford, Narasimhan, Salimans, and Sutskever), 
5. **Transformer** (Vaswani et al).

###Overall components.

![alt text](https://jalammar.github.io/images/transformer-ber-ulmfit-elmo.png)

### High level learning architecture
![alt text](https://jalammar.github.io/images/bert-transfer-learning.png)

###Example: Sentence Classification

![alt text](https://jalammar.github.io/images/BERT-classification-spam.png)

1. BERT BASE – Comparable in size to the OpenAI Transformer in order to compare performance
2. BERT LARGE – A ridiculously huge model which achieved the state of the art results reported in the paper


![alt text](https://jalammar.github.io/images/bert-base-bert-large.png)

###ELMo: Context Matters 

**Embeddings from Language Models (ELMo)**

If we’re using this **GloVe** representation, then the word “**stick**” would be represented by this vector no-matter what the context was. 

“stick”” has multiple meanings depending on where it’s used.

To both capture the word meaning in that context as well as other contextual information?”. And so, **contextualized word-embeddings** were born.

![alt text](https://jalammar.github.io/images/elmo-embedding-robin-williams.png)

###How is ELMo different from other word embeddings?

Unlike traditional word embeddings such as word2vec and GLoVe, the ELMo vector assigned to a token or word is actually a function of the entire sentence containing that word. Therefore, the same word can have different word vectors under different contexts.

Suppose we have a couple of sentences:

1. I **read** the book yesterday.
2. Can you **read** the letter now?

Take a moment to ponder the difference between these two. The verb “read” in the first sentence is in the past tense. And the same verb transforms into present tense in the second sentence. This is a case of **Polysemy** wherein a word could have multiple meanings or senses.

ELMo word vectors successfully address this issue. ELMo word representations take the entire input sentence into equation for calculating the word embeddings. Hence, the term “read” would have different ELMo vectors under different context.

###ULM-FiT: Transfer Learning in NLP

1. Introduced methods to effectively utilize a lot of what the model learns during pre-training.

2. ULM-FiT introduced a language model and a process to effectively fine-tune that language model for various tasks.



###Parallels with Convolutional Neural Network

![alt text](https://jalammar.github.io/images/vgg-net-classifier.png)

## KAGGLE dataset implementation

In [27]:
!git clone https://github.com/kamalkraj/BERT-NER.git

Cloning into 'BERT-NER'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 102 (delta 1), reused 0 (delta 0), pack-reused 96[K
Receiving objects: 100% (102/102), 987.90 KiB | 1.35 MiB/s, done.
Resolving deltas: 100% (51/51), done.


In [28]:
pwd

'/content'

In [29]:
cd BERT-NER

/content/BERT-NER


In [30]:
!pip install -r requirements.txt



In [31]:
!python run_ner.py --data_dir=data/ --bert_model=bert-base-cased --task_name=ner --output_dir=out --max_seq_length=128 --do_train --num_train_epochs 5 --do_eval --warmup_proportion=0.4

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
07/02/2019 07:49:19 - INFO - __main__ -   device: cuda n_gpu: 1, distributed training: False, 16-bits training: False
07/02/2019 07:49:20 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at /root/.pytorch_pretrained_bert/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
07/02/2019 07:49:21 - INFO - pytorch_pretrained_bert.modeling -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz from cache at /root/.pytorch_pretrained_bert/distributed_-1/a803ce83ca27fecf74c355673c434e51c265fb8a3e0e57ac62a80e38ba98d384.681017f415dfb33ec8d0e04fe51a619f3f01532ecea04edbfd48c5d160550d9c
07/02/2019 07:49:21 - INFO - pytorch_pretrained_bert.modeling -   extracting archive file /ro

**Download 'out' folder at local m/c and follow below steps for prediction.**

In [33]:
pwd

'/content/BERT-NER'

In [0]:
from bert import Ner

In [0]:
model = Ner("out/")

In [0]:
output = model.predict("Steve went to Paris")

##Comparison with spaCy

**BERT Vs SpaCy**

---


**Notation:**
1. Spacy suports both BIO and BILUO schemes. 
2. BERT only supports BIO schems/notation.

**BILUO** --> **B**efore **I**nside **L**ast **U**nit **O**ut
**BIO** scheme is more difficult to learn than **BILUO**, as **BILUO** set boundary token.

**Tagging:**
1. SpaCy supports **Sequence Tagging Task** and other tagging approaches.
2. BERT **NOT** supports this and yet in development phase. [as per current reading]

**Grammer:**
1. BERT prediction are good when sentence is gramatically correct.

---


**Coverage and Accuracy**

1. We integrated BERT model into current NLP project and replaced spaCy.
2. For comparison, we ran 30 truth data specific WITS with spaCy and BERT, alternatively. 

Component|	Coverage|	accuracy|	FP’s
--- | --- | --- | -- |
**Spacy** (ORG)	|51.79|	71.06	|0.34
**BERT** (ORG)|55.55 	|59.84	|0.38
**Bert** (ORG,LOC,PERSON)|	71.61	|61.53	|0.36


**Text.AI Conclusion**

1. Even though BERT model was not trained on project specific truth data, it shown good numbers compare to spaCy accuracy and coverage.
2. These numbers will definitly **increase** when model get trained on project specific truth data.
3. With project specific truth data, BERT with local CPU takes too much time. So its 

---




