# Twitter NER with Neural Nets

#### Jovana Urosevic & Kenny Lino
##### March 15, 2018

## Overview
#### 1.  Task and Motivation
#### 2.  Data
#### 3.  Baseline
#### 4.  Experiment
#### 5.  Results
#### 6.  Conclusion and Future work

## Motivation

Task of Named entity recognition is the task of trying to isolate mentions of specific kinds of people, places, organisations and things in unstructured text. In general, tools such as Stanford CoreNLP can do a very good job of this for formal, well-edited text such as newspaper articles. Traditionally, most of the effective NER approaches are based on machine learning techniques, such as conditional random field (CRF), support vector machine (SVM), etc.

### What is NER?

![pic](pic.png)

However, a lot of the data comes from social media that we need to be able to process, in particular Twitter. Tweets are full of informal language, misspellings, abbreviations, hashtags, @-mentions, URLs, and unreliable capitalization and punctuation. Also, users can talk about anything and everything on Twitter, and new entities that were never or scarcely mentioned ever before may become suddenly popular. Also,  tweets are typically short as the number of characters in a particular tweet is restricted to 140. All these factors present huge challenges for general-purpose NER systems that were not designed for this type of text.

#### Why NER on Twitter?

![pic1](1.png) 

![pic2](2.png)

#### Shared Task -  COLING 2016 Workshop on Noisy User-generated text (WNUT):

-  Segmentation and categorisation

-  Segmentation only

## Data

Represented using the ConLL data format

|Sentence   |Label|
|-------------|:---------:|
|It |O|
|sucks |O|
|not|O|
|to|O|
|be|O|
|in|O|
|Disney |**B-facility**|
|world|**I-facility**|
|. |O|

## Data distribution

- annotation
The Twitter NER shared task datasets consist of training set,  development set and test set. The numbers of tweets and tokens of each set are shown in Table. The shared task focuses on finding 10 types of target entities, including company, facility, geo-location, movie, music-artist, other, person, product, sport team and TV show. In particular, the shared task can be divided into two sub-tasks: ‘segmentation only’ and ‘segmentation and categorisation’. The former focuses only on finding the boundaries of entities; meanwhile, the latter requires both the boundaries of entities and the correct categories of entity types.
- 21 tags

10 fine-grained NER categories

|Label type|Train|Dev|Test|
|-------------|---------|------|--------|
|company|171|39|621|
|facility|104|38|253|
|geo-loc|276|116|882|
|movie|34|15|34|
|music artist|55|41|191|
|other|225|132|584|
|person|449|171|482|
|product|97|37|246|
|sports team|51|70|147|
|tv show|34|2|33|
|Total|1496|1420|3473|


## Baseline System

Conditional random field (CRF) is one of the most effective approaches for NER, as it achieved state-of-the-art performances on several NER tasks. In particular, CRF learns latent structures of an input sequence by using a undirected statistical graphical model. Nevertheless, the performance of CRF mainly depends on hand-crafted features designed specifically for a particular task or domain. Consequently, these hand-crafted features are difficult to develop and maintain. Examples of hand-crafted features are orthographic features, which are based on patterns of characters contained in a given word.

Using Conditional Random Fields (CRFs) to label data

|Label type   |Precision|Recall|F1 score|# of phrases|
|-------------|---------|------|--------|------------|
|company|35.48|28.21|31.43| 31|
|facility|15.79|15.79|15.79|38|
|geo-loc|47.69|53.45|50.41|130|
|movie|0.00|0.00|0.00|8|
|music artist|0.00|0.00|0.00|4|
|other|33.33|22.73|27.03|90|
|person|52.04|59.65|55.59|196|
|product|7.14|2.70|3.92|14|
|sports team|40.00|8.57|14.12|15|
|tv show|0.00|0.00|0.00|6|
|**AVG**|40.98|32.98|36.55|-|

Accuracy:  93.85%

## WNUT 2016 - Winning system

![3pic](3.png)

### Results

|Label type  |   Precision | Recall  |  F1       |
| -------- | ----------| ----------- | ------- | 
| company | 69.84 | 48.47 |57.22 |
| facility | 51.70 | 35.97 |42.42 |
| geo-loc |75.21 | 70.18 | 72.61 | 
| movie | 14.29 | 8.82 | 10.91 |
| musicartist | 26.83 | 5.76 | 9.48 |
| other | 49.45 | 23.29 | 31.66 |
| person | 52.06 | 68.05 | 58.99 |
| product | 36.96 | 13.82 | 20.12 |
| sportsteam | 53.15 | 51.70 | 52.41 |
| tvshow | 100.00 | 3.03 | 5.88 |
| **AVG** | 60.77 | 46.07 | 52.41 |

## Experiment 1 - Word-based biLSTM

### Embeddings from training data

* Combining train + dev -> test

|Label type  |   Precision | Recall  |  F1       | # of phrases|
| -------- | ----------| ----------- | ------- |------|
|company |3.27|17.53|15.10|844|
|facility|17.37|16.92|17.14|259|
|geo-loc|48.31|35.40|40.86|650|
|movie|0.00|0.00|0.00|123|
|musicartist|16.67|2.55|4.42|30|
|other|31.46|15.22|20.52|302|
|person|25.43|14.74|18.66|291|
|product|5.40|5.28|5.34|278|
|sportsteam|5.42|11.92|7.45|332|
|tvshow|0.00|0.00|0.00|73|
|**AVG**|21.31|18.72|19.93|3182|

Accuracy: 89.90%








### Pre-trained embeddings

* GloVe (2B tweets, 1.2M vocab,  200d vectors)
* Godin et al. (2015) (400M tweets, 3M vocab, 400d vectors)

## Experiment 2 - Character-based biLSTM


![gif2](gif2.gif)

## Results

## Future work

Our orthographic sentence generator creates an orthographic sentence, which contains orthographic pattern of words in each input sentence. In particular, for a given social media sentence (e.g. ‘14th MENA FOREX EXPO announced!!’), we generate an orthographic sentence (e.g. ‘nncc CCCC CCCCC CCCC cccccccccpp’) by using a set of rules, where each of the upper-case characters, lower-case characters, numbers and punctuations, are replaced with C, c, n and p, respectively. Examples of orthographic sentences generated from social media sentences are shown in Table 1. This orthographic sentence allows bidirectional LSTM to explicitly induce and leverage orthographic features automatically.
We focus on orthographic features as they have shown to be effective and widely used in several NER systems. Importantly, orthographic features are used by majority of the systems (including the best system) participating in the Twitter NER shared task at the 2015 WNUT workshop 

* Use  orthographic features: <br>

14th MENA FOREX EXPO announced!! <br>

nncc CCCC CCCCC CCCC cccccccccpp <br>

### address inconsistencies?

![img](tweet-example1.png)

![img](tweet-example2.png)

![img](tweet-example3.png)

![img](tweet-example4.png)

## Conclusion 
![gif](gif.gif)

## Thank you!

## References

1. Benjamin Strauss, Bethany E. Toma, Alan Ritter, Marie-Catherine de Marneffe, Wei Xu. "Results of the WNUT16 Named Entity Recognition Shared Task." NUT@COLING (2016) <br>
2. Limsopatham, Nut and Nigel Collier. “Bidirectional LSTM for Named Entity Recognition in Twitter Messages.” NUT@COLING (2016). <br>
3. WNUT Named Entity Recognition in Twitter Shared task: https://github.com/aritter/twitter_nlp/tree/master/data/anntated/wnut16 <br>