# Embedding Approaches for IE

# Outline
---

Informal overview of a couple embedding approaches in IE

1. Word embedding approaches - Motivation: NER for Social Media
    * word2vec
    * bag o' character n-grams
    * expensive things
2. Sequence embedding approaches - Motivation: Relation Classification
3. Joint approaches -- End-to-end Relation Extraction

# NER for Social Media -- word2vec
---
Skipgram objective:
$ max \ P(c_j = y | w_i ) \ \forall c_i \in Context(w_i) $ 

$where \ \ P(c_j = y | w_i ) = \frac{exp(v(y)^T v(w_i))} { \Sigma_{y'}{exp(v(y')^T v(w_i))}}$

* Every word, every context has its own vector

* Words with similar contexts end up close together

Some issues: 
   * Large vocabulary sizes
   * Cannot handle new words
   * Rare words will either be omitted or not learned well

# NER for Social Media -- bag of char-grams
---
Split up every word into a bag of character-grams, say 3.

`banana -> ['__b', '_ba', 'ban', 'ana', 'nan, 'ana' ,'na_', 'a__'] `

Now $ v(w) = \Sigma_i {v(c_{i-1:i+1})} \ \ \forall c_i \in \ Characters(w)$

* More compact set of parameters
* Can handle new words
* Rare words can share parameters (but rare char-grams won't be learned well)

# NER for Social Media -- fancy stuff
---
Could also think of other more heavily parameterized word embeddings...

1. Character Convolutions ... Gated Convolutions ...
2. Character RNNs... LSTMs ... Bidirectional LSTMs ...

**NOTE:** Will be slow at training time, and potentially need more data

(In practice these reps can be cached for most character sequences)

# Relation Classification -- dependency path embeddings
---
Given a subject and object of a potential relation, classify if it is and if so, which type.

* Variable length sequences => sequence model (eg, RNN)

* Shortest path on the dependency tree (SDP) between subject (x) and object (y) often signals the relation

Let $\ f(w_{1:n}) = RNN( w_1, ..., w_n ).hidden[n] \ \ where \ \ w_{1:n} = SDP(x, y) $

Then can train relation classification (supervised) with 

$ max \ P(r(x,y)= y\  | \ x, y, w_{1:n})$

$ where\  P(r(x,y)| \ x, y, w_{1:n}) =  softmax(\ [v(x);\ v(y);\ f(w_{1:n})]^TW + b\ ) $

# Joint NER and Relation Classification
---
* Would like to learn a system that classifies entities and their relations jointly

* Everything is differentiable and can be trained via backprop

* So can plug in NER approach for use in Relation Extraction

**NOTE:** Errors propogate both ways -- want to use training signals at lower layers

-> Jointly training multiple objectives (multitask) or layer-wise pretraining
   

# Thanks!