---
Relation Extraction
====

![](http://reel.cs.columbia.edu/frames/description_files/image002.png)

---
By the end of this session, you should be able to:
---
- Define relation extraction
- List the common methods of relation extraction
- Describe relation extraction feature engineering
- Identify rdf triples
- Outline a relation extraction system

----
What is Relation Extraction?
----

Define Named Entity Recognition (NER) in Plain English.

![](images/relationExtractionDiagram.jpg)

Relations are the subject, relationship/action, and object relations within sentences.

You need parse sentences:
- Find subject
- Create relationship/action and object form 
- Add additional semantic information such as
    - entity extraction
    - keyword extraction
    - sentiment analysis
    - location identification
   

----
Relation Extraction Methods:
----

1. Pattern Matching
1. Bootstrapping methods
1. Supervised methods
1. Distant supervision
1. Unsupervised methods

### Pattern Matching
----

- Human
- Manual regex
- Automatically make regex, similar to Trifacta

### Bootstrapping methods
-----
 
![](images/bootstrapping_system.png)

### Supervised methods
-----

- __Must__ know ground truth
- Now use any (multiclass) classifier you like:
    - Naive Bayes
    - MaxEnt (logistic regression)
    - SVM
    - ....


#### Feature Engineering

- Lightweight features — require little pre-processing
    - Bags of words & bigrams between, before, and after the entities
    - Stemmed versions of the same
    - The types of the entities
    - The distance (number of words) between the entities
- Medium-weight features — require base phrase chunking
    - chunking
        - Base-phrase chunk paths
        - Bags of chunk heads
- Heavyweight features — require full syntactic parsing
    - Dependency-tree paths between the entities
    - Constituent-tree paths between the entities
    - Tree distance between the entities
    - Presence of particular constructions in a constituent

### Distant supervision
----

__Hypothesis__: If two entities belong to a certain relation, any sentence containing those two entities is likely to express that relation
 
 
Need very large corpus and very few entities (relative)

Hints at word2vec

### Unsupervised methods
----

<details><summary>
What is the go to unsupervised method?
</summary>
Clustering
</details>

One of the better methods is (Latent Dirichlet allocation) LDA, which we'll cover later in the course

---
Resource Description Framework (RDF), aka Triples
-----

![](http://sett.ociweb.com/sett/settFeb2011_files/triples.gif)

### Graph:

- nodes (subject, object)
- edges directional (relationship / predicate) 

---
Relation extraction into RDF
---

![](images/rdf_exmaple.png)

---
RDF features
----

- RDF is not a data format
- RDF: Data Model to express relationships between
(arbitrary) data elements

- RDF files can be serialized in multiple formats: RDF/XML
- Stored in Files & Databases (triple-store)

---
Wordnet
----

Wordnet is specific type of predicate, the part of the sentence which talks about the subject and which has a verb

![](images/word_net.png)

---
Summary
----

- Relation Extraction is a valuable (and difficult) technique
- It involves processes text and identifying the subject, relationship/action, object relationships.
- Methods vary from Pattern Matching to different types of supervised and unsupervised methods
- RDF triples is the best data structure to store relationships

<br>
<br>
----