# Homework and bake-off: Relation extraction using distant supervision

In [70]:
__author__ = "Bill MacCartney and Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2020"

## Contents

1. [Overview](#Overview)
1. [Set-up](#Set-up)
1. [Baselines](#Baselines)
  1. [Hand-build feature functions](#Hand-build-feature-functions)
  1. [Distributed representations](#Distributed-representations)
1. [Homework questions](#Homework-questions)
  1. [Different model factory [1 points]](#Different-model-factory-[1-points])
  1. [Directional unigram features [1.5 points]](#Directional-unigram-features-[1.5-points])
  1. [The part-of-speech tags of the "middle" words [1.5 points]](#The-part-of-speech-tags-of-the-"middle"-words-[1.5-points])
  1. [Bag of Synsets [2 points]](#Bag-of-Synsets-[2-points])
  1. [Your original system [3 points]](#Your-original-system-[3-points])
1. [Bake-off [1 point]](#Bake-off-[1-point])

## Overview

This homework and associated bake-off are devoted to developing really effective relation extraction systems using distant supervision. 

As with the previous assignments, this notebook first establishes a baseline system. The initial homework questions ask you to create additional baselines and suggest areas for innovation, and the final homework question asks you to develop an original system for you to enter into the bake-off.

## Set-up

See [the first notebook in this unit](rel_ext_01_task.ipynb#Set-up) for set-up instructions.

In [71]:
import numpy as np
import os
import rel_ext
from sklearn.linear_model import LogisticRegression
import utils


As usual, we unite our corpus and KB into a dataset, and create some splits for experimentation:

In [72]:
rel_ext_data_home = os.path.join('data', 'rel_ext_data')

In [73]:
corpus = rel_ext.Corpus(os.path.join(rel_ext_data_home, 'corpus.tsv.gz'))

In [74]:
kb = rel_ext.KB(os.path.join(rel_ext_data_home, 'kb.tsv.gz'))

In [75]:
dataset = rel_ext.Dataset(corpus, kb)

You are not wedded to this set-up for splits. The bake-off will be conducted on a previously unseen test-set, so all of the data in `dataset` is fair game:

In [76]:
splits = dataset.build_splits(
    split_names=['tiny', 'train', 'dev'],
    split_fracs=[0.01, 0.79, 0.20],
    seed=1)

In [77]:
splits

{'tiny': Corpus with 3,474 examples; KB with 445 triples,
 'train': Corpus with 263,285 examples; KB with 36,191 triples,
 'dev': Corpus with 64,937 examples; KB with 9,248 triples,
 'all': Corpus with 331,696 examples; KB with 45,884 triples}

## Baselines

### Hand-build feature functions

In [9]:
def simple_bag_of_words_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.middle.split(' '):
            feature_counter[word] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in ex.middle.split(' '):
            feature_counter[word] += 1
    return feature_counter

In [10]:
featurizers = [simple_bag_of_words_featurizer]

In [11]:
model_factory = lambda: LogisticRegression(fit_intercept=True, solver='liblinear')

In [13]:
baseline_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=featurizers,
    model_factory=model_factory,
    verbose=True)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.879      0.385      0.700        340       5716
author                    0.828      0.540      0.749        509       5885
capital                   0.471      0.168      0.346         95       5471
contains                  0.795      0.601      0.747       3904       9280
film_performance          0.764      0.565      0.714        766       6142
founders                  0.788      0.382      0.650        380       5756
genre                     0.581      0.147      0.365        170       5546
has_sibling               0.855      0.248      0.575        499       5875
has_spouse                0.839      0.325      0.637        594       5970
is_a                      0.699      0.219      0.486        497       5873
nationality               0.609      0.186      0.419        301       5677
parents     

Studying model weights might yield insights:

In [14]:
rel_ext.examine_model_weights(baseline_results)

Highest and lowest feature weights for relation adjoins:

     2.539 Valais
     2.534 Córdoba
     2.325 Taluks
     ..... .....
    -1.258 towers
    -1.299 India
    -1.343 Europe

Highest and lowest feature weights for relation author:

     2.347 wrote
     2.325 writer
     2.269 books
     ..... .....
    -2.429 1997
    -3.011 controversial
    -4.496 infamous

Highest and lowest feature weights for relation capital:

     3.182 capital
     1.765 especially
     1.668 city
     ..... .....
    -1.177 also
    -1.472 ’
    -1.615 includes

Highest and lowest feature weights for relation contains:

     2.661 third-largest
     2.341 bordered
     2.177 districts
     ..... .....
    -2.551 Henley-on-Thames
    -3.147 Lancashire
    -3.371 Midlands

Highest and lowest feature weights for relation film_performance:

     3.836 starring
     3.813 co-starring
     3.334 alongside
     ..... .....
    -1.570 tragedy
    -1.648 or
    -1.792 comedian

Highest and lowest feature weig

### Distributed representations

This simple baseline sums the GloVe vector representations for all of the words in the "middle" span and feeds those representations into the standard `LogisticRegression`-based `model_factory`. The crucial parameter that enables this is `vectorize=False`. This essentially says to `rel_ext.experiment` that your featurizer or your model will do the work of turning examples into vectors; in that case, `rel_ext.experiment` just organizes these representations by relation type.

In [15]:
GLOVE_HOME = os.path.join('data', 'glove.6B')

In [16]:
glove_lookup = utils.glove2dict(
    os.path.join(GLOVE_HOME, 'glove.6B.300d.txt'))

In [17]:
def glove_middle_featurizer(kbt, corpus, np_func=np.sum):
    reps = []
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.middle.split():
            rep = glove_lookup.get(word)
            if rep is not None:
                reps.append(rep)
    # A random representation of the right dimensionality if the
    # example happens not to overlap with GloVe's vocabulary:
    if len(reps) == 0:
        dim = len(next(iter(glove_lookup.values())))                
        return utils.randvec(n=dim)
    else:
        return np_func(reps, axis=0)

In [18]:
glove_results = rel_ext.experiment(
    splits,
    train_split='train',
    test_split='dev',
    featurizers=[glove_middle_featurizer],    
    vectorize=False, # Crucial for this featurizer!
    verbose=True)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.863      0.462      0.735        340       5716
author                    0.846      0.422      0.705        509       5885
capital                   0.704      0.200      0.468         95       5471
contains                  0.655      0.407      0.584       3904       9280
film_performance          0.821      0.330      0.633        766       6142
founders                  0.760      0.242      0.532        380       5756
genre                     0.458      0.065      0.207        170       5546
has_sibling               0.861      0.236      0.564        499       5875
has_spouse                0.863      0.350      0.668        594       5970
is_a                      0.694      0.155      0.409        497       5873
nationality               0.652      0.199      0.448        301       5677
parents     

With the same basic code design, one can also use the PyTorch models included in the course repo, or write new ones that are better aligned with the task. For those models, it's likely that the featurizer will just return a list of tokens (or perhaps a list of lists of tokens), and the model will map those into vectors using an embedding.

## Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

### Different model factory [1 points]

The code in `rel_ext` makes it very easy to experiment with other classifier models: one need only redefine the `model_factory` argument. This question asks you to assess a [Support Vector Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

__To submit:__ A wrapper function `run_svm_model_factory` that does the following: 

1. Uses `rel_ext.experiment` with the model factory set to one based in an `SVC` with `kernel='linear'` and all other arguments left with default values. 
1. Trains on the 'train' part of `splits`.
1. Assesses on the `dev` part of `splits`.
1. Uses `featurizers` as defined above. 
1. Returns the return value of `rel_ext.experiment` for this set-up.

The function `test_run_svm_model_factory` will check that your function conforms to these general specifications.

In [19]:
from sklearn.svm import SVC
def run_svm_model_factory():
    
    ##### YOUR CODE HERE

    return rel_ext.experiment(
        splits,
        train_split='train',
        test_split='dev',
        featurizers=featurizers,
        model_factory=lambda : SVC(kernel='linear'))


In [20]:
def test_run_svm_model_factory(run_svm_model_factory):
    results = run_svm_model_factory()
    assert 'featurizers' in results, \
        "The return value of `run_svm_model_factory` seems not to be correct"
    # Check one of the models to make sure it's an SVC:
    assert 'SVC' in results['models']['adjoins'].__class__.__name__, \
        "It looks like the model factor wasn't set to use an SVC."    

In [21]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_run_svm_model_factory(run_svm_model_factory)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.748      0.332      0.599        340       5716
author                    0.770      0.599      0.729        509       5885
capital                   0.609      0.295      0.502         95       5471
contains                  0.783      0.602      0.738       3904       9280
film_performance          0.736      0.619      0.709        766       6142
founders                  0.718      0.429      0.633        380       5756
genre                     0.606      0.253      0.474        170       5546
has_sibling               0.756      0.248      0.537        499       5875
has_spouse                0.800      0.350      0.636        594       5970
is_a                      0.595      0.278      0.484        497       5873
nationality               0.538      0.186      0.391        301       5677
parents     

### Directional unigram features [1.5 points]

The current bag-of-words representation makes no distinction between "forward" and "reverse" examples. But, intuitively, there is big difference between _X and his son Y_ and _Y and his son X_. This question asks you to modify `simple_bag_of_words_featurizer` to capture these differences. 

__To submit:__

1. A feature function `directional_bag_of_words_featurizer` that is just like `simple_bag_of_words_featurizer` except that it distinguishes "forward" and "reverse". To do this, you just need to mark each word feature for whether it is derived from a subject–object example or from an object–subject example.  The included function `test_directional_bag_of_words_featurizer` should help verify that you've done this correctly.

2. A call to `rel_ext.experiment` with `directional_bag_of_words_featurizer` as the only featurizer. (Aside from this, use all the default values for `rel_ext.experiment` as exemplified above in this notebook.)

3. `rel_ext.experiment` returns some of the core objects used in the experiment. How many feature names does the `vectorizer` have for the experiment run in the previous step? Include the code needed for getting this value. (Note: we're partly asking you to figure out how to get this value by using the sklearn documentation, so please don't ask how to do it!)

In [22]:
def directional_bag_of_words_featurizer(kbt, corpus, feature_counter): 
    # Append these to the end of the keys you add/access in 
    # `feature_counter` to distinguish the two orders. You'll
    # need to use exactly these strings in order to pass 
    # `test_directional_bag_of_words_featurizer`.
    subject_object_suffix = "_SO"
    object_subject_suffix = "_OS"
    
    ##### YOUR CODE HERE
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for word in ex.middle.split(' '):
            feature_counter[word+subject_object_suffix] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for word in ex.middle.split(' '):
            feature_counter[word+object_subject_suffix] += 1
            
    return feature_counter


# Call to `rel_ext.experiment`:
##### YOUR CODE HERE    
directional_bow_results = rel_ext.experiment(
    splits,
    featurizers=[directional_bag_of_words_featurizer],
)

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.881      0.412      0.717        340       5716
author                    0.880      0.589      0.801        509       5885
capital                   0.735      0.263      0.541         95       5471
contains                  0.812      0.669      0.779       3904       9280
film_performance          0.841      0.661      0.797        766       6142
founders                  0.782      0.405      0.659        380       5756
genre                     0.714      0.265      0.533        170       5546
has_sibling               0.844      0.248      0.570        499       5875
has_spouse                0.865      0.355      0.672        594       5970
is_a                      0.704      0.239      0.507        497       5873
nationality               0.632      0.223      0.462        301       5677
parents     

In [23]:
print("Number of features: ", len(directional_bow_results['vectorizer'].vocabulary_))

Number of features:  40726


In [24]:
def test_directional_bag_of_words_featurizer(corpus):
    from collections import defaultdict
    kbt = rel_ext.KBTriple(rel='worked_at', sbj='Randall_Munroe', obj='xkcd')
    feature_counter = defaultdict(int)
    # Make sure `feature_counter` is being updated, not reinitialized:
    feature_counter['is_OS'] += 5
    feature_counter = directional_bag_of_words_featurizer(kbt, corpus, feature_counter)
    expected = defaultdict(
        int, {'is_OS':6,'a_OS':1,'webcomic_OS':1,'created_OS':1,'by_OS':1})
    assert feature_counter == expected, \
        "Expected:\n{}\nGot:\n{}".format(expected, feature_counter)

In [25]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_directional_bag_of_words_featurizer(corpus)

### The part-of-speech tags of the "middle" words [1.5 points]

Our corpus distribution contains part-of-speech (POS) tagged versions of the core text spans. Let's begin to explore whether there is information in these sequences, focusing on `middle_POS`.

__To submit:__

1. A feature function `middle_bigram_pos_tag_featurizer` that is just like `simple_bag_of_words_featurizer` except that it creates a feature for bigram POS sequences. For example, given 

  `The/DT dog/N napped/V`
  
   we obtain the list of bigram POS sequences
  
   `b = ['<s> DT', 'DT N', 'N V', 'V </s>']`. 
   
   Of course, `middle_bigram_pos_tag_featurizer` should return count dictionaries defined in terms of such bigram POS lists, on the model of `simple_bag_of_words_featurizer`.  Don't forget the start and end tags, to model those environments properly! The included function `test_middle_bigram_pos_tag_featurizer` should help verify that you've done this correctly.

2. A call to `rel_ext.experiment` with `middle_bigram_pos_tag_featurizer` as the only featurizer. (Aside from this, use all the default values for `rel_ext.experiment` as exemplified above in this notebook.)

In [26]:
def middle_bigram_pos_tag_featurizer(kbt, corpus, feature_counter):
    
    ##### YOUR CODE HERE
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for pos_bigram in get_tag_bigrams(ex.middle_POS):
            feature_counter[pos_bigram] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for pos_bigram in get_tag_bigrams(ex.middle_POS):
            feature_counter[pos_bigram] += 1
            

    return feature_counter


def get_tag_bigrams(s):
    """Suggested helper method for `middle_bigram_pos_tag_featurizer`.
    This should be defined so that it returns a list of str, where each 
    element is a POS bigram."""
    # The values of `start_symbol` and `end_symbol` are defined
    # here so that you can use `test_middle_bigram_pos_tag_featurizer`.
    start_symbol = "<s>"
    end_symbol = "</s>"
    
    ##### YOUR CODE HERE
    tags = get_tags(s)
    return list(map(lambda x: x[0] + " " + x[1], zip([start_symbol]+tags, tags+[end_symbol])))


    
def get_tags(s): 
    """Given a sequence of word/POS elements (lemmas), this function
    returns a list containing just the POS elements, in order.    
    """
    return [parse_lem(lem)[1] for lem in s.strip().split(' ') if lem]


def parse_lem(lem):
    """Helper method for parsing word/POS elements. It just splits
    on the rightmost / and returns (word, POS) as a tuple of str."""
    return lem.strip().rsplit('/', 1)  

# Call to `rel_ext.experiment`:
##### YOUR CODE HERE

pos_tag_results = rel_ext.experiment(
    splits,
    featurizers=[middle_bigram_pos_tag_featurizer]
)


relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.838      0.365      0.665        340       5716
author                    0.769      0.326      0.605        509       5885
capital                   0.484      0.158      0.342         95       5471
contains                  0.756      0.608      0.721       3904       9280
film_performance          0.721      0.441      0.640        766       6142
founders                  0.610      0.161      0.391        380       5756
genre                     0.667      0.153      0.399        170       5546
has_sibling               0.697      0.170      0.431        499       5875
has_spouse                0.790      0.259      0.560        594       5970
is_a                      0.600      0.175      0.404        497       5873
nationality               0.477      0.070      0.220        301       5677
parents     

In [27]:
def test_middle_bigram_pos_tag_featurizer(corpus):
    from collections import defaultdict
    kbt = rel_ext.KBTriple(rel='worked_at', sbj='Randall_Munroe', obj='xkcd')
    feature_counter = defaultdict(int)
    # Make sure `feature_counter` is being updated, not reinitialized:
    feature_counter['<s> VBZ'] += 5
    feature_counter = middle_bigram_pos_tag_featurizer(kbt, corpus, feature_counter)
    expected = defaultdict(
        int, {'<s> VBZ':6,'VBZ DT':1,'DT JJ':1,'JJ VBN':1,'VBN IN':1,'IN </s>':1})
    assert feature_counter == expected, \
        "Expected:\n{}\nGot:\n{}".format(expected, feature_counter)

In [28]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_middle_bigram_pos_tag_featurizer(corpus)

### Bag of Synsets [2 points]

The following allows you to use NLTK's WordNet API to get the synsets compatible with _dog_ as used as a noun:

```
from nltk.corpus import wordnet as wn
dog = wn.synsets('dog', pos='n')
dog
[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01')]
```

This question asks you to create synset-based features from the word/tag pairs in `middle_POS`.

__To submit:__

1. A feature function `synset_featurizer` that is just like `simple_bag_of_words_featurizer` except that it returns a list of synsets derived from `middle_POS`. Stringify these objects with `str` so that they can be `dict` keys. Use `convert_tag` (included below) to convert tags to `pos` arguments usable by `wn.synsets`. The included function `test_synset_featurizer` should help verify that you've done this correctly.

2. A call to `rel_ext.experiment` with `synset_featurizer` as the only featurizer. (Aside from this, use all the default values for `rel_ext.experiment`.)

In [29]:
from nltk.corpus import wordnet as wn
import nltk
nltk.download('wordnet')
def synset_featurizer(kbt, corpus, feature_counter):
    
    ##### YOUR CODE HERE
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for synset in get_synsets(ex.middle_POS):
            feature_counter[synset] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for synset in get_synsets(ex.middle_POS):
            feature_counter[synset] += 1

    return feature_counter


def get_synsets(s):
    """Suggested helper method for `synset_featurizer`. This should
    be completed so that it returns a list of stringified Synsets 
    associated with elements of `s`.
    """   
    # Use `parse_lem` from the previous question to get a list of
    # (word, POS) pairs. Remember to convert the POS strings.
    wt = [parse_lem(lem) for lem in s.strip().split(' ') if lem]
    
    ##### YOUR CODE HERE
    synsets = []
    for word, tag in wt:
        for synset in wn.synsets(word, pos=convert_tag(tag)):
            synsets.append(str(synset))
    return synsets
    
def convert_tag(t):
    """Converts tags so that they can be used by WordNet:
    
    | Tag begins with | WordNet tag |
    |-----------------|-------------|
    | `N`             | `n`         |
    | `V`             | `v`         |
    | `J`             | `a`         |
    | `R`             | `r`         |
    | Otherwise       | `None`      |
    """        
    if t[0].lower() in {'n', 'v', 'r'}:
        return t[0].lower()
    elif t[0].lower() == 'j':
        return 'a'
    else:
        return None    


# Call to `rel_ext.experiment`:
##### YOUR CODE HERE    

synsets_results = rel_ext.experiment(
    splits,
    featurizers=[synset_featurizer],
    verbose=True)


[nltk_data] Downloading package wordnet to /home/ganesh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.811      0.353      0.644        340       5716
author                    0.758      0.430      0.658        509       5885
capital                   0.550      0.232      0.431         95       5471
contains                  0.784      0.589      0.736       3904       9280
film_performance          0.759      0.555      0.707        766       6142
founders                  0.733      0.368      0.612        380       5756
genre                     0.521      0.224      0.411        170       5546
has_sibling               0.821      0.220      0.531        499       5875
has_spouse                0.811      0.303      0.607        594       5970
is_a                      0.586      0.225      0.444        497       5873
nationality               0.511      0.153      0.348        301       5677
parents     

In [30]:
def test_synset_featurizer(corpus):
    from collections import defaultdict
    kbt = rel_ext.KBTriple(rel='worked_at', sbj='Randall_Munroe', obj='xkcd')
    feature_counter = defaultdict(int)
    # Make sure `feature_counter` is being updated, not reinitialized:
    feature_counter["Synset('be.v.01')"] += 5
    feature_counter = synset_featurizer(kbt, corpus, feature_counter)
    # The full return values for this tend to be long, so we just
    # test a few examples to avoid cluttering up this notebook.
    test_cases = {
        "Synset('be.v.01')": 6,
        "Synset('embody.v.02')": 1
    }
    for ss, expected in test_cases.items():   
        result = feature_counter[ss]
        assert result == expected, \
            "Incorrect count for {}: Expected {}; Got {}".format(ss, expected, result)

In [31]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_synset_featurizer(corpus)

### Your original system [3 points]

There are many options, and this could easily grow into a project. Here are a few ideas:

- Try out different classifier models, from `sklearn` and elsewhere.
- Add a feature that indicates the length of the middle.
- Augment the bag-of-words representation to include bigrams or trigrams (not just unigrams).
- Introduce features based on the entity mentions themselves. <!-- \[SPOILER: it helps a lot, maybe 4% in F-score. And combines nicely with the directional features.\] -->
- Experiment with features based on the context outside (rather than between) the two entity mentions — that is, the words before the first mention, or after the second.
- Try adding features which capture syntactic information, such as the dependency-path features used by Mintz et al. 2009. The [NLTK](https://www.nltk.org/) toolkit contains a variety of [parsing algorithms](http://www.nltk.org/api/nltk.parse.html) that may help.
- The bag-of-words representation does not permit generalization across word categories such as names of people, places, or companies. Can we do better using word embeddings such as [GloVe](https://nlp.stanford.edu/projects/glove/)?

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.

# Enter your system description in this cell.


This is my code description attached here as requested

I have used the following features for the model:
1. Directional BOW features as defined above
2. Bigram of POS tags of the middle text as defined above
3. Synsets of the middle words and pos tags as defined above
4. Average length of the middle for both forward and backward relation (defined below)
5. Directional bag of POS tags for both entity mentions

We also traied different classifiers like SVM (linear and RBF), Logistic regression, Logistic regression with L1 regularization becuase the features required for a particular reltion type must be vry sparse in the whole set of features, Ada Boost classifier (with logistic regression as base estimator), Random Forest and found that the best performing model is the Random Forest classifier giving 78.9. The logistic regression model with L2 regularization achieves a score of 67.0 and the one with L1 regularization achieves a score of 67.8. We also tried the calibarated version of RF so that the class probabilities are better evaluated and found the the model still performs the good with only 0.9 points drop in performance.

Note: We also experimented with bigram features did not find any improvement for the added complexity and features.

My peak score was: 0.789


In [85]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from  sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV

def middle_len_featurizer(kbt, corpus, feature_counter):
    middle_len = 0
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        middle_len += len(ex.middle)
    feature_counter["middle_len_forward"] = 0
    if(len(corpus.get_examples_for_entities(kbt.sbj, kbt.obj)) > 0):
        feature_counter["middle_len_forward"] = middle_len/len(corpus.get_examples_for_entities(kbt.sbj, kbt.obj))
    
    middle_len = 0
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        middle_len += len(ex.middle)
    feature_counter["middle_len_backward"] = 0
    if(len(corpus.get_examples_for_entities(kbt.obj, kbt.sbj)) > 0):
        feature_counter["middle_len_backward"] = middle_len/len(corpus.get_examples_for_entities(kbt.obj, kbt.sbj))

    return feature_counter

def entity_pos_featurizer(kbt, corpus, feature_counter):
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        for pos in get_tags(ex.mention_1_POS):
            feature_counter[pos+"_1_SO"] += 1
        for pos in get_tags(ex.mention_2_POS):
            feature_counter[pos+"_2_SO"] += 1
    for ex in corpus.get_examples_for_entities(kbt.obj, kbt.sbj):
        for pos in get_tags(ex.mention_1_POS):
            feature_counter[pos+"_1_OS"] += 1
        for pos in get_tags(ex.mention_2_POS):
            feature_counter[pos+"_2_OS"] += 1
    return feature_counter

def directional_bag_of_words_bigrams_featurizer(kbt, corpus, feature_counter): 
    subject_object_suffix = "_SO"
    object_subject_suffix = "_OS"

    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        words = list(map(lambda x: x+subject_object_suffix, ex.middle.split(' ')))
        START_TOKEN = "<START>"
        END_TOKEN = "<END>"
        for bigram in zip([START_TOKEN]+words, words + [END_TOKEN]):
            feature_counter[str(bigram)] += 1
    for ex in corpus.get_examples_for_entities(kbt.sbj, kbt.obj):
        words = list(map(lambda x: x+object_subject_suffix, ex.middle.split(' ')))
        START_TOKEN = "<START>"
        END_TOKEN = "<END>"
        for bigram in zip([START_TOKEN]+words, words + [END_TOKEN]):
            feature_counter[str(bigram)] += 1
            
    return feature_counter

def find_best_model_factory():
    logistic = lambda: LogisticRegression(fit_intercept=True, solver='liblinear', random_state=42, max_iter=200)
    logistic_l1 = lambda: LogisticRegression(fit_intercept=True, solver='liblinear', random_state=42, max_iter=200, penalty='l1')
    rf = lambda: RandomForestClassifier(n_jobs=-1, random_state=42)
    rf_calibrated = lambda: CalibratedClassifierCV(base_estimator=RandomForestClassifier(n_jobs=-1, random_state=42), method='isotonic', cv=5)
    adaboost_decision = lambda: AdaBoostClassifier(random_state=42)
    adaboost_linear = lambda: AdaBoostClassifier(base_estimator=LogisticRegression(fit_intercept=True, solver='liblinear', random_state=42, max_iter=200), random_state=42)
    svm_linear = lambda: SVC(kernel='linear')
    svm = lambda: SVC()
    models = {}
    featurizers = [synset_featurizer, middle_bigram_pos_tag_featurizer, 
                   directional_bag_of_words_featurizer, entity_pos_featurizer, middle_len_featurizer]

    best_original_system = None
    best_score = 0
    best_model = None
    for model_factory in [logistic, logistic_l1, rf, rf_calibrated, adaboost_decision, adaboost_linear, svm_linear, svm]:
        print(model_factory())
        original_system_results = rel_ext.experiment(
            splits,
            train_split='train',
            test_split='dev',
            featurizers=featurizers,
            model_factory=model_factory,
            verbose=True)
        models[model_factory()] = original_system_results
        score = original_system_results['score']
        if(score > best_score):
            best_score = score
            best_original_system = original_system_results
            best_model = model_factory()
    print(best_score, best_model)
    return best_score, best_model, best_original_system, models

In [86]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    best_score, best_model, best_original_system, models = find_best_model_factory()

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=200,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)
relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.819      0.438      0.698        340       5716
author                    0.866      0.695      0.825        509       5885
capital                   0.636      0.295      0.517         95       5471
contains                  0.848      0.741      0.824       3904       9280
film_performance          0.861      0.751      0.836        766       6142
founders                  0.792      0.521      0.717        380       5756
genre                     0.794      0.659      0.763        170

relation              precision     recall    f-score    support       size
------------------    ---------  ---------  ---------  ---------  ---------
adjoins                   0.778      0.432      0.671        340       5716
author                    0.831      0.646      0.786        509       5885
capital                   0.576      0.358      0.514         95       5471
contains                  0.806      0.681      0.778       3904       9280
film_performance          0.842      0.711      0.812        766       6142
founders                  0.753      0.458      0.667        380       5756
genre                     0.696      0.471      0.635        170       5546
has_sibling               0.672      0.317      0.549        499       5875
has_spouse                0.757      0.394      0.639        594       5970
is_a                      0.688      0.390      0.597        497       5873
nationality               0.527      0.296      0.455        301       5677
parents     

In [114]:
# Examine feature importances for RF classifier
def examine_model_weights(train_result, k=3, verbose=True):
    vectorizer = train_result['vectorizer']

    if vectorizer is None:
        print("Model weights can be examined only if the featurizers "
              "are based in dicts (i.e., if `vectorize=True`).")
        return

    feature_names = vectorizer.get_feature_names()
    for rel, model in train_result['models'].items():
        print('Highest and lowest feature weights for relation {}:\n'.format(rel))
        try:
            coefs = model.feature_importances_.toarray()
        except AttributeError:
            coefs = model.feature_importances_
        sorted_weights = sorted([(wgt, idx) for idx, wgt in enumerate(coefs)], reverse=True)
        for wgt, idx in sorted_weights[:k]:
            print('{:10.3f} {}'.format(wgt, feature_names[idx]))
        print('{:>10s} {}'.format('.....', '.....'))
        for wgt, idx in sorted_weights[-k:]:
            print('{:10.3f} {}'.format(wgt, feature_names[idx]))
        print()
if 'IS_GRADESCOPE_ENV' not in os.environ:        
    examine_model_weights(best_original_system)

Highest and lowest feature weights for relation adjoins:

     0.032 NNP_1_SO
     0.029 NNP_2_SO
     0.021 CC </s>
     ..... .....
     0.000 #_1_OS
     0.000 # NNP
     0.000 # NN

Highest and lowest feature weights for relation author:

     0.027 by_SO
     0.024 DT_2_OS
     0.023 Synset('by.r.01')
     ..... .....
     0.000 #_1_OS
     0.000 # NNP
     0.000 # NN

Highest and lowest feature weights for relation capital:

     0.032 NNP_1_OS
     0.028 NNP_2_OS
     0.028 Synset('capital.n.05')
     ..... .....
     0.000 #_1_OS
     0.000 # NNP
     0.000 # NN

Highest and lowest feature weights for relation contains:

     0.042 NNP_2_OS
     0.031 NNP_1_OS
     0.025 middle_len_backward
     ..... .....
     0.000 #_1_OS
     0.000 # NNP
     0.000 # NN

Highest and lowest feature weights for relation film_performance:

     0.017 Synset('movie.n.01')
     0.015 Synset('star.v.02')
     0.015 DT_1_OS
     ..... .....
     0.000 #_1_OS
     0.000 # NNP
     0.000 # NN

Highe

In [33]:
# Please do not remove this comment.

In [115]:
def find_new_relation_instances(
        splits,
        trained_model,
        train_split='train',
        test_split='dev',
        k=10,
        vectorize=True,
        verbose=True):
    train_result = trained_model
    test_split = splits[test_split]
    neg_o, neg_y = test_split.build_dataset(
        include_positive=False,
        sampling_rate=0.1)
    neg_X, _ = test_split.featurize(
        neg_o,
        featurizers=train_result['featurizers'],
        vectorizer=train_result['vectorizer'],
        vectorize=vectorize)
    # Report highest confidence predictions:
    for rel, model in train_result['models'].items():
        print('Highest probability examples for relation {}:\n'.format(rel))
        probs = model.predict_proba(neg_X[rel])
        probs = [prob[1] for prob in probs] # probability for class True
        sorted_probs = sorted([(p, idx) for idx, p in enumerate(probs)], reverse=True)
        for p, idx in sorted_probs[:k]:
            print('{:10.3f} {}'.format(p, neg_o[rel][idx]))
        print()

find_new_relation_instances(splits, best_original_system, k=10)

Highest probability examples for relation adjoins:

     0.912 KBTriple(rel='adjoins', sbj='Kolhapur', obj='Sangli')
     0.740 KBTriple(rel='adjoins', sbj='Bandung', obj='Indonesia')
     0.720 KBTriple(rel='adjoins', sbj='Caribbean', obj='South_America')
     0.690 KBTriple(rel='adjoins', sbj='El_Salvador', obj='Central_America')
     0.675 KBTriple(rel='adjoins', sbj='Shanghai', obj='Tianjin')
     0.675 KBTriple(rel='adjoins', sbj='Cambridge', obj='Oxford')
     0.675 KBTriple(rel='adjoins', sbj='Lahore', obj='Karachi')
     0.660 KBTriple(rel='adjoins', sbj='Microsoft_Windows', obj='Linux')
     0.650 KBTriple(rel='adjoins', sbj='Homer', obj='Iliad')
     0.650 KBTriple(rel='adjoins', sbj='Caribbean', obj='North_America')

Highest probability examples for relation author:

     1.000 KBTriple(rel='author', sbj='The_Man_with_the_Golden_Arm', obj='Otto_Preminger')
     0.992 KBTriple(rel='author', sbj='Francis_of_Assisi', obj='Nikos_Kazantzakis')
     0.990 KBTriple(rel='author', sb

     0.870 KBTriple(rel='parents', sbj='Jacob', obj='Sarah')
     0.750 KBTriple(rel='parents', sbj='Carmarthenshire', obj='Rhys_ap_Tewdwr')
     0.730 KBTriple(rel='parents', sbj='Nona_Gaye', obj='Marvin_Gaye')
     0.710 KBTriple(rel='parents', sbj='Copenhagen', obj='New_York')
     0.700 KBTriple(rel='parents', sbj='Germanicus', obj='Agrippina_the_Younger')
     0.680 KBTriple(rel='parents', sbj='Kadapa', obj='Pulivendula')
     0.680 KBTriple(rel='parents', sbj='Caligula', obj='Agrippina_the_Elder')
     0.670 KBTriple(rel='parents', sbj='Valerian_II', obj='Gallienus')
     0.650 KBTriple(rel='parents', sbj='Wilhelm_Friedemann_Bach', obj='Luigi_Boccherini')
     0.600 KBTriple(rel='parents', sbj='Sarah', obj='Esau')

Highest probability examples for relation place_of_birth:

     0.890 KBTriple(rel='place_of_birth', sbj='Screenwriter', obj='Wales')
     0.878 KBTriple(rel='place_of_birth', sbj='Luigi_Luca_Cavalli-Sforza', obj='Stanford_University')
     0.847 KBTriple(rel='place_of

## Bake-off [1 point]

For the bake-off, we will release a test set. The announcement will go out on the discussion forum. You will evaluate your custom model from the previous question on these new datasets using the function `rel_ext.bake_off_experiment`. Rules:

1. Only one evaluation is permitted.
1. No additional system tuning is permitted once the bake-off has started.

The cells below this one constitute your bake-off entry.

People who enter will receive the additional homework point, and people whose systems achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.

The announcement will include the details on where to submit your entry.

In [None]:
# Enter your bake-off assessment code in this cell. 
# Please do not remove this comment.
if 'IS_GRADESCOPE_ENV' not in os.environ:
    pass
    # Please enter your code in the scope of the above conditional.
    ##### YOUR CODE HERE




In [None]:
# On an otherwise blank line in this cell, please enter
# your macro-average f-score (an F_0.5 score) as reported 
# by the code above. Please enter only a number between 
# 0 and 1 inclusive. Please do not remove this comment.
if 'IS_GRADESCOPE_ENV' not in os.environ:
    pass
    # Please enter your score in the scope of the above conditional.
    ##### YOUR CODE HERE


