# Phrase extraction

This notebook explains the phrase extraction architecture which can be used for entity salience detection. Here, phrase extraction is the task in which all non-overlapping phrases (word sequences) are extracted from a document for a given entity. Different entities will result in different phrases. The dataset contains manually labelled phrases per entity per document.

In [2]:
import os
import pandas as pd
import sys
import string
import nltk

sys.path.append(os.path.join('..', '..'))

from utilities.dataset import load_dataset, generate_embeddings, Embeddings, Tokenizer, TextEncoder, LowercaseTransformer, compute_phrase_mask, create_wikiphrase_dataset

## Dataset

In this section, the dataset is loaded and a train, test and validation set is created.

### Load the dataset

The first step is to load the data and the annotations. After loading the data, the annotations are available in a Dataframe.

In [3]:
df_glove = pd.read_csv(os.path.join('..', '..', 'data', 'glove', 'glove.6B.50d.txt'), sep=' ', quoting=3, header=None, index_col=0)
word_embeddings = Embeddings(df_glove)
word_tokenizer = Tokenizer(word_embeddings, nltk.word_tokenize, transformers=[LowercaseTransformer()])
char_embeddings = Embeddings(generate_embeddings(list(string.printable), 16))
char_tokenizer = Tokenizer(char_embeddings, lambda token: list(token))
text_encoder = TextEncoder(word_tokenizer, char_tokenizer)
df_phrases = load_dataset(os.path.join('..', '..', 'data', 'wikiphrase'))
dataset = create_wikiphrase_dataset(df_phrases, text_encoder)

  df_selected = df_selected[df.annotator == annotator]
  df_selected = df_selected[df.entity == entity]


In [4]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(dataset, test_size=20, random_state=0)
df_train.sample(1)

Unnamed: 0,annotator,entity,entity__char_ids,entity__char_tokens,entity__word_ids,entity__word_tokens,kb,kb__char_ids,kb__char_tokens,kb__word_ids,kb__word_tokens,phrase_mask,salience,text,text__char_ids,text__char_tokens,text__word_ids,text__word_tokens
64,kevin,China,"[[2, 3], [2, 42, 21, 22, 27, 14, 3], [2, 3]]","[[__START__, __END__], [__START__, C, h, i, n,...","[2, 136, 3]","[__START__, china, __END__]","China, officially the People's Republic of Chi...","[[2, 3], [2, 42, 21, 22, 27, 14, 3], [2, 77, 3...","[[__START__, __END__], [__START__, C, h, i, n,...","[2, 136, 5, 2395, 4, 73, 13, 878, 7, 136, 27, ...","[__START__, china, ,, officially, the, people,...","[0, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",0.556667,North Korea has today agreed to return to the ...,"[[2, 3], [2, 53, 28, 31, 33, 21, 3], [2, 50, 2...","[[__START__, __END__], [__START__, N, o, r, t,...","[2, 197, 578, 35, 377, 741, 8, 502, 8, 4, 2577...","[__START__, north, korea, has, today, agreed, ..."


In [None]:
from skorch import NeuralNetClassifier