# LAB 3: Automated Terminology Extraction

Extract technical terms from ACL Anthology

Objectives:
* part of speech tagging with spacy
* extract phrases that match a part of speech pattern
* scale processing pipeline with dask
* compute c-values

## Part I: Test c-value function

In [1]:
import pandas as pd
import numpy as np
from cytoolz import *
from tqdm.auto import tqdm
tqdm.pandas()

In [2]:
df = pd.read_parquet('s3://ling583/acl.parquet', storage_options={'anon':True})

In [3]:
df.head()

Unnamed: 0,year,tag,title,text,abstract,body
0,1990,P90-1001,Polynomial Time Parsing of Combinatory Categor...,Polynomial Time Parsing of Combinatory Categor...,In this paper we present a polynomial time par...,Combinatory Categorial Grammar (CCG) is an ex...
1,1990,P90-1002,Structure and Intonation in Spoken Language Un...,Structure and Intonation in Spoken Language Un...,The structure imposed upon spoken sentences by...,"Halliday observed that this constraint, which..."
2,1990,P90-1003,"Prosody, Syntax and Parsing","Prosody, Syntax and Parsing We describe the mo...",We describe the modification of a grammar to t...,"Prosodic information can mark lexical stress, ..."
3,1990,P90-1004,Empirical Study of Predictive Powers of Simple...,Empirical Study of Predictive Powers of Simple...,This empirical study attempts to find answers ...,Difficulty in resolving structural ambiguity i...
4,1990,P90-1005,Structural Disambiguation With Constraint Prop...,Structural Disambiguation With Constraint Prop...,We present a new grammatical formalism called ...,We are interested in an efficient treatment of...


In [4]:
len(df)

6167

We will use random sample (500) of the original df

In [5]:
df = df.sample(500, random_state=100)

### spaCy Setup

In [6]:
import spacy

Loading a processing pipeline.  This is a small English model.  Will be using part of speech labels only, so we will be excluding modules.

In [7]:
nlp = spacy.load('en_core_web_sm', exclude=['parser', 'ner', 'lemmatizer', 'attribute_ruler'])

In [8]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f312b50d4a0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f312b4bdf90>)]

In [10]:
doc = nlp(df['text'].iloc[0])

In [11]:
doc

Resume Information Extraction with Cascaded Hybrid Model This paper presents an effective approach for resume information extraction to support automatic resume management and routing. A cascaded information extraction (IE) framework is designed. In the first pass, a resume is segmented into a consecutive blocks attached with labels indicating the information types. Then in the second pass, the detailed information, such as Name and Address, are identified in certain blocks (e.g. blocks labelled with Personal Information), instead of searching globally in the entire resume. The most appropriate model is selected through experiments for each IE task in different passes. The experimental results show that this cascaded hybrid model achieves better F-score than flat models that do not apply the hierarchical structure of resumes. It also shows that applying different IE models in different passes according to the contextual structure is effective. Big enterprises and head-hunters receive h

In [12]:
doc[0:10]

Resume Information Extraction with Cascaded Hybrid Model This paper presents

In [13]:
doc[200]

The

In [14]:
doc[200].tag_, doc[200].norm_

('DT', 'the')

We will import the rule-based matcher from spacy

In [15]:
from spacy.matcher import Matcher

In [16]:
matcher = Matcher(nlp.vocab)
matcher.add('Term', [[{'TAG': {'IN': ['JJ', 'NN']}},  #JJ = adjective  #NN = noun
                      {'TAG': {'IN': ['JJ', 'NN', 'IN', 'HYPH']}, 'OP': '*'},  ##IN = preposition ##HYPH = hyphenated speech
                      {'TAG': 'NN'}]])

In [17]:
spans = matcher(doc, as_spans=True)

This is the first candidate in the first document:

In [18]:
tuple(tok.norm_ for tok in spans[0])

('effective', 'approach')

### Extract candidate terms

In [19]:
def get_candidates(text):
    doc = nlp(text)  #tokenize and tag
    spans = matcher(doc, as_spans=True)  #find all the tags
    return [tuple(tok.norm_ for tok in span) for span in spans] #return a list of all spans converted into tuples of normalized strings

In [20]:
get_candidates(df['text'].iloc[0])

[('effective', 'approach'),
 ('approach', 'for', 'resume'),
 ('effective', 'approach', 'for', 'resume'),
 ('resume', 'information'),
 ('approach', 'for', 'resume', 'information'),
 ('effective', 'approach', 'for', 'resume', 'information'),
 ('information', 'extraction'),
 ('resume', 'information', 'extraction'),
 ('approach', 'for', 'resume', 'information', 'extraction'),
 ('effective', 'approach', 'for', 'resume', 'information', 'extraction'),
 ('automatic', 'resume'),
 ('resume', 'management'),
 ('automatic', 'resume', 'management'),
 ('cascaded', 'information'),
 ('information', 'extraction'),
 ('cascaded', 'information', 'extraction'),
 ('first', 'pass'),
 ('second', 'pass'),
 ('detailed', 'information'),
 ('entire', 'resume'),
 ('appropriate', 'model'),
 ('hybrid', 'model'),
 ('f', '-', 'score'),
 ('hierarchical', 'structure'),
 ('contextual', 'structure'),
 ('structured', 'information'),
 ('automatic', 'construction'),
 ('construction', 'of', 'database'),
 ('automatic', 'construc

Now, we have to get all of the candidates for the entire dataset

In [21]:
candidates = list(concat(df['text'].progress_apply(get_candidates)))

  0%|          | 0/500 [00:00<?, ?it/s]

### Compute c-values

$$\mbox{C-value}(a)=\begin{cases}\log_2|a|\cdot f(a) & \mbox{if } a \mbox{ is not nested}\\\log_2|a|\left(f(a)-\frac{1}{P(T_a)}\sum_{b\in T_a}f(b)\right) & \mbox{otherwise}\\\end{cases}$$


Next, we will count the frequencies of all the candidates and organize them by length.

In [22]:
from collections import defaultdict, Counter

In [23]:
freqs = defaultdict(Counter)
for c in candidates:
    freqs[len(c)][c] += 1

In [24]:
freqs.keys()

dict_keys([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 12, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32])

In [27]:
freqs[5].most_common(5)

[(('part', '-', 'of', '-', 'speech'), 213),
 (('end', '-', 'to', '-', 'end'), 99),
 (('tree', '-', 'to', '-', 'string'), 57),
 (('sequence', '-', 'to', '-', 'sequence'), 41),
 (('state', '-', 'ofthe', '-', 'art'), 37)]

In [28]:
from nltk import ngrams

In [29]:
#Use 5-1 = 4, to 1, but excluding 1 (use -1)
list(range(4, 1, -1))

[4, 3, 2]

In [30]:
def get_subterms(term):
    k = len(term)
    for m in range(k-1, 1, -1):
        yield from ngrams(term, m)

In [31]:
list(get_subterms(('part', '-', 'of', '-', 'speech')))

[('part', '-', 'of', '-'),
 ('-', 'of', '-', 'speech'),
 ('part', '-', 'of'),
 ('-', 'of', '-'),
 ('of', '-', 'speech'),
 ('part', '-'),
 ('-', 'of'),
 ('of', '-'),
 ('-', 'speech')]

In [32]:
from math import log2

In [33]:
def c_value(F, theta):
    
    termhood = Counter()
    longer = defaultdict(list)
    
    for k in sorted(F, reverse=True):
        for term in F[k]:
            if term in longer:
                discount = sum(longer[term]) / len(longer[term])
            else:
                discount = 0
            c = log2(k) * (F[k][term] - discount)  #This is the extra boost given to longer sequences
            if c > theta:
                termhood[term] = c
                for subterm in get_subterms(term):
                    if subterm in F[len(subterm)]:
                        longer[subterm].append(F[k][term])
    return termhood

In [43]:
terms = c_value(freqs, theta=75)

In [44]:
for t, c in terms.most_common(20):
    print(f'{c:8.2f} {freqs[len(t)][t]:4d} {" ".join(t)}')

  446.00  446 language model
  420.27  213 part - of - speech
  317.00  458 natural language
  310.00  310 training set
  307.48  194 sentence - level
  298.00  381 machine translation
  271.00  271 other hand
  265.00  265 test set
  261.00  261 previous work
  251.00  306 neural network
  242.50  153 word - level
  238.00  238 word alignment
  229.87   99 end - to - end
  223.48  141 natural language processing
  220.00  220 future work
  209.22  132 n - gram
  209.22  132 large - scale
  198.00  270 co -
  194.95  123 f - measure
  193.37  122 f - score


In [45]:
#Looking at the bottom of the list
for t, c in tail(20, terms.most_common()):
    print(f'{c:8.2f} {freqs[len(t)][t]:4d} {" ".join(t)}')

   84.00   84 target domain
   83.00   83 sentence level
   82.72   32 part - of - speech tagging
   80.83   51 part of speech
   80.00   80 first step
   80.00   80 head word
   80.00   80 sentence pair
   79.00   79 time step
   78.00   78 baseline system
   78.00   78 - word
   77.66   49 t - test
   77.66   49 character - level
   77.00   77 related work
   77.00   77 sentence compression
   76.00   76 mutual information
   76.00   76 re -
   76.00   76 small number
   76.00   76 deep learning
   76.00   76 recent work
   76.00   76 text classification


Scale this up to all the articles