# LAB 3: Automated Terminology Extraction

Extract technical terms from ACL Anthology

Objectives:
* part of speech tagging with spacy
* extract phrases that match a part of speech pattern
* scale processing pipeline with dask
* compute c-values

## Part I: Test c-value function

In [1]:
import pandas as pd
import numpy as np
from cytoolz import *
from tqdm.auto import tqdm
tqdm.pandas()

In [4]:
df = pd.read_parquet('s3://ling583/acl.parquet', storage_options={'anon':True})

In [5]:
df.head()

Unnamed: 0,year,tag,title,text,abstract,body
0,1990,P90-1001,Polynomial Time Parsing of Combinatory Categor...,Polynomial Time Parsing of Combinatory Categor...,In this paper we present a polynomial time par...,Combinatory Categorial Grammar (CCG) is an ex...
1,1990,P90-1002,Structure and Intonation in Spoken Language Un...,Structure and Intonation in Spoken Language Un...,The structure imposed upon spoken sentences by...,"Halliday observed that this constraint, which..."
2,1990,P90-1003,"Prosody, Syntax and Parsing","Prosody, Syntax and Parsing We describe the mo...",We describe the modification of a grammar to t...,"Prosodic information can mark lexical stress, ..."
3,1990,P90-1004,Empirical Study of Predictive Powers of Simple...,Empirical Study of Predictive Powers of Simple...,This empirical study attempts to find answers ...,Difficulty in resolving structural ambiguity i...
4,1990,P90-1005,Structural Disambiguation With Constraint Prop...,Structural Disambiguation With Constraint Prop...,We present a new grammatical formalism called ...,We are interested in an efficient treatment of...


In [6]:
# There are 6167 articles. This is more than we need for this part
len(df)

6167

In [7]:
# We cut the dataframe down to a smaller size
# Note: Each time this is run, if the random state is the same, 
# we will receive the same "random" samples
df = df.sample(500, random_state=100)

In [8]:
len(df)

500

### Set up spaCy

In [9]:
import spacy

In [10]:
# en_core_web_sm is an english model built on data from the web
# the "sm" denotes small model, there are larger models available but we don't need all of that
## Excluded modules:
# Parser finds the syntactic structure of sentences
# ner (Named Entity Recognizer) pulls out names of people and places
# lemmatizer strips imflection and morphology from words to find their root
# attribute_ruler identifies gender of pronouns and more
nlp = spacy.load('en_core_web_sm', exclude=['parser', 'ner', 'lemmatizer', 'attribute_ruler'])

In [11]:
# These are the modules that we are left with after exclusions
# took2vec (token to vector) looks words up in the vocabulary
# tagger tags the part of speech
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f37efac4360>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f37efae22c0>)]

In [12]:
# Run our text through the pipeline
doc = nlp(df['text'].iloc[0])

In [None]:
# take a look at what the results are
doc

In [14]:
# can loook at the data token by token
doc[0]

Resume

In [15]:
# can also look at the data in spans
doc[0:10]

Resume Information Extraction with Cascaded Hybrid Model This paper presents

In [17]:
# .tag_ part of speech label
# .norm_ normalized form
doc[0].tag_, doc[0].norm_

('NNP', 'resume')

In [18]:
from spacy.matcher import Matcher

In [20]:
# Create a matcher and link it to our vocabulary established above
matcher = Matcher(nlp.vocab)

# add rules, in this case define candidate terms
# IN just means in the set: []
# JJ = adjective
# NN = noun
# IN = preposition
# HYPH = hyphen
# OP = operation, works like regular expressions (* = zero or more times)
matcher.add('Term', [[{'TAG': {'IN': ['JJ', 'NN']}},
                      {'TAG': {'IN': ['JJ', 'NN', 'IN', 'HYPH']}, 'OP': '*'},
                      {'TAG': 'NN'}]])
# this amounts to any noun/adjective followed by and number of adjective/noun/preposition/hyphen, ending with another noun

In [21]:
spans = matcher(doc, as_spans=True)

In [22]:
tuple(tok.norm_ for tok in spans[0])

('effective', 'approach')

### Extract candidate terms

In [23]:
def get_candidates(text):
    doc = nlp(text) # tokenize and tag
    spans = matcher(doc, as_spans=True) # find all of the spans that satisfy the rules above
    return [tuple(tok.norm_ for tok in span) for span in spans] # return a list of all of the spans converted to tuples of normalized strings

In [None]:
# Check the function on the first article in the database
get_candidates(df['text'].iloc[0])

In [25]:
# Run the process on the entire database
candidates = list(concat(df['text'].progress_apply(get_candidates)))

  0%|          | 0/500 [00:00<?, ?it/s]

### Compute c-values

$$\mbox{C-value}(a)=\begin{cases}\log_2|a|\cdot f(a) & \mbox{if } a \mbox{ is not nested}\\\log_2|a|\left(f(a)-\frac{1}{P(T_a)}\sum_{b\in T_a}f(b)\right) & \mbox{otherwise}\\\end{cases}$$


In [26]:
from collections import defaultdict, Counter

In [27]:
# Create a dict
# Keys = sequence lengths
# Values = counter of sequences of that length
freqs = defaultdict(Counter)
for c in candidates:
    freqs[len(c)][c] += 1

In [29]:
# the longest sequence is 32 characters long
freqs.keys()

dict_keys([2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 12, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32])

In [31]:
# this one is an equation that the pdf to text converter did something weird to
freqs[32]

Counter({('p',
          'conj',
          'p',
          '#',
          'w',
          'p',
          'p',
          'exp',
          'c',
          'p',
          'conj',
          'p',
          '#',
          'w',
          'p',
          'p',
          'exp',
          'h',
          'conj',
          'p',
          '#',
          'c',
          'p',
          '#',
          'w',
          'p',
          's#h',
          'exp',
          'p',
          'p',
          'top',
          'c'): 1})

In [33]:
freqs[5].most_common(5)

[(('part', '-', 'of', '-', 'speech'), 213),
 (('end', '-', 'to', '-', 'end'), 99),
 (('tree', '-', 'to', '-', 'string'), 57),
 (('sequence', '-', 'to', '-', 'sequence'), 41),
 (('state', '-', 'ofthe', '-', 'art'), 37)]

In [34]:
from nltk import ngrams

In [35]:
# If we are wanting to look at the sequences of length 5 or less
# start with N-1 (5-1) and count down from there
list(range(4, 1, -1))

[4, 3, 2]

In [36]:
# Get the length of the term, then create a list 1 smaller than that length that decreases until 2
def get_subterms(term):
    k = len(term)
    for m in range(k-1, 1, -1):
        yield from ngrams(term, m)

In [37]:
# Test on one of the previously found candidates
list(get_subterms(('part', '-', 'of', '-', 'speech')))

[('part', '-', 'of', '-'),
 ('-', 'of', '-', 'speech'),
 ('part', '-', 'of'),
 ('-', 'of', '-'),
 ('of', '-', 'speech'),
 ('part', '-'),
 ('-', 'of'),
 ('of', '-'),
 ('-', 'speech')]

In [38]:
from math import log2

In [39]:
# F = frequency data structure defined above, sorted by length
# theta = Threshold, the C-value above which we consider candidates to be terms
def c_value(F, theta):
    
    # Keep track of terms as we identify them
    termhood = Counter()
    
    # Keep track of longer sequences that contain shorter sequences
    longer = defaultdict(list)
    
    # K is sequence length, starting with the longest
    for k in sorted(F, reverse=True):
        for term in F[k]:
            # if the term is a subsequence of a longer one that we have seen already
            if term in longer:
                discount = sum(longer[term]) / len(longer[term])
            # if there are no longer sequences of it, there is no discount
            else:
                discount = 0
            c = log2(k) * (F[k][term] - discount)
            if c > theta:
                termhood[term] = c
                for subterm in get_subterms(term):
                    if subterm in F[len(subterm)]:
                        longer[subterm].append(F[k][term])
    return termhood

In [40]:
terms = c_value(freqs, theta=75)

In [41]:
for t, c in terms.most_common(20):
    print(f'{c:8.2f} {freqs[len(t)][t]:4d} {" ".join(t)}')

  446.00  446 language model
  420.27  213 part - of - speech
  317.00  458 natural language
  310.00  310 training set
  307.48  194 sentence - level
  298.00  381 machine translation
  271.00  271 other hand
  265.00  265 test set
  261.00  261 previous work
  251.00  306 neural network
  242.50  153 word - level
  238.00  238 word alignment
  229.87   99 end - to - end
  223.48  141 natural language processing
  220.00  220 future work
  209.22  132 n - gram
  209.22  132 large - scale
  198.00  270 co -
  194.95  123 f - measure
  193.37  122 f - score


In [42]:
for t, c in tail(20, terms.most_common()):
    print(f'{c:8.2f} {freqs[len(t)][t]:4d} {" ".join(t)}')

   84.00   84 target domain
   83.00   83 sentence level
   82.72   32 part - of - speech tagging
   80.83   51 part of speech
   80.00   80 first step
   80.00   80 head word
   80.00   80 sentence pair
   79.00   79 time step
   78.00   78 baseline system
   78.00   78 - word
   77.66   49 t - test
   77.66   49 character - level
   77.00   77 related work
   77.00   77 sentence compression
   76.00   76 mutual information
   76.00   76 re -
   76.00   76 small number
   76.00   76 deep learning
   76.00   76 recent work
   76.00   76 text classification
