__Keyphrase extraction__
- Can be supervised or unsupervised;
- Can be used for terminology extraction, bulding of domain-specific dictionaries, clustering, search, summarization etc.

- Two main steps:
    - Selection of keyphrase-candidates
    - Ranking of selected candidates
    
- Datasets
    - https://github.com/snkim/AutomaticKeyphraseExtraction
    - https://scienceie.github.io/resources.html
        
- Resources 
    - [Automatic Keyphrase Extraction: A Survey of the State of the Art](http://www.hlt.utdallas.edu/~saidul/acl14.pdf)
    - [Conundrums in Unsupervised Keyphrase Extraction:
Making Sense of the State-of-the-Art](http://www.hlt.utdallas.edu/~vince/papers/coling10-keyphrase.pdf)

# Selection of keyphrase-candidates
- Not all words and phrases in a document are equally likely to convey its content;
- __Reduces the computational time__ for next steps;
- Approach: heuristic rules for __POS-based__ candidate keyphrase chunking, removal of stopwords, etc.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [8]:
import nltk
nltk.download('averaged_perceptron_tagger')

import string
import itertools

stopwords = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\didimitrov\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [17]:
def _trees_to_keyphrases(parsed_chunks):
    """
    Helper method to extract keyphrases as space-separated text from the trees of parsed chunks.
    """
    # Convert extracted to flat (CONLL - Begining, Inside, Outside of chunk) format
    chunks2conll = [nltk.chunk.tree2conlltags(_chunk) for _chunk in parsed_chunks]
    
    chunks2groups = [(key, list(group)) for _sent in chunks2conll 
                     for key, group in itertools.groupby(_sent, lambda x : x[2] != 'O')]
    print(chunks2groups)
    
    # Get only the keyphrases:
    keyphrases = [" ".join(x[0] for x in group) for key, group in chunks2groups if key]
    keyphrases = [_kp for _kp in keyphrases if all(_s not in string.punctuation for _s in _kp)]
    
    return keyphrases
    
    
def select_by_pos_tag(sentences, regexp, verbose=False):
    # POS-tag sentences 
    pos_tagged_sentences = [nltk.pos_tag(_sentence) for _sentence in sentences]
    if verbose: print("1. Pos-tagged sentences: ", pos_tagged_sentences[0])
    
    # Extract chunks matching the regexp from the sentences
    chunker = nltk.chunk.regexp.RegexpParser(regexp)
    sentence_chunks = [chunker.parse(_sentence) for _sentence in pos_tagged_sentences]
    if verbose: print("2. Extracted chunks: ", sentence_chunks[0])
    
    # Extract the keyphrases from the tree format
    candidates = _trees_to_keyphrases(sentence_chunks)
    if verbose: print("3. Extracted keyphrases:", candidates[:10])
    
    return set(candidates)

In [18]:
corn_sentences = nltk.corpus.reuters.sents(categories='corn')
print("Example input: ", corn_sentences[0])

grammar = r'KT: {(<JJ> <NN>)}'
candidates = select_by_pos_tag(corn_sentences, grammar, verbose=True)

Example input:  ['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', 'QUARTER', 'Thailand', "'", 's', 'trade', 'deficit', 'widened', 'to', '4', '.', '5', 'billion', 'baht', 'in', 'the', 'first', 'quarter', 'of', '1987', 'from', '2', '.', '1', 'billion', 'a', 'year', 'ago', ',', 'the', 'Business', 'Economics', 'Department', 'said', '.']
1. Pos-tagged sentences:  [('THAI', 'NNP'), ('TRADE', 'NNP'), ('DEFICIT', 'NNP'), ('WIDENS', 'NNP'), ('IN', 'NNP'), ('FIRST', 'NNP'), ('QUARTER', 'NNP'), ('Thailand', 'NNP'), ("'", 'POS'), ('s', 'NN'), ('trade', 'NN'), ('deficit', 'NN'), ('widened', 'VBD'), ('to', 'TO'), ('4', 'CD'), ('.', '.'), ('5', 'CD'), ('billion', 'CD'), ('baht', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('first', 'JJ'), ('quarter', 'NN'), ('of', 'IN'), ('1987', 'CD'), ('from', 'IN'), ('2', 'CD'), ('.', '.'), ('1', 'CD'), ('billion', 'CD'), ('a', 'DT'), ('year', 'NN'), ('ago', 'RB'), (',', ','), ('the', 'DT'), ('Business', 'NNP'), ('Economics', 'NNP'), ('Department', 'NNP'), ('said', '

-  Check __TreeBank POS tags__ here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
- __JJ__ - Adjective; __NN__ - Noun, singular; __IN__ - Preposition/Conjunction
- $(<JJ> <NN>)$  - adjective followed up by a noun in a singular form

In [None]:
nltk.download('tagsets')
nltk.help.upenn_tagset()

[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\didimitrov\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping help\tagsets.zip.


True

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

__Exercise__: modify the regular expression to observe keyphrases that :
- Include Nouns in __Plural form__
- Multiple adjectives before a Noun
- Multiple adjectives followed by one or more Nouns
- A group of the latter joined with one more such group by a preposition

In [20]:
# Any number of adjectives followed by one of more nouns and (optionally) 
# joined by a preposition to any number of the same group  
grammar = r'KT: {(<JJ>* <NN.>+ <IN>)? <JJ>* <NN.>+}'
candidates = select_by_pos_tag(corn_sentences, grammar, verbose=True)

1. Pos-tagged sentences:  [('THAI', 'NNP'), ('TRADE', 'NNP'), ('DEFICIT', 'NNP'), ('WIDENS', 'NNP'), ('IN', 'NNP'), ('FIRST', 'NNP'), ('QUARTER', 'NNP'), ('Thailand', 'NNP'), ("'", 'POS'), ('s', 'NN'), ('trade', 'NN'), ('deficit', 'NN'), ('widened', 'VBD'), ('to', 'TO'), ('4', 'CD'), ('.', '.'), ('5', 'CD'), ('billion', 'CD'), ('baht', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('first', 'JJ'), ('quarter', 'NN'), ('of', 'IN'), ('1987', 'CD'), ('from', 'IN'), ('2', 'CD'), ('.', '.'), ('1', 'CD'), ('billion', 'CD'), ('a', 'DT'), ('year', 'NN'), ('ago', 'RB'), (',', ','), ('the', 'DT'), ('Business', 'NNP'), ('Economics', 'NNP'), ('Department', 'NNP'), ('said', 'VBD'), ('.', '.')]
2. Extracted chunks:  (S
  (KT
    THAI/NNP
    TRADE/NNP
    DEFICIT/NNP
    WIDENS/NNP
    IN/NNP
    FIRST/NNP
    QUARTER/NNP
    Thailand/NNP)
  '/POS
  s/NN
  trade/NN
  deficit/NN
  widened/VBD
  to/TO
  4/CD
  ./.
  5/CD
  billion/CD
  (KT baht/NNS)
  in/IN
  the/DT
  first/JJ
  quarter/NN
  of/IN
  1987/CD
  f

__Exercise__: Further preprocessing of candidates:
- Add a parameter to limit by number of words in a keyphrase;
- Remove candidates containing punctuation or any other inappropriate occurrences;
- Remove candidates containing more than a number of stopwords.

In [21]:
print("Initial: ", len(candidates))
candidates = [_candidate for _candidate in candidates if not any(_p in _candidate for _p in string.punctuation)]
print("Keyphrases without punctuation: ", len(candidates))
candidates = list(set([_candidate.lower() for _candidate in candidates]))
print("Keyphrases with lowered case: ", len(candidates))

Initial:  2300
Keyphrases without punctuation:  2300
Keyphrases with lowered case:  2202


# Ranking of selected candidates 
- TF-IDF weighting;
- PMI scoring;
- Text ranking:
    - [TextRank: Bringing Order into Texts](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf)
    - [PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents](http://aclweb.org/anthology/P/P17/P17-1102.pdf)
    - [Topical Word Importance for Fast Keyphrase Extraction](https://core.ac.uk/download/pdf/55828317.pdf)
    
$ S(v_i) = (1 - d) + d * \sum_{j \in In(v_i)}{\frac{1}{|Out(v_j)|}}S(v_j) $

    - where v_i is a given word, In(v_i) are the words that co-occur with this word

## TF-IDF

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict, Counter

def score_keyphrases_by_tfidf(texts, candidates, squashing_func=sum):
    vectorizer = TfidfVectorizer(ngram_range=(1,3), vocabulary=candidates)
    train_corpus = vectorizer.fit_transform(texts)
    
    kp_tfidfs = defaultdict(lambda: [])
    for kp in candidates:
        kp_tfidfs[kp] += [train_corpus[0].toarray()[0][vectorizer.vocabulary_[kp]]]
    
    squashed_results = {_kp: squashing_func(_v) for _kp, _v in kp_tfidfs.items()}
    return squashed_results

In [23]:
kp_tfidfs = score_keyphrases_by_tfidf([" ".join(corn_sent) for corn_sent in corn_sentences], candidates)
Counter(kp_tfidfs).most_common(10)

[('first', 0.5357851385354688),
 ('trade', 0.4297143834766522),
 ('business economics department', 0.4232240421791144),
 ('baht', 0.38560698559230255),
 ('thailand', 0.35523666980362056),
 ('department', 0.19924470264230493),
 ('from', 0.16097509381272476),
 ('to', 0.09308046416727608),
 ('s european community', 0.0),
 ('officials at talks', 0.0)]

## Pointwise Mutual Information 

- how likely is it to encounter a __word in a specific category__?

$ \operatorname{PMI}(word, category) = \log\frac{p(word, category)}{p(word)p(category)} $ <br>

In [30]:
# Which part can be removed from the formula in this setting?
# Try to optimize the computations!
import math
def pmi(category):
    category_files = nltk.corpus.reuters.fileids(category)
    category_files_words = [_word for _fid in category_files for _word in nltk.corpus.reuters.words(_fid)]
    total_number_of_words_in_category = len(category_files_words)
    total_number_of_words = len(nltk.corpus.reuters.words())
    
    word_frequencies = Counter(category_files_words)
    word_frequencies_in_category = Counter(nltk.corpus.reuters.words())
    
    word_pmis = {}
    for word in set(category_files_words):
        if word in string.punctuation or word in stopwords or len(word)<3:
            continue
        p_wc =  word_frequencies_in_category[word] / total_number_of_words_in_category
        p_w = word_frequencies[word] / total_number_of_words
        p_c = len(nltk.corpus.reuters.fileids(category)) / len(nltk.corpus.reuters.fileids())
        word_pmis[word] = math.log10(p_wc / (p_w * p_c))
    return word_pmis

In [31]:
pmi_tea = pmi('tea')
print(Counter(pmi_tea).most_common(20))

pmi_corn = pmi('corn')
print(Counter(pmi_corn).most_common(20))

[('Bank', 8.713448898039305), ('January', 8.686060105933043), ('week', 8.665382074009495), ('Inc', 8.620032268662332), ('dlrs', 8.54772375065961), ('April', 8.47362203014126), ('six', 8.458883630820182), ('earlier', 8.45201536197134), ('agreed', 8.433632276086023), ('International', 8.393297556084132), ('mln', 8.39216736300149), ('rate', 8.387981767785258), ('world', 8.378246240971178), ('statement', 8.37274180122852), ('offer', 8.364069610379845), ('debt', 8.361519097119771), ('reported', 8.359810395314655), ('FOR', 8.352907556243549), ('gain', 8.348243946523409), ('lower', 8.341153266862056)]
[('Net', 6.63485357206849), ('stock', 6.4761364090822315), ('NET', 6.298375854639266), ('bank', 6.273716211304201), ('shares', 6.20703978809342), ('net', 6.115684035014881), ('loss', 6.062744815146917), ('yen', 6.026974451940459), ('shareholders', 6.001783133661785), ('company', 5.97101111289594), ('Group', 5.921386149626988), ('Ltd', 5.887204706499512), ('economy', 5.857856849122159), ('Avg', 5

__Exercise__: Make more preprocessing steps, add bi-grams and normalization to make the extracted words more comprehensible.

### Other interpretations of PMI:
- Used to build __sentiment lexicons__ - we can introduce the concept of __polarity__:
    - Polarity(phrase) = PMI(phrase,"positive") − PMI(phrase,"negative"), 
    - Where positive and negative can be a __list of preselected words__
    - We can make __several iterations__ to expand the lexicon, each time adding the newly discovered words  
- We can make the PMI score __"oriented"__ (in case we have two categories and we want to have __one number__ for both)
    - From the PMI of a phrase in a positive context, we __discard__ the PMI of a phrase in a negative context
    - But difference of two logarithms is:
$ PMI(phrase, "positive") = log2(\frac{hits(phrase NEAR "positive")hits("negative")}{hits(phrase NEAR "negative")hits("positive")})$
- Can be used to evaluate features
- Can be used to find keyphrases - how likely is to have two words that co-occur.

__Exercise__: Using the Friends' transcripts corpus, which we already have been exploring, find:
- Which are the most used words by Monica and which by Joey? - Try with all the methods we explored!
- Can we say which are the words, which are most likely for Joey, but not for Monica, using the idea for "Polarity"?