# Keyword Phrase Extraction
## This notebook outlines the concepts involved in extracting keyword phrases in text

Problem: **Identify keywords in a piece of text**

Identify **words that are very important** in a piece of text

### Possible Solutions:
- TF-IDF (already seen)
- Noun Chunks
- - Specialized Keyword Extraction algorithms
    - TextRank
    - SGRank

Textacy is an excellent library that uses several information extraction functions, many of them based on regular expression patterns and heuristics to address extracting specific expressions such as acronyms and quotations. Apart from these, one can also extract matching custom regular expressions including POS tag patterns, or look for statements involving an entity, subject-verb-object tuples etc. 

We will use Textacy to extract keywords from documents

Documentaion: https://textacy.readthedocs.io/en/stable/

### Install Textacy

In [1]:
# !pip install textacy==0.9.1
# !python -m spacy download en_core_web_sm

### Import the libraries

In [2]:
import spacy
import textacy
from textacy import *



### Load a spacy model, which will be used for all further processing

In [3]:
en = textacy.load_spacy_lang("en_core_web_sm")

### Load a sample text to find keywords
- kpe_sample_text.txt

In [4]:
mytext = open('kpe_sample_text.txt').read()

### Convert the text into a spacy document

In [5]:
doc = textacy.make_spacy_doc(mytext, lang=en)

## Find keywords

## 1. Noun Chunks

In [6]:
print([chunk for chunk in textacy.extract.noun_chunks(doc)])

[Common NLP Tasks, following, list, some, most commonly researched tasks, natural language processing, Some, tasks, direct real-world applications, others, subtasks, that, larger tasks, natural language processing tasks, they, categories, convenience, coarse division, Text and speech processing
Optical character recognition, OCR, image, printed text, corresponding text, Speech recognition, sound clip, person, people, textual representation, speech, This, opposite, text, extremely difficult problems, natural speech, pauses, successive words, speech segmentation, necessary subtask, speech recognition, most spoken languages, sounds, successive letters, process, coarticulation, conversion, analog signal, discrete characters, very difficult process, words, same language, people, different accents, speech recognition software, wide variety, input, terms, its textual equivalent, Speech segmentation, sound clip, person, people, it, words, subtask, speech recognition, it, speech, text, units, s

### Issues ???

- 
- 

## 2. TextRank

In [7]:
textacy.extract.keyterms.textrank(doc, topn=5)

[('natural language semantic', 0.018146838505547314),
 ('natural language processing task', 0.018060171377632987),
 ('language text segmentation', 0.017039167515137954),
 ('possible word form', 0.011931118622641946),
 ('multiple possible semantic', 0.011782065218284243)]

### Get more keywords

In [8]:
textacy.extract.keyterms.textrank(doc, topn=20)

[('natural language semantic', 0.018146838505547314),
 ('natural language processing task', 0.018060171377632987),
 ('language text segmentation', 0.017039167515137954),
 ('possible word form', 0.011931118622641946),
 ('multiple possible semantic', 0.011782065218284243),
 ('natural language expression', 0.011774393399104688),
 ('natural language generation', 0.011688724696224265),
 ('readable human language', 0.011669293765417333),
 ('powerful neural language model', 0.011598017813175949),
 ('natural language understanding', 0.01139476710406764),
 ('language question', 0.011350302842731216),
 ('natural language concept', 0.011169986552497629),
 ('Implicit semantic Role labelling', 0.010617673020212303),
 ('semantic role labelling', 0.010558625103431842),
 ('explicit semantic role', 0.010363135996829321),
 ('word sense disambiguation', 0.00964158598757856),
 ('semantic representation', 0.009522574211103165),
 ('spoken language', 0.009435681059633324),
 ('indian language', 0.009205679489

### Keywords using TextRank algorithm

In [9]:
[kps for kps, weights in textacy.extract.keyterms.textrank(doc, normalize="lemma", topn=10)]

['natural language semantic',
 'natural language processing task',
 'language text segmentation',
 'possible word form',
 'multiple possible semantic',
 'natural language expression',
 'natural language generation',
 'readable human language',
 'powerful neural language model',
 'natural language understanding']

### Keywords using SGRank algorithm

In [10]:
[kps for kps, weights in textacy.extract.keyterms.sgrank(doc, topn=10)]

['natural language',
 'speech recognition',
 'word',
 'text',
 'task',
 'separate word',
 'sentence',
 'sentence boundary',
 'semantic role',
 'segmentation']

Issue: **Overlapping key phrases**

Solution: **aggregage_term_variants**
- Choosing one of the grouped terms per item will give us a list of non-overlapping key phrases

In [11]:
terms = set([term for term,weight in textacy.extract.keyterms.sgrank(doc)])
print(textacy.extract.utils.aggregate_term_variants(terms))

[{'speech recognition'}, {'sentence boundary'}, {'natural language'}, {'separate word'}, {'semantic role'}, {'segmentation'}, {'sentence'}, {'task'}, {'word'}, {'text'}]
