# Topic Extraction

This is linguistic approach how to extract keywords from the sentence. It's not always 100% accurate, since there are unobserved words (specific named entities), but that's happening also for machine learning approach. Anyway, the advantage of this approach is to unnecessity of training set as it is necessary in case of usage supervised machine learning.

In [1]:
import spacy
import pandas as pd
pd.set_option('display.max_colwidth', -1)

Let's load model (https://spacy.io/models) which Spacy team maintain for Spacy usage. In this case it's English corpus (https://spacy.io/models/en) small version (35 MB) covering enough Part-of-Speech of accuracy (97.04%) and Named Entity Recognition accuracy (~ 85%). 

In [2]:
nlp = spacy.load('en_core_web_sm')
df = pd.read_csv('questions_or_not.txt', sep = '\t', names = ['question_expected','text'])

Extraction of the keywords is done in two steps:
- Named Entity Recognition (NER, see https://spacy.io/api/annotation#section-named-entities)
- Part-of-Speech Tags (PoS, see https://spacy.io/api/annotation#section-pos-tagging) specifically:
    - Syntactic Dependency Tags defined by `dep_` attribute on the object
        - dobj = direct object
        - pobj = object of preposition
        - conj = conjunct
        - compound = compound
    - The Universal Part-of-speech Tags defined by `pos_` attribute on the object
        - NOUN = noun
    - The English Part-of-speech Tags defined by `tag_` attribute on the object
        - WP = wh-pronoun, personal

In [3]:
keywords = []
for index, row in df.iterrows():
    kw = []
    parsed_text = nlp(row['text'])
    # taking all named entities as keywords
    for entity in parsed_text.ents:
        #kw.append("%s(%s)" % (entity.text, entity.label_))
        kw.append(entity.text)
    for pt in parsed_text:
        # taking just part of speech tags as keywords
        if pt.dep_ in ('pobj', 'dobj', 'conj', 'compound') and pt.pos_ in ('NOUN') and pt.tag_ not in ('WP'):
            #kw.append("%s(%s,%s)"%(pt, pt.pos_, pt.tag_)) 
            kw.append(str(pt)) 
    # make the list of keyword unique when name entities and objects makes duplicity
    kw = list(set(kw))
    keywords.append(', '.join(str(k) for k in kw))
df = df.assign(keywords=keywords)

In [4]:
spacy.explain("WP")

'wh-pronoun, personal'

Result is then tabulated as the original text and the keywords extracted.

In [5]:
df[["text","keywords"]]

Unnamed: 0,text,keywords
0,Anyone knows of a list of Kylo users that we can show as references?,"Kylo, list, users, references"
1,"I know of Coke and Lego, but a largest list will help. Thanks!","Coke, Lego"
2,Do we have any use cases where we have utilised 3D visualisation technology?,"use, technology, visualisation, cases"
3,"Does someone have more information, decks of the new TD which was presented in partners?","partners, information, TD, decks"
4,Do we have any material on internal fraud on banking?,"material, banking, fraud"
5,Any assessment or implementation on this front?,"implementation, front"
6,I am looking for pre-sale information for Data Science Lab.,"sale, information, Data Science Lab, Lab"
7,Is there someone who has experience within Projects with the Next-Gen Data Platform or Converged Data Platform from MapR?,"experience, Converged Data Platform"
8,I've been preparing for a TD AppCenter demo for a customer and created a set of demo videos for backup.,"backup, set, videos, customer, TD"
9,These may be useful for ones who want to learn about it.,ones
