This notebook illustrates one of the graph based Key Phrase Extraction (SingleRank) method on Openshift 4 dataset. 

Outline
- Download the dataset
- Initialize SingleRank
- Extract keyphrases
- Dump the results

In [4]:
import pke
import pandas as pd

In [3]:
#!python -m nltk.downloader stopwords
#!python -m nltk.downloader universal_target
#!python -m spacy download en # download the english model

In [47]:
# skips useless warnings in the pke methods
import logging

logging.basicConfig(level=logging.CRITICAL)

In [57]:
def keyphrases(text):
    
    # define the set of valid Part Of Speech tags 
    pos = {'NOUN', 'PROPN', 'ADJ'}
    
    #create a SingleRank extractor
    singleRank_extractor = pke.unsupervised.SingleRank()
    
    # load the content of the document
    singleRank_extractor.load_document(input=text, language='en', normalization=None)
    
    # candidate selection (select the longest sequences of nouns and adjectives as candidates)
    singleRank_extractor.candidate_selection(pos)
    
    # candidate_weighing
    # candidate phrases are weighted using sum of their word's scores computed
    # using random walk. In graph, nodes are words of certain part-of-speech(nouns & adjectives)
    # that are connected if they occur in a window of 10 words
    singleRank_extractor.candidate_weighting(window=10, pos=pos)
    
    # rank the keyphrase and get the 10-higest scored candidates
    keyphrases_with_scores = singleRank_extractor.get_n_best(n=10)
    phrases = [keyphrase for keyphrase, score in keyphrases_with_scores]
    
    return phrases

In [1]:
df=pd.read_csv('../data/openshift4_demo.csv')

NameError: name 'pd' is not defined

In [69]:
df.head()

Unnamed: 0,allTitle,view_uri
0,oc command line tool is throwing invalid chara...,https://access.redhat.com/solutions/4034641
1,Debugging OpenShift 4.x,https://access.redhat.com/articles/3780981
2,Getting Journal Logs from the OpenShift 4.x ku...,https://access.redhat.com/solutions/3802181
3,How to connect to Openshift Container Platform...,https://access.redhat.com/solutions/4073041
4,What are the credentials for OpenShift 4 Route...,https://access.redhat.com/solutions/4064271


In [70]:
text_content = df['allTitle'][3]

## Single Rank initialization

In [71]:
# define the set of valid POS
pos = {'NOUN', 'PROPN', 'ADJ'}

In [72]:
#create a SingleRank extractor
singleRank_extractor = pke.unsupervised.SingleRank()

In [73]:
# load the content of the document
singleRank_extractor.load_document(input=text_content, language='en', normalization=None)

## Keyphrase Extraction

In [74]:
# candidate selection
singleRank_extractor.candidate_selection(pos)

In [75]:
# candidate_weighting using the default weighing scheme
singleRank_extractor.candidate_weighting(window=10, pos=pos)

In [76]:
keyphrases_with_scores = singleRank_extractor.get_n_best(n=10); keyphrases_with_scores



[('ssh bastion pod', 0.3750001099999999),
 ('openshift container platform', 0.37500003999999987),
 ('cluster nodes', 0.25000007999999996)]

In [77]:
phrases = [keyphrase for keyphrase, score in keyphrases_with_scores]

In [78]:
phrases

['ssh bastion pod', 'openshift container platform', 'cluster nodes']

## KPE from title

In [79]:
df['allTitle_kpe'] = df['allTitle'].apply(lambda x: keyphrases(x))



In [82]:
df.head(20)

Unnamed: 0,allTitle,view_uri,allTitle_kpe
0,oc command line tool is throwing invalid chara...,https://access.redhat.com/solutions/4034641,"[invalid character error, command line tool]"
1,Debugging OpenShift 4.x,https://access.redhat.com/articles/3780981,[openshift]
2,Getting Journal Logs from the OpenShift 4.x ku...,https://access.redhat.com/solutions/3802181,"[journal logs, kubelet, openshift]"
3,How to connect to Openshift Container Platform...,https://access.redhat.com/solutions/4073041,"[ssh bastion pod, openshift container platform..."
4,What are the credentials for OpenShift 4 Route...,https://access.redhat.com/solutions/4064271,"[router metrics, openshift, credentials]"
5,What nameserver does a Pod use to resolve a ho...,https://access.redhat.com/solutions/4064211,"[pod use, openshift, hostname]"
6,What is the default OpenShift CNI Plugin in Op...,https://access.redhat.com/solutions/4064171,"[default openshift cni plugin, openshift]"
7,Authentication operator fails to upgrade,https://access.redhat.com/solutions/4059611,[authentication operator]
8,who-can command gives incorrect output,https://access.redhat.com/solutions/4058371,"[incorrect output, command]"
9,How to upgrade the Openshift 4 cluster?,https://access.redhat.com/solutions/4044181,"[cluster, openshift]"


In [83]:
results = df[['allTitle', 'allTitle_kpe']]

In [84]:
results.to_csv('results.csv')

In [85]:
import jovian
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Saving notebook..


<IPython.core.display.Javascript object>

[jovian] Creating a new notebook on https://jvn.io
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Committed successfully! https://jvn.io/manisnesan/7a7f29d7c3e844979b0b1aa6d5a0ee55
