## Key Phrase Extraction 

This is based on the paper "TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction" by Bougouin, et al with slight modification by myself. 

In [1]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# CNN Article on Urban Meyer and Domestic Abuse by an Assistant Coach

In [50]:
f=open('c:/users/myles.akin/desktop/ohiostate.txt')
text = f.read()
text

'Ohio State University placed its head football coach, Urban Meyer, on paid administrative leave on Wednesday as it investigates whether he was aware of domestic violence allegations against fired assistant coach Zach Smith.\n"The university is conducting an investigation into these allegations," Ohio State said. Ryan Day, who has been the team\'s offensive coordinator, will be acting head football coach while Meyer is on leave. \n"We are focused on supporting our players and on getting to the truth as expeditiously as possible," the university said. \nMeyer said in a statement that he and Gene Smith, Ohio State\'s athletic director, "agree that being on leave during this inquiry will facilitate its completion. This allows the team to conduct training camp with minimal distraction. I eagerly look forward to the reAt issue is whether Meyer knew about domestic violence allegations against Zach Smith made by his ex-wife, Courtney Smith. Zach Smith was the team\'s wide receivers\' coach. \

## Candidate Key Phrase Extraction

In [51]:
import itertools, string
from nltk.stem.porter import *

#text = ' '.join(text)
candidates = []
stemmer=PorterStemmer()
text = stemmer.stem(text)
grammer = r'KT: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}'

punct = set(string.punctuation)
stop_words = set(nltk.corpus.stopwords.words('english'))
chunker = nltk.chunk.regexp.RegexpParser(grammer)
tagged_sents = nltk.pos_tag_sents(nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(text))

for i in range(len(tagged_sents)):
    all_chunks = nltk.chunk.tree2conlltags(chunker.parse(tagged_sents[i]))
    c = itertools.groupby(all_chunks, key = lambda x :x[2])
    candid = [' '.join(x[0] for x in group) for key, group in itertools.groupby(all_chunks, lambda x : x[2] != 'O') if key]
    candidates=candidates+candid

In [52]:
candidates = list(set(candidates))
candidates

['zach smith',
 'ex-wife',
 'investigation',
 'training camp with minimal distraction',
 'coach',
 'assistant coach zach smith',
 'ryan day',
 'july',
 'big ten conference football media day',
 'wife',
 'police',
 'ohio state',
 'stadium',
 'reat issue',
 'calls',
 'team',
 'players',
 'order',
 'facebook post',
 'little bit',
 'side of events',
 'decision',
 'attorney',
 'smith',
 'i',
 'inquiry',
 'series of domestic violence allegations',
 'today',
 'matter',
 'brad koffel',
 'head football coach while meyer',
 'domestic violence',
 'honest',
 'family',
 'media',
 'domestic violence allegations against zach smith',
 'sports network',
 'statement',
 'chance',
 'head football coach',
 'urban meyer',
 'offensive coordinator',
 'group effort',
 'feet',
 'cnn',
 'courtney smith',
 'civil protection order on behalf',
 'wide receivers',
 'incident',
 'zach',
 'athletic director',
 'meyer',
 'leave',
 'allegations',
 'university',
 'domestic violence allegations',
 'shelley meyer',
 'colleg

## Create Graph

Vertices are candidate key phrases

Edges are based on a distance measure of offset between key phrases with an addition term to rank a phrase higher if it is close to the beginning of the article

In [53]:
import numpy as np


cand = np.array(candidates)
cand_set = set(candidates)
A = np.zeros([len(cand), len(cand)])

punct = string.punctuation
punct = set(punct)
punct.add('``')
punct.add("''")
text = nltk.word_tokenize(text)
text = ' '.join(text)
#doc = [word for word in text if word not in punct]
#doc=np.array(doc)
i=0
for word in cand:
    words = nltk.word_tokenize(word)
    start_d = [m.start() for m in re.finditer(word+' ', text)]#start_d = np.where(doc==words[0])[0]
    end_d = [m.end() for m in re.finditer(word+' ', text)]#end_d = np.where(doc==words[-1])[0]
    d_in=min(start_d)
    j=0
    for other in cand:
        others = nltk.word_tokenize(other)
        start_o = [m.end() for m in re.finditer(other+' ', text)]#start_o = np.where(doc==others[0])[0]
        end_o = [m.end() for m in re.finditer(other+' ', text)]#np.where(doc==others[-1])[0]
        d_st = np.abs([np.array(start_d)-end for end in end_o])
        d_en = np.abs([np.array(end_d)-start for start in start_o])
        d_tot = min(min(d_en[0]),min(d_st[0]))
        A[i,j]=1/(d_tot)+1/(d_in+1)
        if d_tot==0:
            A[i,j]=0
        j=j+1
    i=i+1
            



## Key Phrase Rank

This is based on the Google PageRank algorithm, see Newman 'Networks: and Introduction' for more information on this caculation

In [54]:
l = A.shape[0]
D = np.zeros([l,l])
one = np.ones(l)
for i in range(l):
    D[i,i] = np.sum(A[i,:])
    
x =np.dot(np.dot(D,np.linalg.inv(D-0.85*A)),one)
print(x)

[  29.69418558   11.88589657    9.08697664    8.83516602   50.90333715
   13.14805427    7.89153684    7.66689084    7.98419469   14.31025611
    4.04846744 1099.99364988    4.83926086    5.66880487    5.09047401
   13.00695461    6.30372888    8.18561064    4.78289633    7.63149782
    5.0424583     6.09716428    5.39671196   34.04269483   11.86439899
    5.70706447    5.66056386    4.05186665    5.85102888   13.48071081
    7.42002493   22.22768711    8.18079936    6.13096662   10.69677982
    7.01080602    8.30369641   12.43847424    4.05700332   42.30951321
   36.28302386   11.63453281    4.62594079    4.45287595    6.68521512
   32.4987442     7.93892608    5.87897067    7.18268949   37.58190307
    6.4225296    48.78728046   23.49650247   17.70511634  100.15202509
   17.89531255    4.15328986   12.72707794    4.5579471     4.27064047
    8.48951459 1074.31356035    6.69122145    7.82965594    6.22309125
    4.26127155    5.21140439    4.27869782    5.58445912    5.84964389
   22.

## Top 20 Ranked Phrases

Not bad. I would like to find a way to combine phrases like "ohio stat" and "ohio state university" at some point.

In [55]:
x_sort = np.sort(x)
for i in range(20):
    j = np.where(x==x_sort[-(i+1)])[0]
    print(cand[j])

['ohio state']
['ohio state university']
['university']
['coach']
['meyer']
['head football coach']
['zach']
['urban meyer']
['smith']
['courtney smith']
['zach smith']
['leave']
['domestic violence']
['administrative leave on wednesday']
['domestic violence allegations']
['allegations']
['wife']
['brad koffel']
['assistant coach zach smith']
['team']
