### Activity 2.01: Extracting Top Keywords from the News Article
In this activity, you will extract the most frequently occurring keywords from a sample news article. 

In [1]:
import nltk

In [2]:
# string to lowercase: word.lower()
with open('news-article.txt') as f:
    text = f.read()
text = text.lower()
print(text.lower())

ever since the populist law and justice (pis) party took power in 2015, adam bodnar, poland’s
courts. to illustrate the danger, he uses an expression from communist times: lex telefonica.
in the polish people’s republic, verdicts were routinely dictated by a phone call from an
apparatchik at party headquarters. today’s government has more subtle techniques,
but the goal is the same, mr bodnar says: “if a judge has a case on his desk with some
political importance, he should be afraid.”

the european commission is worried, too. it accuses pis of violating poland’s commitments
to the rule of law under the european union’s founding treaty. in 2017 the commission took
poland to the european court of justice (ecj) over laws that gave politicians control over
appointing judges. (for example, they lowered judges’ retirement age while letting the justice
minister pick whom to exempt.) the ecj ruled against the poles, who had in the meantime
scrapped some of the measures.


In [3]:
# Split the string into tokens using a white space tokenizer.
from nltk.tokenize import WhitespaceTokenizer
def tokenize_with_wst(text):
    wh_tokenizer = WhitespaceTokenizer()
    return wh_tokenizer.tokenize(text)
word_tokens = tokenize_with_wst(text)

In [4]:
# remove stop words
from nltk import download
download('stopwords')
from nltk.corpus import stopwords

# English stopwords
stop_words = stopwords.words('english') # arabic, danish, dutch, etc.

# Remove stop words
def remove_stop_words(sentence_words, stop_words):
    return ' '.join([word for word in sentence_words if word not in stop_words])
words_no_stop = remove_stop_words(word_tokens, stop_words)
print(words_no_stop)



[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/LNonyane/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
# stemming using PorterStemmer
from nltk import stem
# get stems
def get_stems(tokens, stemmer):
    return [stemmer.stem(word) for word in tokenize_with_wst(tokens)]
porterStem = stem.PorterStemmer()
token_stems = get_stems(words_no_stop, porterStem)
print(token_stems)

['ever', 'sinc', 'populist', 'law', 'justic', '(pis)', 'parti', 'took', 'power', '2015,', 'adam', 'bodnar,', 'poland’', 'human-right', 'ombudsman,', 'warn', 'relentless', 'effort', 'get', 'control', 'courts.', 'illustr', 'danger,', 'use', 'express', 'communist', 'times:', 'lex', 'telefonica.', 'polish', 'people’', 'republic,', 'verdict', 'routin', 'dictat', 'phone', 'call', 'apparatchik', 'parti', 'headquarters.', 'today’', 'govern', 'subtl', 'techniques,', 'goal', 'same,', 'mr', 'bodnar', 'says:', '“if', 'judg', 'case', 'desk', 'polit', 'importance,', 'afraid.”', 'european', 'commiss', 'worried,', 'too.', 'accus', 'pi', 'violat', 'poland’', 'commit', 'rule', 'law', 'european', 'union’', 'found', 'treaty.', '2017', 'commiss', 'took', 'poland', 'european', 'court', 'justic', '(ecj)', 'law', 'gave', 'politician', 'control', 'appoint', 'judges.', '(for', 'example,', 'lower', 'judges’', 'retir', 'age', 'let', 'justic', 'minist', 'pick', 'exempt.)', 'ecj', 'rule', 'poles,', 'meantim', 'scra

In [7]:
# Frequency of all the words after stemming
# This method will generate a dict of word frequencies from list.
def get_freq(stems):
    freq_dict = {}
    for t in stems:
        freq_dict[t.strip()] = freq_dict.get(t.strip(), 0) + 1
    return freq_dict

# This method will sort the dictionary on the values and return the top n 
# keys of the dictionary.
import operator
def get_top_n_words(freq_dict, n):
    sorted_dict = sorted(freq_dict.items(), key=operator.itemgetter(1), reverse=True)
    return [x[0] for x in sorted_dict][:n]

In [9]:
freq_dict = get_freq(token_stems)
top_keywords = get_top_n_words(freq_dict, 6)
top_keywords

['law', 'justic', 'european', 'parti', 'took', 'poland’']