# Dr. Jennifer Sleeman
# Exploring Term Extraction Methods
# jennsleeman@gmail.com

In [1]:
#Just in case you need to install the appropriate packages uncomment these lines
!pip3 install phrasemachine
!pip3 install nltk
!pip3 install rake_nltk


Collecting phrasemachine
  Downloading phrasemachine-1.0.7.tar.gz (2.7 MB)
[K     |████████████████████████████████| 2.7 MB 1.2 MB/s eta 0:00:01     |█████▏                          | 430 kB 1.2 MB/s eta 0:00:02
Building wheels for collected packages: phrasemachine
  Building wheel for phrasemachine (setup.py) ... [?25ldone
[?25h  Created wheel for phrasemachine: filename=phrasemachine-1.0.7-py3-none-any.whl size=2694879 sha256=8a481a8f7e2439013d06855dd63e7b22e119b456b3d963d281d3dee68f296317
  Stored in directory: /Users/kerry/Library/Caches/pip/wheels/49/d9/e6/e8948b0664fc1e5135444099525ae3f67cebc0bcddf0a7b453
Successfully built phrasemachine
Installing collected packages: phrasemachine
Successfully installed phrasemachine-1.0.7
You should consider upgrading via the '/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/opt/anaconda3/bin/python -m pip install --upgrade pip' command.[0m
Collecting rake_nltk
  Downloading rake_n

In [2]:
import phrasemachine
import nltk
from rake_nltk import Rake
import re
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import ngrams, FreqDist

In [3]:
# Only run this once, they will be downloaded.
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to /Users/kerry/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/kerry/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
# Examples
document1 = "Ruth Bader Ginsburg was the second woman appointed to the United States Supreme Court, but she’s probably the first justice to become a full-fledged pop-cultural phenomenon. “RBG,” a loving and informative documentary portrait of Justice Ginsburg during her 85th year on earth and her 25th on the bench, is both evidence of this status and a partial explanation of how it came about.Directed by Betsy West and Julie Cohen, the film is a jaunty assemblage of interviews, public appearances and archival material, organized to illuminate its subject’s temperament and her accomplishments so far. Though it begins with audio snippets of Justice Ginsburg’s right-wing detractors — who see her as a “demon,” a “devil” and a threat to America — “RBG” takes a pointedly high road through recent political controversies. Its celebration of Justice Ginsburg’s record of progressive activism and jurisprudence is partisan but not especially polemical. The filmmakers share her convictions and assume that the audience will, too. Which might be true, and not only because much of the audience is likely to consist of liberals. Before she was named to the federal bench by Jimmy Carter in 1980, the future justice had argued a handful of important sex-discrimination cases in front of the Supreme Court. What linked these cases — she won five out of six — was the theory that the equal protection clause of the 14th Amendment should apply to women and could be used to remedy discrepancies in hiring, business practices and public policy.The idea that women are equal citizens — that barring them from certain jobs and educational opportunities and treating them as the social inferiors of men are unfair — may not seem especially controversial now. “RBG” uses Justice Ginsburg’s own experiences to emphasize how different things were not so long ago. At Harvard Law School, she was one of nine women in a class of hundreds, and was asked by the dean (as all the women were) why she thought she deserved to take what should have been a man’s place.The biographical part of “RBG” tells a story that is both typical and exceptional. It’s a reminder that the upward striving of first- and second-generation Jewish immigrants in the middle decades of the 20th century was accompanied by fervent political idealism. Justice Ginsburg’s career was marked by intense intellectual ambition and also by a determination to use the law as an instrument of change.The film also chronicles her marriage to Martin Ginsburg. They met as undergraduates at Cornell, and for the next 63 years, Mr. Ginsburg (who died in 2010) was his wife’s tireless supporter and champion, a man whose commitment to domestic egalitarianism was extraordinary in his time and far from common today. As their friends and children explain — and as Mr. Ginsburg, a New York tax lawyer, often said himself — he was responsible for cooking meals and cracking jokes while she was making history. He also, when Byron White retired from the Supreme Court, made sure that her name was high on President Clinton’s list of candidates.It would be fascinating to learn more about that campaign, and also to have a finer-grained sense of the institutional and interpersonal dynamics of the court over the past quarter-century. But “RBG” reasonably chooses to focus on Justice Ginsburg herself, and relishes every moment of her company. It also shows why she has become such an inspiration for younger feminists, like Irin Carmon and Shana Knizhnik, whose 2015 book “Notorious RBG: The Life and Times of Ruth Bader Ginsburg” helped created the contemporary image of a fierce, uncompromising and gracious champion of women’s rights.That those rights are in a new phase of embattlement goes without saying. The movie’s touch is light and its spirit buoyant, but there is no mistaking its seriousness or its passion. Those qualities resonate powerfully in the dissents that may prove to be Justice Ginsburg’s most enduring legacy, and “RBG” is, above all, a tribute to her voice."
#Add document variables here or parse your documents from a directory (your choice)

In [5]:
# Create a list of stop words from nltk
stop_words = set(stopwords.words("english"))


In [6]:
# Pre-process dataset to remove punctuation
def remove_punctuation(in_text):
    # Remove punctuation
    text = re.sub('[^a-zA-Z]', ' ', str(in_text))
    return text

In [7]:
# Pre-process dataset to lower case it
def lower_case(in_text):
    # Convert to lowercase
    text = in_text.lower()    
    return text

In [8]:
# Pre-process dataset to remove tags
def remove_tags(in_text):    
    # Remove tags
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",in_text)
    return text

In [9]:
# Pre-process dataset to remove special characters and digits
def remove_special_chars_and_digits(in_text):
    # Remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",in_text)
    return text


In [10]:
# Pre-process dataset to appy Stemming
def apply_stemming(in_text):
    stemmer=PorterStemmer()
    word_list = nltk.word_tokenize(in_text)
    output = ' '.join([stemmer.stem(w) for w in word_list])
    return output

In [11]:
# Pre-process dataset to apply Lemmatization
def apply_lemmatization(in_text):
    # Lemmatization
    lem = WordNetLemmatizer()
    word_list = nltk.word_tokenize(in_text)
    output = ' '.join([lem.lemmatize(w) for w in word_list])
    return output

In [12]:
# Remove stop words
def remove_stop_words(in_text):
    stop_words = set(stopwords.words('english')) 
    word_tokens = word_tokenize(in_text)  
    filtered_sentence = [w for w in word_tokens if not w in stop_words] 
    filtered_sentence = [] 
    for w in word_tokens: 
        if w not in stop_words: 
            filtered_sentence.append(w) 

    return filtered_sentence

In [13]:
# Run Phase Machine
def run_phrase_machine(in_text):
    phrases=phrasemachine.get_phrases(in_text)
    return phrases

In [14]:
#Run Rake Keyword Extractor
def run_rake(in_text):
    r = Rake()
    r.extract_keywords_from_text(in_text)
    rake_phrases= r.get_ranked_phrases()
    return rake_phrases

In [15]:
# Run NLTK Tokenizer
def run_nltk_tokenizer(in_text):
    tokens=nltk.word_tokenize(in_text)
    return tokens

In [16]:
# Run NLTK Sentence Tokenizer
def run_nltk_sent_tokenizer(in_corpus):
    sents = nltk.sent_tokenize(in_corpus)
    return sents

In [17]:
#Run word-ngram Tokenizer
def run_nltk_tokenizer_word_ngrams(in_text, ngram_size):
    n_grams = ngrams(nltk.word_tokenize(in_text), ngram_size)
    return [ ' '.join(grams) for grams in n_grams]

In [18]:
#Get Frequ Dist 
def get_freq_dist(terms):
    all_counts = dict()
    all_counts = FreqDist(terms)
    return all_counts

In [19]:
#Run this first to get sentences from text.
sentences=run_nltk_sent_tokenizer(document1)

In [21]:
#Explore different extractors and difference preprocessing techniques
for sentence in sentences:
    print(sentence)
    print("===================NLTK Tokenizer===================")
    print(run_nltk_tokenizer(sentence))
    print("===================NLTK Word NGRAM Tokenizer 2 words===================")
    print(run_nltk_tokenizer_word_ngrams(sentence,2))
    print("===================NLTK Word NGRAM Tokenizer 3 words===================")
    print(run_nltk_tokenizer_word_ngrams(sentence,3))
    print("===================Phrase Machine===================")
    phrases=run_phrase_machine(sentence)
    for term in phrases["counts"].keys():
        print(term)
    print("===================Rake===================")
    print(run_rake(sentence))
    print("===================NLTK Tokenizer===================")
    print(run_nltk_tokenizer((sentence)))
    print("===================NLTK Tokenizer LOWER CASE===================")
    print(run_nltk_tokenizer(lower_case(sentence)))
    print("===================NLTK Tokenizer REMOVE STOP WORDS===================")
    print(remove_stop_words(sentence))   
    print("===================NLTK Tokenizer REMOVED PUNCTUATION===================")
    print(run_nltk_tokenizer(remove_punctuation(sentence)))
    print("===================NLTK Tokenizer REMOVED TAGS===================")
    print(run_nltk_tokenizer(remove_tags(sentence)))
    print("===================NLTK Tokenizer REMOVED CHARS AND DIGITS===================")
    print(run_nltk_tokenizer(remove_special_chars_and_digits(sentence)))
    print("===================NLTK Tokenizer STEMMING APPLIED===================")
    print(run_nltk_tokenizer(apply_stemming(sentence)))
    print("===================NLTK Tokenizer LEMMATIZATION APPLIED===================")
    print(run_nltk_tokenizer(apply_lemmatization(sentence)))
    #break

Ruth Bader Ginsburg was the second woman appointed to the United States Supreme Court, but she’s probably the first justice to become a full-fledged pop-cultural phenomenon.
['Ruth', 'Bader', 'Ginsburg', 'was', 'the', 'second', 'woman', 'appointed', 'to', 'the', 'United', 'States', 'Supreme', 'Court', ',', 'but', 'she', '’', 's', 'probably', 'the', 'first', 'justice', 'to', 'become', 'a', 'full-fledged', 'pop-cultural', 'phenomenon', '.']
['Ruth Bader', 'Bader Ginsburg', 'Ginsburg was', 'was the', 'the second', 'second woman', 'woman appointed', 'appointed to', 'to the', 'the United', 'United States', 'States Supreme', 'Supreme Court', 'Court ,', ', but', 'but she', 'she ’', '’ s', 's probably', 'probably the', 'the first', 'first justice', 'justice to', 'to become', 'become a', 'a full-fledged', 'full-fledged pop-cultural', 'pop-cultural phenomenon', 'phenomenon .']
['Ruth Bader Ginsburg', 'Bader Ginsburg was', 'Ginsburg was the', 'was the second', 'the second woman', 'second woman ap

In [22]:
#Explore different extractors and difference preprocessing techniques
all_terms=[]
for sentence in sentences:
    print(sentence)
    #pick your favorite term extractor
    all_terms = all_terms +run_rake(sentence)
#get the frequency distribution across the terms
fd=get_freq_dist(all_terms)
fd


Ruth Bader Ginsburg was the second woman appointed to the United States Supreme Court, but she’s probably the first justice to become a full-fledged pop-cultural phenomenon.
“RBG,” a loving and informative documentary portrait of Justice Ginsburg during her 85th year on earth and her 25th on the bench, is both evidence of this status and a partial explanation of how it came about.Directed by Betsy West and Julie Cohen, the film is a jaunty assemblage of interviews, public appearances and archival material, organized to illuminate its subject’s temperament and her accomplishments so far.
Though it begins with audio snippets of Justice Ginsburg’s right-wing detractors — who see her as a “demon,” a “devil” and a threat to America — “RBG” takes a pointedly high road through recent political controversies.
Its celebration of Justice Ginsburg’s record of progressive activism and jurisprudence is partisan but not especially polemical.
The filmmakers share her convictions and assume that the a

FreqDist({'justice ginsburg ’': 4, '’': 2, 'become': 2, 'justice ginsburg': 2, 'far': 2, 'audience': 2, 'supreme court': 2, 'women': 2, 'also': 2, 'mr': 2, ...})