# <center>Other NLP Packages: spaCy, Gensim, and Stanza (Stanford NLP)</center>

References: 
- https://nlpforhackers.io/complete-guide-to-spacy/
- https://radimrehurek.com/gensim/models/phrases.html
- https://stanfordnlp.github.io/stanza/

## 1. spaCy
- spaCy is a relatively new framework in the Python Natural Language Processing, but is getting popular
- Provides models for Part Of Speech tagging, Named Entity Recognition and Dependency Parsing
<img src='https://spacy.io/pipeline-7a14d4edd18f3edfee8f34393bff2992.svg' width = "70%">
- Supports 8 languages out of the box
- Provides easy and beautiful visualizations
- PProvides pretrained word vectors
- installation:
  1. `pip install spacy`
  2. `python -m spacy download en` or `python -m spacy download en_core_web_sm`

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# Installation

# ! pip install spacy --upgrade
# ! python -m spacy download en_core_web_sm

In [3]:
# Exercise 1.1. Load package and language library

import spacy
nlp = spacy.load('en_core_web_sm')

# if you downloaded en_core_web_sm use the following:
#import en_core_web_sm 
#nlp = en_core_web_sm.load()

In [4]:
# Exercise 1.2. Get POS, lemmatization, and other NLP tasks all in one task

doc = nlp("Next week I'll be in Madrid.")
for token in doc:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}".format(
        token.text,         # original text
        token.lemma_,       # lemma
        token.is_punct,     # is it a punctuation ?
        token.is_space,     # is it a space
        token.pos_,         # The simple part-of-speech tag.
        token.tag_          # The detailed part-of-speech tag
    ))

Next	next	False	False	ADJ	JJ
week	week	False	False	NOUN	NN
I	I	False	False	PRON	PRP
'll	'll	False	False	AUX	MD
be	be	False	False	VERB	VB
in	in	False	False	ADP	IN
Madrid	Madrid	False	False	PROPN	NNP
.	.	True	False	PUNCT	.


In [5]:
# Exercise 1.3. Segment by sentences

doc = nlp("These are apples. These are oranges.")
 
for sent in doc.sents:
    print(sent)

These are apples.
These are oranges.


In [6]:
# Exercise 1.4. Entity Recognition

doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ'")
for ent in doc.ents:
    print(ent.text, "\t\t", ent.label_)

2 		 CARDINAL
9 a.m. 		 TIME
30% 		 PERCENT
just 2 days 		 DATE
WSJ 		 ORG


In [7]:
# Exercise 1.5. Visulaize named entities

from spacy import displacy
 
doc = nlp('I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ')
displacy.render(doc, style='ent', jupyter=True)


In [8]:
# Exercise 1.6. Visualized dependency graph

from spacy import displacy
 
doc = nlp("I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})
 

## 2. Textacy

Textacy is a Python library for performing a variety of (NLP) tasks, built on the high-performance spaCy library. With the fundamentals — tokenization, part-of-speech tagging, dependency parsing, etc. 

For details, check https://textacy.readthedocs.io/en/latest/index.html

In [21]:
#Installation
! pip install textacy

from textacy import preprocessing



In [22]:
text = (
     "Since the so-called \"statistical revolution\" in the late 1980s and mid 1990s, "
     "much Natural Language Processing research has relied heavily on machine learning. "
     "Formerly, many language-processing tasks typically involved the direct hand coding "
     "of rules, which is not in general robust to natural language variation. "
     "The machine-learning paradigm calls instead for using statistical inference "
     "to automatically learn such rules through the analysis of large corpora "
     "of typical real-world examples."
 )

In [23]:
# remove punctuation

from textacy import preprocessing

text_remove_punct = preprocessing.remove.punctuation(text)
print(text_remove_punct)

# remove whitespace
preprocessing.normalize.whitespace(text_remove_punct)

Since the so called  statistical revolution  in the late 1980s and mid 1990s  much Natural Language Processing research has relied heavily on machine learning  Formerly  many language processing tasks typically involved the direct hand coding of rules  which is not in general robust to natural language variation  The machine learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real world examples 


'Since the so called statistical revolution in the late 1980s and mid 1990s much Natural Language Processing research has relied heavily on machine learning Formerly many language processing tasks typically involved the direct hand coding of rules which is not in general robust to natural language variation The machine learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real world examples'

In [24]:
# make spacy doc

# load English language model
en = textacy.load_spacy_lang("en_core_web_sm")

doc = textacy.make_spacy_doc(text, lang=en)

NameError: name 'textacy' is not defined

In [25]:
# extract bigrams and trigrams
list(textacy.extract.ngrams(doc, n = (2,3), filter_stops=True, \
                            filter_punct=True, filter_nums=False))

NameError: name 'textacy' is not defined

In [None]:
# Extract key terms

from textacy.extract import keyterms as kt
kt.textrank(doc, normalize="lemma", 
            include_pos =('NOUN', 'PROPN', 'ADJ'),
            window_size = 3,
            topn=10)

In [None]:
# extract key terms 

from textacy.extract import keyterms as kt
kt.textrank(doc, normalize="lemma", topn=10)

## 3. gensim
- Gensim is an open source Python library for NLP, with a focus on topic modeling.
- It is not an everything-including-the-kitchen-sink NLP research library (like NLTK); instead, Gensim is a mature, focused, and efficient suite of NLP tools for topic modeling, including 
  - Word2Vec word embedding 
  - Topic modeling
  - Text preprocessing like **phrase extraction**
  
- Gensim Phrase Model: 
    - `gensim.models.phrases.Phrases(sentences, min_count, threshold, max_vocab_size, delimiter, scoring, ...)`
        - `sentences`: list of sentences or iterables, each of which can be a document
        - `min_count`: Ignore all words and bigrams with total collected count lower than this value.
        - `threshold`: Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words $a$ followed by $b$ is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function.
        - `max_vocab_size`: Maximum size (number of tokens) of the vocabulary. 
        - `delimiter`: Glue character used to join collocation tokens, should be a byte string (e.g. '\_').
        - `scoring`: Specify how potential phrases are scored. 
           - `default` - original_scorer(), by Mikolov et al. (2013) (https://arxiv.org/pdf/1310.4546.pdf)
           - `npmi` - npmi_scorer().

In [None]:
# Read an online text file (Apple's annual disclosure)
import urllib
url = "https://www.sec.gov/Archives/edgar/data/320193/000091205700053623/a2032880z10-k.txt"

file = urllib.request.urlopen(url)
text = file.read().decode('utf-8')
print(text)

In [None]:
# Exercise 2.1. Find bigrams using gensim
import gensim
import nltk
from nltk.collocations import *

from gensim.models.phrases import Phrases, Phraser


# Tokenize the text into tokens
pattern=r'\w[\w\',-]*\w'                        
words=nltk.regexp_tokenize(text.lower(), pattern)

# Train phrase model to find phrases using original_scorer
phrases = Phrases([words], min_count=5, threshold=50)

# get unique set of phrases and sorted by score in descending order
items = sorted(set(phrases.export_phrases([words])), key=lambda item: -item[1])

# print top 50 phrases
for phrase, score in items[0:50]:
    print("{0}:\t{1:.2f}".format(phrase, score))

In [None]:
# Exercise 2.2. Find bigrams by NPMI

# find phrases using NPMI

phrases = Phrases([words], min_count=5, threshold=0.5, \
                  scoring='npmi')

# get unique set of phrases and sorted by score in descending order
items = sorted(set(phrases.export_phrases([words])), key=lambda item: -item[1])

# print top 20 phrases
for phrase, score in items[0:50]:
    print("{0}:\t{1:.2f}".format(phrase, score))

In [None]:
# Exercise 2.3. Tokenize by unigrams and bigrams

# Initialize phrase tokenizer
bigram = Phraser(phrases)

sent="Improved profitability was driven by the 30% increase in net sales, stable overall gross margins in 2000 as compared to 1999, and a relatively modest increase in operating expenses before special charges of 18%."
print(bigram[nltk.word_tokenize(sent.lower())])