# Kieft Jupyter Notebook:

# About: 
We created this notebook to prototype and document our research over the Fall 2021 Semester at Monmouth College.
We are currently working to implement Spacy pipelines and custom pipes within this pipeline to improve our web-based text analysis, and this notebook will document that process. The Spacy pipelines will completely replace the analysis process within the dashboard/code. This allows the user to create more customization, with little change to the code base. Also, we will begin to allow Spacy's advanced searching features. Students working on this dashboard are currently: Cal Bigham, Class of 2023 and Shay Hafner, Class of 2023.

# Adding/Deleting Pipes from the Pipeline
Before we get into the details of each pipe or node of the Spacy pipeline that we create, lets consider the spacy methods for adding and deleting pipes from a pipeline. Spacy has a few of these functions to allow this functionality:

remove_pipe: Allows the developer to remove an entire pipe from the pipeline (the pipe will not be loaded)
disable_pipe: Allows the developer to disable a pipe from the pipeline (the pipe will be loaded, but not run)
add_pipe: Can add a pipe to the pipeline, on top of the defaults.
It must be mentioned that using before, after, first, and last keywords, you can customize the pipeline with ease.

For our purposes, we will allow the user to define what they want removed from the pipeline. If the user does not want to get polarity, we would activate the below code block:


In [None]:
user_remove_options = "polarity"
nlp.disable_pipe(user_remove_options)

This code takes a string or list of strings from a form on the dashboard. Then, we could use that string to remove those pipes from the default pipeline. Notice that we are only using disable here, as it is all we need in every scenerio. Adding a pipe and completely removing it would get messy (due to the importance of order), so we will only use disable.

# The Dependency Parser
The dependency parser is going to replace our sentinizer completely. This new parser uses a trained model that is loaded with the Spacy Pipeline to determine sentence boundaries and the sentences in the document. This is accessable in doc.sents, an iterable of all the sentences in the document.

We can use the dependency parser in this way then:

In [10]:
import sys
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_trf

import spacy

nlp = spacy.load("en_core_web_trf")
doc = nlp("The quick brown fox jumps over the lazy dog. Cal and Shay work in the networking lab. Monmouth College has a nice campus.")

for sent in doc.sents:
    yield sent

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/models.py", line 382, in prepare_url
    scheme, auth, host, port, path, query, fragment = parse_url(url)
  File "/usr/lib/python3/dist-packages/urllib3/util/url.py", line 392, in parse_url
    return six.raise_from(LocationParseError(source_url), None)
  File "<string>", line 3, in raise_from
urllib3.exceptions.LocationParseError: Failed to parse: https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ac

OSError: [E050] Can't find model 'en_core_web_trf'. It doesn't seem to be a Python package or a valid path to a data directory.

# The Phrase Match (for filter and analysis terms)
We are able to search for phases as we did in the original deployment of this web app. However, we can use spacy's phrase matcher. Below is a code example for this:

In [None]:
from spacy.matcher import PhraseMatcher
phrase_matcher = PhraseMatcher(nlp.vocab)
phrases = ['quick dog', 'cleared']
patterns = [nlp(text) for text in phrases]
phrase_matcher.add('AI', None, *patterns)
doc = nlp(text)

for sent in doc.sents:
    for match_id, start, end in phrase_matcher(nlp(sent.text)):
        if nlp.vocab.strings[match_id] in ["AI"]:
            print(sent.text)

This code example uses the phrase matcher to search for the phrases in the document. Then, it prints the sentence if it has the matching phrase. The beauty of this is that we do not need to use regexes to search for multi word pharses anymore. We can take whatever the user inputs into the filter/analysis term fields and pass that to the phrase matcher as a list of phrases. To chunk, which we will explore next, we just need to choose 2 sentences before and 2 after this match

# The Chunking Behavior using Spacy


# Adding NLTK Vader Functionality to Spacy
In order to allow for maximum user customizability, as well as the best experience (in our eyes), we wanted to create a custom spacy "pipe" that gave NLTK Vader Functionality. This allows the user to get polarity of the text.

You will see a @Language.factory decoration tag above our code. This tells spacy the name of the custom pipe (eg. spacyVader). This allow for the typical spacy functionality of remove_pipe and add_pipe. This limits the amount of code required to add and remove processes from the default pipeline.

In [8]:
from spacy.tokens import Doc, Span, Token
from spacy.language import Language

from nltk.sentiment.vader import SentimentIntensityAnalyzer

@Language.factory("spacyVader")
class SpacyVader(object):
    """A spacy pipline for NLTK Vader Sentiment Analysis"""
    
    def __init__(self, nlp, name):
        extensions = ["polarity"]
        getters = [self.get_polarity]
        for ext, get in zip(extensions, getters):
            if not Doc.has_extension(ext):
                Doc.set_extension(ext, default=None)
            if not Span.has_extension(ext):
                Span.set_extension(ext, getter=get)
            if not Token.has_extension(ext):
                Token.set_extension(ext, getter=get)
    
    def __call__(self, doc):
        # Doc-level sentiment
        sentiment = self.get_sentiment(doc)
        doc._.set("polarity", sentiment['compound'])

        return doc

    def get_sentiment(self, doc):
        analyzer = SentimentIntensityAnalyzer()
        sentiment = analyzer.polarity_scores(doc.text)
        return sentiment
    
    def get_polarity(self, doc):
        return self.get_sentiment(doc).polarity

TypeError: <class '__main__.SpacyVader'> is a built-in class