#### INSPECTING THE PIPELINE

In [1]:
reg_d = '''Under the federal securities laws, any offer or sale of a security must either be registered with the SEC or meet an exemption. 
Regulation D under the Securities Act provides a number of exemptions from the registration requirements, allowing some companies 
to offer and sell their securities without having to register the offering with the SEC. For more information about these exemptions, 
see our Fast Answers on Rules 504 and 506 of Regulation D.Companies that comply with the requirements of Regulation D do not have to 
register their offering of securities with the SEC, but they must file what’s known as a "Form D" electronically with the SEC after 
they first sell their securities. Form D is a brief notice that includes the names and addresses of the company’s promoters, executive 
officers and directors, and some details about the offering, but contains little other information about the company. You can access the 
SEC’s EDGAR database to determine whether the company has filed a Form D. Even if a company takes advantage of an exemption from registration, 
a company should take care to provide sufficient information to investors to avoid violating the antifraud provisions of the securities laws. 
This means that any information a company provides to investors must be free from false or misleading statements. Similarly, a company should not 
exclude any information if the omission makes what is provided to investors false or misleading. You should always check with your state securities 
regulator to see if they have more information about the company and the people behind it. Be sure to ask whether your state regulator has received 
notice of the offering or, in the case of a Rule 504 offering, cleared the offering for sale in your state. You can get the address and telephone number 
for your state securities regulator by calling the North American Securities Administrators Association at (202) 737-0900 or by visiting its website.'''

In [2]:
import spacy 
nlp = spacy.load("en_core_web_sm")

In [6]:
print(nlp.pipe_names)
print(nlp.pipeline)

['tagger', 'parser', 'ner']
[('tagger', <spacy.pipeline.pipes.Tagger object at 0x000001E7D476F438>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x000001E7D5D84108>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x000001E7D5D84168>)]


#### Adding Custom Components

In [10]:
#example
def length_component(doc):
    doc_len = len(doc)
    print("doc is {} tokens long".format(doc_len))
    return doc

In [11]:
nlp = spacy.load("en_core_web_sm")
#add the component to the pipeline
nlp.add_pipe(length_component, first=True) #before, #after, #last

In [12]:
doc = nlp("This is a sentence.")

doc is 5 tokens long


##### Complex Components

In [20]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [21]:
def animal_component(doc):
    #apply matcher to the doc
    matches = matcher(doc)
    #create SPAN for ANIMAL type in doc
    spans = [Span(doc, start, end, label='ANIMALS') for match_id, start, end in matches]
    
    #overwrite the doc.ents with the matched spans
    doc.ents = spans
    return doc

In [22]:
#add component to the pipeline
nlp.add_pipe(animal_component, after='ner')

doc = nlp("I have a cat and a Golden Retriever")
print([(ent.text, ent.label_) for ent in doc.ents])

ValueError: [E007] 'animal_component' already exists in pipeline. Existing names: ['length_component', 'tagger', 'parser', 'ner', 'animal_component']

SETTING EXTENSION ATTRIBUTES TO DOCS | SPANS | TOKENS

In [26]:
#register the token extension attribute "is_country" with the default value to FALSE
nlp = spacy.load("en_core_web_sm")
from spacy.tokens import Doc, Span, Token

Token.set_extension("is_country", default=False)


In [29]:
#process text and set "is_country" attribute equal to True
doc = nlp("I live in Spain.")
#set the value "Spain" to True
doc[3]._.is_country = True
print([(token.text, token._.is_country) for token in doc] )

[('I', False), ('live', False), ('in', False), ('Spain', True), ('.', False)]


In [30]:
# define getter function taking token and returning the reversed text
def get_reversed(token):
    return token.text[::-1]

In [31]:
#register token property extension with the getter function
Token.set_extension("reversed", getter=get_reversed)

#process text and print reversed
doc = nlp("All generalizations are false, including this one.")
for token in doc:
    print('reversed: ', token._.reversed)

reversed:  llA
reversed:  snoitazilareneg
reversed:  era
reversed:  eslaf
reversed:  ,
reversed:  gnidulcni
reversed:  siht
reversed:  eno
reversed:  .


##### Complex Attributes

In [32]:
#define the getter function
def get_has_number(doc):
    #return if any of the tokens in the doc return True
    return any(token.like_num for token in doc)

#register doc property extension "has_number" with the getter has number
Doc.set_extension("has_number", getter=get_has_number)

#processing text and check custom number 
doc = nlp("The museum closed for five years in 2012.")
print("has_number:", doc._.has_number)

has_number: True


In [34]:
#set a HTML tag
def to_html(span, tag):
    #wrap the span text in HTML tag and return it
    return '<{tag}>{text}</{tag}>'.format(tag=tag, text=span.text)

#register Span extension with method HTML (instead of getter)
Span.set_extension("to_html", method=to_html)

#process text and check for tags
doc = nlp("Hello World, this is a sentence.")
span = doc[0:2]
print(span._.to_html("strong"))

<strong>Hello World</strong>


In [35]:
def get_wikipedia_url(span):
    # Get a Wikipedia URL if the span has one of the labels
    if span.label_ in ('PERSON', 'ORG', 'GPE', 'LOCATION'):
        entity_text = span.text.replace(' ', '_')
        return "https://en.wikipedia.org/w/index.php?search=" + entity_text

# Set the Span extension wikipedia_url using get getter get_wikipedia_url
Span.set_extension('wikipedia_url', getter=get_wikipedia_url)

doc = nlp("In over fifty years from his very first recordings right through to his last album, David Bowie was at the vanguard of contemporary culture.")
for ent in doc.ents:
    # Print the text and Wikipedia URL of the entity
    print(ent.text, ent._.wikipedia_url)

over fifty years None
first None
David Bowie https://en.wikipedia.org/w/index.php?search=David_Bowie


##### Components with extensions

In [36]:
def countries_component(doc):
    # Create an entity Span with the label 'GPE' for all matches
    doc.ents = [Span(doc, start, end, label="GPE")
                for match_id, start, end in matcher(doc)]
    return doc

# Add the component to the pipeline
nlp.add_pipe(countries_component)
print(nlp.pipe_names)

['tagger', 'parser', 'ner', 'countries_component']


In [39]:
# Register capital and getter that looks up the span text in country capitals
Span.set_extension('capital', getter=lambda span: capitals.get(span.text), force=True)

# Process the text and print the entity text, label and capital attributes
doc = nlp("Czech Republic may help Slovakia protect its airspace")
print([(ent.text, ent.label_, ent._.capital) for ent in doc.ents])

[]


#### PROCESSING STREAMS !!!!!!!!!!!!!!

In [44]:
TEXTS = ['McDonalds is my favorite restaurant.',
 'Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..',
 'People really still eat McDonalds :(',
 'The McDonalds in Spain has chicken wings. My heart is so happy ',
 '@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P',
 'please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D',
 'This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it']

#BAD EXAMPLE 
    #docs = [nlp(text) for text in LOTS_O_TEXT]
    
#GOOD EXAMPLE
    #docs = list(nlp.pipe(LOTS_O_TEXTS))

In [42]:
for doc in nlp.pipe(TEXTS):
    print([token.text for token in doc if token.pos_ == 'ADJ'])

['favorite']
['sick']
[]
['happy']
['delicious', 'fast']
[]
['terrible', 'gettin', 'payin']


In [48]:
# Process the texts and print the entities
docs = list(nlp.pipe(TEXTS))
entities = [doc.ents for doc in docs]
print(*entities)

() () () () () () ()


[McDonalds is my favorite restaurant.,
 Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..,
 People really still eat McDonalds :(,
 The McDonalds in Spain has chicken wings. My heart is so happy ,
 @McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P,
 please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D,
 This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it]

In [51]:
people = ['David Bowie', 'Angela Merkel', 'Lady Gaga']

# Create a list of patterns for the PhraseMatcher
patterns = list(nlp.pipe(people))
patterns

[David Bowie, Angela Merkel, Lady Gaga]

In [52]:
# Import the Doc class and register the extensions 'author' and 'book'
from spacy.tokens import Doc
Doc.set_extension('book', default=None)
Doc.set_extension('author', default=None)

for doc, context in nlp.pipe(DATA, as_tuples=True):
    # Set the doc._.book and doc._.author attributes from the context
    doc._.book = context['book']
    doc._.author = context['author']
    
    # Print the text and custom attribute data
    print(doc.text, '\n', "— '{}' by {}".format(doc._.book, doc._.author), '\n')

NameError: name 'DATA' is not defined

In [17]:
def stop_words(doc):
    #get spacys default stopword list(312 words)
    spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
    from string import punctuation
    punct = set(punctuation)
    ALL_STOPS = spacy_stopwords.union(punct)
    #add custom stopword
    #ALL_STOPS.add("")
    for stopword in ALL_STOPS:
        lexeme = nlp.vocab[stopword]
        lexeme.is_stop = True
    for token in doc:
        if not token.is_stop:
            token.text
        
        
    
   
    return doc

In [6]:
nlp.add_pipe(stop_words, after='tagger')
print(nlp.pipe_names)

['tagger', 'stop_words', 'parser', 'ner']


In [7]:
doc = nlp(reg_d)

In [9]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)

312

In [13]:
from string import punctuation


In [14]:
spacy_stopwords.add(punctuation)

In [16]:
spacy_stopwords

{'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~',
 "'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'former

In [7]:
nlp.vocab["message"].is_stop

False

In [5]:
from string import punctuation
def stop_words(doc):
    #get spacys default stopword list(312 words)
    spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
   
    punct = set(punctuation)
    ALL_STOPS = spacy_stopwords.union(punct)
    for stopword in ALL_STOPS:
        lexeme = nlp.vocab[stopword]
        lexeme.is_stop = True
    #for token in doc:
     #   if not token.is_stop:
      #      token.text
        
        
    
   
    return doc

In [18]:
stop_words(doc)

Under the federal securities laws, any offer or sale of a security must either be registered with the SEC or meet an exemption. 
Regulation D under the Securities Act provides a number of exemptions from the registration requirements, allowing some companies 
to offer and sell their securities without having to register the offering with the SEC. For more information about these exemptions, 
see our Fast Answers on Rules 504 and 506 of Regulation D.Companies that comply with the requirements of Regulation D do not have to 
register their offering of securities with the SEC, but they must file what’s known as a "Form D" electronically with the SEC after 
they first sell their securities. Form D is a brief notice that includes the names and addresses of the company’s promoters, executive 
officers and directors, and some details about the offering, but contains little other information about the company. You can access the 
SEC’s EDGAR database to determine whether the company has filed 