GETTING STARTED WITH SPACY

Finding words, phrases, names and concepts

In [4]:
text = """Under the federal securities laws, any offer or sale of a security must either be registered with the SEC or meet an exemption. 
Regulation D under the Securities Act provides a number of exemptions from the registration requirements, allowing some companies 
to offer and sell their securities without having to register the offering with the SEC. For more information about these exemptions, 
see our Fast Answers on Rules 504 and 506 of Regulation D.Companies that comply with the requirements of Regulation D do not have to 
register their offering of securities with the SEC, but they must file what’s known as a "Form D" electronically with the SEC after 
they first sell their securities. Form D is a brief notice that includes the names and addresses of the company’s promoters, executive 
officers and directors, and some details about the offering, but contains little other information about the company. You can access the 
SEC’s EDGAR database to determine whether the company has filed a Form D. Even if a company takes advantage of an exemption from registration, 
a company should take care to provide sufficient information to investors to avoid violating the antifraud provisions of the securities laws. 
This means that any information a company provides to investors must be free from false or misleading statements. Similarly, a company should not 
exclude any information if the omission makes what is provided to investors false or misleading. You should always check with your state securities 
regulator to see if they have more information about the company and the people behind it. Be sure to ask whether your state regulator has received 
notice of the offering or, in the case of a Rule 504 offering, cleared the offering for sale in your state. You can get the address and telephone number 
for your state securities regulator by calling the North American Securities Administrators Association at (202) 737-0900 or by visiting its website."""

Loading Models

In [5]:
#import the language class
from spacy.lang.en import English

In [6]:
#create NLP object
nlp = English()

#process a text 
doc = nlp(text)

In [7]:
doc.text

'Under the federal securities laws, any offer or sale of a security must either be registered with the SEC or meet an exemption. \nRegulation D under the Securities Act provides a number of exemptions from the registration requirements, allowing some companies \nto offer and sell their securities without having to register the offering with the SEC. For more information about these exemptions, \nsee our Fast Answers on Rules 504 and 506 of Regulation D.Companies that comply with the requirements of Regulation D do not have to \nregister their offering of securities with the SEC, but they must file what’s known as a "Form D" electronically with the SEC after \nthey first sell their securities. Form D is a brief notice that includes the names and addresses of the company’s promoters, executive \nofficers and directors, and some details about the offering, but contains little other information about the company. You can access the \nSEC’s EDGAR database to determine whether the company ha

#### Different languages

In [1]:
from spacy.lang.de import German

In [3]:
NLPg = German()
document = NLPg("Liebe Grüße!")
document

Liebe Grüße!

DOCUMENTS | SPANS | TOKENS

In [8]:
#select the first token in the doc
first_token = doc[0]

In [9]:
first_token.text

'Under'

In [10]:
#slice of doc for "federal securities"
fed_sec = doc[2:4]
fed_sec

federal securities

In [12]:
#slice of doc for "federal securities laws."
fed_sec_laws = doc[2:5]
fed_sec_laws

federal securities laws

LEXICAL ATTRIBUTES

In [13]:
#process the text
lex_doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are.")

In [15]:
#iterate over the tokens in the doc 
for token in lex_doc:
    #check if the token resembles a number
    if token.like_num:
        #get the next token in the document
        next_token = lex_doc[token.i +1]
        #check if the next token text is %
        if next_token.text == '%':
            print('Percentage found: ', token.text)

Percentage found:  60
Percentage found:  4


STATISTICAL MODELS

In [6]:
#core_web models, sm-md-lg

In [6]:
import spacy 
NLP = spacy.load("en_core_web_sm")

In [8]:
words= "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

In [9]:
doc = NLP(words)
doc.text

'It’s official: Apple is the first U.S. public company to reach a $1 trillion market value'

In [2]:
#NEWS ARTICLE MODEL
NLPnews = spacy.load("de_core_news_sm")

In [3]:
article = "Als erstes Unternehmen der Börsengeschichte hat Apple einen Marktwert von einer Billion US-Dollar erreicht"


In [4]:
doc = NLPnews(article)
doc.text

'Als erstes Unternehmen der Börsengeschichte hat Apple einen Marktwert von einer Billion US-Dollar erreicht'

PREDICTING LINGUSTIC ANNOTATIONS

In [12]:
for token in doc:
    #get token text, part-of-speech tag, dependency label
    token_text= token.text
    token_pos = token.pos_
    token_dep = token.dep_
    
    #formating results
    print("{:<12}{:<10}{:<10}".format(token_text, token_pos, token_dep))
    #raw 
    ##print(token_text, token_pos, token_dep)

It          PRON      nsubj     
’s          PROPN     ROOT      
official    NOUN      acomp     
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          VERB      ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


In [27]:
#check the meaning of the parts of speech
spacy.explain('SYM')

'symbol'

In [28]:
#check the meaning of 'amod' for dependency
spacy.explain('amod')

'adjectival modifier'

In [30]:
#NER example
#iterate over the predicted entities
for ent in doc.ents:
    #print the ent text and label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


In [13]:
from spacy import displacy
displacy.render(doc, style='ent')

PREDICTING NAMED ENTITIES IN CONTEXT 

In [31]:
#what to do when the model is not correct
sent = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

In [32]:
doc = NLP(sent)

In [36]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG


In [14]:
#mannually get span for iphone_X
iphone_X = doc[1:3]
print("Missing entity: ", iphone_X.text)

Missing entity:  ’s official


USING THE RULE BASED MATCHER

In [18]:
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")

In [21]:
#what to do when the model is not correct
sent = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"
doc = nlp(sent)

In [23]:
#initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

#add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)

#use the matcher on the doc 
matches = matcher(doc)

#show results
print("Matches", [doc[start:end].text for match_id, start, end in matches])

Matches ['iPhone X']


WRITING MATCH PATTERNS

In [25]:
doc = nlp("After making the iOS update you won't notice a radical system-wide redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of iOS 11's furniture remains the same as in iOS 10. But you will discover some tweaks once you delve a little deeper.")


In [26]:
#write  pattern for full iOS verstions
pattern = [{"TEXT": "iOS"}, {"IS_DIGIT": True}]

#add the pattern to matcher and apply matcher to the doc 
matcher.add("IOS_VERSION_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found: ", len(matches))

Total matches found:  3


In [46]:
#iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found: ", doc[start:end].text)

Match found:  iOS 7
Match found:  iOS 11
Match found:  iOS 10


In [28]:
#matching specific words with a part-of-speech PROPN after it
doc = nlp(
    "i downloaded Fortnite on my laptop and can't open the game at all. Help? so when I was downloading Minecraft, I got the Windows version where it is the '.zip' folder and I used the default program to unpack it... do I also need to download Winzip?"
)

In [29]:
# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": "download"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))

Total matches found: 3


In [30]:
# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", doc[start:end].text)

Match found: downloaded Fortnite
Match found: downloading Minecraft
Match found: download Winzip


In [31]:
#matching adjective type and noun type 
doc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")

In [33]:
#write a pattern for adjective and noun 
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'OP': '?'}]
matcher.add('ADJ_NOUN_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found: ', len(matches))

Total matches found:  4


In [34]:
#iterate through the doc and print the span text
for match_id, start, end in matches:
    print('Match found: ', doc[start:end].text)

Match found:  beautiful design,
Match found:  smart search,
Match found:  automatic labels and
Match found:  optional voice responses
