# <center> Introduction to spaCy

#### The nlp object
- At the center of spaCy is the object containing the processing pipeline.
    - variable usually call nlp
- includes languages-specific rules for tokenization etc.
- Supports a variety of languages that are available in spacy dot lang

#### The Doc object
- When processing a text with the <b> nlp object </b>, spaCy creates a Doc object (short for "document")
- Lets you acces information about the text in a structured way, and no information is lost
- It behaves like a normal Python sequence

#### The Token object

<img src="datasets/imgs/Doc_token.png" width="400" height="400">    

- Represent the tokens in a document (example: word, punctiation character)
- To get a token at a specific position, use index into the doc 
- It also provide varios attributes that let you access more information about the tokens.
    - Example: the dot text attribute returns the verbatim token text.

#### The Span object
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQtB4WQhIJU6AvtF6LjoFrH6VRM9Se2lOmGvX4DcXb7kbX1ENSn&usqp=CAU" width="400" height="400">    

- Is a slice of the document consisting of one or more tokens
    - Its only a view of the Doc and doesnt contain any data itself
- Using Python's slice notation is how can a Span can be created

#### Lexical attributes
- They refer to the entry in the vocabulary and don't depend on the token's context.
- Examples of it:

In [2]:
# Import 
from spacy.lang.en import English
nlp=English()
doc=nlp("It cost $5.")

print('Index:  ', [token.i for token in doc])
print('Text:  ', [token.text for token in doc])
##returns boolean values
print('is_alpha:  ', [token.is_alpha for token in doc])
print('is_punct:  ', [token.is_punct for token in doc])
print('like_num:  ', [token.like_num for token in doc])

Index:   [0, 1, 2, 3, 4]
Text:   ['It', 'cost', '$', '5', '.']
is_alpha:   [True, True, False, False, False]
is_punct:   [False, False, False, False, True]
like_num:   [False, False, False, True, False]


#### Some introductorial example:
- Trying some of the 45+ available languages

In [3]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

# Process a text
doc = nlp("This is a sentence.")

# Print the document text
print(doc.text)

This is a sentence.


In [4]:
# Import the German language class
from spacy.lang.de import German

# Create the nlp object
nlp = German()

# Process a text (this is German for: "Kind regards!")
doc = nlp("Liebe Grüße!")

# Print the document text
print(doc.text)

Liebe Grüße!


In [5]:
# Import the Spanish language class
from spacy.lang.es import Spanish

# Create the nlp object
nlp = Spanish()

# Process a text (this is Spanish for: "How are you?")
doc = nlp("¿Cómo estás?")

# Print the document text
print(doc.text)

¿Cómo estás?


#### Documents, spans and tokens
When you call nlp on a string, spaCy first tokenizes the text and creates a document object

In [6]:
# Import the English language class and create the nlp object
from spacy.lang.en import English
nlp = English()

# Process the text and instantiate a Doc object in the variable doc
doc = nlp("I like tree kangaroos and narwhals.")

# Select the first token
first_token = doc[0]

# Print the first token's text
print(first_token.text)

I


In [7]:
##SPANS

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals


#### Lexical attributes
Example to get the % numbers

In [8]:
# Process the text
doc = nlp("In 1990, more than 60% of people in East Asia were in extreme poverty. Now less than 4% are.")

# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals '%'
        if next_token.text == '%':
          print('Percentage found:', token.text)

Percentage found: 60
Percentage found: 4


# <center> spaCy's Statistical Models
    
#### What are statistical models?
- Enable spaCy to predict linguistic attributes in <b> context </b>
    - Part-of-speech tags
    - Syntactic dependencies
    - Named entities
- This models are trained on large datasets of labeled example texts.
- Can be updated with more examples to fine-tune predictions.
    

 
#### Model Packages
-spaCy provides a number of pre-trained model packages for download
- example: "en_core_web_sm" package
    - Is a small model that supports all core capabilites and is trained on the web text
    - The package provides: 
        - The binary weights 
        - Vocabulary
        - Meta information (language, pipeline)
        -All this enable spaCy to make predictions.

#### Predicting Part-of-speech Tags
- This means predict the word types in context.
- The "pos underscore" attribute returns the predicted word type label. Notation:
> token.pos_
    
TIP: In spaCy, attributes that return strings usually end with and underscore - attributes without the underscore return an ID.   
    
#### Predicting Syntactic Dependencies
- This means predict how the words are related
    - example: whether a word is the subject of the sentence or an object.
- The "dep underscore" attribute returns the predicted dependency label. Notation:
    > token.dep_
- The "head" attribute returns the syntactic head token (parent token whis word is attached to).Notation:
    > token.head.text
- To describe syntactic dependencies, spaCy uses a standardized label scheme.    
    
    
<img src="datasets/imgs/Syntactic_depen.png" width="400" height="400">    
    
#### Predicting Named Entities
- Are "real world objects" that are assigned a name
    - example: a person, an organization or a country.
    
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcSGkC4_OVOWUkqbYsB9yFsJlLCRujhnB6AFCTZiJag08AbfkFDw&usqp=CAU" width="400" height="400">    
    
#### Tip: the explain method
- get quick definitions of the most common tags and labels.
    >spacy.explain('GPE')
    Countries, cities, states

#### Loading models

In [16]:
import spacy
# Load the 'en_core_web_sm' model – spaCy is already imported
nlp = spacy.load('en_core_web_sm')

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

# Process the text
doc = nlp(text)

# Print the document text
print(doc.text)

It’s official: Apple is the first U.S. public company to reach a $1 trillion market value


#### Predicting linguistic annotations
- part-of-speech tag | dependency label

In [17]:
for token in doc:
    # Get the token text, part-of-speech tag and dependency label
    token_text = token.text
    token_pos = token.pos_
    token_dep = token.dep_
    # This is for formatting only
    print('{:<12}{:<10}{:<10}'.format(token_text, token_pos, token_dep))

It          PRON      nsubj     
’s          VERB      punct     
official    NOUN      ROOT      
:           PUNCT     punct     
Apple       PROPN     nsubj     
is          AUX       ROOT      
the         DET       det       
first       ADJ       amod      
U.S.        PROPN     nmod      
public      ADJ       amod      
company     NOUN      attr      
to          PART      aux       
reach       VERB      relcl     
a           DET       det       
$           SYM       quantmod  
1           NUM       compound  
trillion    NUM       nummod    
market      NOUN      compound  
value       NOUN      dobj      


- Name entity

In [18]:
# Iterate over the predicted entities
for ent in doc.ents:
    # print the entity text and its label
    print(ent.text, ent.label_)

Apple ORG
first ORDINAL
U.S. GPE
$1 trillion MONEY


#### Predicting named entities in context
Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text that has been processing

In [20]:
text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"

# Process the text
doc = nlp(text)

# Iterate over the entities
for ent in doc.ents:
    # print the entity text and label
    print(ent.text, ent.label_)
    
# Get the span for "iPhone X"
iphone_x = doc[1:3]

# Print the span text
print('Missing entity:', iphone_x.text)

Apple ORG
Missing entity: iPhone X


#### <center> Rule-based Matching
- Lets you write rules to find words and phrases in text

#### Why not just regular expressions?
- Match on Doc objects, not just strings
- Match on tokens and token attributes
- Use the model's predictions
    - Example: duck (verb)  vs "duck" (noun)

#### Match patterns
- List of dictionaries, one per token
- Match exact token texts
> [{'ORTH': 'iPhone'}, {'ORTH':'X'}]    
- Match lexical attributes. Example: looking for tokens whose lowercase forms equal "iphone" and "x"
> [{'LOWER': 'iphone'}, {'LOWER':'x'}]    
- Match any token attributes
> [{'LEMMA': 'buy'}, {'POS':'NOUN'}] 
    
    > OUTPUT: buying milk, bought flowers
    
#### Using Matcher example code: 
- matcher.add()
    - first argument: unique ID to identify which pattern was matched
    - second arg: optional callback.
    - third arg: the pattern
- that function return a tuple
    - match_id: hash value of the pattern name
    - start: star index of matched span
    - end: end index of matched span
    

In [22]:
# Import the Matcher and initialize it with the shared vocabulary
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{'TEXT': 'iPhone'}, {'TEXT': 'X'}]

# Add the pattern to the matcher
matcher.add('IPHONE_X_PATTERN', None, pattern)

# Use the matcher on the doc
matches = matcher(doc)
print('Matches:', [doc[start:end].text for match_id, start, end in matches])

Matches: ['iPhone X']


#### Matching lexical attributes
- complex pattern using lexical attributes

#### Matching other token attributes

In [24]:
## Write one pattern that only matches mentions of the full iOS versions
##########################################################################
doc = nlp("After making the iOS update you won't notice a radical system-wide redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of iOS 11's furniture remains the same as in iOS 10. But you will discover some tweaks once you delve a little deeper.")

# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{'TEXT': 'iOS'}, {'IS_DIGIT': True}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('IOS_VERSION_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 3
Match found: iOS 7
Match found: iOS 11
Match found: iOS 10


In [25]:
## Write one pattern that only matches forms of "download" (tokens with the lemma "download"), 
#followed by a token with the part-of-speech tag 'PROPN' (proper noun).
##########################################################################

doc = nlp("i downloaded Fortnite on my laptop and can't open the game at all. Help? so when I was downloading Minecraft, I got the Windows version where it is the '.zip' folder and I used the default program to unpack it... do I also need to download Winzip?")

# Write a pattern that matches a form of "download" plus proper noun
pattern = [{'LEMMA': 'download'}, {'POS': 'PROPN'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('DOWNLOAD_THINGS_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 2
Match found: downloaded Fortnite
Match found: downloading Minecraft


#### Using operator and quantifiers
- operators can make patterns a lot more powerful, but they also add more complexity

<img src="datasets/imgs/operators.png" width="400" height="400">    

In [26]:
## Write one pattern that matches adjectives ('ADJ') followed by one or two 'NOUN's (one noun and one optional noun).
##########################################################################

doc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")

# Write a pattern for adjective plus one or two nouns
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', None, pattern)
matches = matcher(doc)
print('Total matches found:', len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print('Match found:', doc[start:end].text)

Total matches found: 5
Match found: beautiful design
Match found: smart search
Match found: automatic labels
Match found: optional voice
Match found: optional voice responses
