# [Chapter 1 (Finding Words, Phrases, Names and Concepts)](https://course.spacy.io/en/chapter1)
These are my notes for the first chapter of the advanced NLP [course](https://course.spacy.io/en/) provided by spaCy. 

In [1]:
import spacy

This chapter contains:
- Basics of text processing with spaCy
- Data Structures
- How to work with trained pipelines
- How to use them to predict linguistic features in text

### 1.1: Finding Words

#### NLP

At the center of spaCy is the object containing the processing pipeline, conventionally called 'nlp'. You can use this object like a function to analyze text. It contains all the different components in the pipeline _(what does this mean exactly?)_. Can be used with different languages; will contain different rules for tokenization based on the language. Over 60 languages available.

In [2]:
nlp = spacy.blank("en") # spacy.blank method used when creating a blank pipeline 

#### Doc

When you process text with the `nlp` object, spaCy creates a `Doc` object. It lets you access information about the text in a structured way, with no loss of information about the text. Behaves like a normal Python sequence and lets you iterate over its tokens. 

In [4]:
doc = nlp("Hello World!")

for token in doc:
    print(token.text) # print(token) will achieve the same result, but note that token itself isn't type str

Hello
World
!


#### Token

Token objects represent the tokens in a document, a word or a punctuation character. You can index `doc` to get a specific token. Token objects have attributes that let you access more info about a token, like `.text`

In [9]:
token = doc[1]

print(token.text)

dir(token)[28:35] # Some attribute examples

World


['ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_']

#### Span

A Span object is a slice of a document, consisting of one or more tokens. Only a view of the `Doc` and doesn't contain any data itself _(data about what? Seems to contain data based on dir())_. To create one, you can use Python's slice notation. 

In [10]:
span = doc[1:3]

print(span.text)

World!


In [12]:
# Can I loop through it like a doc? Yes
for token in span:
    print(token)

World
!


#### Some Lexical Attributes

In [41]:
doc = nlp("It costs $50.")

for token in doc:
    print("Index of current token: ", token.i)
    print("Text: ", token.text)
    print("Does it contain an alphanumeric character? ", token.is_alpha)
    print("Is it a punctuation character? ", token.is_punct)
    print("Does it resemble a number? ", token.like_num)
    print("\n")
    # Resembling a number can mean it is either expressed numerically (10) or alphanumerically (ten)

Index of current token:  0
Text:  It
Does it contain an alphanumeric character?  True
Is it a punctuation character?  False
Does it resemble a number?  False


Index of current token:  1
Text:  costs
Does it contain an alphanumeric character?  True
Is it a punctuation character?  False
Does it resemble a number?  False


Index of current token:  2
Text:  $
Does it contain an alphanumeric character?  False
Is it a punctuation character?  False
Does it resemble a number?  False


Index of current token:  3
Text:  50
Does it contain an alphanumeric character?  False
Is it a punctuation character?  False
Does it resemble a number?  True


Index of current token:  4
Text:  .
Does it contain an alphanumeric character?  False
Is it a punctuation character?  True
Does it resemble a number?  False




Notice that 50 is considered a single token!

In [17]:
doc = nlp("Ten dollars")

token = doc[0]
print(token.like_num)

True


### 1.5: Trained Pipelines

Trained Pipeline components have statistical models that enable spaCy to make predictions in context. For example, it can predict whether a word is a verb or a person's name. These pipelines are trained on large datasets of labeled example texts. They can also be updated with more examples to fine-tune their performance, to better perform on your data. 

A pipeline can be downloaded via the command prompt using the command: `python -m spacy download pipeline_name`. Example: `python -m spacy download en_core_web_sm`. 

The `en_core_web_sm` package is a small English pipeline that supports all core capabilities. The package provides binary weights that enable spaCy to make predictions. Also includes the vocabulary, meta information, and the configuration file used to train the pipeline. Tells spaCy which language class to use and how to configure the processing pipeline.

In [19]:
nlp = spacy.load("en_core_web_sm") # notice the use of the spacy.load method instead of spacy.blank

Let's try predicting part-of-speech tags, word types in context. After we load the pipeline and the document, for each token, we can print the text and the predicted part of speech tag (`pos_`). In spaCy, attributes that return strings usually end with an underscore; those without an underscore return an integer. 

In [20]:
doc = nlp("She ate the pizza")

for token in doc:
    print(token.text, token.pos_)

She PRON
ate VERB
the DET
pizza NOUN


We can also predict how words are related AKA syntactic dependencies. 

In [21]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

She PRON nsubj ate
ate VERB ROOT ate
the DET det pizza
pizza NOUN dobj ate


`dep_` returns the predicted dependency label. The `.head` attribute returns the synctatic head token; think of it as a parent token this word is attached to. 

Named entities are "real world objects". E.g: a person, an organization, or a country.

In [42]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents: # .ents returns a Span of all of the entities in the document
    print(ent.text, ent.label_) # .label_ returns the entity label

print(doc.text)

Apple ORG
U.K. GPE
$1 billion MONEY
Apple is looking at buying U.K. startup for $1 billion


To understand what a tag/label means, use `spacy.explain`. 

In [28]:
print(spacy.explain("GPE"))
print(spacy.explain("det"))
print(spacy.explain("NNP"))
print(spacy.explain("AUX"))

Countries, cities, states
determiner
noun, proper singular
auxiliary


### 1.10: Rule-Based Matching

Why not just use regular expressions? With spaCy, you can match on `Doc` objects, and not just strings; match on tokens and token attributes; use a mode's predictions; differentiate between duck (noun) and duck (verb). 

Match Patterns are lists of dictionaries, with each dictionairy describing a single token. So `len(pattern)` indicates how many tokens we are looking for. The keys are the names of token object attributes and the values are the expected values AKA what we want. Notice that we can have multiple keys in a single dictionairy, meaning we want a token to match multiple criteria. Consider the example below, where we are looking for two tokens with the text "iPhone" and "X". 

In [29]:
[{"TEXT": "iPhone"}, {"TEXT": "X"}]

[{'TEXT': 'iPhone'}, {'TEXT': 'X'}]

Can also match on other token attributes. We can look for two tokens whose lowercase forms equal "iphone" and "x". 

In [30]:
[{"LOWER": "iphone"}, {"LOWER": "x"}]

[{'LOWER': 'iphone'}, {'LOWER': 'x'}]

Can write predictions based on the attributes predicted by the model. The below pattern would match a token with a lemma "buy" and a noun. Since the lemma is in base form, this would match stuff like "buying milk" or "bought flowers".

In [31]:
[{"LEMMA": "buy"}, {"POS": "NOUN"}]

[{'LEMMA': 'buy'}, {'POS': 'NOUN'}]

A pattern, describing a token with multiple criteria.

In [40]:
pattern = [
    {"LEMMA": "love", "POS": "VERB"}, # a verb with the base lemma love
    {"POS": "NOUN"}
]

Here's how we actually use a matcher

In [34]:
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab) # Matcher is initialized with the shared vocab. More on this later - just remember to always pass it

pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]

# add a pattern to the Matcher. The first argument is a unique ID to identify which pattern was matched. The second arg is a list
# of patterns
matcher.add("IPHONE_PATTERN", [pattern])

doc = nlp("Upcoming iPhone X release date leaked")

# Get a list of matches on a particular doc
matches = matcher(doc)

In [38]:
# Iterate over the matches:
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(match_id) # kinda weird that it returns the hash value of the pattern name, instead of the name itself, but whatever
    print(matched_span.text)

9528407286733565721
iPhone X


We can add quantifiers to our patterns. Consider the example below:

In [None]:
pattern = [
    {"LEMMA": "buy"},
    {"POS": "DET", "OP": "?"},  # optional: match 0 or 1 times
    {"POS": "NOUN"}
]

In the case above, the `?` makes the determiner optional. It will match both: "bought a smartphone" and "buying apps". To add quantifiers, use the "OP" keyword. 