# Intro to spaCy

Generally, attributes that end with an underscore return strings. Attributes that do not return an ID.

In [1]:
from spacy.lang.en import English

In [2]:
# Create the nlp object
nlp = English()

In [10]:
# Hello world example
doc = nlp("Hello world!")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)

Hello
world
!


## Token Splicing & Subsetting

**Tokens**

In [11]:
# Token Splicing/Subsets

# Index into the Doc to get a single Token
token = doc[1]

# Get the token text via the .text attribute
print(token.text)

world


**Spans**

The example below selects the token from position 1 up to, but not including, position 4.

In [12]:
# A slice from the Doc is a Span object
# format is doc[start:end]
span = doc[1:4]

# Get the span text via the .text attribute
print(span.text)

world!


## Token modules

- `i` gives the index
- `text` gives the *token* text
- `is_alpha` - boolean; if alphabetic
- `is_punct` - boolean; if punctuation
- `like_num` - boolean; if number. Works for "ten" or "10."

In [6]:
doc = nlp("It costs $5.")

print('Index:   ', [token.i for token in doc])
print('Text:    ', [token.text for token in doc])

print('is_alpha:', [token.is_alpha for token in doc])
print('is_punct:', [token.is_punct for token in doc])
print('like_num:', [token.like_num for token in doc])

Index:    [0, 1, 2, 3, 4]
Text:     ['It', 'costs', '$', '5', '.']
is_alpha: [True, True, False, False, False]
is_punct: [False, False, False, False, True]
like_num: [False, False, False, True, False]


## Example - Find all % in the doc

In [7]:
from spacy.lang.en import English

nlp = English()

# Process the text
doc = nlp(
    "In 1990, more than 60% of people in East Asia were in extreme poverty. "
    "Now less than 4% are."
)

In [8]:
# Iterate over the tokens in the doc
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        next_token = doc[token.i + 1]
        # Check if the next token's text equals '%'
        if next_token.text == "%":
            print("Percentage found:", token.text)

Percentage found: 60
Percentage found: 4


# More Examples

**Print the 'doc'**

In [16]:
# Import the English language class
from spacy.lang.en import English

# Create the nlp object
nlp = English()

# Process a text
doc = nlp("This is a sentence.")

# Print the document text
print(doc.text)

This is a sentence.


**More span/slicing examples**

In [17]:
# Process the text
doc = nlp("I like tree kangaroos and narwhals.")

# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)

# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)

tree kangaroos
tree kangaroos and narwhals
