# spaCy Tutorial Part 1: Introduction

spaCy is a free, open-source library for Natural Language Processing (NLP) in Python. There are many things it can do to help you process language data for all kinds of uses. 

### What is spaCy good for?

spaCy is great for quickly analyzing natural language data, especially (standard) written language data, using pretrained models. Common tasks such as tokenization, POS-tagging, dependency parsing, Named Entity Recognition, and word embeddings/vector-space semantic modelling can be done incredibly quickly using the pre-trained models.  

### What is spaCy **not** for?

Strickly speaking, spaCy is not designed for actual NLP research, unlike other libraries such as [Natural Language Toolkit (NLTK)](https://github.com/nltk/nltk). It is designed more for speedy application and production rather than the development and evaluation of optimal NLP algorithms. The aim with spaCy is to simply get things done.

Fortunately, spaCy is **very** good at getting things done, which makes it very useful for corpus linguists. You can train your own models, or update the pretrained ones, with your own data, but I will not go into that in these tutorials. At the moment, getting a good model (with the current methods) usually requires *lots* of data, and training on even small datasets is more than a typical personal laptop or computer can handle. You can find more information about training spaCy models [here](https://spacy.io/usage/training)    

## Installing spaCy

More coming soon. You can find information about installing spaCy at [https://spacy.io/usage](https://spacy.io/usage).

## Adding language models

spaCy comes with pretrained statistical models for a number of languages. There are models of different sizes for about a dozen or so languages, mostly the major European ones, though models for Chinese (Mandarin?) and Japanese are also available. All these models are deep neural networks trained on different kinds of data. You can find a list of spaCy's language models [here](https://spacy.io/usage/models#languages). 

For English there are three pretrained models of different sizes: small `sm`, medium `md`, large `lg`. All the pretrained English models were trained on [OntoNotes](https://catalog.ldc.upenn.edu/LDC2013T19)), and the larger two models contain context-specific token vectors (20k and 685k vectors, respectively). 

For most of the tutorials we'll be using the small English model, `en_core_web_sm`, however we will occasionally use the medium-sized model, so we'll download that too. In most cases the small model is good enough, but we'll examine a case where using a larger model makes a big difference.

***Note:*** *The pretrained models may not perform well on data that is very different from the kind(s) of data the model was trained on. For instance, models trained on edited written language (newspaper articles, novels, academic texts, etc.) will generally not do well on other kinds of language, e.g. casual spoken conversation or text messages. It's always important to be aware of the limitations of these models.*

spaCy’s models can be downloaded in the Python console like so:

In [None]:
$ python -m spacy download en_core_web_sm
$ python -m spacy download en_core_web_md

You can load the model with the `spacy.load()` function.

In [2]:
import spacy

nlp_sm = spacy.load("en_core_web_sm")

## Working with `Doc` objects

spaCy works on objects of `Doc` class, which contain all kinds of information about a bit of text. To create a `Doc` from a text, we simply run our model over a string of text like so.

In [4]:
hobbit_doc1 = nlp_sm("In a hole in the ground there lived a hobbit.")

print(hobbit_doc1) # you can also simply type 'hobbit_doc1'

In a hole in the ground there lived a hobbit.


Now we can start doing things with `hobbit_doc1`. 

### Getting tokens

One of the most basic things we do when analyzing texts is **tokenization**. Tokenization refers to the splitting of a text into individual units for further analysis. In most cases, we use it informally to refer to splitting a text into words. 
The nice thing is that spaCy does this automatically for us when it creates a model object. A `Doc` object is essentially a sequence of `token` objects.

We can see the tokens in `hobbit_doc1` by simply iterating over its elements. The individual tokens are themselves spaCy objects of class `token`, and therefore have lots of attributes which contain information about them. Here we use `.text` to get the plain language text of each token and print it.

In [4]:
for token in hobbit_doc1:
    print(token.text)

In
a
hole
in
the
ground
there
lived
a
hobbit
.


In [5]:
# list comprehension style
[t.text for t in hobbit_doc1]

['In',
 'a',
 'hole',
 'in',
 'the',
 'ground',
 'there',
 'lived',
 'a',
 'hobbit',
 '.']

You can find the full list of token attributes [here](https://spacy.io/api/token#attributes). Below are some attributes that I think are particularly useful.

- `.i`: The index (position) of the token in the text. Starts at 0.
- `.text`: The original word text.
- `.lemma_`: The base form of the word.
- `.pos_`: The simple [Universal Part-Of-Speech tag](https://universaldependencies.org/docs/u/pos/).
- `.tag_`: The detailed part-of-speech tag. (I *believe* these are the same tags used in the [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html), but I am not sure.) 
- `.dep_`: Syntactic dependency, i.e. the relation between tokens.
- `.shape_`: The word shape – capitalization, punctuation, digits.
- `.is_alpha`: Is the token an alpha character?
- `.is_punct`: Is the token a punctuation marker?
- `.is_stop`: Is the token part of a stop list, i.e. the most common words of the language?

You can of course use spaCy with other Python modules to do what you want to do. For example, I'll use the `pandas` module to create a `DataFrame` object, which is automatically printed neatly in jupyter. 

***Python Note:*** *If you plan to use Python for data analysis, I highly recommend checking out the [`pandas` library](https://pandas.pydata.org/).*

In [7]:
import pandas as pd

hobbit_tab = [[token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.is_alpha, token.i] for token in hobbit_doc1]

pd.DataFrame(hobbit_tab, columns = ["text", "lemma_", "pos_", "tag_", "dep_", "is_alpha", "INDEX"])

Unnamed: 0,text,lemma_,pos_,tag_,dep_,is_alpha,INDEX
0,In,in,ADP,IN,prep,True,0
1,a,a,DET,DT,det,True,1
2,hole,hole,NOUN,NN,pobj,True,2
3,in,in,ADP,IN,prep,True,3
4,the,the,DET,DT,det,True,4
5,ground,ground,NOUN,NN,pobj,True,5
6,there,there,PRON,EX,expl,True,6
7,lived,live,VERB,VBD,ROOT,True,7
8,a,a,DET,DT,det,True,8
9,hobbit,hobbit,NOUN,NN,dobj,True,9


Note that some of these attributes have an underscore `_` following them. These tell spaCy to return a label rather than an integer value. If you leave off the `_`, you get a number corresponding to the requisite label (e.g. 90 = 'DET'). These are used internally by spaCy but may not be that useful for you if you don't know what each number refers to. Hence the underscored versions above. 

In [5]:
for token in hobbit_doc1:
    print(token.pos, token.pos_, token.text) # see the integer and label values for 'pos'

85 ADP In
90 DET a
92 NOUN hole
85 ADP in
90 DET the
92 NOUN ground
95 PRON there
100 VERB lived
90 DET a
92 NOUN hobbit
97 PUNCT .


**more about custom tokenizers?**


### Splitting a text into sentences

You can also easily segment a text into sentences, which can be further analyzed. 

In [6]:
hobbit_doc2 = nlp_sm("""This hobbit was a very well-to-do hobbit, and his name was Baggins. The Bagginses had lived in the neighbourhood of The Hill for time out of mind, and people considered them very respectable, not only because most of them were rich, but also because they never had any adventures or did anything unexpected: you could tell what a Baggins would say on any question without the bother of asking him. This is a story of how a Baggins had an adventure, and found himself doing and saying things altogether unexpected. He may have lost the neighbours' respect, but he gained—well, you will see whether he gained anything in the end.""")

hobbit_doc2 # The text is printed as an unbroken string

This hobbit was a very well-to-do hobbit, and his name was Baggins. The Bagginses had lived in the neighbourhood of The Hill for time out of mind, and people considered them very respectable, not only because most of them were rich, but also because they never had any adventures or did anything unexpected: you could tell what a Baggins would say on any question without the bother of asking him. This is a story of how a Baggins had an adventure, and found himself doing and saying things altogether unexpected. He may have lost the neighbours' respect, but he gained—well, you will see whether he gained anything in the end.

All `Doc` objects have an attribute `.sents`, even if there is only one sentence. As with tokenization, sentence segmentation is automatically done when a `Doc` object is created.

In [7]:
for s in hobbit_doc1.sents:
    print(s.text)

In a hole in the ground there lived a hobbit.


In [8]:
for s in hobbit_doc2.sents:
    print(s.text)

This hobbit was a very well-to-do hobbit, and his name was Baggins.
The Bagginses had lived in the neighbourhood of The Hill for time out of mind, and people considered them very respectable, not only because most of them were rich, but also because they never had any adventures or did anything unexpected: you could tell what a Baggins would say on any question without the bother of asking him.
This is a story of how a Baggins had an adventure, and found himself doing and saying things altogether unexpected.
He may have lost the neighbours' respect, but he gained—
well, you will see whether he gained anything in the end.


**More about sentences and spans...**