# spaCy Tutorial Part 2: Part-of-Speech tagging and custom tokenization

In this session we'll look at tokenization


In [2]:
import spacy
nlp_sm = spacy.load("en_core_web_sm")

## Tokenizing

Tokenizing refers to the splitting of a text into individual units for further analysis. In most cases, we use it informally to refer to splitting a text into words. The nice thing is that spaCy does this automatically for us when it creates a model object.

In [30]:
gregor = nlp_sm("One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.")
gregor.text

'One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.'

We can see the tokens by simply iterating over the elements in `gregor`. Remember that these elements are spaCy objects themselves, and therefore have lots of attributese.g. `.text`, that we can call. 

In [31]:
for token in gregor:
    print(token.text)

One
morning
,
when
Gregor
Samsa
woke
from
troubled
dreams
,
he
found
himself
transformed
in
his
bed
into
a
horrible
vermin
.


You can find the full list of token attributes [here](https://spacy.io/api/token#attributes). Below are some attributes that I think are particularly useful.

- `i`: The index (position) of the token in the text. Starts at 0.
- `text`: The original word text.
- `lemma_`: The base form of the word.
- `pos_`: The simple [UPOS](https://universaldependencies.org/docs/u/pos/) part-of-speech tag.
- `tag_`: The detailed part-of-speech tag. (I *believe* these are the same tags used in the [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html), but I am not sure.) 
- `dep_`: Syntactic dependency, i.e. the relation between tokens.
- `shape_`: The word shape – capitalization, punctuation, digits.
- `is_alpha`: Is the token an alpha character?
- `is_punct`: Is the token a punctuation marker?
- `is_stop`: Is the token part of a stop list, i.e. the most common words of the language?

You can of course use spaCy with other Python modules to do what you want to do. For example, I'll use the `pandas` module to create a `DataFrame` object, which is automatically printed neatly in jupyter. 

***Python Note***: If you plan to use Python for data analysis, I highly recommend checking out [`pandas`](https://pandas.pydata.org/).


In [33]:
import pandas as pd

gregor_tab = [[token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.is_alpha, token.i] for token in gregor]

pd.DataFrame(gregor_tab, columns = ["text", "lemma_", "pos_", "tag_", "dep_", "is_alpha", "INDEX"])

Unnamed: 0,text,lemma_,pos_,tag_,dep_,is_alpha,INDEX
0,One,one,NUM,CD,nummod,True,0
1,morning,morning,NOUN,NN,npadvmod,True,1
2,",",",",PUNCT,",",punct,False,2
3,when,when,ADV,WRB,advmod,True,3
4,Gregor,Gregor,PROPN,NNP,nsubj,True,4
5,Samsa,Samsa,PROPN,NNP,nsubj,True,5
6,woke,wake,VERB,VBD,advcl,True,6
7,from,from,ADP,IN,prep,True,7
8,troubled,troubled,ADJ,JJ,amod,True,8
9,dreams,dream,NOUN,NNS,pobj,True,9


Note that some of these have an underscore `_` following them. These tell spaCy to return a label rather than an integer value. If you leave off the `_`, you get a slightly different output. 

In [34]:
for token in gregor:
    print(token.text, token.pos_, token.pos)

One NUM 93
morning NOUN 92
, PUNCT 97
when ADV 86
Gregor PROPN 96
Samsa PROPN 96
woke VERB 100
from ADP 85
troubled ADJ 84
dreams NOUN 92
, PUNCT 97
he PRON 95
found VERB 100
himself PRON 95
transformed VERB 100
in ADP 85
his DET 90
bed NOUN 92
into ADP 85
a DET 90
horrible ADJ 84
vermin NOUN 92
. PUNCT 97


## Working with POS tags and other attributes

Now that we have 

In [35]:
dickens_doc1 = nlp_sm("""It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way— in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.""")
dickens_doc1.text

'It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way— in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.'

The first step might be to get some basic frequency statistics. 

- The total number of tokens
- The number of word tokens, not including punctuation
- The number of unique word types

In [36]:
# Total N tokens
len(dickens_doc1)

138

In [37]:
# Total N words
n_words = 0
for word in dickens_doc1:
    if not word.is_punct:
        n_words = n_words + 1
n_words

119

***Exercise:*** Find another way to calculate the total number of words using the `len()` function and list comprehension. 

In [38]:
# Total N word types
word_list = []
for word in dickens_doc1:
    if not word.is_punct:
        word_list.append(word)
set(word_list) # set() creates a set '{}' of the unique elements in a list 

{It,
 was,
 the,
 best,
 of,
 times,
 it,
 was,
 the,
 worst,
 of,
 times,
 it,
 was,
 the,
 age,
 of,
 wisdom,
 it,
 was,
 the,
 age,
 of,
 foolishness,
 it,
 was,
 the,
 epoch,
 of,
 belief,
 it,
 was,
 the,
 epoch,
 of,
 incredulity,
 it,
 was,
 the,
 season,
 of,
 Light,
 it,
 was,
 the,
 season,
 of,
 Darkness,
 it,
 was,
 the,
 spring,
 of,
 hope,
 it,
 was,
 the,
 winter,
 of,
 despair,
 we,
 had,
 everything,
 before,
 us,
 we,
 had,
 nothing,
 before,
 us,
 we,
 were,
 all,
 going,
 direct,
 to,
 Heaven,
 we,
 were,
 all,
 going,
 direct,
 the,
 other,
 way,
 in,
 short,
 the,
 period,
 was,
 so,
 far,
 like,
 the,
 present,
 period,
 that,
 some,
 of,
 its,
 noisiest,
 authorities,
 insisted,
 on,
 its,
 being,
 received,
 for,
 good,
 or,
 for,
 evil,
 in,
 the,
 superlative,
 degree,
 of,
 comparison,
 only}

In [39]:
len(set(word_list)) 

119

***Exercise:*** Calculate the total number of *lemma* types. 

***Exercise:*** Calculate the total number of word types using list comprehension and the `set()` and `len()` functions.

Now let's create a list of words by frequency. This is where using other modules comes in very handy. So we can use the `Counter()` function in the `collections` module to 

In [40]:
from collections import Counter

Counter(word_list)

Counter({It: 1,
         was: 1,
         the: 1,
         best: 1,
         of: 1,
         times: 1,
         it: 1,
         was: 1,
         the: 1,
         worst: 1,
         of: 1,
         times: 1,
         it: 1,
         was: 1,
         the: 1,
         age: 1,
         of: 1,
         wisdom: 1,
         it: 1,
         was: 1,
         the: 1,
         age: 1,
         of: 1,
         foolishness: 1,
         it: 1,
         was: 1,
         the: 1,
         epoch: 1,
         of: 1,
         belief: 1,
         it: 1,
         was: 1,
         the: 1,
         epoch: 1,
         of: 1,
         incredulity: 1,
         it: 1,
         was: 1,
         the: 1,
         season: 1,
         of: 1,
         Light: 1,
         it: 1,
         was: 1,
         the: 1,
         season: 1,
         of: 1,
         Darkness: 1,
         it: 1,
         was: 1,
         the: 1,
         spring: 1,
         of: 1,
         hope: 1,
         it: 1,
         was: 1,
         the: 1,


We can do the same with POS tags and see how many of each POS we have in the text. We'll use the same `Counter()` function.

In [46]:
pos_list = [w.pos_ for w in dickens_doc1 if not w.is_punct]
pos_counts = Counter(pos_list)
pos_counts

Counter({'PRON': 18,
         'AUX': 16,
         'DET': 17,
         'ADJ': 9,
         'ADP': 20,
         'NOUN': 25,
         'PROPN': 2,
         'ADV': 5,
         'VERB': 4,
         'SCONJ': 2,
         'CCONJ': 1})

In the above example, `pos_counts` is an object of `Counter` class which is a subclass of `dict`, so it has all the methods of `dict` class including `.keys()` (the POS labels) and `.values()` (their counts).

In [47]:
list(pos_counts.keys()) # list() converts it to a list object

['PRON',
 'AUX',
 'DET',
 'ADJ',
 'ADP',
 'NOUN',
 'PROPN',
 'ADV',
 'VERB',
 'SCONJ',
 'CCONJ']

In [48]:
list(pos_counts.values())

[18, 16, 17, 9, 20, 25, 2, 5, 4, 2, 1]

Now we can combine them into a `DataFrame`

In [50]:
pos_df = pd.DataFrame(list(zip(pos_counts.keys(), pos_counts.values())), columns = ["POS", "Freq"])
pos_df

Unnamed: 0,POS,Freq
0,PRON,18
1,AUX,16
2,DET,17
3,ADJ,9
4,ADP,20
5,NOUN,25
6,PROPN,2
7,ADV,5
8,VERB,4
9,SCONJ,2


In [52]:
pos_df.sort_values("Freq", ascending = False)

Unnamed: 0,POS,Freq
5,NOUN,25
4,ADP,20
0,PRON,18
2,DET,17
1,AUX,16
3,ADJ,9
7,ADV,5
8,VERB,4
6,PROPN,2
9,SCONJ,2


## Splitting a text into sentences

Below is the text from the first two paragraphs of *Harry Potter and the Philosopher's Stone* (or *Harry Potter and the Sorcerer's Stone* for Americans).

In [9]:
hp_text1 = """Mr and Mrs Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense. Mr Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large moustache. Mrs Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbours. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere."""
hp_text1

'Mr and Mrs Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense. Mr Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large moustache. Mrs Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbours. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.'

We can make a doc object out of larger bits of text as well as individual sentences. We can all the text with the `.text` method.

In [11]:
hp_doc1 = nlp_sm(hp_text1)
hp_doc1.text

'Mr and Mrs Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense. Mr Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large moustache. Mrs Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbours. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.'

In [18]:
for s in hp_doc1.sents:
    print(s.text)

Mr and Mrs Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.
They were the last people you’d expect to be involved in anything strange or mysterious, because they just didn’t hold with such nonsense.
Mr Dursley was the director of a firm called Grunnings, which made drills.
He was a big, beefy man with hardly any neck, although he did have a very large moustache.
Mrs Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbours.
The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.


In [15]:
hp_text2 = """It was on the corner of the street that he noticed the first sign of something peculiar – a cat reading a map. For a second, Mr Dursley didn’t realise what he had seen – then he jerked his head around to look again."""
hp_text2

'It was on the corner of the street that he noticed the first sign of something peculiar – a cat reading a map. For a second, Mr Dursley didn’t realise what he had seen – then he jerked his head around to look again.'

In [17]:
hp_doc2 = nlp_sm(hp_text2)

for s in hp_doc2.sents:
    print(s.text)

It was on the corner of the street that he noticed the first sign of something peculiar – a cat reading a map.
For a second, Mr Dursley didn’t realise what he had seen – then he jerked his head around to look again.


***Python Note:*** As with individual words, it's often useful to organise your sentences into a `list` object for use down the line. [List comprehensions](https://www.pythonforbeginners.com/basics/list-comprehensions-in-python) are a very efficient way to do this.

### Customizing your sentence segmenter

It's worth noting that the default spaCy sentence splitting process relies on dependency parses to determine sentence boundaries. This is different from a simpler, though less accurate, rule-based method, e.g. splitting sentences wherever you find a `.`, `?`, or `!`. So this means that you need a statistical model to parse the text for the segmenting to work, which is why we load the model in before we do anything else.

However, a model is only as good as the data it was trained on, and if your data is of a very different kind compared to the training data, your segmentation may not work so well. You may find that if you are working with certain kinds of data, e.g. conversational data or data from IM, chats or social media, the default model doesn't work so well. Fortunately, you can customize the segmentation process by adding your own rules to tell spaCy what to do in special cases. What rules you add will depend on the nature of your data.

So consider the toy example below.

In [31]:
nlp = spacy.load("en_core_web_sm") # create a new model object so I don't alter the main one

text = "this is a sentence...and another...and another sentence."
doc = nlp(text)

print("Before adding rule:", [sent.text for sent in doc.sents])

Before adding rule: ['this is a sentence...and another...and another sentence.']


Now we want to tell spaCy to add a sentence boundary whenever it sees a "..." in the text. We do this by defining a function that goes through a `Doc` object token by token and wherever it finds a "...", it adds a flag to the following token indicating that it is the beginning of a new sentence. 

In [32]:
def set_custom_boundaries(doc):
    for token in doc[:-1]: # search from the first to the second to last token
        if token.text == "...":
            # if current token == "...", add sent_start flag to the next token (token.i + 1)
            doc[token.i + 1].is_sent_start = True 
    return doc

# Now add the new function to the pipeline
nlp.add_pipe(set_custom_boundaries, before="parser")

doc = nlp(text)
print("After adding rule:", [sent.text for sent in doc.sents])

After adding rule: ['this is a sentence...', 'and another...', 'and another sentence.']
