# Chapter 2: Large-scale data analysis with spaCy

In this chapter, you'll use your new skills to extract specific information from large volumes of text. You''ll learn how to make the most of spaCy's data structures, and how to effectively combine statistical and rule-based approaches for text analysis.



## Data Structures (1): Vocab, Lexemes and StringStore

Welcome back! Now that you've had some real experience using spaCy's objects, it's time for you to learn more about what's actually going on under spaCy's hood.

In this lesson, we'll take a look at the shared vocabulary and how spaCy deals with strings.

### Shared vocab and string store (1)

- `Vocab`: stores data shared across multiple documents
- To save memory, spaCy encodes all strings to hash values
- Strings are only stored once in the StringStore via nlp.vocab.strings
- String store: lookup table in both directions

In [1]:
import spacy

nlp = spacy.load('en')

In [2]:
# The second line will raise an exception
coffee_hash = nlp.vocab.strings["coffee"]
coffee_string = nlp.vocab.strings[coffee_hash]

KeyError: "[E018] Can't retrieve string for hash '3197928453018144401'. This usually refers to an issue with the `Vocab` or `StringStore`."

In [4]:
# Strangely once you process a document with this word, the exception disappears
nlp("This coffee is really good")

coffee_hash = nlp.vocab.strings["coffee"]
coffee_string = nlp.vocab.strings[coffee_hash]

coffee_hash, coffee_string

(3197928453018144401, 'coffee')

spaCy stores all shared data in a vocabulary, the Vocab.

This includes words, but also the labels schemes for tags and entities.

To save memory, all strings are encoded to hash IDs. If a word occurs more than once, we don't need to save it every time.

Instead, spaCy uses a hash function to generate an ID and stores the string only once in the string store. The string store is available as nlp.vocab.strings.

It's a lookup table that works in both directions. You can look up a string and get its hash, and look up a hash to get its string value. Internally, spaCy only communicates in hash IDs.

Hash IDs can't be reversed, though. If a word is not in the vocabulary, there's no way to get its string. That's why we always need to pass around the shared vocab.

In [5]:
doc = nlp("I love coffee")
print("hash value:", nlp.vocab.strings["coffee"])
print("string value:", nlp.vocab.strings[3197928453018144401])

hash value: 3197928453018144401
string value: coffee


In [6]:
doc = nlp("I love coffee")
print("hash value:", doc.vocab.strings["coffee"])

hash value: 3197928453018144401


In [8]:
len(doc.vocab.strings)

1148

In [9]:
len(nlp.vocab.strings)

1148

In [10]:
doc = nlp("I love yxxxxcoffee")
len(doc.vocab.strings), len(nlp.vocab.strings)

(1149, 1149)

<br>
To get the hash for a string, we can look it up in nlp.vocab.strings.

To get the string representation of a hash, we can look up the hash.

A `Doc` object also exposes its vocab and strings.

In [14]:
type(doc.vocab)

spacy.vocab.Vocab

In [16]:
type(doc.vocab["coffee"])

spacy.lexeme.Lexeme

In [17]:
type(doc.vocab.strings)

spacy.strings.StringStore

In [18]:
doc.vocab.strings["coffee"]

3197928453018144401

## Lexemes: entiries in the vocabulary

A Lexeme object is an entry in the vocabulary

In [20]:
doc = nlp("I love coffee")
lexeme = nlp.vocab["coffee"]

# Print lexical attributes
lexeme.text, lexeme.orth, lexeme.is_alpha

('coffee', 3197928453018144401, True)

In [21]:
nlp.vocab.strings[3197928453018144401]

'coffee'

- Contains the context-independent information about a word
    - Word text: `lexeme.text` and `lexeme.orth` (the hash)
    - Lexical attributes like `lexeme.is_alpha`
    - Not context-dependent part-of-speech tags, dependencies or entity labels

Lexemes are context-independent entries in the vocabulary.

You can get a lexeme by looking up a string or a hash ID in the vocab.

Lexemes expose attributes, just like tokens.

They hold context-independent information about a word, like the text, or whether the word consists of alphabetic characters.

Lexemes don't have part-of-speech tags, dependencies or entity labels. Those depend on the context.

### Vocab, hashes and lexemes
![](https://course.spacy.io/vocab_stringstore.png)

## Data Structures (2)

Now that you know all about the vocabulary and string store, we can take a look at the most important data structure: the `Doc`, and its views `Token` and `Span`.

In [23]:
from spacy.lang.en import English

In [24]:
English

spacy.lang.en.English

In [25]:
# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from
words = ["Hello", "world", "!"]
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

In [26]:
doc

Hello world!

The Doc is one of the central data structures in spaCy. It's created automatically when you process a text with the `nlp` object. But you can also instantiate the class manually.

After creating the nlp object, we can import the `Doc` class from `spacy.tokens`.

Here we're creating a doc from three words. The spaces are a list of boolean values indicating whether the word is followed by a space. Every token includes that information – even the last one!

The Doc class takes three arguments: the shared vocab, the words and the spaces.

In [29]:
type(nlp("Hello world!"))

spacy.tokens.doc.Doc

## The Span object(1)

![](https://course.spacy.io/span_indices.png)

A `Span` is a slice of a doc consisting of one or more tokens. The `Span` takes at least three arguments: the doc it refers to, and the start and end index of the span. Remember that the end index is exclusive!

In [32]:
from spacy.tokens import Doc, Span

words = ["Hello", "world", "!"]
spaces = [True, False, False]

doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")

# Add span to doc.ents
doc.ents = [span_with_label]

In [33]:
doc.ents

(Hello world,)

To create a `Span` manually, we can also import the class from `spacy.tokens`. We can then instantiate it with the doc and the span's start and end index, and an optional label argument.

The `doc.ents` are writable, so we can add entities manually by overwriting it with a list of spans.

### Best practices
- Doc and Span are very powerful and hold references and relationships of words and sentences
    - Convert result to strings as late as possible
    - Use token attributes if available – for example, token.i for the token index
- Don't forget to pass in the shared vocab

A few tips and tricks before we get started:

The `Doc` and `Span` are very powerful and optimized for performance. They give you access to all references and relationships of the words and sentences.

If your application needs to output strings, make sure to convert the doc as late as possible. If you do it too early, you'll lose all relationships between the tokens.

To keep things consistent, try to use built-in token attributes wherever possible. For example, `token.i` for the token index.

Also, don't forget to always pass in the shared vocab!

## Word vectors and semantic similarities

In this lesson, you'll learn how to use spaCy to predict how similar documents, spans or tokens are to each other.

You'll also learn about how to use word vectors and how to take advantage of them in your NLP application.

### Comparing semantic similarity
- spaCy can compare two objects and predict similarity
- `Doc.similarity()`, `Span.similarity()` and `Token.similarity()`
- Take another object and return a similarity score (0 to 1)
- Important: needs a model that has word vectors included, for example:
    - ✅ en_core_web_md (medium model)
    - ✅ en_core_web_lg (large model)
    - 🚫 NOT en_core_web_sm (small model)
