# [Chapter 2 (Large-scale Data Analysis with spaCy)](https://course.spacy.io/en/chapter2)
These are my notes for the second chapter of the advanced NLP [course](https://course.spacy.io/en/) provided by spaCy. 

In [2]:
import spacy

This chapter contains:
- More about data structures
- How to effectively combine statistical and rule-based approaches for text analysis
- Under the hood stuff

### 2.1: Data Structures: Vocab, Lexemes and StringStore

#### Vocab

The Vocab stores data that is shared across multiple documents. Includes words, but also the labels schemes for tags and entities. To save memory, all strings are encoded to hash IDs. If a word occurs more than once, we don't need to save it every time. spaCy uses a function to generate an ID for each string to store it only once. The string store is available in `nlp.vocab.strings`. It's a lookup table that works in both directions: you can look up a string and gets its hash, and vice-versa. Internally, spaCy communicates in hash IDs. Hash IDs can't be reversed. If a word isn't in the vocabulary, there's no way to get its string. That's why we always need to pass around the shared vocab.

In [14]:
nlp = spacy.load("en_core_web_sm")

nlp.vocab.strings.add("coffee")
coffee_hash = nlp.vocab.strings["coffee"]
coffee_string = nlp.vocab.strings[coffee_hash]

print(coffee_hash)
print(coffee_string)

3197928453018144401
coffee


In [13]:
string = nlp.vocab.strings[319792845301814440]
# If we haven't seen the string before, we will get an error

KeyError: "[E018] Can't retrieve string for hash '319792845301814440'. This usually refers to an issue with the `Vocab` or `StringStore`."

More about the above error: any string can be converted to a hash. So, all you need to do is add the string to the vocab, using the above method. In fact, you don't even have to use the method; spaCy will just come up with a new hash once you try to access any string as seen below. However, since hashes can't be reversed, you can't access a "non-existing" hash, so first you must make sure it exists in the vocab, before you actually use it to index anything.

In [27]:
nlp.vocab.strings["siodjfisd"]

12030218847533677157

We can access the vocab and the strings of a `doc` object as well

In [17]:
doc = nlp("I love coffee")

doc.vocab.strings["love"]

3702023516439754181

In [19]:
# What if the doc doesn't have a particular word? We can still access a word. Vocab is shared across ALL documents
doc.vocab.strings["pizza"]

13450285337346246112

In [20]:
nlp.vocab.strings["pizza"]

13450285337346246112

#### Lexeme

A `Lexeme` object is a context-independent entry in the vocabulary. You can get this object by looking up a string or a hash ID in the vocab. Lexeme's expose attributes, like tokens. They hold context-independent information about a word. As such, they don't have part-of-speech tags, dependencies, or entity labels, since those depend on the context.

In [22]:
lexeme = nlp.vocab["coffee"]

print(lexeme.text, lexeme.orth, lexeme.is_alpha, lexeme.like_num)

coffee 3197928453018144401 True False


### 2.4: Doc, Span, and Token

#### Doc

The `Doc` is one of the central objects in spaCy. It is created automatically when processing a text using `nlp`, but can also be instantiated manually. 

In [29]:
from spacy.tokens import Doc

nlp = spacy.blank("en")

words = ["Hello", "world", "!"] # indicates the words in the document
spaces = [True, False, False] # indicates whether there's a space after the word in the corresponding index of words list

doc = Doc(nlp.vocab, words=words, spaces=spaces) # three args: shared vocab, words, and spaces

#### Span

A `Span` is a slice of a document; it shoud contain one or more tokens. It takes at least three arguments: the doc it refers to, and the start and end indicies of the span.

In [31]:
from spacy.tokens import Span

span = Span(doc, 0, 2)

span_with_label = Span(doc, 0, 2, label="GREETING") # optionally, can pass the label argument

doc.ents = [span_with_label] # .ents is a writable attribute, so we can add entities manually by overwriting it or extending it

for entity in doc.ents:
    print(entity.text, entity.label_)

Hello world GREETING


#### Best Practices
Both `Doc` and `Span` are powerful and contain references and relationships of words and sentences. As such:
- Convert the results to strings as late as possible
- Use token attributes if available, like `token.index`
- Don't forget to pass in the shared `vocab`

### 2.8: Word Vectors and Semantic Similarity

spaCy can compare two objects and predict their similarity. `Doc`, `Span`, and `Token` objects all have a `.similarity` method, which takes another object and returns a float between 0 and 1, indicating how similar the two objects are. To use this method, you need a larger spaCy pipeline that includes word vectors. A medium or large English pipeline would work for this, but not a small one. To use word vectors, go with a pipeline that ends with a 'md' or a 'lg'. 

In [32]:
nlp = spacy.load("en_core_web_md")

doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
doc1.similarity(doc2)

0.8627203210548107

In [37]:
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))
# Interesting that spaCy gives such a high score when comparing 'pizza' and 'pasta'

0.73695457


In [36]:
# Can I compare an object with an object of another type? Yes!
doc2.similarity(token1)

0.7497223717646503

spaCy predicts similarity using word vectors, which are multi-dimensional representations of meanings of words. These word vectors can be determined using algorithms like [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) and large amounts of text. Word vectors can be added to spaCy's pipelines. By default, the similarity returned is the cosine similarity between two vectors, although this can be changed. The word vectors of `Span`'s and `Doc`'s default to the average of their token vectors; this means that shorter phrases are better than long documents with many irrelevant words. 

In [39]:
doc = nlp("I have a banana")
print(doc[3].vector)
print(len(doc[3].vector)) # 300 dimensions

[ 2.0228e-01 -7.6618e-02  3.7032e-01  3.2845e-02 -4.1957e-01  7.2069e-02
 -3.7476e-01  5.7460e-02 -1.2401e-02  5.2949e-01 -5.2380e-01 -1.9771e-01
 -3.4147e-01  5.3317e-01 -2.5331e-02  1.7380e-01  1.6772e-01  8.3984e-01
  5.5107e-02  1.0547e-01  3.7872e-01  2.4275e-01  1.4745e-02  5.5951e-01
  1.2521e-01 -6.7596e-01  3.5842e-01 -4.0028e-02  9.5949e-02 -5.0690e-01
 -8.5318e-02  1.7980e-01  3.3867e-01  1.3230e-01  3.1021e-01  2.1878e-01
  1.6853e-01  1.9874e-01 -5.7385e-01 -1.0649e-01  2.6669e-01  1.2838e-01
 -1.2803e-01 -1.3284e-01  1.2657e-01  8.6723e-01  9.6721e-02  4.8306e-01
  2.1271e-01 -5.4990e-02 -8.2425e-02  2.2408e-01  2.3975e-01 -6.2260e-02
  6.2194e-01 -5.9900e-01  4.3201e-01  2.8143e-01  3.3842e-02 -4.8815e-01
 -2.1359e-01  2.7401e-01  2.4095e-01  4.5950e-01 -1.8605e-01 -1.0497e+00
 -9.7305e-02 -1.8908e-01 -7.0929e-01  4.0195e-01 -1.8768e-01  5.1687e-01
  1.2520e-01  8.4150e-01  1.2097e-01  8.8239e-02 -2.9196e-02  1.2151e-03
  5.6825e-02 -2.7421e-01  2.5564e-01  6.9793e-02 -2

Similarity can be useful for many applications, like recommendation system, or duplicate-flagging systems, but the exact definition of 'similarity' will depend on the context and the application. Consider the example below:

In [40]:
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")

print(doc1.similarity(doc2))

0.9501448304982777


Although the documents describe complete different opinions about cats, they both express sentiment about cats. In certain applications, these would be considered similar; in other, these two statements would be direct opposites of each other. As such, when you start developing NLP applications, you might want to train vectors on your own data or tweak the similarity algorithm.

### 2.11: Combining Predictions and Rules
#### What is the difference between statistical models and rule-based systems?
Statistical models are useful if your application needs to generalize based on a few examples. It would be inefficient to provide a model with all of the names that have ever existed, so we train a model to recognize a span as an entity name. In spaCy, these include an entity recognizer, dependency parser, and a part-of-speech tagger. 

On the other hand, a rule-based system if there's a more or less finite number of instances you want to find. In spaCy, you can achieve this with custom tokenization rules, as well as the matcher and phrase matcher. 


#### Phrase Matcher
The Phrase Matcher is another useful tool to find sequences of words in the data. It performs a keyword search on the document, but instead of only finding strings, it gives you direct access to the tokens in context. Takes `Doc` objects as patterns. Due to its quick speed, it's great for matching large dictionaries and word lists on large volumes of text.

In [45]:
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

pattern = nlp("Golden Retriever")
matcher.add("DOG", [pattern])
doc = nlp("I have a Golden Retriever")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print("Matched Span:", span.text)

Matched Span: Golden Retriever


Cool example when matching countries:

In [None]:
with open("exercises/en/countries.json", encoding="utf8") as f:
    COUNTRIES = json.loads(f.read())

nlp = spacy.blank("en")
doc = nlp("Czech Republic may help Slovakia protect its airspace")

# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES)) # list of documents, each document denoting a country
matcher.add("COUNTRY", patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])