# <center> Data Structures: Vocab, Lexemes and StringStore
    
#### Shared vocabulary and string store in spaCy
- Vocab: stores data shared across multiple documents (this includes words, labels schemes for tags and entities)
- To save memory, spaCy encodes all strings to <b> hashs values </b>
- Strings are only stored once in the "StringStore" via "nlp.vocab.strings"
    - It is a lookup table in both directions:
        - Look up for a string and get its hash
        - Look up for a hash and get its string
- Internally, spaCy only communicates in hash IDs
- Hashes cant be reversed - thats why it is necessary to provide the shared vocab
- Look up the string and hash in "nlp.vocab.strings"
- The doc object also exposes the vocab and strings
    
#### Lexemes: entries in the vocabulary
- A lexeme object is an entry in the covabulary
- Get a lexeme by looking up a string or a hash ID in the vocab
- It also exposes attributes as a token
- Contains the context-independent information about a word
    - Word text lexeme.text and lexeme.orth (the hash)
    - Lexical attributes like lexeme.is_alpha
    - The dont have part-of-speech tags, dependencies or entity labels
    
<img src="https://d33wubrfki0l68.cloudfront.net/7331aff857c4ef5869ac47dbc80fb4c8b9e1b883/96c2e/vocab_stringstore-1d1c9ccd7a1cf4d168bfe4ca791e6eed.svg" width="400" height="400">    

In [1]:
# Import 
from spacy.lang.en import English
nlp=English()
doc=nlp("That cat is big.")

In [2]:
# Look up the hash for the word "cat"
cat_hash = nlp.vocab.strings["cat"]
print(cat_hash)

# Look up the cat_hash to get the string
cat_string = nlp.vocab.strings[5439657043933447811]
print(cat_string)

5439657043933447811
cat


In [3]:
# Look up the hash for the string label "PERSON"
person_hash = nlp.vocab.strings["PERSON"]
print(person_hash)

# Look up the person_hash to get the string
person_string = nlp.vocab.strings[380]
print(person_string)

380
PERSON


# <center> Data Structures: Vocab, Lexemes and StringStore

#### The Doc object
    - Created automatically when processing a text with the nlp object but also is possible to instanstiate the class manually.

In [4]:
##AUTO
# Create an nlp object
from spacy.lang.en import English
nlp = English()

## MANUALLY
# Import the Doc class
from spacy.tokens import Doc
# The words and spaces to create the doc from
words = ['Hello','world','!']
spaces = [True, False, False]
# Create a doc manually (THREE ARGUMENTS)
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc)

Hello world!


#### The Span object
    - Is a slice of a Doc consisting of one or more tokens.
    - Takes at least three arguments: the doc, start and end index of the span.
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQtB4WQhIJU6AvtF6LjoFrH6VRM9Se2lOmGvX4DcXb7kbX1ENSn&usqp=CAU" width="400" height="400">    

    - Creating a span manually 

In [5]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# Create a span manually from previous doc
span = Span(doc, 0, 2)
# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")
print(doc)
# Add span to the doc.ents are writable, it is possible to add entities manually by overwriting it with a list of spans
doc.ents = [span_with_label]
print(doc.ents)

Hello world!
(Hello world,)


#### Best practices
- Doc and Span are very powerful and hold references and relationships of words and sentences
    - Convert resultto strings as late as possible
    - Use token attributes if available – for example, token.i for the token index
- Don't forget to pass in the shared vocab


Some examples CREATING A DOC:

In [6]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "spaCy is cool!"
words = ['spaCy', 'is', 'cool', '!']
spaces = [True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

spaCy is cool!


In [7]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Go, get started!"
words = ['Go', ',', 'get', 'started', '!']
spaces = [False, True, True, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Go, get started!


In [8]:
# Import the Doc class
from spacy.tokens import Doc

# Desired text: "Oh, really?!"
words = ['Oh', ',', 'really', '?','!']
spaces = [False, True, False, False, False]

# Create a Doc from the words and spaces
doc = Doc(nlp.vocab, words=words, spaces=spaces)
print(doc.text)

Oh, really?!


#### Docs, spans and entities from scratch
create the Doc and Span objects manually, and update the named entities – just like spaCy does behind the scenes. 

In [9]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# Create a doc from the words and spaces
doc = Doc(nlp.vocab, words=['I', 'like', 'David', 'Bowie'], spaces=[True, True, True, False])

# Create a span for "David Bowie" from the doc and assign it the label "PERSON"
span = Span(doc, 2, 4, label='PERSON')

# Add the span to the doc's entities
doc.ents = [span]

# Print entities' text and labels
print([(ent.text, ent.label_) for ent in doc.ents])

[('David Bowie', 'PERSON')]


#### Data structures best practices 

In [10]:
import spacy
# Load the 'en_core_web_sm' model – spaCy is already imported
nlp = spacy.load('en_core_web_sm')
doc=nlp("Berlin is a nice city")

for token in doc:
    # Check  current tokens 
    print(token.text,token.pos_)

Berlin PROPN
is AUX
a DET
nice ADJ
city NOUN


In [11]:
# 'is' which is incorrect label , reassign it the label "VERB"
doc[1].pos_ = 'VERB'
doc[1].pos_

'VERB'

In [12]:
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == 'PROPN':
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == 'VERB':
            print('Found a verb after a proper noun!')

Found a verb after a proper noun!


# <center> Word vectors and semantic similarity

#### Comparing semantic similarity
- spaCy can compare two objects and predict similarity
- Methods: Doc.similarity() , Span.similarity() and Token.similarity()
- Take another object and return a similarity score ( 0 to 1 )
- Important: needs a modelthat has word vectors included,for example:
    - YES: en_core_web_md (medium model)
    - YES: en_core_web_lg (large model)
    - NO: en_core_web_sm (small model)
    
#### How does spaCy predict similarity?
- Similarity is determined using word vectors
    - Multi-dimensional meaning representations of words
- Generated using an algorithm like Word2Vec and lots of text
- Vectors can be added to spaCy's statistical models
- Default: cosine similarity, but can be adjusted
- Doc and Span vectors default to average oftoken vectors
- Short phrases are better than long documents with many irrelevant words
    
    
#### Similarity depends on the application context
- Useful for many applications: recommendation systems, agging duplicates etc.
- There's no objective denition of"similarity"
- Depends on the context and what application needs to do
    
>doc1 = nlp("I like cats")
    
>doc2 = nlp("I hate cats")
    
>print(doc1.similarity(doc2))
    
>0.950144
    
Examples comparing similarities:

In [13]:
# Load the en_core_web_md model
nlp = spacy.load('en_core_web_md')

# Process a text
doc = nlp("Two bananas in pyjamas")

# Get the vector for the token "bananas"
bananas_vector = doc[1].vector
print(bananas_vector[:20])

[-0.22009  -0.030322 -0.079859 -0.46279  -0.386     0.36962  -0.77178
 -0.11529   0.033601  0.56573  -0.24001   0.41833   0.15049   0.35621
 -0.21508  -0.42743   0.0814    0.33916   0.21637   0.14792 ]


Using spaCy's similarity methods to compare Doc, Token and Span objects and get similarity scores.

In [14]:
doc1 = nlp("It's a warm summer day")
doc2 = nlp("It's sunny outside")

# Get the similarity of doc1 and doc2
similarity = doc1.similarity(doc2)
print(similarity)

0.8789265574516525


In [15]:
doc = nlp("TV and books")
token1, token2 = doc[0], doc[2]

# Get the similarity of the tokens "TV" and "books" 
similarity = token1.similarity(token2)
print(similarity)

0.22325331


In [16]:
doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

# Create spans for "great restaurant" and "really nice bar"
span1 = doc[3:5]
span2 = doc[12:15]


# Get the similarity of the spans
similarity = span1.similarity(span2)
print(similarity)

0.7517392


# <center> Combining models and rules

Combining statistical models with rule-based systems is one of the most powerful tricks for a NLP toolbox
    
#### Statistical predictions vs. rules

| - | Statistical models | Rule-based systems |
| --- | --- | --- |
|Use cases|application needs to generalize based on examples|dictionary with nite number ofexamples|
|Real-world|examples product names, person names,subject/object relationships|countries ofthe world, cities, drug names, dog breeds|
|spaCy features|entity recognizer, dependency parser, partof-speech tagger|tokenizer, Matcher , PhraseMatcher|

#### Adding statistical predictions
- span.root.text : This will return the token that decides the category of the phrase (if the span have 2+ tokens)
- span.root.head.text : This is the syntactic 'parent' that governs the phrase
    
#### Efcient phrase matching 
- PhraseMatcher like regular expressions or keyword search – but with access to the tokens!
- Takes Doc object as patterns
- More efcient and faster than the Matcher
- Great for matching large word lists  

#### Using matcher:

In [17]:
docu='Twitch Prime, the perks program for Amazon Prime members offering free loot, games and other benefits, is ditching one of its best features: ad-free viewing. According to an email sent out to Amazon Prime members today, ad-free viewing will no longer be included as a part of Twitch Prime for new members, beginning on September 14. However, members with existing annual subscriptions will be able to continue to enjoy ad-free viewing until their subscription comes up for renewal. Those with monthly subscriptions will have access to ad-free viewing until October 15.'

In [18]:
from spacy.matcher import Matcher
# Process the text
doc = nlp(docu)

# Create the match patterns
pattern1 = [{'LOWER': 'amazon'}, {'IS_TITLE': True, 'POS': 'PROPN'}]
pattern2 = [{'LOWER': 'ad'},{'LOWER': '-'},{'LOWER': 'free'}, {'POS': 'NOUN'}]

# Initialize the Matcher and add the patterns
matcher = Matcher(nlp.vocab)
matcher.add('PATTERN1', None, pattern1)
matcher.add('PATTERN2', None, pattern2)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Print pattern string name and text of matched span
    print(doc.vocab.strings[match_id], doc[start:end].text)

PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN1 Amazon Prime
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing
PATTERN2 ad-free viewing


#### Phrase matching

In [19]:
doc=nlp('Czech Republic may help Slovakia protect its airspace')
COUNTRIES=['Afghanistan', 'Åland Islands', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia (Plurinational State of)', 'Bonaire, Sint Eustatius and Saba', 'Bosnia and Herzegovina', 'Botswana', 'Bouvet Island', 'Brazil', 'British Indian Ocean Territory', 'United States Minor Outlying Islands', 'Virgin Islands (British)', 'Virgin Islands (U.S.)', 'Brunei Darussalam', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cabo Verde', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Christmas Island', 'Cocos (Keeling) Islands', 'Colombia', 'Comoros', 'Congo', 'Congo (Democratic Republic of the)', 'Cook Islands', 'Costa Rica', 'Croatia', 'Cuba', 'Curaçao', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands (Malvinas)', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'French Southern Territories', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'Heard Island and McDonald Islands', 'Holy See', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', "Côte d'Ivoire", 'Iran (Islamic Republic of)', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Kuwait', 'Kyrgyzstan', "Lao People's Democratic Republic", 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao', 'Macedonia (the former Yugoslav Republic of)', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Martinique', 'Mauritania', 'Mauritius', 'Mayotte', 'Mexico', 'Micronesia (Federated States of)', 'Moldova (Republic of)', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', "Korea (Democratic People's Republic of)", 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Palestine, State of', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Republic of Kosovo', 'Réunion', 'Romania', 'Russian Federation', 'Rwanda', 'Saint Barthélemy', 'Saint Helena, Ascension and Tristan da Cunha', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin (French part)', 'Saint Pierre and Miquelon', 'Saint Vincent and the Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone', 'Singapore', 'Sint Maarten (Dutch part)', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South Georgia and the South Sandwich Islands', 'Korea (Republic of)', 'South Sudan', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Svalbard and Jan Mayen', 'Swaziland', 'Sweden', 'Switzerland', 'Syrian Arab Republic', 'Taiwan', 'Tajikistan', 'Tanzania, United Republic of', 'Thailand', 'Timor-Leste', 'Togo', 'Tokelau', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom of Great Britain and Northern Ireland', 'United States of America', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela (Bolivarian Republic of)', 'Viet Nam', 'Wallis and Futuna', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe']

In [20]:
# Import the PhraseMatcher and initialize it
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

# Create pattern Doc objects and add them to the matcher
# This is the faster version of: [nlp(country) for country in COUNTRIES]
patterns = list(nlp.pipe(COUNTRIES))
matcher.add('COUNTRY', None, *patterns)

# Call the matcher on the test document and print the result
matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])

[Czech Republic, Slovakia]


Extracting countries and relationships

In [21]:
##using a text which was copy on a txt file
with open('datasets/text.txt','r',encoding='UTF-8') as file:
    text=file.read()
text[:500]

'After the Cold War, the UN saw a radical expansion in its peacekeeping duties, taking on more missions in ten years than it had in the previous four decades.Between 1988 and 2000, the number of adopted Security Council resolutions more than doubled, and the peacekeeping budget increased more than tenfold. The UN negotiated an end to the Salvadoran Civil War, launched a successful peacekeeping mission in Namibia, and oversaw democratic elections in post-apartheid South Africa and post-Khmer Rouge'

In [22]:
# Create a doc and find matches in it
nlp=English()
doc = nlp(text)

# Iterate over the matches
for match_id, start, end in matcher(doc):
    # Create a Span with the label for "GPE"
    span = Span(doc, start, end, label='GPE')
    #print(span)
    # Overwrite the doc.ents and add the span
    doc.ents = list(doc.ents) + [span]

    # Get the span's root head token
    span_root_head = span.root.head 
    # Print the text of the span root's head token and the span text
    print(span_root_head.text, '-->', span.text)

Namibia --> Namibia
South --> South Africa
Cambodia --> Cambodia
Kuwait --> Kuwait
Somalia --> Somalia
Haiti --> Haiti
Mozambique --> Mozambique
Somalia --> Somalia
Rwanda --> Rwanda
Singapore --> Singapore
Sierra --> Sierra Leone
Afghanistan --> Afghanistan
Iraq --> Iraq
Sudan --> Sudan
Congo --> Congo
Haiti --> Haiti


In [23]:
# Print the entities in the document
print([(ent.text, ent.label_) for ent in doc.ents if ent.label_ == 'GPE'])

#print(text)

[('Namibia', 'GPE'), ('South Africa', 'GPE'), ('Cambodia', 'GPE'), ('Kuwait', 'GPE'), ('Somalia', 'GPE'), ('Haiti', 'GPE'), ('Mozambique', 'GPE'), ('Somalia', 'GPE'), ('Rwanda', 'GPE'), ('Singapore', 'GPE'), ('Sierra Leone', 'GPE'), ('Afghanistan', 'GPE'), ('Iraq', 'GPE'), ('Sudan', 'GPE'), ('Congo', 'GPE'), ('Haiti', 'GPE')]
