In [1]:
import spacy

## creating a doc container

In [2]:
# carrega o modelo de linguagem em ingles
nlp = spacy.load("en_core_web_sm")

In [3]:
with open('data/wiki_us.txt', 'r') as f:
    text = f.read()

In [4]:
print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [5]:
# objeto nlp doc
doc = nlp(text)

In [6]:
print(doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [7]:
print(len(doc))
print(len(text))

652
3525


In [8]:
# exibe os 10 primeiros caracteres do texto
for token in text[0:10]:
    print(token)

T
h
e
 
U
n
i
t
e
d


In [9]:
# exibe os 10 primeiros tokens do texto (tokens = palavras)
for token in doc[:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [10]:
# não lida com as marcas de pontuação como o parenteses
for token in text.split()[:10]:
    print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


## sentence boundary detection

In [11]:
# sentencias
for sent in doc.sents:
    print(sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

In [12]:
# converte pra lista e pega a primeira sentença
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


## token attributes

In [13]:
for token in doc[:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [14]:
token2 = sentence1[2]
print(token2)

States


In [15]:
# texto do token -> nao precisa trabalhar com o objeto e sim só com o texto
token2.text

'States'

In [16]:
# token mais a esquerda (primeiro da sentença) correspondente ao atual
token2.left_edge

The

In [17]:
token2.right_edge

,

In [18]:
token2.ent_type # tipo da entidade

384

In [19]:
token2.ent_type_ # nome da entidade

'GPE'

In [20]:
token2.ent_iob_ # dentro ou fora da entidade

'I'

In [21]:
token2.lemma_

'States'

In [22]:
sentence1[12].lemma_ # pega a palavra sem concordancia

'know'

In [23]:
print(sentence1[12])

known


In [24]:
token2.morph

Number=Sing

In [25]:
sentence1[12].morph # informações morfologicas

Aspect=Perf|Tense=Past|VerbForm=Part

In [26]:
token2.pos_ # parte da fala

'PROPN'

In [27]:
token2.dep_ # dependencia sintatica

'nsubj'

In [28]:
token2.lang_ # lingua

'en'

In [29]:
text = 'mike enjoys playing football.'
doc2 = nlp(text)
print(doc2)

mike enjoys playing football.


In [30]:
for token in doc2:
    print(token.text, token.pos_, token.dep_)

mike PROPN nsubj
enjoys VERB ROOT
playing VERB xcomp
football NOUN dobj
. PUNCT punct


In [31]:
from spacy import displacy

displacy.render(doc2, style='dep')

In [32]:
for ent in doc.ents:
    print(ent.text, ent.label_) # entidades e tipo de entidade

The United States of America GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
326 CARDINAL
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
fourth ORDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
more than 331 million CARDINAL
third ORDINAL
Washington GPE
D.C. GPE
New York GPE
Paleo-Indians NORP
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
thirteen CARDINAL
British NORP
the East Coast LOC
Great Britain GPE
the American Revolutionary War ORG
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
United States GPE
the second half of the 19th century DATE
the American Civil War ORG
The Spanishâ€“American War and World War I EVENT
U.S. GPE
World War II EVENT
the Cold War EVENT
the United States GPE
the Korean War EVENT
the Vietnam War EVENT
the Sov

In [33]:
displacy.render(doc, style='ent')

## word vectors and spacy

In [34]:
import spacy

In [35]:
nlp = spacy.load("en_core_web_md")

In [36]:
with open('data/wiki_us.txt', 'r') as f:
    text = f.read()

In [37]:
doc = nlp(text)
sentence1 = list(doc.sents)[0]

In [38]:
import numpy as np
your_word = 'country'

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), n=10)

words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words) # coisas mais similare


['anti-poverty', 'SLUMS', 'inner-city', 'Socioeconomic', 'INTERSECT', 'Divides', 'handicaps', 'dropout', 'drop-out', 'Crime-Ridden']


In [39]:
doc1 = nlp('I like fast food.')
doc2 = nlp('I like pizza.')

In [40]:
print(doc1, '<->', doc2, doc1.similarity(doc2)) # similaridade entre os dois textos

I like fast food. <-> I like pizza. 0.8959503928431998


In [41]:
doc3 = nlp('the country is beautiful')
print(doc1, '<->', doc3, doc1.similarity(doc3))

I like fast food. <-> the country is beautiful 0.5887682831850998


In [42]:
doc4 = nlp('i enjoy oranges')
doc5 = nlp('i enjoy apples')
print(doc4, '<->', doc5, doc4.similarity(doc5))

i enjoy oranges <-> i enjoy apples 0.9219616215250876


In [43]:
doc6 = nlp('i enjoy peanut butter')
print(doc4, '<->', doc6, doc4.similarity(doc6)) # depende da semelhança da SEMANTICA, nao do contexto

i enjoy oranges <-> i enjoy peanut butter 0.80249044256641


## spacy's pipelines

In [44]:
nlp = spacy.blank('en')

In [45]:
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x270e4da7990>

In [46]:
nlp.analyze_pipes()

{'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'],
   'requires': [],
   'scores': ['sents_f', 'sents_p', 'sents_r'],
   'retokenizes': False}},
 'problems': {'sentencizer': []},
 'attrs': {'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []},
  'doc.sents': {'assigns': ['sentencizer'], 'requires': []}}}

In [47]:
nlp2 = spacy.load('en_core_web_sm')

In [48]:
nlp2.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'ner': []},
 'att

## spacy entityruler

In [50]:
import spacy

In [51]:
nlp = spacy.load('en_core_web_sm')
text = 'West Chestertenfieldville was referenced in Mr. Deeds.'
doc = nlp(text)

In [None]:
# no vídeo deu diferente, aqui já deu certo
for ent in doc.ents:
    print(ent.text, ent.label_)

West Chestertenfieldville LOC
Deeds PERSON


In [53]:
ruler = nlp.add_pipe('entity_ruler')

In [54]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

In [55]:
patterns = [
    {'label': 'GPE', 'pattern': 'West Chestertenfieldville'}
]

In [56]:
ruler.add_patterns(patterns)

In [58]:
# ainda nao mudou
doc2 = nlp(text)
for ent in doc2.ents:
    print(ent.text, ent.label_)

West Chestertenfieldville LOC
Deeds PERSON


In [59]:
nlp2 = spacy.load('en_core_web_sm')


In [65]:
# nlp2.remove_pipe("entity_ruler")
ruler = nlp2.add_pipe("entity_ruler", before = "ner")
ruler.add_patterns(patterns)

In [66]:
doc = nlp2(text)

In [67]:
for ent in doc.ents:
    print(ent.text, ent.label_)

West Chestertenfieldville GPE
Deeds PERSON


In [68]:
nlp2.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ent

In [70]:
# ADICIONA O NOVO PADRÃO FILM PARA O MR.
nlp3 = spacy.load('en_core_web_sm')
ruler = nlp3.add_pipe('entity_ruler', before='ner')
patterns = [
    {'label': 'GPE', 'pattern': 'West Chestertenfieldville'},
    {'label': 'FILM', 'pattern': 'Mr. Deeds'}
]
ruler.add_patterns(patterns)

In [None]:
doc = nlp3(text)

# agora mostra corretamente
for ent in doc.ents:
    print(ent.text, ent.label_)

West Chestertenfieldville GPE
Mr. Deeds FILM


## spacy matcher

In [73]:
import spacy
from spacy.matcher import Matcher

In [74]:
nlp = spacy.load('en_core_web_sm')

In [75]:
matcher = Matcher(nlp.vocab)
pattern = [
    {'LIKE_EMAIL': True}
]
matcher.add('EMAIL_ADDRESS', [pattern])

In [76]:
doc = nlp('this is an email address: wmattingly@aol.com')
matches = matcher(doc)
print(matches)

[(16571425990740197027, 6, 7)]


In [79]:
print(nlp.vocab[matches[0][0]].text)

EMAIL_ADDRESS


In [80]:
with open('data/wiki_mlk.txt', 'r') as f:
    text = f.read()

In [81]:
print(text)

Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 â€“ April 4, 1968) was an American Baptist minister and activist who became the most visible spokesman and leader in the American civil rights movement from 1955 until his assassination in 1968. King advanced civil rights through nonviolence and civil disobedience, inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi. He was the son of early civil rights activist and minister Martin Luther King Sr.

King participated in and led marches for blacks' right to vote, desegregation, labor rights, and other basic civil rights.[1] King led the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, Alabama. King helped organize the 1963 March on Washington, where he delivered his famou

In [None]:
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
pattern = [
    {'POS': 'PROPN',
     'OP': '+'},
    {'POS': 'VERB'}
]
matcher.add('PROPER_NOUN', [pattern])
doc = nlp(text)
matches = matcher(doc)
matches.sort(key=lambda x: x[1])

# mostra os 10 primeiros sujeitos do verbo imediatamente pós-posto
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

(451313080118390996, 50, 52) King advanced
(451313080118390996, 90, 92) King participated
(451313080118390996, 114, 116) King led
(451313080118390996, 168, 170) King helped
(451313080118390996, 248, 253) Director J. Edgar Hoover considered
(451313080118390996, 249, 253) J. Edgar Hoover considered
(451313080118390996, 250, 253) Edgar Hoover considered
(451313080118390996, 251, 253) Hoover considered
(451313080118390996, 323, 325) King won
(451313080118390996, 486, 489) United States beginning


In [92]:
import json
with open('data/alice.json', 'r') as f:
    data = json.load(f)

In [93]:
text = data[0][2][0]
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'


In [94]:
text = text.replace('`', "'")
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


In [None]:
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
pattern = [
    {'ORTH': "'"},
    {"IS_ALPHA": True, 'OP': '+'},
    {'IS_PUNCT': True, 'OP': '*'}, # pode ou nao estar lá
    {'ORTH': "'"}
]
matcher.add('PROPER_NOUNS', [pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
matches.sort(key=lambda x: x[1])

for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

(3232560085755078826, 47, 58) 'and what is the use of a book,'
(3232560085755078826, 60, 67) 'without pictures or conversation?'


In [98]:
speak_lemmas = ['think', 'say']
matcher = Matcher(nlp.vocab)
pattern = [
    {'ORTH': "'"},
    {"IS_ALPHA": True, 'OP': '+'},
    {'IS_PUNCT': True, 'OP': '*'}, # pode ou nao estar lá
    {'ORTH': "'"},
    {'POS': 'VERB', 'LEMMA': {'IN': speak_lemmas}},
    {'POS': 'PROPN', 'OP': '+'}, # pode ser uma sequencia de tokens, por isso op +
    {'ORTH': "'"},
    {"IS_ALPHA": True, 'OP': '+'},
    {'IS_PUNCT': True, 'OP': '*'}, # pode ou nao estar lá
    {'ORTH': "'"}
]

matcher.add('PROPER_NOUNS', [pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
matches.sort(key=lambda x: x[1])

for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


In [106]:
matcher.add('PROPER_NOUNS', [pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
matches.sort(key=lambda x: x[1])

for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

for text in data[0][2]:
    text = text.replace("`", "'")
    doc = nlp(text)
    matches = matcher(doc)
    matches.sort(key = lambda x: x[1])
    print (len(matches))
    for match in matches[:10]:
        print (match, doc[match[1]:match[2]])

1
(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
0
0
0
0
0
1
(3232560085755078826, 0, 6) 'Well!' thought Alice
0
0
0
0
0
0
0
1
(3232560085755078826, 57, 68) 'which certainly was not here before,' said Alice
0
0


In [None]:
speak_lemmas = ["think", "say"]
text = data[0][2][0].replace( "`", "'")
matcher = Matcher(nlp.vocab)
pattern1 = [{'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}, {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}, {'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}]
pattern2 = [{'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}, {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}]
pattern3 = [{"POS": "PROPN", "OP": "+"},{"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}]
matcher.add("PROPER_NOUNS", [pattern1, pattern2, pattern3], greedy='LONGEST')

for text in data[0][2]:
    text = text.replace("`", "'")
    doc = nlp(text)
    matches = matcher(doc)
    matches.sort(key = lambda x: x[1])
    print (len(matches))
    for match in matches[:10]:
        print (match, doc[match[1]:match[2]])

1
(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
0
0
0
0
0
1
(3232560085755078826, 0, 6) 'Well!' thought Alice
0
0
0
0
0
0
0
1
(3232560085755078826, 57, 68) 'which certainly was not here before,' said Alice
0
0


## custom components in spacy

In [108]:
nlp = spacy.load('en_core_web_sm')
doc = nlp('Britain is a place. Mary is a doctor.')

In [109]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Britain GPE
Mary PERSON


In [110]:
# remover todas as gpe
from spacy.language import Language

In [None]:
# remover todas as entidades GPE
@Language.component('remove_gpe')
def remove_gpe(doc):
    original_ents = list(doc.ents)
    for ent in original_ents:
        if ent.label_ == 'GPE':
            original_ents.remove(ent)
    doc.ents = original_ents

    return doc

In [113]:
nlp.add_pipe('remove_gpe')

<function __main__.remove_gpe(doc)>

In [114]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'remove_gpe': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  

In [None]:
doc = nlp('Britain is a place. Mary is a doctor.')
for ent in doc.ents:
    print(ent.text, ent.label_)

Mary PERSON


In [None]:
nlp.to_disk('data/new_en_core_web_sm')

## regex in spacy

In [117]:
import re

In [118]:
text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Pau is quite common."

In [119]:
pattern = r"Paul [A-Z]\w+"

In [120]:
matches = re.finditer(pattern, text)
for match in matches:
    print(match)

<re.Match object; span=(0, 11), match='Paul Newman'>
<re.Match object; span=(39, 53), match='Paul Hollywood'>


In [121]:
from spacy.tokens import Span

In [None]:
nlp = spacy.blank('en')
doc = nlp(text)
original_ents = list(doc.ents)
mwt_ents = []

for match in re.finditer(pattern, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
        mwt_ents.append((span.start, span.end, span.text))

# percorre as entidades e cria um objeto span para cada uma
for ent in mwt_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label='PERSON') # cria um objeto span
    original_ents.append(per_ent)
doc.ents = original_ents

for ent in doc.ents:
    print(ent.text, ent.label_)

Paul Newman PERSON
Paul Hollywood PERSON


In [127]:
print(mwt_ents)

[(0, 2, 'Paul Newman'), (8, 10, 'Paul Hollywood')]


In [132]:
@Language.component('paul_ner')
def paul_ner(doc):
    pattern = r"Paul [A-Z]\w+"
    original_ents = list(doc.ents)
    mwt_ents = []

    for match in re.finditer(pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        if span is not None:
            mwt_ents.append((span.start, span.end, span.text))

    # percorre as entidades e cria um objeto span para cada uma
    for ent in mwt_ents:
        start, end, name = ent
        per_ent = Span(doc, start, end, label='PERSON') # cria um objeto span
        original_ents.append(per_ent)
    doc.ents = original_ents

    return doc

In [133]:
nlp2 = spacy.blank('en')
nlp2.add_pipe('paul_ner')

<function __main__.paul_ner(doc)>

In [135]:
doc2 = nlp2(text)
print(doc2.ents)

(Paul Newman, Paul Hollywood)


In [143]:
from spacy.util import filter_spans

@Language.component('cinema_ner')
def cinema_ner(doc):
    pattern = r"Hollywood"
    original_ents = list(doc.ents)
    mwt_ents = []

    for match in re.finditer(pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        if span is not None:
            mwt_ents.append((span.start, span.end, span.text))

    # percorre as entidades e cria um objeto span para cada uma
    for ent in mwt_ents:
        start, end, name = ent
        per_ent = Span(doc, start, end, label='CINEMA') # cria um objeto span
        original_ents.append(per_ent)
    filtered = filter_spans(original_ents)
    doc.ents = original_ents

    return doc

In [144]:
nlp3 = spacy.load('en_core_web_sm')
nlp3.add_pipe('cinema_ner')

<function __main__.cinema_ner(doc)>

In [145]:
doc3 = nlp(text)
for ent in doc3.ents:
    print(ent.text, ent.label_)

## financial spacy

In [146]:
import pandas as pd

In [148]:
df = pd.read_csv('data/stocks.tsv', sep='\t')
df

Unnamed: 0,Symbol,CompanyName,Industry,MarketCap
0,A,Agilent Technologies,Life Sciences Tools & Services,53.65B
1,AA,Alcoa,Metals & Mining,9.25B
2,AAC,Ares Acquisition,Shell Companies,1.22B
3,AACG,ATA Creativity Global,Diversified Consumer Services,90.35M
4,AADI,Aadi Bioscience,Pharmaceuticals,104.85M
...,...,...,...,...
5874,ZWRK,Z-Work Acquisition,Shell Companies,278.88M
5875,ZY,Zymergen,Chemicals,1.31B
5876,ZYME,Zymeworks,Biotechnology,1.50B
5877,ZYNE,Zynerba Pharmaceuticals,Pharmaceuticals,184.39M


In [149]:
symbols = df.Symbol.tolist()

In [150]:
companies = df.CompanyName.tolist()

In [151]:
print(symbols[0])
print(companies[0])

A
Agilent Technologies


In [153]:
df2 = pd.read_csv("data/indexes.tsv", sep="\t")
df2

Unnamed: 0,IndexName,IndexSymbol
0,Dow Jones Industrial Average,DJIA
1,Dow Jones Transportation Average,DJT
2,Dow Jones Utility Average Index,DJU
3,NASDAQ 100 Index (NASDAQ Calculation),NDX
4,NASDAQ Composite Index,COMP
5,NYSE Composite Index,NYA
6,S&P 500 Index,SPX
7,S&P 400 Mid Cap Index,MID
8,S&P 100 Index,OEX
9,NASDAQ Computer Index,IXCO


In [154]:
indexes = df2.IndexName.tolist()
index_symbols = df2.IndexSymbol.tolist()

In [155]:
df3 = pd.read_csv("data/stock_exchanges.tsv", sep="\t")
df3

Unnamed: 0,BloombergExchangeCode,BloombergCompositeCode,Country,Description,ISOMIC,Google Prefix,EODcode,NumStocks
0,AF,AR,Argentina,Bolsa de Comercio de Buenos Aires,XBUE,,BA,12
1,AO,AU,Australia,National Stock Exchange of Australia,XNEC,,,1
2,AT,AU,Australia,Asx - All Markets,XASX,ASX,AU,875
3,AV,,Austria,Wiener Boerse Ag,XWBO,VIE,VI,38
4,BI,,Bahrain,Bahrain Bourse,XBAH,,,4
...,...,...,...,...,...,...,...,...
97,UR,US,USA,NASDAQ Capital Market,XNCM,NASDAQ,US,2209
98,UV,US,USA,OTC markets,OOTC,OTCMKTS,US,2433
99,UW,US,USA,NASDAQ Global Select,XNGS,NASDAQ,US,1768
100,VH,VN,Vietnam,Hanoi Stock Exchange,HSTC,,,4


In [156]:
exchanges = df3.ISOMIC.tolist()+df3["Google Prefix"].tolist()
descriptions = df3.Description.tolist()

In [None]:
stops = ["two"]
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
patterns = []
letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

for symbol in symbols:
    patterns.append({"label": "STOCK", "pattern": symbol})
    for l in letters:
        patterns.append({"label": "STOCK", "pattern": symbol+f".{l}"})

for company in companies:
    if company not in stops:
        patterns.append({"label": "COMPANY", "pattern": company})
        words = company.split()
        if len(words) > 1:
            new = " ".join(words[:2])
            patterns.append({"label": "COMPANY", "pattern": new})

for index in indexes:
    patterns.append({"label": "INDEX", "pattern": index})
    versions = []
    words = index.split()
    caps = []
    for word in words:
        word = word.lower().capitalize()
        caps.append(word)
    versions.append(" ".join(caps))
    versions.append(words[0])
    versions.append(caps[0])
    versions.append(" ".join(caps[:2]))
    versions.append(" ".join(words[:2]))
    for version in versions:
        if version != "NYSE":
            patterns.append({"label": "INDEX", "pattern": version})

for symbol in index_symbols:
    patterns.append({"label": "INDEX", "pattern": symbol})

for d in descriptions:
    patterns.append({"label": "STOCK_EXCHANGE", "pattern": d})
for e in exchanges:
    patterns.append({"label": "STOCK_EXCHANGE", "pattern": e})

ruler.add_patterns(patterns)

print (len(patterns))

169694


In [158]:
text = '''
Sept 10 (Reuters) - Wall Street's main indexes were subdued on Friday as signs of higher inflation and a drop in Apple shares following an unfavorable court ruling offset expectations of an easing in U.S.-China tensions.

Data earlier in the day showed U.S. producer prices rose solidly in August, leading to the biggest annual gain in nearly 11 years and indicating that high inflation was likely to persist as the pandemic pressures supply chains. read more .

"Today's data on wholesale prices should be eye-opening for the Federal Reserve, as inflation pressures still don't appear to be easing and will likely continue to be felt by the consumer in the coming months," said Charlie Ripley, senior investment strategist for Allianz Investment Management.

Apple Inc (AAPL.O) fell 2.7% following a U.S. court ruling in "Fortnite" creator Epic Games' antitrust lawsuit that stroke down some of the iPhone maker's restrictions on how developers can collect payments in apps.


Sponsored by Advertising Partner
Sponsored Video
Watch to learn more
Report ad
Apple shares were set for their worst single-day fall since May this year, weighing on the Nasdaq (.IXIC) and the S&P 500 technology sub-index (.SPLRCT), which fell 0.1%.

Sentiment also took a hit from Cleveland Federal Reserve Bank President Loretta Mester's comments that she would still like the central bank to begin tapering asset purchases this year despite the weak August jobs report. read more

Investors have paid keen attention to the labor market and data hinting towards higher inflation recently for hints on a timeline for the Federal Reserve to begin tapering its massive bond-buying program.

The S&P 500 has risen around 19% so far this year on support from dovish central bank policies and re-opening optimism, but concerns over rising coronavirus infections and accelerating inflation have lately stalled its advance.


Report ad
The three main U.S. indexes got some support on Friday from news of a phone call between U.S. President Joe Biden and Chinese leader Xi Jinping that was taken as a positive sign which could bring a thaw in ties between the world's two most important trading partners.

At 1:01 p.m. ET, the Dow Jones Industrial Average (.DJI) was up 12.24 points, or 0.04%, at 34,891.62, the S&P 500 (.SPX) was up 2.83 points, or 0.06%, at 4,496.11, and the Nasdaq Composite (.IXIC) was up 12.85 points, or 0.08%, at 15,261.11.

Six of the eleven S&P 500 sub-indexes gained, with energy (.SPNY), materials (.SPLRCM) and consumer discretionary stocks (.SPLRCD) rising the most.

U.S.-listed Chinese e-commerce companies Alibaba and JD.com , music streaming company Tencent Music (TME.N) and electric car maker Nio Inc (NIO.N) all gained between 0.7% and 1.4%


Report ad
Grocer Kroger Co (KR.N) dropped 7.1% after it said global supply chain disruptions, freight costs, discounts and wastage would hit its profit margins.

Advancing issues outnumbered decliners by a 1.12-to-1 ratio on the NYSE and by a 1.02-to-1 ratio on the Nasdaq.

The S&P index recorded 14 new 52-week highs and three new lows, while the Nasdaq recorded 49 new highs and 38 new lows.
'''

In [159]:
doc = nlp(text)

In [160]:
for ent in doc.ents:
    print (ent.text, ent.label_)

Apple COMPANY
Apple COMPANY
AAPL.O STOCK
Apple COMPANY
Nasdaq COMPANY
S&P 500 INDEX
S&P 500 INDEX
ET STOCK
Dow Jones Industrial Average INDEX
S&P 500 INDEX
Nasdaq Composite INDEX
S&P 500 INDEX
JD.com COMPANY
Tencent Music COMPANY
TME.N STOCK
NIO.N STOCK
Kroger COMPANY
KR.N STOCK
NYSE STOCK_EXCHANGE
Nasdaq INDEX
S&P INDEX
Nasdaq INDEX


In [161]:
text2 = '''
Apple Inc. designs, manufactures and markets smartphones, personal computers, tablets, wearables and accessories, and sells a variety of related services. The Company’s products include iPhone, Mac, iPad, and Wearables, Home and Accessories. iPhone is the Company’s line of smartphones based on its iOS operating system. Mac is the Company’s line of personal computers based on its macOS operating system. iPad is the Company’s line of multi-purpose tablets based on its iPadOS operating system. Wearables, Home and Accessories includes AirPods, Apple TV, Apple Watch, Beats products, HomePod, iPod touch and other Apple-branded and third-party accessories. AirPods are the Company’s wireless headphones that interact with Siri. Apple Watch is the Company’s line of smart watches. Its services include Advertising, AppleCare, Cloud Services, Digital Content and Payment Services. Its customers are primarily in the consumer, small and mid-sized business, education, enterprise and government markets.
'''

In [162]:
doc2 = nlp(text2)

for ent in doc2.ents:
    print (ent.text, ent.label_)

Apple COMPANY
Apple COMPANY
TV STOCK
Apple COMPANY
Apple COMPANY
Apple COMPANY
