<a href="https://colab.research.google.com/github/jianzhiw/spaCy/blob/master/spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Spacy](https://spacy.io/) #

spaCy is an open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython. The library is published under the MIT license and currently offers statistical neural network models for English, German, Spanish, Portuguese, French, Italian, Dutch and multi-language NER, as well as tokenization for various other languages.
<br></br>

Future work: Explore [NLTK](https://www.nltk.org/), a similar library for language processing.

In [2]:
# Install spaCy
!pip install spacy

# Download large English model for spaCy
!python -m spacy download en_core_web_lg

# Install textacy which will also be useful
!pip install textacy

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [0]:
import spacy

In [4]:
# Load the large English NLP model
# If you have error here, go to runtime -> restart runtime and rerun
nlp = spacy.load('en_core_web_lg')

# The text we want to examine
text = """London is the capital and most populous city of England and 
the United Kingdom.  Standing on the River Thames in the south east 
of the island of Great Britain, London has been a major settlement 
for two millennia. It was founded by the Romans, who named it Londinium.
"""

# Parse the text with spaCy. This runs the entire pipeline.
doc = nlp(text)

# 'doc' now contains a parsed version of text. We can use it to do anything we want!
# For example, this will print out all the named entities that were detected:
for entity in doc.ents:
    print(f"{entity.text} ({entity.label_})")

London (GPE)
England (GPE)
the United Kingdom (GPE)
the River Thames (LOC)
Great Britain (GPE)
London (GPE)
two millennia (DATE)
Romans (NORP)
Londinium (LOC)


View the [annotations](https://spacy.io/api/annotation#named-entities) here.

# Remove the detected name #

In [5]:
# Replace a token with "REDACTED" if it is a name
def replace_name_with_placeholder(token):
    if token.ent_iob != 0 and token.ent_type_ == "PERSON":
        return "[REDACTED] "
    else:
        return token.string

# Loop through all the entities in a document and check if they are names
def scrub(text):
    doc = nlp(text)
    for ent in doc.ents:
        ent.merge()
    tokens = map(replace_name_with_placeholder, doc)
    return "".join(tokens)

s = """
In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence". In 1957, Noam Chomsky’s 
Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures.
"""

print(scrub(s))


In 1950, [REDACTED] published his famous article "Computing Machinery and Intelligence". In 1957, [REDACTED] 
Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures.



# Extracting Facts #

What you can do with spaCy right out of the box is pretty amazing. But you can also use the parsed output from spaCy as the input to more complex data extraction algorithms. There’s a python library called textacy that implements several common data extraction algorithms on top of spaCy. It’s a great starting point.
<br></br>

One of the algorithms it implements is called [Semi-structured Statement Extraction](https://chartbeat-labs.github.io/textacy/api_reference/information_extraction.html?highlight=semistructured#textacy.extract.semistructured_statements). We can use it to search the parse tree for simple statements where the subject is “London” and the verb is a form of “be”. That should help us find facts about London.

In [6]:
import textacy.extract

# The text we want to examine
text = """London is the capital and most populous city of England and  the United Kingdom.  
Standing on the River Thames in the south east of the island of Great Britain, 
London has been a major settlement  for two millennia.  It was founded by the Romans, 
who named it Londinium.
"""

# Parse the document with spaCy
doc = nlp(text)

# Extract semi-structured statements
statements = textacy.extract.semistructured_statements(doc, "London")

# Print the results
print("Here are the things I know about London:")

for statement in statements:
    subject, verb, fact = statement
    print(f" - {fact}")

Here are the things I know about London:
 - the capital and most populous city of England and  the United Kingdom.  

 - a major settlement  for two millennia.  


For extra credit, try installing the [neuralcoref](https://github.com/huggingface/neuralcoref?source=post_page---------------------------) library (I was having some problem in using the library above in Colab) and adding Coreference Resolution to your pipeline. That will get you a few more facts since it will catch sentences that talk about “it” instead of mentioning “London” directly.



[Source](https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e)