References

1. https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_11_01_spacy.ipynb
2. https://derwen.ai/s/d5c7#1
3. [SpaCy Universe](https://spacy.io/universe) (ref: [derwen.ai](https://derwen.ai/s/d5c7#1))
 - [Legal: Blackstone](https://spacy.io/universe/project/blackstone)
 - [Biomedical: Kindred](https://spacy.io/universe/project/kindred)
 - [Geographic: mordecai](https://spacy.io/universe/project/mordecai)
 - [Label: Prodigy](https://spacy.io/universe/project/prodigy)
 - [Edge: spacy-raspberry](https://spacy.io/universe/project/spacy-raspberry)
 - [Voice: Rasa NLU](https://spacy.io/universe/project/rasa) 
  - [Transformers: spacy-transformers](https://explosion.ai/blog/spacy-pytorch-transformers) 
  - [Conference: spaCy IRL 2019](https://irl.spacy.io/2019/)

In [13]:
try:
    %tensorflow_version 2.x
    COLAB = True
    !pip install spacy
    !python -m spacy download en_core_web_lg
    !pip install spacy-transformers
except:
    COLAB = False

print(f'\033[00mUsing Google CoLab = \033[93m{COLAB}')
if (COLAB): print("Dependencies installed")

[00mUsing Google CoLab = [93mFalse


# Spacy: Getting started

As discussed in the lecture portion, Python has two main libraries to help with NLP tasks: 

* [NLTK](https://www.nltk.org/)
* [Spacy](https://spacy.io/)

SpaCy launched in 2015 and has rapidly become an industry standard, and is a focus of our training. SpaCy provides an industrial grade project that is both open-source and contains community driven integrations (see SpaCy Universe).

SpaCy requires you to download language resources (such as models). For the english language, you can use `python -m spacy download en_core_web_sm`. The suffix `_sm` indicates "small" model, while `_md` and `_lg` indicate medium and large, respectively and provide more advanced features (we won't need in this tutorial).


In [19]:
import spacy
nlp = spacy.load('en_core_web_sm')

# Use if needed:
#spacy.util.get_data_path()

### Tokenization

For each word in that sentence _spaCy_ generates a [token](https://spacy.io/api/token) for each word in the sentence. The token fields show the raw text, the root of the word (lemma), the Part of Speech (POS), whether or not its a stop word, and many other things. 

In [40]:
import spacy
text = "this is a beautiful day"
doc = nlp(text)
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop)

this this DET DT nsubj xxxx True True
is be AUX VBZ ROOT xx True True
a a DET DT det x True True
beautiful beautiful ADJ JJ amod xxxx True False
day day NOUN NN attr xxx True False


### Numeric representation

Let's print the last token and see its _numeric_ representation:

In [41]:
print(f'The token is from the raw text: \033[92m{token.text}\033[0m\nNumeric representation:\n')
print(token.vector)
print(f'\nThe length of the vector is {token.vector.shape}') # 96 length vector

The token is from the raw text: [92mday[0m
Numeric representation:

[ 1.639962   -1.5621606   0.05948496  0.01268986  2.1984892  -1.8145177
 -0.745441   -1.5280969  -2.7714853   6.007323   -2.0809193  -1.961708
  2.1664617   2.3318393  -3.8029075  -2.745814    1.7596581   2.5324426
 -0.5090674   2.0728378   3.501279   -0.88496184  1.6712112  -0.8527437
  0.81122905  3.8929913  -2.6595979  -1.4807723   0.9421574   1.870143
 -0.76680666 -0.9048741   0.51840436 -1.8099762  -3.7449381   1.1266654
 -1.5931005   0.6592519   2.1718125   1.0615923   1.2269886  -2.0375106
 -2.7071342  -0.96021605  2.1439214  -2.8734689   0.4292348  -2.465563
 -1.6698704  -0.94421875 -1.5220733  -0.22063437 -0.77889663 -2.4767165
  1.944675    2.2797525   0.55317724 -2.6973386  -0.9994705  -1.3853178
 -0.9034357  -2.038024    0.46580553 -1.2795513   1.4021541   3.738821
  3.2633476  -1.2171834   2.8708591   4.098246   -2.5814586   0.7266145
  1.4873066  -0.0491671  -0.8378353   2.0663633   2.8921773   0.638961

### Display

Note: Run the following as `display.serve` outside of Jupyter

In [47]:
from spacy import displacy

displacy.render(doc, style="dep", jupyter=True)
displacy.render(doc, style="ent")
# day is shown as a recognized "DATE"

### Exercise:

Explore different parts of speech & sentence structures. 
* Show PERSON 
* Show location

Some examples:
* "They met at a cafe in London last year"
* "Peter went to see his uncle in Brooklyn"
* "The chicken crossed the road because it was hungry"
* "The chicken crossed the road because it was narrow"

## Similarity of two sentences

Let's do the same as above, but mix with two similar sentences

In [80]:
sentence_list = ["this is a beautiful day", "today is bright and sunny"]

In [83]:
#doc_list = list(map(nlp, sentence_list))
doc_list = list(nlp.pipe(sentence_list))

In [84]:
## Python program to understand the usage of tabulate function for printing tables in a tabular format
from tabulate import tabulate
import pandas as pd

column_names = ['text', 'lemma', 'pos', 'tag', 'dep', 'shape', 'is_alpha', 'is_stop']
df = pd.DataFrame(columns = column_names)
for doc in doc_list:
    print(f'\n\033[92mPrinting tokens for \033[91m"{doc}"\033[0m')
    for token in doc:
        token_list = [token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
                      token.shape_, token.is_alpha, token.is_stop]
        token_series = pd.Series(token_list, index = df.columns)
        df = df.append(token_series, ignore_index=True)
    print(tabulate(df, headers=column_names))


[92mPrinting tokens for [91m"this is a beautiful day"[0m
    text             lemma            pos    tag    dep    shape       is_alpha    is_stop
--  ---------------  ---------------  -----  -----  -----  ----------  ----------  ---------
 0  this             this             DET    DT     nsubj  xxxx        True        True
 1  is               be               AUX    VBZ    ROOT   xx          True        True
 2  a beautiful day  a beautiful day  NOUN   NN     attr   x xxxx xxx  False       False

[92mPrinting tokens for [91m"today is bright and sunny"[0m
    text             lemma            pos    tag    dep    shape       is_alpha    is_stop
--  ---------------  ---------------  -----  -----  -----  ----------  ----------  ---------
 0  this             this             DET    DT     nsubj  xxxx        True        True
 1  is               be               AUX    VBZ    ROOT   xx          True        True
 2  a beautiful day  a beautiful day  NOUN   NN     attr   x xxxx x

### Showing similarity between two sentences

1. "this is a beautiful day"
2. "this day is bright and sunny"

Note: If you have loaded the small (sm) dataset, you will get the following warning:
> UserWarning: [W007] The model you're using has no word vectors loaded, so the result of the Token.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.

Try: 
* `python -m spacy download en_core_web_md`
* or: `python -m spacy download en_core_web_lg`

In [63]:
import warnings

# choose action = 'ignore' to ignore the small dataset warning
warnings.filterwarnings(action = "ignore") # "default"

In [64]:
doc_list[0].similarity(doc_list[1])

0.5232043828276217

In [25]:
nlp_md = spacy.load("en_core_web_md")

In [26]:
# try again
doc_md_list = list(map(nlp_md, sentence_list))
doc_md_list[0].similarity(doc_md_list[1])

0.7740792555658819

## Paragraph

How do you deal with multiple sentences?

In [51]:
text = """When we went out for ice-cream last summer, the place was 
packed. This year, however, things are eerily different. You can see that 
the stores are nearly desserted and roads empty like never before. It's a 
reality that we are all getting used to, albeit very slowly and reluctantly.
"""

doc = nlp(text)

for sent in doc.sents:
    print(">", sent)

> When we went out for ice-cream last summer, the place was 
packed.
> This year, however, things are eerily different.
> You can see that 
the stores are nearly desserted and roads empty like never before.
> It's a 
reality that we are all getting used to, albeit very slowly and reluctantly.



In [78]:
?nlp

### Scattertext

Credit: derwen.ai 

In [76]:
if False: # install if not already run
    !pip install scattertext

In [77]:
import scattertext as st

if "merge_entities" not in nlp.pipe_names:
    nlp.add_pipe(nlp.create_pipe("merge_entities"))

if "merge_noun_chunks" not in nlp.pipe_names:
    nlp.add_pipe(nlp.create_pipe("merge_noun_chunks"))

convention_df = st.SampleCorpora.ConventionData2012.get_data() 
corpus = st.CorpusFromPandas(convention_df,
                             category_col="party",
                             text_col="text",
                             nlp=nlp).build()

Generate interactive visualization once the corpus is ready:

In [67]:
html = st.produce_scattertext_explorer(
    corpus,
    category="democrat",
    category_name="Democratic",
    not_category_name="Republican",
    width_in_pixels=1000,
    metadata=convention_df["speaker"]
)

Render the visualization:

In [69]:
from IPython.display import IFrame
from IPython.core.display import display, HTML
import sys

IN_COLAB = "google.colab" in sys.modules
print(IN_COLAB)

False


**Use in Google Colab**

In [72]:
if IN_COLAB:
    display(HTML("<style>.container { width:98% !important; }</style>"))
    display(HTML(html))

**Use in Jupyter**

In [73]:
file_name = "foo.html"

with open(file_name, "wb") as f:
    f.write(html.encode("utf-8"))

IFrame(src=file_name, width = 1200, height=700)