<a href="https://colab.research.google.com/github/junkyuhufs/Practice/blob/main/SpaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##SpaCy practice

https://colab.research.google.com/github/DerwenAI/spaCy_tuTorial/blob/master/spaCy_tuTorial.ipynb#scrollTo=qRCCr_LmW_1-

SpaCy 불러오고, model 설치 (model은 sm, medium, large가능)

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")

텍스트 생성 및 텍스트의 간단 nlp를 doc변수에 저장

In [None]:
text = "The rain in Spain falls mainly on the plain."
doc = nlp(text)
print(doc); type(doc)

In [None]:
text = "The rain in Spain falls mainly on the plain."
doc = nlp(text)

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.is_stop)

doc변수를 padas이용 data frame 변환

In [None]:
import pandas as pd

cols = ("text", "lemma", "POS", "explain", "stopword")
rows = []

for t in doc:
    row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
    rows.append(row)

df = pd.DataFrame(rows, columns=cols)
    
df

displacy이용해서 문장구조 시각화

In [None]:
from spacy import displacy

displacy.render(doc, style="dep", jupyter=True)

두 문장이상일때는? There are features for sentence boundary detection (SBD) – also known as sentence segmentation – based on the builtin/default sentencizer (i.e.,  .sents):

In [None]:
text2 = "We were all out at the zoo one day, I was doing some acting, walking on the railing of the gorilla exhibit. I fell in. Everyone screamed and Tommy jumped in after me, forgetting that he had blueberries in his front pocket. The gorillas just went wild."

doc2 = nlp(text2)

for sent in doc2.sents:
    print(">", sent)

문장안의 글자갯수 (공백포함)

In [None]:
for sent in doc2.sents:
    print(">", sent.start, sent.end)

한문장만 뽑을때 인덱스이용

In [None]:
doc2[48:54]

#Natural Language Understanding

Now let's dive into some of the spaCy features for NLU. Given that we have a parse of a document, from a purely grammatical standpoint we can pull the noun chunks, i.e., each of the noun phrases ( .noun_chunks ):

In [None]:
text3 = "Steve Jobs and Steve Wozniak incorporated Apple Computer on January 3, 1977, in Cupertino, California."
doc3 = nlp(text3)

for chunk in doc3.noun_chunks:
    print(chunk.text)

 identify named entities within the text, i.e., the proper nouns ( .ents ):

In [None]:
for ent in doc3.ents:
    print(ent.text, ent.label_)

Named entities 시각화

In [None]:
displacy.render(doc3, style="ent", jupyter=True)

a spaCy integration for WordNet called spacy-wordnet by Daniel Vila Suero
일단 nltk에서 wordnet불러옴

In [30]:
import nltk

nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Pipeline이용
we'll add the WordnetAnnotator from the spacy-wordnet project

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!pip install spacy-wordnet

In [38]:
from spacy_wordnet.wordnet_annotator import WordnetAnnotator

print("before", nlp.pipe_names)

if "WordnetAnnotator" not in nlp.pipe_names:
    nlp.add_pipe(WordnetAnnotator(nlp.lang), after="tagger")
    
print("after", nlp.pipe_names)

before ['tok2vec', 'tagger', 'spacy_wordnet', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


TypeError: ignored

영어단어는 다의어 (예, withdraw)

In [None]:
token = nlp("withdraw")[0]
token._.wordnet.synsets()

withdraw의 다의어와 관련된 의미영역 추출
(궁극적으로는 시각화가능; 아직 어려운 영역)

In [None]:
token._.wordnet.wordnet_domains()

그러나 다른 방법으로 우리가 이미 withdraw가 사용되는 문맥을 지정할 수 있음 (예, finace, banking)

In [None]:
domains = ["finance", "banking"]
sentence = nlp("I want to withdraw 5,000 euros.")

enriched_sent = []

for token in sentence:
    # get synsets within the desired domains
    synsets = token._.wordnet.wordnet_synsets_for_domain(domains)
    
    if synsets:
        lemmas_for_synset = []
        
        for s in synsets:
            # get synset variants and add to the enriched sentence
            lemmas_for_synset.extend(s.lemma_names())
            enriched_sent.append("({})".format("|".join(set(lemmas_for_synset))))
    else:
        enriched_sent.append(token.text)

print(" ".join(enriched_sent))

시각화예시
Let's analyze text data from the party conventions during the 2012 US Presidential elections. It may take a minute or two to run, but the results from all that number crunching is worth the wait.

(an interactive visualization for understanding texts: scattertext, a product of the genius of Jason Kessler.)

In [None]:
!pip install scattertext

In [None]:
import scattertext as st

if "merge_entities" not in nlp.pipe_names:
    nlp.add_pipe(nlp.create_pipe("merge_entities"))

if "merge_noun_chunks" not in nlp.pipe_names:
    nlp.add_pipe(nlp.create_pipe("merge_noun_chunks"))

convention_df = st.SampleCorpora.ConventionData2012.get_data() 
corpus = st.CorpusFromPandas(convention_df,
                             category_col="party",
                             text_col="text",
                             nlp=nlp).build()

In [None]:
html = st.produce_scattertext_explorer(
    corpus,
    category="democrat",
    category_name="Democratic",
    not_category_name="Republican",
    width_in_pixels=1000,
    metadata=convention_df["speaker"]
)

In [None]:
from IPython.display import IFrame
from IPython.core.display import display, HTML
import sys

IN_COLAB = "google.colab" in sys.modules
print(IN_COLAB)

True


In [None]:
if IN_COLAB:
    display(HTML("<style>.container { width:98% !important; }</style>"))
    display(HTML(html))