In [32]:
import spacy

# Loading the language model
nlp = spacy.load("en_core_web_sm")

# Input text
text = "Natural language processing, including sentiment analysis and topic modelling. SpaCy is fast."

# Applying tokenisation
doc = nlp(text)

# Taking tokens (words and punctuation) out of the text
for token in doc:
    print(token.text)

Natural
language
processing
,
including
sentiment
analysis
and
topic
modelling
.
SpaCy
is
fast
.


In [33]:
# Applying sentence splitting
doc = nlp(text)

# Taking sentences out of the text
for sentence in doc.sents:
    print(sentence.text)

Natural language processing, including sentiment analysis and topic modelling.
SpaCy is fast.


In [34]:
# Applying the analysis
doc = nlp(text)

# Bringing out words and their parts of speech
for token in doc:
    print(token.text, token.pos_)

Natural ADJ
language NOUN
processing NOUN
, PUNCT
including VERB
sentiment NOUN
analysis NOUN
and CCONJ
topic NOUN
modelling NOUN
. PUNCT
SpaCy PROPN
is AUX
fast ADJ
. PUNCT


Bringing words back to their base form Lemmatisation is the process of bringing a word back to its base form (lemma) by removing endings and suffixes. This helps to harmonise the different forms of a word and improve the accuracy of the analysis


In [35]:
doc = nlp(text)

for token in doc:
    print(token.text, token.lemma_)

Natural natural
language language
processing processing
, ,
including include
sentiment sentiment
analysis analysis
and and
topic topic
modelling modelling
. .
SpaCy SpaCy
is be
fast fast
. .


Detecting and classifying Named Entities Named Entities are real world objects that can be identified by name, such as names of people, places, dates, organisations, etc. Extracting and classifying named entities is an important task in NLP. SpaCy provides convenient tools for this purpose.

In [36]:
text = "Apple is going to build a new office in London in 2023."

doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
London GPE
2023 DATE


In [37]:
from gensim.models import Word2Vec

# Loading the language model
nlp = spacy.load("en_core_web_sm")

# Preparing the text corpus
corpus = ["I like cats.", "Dogs are friendly.", "Cats and dogs are pets."]

# # Tokenise and lemmatise the text
processed_corpus = []
for doc in nlp.pipe(corpus):
    processed_corpus.append([token.lemma_ for token in doc])

# Teaching modellingь Word2Vec
model = Word2Vec(processed_corpus, vector_size=100, window=5, min_count=1, sg=0)

# Save the model
model.save("custom_word_vectors.model")

In [38]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Loading the model Word2Vec
custom_model = Word2Vec.load("custom_word_vectors.model")

# We get the vector representation of the word "cat"
vector_cat = custom_model.wv["cat"]

# We get the vector representation of the word "dog"
vector_dog = custom_model.wv["dog"]

# Preparing data for classification
X = [vector_cat, vector_dog]
y = ["animal", "animal1"]

# Training the classification model
classifier = SVC()
classifier.fit(X, y)

# Testing the model
test_vector = custom_model.wv["dog"]
predicted_label = classifier.predict([test_vector])[0]

print("Predicted label for 'dog':", predicted_label)

Predicted label for 'dog': animal1


Syntactic Analysis Dependency Tree Dependency Tree Structure Syntactic analysis, also known as dependency analysis, is an important part of natural language processing. It focuses on identifying syntactic relationships between words in a sentence. To visualise these relationships, a dependency tree, which is a graphical representation of sentence structure, is used.

Using a dependency tree to analyse the relationships between words A dependency tree allows us to easily see which words are main words and which are dependent words, as well as which syntactic relations connect them. In spaCy, you can obtain a dependency tree for a sentence using the .print_tree().

In [39]:
text = "Natural language processing, including sentiment analysis and topic modelling. SpaCy is fast."

doc = nlp(text)

# Display the dependency tree
for token in doc:
    print(token.text, token.dep_, token.head.text)

Natural amod language
language compound processing
processing ROOT processing
, punct processing
including prep processing
sentiment compound analysis
analysis pobj including
and cc analysis
topic compound modelling
modelling conj analysis
. punct processing
SpaCy nsubj is
is ROOT is
fast acomp is
. punct is


Grammatical relations The concept of syntactic relations (subject, object, etc.) Syntactic relations define how words are related to each other in a sentence. Some of the key syntactic relations include subject, object, direct complement, indirect complement, etc. These relations help us to understand the semantic structure of a sentence.

In [40]:
text = "Natural language processing, including sentiment analysis and topic modelling. SpaCy is fast."

doc = nlp(text)

# Extracting grammatical relations and semantic roles
for token in doc:
    if token.dep_ == "nsubj":
        print(f"Subject: {token.text}")
    elif token.dep_ == "dobj":
        print(f"Direct Object: {token.text}")
    elif token.dep_ == "prep":
        print(f"Preposition: {token.text}")

Preposition: including
Subject: SpaCy


Text tone analysis Text tone analysis is an important component of analysing sentiment and opinion in text data. SpaCy, although not a dedicated library for tone analysis, can be a useful tool for pre-processing and analysing texts before applying specialised techniques.


In [41]:

from textblob import TextBlob

nlp = spacy.load("en_core_web_sm")

text = "Natural language processing, including sentiment analysis and topic modelling. SpaCy is fast.I love this product. It's amazing!"

doc = nlp(text)

# Let's use TextBlob to analyse tonality
analysis = TextBlob(text)

# Evaluate the mood of the text
sentiment = analysis.sentiment.polarity

if sentiment > 0:
    sentiment_label = "positive"
elif sentiment < 0:
    sentiment_label = "negative"
else:
    sentiment_label = "neutral"

print(f"Sentiment: {sentiment_label}")

Sentiment: positive


Extracting keywords from text helps to condense information and highlight the most important aspects of the content. SpaCy provides the ability to extract keywords using word frequency or semantic meaning.


In [30]:
nlp = spacy.load("en_core_web_sm")

text = "Natural language processing is a field of study focused on making sense of text data.Natural language processing, including sentiment analysis and topic modelling. SpaCy is fast."

doc = nlp(text)

# Extract keywords based on frequency
keywords_freq = [token.text for token in doc if not token.is_stop and token.is_alpha]
# Extract keywords based on the weight of embedded vectors
keywords_semantic = [token.text for token in doc if not token.is_stop and token.vector_norm > 0]

print("Keywords based on frequency:", keywords_freq)
print("Keywords based on semantics:", keywords_semantic)

Keywords based on frequency: ['Natural', 'language', 'processing', 'field', 'study', 'focused', 'making', 'sense', 'text', 'data', 'Natural', 'language', 'processing', 'including', 'sentiment', 'analysis', 'topic', 'modelling', 'SpaCy', 'fast']
Keywords based on semantics: ['Natural', 'language', 'processing', 'field', 'study', 'focused', 'making', 'sense', 'text', 'data', '.', 'Natural', 'language', 'processing', ',', 'including', 'sentiment', 'analysis', 'topic', 'modelling', '.', 'SpaCy', 'fast', '.']


# Новый раздел