# Spacy x Nltk

* Functionality: NLTK is a general-purpose NLP library that provides a wide range of tools and algorithms for text processing, including tokenization, POS tagging, stemming, and sentiment analysis, among others. spaCy, on the other hand, is a more specialized NLP library that focuses on advanced text processing tasks, such as named entity recognition, dependency parsing, and text classification.

* Performance: spaCy is known for its speed and efficiency, thanks to its use of Cython, a programming language that is optimized for high-performance computing. NLTK, on the other hand, may be slower in some cases, especially when dealing with large datasets or complex text processing tasks.

# Tokenization

In [3]:
import nltk

Tokenization is the process of breaking down text into individual words or phrases, known as tokens. Tokenization is a crucial step in natural language processing (NLP) because it is the first step in preparing text for analysis.

With Nltk

In [5]:
text = "This is a sample sentence. And another sentence."
tokens = nltk.word_tokenize(text)
tokens

['This',
 'is',
 'a',
 'sample',
 'sentence',
 '.',
 'And',
 'another',
 'sentence',
 '.']

With Sklearn

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
tokens = vectorizer.fit_transform([text])

print(tokens)
print(vectorizer.get_feature_names())


  (0, 5)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	2
  (0, 0)	1
  (0, 1)	1
['another', 'here', 'is', 'sample', 'sentence', 'this']


With Spacy

In [11]:
import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(text)

tokens = [token.text for token in doc]

print(tokens)

['This', 'is', 'a', 'sample', 'sentence', '.', 'Another', 'sentence', 'here', '.']


# Stemming

Stemming is the process of reducing a word to its root or base form, known as the stem. This is achieved by removing suffixes and prefixes from the word. Stemming is a common preprocessing step in natural language processing (NLP) that helps reduce the dimensionality of text data and improve the accuracy of text analysis.

Porter stemming is one of the most widely used stemming algorithms in NLP. It is based on a set of heuristic rules that are applied recursively to a word until a suffix is removed.

In [12]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["play", "playing", "played"]
for word in words:
    stem = stemmer.stem(word)
    print(f"{word} -> {stem}")

play -> play
playing -> play
played -> play


The Snowball stemmer (also known as the Porter2 stemmer) is an improved version of the Porter stemmer that is more aggressive in removing suffixes. 

In [13]:
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")
words = ["jumping", "jumps", "jumped"]
for word in words:
    stem = stemmer.stem(word)
    print(f"{word} -> {stem}")


jumping -> jump
jumps -> jump
jumped -> jump


The Lancaster stemmer is the most agressive stemming algorithm that can produce very short stems, that can sometimes lose meaning.

In [14]:
from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()
words = ["jumping", "jumps", "jumped"]
for word in words:
    stem = stemmer.stem(word)
    print(f"{word} -> {stem}")


jumping -> jump
jumps -> jump
jumped -> jump


# Lemmatization
Lemmatization is a process of reducing words to their base form, known as the lemma, based on their morphological features and their part of speech (POS) in the sentence. The main difference between stemming and lemmatization is that stemming reduces words to their root form by simply removing the suffix, whereas lemmatization produces valid words that are present in the dictionary.

Compared to stemming, lemmatization produces more accurate and meaningful results. For example, consider the word "better". Stemming would reduce it to "bett", which is not a valid word and loses the meaning of the original word. On the other hand, lemmatization would reduce it to "good", which is a valid word and preserves the meaning of the original word.

In [19]:
import nltk
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ["playing", "plays", "played"]
for word in words:
    lemma = lemmatizer.lemmatize(word, pos="v")
    print(f"{word} -> {lemma}")


playing -> play
plays -> play
played -> play


# POS Tagging



POS tagging, or Part-of-Speech tagging, is the process of assigning each word in a text a particular part-of-speech tag based on its definition and context. The parts of speech include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and interjections.

In [26]:
import spacy

nlp = spacy.load("en_core_web_sm")  # Load the pre-trained model

sentence = "John likes to watch movies. He prefers action movies."

# Process the sentence and obtain the POS tags for each token
doc = nlp(sentence)
pos_tags = [(token.text, token.pos_) for token in doc]

# Print the POS tags
print(pos_tags)


[('John', 'PROPN'), ('likes', 'VERB'), ('to', 'PART'), ('watch', 'VERB'), ('movies', 'NOUN'), ('.', 'PUNCT'), ('He', 'PRON'), ('prefers', 'VERB'), ('action', 'NOUN'), ('movies', 'NOUN'), ('.', 'PUNCT')]


# NER Named Entity Recognition

Named entity recognition (NER) is a common task in NLP that involves identifying and classifying named entities (e.g., people, organizations, locations) in text

Summary

* PERSON: People, including fictional.
* NORP: Nationalities or religious or political groups.
* FAC: Buildings, airports, highways, bridges, etc.
* ORG: Companies, agencies, institutions, etc.
* GPE: Countries, cities, states.
* LOC: Non-GPE locations, mountain ranges, bodies of water.
* PRODUCT: Objects, vehicles, foods, etc. (Not services.)
* EVENT: Named hurricanes, battles, wars, sports events, etc.
* WORK_OF_ART: Titles of books, songs, etc.
* LAW: Named documents made into laws.
* LANGUAGE: Any named language.
* DATE: Absolute or relative dates or periods.
* TIME: Times smaller than a day.
* PERCENT: Percentage, including "%".
* MONEY: Monetary values, including unit.
* QUANTITY: Measurements, as of weight or distance.
* ORDINAL: "first", "second", etc.
* CARDINAL: Numerals that do not fall under another type.

In [29]:
import spacy

nlp = spacy.load("en_core_web_sm")  # Load the pre-trained model

text = "I live in New York City and work at Google."

# Process the text and obtain the named entities
doc = nlp(text)
entities = [(entity.text, entity.label_) for entity in doc.ents]

# Print the named entities
print(entities)


[('New York City', 'GPE'), ('Google', 'ORG')]


# Sentiment Analysis

Sentiment analysis is the task of analyzing a piece of text to determine whether the author's attitude towards a particular topic or subject is positive, negative, or neutral.

Sentiment analysis can be useful for a variety of applications, such as social media monitoring, customer feedback analysis, brand reputation management, and market research.

In [37]:
from textblob import TextBlob

text = "I like this product! It's horrible."

# Create a TextBlob object from the text
blob = TextBlob(text)

# Obtain the sentiment polarity (a value between -1 and 1)
sentiment = blob.sentiment.polarity

# Print the sentiment polarity
print(sentiment)
threshold = 0 

if sentiment > threshold:
    print("Positive sentiment!")
else:
    print("Negative sentiment!")

-1.0
Negative sentiment!


In [45]:
#!pip install pycaret[full]