# Spacy x Nltk

* Functionality: NLTK is a general-purpose NLP library that provides a wide range of tools and algorithms for text processing, including tokenization, POS tagging, stemming, and sentiment analysis, among others. spaCy, on the other hand, is a more specialized NLP library that focuses on advanced text processing tasks, such as named entity recognition, dependency parsing, and text classification.

* Performance: spaCy is known for its speed and efficiency, thanks to its use of Cython, a programming language that is optimized for high-performance computing. NLTK, on the other hand, may be slower in some cases, especially when dealing with large datasets or complex text processing tasks.

# Tokenization

In [1]:
import nltk

Tokenization is the process of breaking down text into individual words or phrases, known as tokens. Tokenization is a crucial step in natural language processing (NLP) because it is the first step in preparing text for analysis.

With Nltk

In [2]:
text = "This is a sample sentence. And another sentence."
tokens = nltk.word_tokenize(text)
tokens

['This',
 'is',
 'a',
 'sample',
 'sentence',
 '.',
 'And',
 'another',
 'sentence',
 '.']

With Sklearn

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
tokens = vectorizer.fit_transform([text])

print(tokens)
print(vectorizer.get_feature_names())


  (0, 5)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	2
  (0, 0)	1
  (0, 1)	1
['and', 'another', 'is', 'sample', 'sentence', 'this']




With Spacy

In [4]:
import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(text)

tokens = [token.text for token in doc]

print(tokens)

['This', 'is', 'a', 'sample', 'sentence', '.', 'And', 'another', 'sentence', '.']


# Stemming

Stemming is the process of reducing a word to its root or base form, known as the stem. This is achieved by removing suffixes and prefixes from the word. Stemming is a common preprocessing step in natural language processing (NLP) that helps reduce the dimensionality of text data and improve the accuracy of text analysis.

Porter stemming is one of the most widely used stemming algorithms in NLP. It is based on a set of heuristic rules that are applied recursively to a word until a suffix is removed.

In [5]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["play", "playing", "played"]
for word in words:
    stem = stemmer.stem(word)
    print(f"{word} -> {stem}")

play -> play
playing -> play
played -> play


The Snowball stemmer (also known as the Porter2 stemmer) is an improved version of the Porter stemmer that is more aggressive in removing suffixes. 

In [6]:
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")
words = ["jumping", "jumps", "jumped"]
for word in words:
    stem = stemmer.stem(word)
    print(f"{word} -> {stem}")


jumping -> jump
jumps -> jump
jumped -> jump


The Lancaster stemmer is the most agressive stemming algorithm that can produce very short stems, that can sometimes lose meaning.

In [7]:
from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()
words = ["jumping", "jumps", "jumped"]
for word in words:
    stem = stemmer.stem(word)
    print(f"{word} -> {stem}")


jumping -> jump
jumps -> jump
jumped -> jump


# Lemmatization
Lemmatization is a process of reducing words to their base form, known as the lemma, based on their morphological features and their part of speech (POS) in the sentence. The main difference between stemming and lemmatization is that stemming reduces words to their root form by simply removing the suffix, whereas lemmatization produces valid words that are present in the dictionary.

Compared to stemming, lemmatization produces more accurate and meaningful results. For example, consider the word "better". Stemming would reduce it to "bett", which is not a valid word and loses the meaning of the original word. On the other hand, lemmatization would reduce it to "good", which is a valid word and preserves the meaning of the original word.

In [8]:
import nltk
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ["playing", "plays", "played"]
for word in words:
    lemma = lemmatizer.lemmatize(word, pos="v")
    print(f"{word} -> {lemma}")


playing -> play
plays -> play
played -> play


# POS Tagging



POS tagging, or Part-of-Speech tagging, is the process of assigning each word in a text a particular part-of-speech tag based on its definition and context. The parts of speech include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and interjections.

In [9]:
import spacy

nlp = spacy.load("en_core_web_sm")  # Load the pre-trained model

sentence = "John likes to watch movies. He prefers action movies."

# Process the sentence and obtain the POS tags for each token
doc = nlp(sentence)
pos_tags = [(token.text, token.pos_) for token in doc]

# Print the POS tags
print(pos_tags)


[('John', 'PROPN'), ('likes', 'VERB'), ('to', 'PART'), ('watch', 'VERB'), ('movies', 'NOUN'), ('.', 'PUNCT'), ('He', 'PRON'), ('prefers', 'VERB'), ('action', 'NOUN'), ('movies', 'NOUN'), ('.', 'PUNCT')]


# NER Named Entity Recognition

Named entity recognition (NER) is a common task in NLP that involves identifying and classifying named entities (e.g., people, organizations, locations) in text

Summary

* PERSON: People, including fictional.
* NORP: Nationalities or religious or political groups.
* FAC: Buildings, airports, highways, bridges, etc.
* ORG: Companies, agencies, institutions, etc.
* GPE: Countries, cities, states.
* LOC: Non-GPE locations, mountain ranges, bodies of water.
* PRODUCT: Objects, vehicles, foods, etc. (Not services.)
* EVENT: Named hurricanes, battles, wars, sports events, etc.
* WORK_OF_ART: Titles of books, songs, etc.
* LAW: Named documents made into laws.
* LANGUAGE: Any named language.
* DATE: Absolute or relative dates or periods.
* TIME: Times smaller than a day.
* PERCENT: Percentage, including "%".
* MONEY: Monetary values, including unit.
* QUANTITY: Measurements, as of weight or distance.
* ORDINAL: "first", "second", etc.
* CARDINAL: Numerals that do not fall under another type.

In [10]:
import spacy

nlp = spacy.load("en_core_web_sm")  # Load the pre-trained model

text = "I live in New York City and work at Google."

# Process the text and obtain the named entities
doc = nlp(text)
entities = [(entity.text, entity.label_) for entity in doc.ents]

# Print the named entities
print(entities)


[('New York City', 'GPE'), ('Google', 'ORG')]


# Sentiment Analysis

Sentiment analysis is the task of analyzing a piece of text to determine whether the author's attitude towards a particular topic or subject is positive, negative, or neutral.

Sentiment analysis can be useful for a variety of applications, such as social media monitoring, customer feedback analysis, brand reputation management, and market research.

In [11]:
from textblob import TextBlob

text = "I like this product! It's horrible."

# Create a TextBlob object from the text
blob = TextBlob(text)

# Obtain the sentiment polarity (a value between -1 and 1)
sentiment = blob.sentiment.polarity

# Print the sentiment polarity
print(sentiment)
threshold = 0 

if sentiment > threshold:
    print("Positive sentiment!")
else:
    print("Negative sentiment!")

-1.0
Negative sentiment!


In [12]:
#!pip install pycaret[full]

# Create a analysis about the text to identify gender scikit

In [13]:
from pycaret.datasets import get_data

In [16]:
df = get_data('kiva')

Unnamed: 0,country,en,gender,loan_amount,nonpayment,sector,status
0,Dominican Republic,"""Banco Esperanza"" is a group of 10 women looki...",F,1225,partner,Retail,0
1,Dominican Republic,"""Caminemos Hacia Adelante"" or ""Walking Forward...",F,1975,lender,Clothing,0
2,Dominican Republic,"""Creciendo Por La Union"" is a group of 10 peop...",F,2175,partner,Clothing,0
3,Dominican Republic,"""Cristo Vive"" (""Christ lives"" is a group of 10...",F,1425,partner,Clothing,0
4,Dominican Republic,"""Cristo Vive"" is a large group of 35 people, 2...",F,4025,partner,Food,0


In [18]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
nltk.download('stopwords')

from nltk.corpus import stopwords


stopwords = set(stopwords.words('english'))

def preprocess_text(text):
    tokens = nltk.word_tokenize(text.lower())
    tokens = [token for token in tokens if token.isalpha() and token not in stopwords]
    return ' '.join(tokens)

df['processed_text'] = df['en'].apply(preprocess_text)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lfroes\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [23]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['processed_text'])
y = df['gender']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [24]:
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

MultinomialNB()

In [26]:
y_pred = nb_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print('Accuracy:', accuracy)
print('Confusion Matrix:\n', confusion_mat)
print('Classification Report:\n', class_report)

Accuracy: 0.9017595307917888
Confusion Matrix:
 [[1021   15]
 [ 119  209]]
Classification Report:
               precision    recall  f1-score   support

           F       0.90      0.99      0.94      1036
           M       0.93      0.64      0.76       328

    accuracy                           0.90      1364
   macro avg       0.91      0.81      0.85      1364
weighted avg       0.90      0.90      0.89      1364



# Creating a Model with pycaret to identify the sentiment and do plots

In [28]:
from pycaret.nlp import *

models()

Unnamed: 0_level_0,Name,Reference
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
lda,Latent Dirichlet Allocation,gensim/models/ldamodel
lsi,Latent Semantic Indexing,gensim/models/lsimodel
hdp,Hierarchical Dirichlet Process,gensim/models/hdpmodel
rp,Random Projections,gensim/models/rpmodel
nmf,Non-Negative Matrix Factorization,sklearn.decomposition.NMF


In [30]:
s = setup(df, target = 'en')

lda = create_model('lda')

plot_model(lda, plot = 'frequency')

In [31]:
plot_model(lda, plot = 'sentiment')

In [32]:
evaluate_model(lda)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Frequency Plot', 'freque…

In [33]:
lda_results = assign_model(lda)
lda_results.head()

Unnamed: 0,country,en,gender,loan_amount,nonpayment,sector,status,processed_text,Topic_0,Topic_1,Topic_2,Topic_3,Dominant_Topic,Perc_Dominant_Topic
0,Dominican Republic,group woman look receive small loan take small...,F,1225,partner,Retail,0,banco esperanza group women looking receive sm...,0.001049,0.05648,0.3685,0.57397,Topic 3,0.57
1,Dominican Republic,walk forward group entrepreneur seek second lo...,F,1975,lender,Clothing,0,caminemos hacia adelante walking forward group...,0.000981,0.071564,0.237809,0.689645,Topic 3,0.69
2,Dominican Republic,group people hope start business group look re...,F,2175,partner,Clothing,0,creciendo por la union group people hoping sta...,0.001305,0.044101,0.313093,0.641501,Topic 3,0.64
3,Dominican Republic,live group woman look receive first loan young...,F,1425,partner,Clothing,0,cristo vive christ lives group women looking r...,0.001079,0.138274,0.464483,0.396164,Topic 2,0.46
4,Dominican Republic,cristo vive large group people hope take loan ...,F,4025,partner,Food,0,cristo vive large group people hoping take loa...,0.001037,0.081933,0.202928,0.714103,Topic 3,0.71


In [None]:
#save_model(lda, 'my_lda_model')

# Rule Based Sentiment Analysis

In [34]:
def rule_based_sentiment(text):
    # Define rules for positive and negative sentiment
    positive_words = ['happy', 'joyful', 'amazing']
    negative_words = ['sad', 'angry', 'frustrated']
    
    # Tokenize the text into individual words
    words = text.lower().split()
    
    # Count the number of positive and negative words
    positive_count = sum([1 for word in words if word in positive_words])
    negative_count = sum([1 for word in words if word in negative_words])
    
    # Determine the overall sentiment based on the number of positive and negative words
    if positive_count > negative_count:
        return 'positive'
    elif negative_count > positive_count:
        return 'negative'
    else:
        return 'neutral'


In [35]:
rule_based_sentiment('hello im sad')

'negative'