Ten ways for text analysis
1. Vectorizer
2. TFIDF Vectorizer
3. Sentiment Intensity Analyzer
4. Data pre-processing: stopwords, word_tokenize
5. PorterStemmer
6. WordNetLemmatizer
7. Word_tokenize
8. FreqDist
9. WordCloud
10. Topic Modeling: LatentDirichletAllocation

1. Vectorizer
In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:
1) tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
2) counting the occurrences of tokens in each document.
3) normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.

In this scheme, features and samples are defined as follows:
1) each individual token occurrence frequency (normalized or not) is treated as a feature.
2) the vector of all the token frequencies for a given document is considered a multivariate sample.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

#Use bag of words for vectorization
vectorizer = CountVectorizer()
def vectorize_text(text):
    X = vectorizer.fit_transform([text])
    return X.toarray()

In [5]:
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one third third.',
    'Is this the first document?',
]
for sen in corpus:
    T = vectorize_text(sen)
    print(T)


[[1 1 1 1 1]]
[[1 1 2 1 1]]
[[1 1 1 3]]
[[1 1 1 1 1]]


The default configuration tokenizes the string by extracting words of at least 2 letters.

In [6]:
analyze = vectorizer.build_analyzer()
analyze("This is a text document to analyze.") == (
    ['this', 'is', 'text', 'document', 'to', 'analyze'])

True

2. TFIDF Vectorizer

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Use TF-IDF for vectorization
tfidf_vectorizer = TfidfVectorizer()
def tfidf_vectorize_text(text):
    X = tfidf_vectorizer.fit_transform([text])
    return X.toarray()

In [19]:
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one third third.',
    'Is this the first document?',
    'Star Star Star Twinkle Star',
]
for sen in corpus:
    T = tfidf_vectorize_text(sen)
    print(T)


[[0.4472136 0.4472136 0.4472136 0.4472136 0.4472136]]
[[0.35355339 0.35355339 0.70710678 0.35355339 0.35355339]]
[[0.28867513 0.28867513 0.28867513 0.8660254 ]]
[[0.4472136 0.4472136 0.4472136 0.4472136 0.4472136]]
[[0.9701425  0.24253563]]


3. Sentiment Intensity Analyzer

NLTK (Natural Language Toolkit) is a comprehensive Python library widely used for various Natural Language Processing (NLP) tasks. Here are some of the primary tasks and use cases for which NLTK is commonly applied:

Tokenization:
Breaking down text into words, sentences, or other meaningful units.
Example: Splitting a paragraph into individual sentences or words.

Text Normalization:
Converting text to a standard format by processes like lowercasing, stemming, or lemmatization.
Example: Reducing words to their base forms (e.g., "running" → "run").

Stopwords Removal:
Filtering out common words (such as "and," "the," "is") that may not add significant meaning to the analysis.

Part-of-Speech (POS) Tagging:
Labeling words with their corresponding parts of speech (nouns, verbs, adjectives, etc.), which is useful for syntactic and semantic analysis.

Named Entity Recognition (NER):
Identifying and classifying named entities (like persons, organizations, locations) within text.

Parsing and Chunking:
Analyzing the grammatical structure of sentences using techniques such as constituency parsing or dependency parsing.
Example: Extracting noun or verb phrases from sentences.

Corpus Management:
Accessing and managing a variety of text corpora and lexical resources (e.g., the Brown Corpus, Gutenberg Corpus, or WordNet) for language research and model training.

Text Classification:
Building classifiers to categorize text into predefined classes, such as spam detection or sentiment analysis.

Sentiment Analysis:
Analyzing the sentiment or emotional tone of a piece of text. NLTK even includes tools like VADER for sentiment analysis in social media text.

Language Modeling and N-grams:
Creating statistical models based on n-grams (sequences of n words) to predict the next word in a sequence or analyze language patterns.

Educational Purposes:
NLTK is also heavily used in academia for teaching and research in computational linguistics and NLP because of its ease of use and comprehensive documentation.

In [12]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\micha\AppData\Roaming\nltk_data...


True

Example 1: Tokenization and POS Tagging

Common POS Tags and Their Meanings
NLTK uses the Penn Treebank tag set by default. Some of the most common tags include:

CC: Coordinating conjunction (e.g., and, but, or)
CD: Cardinal number (e.g., one, two, 3, 100)
DT: Determiner (e.g., the, a, an)
EX: Existential there (e.g., there is, there are)
FW: Foreign word (e.g., non-English words in an English text)
IN: Preposition or subordinating conjunction (e.g., in, on, at, because)
JJ: Adjective (e.g., blue, quick, large)
JJR: Adjective, comparative (e.g., better, larger)
JJS: Adjective, superlative (e.g., best, largest)
LS: List item marker (e.g., numbered items or bullet points)
MD: Modal (e.g., can, might, should)
NN: Noun, singular or mass (e.g., dog, car, music)
NNS: Noun, plural (e.g., dogs, cars)
NNP: Proper noun, singular (e.g., London, Alice)
NNPS: Proper noun, plural (e.g., Americans, Beatles)
PDT: Predeterminer (e.g., all, both, half)
POS: Possessive ending (e.g., ’s in “John’s”)
PRP: Personal pronoun (e.g., I, you, he, she)
PRP$: Possessive pronoun (e.g., my, your, his, her)
RB: Adverb (e.g., quickly, silently)
RBR: Adverb, comparative (e.g., faster)
RBS: Adverb, superlative (e.g., fastest)
RP: Particle (e.g., up, off, out when used with verbs such as “pick up”)
TO: The word “to” (e.g., in the infinitive “to go”)
UH: Interjection (e.g., oh, uh, wow)
VB: Verb, base form (e.g., run, eat)
VBD: Verb, past tense (e.g., ran, ate)
VBG: Verb, gerund or present participle (e.g., running, eating)
VBN: Verb, past participle (e.g., run, eaten)
VBP: Verb, non-3rd person singular present (e.g., run, eat)
VBZ: Verb, 3rd person singular present (e.g., runs, eats)
WDT: Wh-determiner (e.g., which, that)
WP: Wh-pronoun (e.g., who, what)
WP$: Possessive wh-pronoun (e.g., whose)
WRB: Wh-adverb (e.g., how, where, when)

In [35]:
import nltk
nltk.download('punkt')  # For tokenizers
nltk.download('averaged_perceptron_tagger')  # For POS tagging
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\micha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\micha\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\micha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\micha\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\micha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\micha\AppData\Roaming\nltk

True

In [None]:

# Sample text
text = "NLTK is a powerful library for natural language processing."

# Tokenize the text into words
tokens = nltk.word_tokenize(text)
print("Tokens:", tokens)

# Perform POS tagging on the tokens
pos_tags = nltk.pos_tag(tokens)
print("POS Tags:", pos_tags)

In [14]:
from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()

def sentiment_analysis(text):
    return sia.polarity_scores(text)

In [17]:
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one third third.',
    'Is this the first document?',
    'I am very excited about this opportunity',
    'This could be better if the timing was right',
]
for sen in corpus:
    T = sentiment_analysis(sen)
    print(T)

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
{'neg': 0.0, 'neu': 0.421, 'pos': 0.579, 'compound': 0.6697}
{'neg': 0.0, 'neu': 0.734, 'pos': 0.266, 'compound': 0.4404}


4. Data pre-processing: stopwords, word_tokenize

In [26]:
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Data preprocessing to remove puncturation and stopwords
def preprocess_text(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    words = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    filtered_word = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_word)

In [27]:
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one third third.',
    'Is this the first document?',
    'I am very excited about this opportunity',
    'This could be better if the timing was right',
]
for sen in corpus:
    T = preprocess_text(sen)
    print(T)

first document
second second document
third one third third
first document
excited opportunity
could better timing right


5. PorterStemmer

In [28]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def stem_words(words):
    return [stemmer.stem(word) for word in words]

In [39]:
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one third third.',
    'Is this the first document?',
    'I am very excited about this opportunity',
    'This could be better if the timing was right',
]

#text = ["running", "runs", "ran", "easily", "fairly"]
stem_words(text)

for sen in corpus:
    T = stem_words(sen.split())
    print(T)

['thi', 'is', 'the', 'first', 'document.']
['thi', 'is', 'the', 'second', 'second', 'document.']
['and', 'the', 'third', 'one', 'third', 'third.']
['is', 'thi', 'the', 'first', 'document?']
['i', 'am', 'veri', 'excit', 'about', 'thi', 'opportun']
['thi', 'could', 'be', 'better', 'if', 'the', 'time', 'wa', 'right']


6. WordNetLemmatizer

In [46]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

def lemmatize_words(words):
    return [lemmatizer.lemmatize(word) for word in words]

In [44]:
corpus = [
    'This was the first document.',
    'This is the second second document.',
    'And the third one third third.',
    'Is this the first document?',
    'I am very excited about this opportunity',
    'This could be better if the timing was right',
]
for sen in corpus:
    T = lemmatize_words(sen.split())
    print(T)

['This', 'wa', 'the', 'first', 'document.']
['This', 'is', 'the', 'second', 'second', 'document.']
['And', 'the', 'third', 'one', 'third', 'third.']
['Is', 'this', 'the', 'first', 'document?']
['I', 'am', 'very', 'excited', 'about', 'this', 'opportunity']
['This', 'could', 'be', 'better', 'if', 'the', 'timing', 'wa', 'right']


In [49]:
words = ["running", "runs", "ran", "easily", "fairly"]

lemmatized_words = [
    lemmatizer.lemmatize("running", pos=wordnet.VERB),
    lemmatizer.lemmatize("ran", pos=wordnet.VERB),
    lemmatizer.lemmatize("easily", pos=wordnet.ADV),
    lemmatizer.lemmatize("fairly", pos=wordnet.ADV),
    lemmatizer.lemmatize("better", pos=wordnet.ADJ)
]

# Print the original words and their lemmatized forms
for word, lemma in zip(words, lemmatized_words):
    print(f"Original: {word} -> Lemma: {lemma}")

Original: running -> Lemma: run
Original: runs -> Lemma: run
Original: ran -> Lemma: easily
Original: easily -> Lemma: fairly
Original: fairly -> Lemma: good


7. Word_tokenize

In [50]:
from nltk.tokenize import word_tokenize

def tokenize(text):
    return word_tokenize(text)

In [52]:
corpus = [
    'This was the first document.',
    'This is the second second document.',
    'And the third one third third.',
    'Is this the first document?',
    'I am very excited about this opportunity',
    'This could be better if the timing was right',
]
for sen in corpus:
    T = tokenize(sen)
    print(T)

['This', 'was', 'the', 'first', 'document', '.']
['This', 'is', 'the', 'second', 'second', 'document', '.']
['And', 'the', 'third', 'one', 'third', 'third', '.']
['Is', 'this', 'the', 'first', 'document', '?']
['I', 'am', 'very', 'excited', 'about', 'this', 'opportunity']
['This', 'could', 'be', 'better', 'if', 'the', 'timing', 'was', 'right']


8. FreqDist

In [53]:
from nltk.probability import FreqDist

def calculate_frequency_distribution(words):
    freq_dist = FreqDist(words)
    return freq_dist

In [55]:
corpus = [
    'This was the first document.',
    'This is the second second document.',
    'And the third one third third.',
    'Is this the first document?',
    'I am very excited about this opportunity',
    'This could be better if the timing was right',
]
for sen in corpus:
    T = calculate_frequency_distribution(sen.split())
    print(T)

<FreqDist with 5 samples and 5 outcomes>
<FreqDist with 5 samples and 6 outcomes>
<FreqDist with 5 samples and 6 outcomes>
<FreqDist with 5 samples and 5 outcomes>
<FreqDist with 7 samples and 7 outcomes>
<FreqDist with 9 samples and 9 outcomes>


9. WordCloud

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

def generate_wordcloud(frequencies):
    wordcloud = WordCloud(width=800, height=400).generate_from_frequencies(frequencies)
    plt.figure(figsize=(10,5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.show()

10. Topic Modeling: LatentDirichletAllocation

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=5)

def topic_modeling(documents):
    return lda.fit_transform(documents)