## Natural Language Processing (NLP)

- Scikit-learn excels at transforming text data into numerical representations that machine learning algorithms can process.This is achieved through modules like sklearn.feature_extraction.text

- NLTK (Natural Language Toolkit): A comprehensive suite for symbolic and statistical NLP, offering tools for tokenization, stemming, tagging, parsing, and semantic reasoning, ideal for educational and research purposes.

- spaCy: A fast and efficient library designed for production-level NLP tasks, providing robust features like tokenization, part-of-speech tagging, named entity recognition, and dependency parsing with pre-trained models.

- Gensim: Focused on topic modeling and document similarity, Gensim excels at tasks like Latent Dirichlet Allocation (LDA) and word embeddings, making it suitable for analyzing large text corpora.

### CountVectorizer
Converts text documents into a matrix of word counts. This is often referred to as the "bag-of-words" model.
Some common nlp steps are included:

- Tokenization: token_pattern, regex.
- Normalization: lowercase, punctuation removal (implicit).
- Stop Words: stop_words parameter.
- N-grams: ngram_range.
- Vocabulary: unique tokens.
- Document-Term Matrix: frequency counts.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names_out()


array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [15]:
print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [16]:
# stop words removal
vectorizer2 = CountVectorizer(stop_words='english')
X2 = vectorizer2.fit_transform(corpus)
vectorizer2.get_feature_names_out()

array(['document', 'second'], dtype=object)

In [17]:
print(X2.toarray())

[[1 0]
 [2 1]
 [0 0]
 [1 0]]


In [18]:
# Set ngram (a contiguous sequence of n items)
vectorizer3 = CountVectorizer(ngram_range=(2, 2)) # ngram (min, max)
X3 = vectorizer3.fit_transform(corpus)
vectorizer3.get_feature_names_out()

array(['and this', 'document is', 'first document', 'is the', 'is this',
       'second document', 'the first', 'the second', 'the third',
       'third one', 'this document', 'this is', 'this the'], dtype=object)

In [19]:
print(X3.toarray())

[[0 0 1 1 0 0 1 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 1 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 1]]


### TfidfVectorizer

TfidfVectorizer() helps to create a numerical representation of text data that emphasizes the importance of terms within a document relative to the entire corpus.

Use Cases:

- Text classification
- Information retrieval
- Sentiment analysis
- Document clustering
- Keyword extraction

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

vectorizer = TfidfVectorizer() # stop_words = 'english' if needed
X4 = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names_out()


array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [21]:
print(X4.toarray())

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]


### NLTK

NLTK offers a vast collection of algorithms and tools for various NLP tasks, including:
- Tokenization (splitting text into words or sentences)
- Stemming and lemmatization (reducing words to their root form)
- Part-of-speech tagging (labeling words with their grammatical roles)
- Parsing (analyzing sentence structure)
- Named entity recognition (identifying people, places, organizations)
- Sentiment analysis
- Text classification

In [22]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
import string

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Download necessary NLTK resources (run once)
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kabir\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\kabir\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kabir\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [23]:
def preprocess_text(text):
    """Preprocesses a single text document using built-in Python tools."""
    # Lowercase
    text = text.lower()

    # Remove punctuation using regex
    text = re.sub(r'[^\w\s]', '', text) #keep word characters and whitespace

    # Tokenization
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]

    return ' '.join(filtered_tokens)

# Preprocess the corpus
preprocessed_corpus = [preprocess_text(doc) for doc in corpus]

# Vectorization using TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_corpus)

# Print the TF-IDF matrix (sparse matrix)
print("TF-IDF Matrix:")
print(tfidf_matrix)

# Print the feature names (words)
print("\nFeature Names:")
print(vectorizer.get_feature_names_out())

# If you want to see the dense array representation, use toarray()
print("\nTF-IDF Dense Array:")
print(tfidf_matrix.toarray())

TF-IDF Matrix:
  (0, 1)	0.7772211620785797
  (0, 0)	0.6292275146695526
  (1, 0)	0.78722297610404
  (1, 3)	0.6166684570284895
  (2, 4)	0.7071067811865476
  (2, 2)	0.7071067811865476
  (3, 1)	0.7772211620785797
  (3, 0)	0.6292275146695526

Feature Names:
['document' 'first' 'one' 'second' 'third']

TF-IDF Dense Array:
[[0.62922751 0.77722116 0.         0.         0.        ]
 [0.78722298 0.         0.         0.61666846 0.        ]
 [0.         0.         0.70710678 0.         0.70710678]
 [0.62922751 0.77722116 0.         0.         0.        ]]


### SpaCy



In [24]:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Load the spaCy English model
nlp = spacy.load("en_core_web_sm") # converted to 96 dimension


In [25]:
def preprocess_text(text):
    """Preprocesses a single text document using spaCy."""
    doc = nlp(text)

    # Lowercase, remove punctuation, remove stopwords, and lemmatization.
    tokens = [token.lemma_.lower() for token in doc if not token.is_punct and not token.is_stop]

    return " ".join(tokens)

# Preprocess the corpus
preprocessed_corpus = [preprocess_text(doc) for doc in corpus]

# Vectorization using TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_corpus)

# Print the TF-IDF matrix (sparse matrix)
print("TF-IDF Matrix:")
print(tfidf_matrix)

# Print the feature names (words)
print("\nFeature Names:")
print(vectorizer.get_feature_names_out())

# If you want to see the dense array representation, use toarray()
print("\nTF-IDF Dense Array:")
print(tfidf_matrix.toarray())

TF-IDF Matrix:
  (0, 0)	1.0
  (1, 0)	0.78722297610404
  (1, 1)	0.6166684570284895
  (3, 0)	1.0

Feature Names:
['document' 'second']

TF-IDF Dense Array:
[[1.         0.        ]
 [0.78722298 0.61666846]
 [0.         0.        ]
 [1.         0.        ]]


In [26]:
# Use SpaCy word2vec

text = "I enjoy coding in Python."
doc = nlp(text)

for token in doc:
    print(token.text, token.vector[:5])  # Print the first 5 vector dimensions

print("\nDocument vector:", doc.vector[:5]) # print the first 5 vector dimensions.

I [-1.3939121  -0.38389102  0.11240871  0.20698646  0.7719366 ]
enjoy [-1.0712639  -0.8785212  -0.6848391  -0.03444037 -0.6544076 ]
coding [ 0.37274474  1.2042749  -0.09712669  0.84556305  0.15096729]
in [ 0.7516444  -0.34233165  0.51338893 -1.3308781  -0.5497204 ]
Python [-0.77589226 -0.38897672  0.23942587  0.5341584   0.44315192]
. [-0.7804837  -0.6215247  -0.8710715  -0.92849153 -0.25889164]

Document vector: [-0.48286048 -0.23516172 -0.1313023  -0.11785036 -0.01616064]


### Topic Modeling



In [27]:
# Match the version between numpy and gensim
# Skip this if you don't have the issue
#!pip install --upgrade --force-reinstall numpy gensim
#!pip install --upgrade --force-reinstall scipy

In [28]:
import gensim
from gensim import corpora
from gensim.models import LdaModel
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
import string

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')

# Example corpus
corpus = [
    "The cat sat on the mat. The dog played in the garden.",
    "Cats are known for their independent nature.",
    "Dogs are loyal companions to humans.",
    "Gardens are beautiful places with flowers and trees.",
    "Cats and dogs can coexist peacefully.",
    "The cat chased a mouse in the garden.",
    "Humans love their pets, both cats and dogs.",
    "Flowers bloom in spring and summer."
]

def preprocess_text(text):
    """Preprocesses a single text document."""
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return filtered_tokens

# Preprocess the corpus
processed_corpus = [preprocess_text(doc) for doc in corpus]

# Create a dictionary from the processed corpus
dictionary = corpora.Dictionary(processed_corpus)

# Create a bag-of-words representation of the corpus
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_corpus]

# Train the LDA model
lda_model = LdaModel(bow_corpus, num_topics=3, id2word=dictionary, passes=15)

# Print the topics
for idx, topic in enumerate(lda_model.print_topics(num_topics=-1)):
    print('Topic {}: {}'.format(idx, topic))

# Get topic distribution for a document.
test_doc = "My cat loves to play in the garden with flowers"
test_doc_processed = preprocess_text(test_doc)
test_doc_bow = dictionary.doc2bow(test_doc_processed)
topic_distribution = lda_model.get_document_topics(test_doc_bow)
print("\nTopic Distribution for test document:")
print(topic_distribution)


Topic 0: (0, '0.066*"garden" + 0.066*"cat" + 0.066*"flowers" + 0.065*"mat" + 0.065*"sat" + 0.065*"dog" + 0.065*"played" + 0.065*"trees" + 0.065*"gardens" + 0.065*"beautiful"')
Topic 1: (1, '0.089*"humans" + 0.089*"dogs" + 0.051*"loyal" + 0.051*"companions" + 0.051*"spring" + 0.051*"bloom" + 0.051*"summer" + 0.051*"love" + 0.051*"pets" + 0.051*"mouse"')
Topic 2: (2, '0.137*"cats" + 0.077*"dogs" + 0.077*"nature" + 0.077*"known" + 0.077*"independent" + 0.077*"peacefully" + 0.077*"coexist" + 0.019*"flowers" + 0.019*"garden" + 0.019*"cat"')

Topic Distribution for test document:
[(0, 0.8182691), (1, 0.097704984), (2, 0.084025994)]


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kabir\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kabir\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\kabir\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\kabir\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [29]:
pip install pyLDAvis

Note: you may need to restart the kernel to use updated packages.




In [30]:
import pyLDAvis
import pyLDAvis.gensim

pyLDAvis.enable_notebook()  # For inline visualization in Google Colab

# Prepare visualization
lda_vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary)

# Display the visualization
pyLDAvis.display(lda_vis)

## Summary

For basic text preprocessing in Python, including punctuation removal and lowercasing, you can efficiently use built-in string methods and regular expressions. NLTK is well-suited for more complex linguistic preprocessing, offering tools for tokenization, stop word removal, stemming, and lemmatization. spaCy excels in providing fast and accurate pre-trained models for tokenization, lemmatization, stop word removal, and other advanced NLP tasks. Gensim is particularly useful for topic modeling and word embedding generation. For building efficient NLP pipelines, you can integrate scikit-learn's feature extraction tools with either spaCy or Gensim, leveraging their respective strengths in preprocessing and vectorization.




## Your Homework

Practice the sklearn CountVectorizer() and TfidfVectorizer() with your self-defined corpus. Submit your notebook to BrightSpace by 4/13 11:59 pm.

In [39]:
corpus = [
    "Artificial intelligence is transforming industries worldwide.",
    "Machine learning and deep learning are subsets of AI.",
    "Natural language processing enables machines to understand human language.",
    "AI applications include image recognition, speech processing, and autonomous vehicles.",
    "Ethical considerations are crucial in the development of AI technologies.",
    "Data is the backbone of machine learning models.",
    "Supervised learning requires labeled data for training.",
    "Unsupervised learning helps in discovering hidden patterns in data.",
    "Reinforcement learning is inspired by behavioral psychology.",
    "AI is revolutionizing healthcare, finance, and education."
] 
#generated by ChatGPT text

In [40]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names_out()

array(['ai', 'and', 'applications', 'are', 'artificial', 'autonomous',
       'backbone', 'behavioral', 'by', 'considerations', 'crucial',
       'data', 'deep', 'development', 'discovering', 'education',
       'enables', 'ethical', 'finance', 'for', 'healthcare', 'helps',
       'hidden', 'human', 'image', 'in', 'include', 'industries',
       'inspired', 'intelligence', 'is', 'labeled', 'language',
       'learning', 'machine', 'machines', 'models', 'natural', 'of',
       'patterns', 'processing', 'psychology', 'recognition',
       'reinforcement', 'requires', 'revolutionizing', 'speech',
       'subsets', 'supervised', 'technologies', 'the', 'to', 'training',
       'transforming', 'understand', 'unsupervised', 'vehicles',
       'worldwide'], dtype=object)

In [41]:
def preprocess_text(text):
    """Preprocesses a single text document using built-in Python tools."""
    # Lowercase
    text = text.lower()

    # Remove punctuation using regex
    text = re.sub(r'[^\w\s]', '', text) #keep word characters and whitespace

    # Tokenization
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]

    return ' '.join(filtered_tokens)

# Preprocess the corpus
preprocessed_corpus = [preprocess_text(doc) for doc in corpus]

# Vectorization using TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_corpus)

# Print the TF-IDF matrix (sparse matrix)
print("TF-IDF Matrix:")
print(tfidf_matrix)

# Print the feature names (words)
print("\nFeature Names:")
print(vectorizer.get_feature_names_out())

# If you want to see the dense array representation, use toarray()
print("\nTF-IDF Dense Array:")
print(tfidf_matrix.toarray())

TF-IDF Matrix:
  (0, 2)	0.4472135954999579
  (0, 24)	0.4472135954999579
  (0, 44)	0.4472135954999579
  (0, 22)	0.4472135954999579
  (0, 48)	0.4472135954999579
  (1, 28)	0.39763979707168545
  (1, 27)	0.5555327633956749
  (1, 9)	0.4677612498987575
  (1, 40)	0.4677612498987575
  (1, 0)	0.3092972142859651
  (2, 31)	0.3207063470844535
  (2, 26)	0.641412694168907
  (2, 33)	0.2726296947467652
  (2, 13)	0.3207063470844535
  (2, 29)	0.3207063470844535
  (2, 45)	0.3207063470844535
  (2, 19)	0.3207063470844535
  (3, 0)	0.2314781013881342
  (3, 33)	0.2975937092579673
  (3, 1)	0.3500726195658407
  (3, 21)	0.3500726195658407
  (3, 20)	0.3500726195658407
  (3, 35)	0.3500726195658407
  (3, 39)	0.3500726195658407
  (3, 3)	0.3500726195658407
  :	:
  (5, 4)	0.5249787187068014
  (5, 30)	0.5249787187068014
  (6, 27)	0.26810347016766056
  (6, 8)	0.3357855442949281
  (6, 41)	0.45148881423757814
  (6, 37)	0.45148881423757814
  (6, 25)	0.45148881423757814
  (6, 43)	0.45148881423757814
  (7, 27)	0.2443529929235

In [42]:
def preprocess_text(text):
    """Preprocesses a single text document."""
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return filtered_tokens

# Preprocess the corpus
processed_corpus = [preprocess_text(doc) for doc in corpus]

# Create a dictionary from the processed corpus
dictionary = corpora.Dictionary(processed_corpus)

# Create a bag-of-words representation of the corpus
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_corpus]

# Train the LDA model
lda_model = LdaModel(bow_corpus, num_topics=3, id2word=dictionary, passes=15)

# Print the topics
for idx, topic in enumerate(lda_model.print_topics(num_topics=-1)):
    print('Topic {}: {}'.format(idx, topic))

# Get topic distribution for a document.
test_doc = "My cat loves to play in the garden with flowers"
test_doc_processed = preprocess_text(test_doc)
test_doc_bow = dictionary.doc2bow(test_doc_processed)
topic_distribution = lda_model.get_document_topics(test_doc_bow)
print("\nTopic Distribution for test document:")
print(topic_distribution)

Topic 0: (0, '0.091*"ai" + 0.062*"learning" + 0.036*"machine" + 0.036*"applications" + 0.036*"vehicles" + 0.036*"autonomous" + 0.036*"include" + 0.036*"recognition" + 0.036*"image" + 0.036*"speech"')
Topic 1: (1, '0.088*"learning" + 0.068*"data" + 0.027*"hidden" + 0.027*"unsupervised" + 0.027*"helps" + 0.027*"discovering" + 0.027*"patterns" + 0.027*"supervised" + 0.027*"requires" + 0.027*"training"')
Topic 2: (2, '0.096*"language" + 0.055*"processing" + 0.055*"human" + 0.055*"machines" + 0.055*"natural" + 0.055*"enables" + 0.055*"understand" + 0.014*"machine" + 0.014*"ai" + 0.014*"learning"')

Topic Distribution for test document:
[(0, 0.33333334), (1, 0.33333334), (2, 0.33333334)]


In [43]:
processed_corpus = [preprocess_text(doc) for doc in corpus]

# Create a dictionary from the processed corpus
dictionary = corpora.Dictionary(processed_corpus)

# Create a bag-of-words representation of the corpus
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_corpus]

# Train the LDA model
lda_model = LdaModel(bow_corpus, num_topics=3, id2word=dictionary, passes=15)

# Print the topics
for idx, topic in enumerate(lda_model.print_topics(num_topics=-1)):
    print('Topic {}: {}'.format(idx, topic))

# Get topic distribution for a document.
test_doc = "My cat loves to play in the garden with flowers"
test_doc_processed = preprocess_text(test_doc)
test_doc_bow = dictionary.doc2bow(test_doc_processed)
topic_distribution = lda_model.get_document_topics(test_doc_bow)
print("\nTopic Distribution for test document:")
print(topic_distribution)

Topic 0: (0, '0.040*"learning" + 0.040*"crucial" + 0.040*"development" + 0.040*"ethical" + 0.040*"considerations" + 0.040*"technologies" + 0.040*"transforming" + 0.040*"industries" + 0.040*"worldwide" + 0.040*"artificial"')
Topic 1: (1, '0.135*"learning" + 0.084*"data" + 0.058*"machine" + 0.034*"helps" + 0.034*"discovering" + 0.034*"hidden" + 0.034*"unsupervised" + 0.034*"patterns" + 0.034*"requires" + 0.034*"training"')
Topic 2: (2, '0.061*"ai" + 0.061*"language" + 0.061*"processing" + 0.035*"enables" + 0.035*"natural" + 0.035*"understand" + 0.035*"machines" + 0.035*"human" + 0.035*"recognition" + 0.035*"autonomous"')

Topic Distribution for test document:
[(0, 0.33333334), (1, 0.33333334), (2, 0.33333334)]
