# Feature Extraction Techniques – NLP

The sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

Feature extraction is a process of dimensionality reduction by which an initial set of raw data is reduced to more manageable groups for processing. A characteristic of these large data sets is a large number of variables that require a lot of computing resources to process. Feature extraction is the name for methods that select and /or combine variables into features, effectively reducing the amount of data that must be processed, while still accurately and completely describing the original data set.

Feature extraction essentially is the process of converting raw data into numerical features that can be processed while preserving the information in the original data set. It yields better results than if you directly apply machine learning techniques on the raw data. 

Natural Language Processing (NLP) is a branch of computer science and machine learning that deals with training computers to process a large amount of human (natural) language data. Briefly, NLP is the ability of computers to understand human language. Need of feature extraction techniques Machine Learning algorithms learn from a pre-defined set of features from the training data to produce output for the test data. But the main problem in working with language processing is that machine learning algorithms cannot work on the raw text directly. So, we need some feature extraction techniques to convert text into a matrix(or vector) of features.

# What is NLP (Natural Language Processing)?

NLP is a subfield of computer science and artificial intelligence concerned with interactions between computers and human (natural) languages. It is used to apply machine learning algorithms to text and speech.

For example, we can use NLP to create systems like speech recognition, document summarization, machine translation, spam detection, named entity recognition, question answering, autocomplete, predictive typing and so on.

Nowadays, most of us have smartphones that have speech recognition. These smartphones use NLP to understand what is said. Also, many people use laptops which operating system has a built-in speech recognition.

NLP can be divided into two overlapping subfields: natural language understanding (NLU), which focuses on semantic analysis or determining the intended meaning of text, and natural language generation (NLG), which focuses on text generation by a machine. NLP is separate from — but often used in conjunction with — speech recognition, which seeks to parson spoken language into words, turning sound into text and vice versa.

# How Does Natural Language Processing (NLP) Work?

 NLP architectures use various methods for data preprocessing, feature extraction, and modeling. Some of these processes are: 
 
**Data preprocessing**: Before a model processes text for a specific task, the text often needs to be preprocessed to improve model performance or to turn words and characters into a format the model can understand. Data-centric AI is a growing movement that prioritizes data preprocessing. Various techniques may be used in this data preprocessing:

**1. Stemming and lemmatization**: Stemming is an informal process of converting words to their base forms using heuristic rules. For example, “university,” “universities,” and “university’s” might all be mapped to the base univers. (One limitation in this approach is that “universe” may also be mapped to univers, even though universe and university don’t have a close semantic relationship.) Lemmatization is a more formal way to find roots by analyzing a word’s morphology using vocabulary from a dictionary. Stemming and lemmatization are provided by libraries like spaCy and NLTK. 

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of 
words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP). When a new word is found, it can present new research opportunities. For example -

![2023-07-28 21_53_57-Class 15.pdf and 2 more pages - Profile 1 - Microsoft​ Edge.png](attachment:a789b95a-f404-4fd9-a011-5e92a385e7d5.png)

* ▪ Porter Stemmer(): The Porter stemming algorithm (or 'Porter stemmer’) removes the commoner morphological endings from words in English.
* ▪ Lovins Stemmer
* ▪ Dawson Stemmer
* ▪ Krovetz Stemmer
* ▪ Xerox Stemmer
* ▪ N-Gram Stemmer
* ▪ Snowball Stemmer
* ▪ Lancaster Stemme

**lemmatization** tries to do it the proper way. It doesn’t just chop things off, it actually transforms words to the actual root. For example, the word “better” would map to “good”. It may use a dictionary such as WordNet for mappings or some special rule-based approaches. 

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis
of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

* ▪ Word Net Lemmatizer
* ▪ Spacy Lemmatizer
* ▪ TextBlob
* ▪ Gensim Lemmatizer
* ▪ TreeTagger

![2023-07-28 22_21_14-Class 15.pdf and 2 more pages - Profile 1 - Microsoft​ Edge.png](attachment:8a296d70-0c92-40fe-8771-eca693185fcb.png)

![2023-07-28 22_23_17-Class 15.pdf and 2 more pages - Profile 1 - Microsoft​ Edge.png](attachment:598a0a68-8472-409d-b6f9-e6b237ebf140.png)
**2. Sentence segmentation** The process of deciding from where the sentences actually start or end in NLP or we can simply say that here we are dividing a paragraph based on sentences. This process is known as Sentence Segmentation.

This is obvious in languages like English, where the end of a sentence is marked by a period, but it is still not trivial. A period can be used to mark an abbreviation as well as to terminate a sentence, and in this case, the period should be part of the abbreviation token itself. The process becomes even more complex in languages, such as ancient Chinese, that don’t have a delimiter that marks the end of a sentence. 

**3. Stop word removal**: Stop words are words which are filtered out before or after processing of text. When applying machine learning to text, these words can add a lot of noise. That’s why we want to remove these irrelevant words.

Stop words usually refer to the most common words such as “and”, “the”, “a” in a language, but there is no single universal list of stopwords. The list of the stop words can change depending on your application.aims to remove the most commonly occurring words that don’t add much information to the text. For example, “the,” “a,” “an,” and so on.

The NLTK tool has a predefined list of stopwords that refers to the most common words. If you use it for your first time, you need to download the stop words using this code: nltk.download(“stopwords”). Once we complete the downloading, we can load the stopwords package from the nltk.corpus and use it to load the stop words.



In [6]:
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

# Tokenization
Tokenization is a common task in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers.

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.

For example, consider the sentence: “Never give up”.

The most common way of forming tokens is based on space. Assuming space as a delimiter, the tokenization of the sentence results in 3 tokens – Never-give-up. As each token is a word, it becomes an example of Word tokenization.

**Type of Tokens**
1. Word tokens
2. Character tokens
3. Sentence tokens
4. Named entity tokens
5. Part-of-speech (POS) tags
6. Sub-word tokens

* **Word Tokenization**
Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.

For example, the text “It is raining” can be tokenized into ‘It’, ‘is’, ‘raining’

Tokenization can be done to either separate words or sentences.

If the text is split into words using some separation technique it is called word tokenization and same separation done for sentences is called sentence tokenization.

In [7]:
# Simple Example of word tokenization
from nltk.tokenize import word_tokenize

#input a text
text="Hello everyone. Welcome to my Notebook"
word_tokenize(text)#print tokenized words

['Hello', 'everyone', '.', 'Welcome', 'to', 'my', 'Notebook']

* **Character Tokenization**

Character Tokenization splits a piece of text into a set of characters.Character Tokenizers handles Out Of Vocabulary (OOV) words coherently by preserving the information of the word. It breaks down the Out Of Vocabulary (OOV) words into characters and represents the word in terms of these characters It also limits the size of the vocabulary. 26 since the vocabulary contains a unique set of characters.

For example, consider the sentence: "Hello, how are you?"

With character tokenization, this sentence might be tokenized into individual characters:
["H", "e", "l", "l", "o", ",", " ", "h", "o", "w", " ", "a", "r", "e", " ", "y", "o", "u", "?"]

Character tokenization can be particularly useful when working with languages that don't have clear word boundaries or when dealing with tasks like transliteration, where characters need to be preserved in their original form.

**Drawbacks of Character Tokenization**

Character tokens solve the OOV problem but the length of the input and output sentences increases rapidly as we are representing a sentence as a sequence of characters. As a result, it becomes challenging to learn the relationship between the characters to form meaningful words.This brings us to another tokenization known as Subword Tokenization which is in between a Word and Character tokenization.

* **Sentence Tokenization**

why sentence tokenization is needed when we have the option of word tokenization. Imagine you need to count average words per sentence, how you will calculate? For accomplishing such a task, you need both NLTK sentence tokenizer as well as NLTK word tokenizer to calculate the ratio. Such output serves as an important feature for machine training as the answer would be numeric.

To tokenize the sentences with Natural Language Tool kit, the steps below should be followed.

Import the “sent_tokenize” from “nltk.tokenize”.
Load the text for sentence tokenization into a variable.
Use the “sent_tokenize” for the specific variable.
Print the output.
Below, you can see an example of NLTK Tokenization for sentences.

In [8]:
from nltk.tokenize import sent_tokenize
text = "God is Great! I won a lottery."
print(sent_tokenize(text))

Output: ['God is Great!', 'I won a lottery ']

['God is Great!', 'I won a lottery.']


* **Sub-word tokens**

The main advantage of a subword tokenizer is that it interpolates between word-based and character-based tokenization. Common words get a slot in the vocabulary, but the tokenizer can fall back to word pieces and individual characters for unknown words.


# Vectorizer

In Python, a vectorizer is a term often used in the context of natural language processing (NLP) and machine learning tasks to transform textual data into numerical representations (vectors). These numerical representations are necessary for many machine learning algorithms, as most of them require numerical inputs. Vectorizers help convert raw text data, such as sentences or documents, into feature vectors that can be used for training and analyzing machine learning models.

**Types of Vectorizer**:

* Bag of Words: Count Vectorizer
* TF-IDF Vectorizer
* Word2Vec
* Global Vectors for Word Representation


**CountVectorizer** is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors (for using in further text analysis). 

CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample.  This can be visualized as follows –

![2023-07-30 21_29_35-Class 15.pdf and 2 more pages - Profile 1 - Microsoft​ Edge.png](attachment:8d5682f7-5a85-4d90-beda-d7146e7dbf31.png)

**TF-IDF:** In Bag-of-Words, we count the occurrence of each word or n-gram in a document. In contrast, with TF-IDF, we weight each word by its importance. To evaluate a word’s significance, we consider two things:

**Term Frequency: How important is the word in the document?**

Suppose we have a set of English text documents and wish to rank which document is most relevant to the query , “Data Science is awesome !” A simple way to start out is by eliminating documents that do not contain all three words “Data”,”is”, “Science”, and “awesome”, but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document; the number of times a term occurs in a document is called its term frequency.

The weight of a term that occurs in a document is simply proportional to the term frequency.

Formula :
tf(t,d) = count of t in d / number of words in d

TF(word in a document)= Number of occurrences of that word in document / Number of words in document

**Inverse Document Frequency: How important is the term in the whole corpus?**

**IDF(word in a corpus)=log(number of documents in the corpus / number of documents that include the word)**

A word is important if it occurs many times in a document. But that creates a problem. Words like “a” and “the” appear often. And as such, their TF score will always be high. We resolve this issue by using Inverse Document Frequency, which is high if the word is rare and low if the word is common across the corpus. The TF-IDF score of a term is the product of TF and IDF. 

**Term Frequency(TF)** : Term frequency specifies how frequently a term appears in the entire document.It can be thought of as the probability of finding a word within the document.It calculates the number of times a word w_i   occurs in a review r_j   , with respect to the total number of words in the review r_j   .It is formulated as: 

![2023-07-30 22_31_52-Feature Extraction Techniques - NLP - GeeksforGeeks.png](attachment:2c3ee3c1-63b4-4fa2-812c-bf99d923964d.png)

**Inverse Document Frequency(IDF)** : While computing TF, all terms are considered equally important. However it is known that certain terms, such as “is”, “of”, and “that”, may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing IDF, an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

IDF is the inverse of the document frequency which measures the informativeness of term t. When we calculate IDF, it will be very low for the most occurring words such as stop words (because stop words such as “is” is present in almost all of the documents, and N/df will give a very low value to that word). This finally gives what we want, a relative weightage.

!idf(t) = N/df

Now there are few other problems with the IDF , in case of a large corpus,say 100,000,000 , the IDF value explodes , to avoid the effect we take the log of idf .

During the query time, when a word which is not in vocab occurs, the df will be 0. As we cannot divide by 0, we smoothen the value by adding 1 to the denominator.

that’s the final formula:

Formula :
idf(t) = log(N/(df + 1))

tf-idf now is a the right measure to evaluate how important a word is to a document in a collection or corpus.

Formula :
tf-idf(t, d) = tf(t, d) * log(N/(df + 1))

The log of the number of documents divided by the number of documents that contain the word w. Inverse data frequency determines the weight of rare words across all documents in the corpus.

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

![2023-07-30 23_54_35-Class 15.pdf and 2 more pages - Profile 1 - Microsoft​ Edge.png](attachment:d495a4f6-5209-40bf-b61d-f86a80d889d2.png)

# Implementing TF-IDF in Python From Scratch :

To make TF-IDF from scratch in python,let’s imagine those two sentences from diffrent document :

first_sentence : “Data Science is the sexiest job of the 21st century”.

second_sentence : “machine learning is the key for data science”.

First step we have to create the TF function to calculate total word frequency for all documents. Here are the codes below:

first as usual we should import the necessary libraries :

In [9]:
import pandas as pd
import sklearn as sk
import math 

#so let’s load our sentences and combine them together in a single set :

first_sentence = "Data Science is the sexiest job of the 21st century"
second_sentence = "machine learning is the key for data science"
#split so each word have their own string
first_sentence = first_sentence.split(" ")
second_sentence = second_sentence.split(" ")#join them to remove common duplicate words
total= set(first_sentence).union(set(second_sentence))
print(total)

{'21st', 'sexiest', 'the', 'of', 'Science', 'machine', 'job', 'key', 'data', 'learning', 'is', 'Data', 'science', 'for', 'century'}


In [10]:
#Now lets add a way to count the words using a dictionary key-value pairing for both sentences :

wordDictA = dict.fromkeys(total, 0) 
wordDictB = dict.fromkeys(total, 0)
for word in first_sentence:
    wordDictA[word]+=1
    
for word in second_sentence:
    wordDictB[word]+=1
    
#Now we put them in a dataframe and then view the result:

pd.DataFrame([wordDictA, wordDictB])

Unnamed: 0,21st,sexiest,the,of,Science,machine,job,key,data,learning,is,Data,science,for,century
0,1,1,2,1,1,0,1,0,0,0,1,1,0,0,1
1,0,0,1,0,0,1,0,1,1,1,1,0,1,1,0


In [11]:
def computeTF(wordDict, doc):
    tfDict = {}
    corpusCount = len(doc)
    for word, count in wordDict.items():
        tfDict[word] = count/float(corpusCount)
    return(tfDict)
#running our sentences through the tf function:
tfFirst = computeTF(wordDictA, first_sentence)
tfSecond = computeTF(wordDictB, second_sentence)
#Converting to dataframe for visualization
tf = pd.DataFrame([tfFirst, tfSecond])

tf

Unnamed: 0,21st,sexiest,the,of,Science,machine,job,key,data,learning,is,Data,science,for,century
0,0.1,0.1,0.2,0.1,0.1,0.0,0.1,0.0,0.0,0.0,0.1,0.1,0.0,0.0,0.1
1,0.0,0.0,0.125,0.0,0.0,0.125,0.0,0.125,0.125,0.125,0.125,0.0,0.125,0.125,0.0


# TF-IDF Vs. Count

▪ TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present
in the corpus but also provides the importance of the words. The term "df" is called document
frequency which means in how many documents the word “subfield” is present within the corpus.

❑ Can TF IDF Be Negative?

❑ No. The lowest value is 0. Both term frequency and inverse document frequency are positive
numbers.

▪ In AI inference and machine learning, sparsity refers to a matrix of numbers that includes many zeros or
values that

▪ will not significantly impact a calculation

# Word2Vec

**Why are word embeddings needed?**

Before we get into word2vec, let’s establish an understanding of what word embeddings are. This is important to know because the overall result and output of word2vec will be embeddings associated with each unique word passed through the algorithm.

Word embeddings are a technique where individual words are transformed into a numerical representation of the word (a vector). Where each word is mapped to one vector, this vector is then learned in a way which resembles a neural network. The vectors try to capture various characteristics of that word with regard to the overall text. These characteristics can include the semantic relationship of the word, definitions, context, etc. With these numerical representations, you can do many things, like identify similarities or dissimilarities between words.

These are integral inputs to various aspects of machine learning. A machine cannot process text in its raw form; thus, converting the text into an embedding will allow users to feed the embedding to classic machine learning models. The simplest embedding would be a one-hot encoding of text data where each vector would be mapped to a category.

**Let us consider the two sentences** – “You can scale your business.” and “You can grow your business.”. These two sentences have the same meaning. If we consider a vocabulary considering these two sentences, it will constitute of these words: {You, can, scale, grow, your, business}.

A one-hot encoding of these words would create a vector of length 6. The encodings for each of the words would look like this:

You: [1,0,0,0,0,0], Can: [0,1,0,0,0,0], Scale: [0,0,1,0,0,0], Grow: [0,0,0,1,0,0],

Your: [0,0,0,0,1,0], Business: [0,0,0,0,0,1]

In a 6-dimensional space, each word would occupy one of the dimensions, meaning that none of these words has any similarity with each other – irrespective of their literal meanings.

**What is Word2Vec?**

The effectiveness of Word2Vec comes from its ability to group together vectors of similar words. Given a large enough dataset, Word2Vec can make strong estimates about a word’s meaning based on its occurrences in the text. These estimates yield word associations with other words in the corpus. For example, words like “King” and “Queen” would be very similar to one another. When conducting algebraic operations on word embeddings, you can find a close approximation of word similarities. For example, the 2-dimensional embedding vector of "king" - the 2-dimensional embedding vector of "man" + the 2-dimensional embedding vector of "woman" yielded a vector which is very close to the embedding vector of "queen". Note that the values below were chosen arbitrarily.

![2023-07-31 21_48_24-Word2Vec Explained. Explaining the Intuition of Word2Vec &… _ by Vatsal _ Toward.png](attachment:94987e83-5fb1-40a5-8475-fc0891d08498.png) 

You can see that the words King and Queen are close to each other in position. 

**There are two main architectures which yield the success of word2vec. The skip-gram and CBOW architectures.**



# CBOW (Continuous Bag of Words)

This architecture is very similar to a feed-forward neural network. This model architecture essentially tries to predict a target word from a list of context words. The intuition behind this model is quite simple: given a phrase "Have a great day" , we will choose our target word to be “a” and our context words to be [“have”, “great”, “day”]. What this model will do is take the distributed representations of the context words to try and predict the target word.

![cbow-1.png](attachment:42a3fb0f-7b1b-4701-99bc-d1e1f68abe71.png)

# Continuous Skip-Gram Model

The skip-gram model is a simple neural network with one hidden layer trained to predict the probability of a given word being present when an input word is present. Intuitively, the skip-gram model is the opposite of the CBOW model. In this architecture, it takes the current word as input and tries to accurately predict the words before and after this current word. This model essentially tries to learn and predict the context words around the specified input word. Based on experiments assessing the accuracy of this model, it was found that the prediction quality improves given a large range of word vectors. However, it also increases the computational complexity. The process can be described visually, as seen below.

![2023-07-31 22_39_57-Word2Vec Explained. Explaining the Intuition of Word2Vec &… _ by Vatsal _ Toward.png](attachment:f64c7999-e21b-4bfc-a32c-d6f70d95936b.png)

As seen above, given some corpus of text, a target word is selected over some rolling window. The training data consists of pairwise combinations of that target word and all other words in the window. This is the resulting training data for the neural network. Once the model is trained, we can essentially yield a probability of a word being a context word for a given target. The following image below represents the architecture of the neural network for the skip-gram model.

![2023-07-31 22_41_06-Word2Vec Explained. Explaining the Intuition of Word2Vec &… _ by Vatsal _ Toward.png](attachment:ee708921-19f9-4b59-ab07-710f4de2e885.png)

A corpus can be represented as a vector of size N, where each element in N corresponds to a word in the corpus. During the training process, we have a pair of target and context words. The input array will have 0 in all elements except for the target word. The target word will be equal to 1. The hidden layer will learn the embedding representation of each word, yielding a d-dimensional embedding space. The output layer is a dense layer with a softmax activation function. The output layer will yield a vector of the same size as the input. Each element in the vector will consist of a probability. This probability indicates the similarity between the target word and the associated word in the corpus.

# Stemming in NLP

In [12]:
!pip install nltk



In [13]:
import nltk
nltk.download('punkt')  # Download the required resource (tokenizer models) 

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [14]:
word=['change','changing','changes','changed']

In [15]:
word

['change', 'changing', 'changes', 'changed']

In [16]:
from nltk.stem import PorterStemmer

In [17]:
p=PorterStemmer()

In [18]:
for w in word:
    print(p.stem(w))

chang
chang
chang
chang


In [19]:
for w in word:
    print(w, p.stem(w))

change chang
changing chang
changes chang
changed chang


In [20]:
sen = 'I want to change the world if world changed my career by changing abcd'

In [21]:
from nltk.tokenize import word_tokenize
toke = word_tokenize(sen)
toke

['I',
 'want',
 'to',
 'change',
 'the',
 'world',
 'if',
 'world',
 'changed',
 'my',
 'career',
 'by',
 'changing',
 'abcd']

In [22]:
for w in toke:
    print(w, p.stem(w))

I I
want want
to to
change chang
the the
world world
if if
world world
changed chang
my my
career career
by by
changing chang
abcd abcd


In [23]:
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Lemmatization in NLP

In [24]:
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize

In [25]:
le = WordNetLemmatizer()
sent = 'This is a foo bar sentence'
pos_tag(word_tokenize(sent))

[('This', 'DT'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('foo', 'JJ'),
 ('bar', 'NN'),
 ('sentence', 'NN')]

# Tokenization in NLP

In Python, there are several libraries and tools available for performing tokenization and other NLP tasks. Here are a few examples using popular libraries

# NLTK

NLTK (Natural Language Toolkit) is a widely used library for NLP tasks. To perform tokenization using NLTK, you need to install it first. You can do so by running pip install nltk. Here's an example of tokenizing a sentence using NLTK

In [26]:
from nltk.tokenize import word_tokenize, sent_tokenize

sentence = "I'm from aiQuest Intelligence. I am learning NLP. It is fascinating!"
word_tokens = word_tokenize(sentence)
sentence_tokens = sent_tokenize(sentence)

print(word_tokens)
print(sentence_tokens)

['I', "'m", 'from', 'aiQuest', 'Intelligence', '.', 'I', 'am', 'learning', 'NLP', '.', 'It', 'is', 'fascinating', '!']
["I'm from aiQuest Intelligence.", 'I am learning NLP.', 'It is fascinating!']


# spaCy

spaCy is another powerful library for NLP. To install spaCy, you can run pip install spacy and then download the appropriate language model. Here's an example of tokenization using spaCy

In [27]:
!pip install spacy



In [28]:
import spacy

nlp = spacy.load('en_core_web_sm')  # Load the English language model

sentence = "I'm from aiQuest Intelligence. I am learning NLP. It is fascinating!"
doc = nlp(sentence)

word_tokens = [token.text for token in doc]

print(word_tokens)


caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


['I', "'m", 'from', 'aiQuest', 'Intelligence', '.', 'I', 'am', 'learning', 'NLP', '.', 'It', 'is', 'fascinating', '!']


# Transformers

Transformers is a library built by Hugging Face that provides state-of-the-art pre-trained models for NLP. It offers various functionalities, including tokenization. To install Transformers, run pip install transformers. Here's an example of tokenization using Transformers

In [29]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

sentence = "I'm from aiQuest Intelligence. I am learning NLP. It is fascinating!"
tokens = tokenizer.tokenize(sentence)

print(tokens)


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

['i', "'", 'm', 'from', 'ai', '##quest', 'intelligence', '.', 'i', 'am', 'learning', 'nl', '##p', '.', 'it', 'is', 'fascinating', '!']


# Named Entity Tokenization using NLTK

To perform named entity tokenization using NLTK (Natural Language Toolkit), you can utilize the named entity recognition (NER) functionality provided by NLTK. Here's an example of how to extract named entity tokens from a sentence using NLTK

In [30]:
nltk.download('averaged_oerceotrib_tagger')

[nltk_data] Error loading averaged_oerceotrib_tagger: Package
[nltk_data]     'averaged_oerceotrib_tagger' not found in index


False

In [31]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [32]:
import nltk
nltk.download('maxent_ne_chunker')  # Download the required resource (NER models)
nltk.download('words')  # Download the required resource (word corpus) 

from nltk import word_tokenize, pos_tag, ne_chunk

sentence = "I'm from aiQuest Intelligence. I am learning NLP. It is fascinating!"

# Tokenize the sentence into words
tokens = word_tokenize(sentence)

# Perform part-of-speech tagging
pos_tags = pos_tag(tokens)

# Perform named entity recognition
ner_tags = ne_chunk(pos_tags) 

# Extract named entity tokens
named_entity_tokens = []

for chunk in ner_tags:
    if hasattr(chunk, 'label'): #hasattr(object, attribute)
        
        named_entity_tokens.append(' '.join(c[0] for c in chunk))

print(named_entity_tokens)


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /usr/share/nltk_data...
[nltk_data]   Package words is already up-to-date!
['aiQuest Intelligence', 'NLP']


# Text Vectorizer

In [33]:
import pandas as pd
df=pd.read_csv("../input/data-nlp/data_nlp.csv")
df

Unnamed: 0,test,class
0,I love Bangladesh,1
1,Could you give me an iphone?,0
2,Hello how are you?,1
3,I want to talk you.,1


# CountVectorizer

In [34]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

In [35]:
cv=CountVectorizer()

In [36]:
cv_x=cv.fit_transform(df['test'])
cv_x

<4x14 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [37]:
cv_x.toarray()

array([[0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])

In [38]:
cv.get_feature_names_out()

array(['an', 'are', 'bangladesh', 'could', 'give', 'hello', 'how',
       'iphone', 'love', 'me', 'talk', 'to', 'want', 'you'], dtype=object)

In [39]:
cv_df = pd.DataFrame(cv_x.toarray(), columns=cv.get_feature_names_out(), index=df['test'])

In [40]:
cv_df

Unnamed: 0_level_0,an,are,bangladesh,could,give,hello,how,iphone,love,me,talk,to,want,you
test,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
I love Bangladesh,0,0,1,0,0,0,0,0,1,0,0,0,0,0
Could you give me an iphone?,1,0,0,1,1,0,0,1,0,1,0,0,0,1
Hello how are you?,0,1,0,0,0,1,1,0,0,0,0,0,0,1
I want to talk you.,0,0,0,0,0,0,0,0,0,0,1,1,1,1


In [41]:
cv_df = pd.DataFrame(cv_x.toarray(), columns=cv.get_feature_names_out())

In [42]:
cv_df

Unnamed: 0,an,are,bangladesh,could,give,hello,how,iphone,love,me,talk,to,want,you
0,0,0,1,0,0,0,0,0,1,0,0,0,0,0
1,1,0,0,1,1,0,0,1,0,1,0,0,0,1
2,0,1,0,0,0,1,1,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,1,1,1,1


# TfidfVectorizer

In [43]:
tf = TfidfVectorizer()
tf_z = tf.fit_transform(df['test'])
tf_z

<4x14 sparse matrix of type '<class 'numpy.float64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [44]:
cv_df = pd.DataFrame(tf_z.toarray(), columns=tf.get_feature_names_out(), index=df['test'])

In [45]:
cv_df

Unnamed: 0_level_0,an,are,bangladesh,could,give,hello,how,iphone,love,me,talk,to,want,you
test,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
I love Bangladesh,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0
Could you give me an iphone?,0.430037,0.0,0.0,0.430037,0.430037,0.0,0.0,0.430037,0.0,0.430037,0.0,0.0,0.0,0.274487
Hello how are you?,0.0,0.541736,0.0,0.0,0.0,0.541736,0.541736,0.0,0.0,0.0,0.0,0.0,0.0,0.345783
I want to talk you.,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.541736,0.541736,0.541736,0.345783


# Word2Vec

In [46]:
!pip install gensim

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [47]:
from gensim.models import Word2Vec, KeyedVectors

text_vector = [nltk.word_tokenize(test) for test in df['test']]
text_vector

[['I', 'love', 'Bangladesh'],
 ['Could', 'you', 'give', 'me', 'an', 'iphone', '?'],
 ['Hello', 'how', 'are', 'you', '?'],
 ['I', 'want', 'to', 'talk', 'you', '.']]

In [48]:
model= Word2Vec(text_vector, min_count=1) #shift+tab

In [49]:
model.wv.most_similar('want')

[('an', 0.17826786637306213),
 ('I', 0.16072483360767365),
 ('give', 0.10560770332813263),
 ('how', 0.09215974807739258),
 ('iphone', 0.048910051584243774),
 ('are', 0.02700837142765522),
 ('Could', 0.007729300297796726),
 ('you', -0.03771638125181198),
 ('.', -0.04552280902862549),
 ('talk', -0.0464920699596405)]