In [1]:
import nltk
import numpy as np
import pandas as pd
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


In [2]:
text_data = "Feature Extraction aims to reduce the number of features in a dataset by creating new features from the existing ones (and then discarding the original features). These new reduced set of features should then be able to summarize most of the information contained in the original set of features!!!. In this way, a summarised version of the original features can be created from a combination of the original set!!!Another commonly used technique to reduce the number of feature in a dataset is Feature Selection! The difference between Feature Selection and/or Feature Extraction is that feature selection aims instead to $ rank the importance of the existing features in the dataset and discard less important ones (no new features are created)?!. If you are interested in finding out more about Feature Selection, you can find more information about it in my previous article.In this article, I will walk you through how to apply Feature Extraction techniques using the Kaggle Mushroom Classification Dataset as an example??? Our objective will be to try to predict if a Mushroom is poisonous or not by looking at the given features. All the code used in this post (and more!) is available on Kaggle and on my GitHub Account."

####  ***Remove stopword and punctuation***

In [3]:
tokenlist = word_tokenize(text_data.lower())
stopwordsList = stopwords.words('english')
finalList = []

# removing Stopwords from the string
for x in tokenlist:
    if x not in stopwordsList:
        finalList.append(x)

stringWithoutStopwords = " ".join(finalList)
print('Original String: ',stringWithoutStopwords)

# removing Punctuations from the string
punc = "!()-[]{};:'\"\,<>./?@#$%^&*_~"
finalString = ''
for ele in stringWithoutStopwords:
    if ele in punc:
        stringWithoutStopwords = stringWithoutStopwords.replace(ele, "")

stringWithoutStopwords= " ".join(stringWithoutStopwords.split())
print('Modified String: ',stringWithoutStopwords)


Original String:  feature extraction aims reduce number features dataset creating new features existing ones ( discarding original features ) . new reduced set features able summarize information contained original set features ! ! ! . way , summarised version original features created combination original set ! ! ! another commonly used technique reduce number feature dataset feature selection ! difference feature selection and/or feature extraction feature selection aims instead $ rank importance existing features dataset discard less important ones ( new features created ) ? ! . interested finding feature selection , find information previous article.in article , walk apply feature extraction techniques using kaggle mushroom classification dataset example ? ? ? objective try predict mushroom poisonous looking given features . code used post ( ! ) available kaggle github account .
Modified String:  feature extraction aims reduce number features dataset creating new features existing 

####  ***Stemming & lemmetization***

In [4]:
# Performing Stemmization operation
porter = PorterStemmer()
StemmedString = []
for x in stringWithoutStopwords.split(' '):
    y = porter.stem(x)
    StemmedString.append(y)

StemmedString = ' '.join(StemmedString)
print('Original String: ',stringWithoutStopwords)
print('Modified String: ',StemmedString)

Original String:  feature extraction aims reduce number features dataset creating new features existing ones discarding original features new reduced set features able summarize information contained original set features way summarised version original features created combination original set another commonly used technique reduce number feature dataset feature selection difference feature selection andor feature extraction feature selection aims instead rank importance existing features dataset discard less important ones new features created interested finding feature selection find information previous articlein article walk apply feature extraction techniques using kaggle mushroom classification dataset example objective try predict mushroom poisonous looking given features code used post available kaggle github account
Modified String:  featur extract aim reduc number featur dataset creat new featur exist one discard origin featur new reduc set featur abl summar inform contain o

In [5]:
# Performing Lemmetization operation
lem = WordNetLemmatizer()
LemmetizedString = []
for x in stringWithoutStopwords.split(' '):
    y = lem.lemmatize(x, pos='v')
    LemmetizedString.append(y)
LemmetizedString = ' '.join(LemmetizedString)
print('Original String: ',stringWithoutStopwords)
print('Modified String: ',LemmetizedString)

Original String:  feature extraction aims reduce number features dataset creating new features existing ones discarding original features new reduced set features able summarize information contained original set features way summarised version original features created combination original set another commonly used technique reduce number feature dataset feature selection difference feature selection andor feature extraction feature selection aims instead rank importance existing features dataset discard less important ones new features created interested finding feature selection find information previous articlein article walk apply feature extraction techniques using kaggle mushroom classification dataset example objective try predict mushroom poisonous looking given features code used post available kaggle github account
Modified String:  feature extraction aim reduce number feature dataset create new feature exist ones discard original feature new reduce set feature able summariz

####  ***POS***

In [6]:
# finding POS of each word
print('Parts of speech :', nltk.pos_tag(tokenlist) )

Parts of speech : [('feature', 'NN'), ('extraction', 'NN'), ('aims', 'VBZ'), ('to', 'TO'), ('reduce', 'VB'), ('the', 'DT'), ('number', 'NN'), ('of', 'IN'), ('features', 'NNS'), ('in', 'IN'), ('a', 'DT'), ('dataset', 'NN'), ('by', 'IN'), ('creating', 'VBG'), ('new', 'JJ'), ('features', 'NNS'), ('from', 'IN'), ('the', 'DT'), ('existing', 'VBG'), ('ones', 'NNS'), ('(', '('), ('and', 'CC'), ('then', 'RB'), ('discarding', 'VBG'), ('the', 'DT'), ('original', 'JJ'), ('features', 'NNS'), (')', ')'), ('.', '.'), ('these', 'DT'), ('new', 'JJ'), ('reduced', 'VBD'), ('set', 'NN'), ('of', 'IN'), ('features', 'NNS'), ('should', 'MD'), ('then', 'RB'), ('be', 'VB'), ('able', 'JJ'), ('to', 'TO'), ('summarize', 'VB'), ('most', 'JJS'), ('of', 'IN'), ('the', 'DT'), ('information', 'NN'), ('contained', 'VBN'), ('in', 'IN'), ('the', 'DT'), ('original', 'JJ'), ('set', 'NN'), ('of', 'IN'), ('features', 'NNS'), ('!', '.'), ('!', '.'), ('!', '.'), ('.', '.'), ('in', 'IN'), ('this', 'DT'), ('way', 'NN'), (',', '

####  ***Bag of Words***

In [7]:
text_data = np.array(["Feature Extraction aims to reduce the number of features in a dataset by creating new features from the existing ones (and then discarding the original features). These new reduced set of features should then be able to summarize most of the information contained in the original set of features!!!. In this way, a summarised version of the original features can be created from a combination of the original set!!!",
                      "Another commonly used technique to reduce the number of feature in a dataset is Feature Selection! The difference between Feature Selection and/or Feature Extraction is that feature selection aims instead to $ rank the importance of the existing features in the dataset and discard less important ones (no new features are created)?!. If you are interested in finding out more about Feature Selection, you can find more information about it in my previous article.",
                      "In this article, I will walk you through how to apply Feature Extraction techniques using the Kaggle Mushroom Classification Dataset as an example??? Our objective will be to try to predict if a Mushroom is poisonous or not by looking at the given features. All the code used in this post (and more!) is available on Kaggle and on my GitHub Account."])

In [8]:
# Bag of Words 
Count = CountVectorizer()
bag_of_words = Count.fit_transform(text_data)
pd.DataFrame(bag_of_words.toarray(), columns = Count.get_feature_names_out())


Unnamed: 0,able,about,account,aims,all,an,and,another,apply,are,...,through,to,try,used,using,version,walk,way,will,you
0,1,0,0,1,0,0,1,0,0,0,...,0,2,0,0,0,1,0,1,0,0
1,0,2,0,1,0,0,2,1,0,2,...,0,2,0,1,0,0,0,0,0,2
2,0,0,1,0,1,1,2,0,1,0,...,1,3,1,1,1,0,1,0,2,1


####  ***TFIDF***

In [9]:
# TFIDF vectorization
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)
pd.DataFrame(feature_matrix.toarray(), columns = Count.get_feature_names_out())

Unnamed: 0,able,about,account,aims,all,an,and,another,apply,are,...,through,to,try,used,using,version,walk,way,will,you
0,0.09415,0.0,0.0,0.071603,0.0,0.0,0.055606,0.0,0.0,0.0,...,0.0,0.111213,0.0,0.0,0.0,0.09415,0.0,0.09415,0.0,0.0
1,0.0,0.207232,0.0,0.078803,0.0,0.0,0.122395,0.103616,0.0,0.207232,...,0.0,0.122395,0.0,0.078803,0.0,0.0,0.0,0.0,0.0,0.157606
2,0.0,0.0,0.127726,0.0,0.127726,0.127726,0.150874,0.0,0.127726,0.0,...,0.127726,0.226311,0.127726,0.097139,0.127726,0.0,0.127726,0.0,0.255451,0.097139
