# Introduction 

Unstructured text data, like the contents of a book or a tweet, is both one of the most
interesting sources of features and one of the most complex to handle. In this chapter,
we will cover strategies for transforming text into information-rich features. This is
not to say that the recipes covered here are comprehensive. There exist entire academic
disciplines focused on handling this and similar types of data, and the contents
of all their techniques would fill a small library. Despite this, there are some commonly
used techniques, and a knowledge of these will add valuable tools to our preprocessing
toolbox.

## 5.1 Cleaning Text

Python’s core string operations,in particular **strip, replace, and split**

In [None]:
#Create text
text_data = [" Interrobang. By Aishwarya Henriette ",
            "Parking And Going. By Karl Gautier",
            " Today Is The night. By Jarek Prakash "]

In [None]:
# Strip whitespaces
strip_whitespace = [string.strip() for string in text_data]
# Show text
strip_whitespace


In [None]:
# Remove periods
remove_periods = [string.replace(".", "") for string in strip_whitespace]
# Show text
remove_periods

We also create and apply a custom transformation function:

In [None]:
# Create function
def capitalizer(string: str) -> str:
    return string.upper()
# Apply function
[capitalizer(string) for string in remove_periods]

Finally, we can use regular expressions to make powerful string operations:

In [None]:
# Import library
import re
# Create function
def replace_letters_with_X(string: str) -> str:
    return re.sub(r"[a-zA-Z]", "*", string)
# Apply function
[replace_letters_with_X(string) for string in remove_periods]

## 5.2 Parsing and Cleaning HTML

You have text data with HTML elements and want to extract just the text.Use **Beautiful Soup’s** extensive set of options to parse and extract from HTML


In [None]:
# Load library
from bs4 import BeautifulSoup
# Create some HTML code
html =  """
       <div class='full_name'><span style='font-weight:bold'>
        Masego</span> Azra</div>"
        """
# Parse html
soup = BeautifulSoup(html, "lxml")
# Find the div with the class "full_name", show text
soup.find("div", { "class" : "full_name" }).text

Despite the strange name, Beautiful Soup is a powerful Python library designed for
scraping HTML. Typically Beautiful Soup is used scrape live websites, but we can just
as easily use it to extract text data embedded in HTML. The full range of Beautiful
Soup operations is beyond the scope of this book, but even the few methods used in
our solution show how easily we can parse HTML code to extract the data we want

## 5.3 Removing Punctuation

Define a function that uses **translate** with a dictionary of punctuation characters

In [None]:
# Load libraries
import unicodedata
import sys
# Create text
text_data = ['Hi!!!! I. Love. This. Song....',
            '10000% Agree!!!! #LoveIT',
            'Right?!?!']
# Create a dictionary of punctuation characters
punctuation = dict.fromkeys(i for i in range(sys.maxunicode)
if unicodedata.category(chr(i)).startswith('P'))


In [None]:
# For each string, remove any punctuation characters
[string.translate(punctuation) for string in text_data]

## 5.4 Tokenizing Text

Natural Language Toolkit for Python (NLTK) has a powerful set of text manipulation
operations, including word tokenizing:

In [49]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [50]:
# Load library
from nltk.tokenize import word_tokenize
# Create text
string = "The science of today is the technology of tomorrow"
# Tokenize words
word_tokenize(string)

['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tomorrow']

We can also tokenize into sentences:

In [51]:
# Load library
from nltk.tokenize import sent_tokenize
# Create text
string = "The science of today is the technology of tomorrow. Tomorrow is today."
# Tokenize sentences
sent_tokenize(string)

['The science of today is the technology of tomorrow.', 'Tomorrow is today.']

Tokenization, especially word tokenization, is a common task after cleaning text data
because it is the first step in the process of turning the text into data we will use to
construct useful features

## 5.5 Removing Stop Words

Given tokenized text data, you want to remove extremely common words (e.g., a, is,
of, on) that contain little informational value.

In [52]:
import nltk 
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [54]:
# Load library
from nltk.corpus import stopwords
# You will have to download the set of stop words the first time
# import nltk
# nltk.download('stopwords')
# Create word tokens
tokenized_words = ['i',
                  'am',
                'going',
                'to',
                'go',
                'to',
                'the',
                'store',
                'and',
                'park']
# Load stop words
stop_words = stopwords.words('english')
# Remove stop words
[word for word in tokenized_words if word not in stop_words]

['going', 'go', 'store', 'park']

While “stop words” can refer to any set of words we want to remove before processing,
frequently the term refers to extremely common words that themselves contain
little information value. NLTK has a list of common stop words that we can use to
find and remove stop words in our tokenized words

In [56]:
# Show stop words
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

Note that **NLTK’s** stopwords assumes the tokenized words are all lowercased.

## 5.6 stemming words

You have tokenized words and want to convert them into their root forms

Use NLTK’s **PorterStemmer**

In [57]:
# Load library
from nltk.stem.porter import PorterStemmer
# Create word tokens
tokenized_words = ['i', 'am', 'humbled', 'by', 'this', 'traditional', 'meeting']
# Create stemmer
porter = PorterStemmer()
# Apply stemmer
[porter.stem(word) for word in tokenized_words]

['i', 'am', 'humbl', 'by', 'thi', 'tradit', 'meet']

Stemming reduces a word to its stem by identifying and removing affixes (e.g., gerunds)
while keeping the root meaning of the word. For example, both “tradition” and
“traditional” have “tradit” as their stem, indicating that while they are different words
they represent the same general concept. By stemming our text data, we transform it
to something less readable, but closer to its base meaning and thus more suitable for
comparison across observations. NLTK’s PorterStemmer implements the widely used
Porter stemming algorithm to remove or replace common suffixes to produce the
word stem.

## 5.7 Tagging Parts of Speech

In [59]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [60]:
# Load libraries
from nltk import pos_tag
from nltk import word_tokenize
# Create text
text_data = "Chris loved outdoor running"
# Use pre-trained part of speech tagger
text_tagged = pos_tag(word_tokenize(text_data))
# Show parts  of speech
text_tagged

[('Chris', 'NNP'), ('loved', 'VBD'), ('outdoor', 'RP'), ('running', 'VBG')]

In [61]:
# Filter words
[word for word, tag in text_tagged if tag in ['NN','NNS','NNP','NNPS'] ]

['Chris']

A more realistic situation would be that we have data where every observation contains
a tweet and we want to convert those sentences into features for individual parts
of speech (e.g., a feature with 1 if a proper noun is present, and 0 otherwise):

In [62]:
# Create text
tweets = ["I am eating a burrito for breakfast",
        "Political science is an amazing field",
        "San Francisco is an awesome city"]
# Create list
tagged_tweets = []
# Tag each word and each tweet
for tweet in tweets:
    tweet_tag = nltk.pos_tag(word_tokenize(tweet))
    tagged_tweets.append([tag for word, tag in tweet_tag])


In [63]:
tagged_tweets

[['PRP', 'VBP', 'VBG', 'DT', 'NN', 'IN', 'NN'],
 ['JJ', 'NN', 'VBZ', 'DT', 'JJ', 'NN'],
 ['NNP', 'NNP', 'VBZ', 'DT', 'JJ', 'NN']]

In [64]:
# Use one-hot encoding to convert the tags into features
from sklearn.preprocessing import MultiLabelBinarizer
one_hot_multi = MultiLabelBinarizer()
one_hot_multi.fit_transform(tagged_tweets)

array([[1, 1, 0, 1, 0, 1, 1, 1, 0],
       [1, 0, 1, 1, 0, 0, 0, 0, 1],
       [1, 0, 1, 1, 1, 0, 0, 0, 1]])

In [65]:
# Show feature names
one_hot_multi.classes_

array(['DT', 'IN', 'JJ', 'NN', 'NNP', 'PRP', 'VBG', 'VBP', 'VBZ'],
      dtype=object)

If our text is English and not on a specialized topic (e.g., medicine) the simplest solution
is to use NLTK’s pre-trained parts-of-speech tagger. However, if pos_tag is not
very accurate, NLTK also gives us the ability to train our own tagger. The major
downside of training a tagger is that we need a large corpus of text where the tag of
each word is known. Constructing this tagged corpus is obviously labor intensive and
is probably going to be a last resort.
All that said, if we had a tagged corpus and wanted to train a tagger, the following is
an example of how we could do it. The corpus we are using is the Brown Corpus, one
of the most popular sources of tagged text. Here we use a backoff n-gram tagger,
where n is the number of previous words we take into account when predicting a
word’s part-of-speech tag. First we take into account the previous two words using
TrigramTagger; if two words are not present, we “back off ” and take into account the
tag of the previous one word using BigramTagger, and finally if that fails we only look
at the word itself using UnigramTagger. To examine the accuracy of our tagger, we
split our text data into two parts, train our tagger on one part, and test how well it
predicts the tags of the second part:

In [66]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [67]:
# Load library
from nltk.corpus import brown
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
# Get some text from the Brown Corpus, broken into sentences
sentences = brown.tagged_sents(categories='news')


In [68]:
sentences

[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlant

In [69]:
# Split into 4000 sentences for training and 623 for testing
train = sentences[:4000]
test = sentences[4000:]
# Create backoff tagger
unigram = UnigramTagger(train)
bigram = BigramTagger(train, backoff=unigram)
trigram = TrigramTagger(train, backoff=bigram)
# Show accuracy
trigram.evaluate(test)

0.8174734002697437

## 5.8 Encoding Text as a Bag of Words

In [2]:
# Load library
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
# Create text
text_data = np.array(['I love Brazil. Brazil!',
                    'Sweden is best',
                    'Germany beats both'])
# Create the bag of words feature matrix
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)
# Show feature matrix
bag_of_words

<3x8 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

This output is a sparse array, which is often necessary when we have a large amount
of text. However, in our toy example we can use toarray to view a matrix of word
counts for each observation:

In [3]:
bag_of_words.toarray()

array([[0, 0, 0, 2, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 0, 0]], dtype=int64)

We can use the **vocabulary_ method** to view the word associated with each feature:

In [4]:
# Show feature names
count.get_feature_names()

['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love', 'sweden']

One of the most common methods of transforming text into features is by using a
bag-of-words model. Bag-of-words models output a feature for every unique word in
text data, with each feature containing a count of occurrences in observations. For
example, in our solution the sentence I love Brazil. Brazil! has a value of 2 in
the “brazil” feature because the word brazil appears two times.
The text data in our solution was purposely small. In the real world, a single observation
of text data could be the contents of an entire book! Since our bag-of-words
model creates a feature for every unique word in the data, the resulting matrix can
contain thousands of features. This means that the size of the matrix can sometimes
become very large in memory. However, luckily we can exploit a common characteristic
of bag-of-words feature matrices to reduce the amount of data we need to store.
Most words likely do not occur in most observations, and therefore bag-of-words feature
matrices will contain mostly 0s as values. We call these types of matrices “sparse.”
Instead of storing all values of the matrix, we can only store nonzero values and then
assume all other values are 0. This will save us memory when we have large feature
matrices. One of the nice features of CountVectorizer is that the output is a sparse
matrix by default.
CountVectorizer comes with a number of useful parameters to make creating bagof-
words feature matrices easy. First, while by default every feature is a word, that
does not have to be the case. Instead we can set every feature to be the combination of
two words (called a 2-gram) or even three words (3-gram). ngram_range sets the
minimum and maximum size of our n-grams. For example, (2,3) will return all 2-
grams and 3-grams. Second, we can easily remove low-information filler words using
stop_words either with a built-in list or a custom list. Finally, we can restrict the
words or phrases we want to consider to a certain list of words using vocabulary. For
example, we could create a bag-of-words feature matrix for only occurrences of country
names:

In [5]:
# Create feature matrix with arguments
count_2gram = CountVectorizer(ngram_range=(1,2),
stop_words="english",
vocabulary=['brazil'])
bag = count_2gram.fit_transform(text_data)
# View feature matrix
bag.toarray()

# View the 1-grams and 2-grams
count_2gram.vocabulary_

{'brazil': 0}

## Generate the N-grams for the given sentence

In [1]:
import nltk
from nltk.util import ngrams
 
# Function to generate n-grams from sentences.
def extract_ngrams(data, num):
    n_grams = ngrams(nltk.word_tokenize(data), num)
    return [ ' '.join(grams) for grams in n_grams]
 
data = 'A class is a blueprint for the object.'
 
print("1-gram: ", extract_ngrams(data, 1))
print("2-gram: ", extract_ngrams(data, 2))
print("3-gram: ", extract_ngrams(data, 3))
print("4-gram: ", extract_ngrams(data, 4))

1-gram:  ['A', 'class', 'is', 'a', 'blueprint', 'for', 'the', 'object', '.']
2-gram:  ['A class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object', 'object .']
3-gram:  ['A class is', 'class is a', 'is a blueprint', 'a blueprint for', 'blueprint for the', 'for the object', 'the object .']
4-gram:  ['A class is a', 'class is a blueprint', 'is a blueprint for', 'a blueprint for the', 'blueprint for the object', 'for the object .']


## 5.9 Weighting Word Importance

You want a bag of words, but with words weighted by their importance to an observation.

Compare the frequency of the word in a document (a tweet, movie review, speech
transcript, etc.) with the frequency of the word in all other documents using term
frequency-inverse document frequency (tf-idf). scikit-learn makes this easy with
**TfidfVectorizer:**

In [6]:
# Load libraries
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
# Create text
text_data = np.array([ 'I love Brazil. Brazil!',
                        'Sweden is best',
                        'Germany beats both'])
# Create the tf-idf feature matrix
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)
# Show tf-idf feature matrix
feature_matrix

<3x8 sparse matrix of type '<class 'numpy.float64'>'
	with 8 stored elements in Compressed Sparse Row format>

 the output is a spare matrix. However, if we want to view the
output as a dense matrix, we can use .toarray

In [7]:
# Show tf-idf feature matrix as dense matrix
feature_matrix.toarray()

array([[0.        , 0.        , 0.        , 0.89442719, 0.        ,
        0.        , 0.4472136 , 0.        ],
       [0.        , 0.57735027, 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.57735027],
       [0.57735027, 0.        , 0.57735027, 0.        , 0.57735027,
        0.        , 0.        , 0.        ]])

**vocabulary_ shows** us the word of each feature:

In [8]:
# Show feature names
tfidf.vocabulary_

{'love': 6,
 'brazil': 3,
 'sweden': 7,
 'is': 5,
 'best': 1,
 'germany': 4,
 'beats': 0,
 'both': 2}

The more a word appears in a document, the more likely it is important to that document.
For example, if the word economy appears frequently, it is evidence that the
document might be about economics. We call this term frequency (tf).
In contrast, if a word appears in many documents, it is likely less important to any
individual document. For example, if every document in some text data contains the
word after then it is probably an unimportant word. We call this document frequency
(df).
By combining these two statistics, we can assign a score to every word representing
how important that word is in a document. Specifically, we multiply tf to the inverse
of document frequency (idf):
**tf‐idf t, d = t f t, d × id f t**
where t is a word and d is a document. There are a number of variations in how tf and
idf are calculated. In scikit-learn, tf is simply the number of times a word appears in
the document and idf is calculated as:
**formula**
where nd is the number of documents and df(d,t) is term, t’s document frequency (i.e.,
number of documents where the term appears).
By default, scikit-learn then normalizes the tf-idf vectors using the Euclidean norm
(L2 norm). The higher the resulting value, the more important the word is to a document.