# Introduction:
> - Unstructured text data, like the contents of a book or a tweet, is both one of the most
interesting sources of features and one of the most complex to handle.
> - For transforming text into information-rich features we use
some out-of-the-box features (termed embeddings) that have become increasingly
ubiquitous in tasks that involve natural language processing **(NLP)**.

# Cleaning Text
> - Some text data will need to be cleaned before we can use it to build features, or
be preprocessed in some way prior to being fed into an algorithm.
> - Most basic text cleaning can be completed using Python’s standard string operations.
> - In the real world, we will most likely define a custom cleaning function (e.g., capitalizer)
combining some cleaning tasks and apply that to the text data.
>> -  Although cleaning
strings can remove some information, it makes the data much easier to work with.

In [1]:
# If you have some unstructured text data and want to complete some basic cleaning.
# In the following example, we look at the text for three books and clean it by using
# Python’s core string operations, in particular strip, replace, and split:
# Create text
text_data = [" Interrobang. By Aishwarya Henriette ",
 "Parking And Going. By Karl Gautier",
" Today Is The night. By Jarek Prakash "]
# Strip whitespaces
strip_whitespace = [string.strip() for string in text_data]
# Show text
strip_whitespace

# Remove periods
remove_periods = [string.replace(".", "") for string in strip_whitespace]
# Show text
remove_periods

['Interrobang By Aishwarya Henriette',
 'Parking And Going By Karl Gautier',
 'Today Is The night By Jarek Prakash']

In [2]:
# We also create and apply a custom transformation function:
# Create function
def capitalizer(string: str) -> str:
 return string.upper()
# Apply function
[capitalizer(string) for string in remove_periods]

['INTERROBANG BY AISHWARYA HENRIETTE',
 'PARKING AND GOING BY KARL GAUTIER',
 'TODAY IS THE NIGHT BY JAREK PRAKASH']

In [3]:
# Finally, we can use regular expressions to make powerful string operations:
# Import library
import re
# Create function
def replace_letters_with_X(string: str) -> str:
 return re.sub(r"[a-zA-Z]", "X", string)
# Apply function
[replace_letters_with_X(string) for string in remove_periods]

['XXXXXXXXXXX XX XXXXXXXXX XXXXXXXXX',
 'XXXXXXX XXX XXXXX XX XXXX XXXXXXX',
 'XXXXX XX XXX XXXXX XX XXXXX XXXXXXX']

In [5]:
# Define a string
s = "machine learning in python cookbook"
# Find the first index of the letter "n"
find_n = s.find("n")
print(find_n)
# Whether or not the string starts with "m"
starts_with_m = s.startswith("m")
print(starts_with_m)
# Whether or not the string ends with "python"
ends_with_python = s.endswith("python")
print(ends_with_python)
# Is the string alphanumeric
is_alnum = s.isalnum()
print(is_alnum)
# Is it composed of only alphabetical characters (not including spaces)
is_alpha = s.isalpha()
print(is_alpha)
# Encode as utf-8
encode_as_utf8 = s.encode("utf-8")
print(encode_as_utf8)
# Decode the same utf-8
decode = encode_as_utf8.decode("utf-8")
print(decode)

5
True
False
False
False
b'machine learning in python cookbook'
machine learning in python cookbook


# Parsing and Cleaning HTML
> - Despite the strange name, `Beautiful Soup` is a powerful Python library designed for
scraping HTML.
>> - Typically Beautiful Soup is used to process HTML during live web
scraping, but we can just as easily use it to extract text data embedded in static
HTML.
>>> - The method bellow shows how easy it can be to parse HTML
and extract information from specific tags using find().

In [6]:
# If you have text data with HTML elements and want to extract just the text,
# use Beautiful Soup’s extensive set of options to parse and extract from HTML:
# Load library
from bs4 import BeautifulSoup
# Create some HTML code
html = "<div class='full_name'>"\
 "<span style='font-weight:bold'>Masego"\
 "</span> Azra</div>"
# Parse html
soup = BeautifulSoup(html, "lxml")
# Find the div with the class "full_name", show text
soup.find("div", { "class" : "full_name" }).text
# 'Masego Azra'

'Masego Azra'

# Removing Punctuation
> - The Python translate method is popular due to its speed.
>> 1. First
we created a dictionary, punctuation, with all punctuation characters according to
Unicode as its keys and None as its values.
>> 2. Next we translated all characters in the
string that are in punctuation into None, effectively removing them. There are more
readable ways to remove punctuation, but this somewhat hacky solution has the
advantage of being far faster than alternatives.<br>

> **Note:** It is important to be conscious of the fact that punctuation contains information (e.g.,
“Right?” versus “Right!”). If the punctuation is important we should
make sure to take that into account. Depending on the downstream task we’re trying
to accomplish, punctuation might contain important information we want to keep
(e.g., using a “?” to classify if some text contains a question).

In [7]:
# If you have a feature of text data and want to remove punctuation.
# Define a function that uses translate with a dictionary of punctuation characters:
# Load libraries
import unicodedata
import sys
# Create text
text_data = ['Hi!!!! I. Love. This. Song....',
 '10000% Agree!!!! #LoveIT',
 'Right?!?!']
# Create a dictionary of punctuation characters
punctuation = dict.fromkeys(
 (i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P')), None)
# For each string, remove any punctuation characters
[string.translate(punctuation) for string in text_data]

['Hi I Love This Song', '10000 Agree LoveIT', 'Right']

# Tokenizing Text
> Tokenization, especially word tokenization, is a common task after cleaning text data
because it is the first step in the process of turning the text into data we will use
to construct useful features.
> > Some pretrained NLP models (such as `Google’s BERT`)
utilize model-specific tokenization techniques; however, word-level tokenization is
still a fairly common tokenization approach before getting features from individual
words.

In [8]:
# IF you have text and want to break it up into individual words.
# Natural Language Toolkit for Python (NLTK) has a powerful set of text manipulation
# operations, including word tokenizing:
# Load library
from nltk.tokenize import word_tokenize
# Create text
string = "The science of today is the technology of tomorrow"
# Tokenize words
word_tokenize(string)
# ['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tomorrow']

['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tomorrow']

In [9]:
# We can also tokenize into sentences:
# Load library
from nltk.tokenize import sent_tokenize
# Create text
string = "The science of today is the technology of tomorrow. Tomorrow is today."
# Tokenize sentences
sent_tokenize(string)
# ['The science of today is the technology of tomorrow.', 'Tomorrow is today.']

['The science of today is the technology of tomorrow.', 'Tomorrow is today.']

# Removing Stop Words
> While “stop words” can refer to any set of words we want to remove before processing, frequently the term refers to extremely common words that themselves contain
little information value.
> > Whether or not you choose to remove stop words will
depend on your individual use case.

In [10]:
# Given tokenized text data, if you want to remove extremely common words (e.g., a, is, of, on)
# that contain little informational value, use NLTK’s stopwords:
# Load library
from nltk.corpus import stopwords
# You will have to download the set of stop words the first time
# import nltk
# nltk.download('stopwords')
# Create word tokens
tokenized_words = ['i',
 'am',
 'going',
 'to',
 'go',
 'to',
 'the',
 'store',
 'and',
 'park']
# Load stop words
stop_words = stopwords.words('english')
# Remove stop words
[word for word in tokenized_words if word not in stop_words]

['going', 'go', 'store', 'park']

In [11]:
# NLTK has a list of common stop words that we can use to find and remove stop words in our tokenized words:
# Show stop words
stop_words[:5]

['i', 'me', 'my', 'myself', 'we']

# Stemming Words
> Stemming reduces a word to its stem by identifying and removing affixes (e.g., ger‐
unds) while keeping the root meaning of the word. For example, both “tradition” and
“traditional” have “tradit” as their stem, indicating that while they are different words,
they represent the same general concept.
> >By stemming our text data, we transform
it to something less readable but closer to its base meaning and thus more suitable
for comparison across observations.
> >> - NLTK’s `PorterStemmer` implements the widely
used Porter stemming algorithm to remove or replace common suffixes to produce
the word stem.

In [12]:
# If you have tokenized words and want to convert them into their root forms, use NLTK’s PorterStemmer:
# Load library
from nltk.stem.porter import PorterStemmer
# Create word tokens
tokenized_words = ['i', 'am', 'humbled', 'by', 'this', 'traditional', 'meeting']
# Create stemmer
porter = PorterStemmer()
# Apply stemmer
[porter.stem(word) for word in tokenized_words]
# ['i', 'am', 'humbl', 'by', 'thi', 'tradit', 'meet']

['i', 'am', 'humbl', 'by', 'thi', 'tradit', 'meet']

# Tagging Parts of Speech
> If our text is English and not on a specialized topic (e.g., medicine) the simplest
solution is to use NLTK’s pretrained parts-of-speech tagger.
> - However, if pos_tag is not very accurate, NLTK also gives us the ability to train our own tagger.
>> - The major
downside of training a tagger is that we need a large corpus of text where the tag of
each word is known.
>>> Constructing this tagged corpus is obviously labor intensive and
is probably going to be a last resort.

In [14]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\kaveh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [15]:
# If you have text data and want to tag each word or character with its part of speech,
# use NLTK’s pretrained parts-of-speech tagger:
# Load libraries
from nltk import pos_tag
from nltk import word_tokenize
# Create text
text_data = "Chris loved outdoor running"
# Use pretrained part of speech tagger
text_tagged = pos_tag(word_tokenize(text_data))
# Show parts of speech
text_tagged
# [('Chris', 'NNP'), ('loved', 'VBD'), ('outdoor', 'RP'), ('running', 'VBG')]

[('Chris', 'NNP'), ('loved', 'VBD'), ('outdoor', 'RP'), ('running', 'VBG')]

> - The output is a list of tuples with the word and the tag of the part of speech.
> - NLTK uses the Penn Treebank parts for speech tags.
>> - Some examples of the Penn Treebank tags are:

>> |Tag | Part of speech                    |
| ----------- | ----------- |
|NNP | Proper noun, singular             |
|NN  | Noun, singular or mass            |
|RB  | Adverb                            |
|VBD | Verb, past tense                  |
|VBG | Verb, gerund or present participle|
|JJ  | Adjective                         |
|PRP | Personal pronoun                  |

In [16]:
# Once the text has been tagged, we can use the tags to find certain parts of speech. 
# For example, here are all nouns:
# Filter words
[word for word, tag in text_tagged if tag in ['NN','NNS','NNP','NNPS'] ]
# ['Chris']

['Chris']

In [17]:
# A more realistic situation would be to have data where every observation contains
# a tweet, and we want to convert those sentences into features for individual parts of
# speech (e.g., a feature with 1 if a proper noun is present, and 0 otherwise):

# Import libraries
from sklearn.preprocessing import MultiLabelBinarizer
# Create text
tweets = ["I am eating a burrito for breakfast",
 "Political science is an amazing field",
 "San Francisco is an awesome city"]
# Create list
tagged_tweets = []
# Tag each word and each tweet
for tweet in tweets:
 tweet_tag = nltk.pos_tag(word_tokenize(tweet))
 tagged_tweets.append([tag for word, tag in tweet_tag])

# Use one-hot encoding to convert the tags into features
one_hot_multi = MultiLabelBinarizer()
one_hot_multi.fit_transform(tagged_tweets)
# array([[1, 1, 0, 1, 0, 1, 1, 1, 0],
#  [1, 0, 1, 1, 0, 0, 0, 0, 1],
#  [1, 0, 1, 1, 1, 0, 0, 0, 1]])

array([[1, 1, 0, 1, 0, 1, 1, 1, 0],
       [1, 0, 1, 1, 0, 0, 0, 0, 1],
       [1, 0, 1, 1, 1, 0, 0, 0, 1]])

In [19]:
# Using classes_ we can see that each feature is a part-of-speech tag:
# Show feature names
one_hot_multi.classes_
# array(['DT', 'IN', 'JJ', 'NN', 'NNP', 'PRP', 'VBG', 'VBP', 'VBZ'], dtype=object)

array(['DT', 'IN', 'JJ', 'NN', 'NNP', 'PRP', 'VBG', 'VBP', 'VBZ'],
      dtype=object)

# Performing Named-Entity Recognition
> Named-entity recognition is the process of recognizing specific entities from text.
>> Tools like `spaCy` offer preconfigured pipelines, and even pretrained or fine-tuned
machine learning models that can easily identify these entities.
>>> In this case, we
use spaCy to identify a person (“Elon Musk”), organization (“Twitter”), and money
value (“21B”) from the raw text.
>>> - Using this information, we can extract structured
information from the unstructured textual data. This information can then be used in
downstream machine learning models or data analysis.

In [1]:
# IF you want to perform named-entity recognition in freeform text (such as “Person,”“State,” etc.).
# use spaCy’s default named-entity recognition pipeline and models to extract entites from text:
# Import libraries
import spacy

In [3]:
spacy.prefer_gpu()
# Load the spaCy package and use it to parse the text
# make sure you have run "python -m spacy download en"
nlp = spacy.load("en_core_web_trf")
doc = nlp("Elon Musk offered to buy Twitter using $21B of his own money.")
# Print each entity
print(doc.ents)
# For each entity print the text and the entity label
for entity in doc.ents:
    print(entity.text, entity.label_, sep=",")
# (Elon Musk, Twitter, 21B)
# Elon Musk, PERSON
# Twitter, ORG
# 21B, MONEY

(Elon Musk, Twitter, 21B)
Elon Musk,PERSON
Twitter,ORG
21B,MONEY


# Encoding Text as a Bag of Words
> - One of the most common methods of transforming text into features is using a
`bag-of-words` model.
>> - Bag-of-words models output a feature for every unique word
in text data, with each feature containing a count of occurrences in observations.
>>> For example, in the following code examples, the sentence “I love Brazil. Brazil!” has a value of 2 in the
“brazil” feature because the word brazil appears two times.
>_______________________________________________

> Most words likely do not occur in most observations, and therefore bag-of-words
feature matrices will contain mostly 0s as values called sparse:
> > - Instead of storing all values of the matrix, we can store only nonzero values
and then assume all other values are 0.
>>- This will save memory when we have large
feature matrices.
>>> One of the nice features of `CountVectorizer` is that the output is a
sparse matrix by `default`.
> ____________________________________________________

> `CountVectorizer` comes with a number of useful parameters to make it easy to create
bag-of-words feature matrices.
> 1. while by default every feature is a word, that
does not have to be the case. Instead we can set every feature to be the combination
of two words (called a `2-gram`) or even three words (`3-gram`).
>> `ngram_range` sets the
minimum and maximum size of our n-grams. For example, `(2,3) will return all
2-grams and 3-grams`.
>3. we can easily remove low-information filler words by
using `stop_words`, either with a `built-in list or a custom list`.

In [8]:
# If you have text data and want to create a set of features 
# indicating the number of times an observation’s text contains a particular word,
# use scikit-learn’s CountVectorizer:
# Load library
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
# Create text
text_data = np.array(['I love Brazil. Brazil!',
 'Sweden is best',
'Germany beats both'])
# Create the bag of words feature matrix
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)
# Show feature matrix
bag_of_words
# <3x8 sparse matrix of type '<class 'numpy.int64'>'
#  with 8 stored elements in Compressed Sparse Row format>

# This output is a sparse array, which is often necessary when we have a large amount of text.

<3x8 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [9]:
# However, in our toy example we can use toarray to view a matrix of word
# counts for each observation:
bag_of_words.toarray()
# array([[0, 0, 0, 2, 0, 0, 1, 0],
#  [0, 1, 0, 0, 0, 1, 0, 1],
#  [1, 0, 1, 0, 1, 0, 0, 0]], dtype=int64)

array([[0, 0, 0, 2, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 0, 0]], dtype=int64)

In [10]:
# We can use the get_feature_names method to view the word associated with each feature:
# Show feature names
count.get_feature_names_out()
# array(['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love',
#  'sweden'], dtype=object)

array(['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love',
       'sweden'], dtype=object)

- **Note:** that the I from I love Brazil is not considered a token because the default
token_pattern only considers tokens of two or more alphanumeric characters.

In [11]:
# Finally, we can restrict the words or phrases we want to consider to a certain list of words using vocabulary
# For example, we could create a bag-of-words feature matrix only for occurrences of country names:
# Create feature matrix with arguments
count_2gram = CountVectorizer(ngram_range=(1,2), stop_words="english", vocabulary=['brazil'])
bag = count_2gram.fit_transform(text_data)
# View feature matrix
bag.toarray()
# array([[2],
#  [0],
#  [0]])

array([[2],
       [0],
       [0]], dtype=int64)

In [13]:
# View the 1-grams and 2-grams
count_2gram.vocabulary_
# {'brazil': 0}

{'brazil': 0}

# Weighting Word Importance
> - The more a word appears in a document, the more likely it is that the word is
important to that document.
>> - If the word economy appears frequently, it
is evidence that the document might be about economics. We call this `term frequency (tf)`.
>___________________________
> - In contrast, if a word appears in many documents, it is likely less important to any
individual document.
>> - If every document in some text data contains the
word after then it is probably an unimportant word. We call this `document frequency
(df)`.
>___________________________________________________
> By combining these two statistics, we can assign a score to every word representing
how important that word is in a document.
>> - Specifically, we multiply tf to the inverse
of document frequency (idf):<br>
`tf‐idf(t, d) = tf(t, d) × idf(t)`
>>> where t is a word (term) and d is a document. There are a number of variations in
how tf and idf are calculated. In scikit-learn, `tf` is simply the number of times a word
appears in the document, and `idf` is calculated as:<br>
`idf(t) = log((1 + n|d)/(1 + df(d,t)) + 1`
>>>> where `n|d` is the number of documents, and df d,t is term t’s document frequency
(i.e., the number of documents where the term appears).

>> - **Note:** By default, scikit-learn then normalizes the tf‐idf vectors using the Euclidean norm
(L2 norm). The higher the resulting value, the more important the word is to a
document.

In [14]:
# If you want a bag of words with words weighted by their importance to an observation.
# Compare the frequency of the word in a document (a tweet, movie review, speech
# transcript, etc.) with the frequency of the word in all other documents using term
# frequency-inverse document frequency (tf‐idf). 
# scikit-learn makes this easy with TfidfVectorizer:
# Load libraries
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
# Create text
text_data = np.array(['I love Brazil. Brazil!',
 'Sweden is best',
'Germany beats both'])
# Create the tf-idf feature matrix
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)
# Show tf-idf feature matrix
feature_matrix

<3x8 sparse matrix of type '<class 'numpy.float64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [15]:
# the output is a sparse matrix. If we want to view the
# output as a dense matrix, we can use toarray:
# Show tf-idf feature matrix as dense matrix
feature_matrix.toarray()

array([[0.        , 0.        , 0.        , 0.89442719, 0.        ,
        0.        , 0.4472136 , 0.        ],
       [0.        , 0.57735027, 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.57735027],
       [0.57735027, 0.        , 0.57735027, 0.        , 0.57735027,
        0.        , 0.        , 0.        ]])

In [17]:
# vocabulary_ shows us the word of each feature:
# Show feature names
tfidf.vocabulary_
# {'love': 6,
#  'brazil': 3,
#  'sweden': 7,
#  'is': 5,
#  'best': 1,
#  'germany': 4,
#  'beats': 0,
#  'both': 2}

{'love': 6,
 'brazil': 3,
 'sweden': 7,
 'is': 5,
 'best': 1,
 'germany': 4,
 'beats': 0,
 'both': 2}

# Using Text Vectors to Calculate Text Similarity in a Search Query
> Text vectors are incredibly useful for NLP use cases such as search engines.
> > After
calculating the `tf‐idf vectors` of a set of sentences or documents, we can use the same
tfidf object to vectorize future sets of text.<br>
> > Then, we can compute cosine similarity
between our input vector and the matrix of other vectors and sort by the most
relevant documents.
>_________________________________
> `Cosine similarities` take on the range of `[0, 1.0]`, with `0 being least similar and 1 being
most similar`.
> > Since we’re using tf‐idf vectors to compute the similarity between vectors,
the frequency of a word’s occurrence is also taken into account.
> > > - However, with a small
corpus (set of documents) even “frequent” words may not appear frequently. In this
example, “Sweden is best” is the most relevant text to our search query “Brazil is the
best”.
>>> - Since the query mentions Brazil, we might expect “I love Brazil. Brazil!” to be
the most relevant; however, “Sweden is best” is the most similar due to the words “is”
and “best”.
>> - **Note:** As the number of documents we add to our corpus increases, less important
words will be weighted less and have less effect on our cosine similarity calculation.

In [18]:
# If you want to use tf‐idf vectors to implement a text search function in Python.
# Calculate the cosine similarity between tf‐idf vectors using scikit-learn:
# Load libraries
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
# Create searchable text data
text_data = np.array(['I love Brazil. Brazil!',
 'Sweden is best',
'Germany beats both'])
# Create the tf-idf feature matrix
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)
# Create a search query and transform it into a tf-idf vector
text = "Brazil is the best"
vector = tfidf.transform([text])
# Calculate the cosine similarities between the input vector and all other vectors
cosine_similarities = linear_kernel(vector, feature_matrix).flatten()
# Get the index of the most relevent items in order
related_doc_indicies = cosine_similarities.argsort()[:-10:-1]
# Print the most similar texts to the search query along with the cosine similarity
print([(text_data[i], cosine_similarities[i]) for i in related_doc_indicies])
# [
#  (
#  'Sweden is best', 0.6666666666666666),
#  ('I love Brazil. Brazil!', 0.5163977794943222),
#  ('Germany beats both', 0.0
#  )
# ]

[('Sweden is best', 0.6666666666666666), ('I love Brazil. Brazil!', 0.5163977794943222), ('Germany beats both', 0.0)]


# Using a Sentiment Analysis Classifier
> The `transformers` library is an extremely popular library for NLP tasks and contains
a number of easy-to-use APIs for training models or using `pretrained` ones. 

In [22]:
# If you want to classify the sentiment of some text to use as a feature or in downstream data analysis,
# use the transformers library’s sentiment classifier.
# Import libraries
from transformers import pipeline
# Create an NLP pipeline that runs sentiment analysis
classifier = pipeline("text-classification", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
# Classify some text
# (this may download some data and models the first time you run it)
sentiment_1 = classifier("I hate machine learning! It's the absolute worst.")
sentiment_2 = classifier(
 "Machine learning is the absolute"
"bees knees I love it so much!"
)
# Print sentiment output
print(sentiment_1, sentiment_2)
# [
#  {
#  'label': 'NEGATIVE',
#  'score': 0.9998020529747009
#  }
# ]
# [
#  {
#  'label': 'POSITIVE',
#  'score': 0.9990628957748413
#  }
# ]

[{'label': 'NEGATIVE', 'score': 0.9998020529747009}] [{'label': 'POSITIVE', 'score': 0.9995730519294739}]


# END of Chapter 6 --> Handling Text data