# * Cleaning Text
Most basic text cleaning operations should only use Python’s core string operations,
in particular strip, replace, and split:

In [1]:
text_data = [" Interrobang. By HK Henriette ",
                "Parking And Going. By Karl Gautier",
                " Today Is The night. By Jarek Prakash "]

In [2]:
string_without_whitespace = [string.strip() for string in text_data]
string_without_whitespace

['Interrobang. By HK Henriette',
 'Parking And Going. By Karl Gautier',
 'Today Is The night. By Jarek Prakash']

In [3]:
string_without_period = [string.replace(".", "") for string in string_without_whitespace]
string_without_period

['Interrobang By HK Henriette',
 'Parking And Going By Karl Gautier',
 'Today Is The night By Jarek Prakash']

In [4]:
capital_text = [string.upper() for string in string_without_period]
capital_text

['INTERROBANG BY HK HENRIETTE',
 'PARKING AND GOING BY KARL GAUTIER',
 'TODAY IS THE NIGHT BY JAREK PRAKASH']

# * Parsing and cleaning HTML
## Problem:
You have text data with HTML elements and want to extract just the text.

## Solution:
Use Beautiful Soup’s extensive set of options to parse and extract from HTML:

In [5]:
from bs4 import BeautifulSoup
import nltk

In [6]:
# Create sample HTML content
html = """
<div class='full_name'><span style='font-weight:bold'>
Masego</span> Azra</div>"
"""

# Parse html
soup = BeautifulSoup(html, "lxml")

#Find div with class name 'full_name'
soup.find('div', {'class':'full_name'}).text

'\nMasego Azra'

Despite the strange name, Beautiful Soup is a powerful Python library designed for
scraping HTML. Typically Beautiful Soup is used scrape live websites, but we can just
as easily use it to extract text data embedded in HTML. The full range of Beautiful
Soup operations is beyond the scope of this blog, but even the few methods used in
our solution show how easily we can parse HTML code to extract the data we want.

# * Removing punctuation
## Problem:
You have a feature of text data and want to remove punctuation.

## Solution:
Define a function that uses translate with a dictionary of punctuation characters:

In [7]:
import string
text_data = ['Hi!!!! I. Love. This. Song....',
            '10000% Agree!!!! #LoveIT',
            'Right?!?!']
res_list = []
for string_tmp in text_data:
    res = ''
    for c in string_tmp:
        if c not in string.punctuation:
            res = res+c
    res_list.append(res)
res_list

['Hi I Love This Song', '10000 Agree LoveIT', 'Right']

It is important to be conscious of the fact that punctuation contains information (e.g.,
“Right?” versus “Right!”). Removing punctuation is often a necessary evil to create
features; however, if the punctuation is important we should make sure to take that
into account.

# * Tokenizing text
## Problem:
You have text and want to break it up into individual words.

## Solution:
NLP Toolkit of python (NLTK) has some really powerful methods including text tokenizing.

In [8]:
from nltk.tokenize import word_tokenize
string = "The science of today is the technology of tomorrow. Thus the end"

word_tokenize(string)

['The',
 'science',
 'of',
 'today',
 'is',
 'the',
 'technology',
 'of',
 'tomorrow',
 '.',
 'Thus',
 'the',
 'end']

Tokenization, especially word tokenization, is a common task after cleaning text data
because it is the first step in the process of turning the text into data we will use to
construct useful features.

# * Removing stop words
## Problem:
Given tokenized words, remove common words.

## Solution:
Use NLTK's stop words

In [9]:
from nltk.corpus import stopwords

In [10]:
text = "I am going to market today I had enjoyed your ride"
text = text.lower()
list_of_words = word_tokenize(text)

tmp = stopwords.words('english')
text_without_stopwords = [word for word in list_of_words if word not in tmp]

text_without_stopwords

['going', 'market', 'today', 'enjoyed', 'ride']

While “stop words” can refer to any set of words we want to remove before processing,
frequently the term refers to extremely common words that themselves contain
little information value. NLTK has a list of common stop words that we can use to
find and remove stop words in our tokenized words

In [11]:
tmp[:10]    #Here tmp is list of stopwords as assigned in above cell

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

##### Note that NLTK’s stopwords assumes the tokenized words are all lowercased.

# * Stemming words
You have tokenized words and want to convert them into their root forms.

## Problem:
You have tokenized words and want to convert them into their root forms.

## Solution:
Use NLTK’s PorterStemmer:

In [12]:
from nltk.stem.porter import PorterStemmer

# Here we already have tokenized words from above output, i.e. text_without_stopwords

#Create porter
porter = PorterStemmer()

root_words = [porter.stem(word) for word in text_without_stopwords]
root_words

['go', 'market', 'today', 'enjoy', 'ride']

Stemming reduces word to its stem - without altering the original meaning of word. For example, both “tradition” and
“traditional” have “tradit” as their stem, indicating that while they are different words
they represent the same general concept. This process removes suffixes, prefixes etc. to get original root words.

# * Tagging Parts of Speech
## Problem:
You have text data and want to tag each word or character with its part of speech.

## Solution:
Use NLTK’s pre-trained parts-of-speech tagger:

In [13]:
from nltk import pos_tag
from nltk import word_tokenize

In [14]:
text_data = "Chris loved outdoor running"

In [15]:
# Use pre-trained part of speech tagger
text_tagged = pos_tag(word_tokenize(text_data))
text_tagged

[('Chris', 'NNP'), ('loved', 'VBD'), ('outdoor', 'RP'), ('running', 'VBG')]

The output is a list of tuples with the word and the tag of the part of speech.


Here is the list of sample acronyms and their meaning:

In [16]:
'''
NNP: Proper noun, singular
NN: Noun, singular or mass
RB: Adverb
VBD: Verb, past tense
VBG: Verb, gerund or present participle
JJ: Adjective
PRP: Personal pronoun
'''
print()




Once we've tags for each word, we can use it to find certain parts of speech.

In [17]:
# Example to get all nouns
nouns = [word for word, tag in text_tagged if tag in {'NN','NNS','NNP','NNPS'}]
nouns

['Chris']

## Realistic case study of use of Parts-of-speech:
Consider we have data where every observation has tweet. We want to convert those sentences into features for individual parts of speech. (e.g., a feature with 1 if a proper noun is present, and 0 otherwise)

In [18]:
tweets = ["I am eating a burrito for breakfast",
        "Political science is an amazing field",
        "San Francisco is an awesome city"]

# Create list
tagged_tweets = []

In [19]:
# Tag each word and each tweet
for tweet in tweets:
    tweet_tag = nltk.pos_tag(word_tokenize(tweet))
    tagged_tweets.append([tag for word, tag in tweet_tag])
    
tagged_tweets

[['PRP', 'VBP', 'VBG', 'DT', 'NN', 'IN', 'NN'],
 ['JJ', 'NN', 'VBZ', 'DT', 'JJ', 'NN'],
 ['NNP', 'NNP', 'VBZ', 'DT', 'JJ', 'NN']]

In [20]:
# Using one hot encoder
from sklearn.preprocessing import MultiLabelBinarizer

# Use one-hot encoding to convert the tags into features
one_hot_multi = MultiLabelBinarizer()
one_hot_multi.fit_transform(tagged_tweets)

array([[1, 1, 0, 1, 0, 1, 1, 1, 0],
       [1, 0, 1, 1, 0, 0, 0, 0, 1],
       [1, 0, 1, 1, 1, 0, 0, 0, 1]])

In [21]:
# Using classes_ we can see that each feature is part-of-speech-tag
one_hot_multi.classes_

array(['DT', 'IN', 'JJ', 'NN', 'NNP', 'PRP', 'VBG', 'VBP', 'VBZ'],
      dtype=object)

# *Encoding Text as a Bag of Words
## Problem:
You have text data and want to create a set of features indicating the number of times
an observation’s text contains a particular word.

## Solution:
Use scikit-learn’s CountVectorizer:

In [22]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Create text
text_data = np.array(['I love India. India!',
                        'Pune is the best',
                        'xyz beats both'])

# Create the bag of words feature matrix
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)

# Show feature matrix
bag_of_words

<3x9 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

'''
This output is a sparse array, which is often necessary when we have a large amount
of text. However, in our toy example we can use toarray to view a matrix of word
counts for each observation:
'''

bag_of_words.toarray()

###### One of the most common methods of transforming text into features is by using a bag-of-words model. Bag-of-words models output a feature for every unique word in text data, with each feature containing a count of occurrences in observations.
###### The text data in our solution was purposely small. In the real world, a single observation of text data could be the contents of an entire book! Since our bag-of-words model creates a feature for every unique word in the data, the resulting matrix can contain thousands of features. This means that the size of the matrix can sometimes become very large in memory. However, luckily we can exploit a common characteristic of bag-of-words feature matrices to reduce the amount of data we need to store.
###### Most words likely do not occur in most observations, and therefore bag-of-words feature matrices will contain mostly 0s as values. We call these types of matrices “sparse.” Instead of storing all values of the matrix, we can only store nonzero values and then assume all other values are 0. This will save us memory when we have large feature matrices. One of the nice features of CountVectorizer is that the output is a sparse matrix by default.
###### CountVectorizer comes with a number of useful parameters to make creating bag-of-words feature matrices easy. First, while by default every feature is a word, that does not have to be the case. Instead we can set every feature to be the combination of two words (called a 2-gram) or even three words (3-gram). ngram_range sets the minimum and maximum size of our n-grams. For example, (2,3) will return all 2- grams and 3-grams. Second, we can easily remove low-information filler words using stop_words either with a built-in list or a custom list.

In [23]:
# Create feature matrix with arguments
count_2gram = CountVectorizer(ngram_range=(1,2),
    stop_words="english")

bag = count_2gram.fit_transform(text_data)
# View feature matrix
bag.toarray()

array([[0, 0, 2, 1, 1, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 1, 1]], dtype=int64)

In [24]:
count_2gram.vocabulary_

{'love': 4,
 'india': 2,
 'love india': 5,
 'india india': 3,
 'pune': 6,
 'best': 1,
 'pune best': 7,
 'xyz': 8,
 'beats': 0,
 'xyz beats': 9}

# AND THE MOST IMPORTANT CONCEPT OF TEXT HANDLING ->
# * Weighting word importance
## Problem:
You want a bag of words, but with words weighted by their importance to an observation.

## Solution:
Compare the frequency of the word in a document (i.e., the dataset of tweets, movie reviews, speech
transcripts, etc.) with the frequency of the word in all other documents using term
frequency-inverse document frequency (tf-idf). scikit-learn makes this easy with
TfidfVectorizer:

In [25]:
# import requirements:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

In [26]:
# Create sample data
text_data = np.array(['I love India. India!',
                        'Japan is the best',
                        'NY beats both'])

In [27]:
# Create the tf-idf feature matrix
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)

feature_matrix.toarray()

array([[0.        , 0.        , 0.        , 0.89442719, 0.        ,
        0.        , 0.4472136 , 0.        , 0.        ],
       [0.        , 0.5       , 0.        , 0.        , 0.5       ,
        0.5       , 0.        , 0.        , 0.5       ],
       [0.57735027, 0.        , 0.57735027, 0.        , 0.        ,
        0.        , 0.        , 0.57735027, 0.        ]])

The more a word appears in a document, the more likely it is important to that document.
For example, if the word economy appears frequently, it is evidence that the
document might be about economics. We call this term frequency (tf).

In contrast, if a word appears in many documents, it is likely less important to any
individual document. For example, if every document in some text data contains the
word 'after', then it is probably an unimportant word. We call this document frequency
(df).

By combining these two statistics, we can assign a score to every word representing
how important that word is in a document. Specifically, we multiply tf to the inverse
of document frequency (idf):

#### tf_idf(t, d) = tf(t,d) × idf(t)
where t is a word and d is a document.

### The higher the resulting value, the more important the word is to a document.

# That's it folks... These are basics about handling text data in data mining process. You can get more information from post at AnalyticsVidhya:
https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/

# See you next week!