# Text analysis II

In this notebook, we will:

- Preprocess text.
- Extract features.
    - Word counts
    - Term frequenct
    - Word and document vectors.

In [1]:
from textblob import TextBlob, Word

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import pandas as pd

import spacy

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jkiley/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Preprocess text

There are a number of common ways to preprocess text for use in machine learning and other text analysis models.
While these things are often helpful, feel free to experiment with your own models and text corpora.


Note that we will look at some of this functionality in TextBlob, though as we will see, we often use scikit-learn's tools for these tasks.
However, textblob makes it easy to see how these tools work.

- lower case
- punctuation removed
- POS tagging
- lemmatization
- n-grams
- stop words removed

In [2]:
example_text_1 = ('Ultimately, we want to turn our text into a matrix that '
                 'gives the algorithm information to categorize text. That '
                 'is more difficult if we miss the same words due to case, '
                 'punctuation, or common words that don\'t help predict. '
                 'So, we can clean our text to potentially make our '
                 'predictions better.')
example_text_1

"Ultimately, we want to turn our text into a matrix that gives the algorithm information to categorize text. That is more difficult if we miss the same words due to case, punctuation, or common words that don't help predict. So, we can clean our text to potentially make our predictions better."

In [3]:
e_blob_1 = TextBlob(example_text_1)
e_blob_1.word_counts

defaultdict(int,
            {'ultimately': 1,
             'we': 3,
             'want': 1,
             'to': 4,
             'turn': 1,
             'our': 3,
             'text': 3,
             'into': 1,
             'a': 1,
             'matrix': 1,
             'that': 3,
             'gives': 1,
             'the': 2,
             'algorithm': 1,
             'information': 1,
             'categorize': 1,
             'is': 1,
             'more': 1,
             'difficult': 1,
             'if': 1,
             'miss': 1,
             'same': 1,
             'words': 2,
             'due': 1,
             'case': 1,
             'punctuation': 1,
             'or': 1,
             'common': 1,
             'do': 1,
             "n't": 1,
             'help': 1,
             'predict': 1,
             'so': 1,
             'can': 1,
             'clean': 1,
             'potentially': 1,
             'make': 1,
             'predictions': 1,
             'better': 1})

Notice a few things about the dictionary above.

1. This text has been **tokenized**, meaning that it has been split into tokens that have meaning (words in this case).
1. textblob make the words lowercase before counting them. The word "that" appears in the original text both capitalized and lower case. This is perhaps the most common transformation of all, so it is not surprising that it does that for us automatically.
1. The punctuation has been removed. That's not always something we will want, but it is quite helpful in most cases.
1. The word "don't" was split into `'do'` and ``"n't"``. The tokenizer is smart enough to separate it so that the negation is captured separately.

Many times, we would like to consider parts of speech, and there are quite good models for finding this information for words.
textblob has this functionality built in.
For some tasks, it can be helpful to treat words used as different parts of speech as different words.

In [4]:
# Use slicing to look at the first ten.
e_blob_1.tags[:10]

[('Ultimately', 'RB'),
 ('we', 'PRP'),
 ('want', 'VBP'),
 ('to', 'TO'),
 ('turn', 'VB'),
 ('our', 'PRP$'),
 ('text', 'NN'),
 ('into', 'IN'),
 ('a', 'DT'),
 ('matrix', 'NN')]

Similarly, we may want to reduce words to their base or **lemmatized** form in order to construct better counts.

In [5]:
Word('learning')

'learning'

In [6]:
# We tell the lemmatize method the part of speech.
Word('learning').lemmatize('v')

'learn'

Another common transformation is using more than one word at a time to capture context.
These multi-word groups are called **n-grams**.
We do have to be careful here, as the dimensionality (and, thus, computational intensity) grows very quickly.

**Note:** we would typically add the n-grams to the single words as features.

In [7]:
print(f'Length of words alone:  {len(e_blob_1.word_counts)}')
print(f'Length of n-grams of 2: {len(e_blob_1.ngrams(2))}')

Length of words alone:  39
Length of n-grams of 2: 51


In [8]:
e_blob_1.words.lower()

WordList(['ultimately', 'we', 'want', 'to', 'turn', 'our', 'text', 'into', 'a', 'matrix', 'that', 'gives', 'the', 'algorithm', 'information', 'to', 'categorize', 'text', 'that', 'is', 'more', 'difficult', 'if', 'we', 'miss', 'the', 'same', 'words', 'due', 'to', 'case', 'punctuation', 'or', 'common', 'words', 'that', 'do', "n't", 'help', 'predict', 'so', 'we', 'can', 'clean', 'our', 'text', 'to', 'potentially', 'make', 'our', 'predictions', 'better'])

In [9]:
e_blob_1_stop = [w for w in e_blob_1.words.lower() 
                 if w not in stopwords.words('english')]
e_blob_1_stop

['ultimately',
 'want',
 'turn',
 'text',
 'matrix',
 'gives',
 'algorithm',
 'information',
 'categorize',
 'text',
 'difficult',
 'miss',
 'words',
 'due',
 'case',
 'punctuation',
 'common',
 'words',
 "n't",
 'help',
 'predict',
 'clean',
 'text',
 'potentially',
 'make',
 'predictions',
 'better']

# Word count features

We're going to use these sentences as an example to see how the transformation work, though the patterns we will see are generally quite common.

In [10]:
# Nice description that comes with this dataset.
# You can uncomment and run it yourself if you like.
# print(news_test['DESCR'])

`sklearn`'s text utilities do a lot of feature extraction for us relatively easily.
We will look at them in a few examples.

In [11]:
# Let's look at the defaults.
test_cv = CountVectorizer()
test_cv

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

Note a few things:

1. By default, `lowercase=True`. As we discussed before, this is a transform that is nearly universal.
1. It has a default of `ngram_range=(1, 1)`, but we can see that we can specify n-grams.
1. It can filter stop words, but it is off be default. As the [documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#stop-words) notes, there are reasons to worry about stop words.
1. If we want to override the built-in behavior, it allows us to pass in our own functions for the `preprocessor` and `tokenizer` arguments.
1. Note that we do not have POS tagging built-in, but we could preprocess the text ourselves to feed in data with tags.

Let's see some output.

In [12]:
test_sentences = ['If we want to override the built-in behavior, '
                  'it allows us to pass in our own functions for the '
                  ' preprocessor and tokenizer arguments.',
                  'Note that we do not have POS tagging built-in, '
                  'but we could preprocess the '
                  'text ourselves to feed in data with tags.']
test_sent_vec = test_cv.fit_transform(test_sentences)
print(test_cv.get_feature_names())
print(test_sent_vec.toarray())

['allows', 'and', 'arguments', 'behavior', 'built', 'but', 'could', 'data', 'do', 'feed', 'for', 'functions', 'have', 'if', 'in', 'it', 'not', 'note', 'our', 'ourselves', 'override', 'own', 'pass', 'pos', 'preprocess', 'preprocessor', 'tagging', 'tags', 'text', 'that', 'the', 'to', 'tokenizer', 'us', 'want', 'we', 'with']
[[1 1 1 1 1 0 0 0 0 0 1 1 0 1 2 1 0 0 1 0 1 1 1 0 0 1 0 0 0 0 2 2 1 1 1 1
  0]
 [0 0 0 0 1 1 1 1 1 1 0 0 1 0 2 0 1 1 0 1 0 0 0 1 1 0 1 1 1 1 1 1 0 0 0 2
  1]]


In [13]:
# Let's see what happens with n-grams of 2.
test_cv_2 = CountVectorizer(ngram_range=(1, 2))
test_sent_vec_2 = test_cv_2.fit_transform(test_sentences)
print(test_cv_2.get_feature_names())
print(test_sent_vec_2.toarray())

['allows', 'allows us', 'and', 'and tokenizer', 'arguments', 'behavior', 'behavior it', 'built', 'built in', 'but', 'but we', 'could', 'could preprocess', 'data', 'data with', 'do', 'do not', 'feed', 'feed in', 'for', 'for the', 'functions', 'functions for', 'have', 'have pos', 'if', 'if we', 'in', 'in behavior', 'in but', 'in data', 'in our', 'it', 'it allows', 'not', 'not have', 'note', 'note that', 'our', 'our own', 'ourselves', 'ourselves to', 'override', 'override the', 'own', 'own functions', 'pass', 'pass in', 'pos', 'pos tagging', 'preprocess', 'preprocess the', 'preprocessor', 'preprocessor and', 'tagging', 'tagging built', 'tags', 'text', 'text ourselves', 'that', 'that we', 'the', 'the built', 'the preprocessor', 'the text', 'to', 'to feed', 'to override', 'to pass', 'tokenizer', 'tokenizer arguments', 'us', 'us to', 'want', 'want to', 'we', 'we could', 'we do', 'we want', 'with', 'with tags']
[[1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 2 1 0 0 1 1 1 0 0
  0 0 1 

One issue that may seem obvious from our discussion of stop words earlier is that some words don't do a lot for us in terms of prediction.
Another strategy for dealing with that issue is weighting terms such that those that are less frequent receive a higher weight and vice versa.
We call this **term frequency times inverse document frequency** or tf-idf.

Another issue you may have thought of is that we're using raw counts above.
Longer documents will naturally have higher counts, so we can normalize those values if we choose (like the example below).
It is not that important for our examples, but some models are sensitive to that.

In [14]:
# Again, let's look at it.
test_tt_1 = TfidfTransformer()
test_tt_1

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

We can see that, by default, it both normalizes and uses idf, but we can change those arguments if we choose.

In [15]:
test_sent_tfidf_1 = test_tt_1.fit_transform(test_sent_vec_2.toarray())
print(test_sent_tfidf_1.toarray())

[[0.14809752 0.14809752 0.14809752 0.14809752 0.14809752 0.14809752
  0.14809752 0.1053726  0.1053726  0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.14809752 0.14809752 0.14809752 0.14809752 0.
  0.         0.14809752 0.14809752 0.2107452  0.14809752 0.
  0.         0.14809752 0.14809752 0.14809752 0.         0.
  0.         0.         0.14809752 0.14809752 0.         0.
  0.14809752 0.14809752 0.14809752 0.14809752 0.14809752 0.14809752
  0.         0.         0.         0.         0.14809752 0.14809752
  0.         0.         0.         0.         0.         0.
  0.         0.2107452  0.14809752 0.14809752 0.         0.2107452
  0.         0.14809752 0.14809752 0.14809752 0.14809752 0.14809752
  0.14809752 0.14809752 0.14809752 0.1053726  0.         0.
  0.14809752 0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.10840958 0.10840958 0.15236588 0.15236588 0.15236588
  0.15236588 0.15

While we looked at a number of the intermediate states, these tasks are common enough that the `TfidfVectorizer` class bundles together both `CountVectorizer` and `TfidfTransformer` into one step.

In [16]:
# Notice that we get the same result in one step.
test_tv_1 = TfidfVectorizer(ngram_range=(1, 2))
test_sent_tv_1 = test_tv_1.fit_transform(test_sentences)

In [17]:
print(test_sent_tv_1.toarray())

[[0.14809752 0.14809752 0.14809752 0.14809752 0.14809752 0.14809752
  0.14809752 0.1053726  0.1053726  0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.14809752 0.14809752 0.14809752 0.14809752 0.
  0.         0.14809752 0.14809752 0.2107452  0.14809752 0.
  0.         0.14809752 0.14809752 0.14809752 0.         0.
  0.         0.         0.14809752 0.14809752 0.         0.
  0.14809752 0.14809752 0.14809752 0.14809752 0.14809752 0.14809752
  0.         0.         0.         0.         0.14809752 0.14809752
  0.         0.         0.         0.         0.         0.
  0.         0.2107452  0.14809752 0.14809752 0.         0.2107452
  0.         0.14809752 0.14809752 0.14809752 0.14809752 0.14809752
  0.14809752 0.14809752 0.14809752 0.1053726  0.         0.
  0.14809752 0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.10840958 0.10840958 0.15236588 0.15236588 0.15236588
  0.15236588 0.15

# Word vector features

As we discussed earlier, word vectors represent words as vectors in a vector space.
The embeddings below are from Stanford's [GloVe](https://github.com/stanfordnlp/GloVe) project, specifically the 100-dimensional version of the Wikipedia 2014 + Gigaword 5 data.

Here, I have extracted a small subset of the embeddings for the words in the `e_blob_1_stop` list above.
The full data is quite large.

In [18]:
glove = pd.read_csv('../data/glove.csv', index_col=0)
glove.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,91,92,93,94,95,96,97,98,99,100
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ultimately,0.45483,-0.10229,0.35265,0.16507,0.33391,-0.033643,0.000439,-0.09316,0.60292,-0.26628,...,-0.35214,-0.47388,-0.19256,0.2763,-0.053913,0.34036,0.23723,0.054277,0.10977,0.18678
want,-0.17124,0.56447,0.34667,-0.56711,-0.65675,0.12081,-0.76863,0.072832,0.42237,-0.10464,...,-0.013218,-0.20853,0.052186,-0.86911,-0.85816,-0.23443,0.057799,0.03115,0.48789,0.69311
turn,-0.08525,0.021085,0.30965,-0.17603,0.088869,0.28505,-0.34456,0.11396,0.29212,-0.24502,...,-0.22737,0.11364,-0.23705,-0.31441,-0.43231,-0.10549,0.14866,-0.24256,0.47029,0.2342
text,-0.49705,0.71642,0.40119,-0.05761,0.83614,0.8256,0.08963,-0.53492,0.34335,-0.27079,...,0.040066,0.60803,-0.027058,0.15273,-0.16887,-0.47664,-0.61775,-0.98735,0.23776,0.39952
matrix,-0.26638,0.44491,0.32743,0.43459,0.10528,0.31703,-0.34503,0.18147,-0.14878,0.84897,...,-1.1066,0.35388,-0.26355,0.59609,1.1334,-1.1025,0.77682,-0.17267,-0.53726,0.158


## Using spacy to prepare text

The [spacy](https://spacy.io/usage/vectors-similarity) package can do a lot of the preparation we have described for us before using machine learning models.
It can also do some other interesting things using the respresentation it creates.

In [19]:
nlp = spacy.load('en_core_web_lg')
example_doc = nlp(example_text_1)

In [20]:
example_doc

Ultimately, we want to turn our text into a matrix that gives the algorithm information to categorize text. That is more difficult if we miss the same words due to case, punctuation, or common words that don't help predict. So, we can clean our text to potentially make our predictions better.

In [21]:
# Look at each item in the text.
[i for i in example_doc]

[Ultimately,
 ,,
 we,
 want,
 to,
 turn,
 our,
 text,
 into,
 a,
 matrix,
 that,
 gives,
 the,
 algorithm,
 information,
 to,
 categorize,
 text,
 .,
 That,
 is,
 more,
 difficult,
 if,
 we,
 miss,
 the,
 same,
 words,
 due,
 to,
 case,
 ,,
 punctuation,
 ,,
 or,
 common,
 words,
 that,
 do,
 n't,
 help,
 predict,
 .,
 So,
 ,,
 we,
 can,
 clean,
 our,
 text,
 to,
 potentially,
 make,
 our,
 predictions,
 better,
 .]

In [22]:
# We can see here that each element has a vector representation.
# Also, each vector is in 300 dimensions.
print(len([i.vector for i in example_doc]))
print(len([i.vector for i in example_doc][0]))

59
300


In [23]:
# There is a vector for the document, which is the average of the element vectors.
len(example_doc.vector)

300

In [24]:
# Looking at just the first 40 elements of the vector.
example_doc.vector[0:40]

array([-0.04557627,  0.22923034, -0.24733946,  0.01531771,  0.06508073,
        0.0169166 , -0.03391241,  0.00259539, -0.00466278,  2.2064903 ,
       -0.17149216,  0.06919597,  0.17887476, -0.00224403, -0.18808614,
       -0.05013261, -0.09594608,  1.2901341 , -0.2074607 , -0.04749772,
       -0.01562062,  0.03571492, -0.01421114, -0.01695105,  0.0155637 ,
        0.1137066 , -0.03556239, -0.02754606,  0.11727446, -0.17732875,
       -0.02521809,  0.06882533, -0.02938912,  0.02645194,  0.10027047,
       -0.05578014,  0.11198515,  0.01748868, -0.07824575, -0.10478983],
      dtype=float32)

In [25]:
# We can calculate similarity between documents, too.
example_2 = nlp('Cleaning text is a good idea.')
example_doc.similarity(example_2)

0.8728022575670232

In [26]:
# If we calculate the cosine similarity of the two vectors,
# we can see that it is the method used above.
cosine_similarity([example_doc.vector], [example_2.vector])

array([[0.8728023]], dtype=float32)

# Breakout Exercises

Let's do an exercise to reinforce the concepts we learned above.

## EX1: similarity

Let's use spacy to help us compute the similarity of two strings.

1. Create two spacy documents, named `breakout1` and `breakout2`, by passing strings to the `nlp()` vector package that we created earlier.
1. Use the similarity method on the first document to compare it to the second.
1. Try changing up the strings (and feel free to Google longer passages to test).

In [27]:
# 1-1 code


In [28]:
# 1-2 code
