# Vectorizing Text

Most machine learning techniques rely on numerical input data.  Thus, the first step in any natural language processing exercise is to convert text data into numbers, in particular vectors/arrays of numbers.

In this chapter we consider three ways of vectorizing text data:

1. Word/Token counts
2. Tfidf weightings
3. Word embeddings

## Importing Packages

Let's begin by importing the packages that we will need.

In [None]:
import numpy as np
import pandas as pd

## Reading-In Data

Next we read-in some labeled news data.  In this data set, each financial headline is associated with a sentiment, positive or negative.  We can think of the headlines as being the features, and the sentiment as being the label.  In this chapter, we won't be concerned with the labels, but rather how to turn our raw text features into meaningful numeric feature vectors.

In [None]:
df_headline = pd.read_csv('LabelledNewsData.csv', encoding='unicode_escape')
df_headline

Unnamed: 0,datetime,headline,ticker,sentiment
0,1/16/2020 5:25,$MMM fell on hard times but could be set to re...,MMM,0
1,1/11/2020 6:43,Wolfe Research Upgrades 3M $MMM to ¡§Peer Perf...,MMM,1
2,1/9/2020 9:37,3M $MMM Upgraded to ¡§Peer Perform¡¨ by Wolfe ...,MMM,1
3,1/8/2020 17:01,$MMM #insideday follow up as it also opened up...,MMM,1
4,1/8/2020 7:44,$MMM is best #dividend #stock out there and do...,MMM,0
...,...,...,...,...
9465,4/11/2019 1:24,$WMT - Walmart shifts to remodeling vs. new st...,WMT,1
9466,4/10/2019 6:05,Walmart INC $WMT Holder Texas Permanent School...,WMT,0
9467,4/9/2019 4:38,$WMT $GILD:3 Dividend Stocks Perfect for Retir...,WMT,1
9468,4/9/2019 4:30,Walmart expanding use of #robots to scan shelv...,WMT,1


## Cleaning up Headlines

Let's use this user-defined function to clean our data.  In particular, we will lowercase all the words and remove punctuation.

In [None]:
import re
import string

def process_text(text):
    text = str(text).lower()
    text = re.sub(
        f"[{re.escape(string.punctuation)}]", " ", text
    )
    text = " ".join(text.split())
    return text

Now we can use the `.apply()` method to clean all the headlines in a single line of code.

In [None]:
df_headline['clean_headline'] = df_headline['headline'].apply(process_text)
df_headline

Unnamed: 0,datetime,headline,ticker,sentiment,clean_headline
0,1/16/2020 5:25,$MMM fell on hard times but could be set to re...,MMM,0,mmm fell on hard times but could be set to reb...
1,1/11/2020 6:43,Wolfe Research Upgrades 3M $MMM to ¡§Peer Perf...,MMM,1,wolfe research upgrades 3m mmm to ¡§peer perfo...
2,1/9/2020 9:37,3M $MMM Upgraded to ¡§Peer Perform¡¨ by Wolfe ...,MMM,1,3m mmm upgraded to ¡§peer perform¡¨ by wolfe r...
3,1/8/2020 17:01,$MMM #insideday follow up as it also opened up...,MMM,1,mmm insideday follow up as it also opened up w...
4,1/8/2020 7:44,$MMM is best #dividend #stock out there and do...,MMM,0,mmm is best dividend stock out there and down ...
...,...,...,...,...,...
9465,4/11/2019 1:24,$WMT - Walmart shifts to remodeling vs. new st...,WMT,1,wmt walmart shifts to remodeling vs new stores
9466,4/10/2019 6:05,Walmart INC $WMT Holder Texas Permanent School...,WMT,0,walmart inc wmt holder texas permanent school ...
9467,4/9/2019 4:38,$WMT $GILD:3 Dividend Stocks Perfect for Retir...,WMT,1,wmt gild 3 dividend stocks perfect for retirees
9468,4/9/2019 4:30,Walmart expanding use of #robots to scan shelv...,WMT,1,walmart expanding use of robots to scan shelve...


## Word Counts - Simple Example

The simplest way of vectorizing text is by creating a vocabulary of all the unique words in the corpus (the collection of all headlines in our case) and then counting how many times each word is used in each headline.

Let's start with a simple example, a corpus of two headlines.

In [None]:
sentences = [
    'The stock price of google jumps on the earning data today',
    'Google plunge on China Data!'
]

We can use the `CountVectorizer` in **sklearn** to perform this kind of vectorization.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

Using the `.fit_transform` method of our `CountVectorizer` instance returns a sparse matrix with the word counts in it.

In [None]:
type(vectorizer.fit_transform(sentences))

scipy.sparse._csr.csr_matrix

However, we can view this as regular matrix as well.  Notice that each headline is represented by a row, and each column represents a particular word.

In [None]:
vectorizer.fit_transform(sentences).todense()

matrix([[0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 2, 1],
        [1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0]])

The entries in the returned matrix are associated with the `.vocabulary_` `dict` that is constructed when `.fit_transform()` is run.  Our vocabulary consists of 12 words.  The indexes associated with each work in the vocabulary correspond to the column number in the matrix representation above.  Notice that the word `'the'` appears twice in the first headline.

In [None]:
vectorizer.vocabulary_

{'the': 10,
 'stock': 9,
 'price': 8,
 'of': 5,
 'google': 3,
 'jumps': 4,
 'on': 6,
 'earning': 2,
 'data': 1,
 'today': 11,
 'plunge': 7,
 'china': 0}

## Word Counts - Full Data Set

Let's now vectorize our full headline data set with `CountVectorizer`.  We begin by instantiating a new `CountVectorizer`, notice that we are explicitly setting the `ngram_range` input to the constructor.  If we were to, for example, set `ngram_range=(1, 2)` our `vectorizer` would produce counts for all 1-grams (words) and 2-grams (two-word sequences).

In [None]:
vectorizer = CountVectorizer(ngram_range=(1, 1))

Now we run the `.fit_transform()` method on all of our headlines.

In [None]:
features = vectorizer.fit_transform(df_headline['clean_headline'])

We can see that our vocabulary consists of 9464 unique words.

In [None]:
len(vectorizer.vocabulary_)

9464

Here is a list of the first 100 words in our `vocabulary_`.  The ordering is not important.

In [None]:
print(list(vectorizer.vocabulary_.keys())[:100])

['mmm', 'fell', 'on', 'hard', 'times', 'but', 'could', 'be', 'set', 'to', 'rebound', 'soon', 'wolfe', 'research', 'upgrades', '3m', 'peer', 'perform', 'upgraded', 'by', 'stocks', 'insideday', 'follow', 'up', 'as', 'it', 'also', 'opened', 'with', 'nice', 'candle', 'that', 'closed', 'just', 'over', 'the', 'prior', 'day', 'high', 'and', 'th', 'is', 'best', 'dividend', 'stock', 'out', 'there', 'down', '40', 'in', '2019', 'xli', 'go', 'please', 'fallen', 'king', 'will', 'back', 'read', 'more', 'sign', 'for', 'updates', 'trading', 'economy', 'investing', 'mmmcelebrates', 'new', 'year', 'month', 'close', 'volume', 'above', 'long', 'term', 'support', 'resistance', 'off', 'flag', '180', 'baby', 'going', 'higher', 'mmmhasn', 'really', 'done', 'much', 'this', 'looks', 'like', 'series', 'of', 'highs', 'forming', 'recent', 'ab', 'rating', 'increased', 'neutral', 'at']


Our full set of features is a matrix with 9470 rows, one for each of the headlines, and 9464 columns, one for each of the tokens in our vocabulary.

In [None]:
features.shape

(9470, 9464)

You can see that the feature matrix is sparse, i.e. mostly zeros.

In [None]:
features.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

## Tfidf - Simple Example

TF-IDF is a word frequency score that tries to highlight words that are *interesting*.  In particular, it tries to identify words that are frequent in the document (for us a headline) but not across documents (the set of all headlines). Specifially:

\begin{align*}
TF &= \frac{\text{number of times term appears in the document}}{\text{total number of terms in the document}} \\[12pt]
IDF &= \ln \bigg( \frac{\text{number of documents}}{\text{number of documents containing the term}} \bigg) \\[12pt]
TFIDF &= TF * IDF
\end{align*}

Notice that if a term appears in every document, then it gets a TF-IDF score of zero.

The specific implementation in **sklearn** is a bit more nuanced, and can be found in the [documentation](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting).

Let's explore TF-IDF weighting by way of our simple two headline corpus.

In [None]:
sentences = [
    'The stock price of google jumps on the earning data today',
    'Google plunge on China Data!'
]

As before, we simply import the `TfidfVectorizer` class from **sklearn**, instantiate it, and then use the `.fit_transform()` method.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
features = vectorizer.fit_transform(sentences)

We can view the full vocabulary with the `.get_feature_names_out()` method.

In [None]:
vectorizer.get_feature_names_out()

array(['china', 'data', 'earning', 'google', 'jumps', 'of', 'on',
       'plunge', 'price', 'stock', 'the', 'today'], dtype=object)

As above, the resulting `features` is a sparse matrix.

In [None]:
features

<2x12 sparse matrix of type '<class 'numpy.float64'>'
	with 15 stored elements in Compressed Sparse Row format>

In [None]:
features.shape

(2, 12)

But we can also view it as a dense matrix.

In [None]:
features.todense()

matrix([[0.        , 0.20964166, 0.29464404, 0.20964166, 0.29464404,
         0.29464404, 0.20964166, 0.        , 0.29464404, 0.29464404,
         0.58928809, 0.29464404],
        [0.53309782, 0.37930349, 0.        , 0.37930349, 0.        ,
         0.        , 0.37930349, 0.53309782, 0.        , 0.        ,
         0.        , 0.        ]])

## Tfidf - Full Data Set

Let's now create TF-IDF scores for our all the headlines in our data set.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
features = vectorizer.fit_transform(df_headline['clean_headline'])

We have a total of 9464 words/tokens in our data set.

In [None]:
len(vectorizer.get_feature_names_out())

9464

Let's view our features matrix.  As you can see it is mostly zeros.

In [None]:
features.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

## Word Embeddings

When using both `CountVectorizer` and `TfidfVectorizer` the resulting document matrix was sparse: each headline was represented by a row of length 9464, with most of entries of matrix having a value of zero.  And we are dealing with a very small corpus; the problem worsens as you deal with more and more documents.  Large, sparse matrices can cause computational strain and instability with a lot of machine learning techniques.

Another issue with the above vectorizers is that we are making an implicit assumption that all words are independents of one another, which clearly isn't true when it comes to words in natural language.

A tool that can help with both of these shortcomings of count-based vectorization is *word embeddings*.  Word embeddings are meaningful vector representations of words such that similar/related words will have similar vector representations.  There are a variety of ways training word embeddings, using deep learning or more standard statistical techniques (e.g. singular value decomposition).  In practice, data scientists often use pre-trained word embeddings; popular ones that you will see are **word2vec** and **Glove**. 

For our purposes, we will use the word embedding built into the **spaCy** library, which includes 20,000 words each represented by a 300-dimensional vector.  In order to use it let's first import **spaCy** and initialize.

In [None]:
import spacy
nlp = spacy.load('en_core_web_lg')

Let's see the vector representation of a the single word `'stock'`.

In [None]:
doc = nlp('stock')

In [None]:
for token in doc:
    print(token.vector)

[-1.6763e+00 -3.4523e+00 -5.6915e+00  8.3472e+00  4.6024e+00  2.2767e+00
  2.1329e+00  5.3314e+00 -6.0596e-01 -1.7710e+00  3.3695e+00  2.7222e-01
 -7.9550e+00 -2.9420e+00 -4.6784e+00 -3.8186e-01  7.3475e+00 -6.1266e-01
 -2.7009e-01 -3.9311e+00  2.2934e-03  3.7566e+00 -2.2156e+00 -1.7855e+00
  1.3459e-01  1.8638e+00 -1.5185e+00 -5.1967e+00 -9.7408e-01 -5.9038e-01
  3.5877e+00 -1.8775e+00 -6.5891e+00  1.9367e+00 -2.6755e+00 -1.4335e+00
  4.5114e+00  5.2133e+00  1.8360e+00  3.1565e+00 -6.6352e-01  4.3622e+00
  3.9877e+00  2.5044e-02 -6.4742e-01  4.4283e+00  2.9162e+00 -8.2397e-01
  3.9111e+00  1.8230e+00  1.5662e+00 -2.8878e+00  5.9252e-01 -4.4401e+00
 -2.8798e+00  2.1201e+00 -1.9458e+00  9.7731e-01  2.2704e+00  4.8463e-01
  4.6493e+00 -1.6039e+00  2.9300e+00  6.6515e-01  2.1048e-01 -2.3328e-02
 -3.1912e-01  1.7723e-01  4.1515e-01  6.2709e+00 -3.2902e+00 -2.3934e+00
 -6.9308e-01 -5.1802e-02  6.8780e-02  6.7266e+00 -1.6526e+00  1.5962e+00
  1.7972e+00 -1.4440e+00  6.3026e-01  4.6175e+00  3

## Word Embedding - Simple Example

Now let's calculate the work embedding for each of the words in the first sentence of our simple two headline corpus.

In [None]:
sentences = [
    'The stock price of google jumps on the earning data today',
    'Google plunge on China Data!'
]

First, we create a `doc` object.

In [None]:
doc = nlp(sentences[0])

Next, let's print out the vector representation of each of the words.

In [None]:
for token in doc:
    print(token.vector)

[-7.2681e+00 -8.5717e-01  5.8105e+00  1.9771e+00  8.8147e+00 -5.8579e+00
  3.7143e+00  3.5850e+00  4.7987e+00 -4.4251e+00  1.7461e+00 -3.7296e+00
 -5.1407e+00 -1.0792e+00 -2.5555e+00  3.0755e+00  5.0141e+00  5.8525e+00
  7.3378e+00 -2.7689e+00 -5.1641e+00 -1.9879e+00  2.9782e+00  2.1024e+00
  4.4306e+00  8.4355e-01 -6.8742e+00 -4.2949e+00 -1.7294e-01  3.6074e+00
  8.4379e-01  3.3419e-01 -4.8147e+00  3.5683e-02 -1.3721e+01 -4.6528e+00
 -1.4021e+00  4.8342e-01  1.2549e+00 -4.0644e+00  3.3278e+00 -2.1590e-01
 -5.1786e+00  3.5360e+00 -3.1575e+00 -3.5273e+00 -3.6753e+00  1.5863e+00
 -8.1594e+00 -3.4657e+00  1.5262e+00  4.8135e+00 -3.8428e+00 -3.9082e+00
  6.7549e-01 -3.5787e-01 -1.7806e+00  3.5284e+00 -5.1114e-02 -9.7150e-01
 -9.0553e-01 -1.5570e+00  1.2038e+00  4.7708e+00  9.8561e-01 -2.3186e+00
 -7.4899e+00 -9.5389e+00  8.5572e+00  2.7420e+00 -3.6270e+00  2.7456e+00
 -6.9574e+00 -1.7190e+00 -2.9145e+00  1.1838e+00  3.7864e+00  2.0413e+00
 -3.5808e+00  1.4319e+00  2.0528e-01 -7.0640e-01 -5

We can use the following code to put all the word vectors into a document `array`.  Notice that there are 11 total words, and each word is represented by a 300-dimensional vector.

In [None]:
np.array([token.vector for token in doc]).shape

(11, 300)

Now we have a matrix representing a headline, but we would like a single vector to represent a headline (like we get with a count-based vectorization).  A simple way to do this is to take the mean of all the word vectors.

In [None]:
np.array([token.vector for token in doc]).mean(axis=0)

array([-1.67427254e+00, -3.95649701e-01, -3.52179974e-01,  2.67473698e+00,
        6.60974836e+00, -5.14187276e-01,  8.85469437e-01,  5.00070000e+00,
        1.94902003e+00, -2.36047196e+00,  5.78760433e+00,  3.53501201e+00,
       -5.29743910e+00,  8.60979080e-01,  1.62974551e-01,  1.90474904e+00,
        5.13615561e+00,  7.06369221e-01, -8.54807645e-02, -3.06247258e+00,
        3.30119342e-01, -2.01464462e+00, -1.07013643e+00,  5.12255311e-01,
        7.67499208e-01, -1.41396359e-01, -8.70485485e-01, -2.18154192e+00,
       -7.69820929e-01,  3.40282440e+00,  1.39036822e+00, -6.79862797e-01,
       -2.90976644e+00, -3.73797679e+00, -2.55181849e-01, -2.72300005e+00,
       -2.77262658e-01,  1.46237183e+00,  2.09034085e+00,  2.50623512e+00,
        9.90345255e-02,  1.39819002e+00,  3.58493626e-01,  6.11728609e-01,
       -1.93930364e+00,  8.43867302e-01,  2.41608524e+00,  1.39169157e-01,
        3.17911834e-01,  7.89771795e-01, -1.64458126e-01,  2.41038370e+00,
       -3.48086071e+00, -

## Word Embedding - Full Data Set

We can use the following nested `list comprehension` to calculate all the vector representations of all of our headlines at once. (This code takes about 45 seconds to run.)

In [None]:
%%time
all_vectors = np.array([np.array([token.vector for token in nlp(s)]).mean(axis=0) for s in df_headline['clean_headline']])

CPU times: user 46.1 s, sys: 11.2 ms, total: 46.2 s
Wall time: 46.2 s


Notice that we now have an `array` that represents our 9470 headlines, each with a 300-dimensional vector.  And each 300-dimensional vector is the mean of the vector representations of each of the words in the headline. 

In [None]:
all_vectors.shape

(9470, 300)