# Dimensionality Reduction

This notebook covers a few of the foundational techniques for dimensionality reduction in traditional NLP approaches. English words sit inside an enormous space. Merriam-Webster records 470,000 words in their unabridged dictionary. The corpus of English words (the words used by the entire set of English speakers) at any given point in recent history is about 170,000 words. The average English speaker uses about 30,000 words in their personal vocabulary. And this only represents our starting dataset. We are actually introduces in documents, which are large collections of English words. Even with a minimal vocabularly, there are more grammatically correct english sentences than there are atoms in the universe.

Figuring out how to encode that much information into a numerical representation consumable by a computer has been the work of AI and NLP researchers since the 1950’s. The techniques that dominated that space until the introduction of the word embedding are explored (lightly) in this notebook. The goal isn't to learn these technqiues, but to gain some understanding of the problem that vector embedding will solve.

In [1]:
import numpy as np

In [2]:
text = '''NLP has used several different techniques for dimensionality reduction. The two most common are bag of words and TI/IDF (which are closely related). We also use other techniques from impoveing the signal, like removing stop words (the, and, to, etc), tokenization, stemming/lemmatization (running -> run, construction/constructs -> construct). The goal of all of these techniques is to create a numerical representation of the words in our corpus for consumption by the computer.'''

# Bag of Words

Bag of words is one of the oldest techniques. The basic approach is to convert a document into a the vector of counts of words. If a word appears once in your document, than that word will have a value of one in the vector. Note that this actually encodes two pieces of information, the index that the word is mapped to as well as the value.

To get started we will take the sentence above and parse it using basic Python data structures.

In [3]:
import re

# First we split the sentence into unique words
words = text.split(' ')

Unfortunately, we already have a problem. Splitting a sentence into words is not a straightforward exercise. We used the direct approach above of splitting on spaces, but that doesn't adequately address punctuation, which you can see below. Additionally, we need to account for differences in captialization. Should "NLP" be capitalized or lowercase? There are no hard and fast answers here. Our strategy for parsing this sentence depends on our goals.

In [4]:
words

['NLP',
 'has',
 'used',
 'several',
 'different',
 'techniques',
 'for',
 'dimensionality',
 'reduction.',
 'The',
 'two',
 'most',
 'common',
 'are',
 'bag',
 'of',
 'words',
 'and',
 'TI/IDF',
 '(which',
 'are',
 'closely',
 'related).',
 'We',
 'also',
 'use',
 'other',
 'techniques',
 'from',
 'impoveing',
 'the',
 'signal,',
 'like',
 'removing',
 'stop',
 'words',
 '(the,',
 'and,',
 'to,',
 'etc),',
 'tokenization,',
 'stemming/lemmatization',
 '(running',
 '->',
 'run,',
 'construction/constructs',
 '->',
 'construct).',
 'The',
 'goal',
 'of',
 'all',
 'of',
 'these',
 'techniques',
 'is',
 'to',
 'create',
 'a',
 'numerical',
 'representation',
 'of',
 'the',
 'words',
 'in',
 'our',
 'corpus',
 'for',
 'consumption',
 'by',
 'the',
 'computer.']

Our goal is to demonstrate the process so we can appreciate how great vector embeddings are; consequently, we'll take the straight forward approach and remove all punctuation and convert everything to lower case. 

In [5]:
# Now we want to remove punctuation
words_clean = [re.sub(r"[^\w\s]", "", word) for word in words]
words_clean = [x.lower() for x in words_clean if x != '']

In [6]:
words_clean

['nlp',
 'has',
 'used',
 'several',
 'different',
 'techniques',
 'for',
 'dimensionality',
 'reduction',
 'the',
 'two',
 'most',
 'common',
 'are',
 'bag',
 'of',
 'words',
 'and',
 'tiidf',
 'which',
 'are',
 'closely',
 'related',
 'we',
 'also',
 'use',
 'other',
 'techniques',
 'from',
 'impoveing',
 'the',
 'signal',
 'like',
 'removing',
 'stop',
 'words',
 'the',
 'and',
 'to',
 'etc',
 'tokenization',
 'stemminglemmatization',
 'running',
 'run',
 'constructionconstructs',
 'construct',
 'the',
 'goal',
 'of',
 'all',
 'of',
 'these',
 'techniques',
 'is',
 'to',
 'create',
 'a',
 'numerical',
 'representation',
 'of',
 'the',
 'words',
 'in',
 'our',
 'corpus',
 'for',
 'consumption',
 'by',
 'the',
 'computer']

To make the Bag of Words vector, we will use the default dictionary.

In [7]:
from collections import defaultdict

In [50]:
bag_of_words = defaultdict(int)
for word in words_clean:
    bag_of_words[word] += 1

Looking at the output, we can see that the count is dominated by words of super low informational value like "the" and "of". This are known as stop words and are frequently removed when utilizing an encoding like Bag of Words.

In [51]:
a = np.array([[k, v] for k, v in bag_of_words.items()])
a[a[:, 1].argsort()]

array([['nlp', '1'],
       ['signal', '1'],
       ['like', '1'],
       ['removing', '1'],
       ['stop', '1'],
       ['etc', '1'],
       ['tokenization', '1'],
       ['stemminglemmatization', '1'],
       ['running', '1'],
       ['run', '1'],
       ['constructionconstructs', '1'],
       ['construct', '1'],
       ['goal', '1'],
       ['all', '1'],
       ['these', '1'],
       ['is', '1'],
       ['create', '1'],
       ['a', '1'],
       ['numerical', '1'],
       ['representation', '1'],
       ['in', '1'],
       ['our', '1'],
       ['corpus', '1'],
       ['consumption', '1'],
       ['impoveing', '1'],
       ['by', '1'],
       ['from', '1'],
       ['use', '1'],
       ['has', '1'],
       ['used', '1'],
       ['several', '1'],
       ['different', '1'],
       ['dimensionality', '1'],
       ['reduction', '1'],
       ['two', '1'],
       ['most', '1'],
       ['other', '1'],
       ['bag', '1'],
       ['common', '1'],
       ['tiidf', '1'],
       ['which', '1'],

To remove the stop words I am going to use one of the Python NLP libraries. There is a lot of great functionality to be found in sPacy, but for our purposes we will initialize a simple NLP pipeline (the pipeline handles the myriad of parsing activities), process the text, then use the richly annotated output to identify stop words.

In [8]:
import spacy
nlp = spacy.load("en_core_web_sm")

ModuleNotFoundError: No module named 'spacy'

In [55]:
b = nlp(text)

In [57]:
help(b[0])

Help on Token object:

class Token(builtins.object)
 |  An individual token – i.e. a word, punctuation symbol, whitespace,
 |  etc.
 |  
 |  DOCS: https://spacy.io/api/token
 |  
 |  Methods defined here:
 |  
 |  __bytes__(...)
 |      Token.__bytes__(self)
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __hash__(self, /)
 |      Return hash(self).
 |  
 |  __le__(self, value, /)
 |      Return self<=value.
 |  
 |  __len__(...)
 |      The number of unicode characters in the token, i.e. `token.text`.
 |      
 |      RETURNS (int): The number of unicode characters in the token.
 |      
 |      DOCS: https://spacy.io/api/token#len
 |  
 |  __lt__(self, value, /)
 |      Return self<value.
 |  
 |  __ne__(self, value, /)
 |      Return self!=value.
 |  
 |  __reduce__(...)
 |      Token.__reduce__(self)
 |  
 |  __repr__(self, /)
 |      Return re

In [58]:
bag_of_words = defaultdict(int)
for word in [x for x in b if not x.is_stop]:
    bag_of_words[word] += 1

In [59]:
a = np.array([[k, v] for k, v in bag_of_words.items()])
a[a[:, 1].argsort()]

array([['NLP', '1'],
       ['stop', '1'],
       ['etc', '1'],
       ['tokenization', '1'],
       ['stemming', '1'],
       ['lemmatization', '1'],
       ['running', '1'],
       ['run', '1'],
       ['construction', '1'],
       ['constructs', '1'],
       ['construct', '1'],
       ['goal', '1'],
       ['create', '1'],
       ['numerical', '1'],
       ['representation', '1'],
       ['corpus', '1'],
       ['remove', '1'],
       ['consumption', '1'],
       ['like', '1'],
       ['computer', '1'],
       ['impovee', '1'],
       ['use', '1'],
       ['different', '1'],
       ['related', '1'],
       ['closely', '1'],
       ['dimensionality', '1'],
       ['IDF', '1'],
       ['reduction', '1'],
       ['common', '1'],
       ['signal', '1'],
       ['TI', '1'],
       ['bag', '1'],
       ['>', '2'],
       ['-', '2'],
       ['/', '3'],
       ['(', '3'],
       ['technique', '3'],
       [')', '3'],
       ['word', '3'],
       ['.', '4'],
       [',', '7']], dtype='<U21')

Note how the tokens now include a lot of punctuation. We could (and should) remove these tokens, but we are now hitting the point of diminishing returns. Vector embeddings will solve nearly all of our problems. 