# Lab 02: Introduction to Text Preprocessing & the Spacy Toolkit

### Objectives:
1. Get familiar with basic text preprocessing pipelines
2. Get familiar with regular expressions, and the `re` package in Python
3. Evaluate the lexical diversity of the data in each category within the 20 News Groups Dataset
4. Use normalized BOW features to evaluate text similarity using the KL-divergence

### Required Reading:

1. https://universaldependencies.org/u/pos/
2. https://spacy.io/api/annotation#pos-tagging
3. https://spacy.io/api/annotation#dependency-parsing

# Part I: Introduction to Spacy

### Download Spacy's base English language *pipeline* components

``$ python -m spacy download en_core_web_sm``

What is a Spacy *pipeline*? A Spacy pipeline is an extensible tool that streamlines many of the common tasks in NLP, such as tokenization, part-of-speech tagging, named entity recognition, stemming, lemmatizing, and parsing. It also has custom pipeline components specifically for transformers. It is built for production use; much thought and care has gone into its API and implementation. You can actually configure Spacy to use some of the statistical models that we will discover in this class; for now we're just going to cover some of the basics.

In [1]:
import os
import sys
os.path.dirname(sys.executable)

'C:\\Users\\sangi\\anaconda3'

In [2]:
import spacy

pipeline = spacy.load('en_core_web_sm')

#import en_core_web_sm

#pipeline = en_core_web_sm.load()

### Download the 20 News Groups dataset using the sklearn package

This data consists of news articles from 20 different categories. 

In [3]:
from sklearn.datasets import fetch_20newsgroups
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

ng_train = fetch_20newsgroups(subset='train')
ng_test = fetch_20newsgroups(subset='test')
ng_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

### Get the number of training & test examples

In [4]:
len(ng_train.data), len(ng_test.data)

(11314, 7532)

### Take a peek at the first document and its label

In [5]:
ng_train.data[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [6]:
label_idx = ng_train.target[0]
ng_train.target_names[label_idx]

'rec.autos'

### Evaluate Spacy's recognition of entities, POS

In [7]:
from pprint import pprint

doc = pipeline(ng_train.data[0])
for i, token in enumerate(doc):
    pprint({"text": token.text,
            "lemma": token.lemma_,
            "POS": token.pos_,
            "tag": token.tag_,
            "dep": token.dep_,
            "shape": token.shape_,
            "is_alpha": token.is_alpha,
            "is_stop": token.is_stop})
    if i == 3:
        break

{'POS': 'ADP',
 'dep': 'ROOT',
 'is_alpha': True,
 'is_stop': True,
 'lemma': 'from',
 'shape': 'Xxxx',
 'tag': 'IN',
 'text': 'From'}
{'POS': 'PUNCT',
 'dep': 'punct',
 'is_alpha': False,
 'is_stop': False,
 'lemma': ':',
 'shape': ':',
 'tag': ':',
 'text': ':'}
{'POS': 'PROPN',
 'dep': 'pobj',
 'is_alpha': False,
 'is_stop': False,
 'lemma': 'lerxst@wam.umd.edu',
 'shape': 'xxxx@xxx.xxx.xxx',
 'tag': 'NNP',
 'text': 'lerxst@wam.umd.edu'}
{'POS': 'PUNCT',
 'dep': 'punct',
 'is_alpha': False,
 'is_stop': False,
 'lemma': '(',
 'shape': '(',
 'tag': '-LRB-',
 'text': '('}


### Visualize Spacy's dependency parse

In [8]:
from spacy import displacy

displacy.render(doc, style='dep')

### Let's define a preprocessing function that cleans our data

You'll notice that even the lemmatized text contains meaningless tokens. In the real world you're never going to get around having to do some feature engineering. In NLP this often means writing some regexes to transform text into a usable format. This has become less important in the deep learning era, but applying domain specific knowledge is always beneficial. In the case of this dataset, we have text that originated in news feeds, some of which is messy. There are email and url addresses, grammatical errors, and a lot puntuation and uninformative characters (e.g., the newline character `\n`). Below is a function that does some very basic regex (regular expression) matching to strip out emails, urls, punctuation, and other junk.

In [9]:
import re
from spacy.language import Language


# http://emailregex.com/
email_re = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""

# replace = [ (pattern-to-replace, replacement),  ...]
replace = [
    (r"<a[^>]*>(.*?)</a>", r"\1"),  # Matches most URLs
    (email_re, "email"),            # Matches emails
    (r"(?<=\d),(?=\d)", ""),        # Remove commas in numbers
    (r"\d+", "numbr"),              # Map digits to special token <numbr>
    (r"[\t\n\r\*\.\@\,\-\/]", " "),   # Punctuation and other junk
    (r"\s+", " ")                   # Stips extra whitespace
]

train_text = ng_train.data
test_text = ng_test.data
for repl in replace:
    train_text = [re.sub(repl[0], repl[1], text) for text in train_text]
    test_text = [re.sub(repl[0], repl[1], text) for text in test_text]

@Language.component("ng20")
def ng20_preprocess(doc):
    tokens = [token for token in doc 
              if not any((token.is_stop, token.is_punct))]
    tokens = [token.lemma_.lower().strip() for token in tokens]
    tokens = [token for token in tokens if token]
    return " ".join(tokens)

pipeline.add_pipe("ng20");

#### Peek at our processing pipeline

In [10]:
pipeline.analyze_pipes(pretty=True)

[1m

#   Component         Assigns               Requires   Scores             Retokenizes
-   ---------------   -------------------   --------   ----------------   -----------
0   tok2vec           doc.tensor                                          False      
                                                                                     
1   tagger            token.tag                        tag_acc            False      
                                                                                     
2   parser            token.dep                        dep_uas            False      
                      token.head                       dep_las                       
                      token.is_sent_start              dep_las_per_type              
                      doc.sents                        sents_p                       
                                                       sents_r                       
                                                

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'ng20': {'assigns': [], 'requires': [], 'scores': [], 'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
 

### Now pass each training and test document through the pipeline

In [11]:
docs_train = [pipeline(doc) for doc in train_text[:500]]
docs_test = [pipeline(doc) for doc in test_text[:500]]

### Let's look at that first document following this transformation and compare it to the original text

In [12]:
ng_train.data[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [13]:
docs_train[0]

'email thing subject car nntp posting host racnumbr wam umd edu organization university maryland college park lines numbr wonder enlighten car see day numbr door sport car look late numbrs early numbrs call bricklin door small addition bumper separate rest body know tellme model engine spec year production car history info funky look car e mail thanks il bring neighborhood lerxst'

# Part II: Lexical diversity

Sometimes it's useful to understand how diverse is the language in some body of text. Once simple heuristic to evaluate diversity is as follows: 

$$ lexical\_diversity = \frac{ len(set(all\_words\_in\_doc)) }{ len(doc) }$$

Find the set of all words observed in the document, and divide it by the number of total words in the document. Let's use this to evalute the diversity of each category in the 20NG dataset.

### (5 pts) Task I: 
In the cell below, compute the diversity of each category in the 20NG dataset using the above heuristic

In [14]:
# Your code goes here

print('==== Test Single Doc =====')
# Length of a doc
print(len(docs_train[0].split(' ')))
# Length of the set (list of unique elements) of a doc
print(len(set(docs_train[0].split(' '))))
# Lexical Diversity
print(len(set(docs_train[0].split(' '))) / len(docs_train[0].split(' ')))
print('===== End Test =====\n\n')

cat_docs = {}
for cat in ng_test.target_names:
    i = 0
    cat_docs[cat] = ''
    for doc in docs_train:
        label_idx = ng_train.target[i]
        if cat == ng_train.target_names[label_idx]:
            cat_docs[cat] += (' ' + doc)
        i += 1

i = 0
for idx, doc in cat_docs.items():
    print('===== Category: ' + str(idx) + ' =====')
    print('Total: ' + str(len(doc.split(' '))))
    # Length of the set (list of unique elements) of a doc
    print('Unique: ' + str(len(set(doc.split(' ')))))
    # Lexical Diversity
    print('Diversity: ' + str(len(set(doc.split(' '))) / len(doc.split(' '))))
    i += 1
    print('==========')

==== Test Single Doc =====
61
53
0.8688524590163934
===== End Test =====


===== Category: alt.atheism =====
Total: 5935
Unique: 1809
Diversity: 0.30480202190395955
===== Category: comp.graphics =====
Total: 3054
Unique: 1165
Diversity: 0.38146692861820564
===== Category: comp.os.ms-windows.misc =====
Total: 18592
Unique: 8640
Diversity: 0.46471600688468157
===== Category: comp.sys.ibm.pc.hardware =====
Total: 2060
Unique: 832
Diversity: 0.40388349514563104
===== Category: comp.sys.mac.hardware =====
Total: 3553
Unique: 1237
Diversity: 0.3481564874753729
===== Category: comp.windows.x =====
Total: 2180
Unique: 791
Diversity: 0.3628440366972477
===== Category: misc.forsale =====
Total: 2799
Unique: 1119
Diversity: 0.39978563772775993
===== Category: rec.autos =====
Total: 4242
Unique: 1232
Diversity: 0.29042904290429045
===== Category: rec.motorcycles =====
Total: 3651
Unique: 1495
Diversity: 0.4094768556559847
===== Category: rec.sport.baseball =====
Total: 4345
Unique: 1227
Diversity:

### Explain these scores: 

1. Is this result real or an artifact of some underlying problem with our data? 
The result is "real" in the sense that it's a reflection of our data. However, our data was not processed with things like lemmatization, spellcheck, or similar preprocessing considerations in mind, which may artificially inflate the diversity scores
2. What might you do to better evaluate lexical diversity on this data using this scoring function?
I would first suggest performing preprocessing such as lemmatization and plurality checking
3. Is this heuristic a good metric for lexical diversity in general?
This scoring function is very simple to implement and understand, which is useful as a data exploration tool before using something more powerful (like a language model). However, it is very simplistic, and should not be taken as ground truth on its own

### Entropy 
Entropy is another, perhaps more principled, way by which we can evaluate how diverse, or varied, is a piece of text. Recall the definition of Entropy, $H(P(x))$:

$$ H(P(x)) = \sum_{i=1}^{N} -P(x_{i}) \log P(x_{i}) $$

In the Bag-of-Words (BOW) feature representation of a document, each document is represented by a word count vector, ${x}_{i} \in \mathbb{R}^{N}$ where $N$ is the cardinality of the set of words in the document.

### (5 pts) Task II:
In order to compute an entropy from this representation, you'll first need to convert those count vectors into probability distributions. Then compute the entropy of the word distributions for each news category.

In [36]:
# Your code goes here
from sklearn.feature_extraction.text import CountVectorizer
import math
import numpy as np

cats_entropy = {}

for idx, doc in cat_docs.items():
    print("====== Category: " + idx + " =======")
    count_vectorizer = CountVectorizer()
    count_vectorizer.fit([doc]) # Definitely a better way to do this, but works for testing for now
    count_vector = count_vectorizer.transform([doc])
    print("Raw CountVectorized Category")
    print(count_vector.toarray())
    doc_array = count_vector.toarray()
    doc_length = sum(doc_array[0])
    prob_array = doc_array / doc_length + 0.000000001 # Cheating here to avoid divide by 0 errors
    print("Array of Probabilities for Each Word in the Category")
    print(prob_array)
    print("Entropy of the Category")
    cats_entropy[idx] = sum(-1 * prob_array[0] * np.log(prob_array[0]))
    print(sum(-1 * prob_array[0] * np.log(prob_array[0])))
    print("======================")

Raw CountVectorized Category
[[4 2 1 ... 1 1 1]]
Array of Probabilities for Each Word in the Category
[[0.00077535 0.00038767 0.00019384 ... 0.00019384 0.00019384 0.00019384]]
Entropy of the Category
6.869456078195318
Raw CountVectorized Category
[[ 4  1 10 ...  1  2  2]]
Array of Probabilities for Each Word in the Category
[[0.00156924 0.00039231 0.00392311 ... 0.00039231 0.00078462 0.00078462]]
Entropy of the Category
6.393763296369005
Raw CountVectorized Category
[[4 1 1 ... 4 1 1]]
Array of Probabilities for Each Word in the Category
[[1.62900613e-04 4.07259033e-05 4.07259033e-05 ... 1.62900613e-04
  4.07259033e-05 4.07259033e-05]]
Entropy of the Category
5.8767679160164255
Raw CountVectorized Category
[[  1   1   1   2   1   1   1   1   1   4   1   1   1   2   3   1   1   1
    1   1   1   1   2   2   1   1   1   1   1   1   1   3   2   1   1   1
    1   1   1   1   1   1   3   1   1   4   1   2   1   1   1   3   1   1
    1   1   5   1   2   2   1   1   3   1   1   2   1   1   1 

[[5 1 4 ... 1 2 1]]
Array of Probabilities for Each Word in the Category
[[0.00153799 0.0003076  0.00123039 ... 0.0003076  0.0006152  0.0003076 ]]
Entropy of the Category
6.5268719073619
Raw CountVectorized Category
[[1 1 1 ... 1 1 1]]
Array of Probabilities for Each Word in the Category
[[0.00037994 0.00037994 0.00037994 ... 0.00037994 0.00037994 0.00037994]]
Entropy of the Category
6.443489076270042
Raw CountVectorized Category
[[1 1 1 ... 2 1 4]]
Array of Probabilities for Each Word in the Category
[[0.00034376 0.00034376 0.00034376 ... 0.00068752 0.00034376 0.00137504]]
Entropy of the Category
6.43738449750586
Raw CountVectorized Category
[[1 1 2 ... 1 1 3]]
Array of Probabilities for Each Word in the Category
[[0.00028019 0.00028019 0.00056038 ... 0.00028019 0.00028019 0.00084057]]
Entropy of the Category
6.751579009113472
Raw CountVectorized Category
[[2 1 2 ... 1 1 9]]
Array of Probabilities for Each Word in the Category
[[0.0003649  0.00018245 0.0003649  ... 0.00018245 0.000182

In [37]:
cats_entropy

{'alt.atheism': 6.869456078195318,
 'comp.graphics': 6.393763296369005,
 'comp.os.ms-windows.misc': 5.8767679160164255,
 'comp.sys.ibm.pc.hardware': 5.938334087445443,
 'comp.sys.mac.hardware': 6.33100341828723,
 'comp.windows.x': 6.053528235530873,
 'misc.forsale': 5.938933518382941,
 'rec.autos': 6.305344391172238,
 'rec.motorcycles': 6.630437757846492,
 'rec.sport.baseball': 4.718657675587231,
 'rec.sport.hockey': 6.5268719073619,
 'sci.crypt': 6.443489076270042,
 'sci.electronics': 6.43738449750586,
 'sci.med': 6.751579009113472,
 'sci.space': 6.728032481958939,
 'soc.religion.christian': 6.853443909418548,
 'talk.politics.guns': 6.555393945367377,
 'talk.politics.mideast': 6.749731586576314,
 'talk.politics.misc': 6.640369894798911,
 'talk.religion.misc': 6.013214862449288}

### Explain this result

1. What does it mean for a distribution to have high or low entropy?
A distribution with high entropy will have high "chaos", or in our case, will have a higher diversity of words used. Another way to think about it is that high entropy scores means that the probability of each word will be closer to the mean probability. 
2. Do these scores make intuitive sense? Any more or less so than the heuristic from Task I?
The scores make sense relatively speaking, but there also isn't a clear sense of "range", so it's not as intuitive to compare two values and decide if they are very similar or very different in terms of entropy
2. Is entropy a good metric for evaluating lexical diversity in general?
In general, it seems like a step up from our simple heuristic, but the score is relatively meaningless in a vacuum.

# Part III: Document Similarity

Throughout this course we will discuss the notion of *similarity* between texts and explore ways to measure it. This is a critical component of search and recommender systems. One such approach involves measuring how *close* two word distributions are using the notion divergence, which we discussed in the first lecture.

### (10 pts) Task III

Using the definition below, compute the KL-divergence, $K_{DL}$, between the word distributions in each category. This will result in a $K \times K$ matrix of divergence values.

$$ D_{KL}(P||Q) = \sum_{i=1}^{N} P(x_{i}) \log \frac{P(x_{i})}{Q(x_{i})} $$

In [38]:
# Your code goes here

cats_divergence = []

for idx, doc in cat_docs.items():
    divergences = []
    for idx_2, doc_2 in cat_docs.items():
        count_vectorizer = CountVectorizer()
        count_vectorizer.fit([doc]) # Definitely a better way to do this, but works for testing for now
        count_vector = count_vectorizer.transform([doc])
        count_vector_2 = count_vectorizer.transform([doc_2])
        doc_array = count_vector.toarray()
        doc_length = sum(doc_array[0])
        prob_array = doc_array / doc_length + 0.000000001 # Cheating here to avoid divide by 0 errors
        doc_array = count_vector_2.toarray()
        doc_length = sum(doc_array[0])
        prob_array_2 = doc_array / doc_length + 0.000000001 # Cheating here to avoid divide by 0 errors
        print("KL Divergence between the categories " + idx + " and " + idx_2)
        divergences.append(sum(prob_array[0] * np.log(prob_array[0] / prob_array_2[0])))
        print(sum(prob_array[0] * np.log(prob_array[0] / prob_array_2[0])))
    cats_divergence.append(divergences)

KL Divergence between the categories alt.atheism and alt.atheism
0.0
KL Divergence between the categories alt.atheism and comp.graphics
8.565015344422772
KL Divergence between the categories alt.atheism and comp.os.ms-windows.misc
9.346158268758707
KL Divergence between the categories alt.atheism and comp.sys.ibm.pc.hardware
9.420578682721715
KL Divergence between the categories alt.atheism and comp.sys.mac.hardware
8.546375889017499
KL Divergence between the categories alt.atheism and comp.windows.x
9.501084514672755
KL Divergence between the categories alt.atheism and misc.forsale
9.768199409465113
KL Divergence between the categories alt.atheism and rec.autos
8.511268390165545
KL Divergence between the categories alt.atheism and rec.motorcycles
8.08499505310392
KL Divergence between the categories alt.atheism and rec.sport.baseball
9.175502720823403
KL Divergence between the categories alt.atheism and rec.sport.hockey
8.5306180903885
KL Divergence between the categories alt.atheism 

6.91927477656122
KL Divergence between the categories comp.sys.mac.hardware and sci.electronics
5.734646638454139
KL Divergence between the categories comp.sys.mac.hardware and sci.med
6.591144345266142
KL Divergence between the categories comp.sys.mac.hardware and sci.space
5.533694487444402
KL Divergence between the categories comp.sys.mac.hardware and soc.religion.christian
6.906519507358602
KL Divergence between the categories comp.sys.mac.hardware and talk.politics.guns
6.983727083947898
KL Divergence between the categories comp.sys.mac.hardware and talk.politics.mideast
7.003225351484783
KL Divergence between the categories comp.sys.mac.hardware and talk.politics.misc
7.476854023846409
KL Divergence between the categories comp.sys.mac.hardware and talk.religion.misc
8.474059614895241
KL Divergence between the categories comp.windows.x and alt.atheism
7.408140654181437
KL Divergence between the categories comp.windows.x and comp.graphics
6.4985954933572465
KL Divergence between th

5.33754501189306
KL Divergence between the categories rec.sport.baseball and rec.autos
5.014479671320577
KL Divergence between the categories rec.sport.baseball and rec.motorcycles
5.0620267928517215
KL Divergence between the categories rec.sport.baseball and rec.sport.baseball
0.0
KL Divergence between the categories rec.sport.baseball and rec.sport.hockey
4.526072103099929
KL Divergence between the categories rec.sport.baseball and sci.crypt
5.481249782054187
KL Divergence between the categories rec.sport.baseball and sci.electronics
5.052286748703527
KL Divergence between the categories rec.sport.baseball and sci.med
5.060863789141827
KL Divergence between the categories rec.sport.baseball and sci.space
4.5776431970237565
KL Divergence between the categories rec.sport.baseball and soc.religion.christian
5.112327006066595
KL Divergence between the categories rec.sport.baseball and talk.politics.guns
5.1920816143534845
KL Divergence between the categories rec.sport.baseball and talk.p

KL Divergence between the categories sci.space and comp.os.ms-windows.misc
7.9909860865975455
KL Divergence between the categories sci.space and comp.sys.ibm.pc.hardware
8.338032506980337
KL Divergence between the categories sci.space and comp.sys.mac.hardware
7.4859770463895785
KL Divergence between the categories sci.space and comp.windows.x
8.339055725789212
KL Divergence between the categories sci.space and misc.forsale
8.38726432408292
KL Divergence between the categories sci.space and rec.autos
7.701525526202069
KL Divergence between the categories sci.space and rec.motorcycles
7.382167660569247
KL Divergence between the categories sci.space and rec.sport.baseball
8.304697485302675
KL Divergence between the categories sci.space and rec.sport.hockey
7.7027023162352
KL Divergence between the categories sci.space and sci.crypt
7.8315401853122655
KL Divergence between the categories sci.space and sci.electronics
7.146145456853608
KL Divergence between the categories sci.space and sci

7.684062167975557
KL Divergence between the categories talk.politics.misc and sci.space
7.076019830068062
KL Divergence between the categories talk.politics.misc and soc.religion.christian
6.686945629281
KL Divergence between the categories talk.politics.misc and talk.politics.guns
7.602428067834621
KL Divergence between the categories talk.politics.misc and talk.politics.mideast
7.073214682964968
KL Divergence between the categories talk.politics.misc and talk.politics.misc
0.0
KL Divergence between the categories talk.politics.misc and talk.religion.misc
9.410463126514856
KL Divergence between the categories talk.religion.misc and alt.atheism
5.534509904544593
KL Divergence between the categories talk.religion.misc and comp.graphics
8.41651944046314
KL Divergence between the categories talk.religion.misc and comp.os.ms-windows.misc
8.881968135048943
KL Divergence between the categories talk.religion.misc and comp.sys.ibm.pc.hardware
8.931967598996355
KL Divergence between the categor

In [39]:
cats_divergence

[[0.0,
  8.565015344422772,
  9.346158268758707,
  9.420578682721715,
  8.546375889017499,
  9.501084514672755,
  9.768199409465113,
  8.511268390165545,
  8.08499505310392,
  9.175502720823403,
  8.5306180903885,
  8.32611479250726,
  8.580233878324023,
  7.543344116027011,
  7.544549458339563,
  5.548570627320111,
  7.206009312504898,
  6.7921658746207765,
  8.052667551657745,
  8.191787531726746],
 [7.000811619995107,
  0.0,
  6.677246371507498,
  7.4040559887785955,
  6.6645298267997255,
  7.405926200051156,
  7.747993862988402,
  7.662857433470146,
  6.936924211068724,
  7.980475330525258,
  7.600399419443155,
  7.3288933074804525,
  6.883965504075618,
  7.08701679945874,
  6.22147971306508,
  6.904695720147689,
  7.408548216832387,
  7.159798646271147,
  7.65583659897246,
  8.729390581922248],
 [11.29986128757868,
  10.823705245007067,
  0.0,
  10.563411305540095,
  10.53042861132393,
  11.003236852927474,
  10.702682024530358,
  10.928732499513838,
  10.736720640388056,
  10.960

### Explain this result

1. How did you handle any differences in the support for P and Q? What about when Q(x) = 0?
To handle differences between P and Q, I trained my CountVector model on P, and performed the transform on Q with the same model. To handle P or Q == 0, I cheated a little bit, and added an epsilon value of 10^-9 to each. In these documents (from strung-together categories), the length is far less than 10^9, so epsilon will not make a significant difference.
2. What does it mean for two distributions to have high or low divergence?
Low divergence means the documents make use of similar words in similar proportions. 0 means they are exactly the same.
3. Do these similarity scores make sense intuitively?
Yes, this is a very intuitive comparison matrix!
4. Is the resultant $K \times K$ matrix symmetric? Why is this the case?
No, because on the upper triangle, I trained P and Q based on P. On the lower triangle, P and Q are switched, so the CountVector is different than the reflection on the other half.
5. Is $D_{KL}$ a good measure of the similarity in this context?   
Yes, I believe so. It is a very intuitive comparison measure (in the absence of more advanced techniques that we'll learn later) that can be relatively easily implemented.