# Problem set 4: Textual distances (solution)

## Description

**The goal of this problem set is to calculate the distance between several texts and to examine the effects of different distance metrics and input preprocessing steps.**

You'll use the techniques developed in this problem set in the next several problems sets as well.

Pay attention to the `import` statements in the next cell. We're going to use `scikit-learn` for both vectorization and distance calculations. We could do these steps by hand (indeed, we'll calculate by hand the Euclidean distance between two trivial documents), but `scikit-learn` has several advantages:

* It implements configurable preprocessing steps.
* It's well-vetted, so is unlikely to contain arithmetic errors.
* It integrates with standard Python machine learning workflows.

OK, to work!

## Imports and setup

In [1]:
import numpy as np
import os
from   sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from   sklearn.metrics.pairwise import euclidean_distances, cosine_distances

dev_text = '''\
My cat likes water. She likes water so much that she often falls in the sink. But she hates to be wet.
The dog eats food all day. The dog is named spot, but has no spots. The dog is furry.
The dog and the cat play together. When they are tired, they eat and drink and sleep.
A dog and a cat meet another dog and cat. They play, but only for a short while. They are not friends.
The bird and the snake run in the woods. They do nothing like what the others do. But their function words do.'''

dev_docs = [doc for doc in dev_text.split('\n')] # Documents are one-per-line

# Print docs for reference
for i, text in enumerate(dev_docs):
    print(i, text)

0 My cat likes water. She likes water so much that she often falls in the sink. But she hates to be wet.
1 The dog eats food all day. The dog is named spot, but has no spots. The dog is furry.
2 The dog and the cat play together. When they are tired, they eat and drink and sleep.
3 A dog and a cat meet another dog and cat. They play, but only for a short while. They are not friends.
4 The bird and the snake run in the woods. They do nothing like what the others do. But their function words do.


## Task: Judging similarity (5 points)

Consider the very short documents in `dev_docs`. Which do you judge to be most similar? Least similar? Why? Answer these questions in a couple of sentences.

**Your answers here**

## Task: Vectorization and Euclidean distance by hand

Consider two sentences (or "sentences"):

* sent_1: "apple red"
* sent_2: "orange orange"

How far apart are these documents in three-dimensional Euclidean space?

First, record the vector representation of each sentence, where the count of the word "apple" is dimension 1, "red" is dimension two, and "orange" is dimension three. Your vectors should look like `sent_1 = [x, y, z]` where x, y, and z are integers.

### Vector answer (5 points)

Write the two vectors in Markdown in the cell below:

**Your vector representations here**

* `sent_1 = [1,1,0]`
* `sent_2 = [0,0,2]`

Recall that the Euclidean distance between two points is the square root of the sum of the squared differences between the points in each dimension. As an equation:

$dist = \sqrt{a^2 + b^2 + c^2 + ... + n^2}$

where *a*, *b*, *c*, ..., *n* are the differences between the points in each of the *n* dimensions of the vector space.

### Euclidean distance answer (5 points)

Write the Euclidean distance between `sent_1` and `sent_2` in the Markdown cell below:

**Your hand-calculated Euclidean distance answer here**

$dist = \sqrt{1^2 + 1^2 + 2^2} = \sqrt{6} = 2.449...$ (2.4 and variations are fine)

## Example: Vectorize and calculate distances with `scikit-learn`

As you can see, hand vectorization becomes cumbersome in a hurry and is impossible for most real-world corpora. Let's see how to do with `scikit-learn`.

We'll begin with `CountVectorizer`, which transforms input texts into vectors of word counts.

Couple of things ot note:

* The workflow here is:
  1. Initialize a vectorizer object, then 
  1. Use the initialized vectorizer's `fit_transform` method to tranform your input texts into vectorized output.
* The vectorizer has many options that control what it does. I've included some of the more useful ones. Make sure you understand what each one does.
  * See also the `CountVectorizer` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).
* I've also included some `print` statements that allow us to examine selected properties of the vector matrix and of the vectorizer itself. Make sure you understand what these are and how they are produced.

### Vectorize

In [2]:
# This is a freebie to show how it's done

# Set up the vectorizer object
count_vectorizer = CountVectorizer(
    encoding='utf-8',
    strip_accents='unicode', # or 'ascii' (faster but less robust)
    lowercase=True,
    stop_words=None, # or 'english'
    min_df=1, # include words that occur in as few as a single document
    max_df=1.0, # include words that occur in as many as all documents
    binary=False # True = return 1 if word is present in document, else 0
)

# Perform vectorization
count_matrix = count_vectorizer.fit_transform(dev_docs)

# Print some useful info about our data
print("Matrix shape:", count_matrix.shape)
print("\nFeature labels:", count_vectorizer.get_feature_names())
print("\nStopwords used:", count_vectorizer.get_stop_words())
print("\nDocument 0 vector:", count_matrix[0].toarray())
print("\nDocument 0 words:", count_vectorizer.inverse_transform(count_matrix[0]))

Matrix shape: (5, 60)

Feature labels: ['all', 'and', 'another', 'are', 'be', 'bird', 'but', 'cat', 'day', 'do', 'dog', 'drink', 'eat', 'eats', 'falls', 'food', 'for', 'friends', 'function', 'furry', 'has', 'hates', 'in', 'is', 'like', 'likes', 'meet', 'much', 'my', 'named', 'no', 'not', 'nothing', 'often', 'only', 'others', 'play', 'run', 'she', 'short', 'sink', 'sleep', 'snake', 'so', 'spot', 'spots', 'that', 'the', 'their', 'they', 'tired', 'to', 'together', 'water', 'wet', 'what', 'when', 'while', 'woods', 'words']

Stopwords used: None

Document 0 vector: [[0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 2 0 1 1 0 0 0 0 1 0 0
  0 0 3 0 1 0 0 1 0 0 1 1 0 0 0 1 0 2 1 0 0 0 0 0]]

Document 0 words: [array(['my', 'cat', 'likes', 'water', 'she', 'so', 'much', 'that',
       'often', 'falls', 'in', 'the', 'sink', 'but', 'hates', 'to', 'be',
       'wet'], dtype='<U8')]


### Calculate the pairwise distances between the documents

Once we have our feature matrix from the previous step, it's a one-liner to calculate the pairwise distances between the objects. Note that the `euclidean_distances` and `cosine_distances` functions were imported from `sklearn` at the top of the notebook.

We're operating naïvely here. We haven't normalized length or removed stopwords. And we think that Euclidean distances might not be very useful anyway. But it's a start.

#### Naïve Euclidean distances

In [3]:
# Freebie for illustration purposes
euc = euclidean_distances(count_matrix)
euc

array([[0.        , 7.54983444, 7.28010989, 7.28010989, 7.74596669],
       [7.54983444, 0.        , 6.4807407 , 6.78232998, 6.8556546 ],
       [7.28010989, 6.4807407 , 0.        , 4.69041576, 6.40312424],
       [7.28010989, 6.78232998, 4.69041576, 0.        , 7.54983444],
       [7.74596669, 6.8556546 , 6.40312424, 7.54983444, 0.        ]])

This output array is ordered by sentence number in both columns and rows. In other words, to find the distance between sentence 0 and sentence 1, you can count zero rows down and 1 over, or 1 down and zero over. Note that the distance values in those two matrix positions are identical (7.5498...).

#### Naïve cosine distances

In [4]:
# Freebie, ditto
cos = cosine_distances(count_matrix)
cos

array([[0.        , 0.87690851, 0.89793793, 0.89793793, 0.8322949 ],
       [0.87690851, 0.        , 0.69848866, 0.76549118, 0.64218678],
       [0.89793793, 0.69848866, 0.        , 0.40740741, 0.6044226 ],
       [0.89793793, 0.76549118, 0.40740741, 0.        , 0.84785485],
       [0.8322949 , 0.64218678, 0.6044226 , 0.84785485, 0.        ]])

### Task: Assess naïve distance results (10 points)

Offer a few brief reflections on the distance metrics we just calculated. Which sentences are closest to one another in each case? Which are furthest apart? Do these results make any kind of sense? If you're surprised, where do you think the method (or your intuition) goes wrong?

**Your answer here**

## Explore the impact of preprocessing and normalization

**In the cell below, set up a new vectorizer that removes stopwords.** Keep the print statements from above. Then, in subsequent cells, calculate Euclidean and cosine distances on the new feature matrix. Finally, compare briefly the distance results without stopwords to those that retain stopwords.

### Vectorize the dev documents, removing stopwords (5 points)

In [5]:
stopword_vectorizer = CountVectorizer(
    encoding='utf-8',
    strip_accents='unicode',
    lowercase=True,
    stop_words='english',
    min_df=1, # include words that occur in as few as a single document
    max_df=1.0, # include words that occur in as many as all documents
    binary=False # True = return 1 if word is present in document, else 0
)
count_matrix_stopsremoved = stopword_vectorizer.fit_transform(dev_docs)
print("Matrix shape:", count_matrix_stopsremoved.shape)
print("\nFeature labels:", stopword_vectorizer.get_feature_names())
print("\nStopwords used:", sorted(stopword_vectorizer.get_stop_words()))
print("\nDocument 0 vector:", count_matrix_stopsremoved[0].toarray())
print("\nDocument 0 words:", stopword_vectorizer.inverse_transform(count_matrix_stopsremoved[0]))

Matrix shape: (5, 30)

Feature labels: ['bird', 'cat', 'day', 'dog', 'drink', 'eat', 'eats', 'falls', 'food', 'friends', 'function', 'furry', 'hates', 'like', 'likes', 'meet', 'named', 'play', 'run', 'short', 'sink', 'sleep', 'snake', 'spot', 'spots', 'tired', 'water', 'wet', 'woods', 'words']

Stopwords used: ['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', '

### Distances (5 points total)

**Calculate (and display) Euclidean distances between your newly vectorized documents.**

In [6]:
# Calculate Euclidean distances for the new, stopword-removed feature matrix
euclidean_distances(count_matrix_stopsremoved)

array([[0.        , 5.38516481, 4.24264069, 4.58257569, 4.47213595],
       [5.38516481, 0.        , 4.12310563, 4.        , 4.79583152],
       [4.24264069, 4.12310563, 0.        , 3.        , 3.74165739],
       [4.58257569, 4.        , 3.        , 0.        , 4.35889894],
       [4.47213595, 4.79583152, 3.74165739, 4.35889894, 0.        ]])

**Calculate (and display) cosine distances on the same matrix**

In [7]:
# Calculate cosine distances for the new, stopword-removed feature matrix
cosine_distances(count_matrix_stopsremoved)

array([[0.        , 1.        , 0.89517152, 0.83987185, 1.        ],
       [1.        , 0.        , 0.71652665, 0.5669873 , 1.        ],
       [0.89517152, 0.71652665, 0.        , 0.45445527, 1.        ],
       [0.83987185, 0.5669873 , 0.45445527, 0.        , 1.        ],
       [1.        , 1.        , 1.        , 1.        , 0.        ]])

### Task: Compare these results to the previous ones (5 points)

How do the distances calculated after removing stopwords compare to those obtained when retaining stopwords?

**Your answer here**

## Normalize and try out TF-IDF weighting

As noted in class, Euclidean distances are strongly influenced by document length. One way to minimize the impact of length is to *normalize* our vectors.

To normalize our vectors, we need to switch to `TfidfVectorizer` (or else be prepared to do some math on our own). `TfidfVectorizer` has two helpful features that are not present in `CountVectorizer`:

1. `TfidfVectorizer` applies selectable normalization to the calculated vectors. This prevents long documents from being far away from short documents simply because the documents contain different numbers of words.
  1. There are two built-in normalization methods. `'l1'` produces vectors the elements of which sum to 1. `'l2'` produces vectors whose *squared* elements sum to 1.
  1. One sees `l2` normalization used more often than `l1` for machine learning tasks. **In general**(!), *l2* produces better fits because it reduces the influence of outlying data points more aggressively, while *l1* produces sparser features (that is, feature vectors with more values that are close to zero). Sparse features can be desirable in some cases (feature selection and interpretation can be easier with sparse features, for instance).
1. `TfidfVectorizer` allows us to apply [TF-IDF weighting](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), if desired. This is a method of downweighting words that appear in many documents (on the theory that words shared by many documents are less likely to tell us much about the distinctive content of any one document than are words that are not so widely shared). TF-IDF is a classic preprocessing step in many information retrieval tasks, though it's unclear how much it helps when assessing document similarity. Since it's a selectable flag in `TfidfVectorizer`, we can try it out and see what difference it makes.

You can consult the `TfidfVectorizer` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) if you want details and options.

### Vectorize (5 points)

Vectorize the `dev_docs` with `TfidfVectorizer`, using `norm='l2'` and `use_idf=False` (that is, without TF-IDF weighting).

In [8]:
# Vectorize, norm on, idf off
normalizing_vectorizer = TfidfVectorizer(
    encoding='utf-8',
    strip_accents='unicode',
    lowercase=True,
    stop_words='english',
    min_df=1, # 1 = a single document
    max_df=1.0, # 1.0 = 100% = all documents,
    binary=False, # True = return 1 if word is present in document, else 0
    norm='l2',
    use_idf=False
)
norm_matrix = normalizing_vectorizer.fit_transform(dev_docs)
print("Matrix shape:", norm_matrix.shape)
print("\nFeature labels:", normalizing_vectorizer.get_feature_names())
print("\nDocument 0 vector:", norm_matrix[0].toarray())
print("\nDocument 0 words:", normalizing_vectorizer.inverse_transform(norm_matrix[0]))

Matrix shape: (5, 30)

Feature labels: ['bird', 'cat', 'day', 'dog', 'drink', 'eat', 'eats', 'falls', 'food', 'friends', 'function', 'furry', 'hates', 'like', 'likes', 'meet', 'named', 'play', 'run', 'short', 'sink', 'sleep', 'snake', 'spot', 'spots', 'tired', 'water', 'wet', 'woods', 'words']

Document 0 vector: [[0.        0.2773501 0.        0.        0.        0.        0.
  0.2773501 0.        0.        0.        0.        0.2773501 0.
  0.5547002 0.        0.        0.        0.        0.        0.2773501
  0.        0.        0.        0.        0.        0.5547002 0.2773501
  0.        0.       ]]

Document 0 words: [array(['cat', 'likes', 'water', 'falls', 'sink', 'hates', 'wet'],
      dtype='<U8')]


Verify that we're L2-normed. Sum of squared feature weights in each document should be 1.0.

In [9]:
# Verify l2-norm, freebie
# You will need to change the input variable name below to match whatever you used inthe previous cell
for i, vec in enumerate(norm_matrix.toarray()):
    squared_features = [term**2 for term in vec]
    print(f"Squared feature weights in document {i} sum to 1:", np.isclose(sum(squared_features), 1.0))

Squared feature weights in document 0 sum to 1: True
Squared feature weights in document 1 sum to 1: True
Squared feature weights in document 2 sum to 1: True
Squared feature weights in document 3 sum to 1: True
Squared feature weights in document 4 sum to 1: True


### Task: Calculate distances for the normalized vectors and discuss the results (10 points total; 5 for distances, 5 for discussion)

In [10]:
# Euclidean distances on l2-normed features with stopword removal
euclidean_distances(norm_matrix)

array([[0.        , 1.41421356, 1.33803701, 1.29604926, 1.41421356],
       [1.41421356, 0.        , 1.19710204, 1.06488243, 1.41421356],
       [1.33803701, 1.19710204, 0.        , 0.953368  , 1.41421356],
       [1.29604926, 1.06488243, 0.953368  , 0.        , 1.41421356],
       [1.41421356, 1.41421356, 1.41421356, 1.41421356, 0.        ]])

In [11]:
# Cosine distances on l2-normed features with stopword removal
cosine_distances(norm_matrix)

array([[0.        , 1.        , 0.89517152, 0.83987185, 1.        ],
       [1.        , 0.        , 0.71652665, 0.5669873 , 1.        ],
       [0.89517152, 0.71652665, 0.        , 0.45445527, 1.        ],
       [0.83987185, 0.5669873 , 0.45445527, 0.        , 1.        ],
       [1.        , 1.        , 1.        , 1.        , 0.        ]])

**Your discussion of the impact of normalization here.** Write a few sentences.

### Task: Try TF-IDF (15 points total: 5 vectorization/5 distances/5 discussion)

Set up a new vectorizer with normalization *and* TF-IDF weighting. Then calculate the Euclidean and cosine distance matrices and compare the results to the normalized but non-TF-IDF results immediately above.

In [12]:
# Vectorize, norm on, idf on
tfidf_vectorizer = TfidfVectorizer(
    encoding='utf-8',
    strip_accents='unicode',
    lowercase=True,
    stop_words='english',
    min_df=1, # 1 = a single document
    max_df=1.0, # 1.0 = 100% = all documents,
    binary=False, # True = return 1 if word is present in document, else 0
    norm='l2',
    use_idf=True
)
tfidf_matrix = tfidf_vectorizer.fit_transform(dev_docs)
print("Matrix shape:", tfidf_matrix.shape)
print("\nFeature labels:", tfidf_vectorizer.get_feature_names())
print("\nDocument 0 vector:", tfidf_matrix[0].toarray())
print("\nDocument 0 words:", tfidf_vectorizer.inverse_transform(tfidf_matrix[0]))

Matrix shape: (5, 30)

Feature labels: ['bird', 'cat', 'day', 'dog', 'drink', 'eat', 'eats', 'falls', 'food', 'friends', 'function', 'furry', 'hates', 'like', 'likes', 'meet', 'named', 'play', 'run', 'short', 'sink', 'sleep', 'snake', 'spot', 'spots', 'tired', 'water', 'wet', 'woods', 'words']

Document 0 vector: [[0.         0.18981438 0.         0.         0.         0.
  0.         0.28342702 0.         0.         0.         0.
  0.28342702 0.         0.56685404 0.         0.         0.
  0.         0.         0.28342702 0.         0.         0.
  0.         0.         0.56685404 0.28342702 0.         0.        ]]

Document 0 words: [array(['wet', 'hates', 'sink', 'falls', 'water', 'likes', 'cat'],
      dtype='<U8')]


In [13]:
# Euclidean distances
euclidean_distances(tfidf_matrix)

array([[0.        , 1.41421356, 1.37552185, 1.34573803, 1.41421356],
       [1.41421356, 0.        , 1.28689221, 1.18231053, 1.41421356],
       [1.37552185, 1.28689221, 0.        , 1.10832776, 1.41421356],
       [1.34573803, 1.18231053, 1.10832776, 0.        , 1.41421356],
       [1.41421356, 1.41421356, 1.41421356, 1.41421356, 0.        ]])

In [14]:
# Cosine distances
cosine_distances(tfidf_matrix)

array([[0.        , 1.        , 0.94603018, 0.90550542, 1.        ],
       [1.        , 0.        , 0.82804578, 0.69892909, 1.        ],
       [0.94603018, 0.82804578, 0.        , 0.61419521, 1.        ],
       [0.90550542, 0.69892909, 0.61419521, 0.        , 1.        ],
       [1.        , 1.        , 1.        , 1.        , 0.        ]])

**Your discussion of the impact of TF-IDF here.** Again, a couple of sentences.

## Finally: Try a few novels (30 points total)

Points breakdown:

* 10 points for thoughtful vectorization settings
* 5 points for calculating distances
* 15 points for thoughtful dicussion of your results and their generalizability

In the cell below is a list of five novels. They are:

* Stephen Crane's *Maggie: A Girl of the Streets* (1893)
* Theodore Dreiser's *Sister Carrie* (1900)
* Charlotte Perkins Gilman's *Herland* (1915)
* Jane Austen's *Pride and Prejudice* (1813)
* George Eliot's *Middlemarch* (1869)

There's reason to believe that some of these novels resemble one another, while others are quite dissimilar. 

* *Maggie*, *Sister Carrie*, and *Herland* are American novels published around the start of the twentieth century; Austen's and Eliot's novels are British and were published decades earlier.
* The first two novels were written by men; the last three by women.
* The first two books are Naturalist novels about the (mis)fortunes of young women in predatory cities. *Herland* is an early feminist utopia. The last two books are classic Romantic/Realist novels about British country life, with special emphasis on the aristocracy.

Do the distance metrics we've seen so far reflect the differences we would expect to see between these texts?

### Task: Calculate distances between these five texts. Discuss your results.

* You may use any of the vectorization approaches and distance metrics explored above. 
* You must justify your decisions concerning parameters and techniques.
* If you remove stopwords (you should, but there are different ways to do this, not all of which involve a fixed list of words), you must justify your choices. Unreflective reliance on a stock list of stopwords is not good enough.
* Conclude your work with a discussion of your results.
  * Did your results match your expectations? In what ways?
  * What else might you try so as to improve your results?
  * Can you make any more general observations about the approaches to document similarity that you've employed?

In [15]:
# Novel list
novels = [
    'A-Crane-Maggie-1893-M.txt',
    'A-Dreiser-Sister_Carrie-1900-M.txt',
    'A-Gilman-Herland-1915-F.txt',
    'B-Austen-Pride_Prejudice-1813-F.txt',
    'B-Eliot-Middlemarch-1869-F.txt'
]

# File location
novel_path = os.path.join('..','..','data','texts')

# Create list of full file paths
files = [os.path.join(novel_path, novel) for novel in novels]

In [16]:
# Your code here
stopwords = [
     'a',
     'above',
     'am',
     'an',
     'and',
     'are',
     'at',
     'be',
     'been',
     'being',
     'but',
     'had',
     'has',
     'have',
     'in',
     'is',
     'of',
     'on',
     'out',
     'said',
     'saying',
     'says',
     'the',
     'under',
     'was',
     'were',
     'with'
]

novel_vectorizer = TfidfVectorizer(
    input='filename', # Take input as file paths, not raw text
    encoding='utf-8',
    strip_accents='unicode',
    lowercase=True,
    stop_words=stopwords,
    min_df=2, # 1 = a single document
    max_df=0.9, # 1.0 = 100% = all documents,
    binary=False, # True = return 1 if word is present in document, else 0
    norm='l2',
    use_idf=False,
    max_features=5000
)
novel_matrix = novel_vectorizer.fit_transform(files)
print("Matrix shape:", novel_matrix.shape)
print("\nEuclidean:\n", euclidean_distances(novel_matrix))
print("\nCosine:\n", cosine_distances(novel_matrix))

Matrix shape: (5, 5000)

Euclidean:
 [[0.         1.28782107 1.36427803 1.38998101 1.3380117 ]
 [1.28782107 0.         1.11337328 1.14719942 0.96370704]
 [1.36427803 1.11337328 0.         1.25870711 1.11528058]
 [1.38998101 1.14719942 1.25870711 0.         0.97385512]
 [1.3380117  0.96370704 1.11528058 0.97385512 0.        ]]

Cosine:
 [[0.         0.82924156 0.93062727 0.9660236  0.89513766]
 [0.82924156 0.         0.61980003 0.65803325 0.46436563]
 [0.93062727 0.61980003 0.         0.7921718  0.62192538]
 [0.9660236  0.65803325 0.7921718  0.         0.4741969 ]
 [0.89513766 0.46436563 0.62192538 0.4741969  0.        ]]


**Your discussion here.** Probably a couple of paragraphs.

* How did you choose to vectorize?
* How did you approach stopword removal?
* How did you measure distances?
* What did you find?
* How might you improve your results?
* Are you prepared to offer any generalizations about textual simlarity in novels?