Working with text
=================

This is a shortened version of the tutorial "Working with text" in the form of an interactive IPython notebook. This document is "live"; any code example can be edited and executed in the browser. To see this in action, change some part of the code in the *cell* below and then click on the play button above.

In [1]:
list_of_strings = ['working', 'with', 'text']
for s in list_of_strings:
    print(s)

working
with
text


Creating a document-term matrix
-------------------------------

Word frequencies and document-term matrices are typical units of
analysis when working with text collections. It may come as a surprise
that reducing a book to a list of word frequencies retains useful
information, but practice has shown this to be the case. Treating texts
as a list of word frequencies (a vector) also makes available a range of
mathematical tools developed for [studying and manipulating
vectors](http://en.wikipedia.org/wiki/Euclidean_vector#History).

> **Note**: Turning texts into unordered lists (or "bags") of words is easy in
> Python. [Python Programming for the
> Humanities](http://fbkarsdorp.github.io/python-course/) includes a
> chapter entitled [Text
> Processing](http://nbviewer.ipython.org/urls/raw.github.com/fbkarsdorp/python-course/master/Chapter%203%20-%20Text%20Preprocessing.ipynb)
> that describes the steps in detail.

This document assumes some prior exposure to text analysis so we will
gather word frequencies (or term frequencies) derived from the lists of
words appearing in texts into a document-term matrix using the
[CountVectorizer](http://scikit-learn.sourceforge.net/dev/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
class from the [scikit-learn](http://scikit-learn.sourceforge.net/)
package. (For those familiar with R and the
[tm](http://cran.r-project.org/web/packages/tm/) package, this function
performs the same operation as `DocumentTermMatrix` and takes
recognizably similar arguments.)

First we need to import the functions and classes we intend to use,
along with our customary abbreviation for functions in the `numpy`
package.

In [2]:
import numpy as np  # a conventional alias
from sklearn.feature_extraction.text import CountVectorizer

Now we use the
[CountVectorizer](http://scikit-learn.sourceforge.net/dev/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
class to create a document-term matrix. `CountVectorizer` is highly
customizable. For example, a list of "stop words" can be specified with
the ``stop_words`` parameter. Other important parameters include:

-   `lowercase` (default `True`) convert all text to lowercase before
    tokenizing
-   `min_df` (default `1`) remove from the vocabulary terms that occur
    in fewer than `min_df` documents–with a large corpus this may be set
    to `5` to eliminate rare words
-   `vocabulary` ignore words that do not appear in the list (or
    iterable) assigned to parameter `vocabulary`
-   `strip_accents` remove accents
-   `token_pattern` (default `u'(?u)\b\w\w+\b'`) regular expression
    identifying tokens–by default words that consist of a single
    character (e.g., 'a', '2') are ignored, setting `token_pattern` to
    `'(?u)\b\w+\b'` will include these tokens
-   `tokenizer` (default unused) use a custom function for tokenizing

For this example we will use texts by Jane Austen and Charlotte Brontë. These
texts are available in the *Datasets* section of the collected tutorials.


In [23]:
filenames = ['../data/austen-brontë/Austen_Emma.txt',           
             '../data/austen-brontë/Austen_Pride.txt',          
             '../data/austen-brontë/Austen_Sense.txt',          
             '../data/austen-brontë/CBronte_Jane.txt',          
             '../data/austen-brontë/CBronte_Professor.txt',     
             '../data/austen-brontë/CBronte_Villette.txt']      
                                                             
vectorizer = CountVectorizer(input='filename', min_df=0.15)               
dtm = vectorizer.fit_transform(filenames)  # a sparse matrix 
vocab = vectorizer.get_feature_names()

Now we have a document-term matrix and a vocabulary list. Before we can
query the matrix and find out, for example, how many times the word
'house' occurs in *Emma* (the first text in `filenames`), we need to
convert this matrix from its current format, a [sparse
matrix](http://docs.scipy.org/doc/scipy/reference/sparse.html), into a
normal NumPy array. We will also convert `vocab`, a list of vocabulary,
to an array of strings, as an array supports a greater variety of
operations.


In [18]:
# for reference, note the current class of `dtm`  
type(dtm)                                         
dtm = dtm.toarray()  # convert to a regular array 
vocab_list = np.array(vocab)

> **Note:** A sparse matrix is used to store matrices that contain a significant
> number of entries that are zero. Typically, a sparse matrix only
> records non-zero entries. To understand why this matters so much
> that `CountVectorizer` returns a sparse matrix by default,
> consider a 4000 by 50000 matrix that is 60% zeros. In Python an
> integer takes up 4 bytes, so using a sparse matrix saves almost
> 500M, which is a significant amount of computer memory. (Remember
> that arrays are usually stored in memory, not on disk).

Querying the document-term matrix and the vocabulary is straightforward.
For example, here are two ways of finding how many times the word
'house' occurs in the first text, *Emma*:


In [19]:
# the first file, indexed by 0 in Python, is *Emma*                 
filenames[0] == 'data/austen-brontë/Austen_Emma.txt'                
                                                                    
# use the standard Python list method index(...)                    
house_idx = vocab.index('house')                               
dtm[0, house_idx]                                                   
                                                                    
# alternatively, use NumPy indexing                                 
# in R this would be essentially the same, dtm[1, vocab == 'house'] 
dtm[0, vocab == 'house']                                         




2

In [20]:
# verify that this is the result we anticipated
vocab[house_idx]

u'house'

Sandbox
=======
Feel free to experiment with the document-term matrix `dtm` in the code cells below.

In [21]:
print(dtm.shape)
for fn in filenames:
    print(fn)

(6, 22854)
../data/austen-brontë/Austen_Emma.txt
../data/austen-brontë/Austen_Pride.txt
../data/austen-brontë/Austen_Sense.txt
../data/austen-brontë/CBronte_Jane.txt
../data/austen-brontë/CBronte_Professor.txt
../data/austen-brontë/CBronte_Villette.txt


In [22]:
print(len(vocab))
vocab[500:550]  # look at some of the vocabulary

22854


[u'abuse',
 u'abused',
 u'abuses',
 u'abusing',
 u'abusive',
 u'abyss',
 u'acacia',
 u'acacias',
 u'academician',
 u'academicians',
 u'accede',
 u'acceded',
 u'acceding',
 u'accelerate',
 u'accelerated',
 u'accent',
 u'accented',
 u'accents',
 u'accentuated',
 u'accept',
 u'acceptable',
 u'acceptably',
 u'acceptance',
 u'accepted',
 u'accepting',
 u'accepts',
 u'access',
 u'accessible',
 u'accession',
 u'accessory',
 u'accident',
 u'accidental',
 u'accidentally',
 u'accidently',
 u'accidents',
 u'accommodate',
 u'accommodated',
 u'accommodating',
 u'accommodation',
 u'accommodations',
 u'accompanied',
 u'accompanies',
 u'accompaniment',
 u'accompaniments',
 u'accompany',
 u'accompanying',
 u'accompli',
 u'accomplices',
 u'accomplish',
 u'accomplished']

In [25]:
from sklearn.metrics.pairwise import euclidean_distances
dist = euclidean_distances(dtm)
np.round(dist, 1)

array([[    0. ,  3856.3,  4182.8,  5119.7,  7113.3,  5280.2],
       [ 3856.3,     0. ,  1922.6,  6313.1,  4126.2,  6381.2],
       [ 4182.8,  1922.6,     0. ,  6657.4,  4045.3,  6650.3],
       [ 5119.7,  6313.1,  6657.4,     0. ,  8363.8,  2591.5],
       [ 7113.3,  4126.2,  4045.3,  8363.8,     0. ,  8484.1],
       [ 5280.2,  6381.2,  6650.3,  2591.5,  8484.1,     0. ]])

In [26]:
# *Pride and Prejudice* is index 1 and *Jane Eyre* is index 3

In [28]:
filenames[1] == '../data/austen-brontë/Austen_Pride.txt'

True

In [29]:
filenames[3] == '../data/austen-brontë/CBronte_Jane.txt'

True

In [30]:
# the distance between *Pride and Prejudice* and *Jane Eyre*
dist[1, 3]

6313.0833987838305

In [31]:
# which is greater than the distance between *Jane Eyre* and *Villette* (index 5)
dist[1, 3] > dist[3, 5]

True

And if we want to use a measure of distance that takes into consideration the length of the novels (an excellent idea), we can calculate the cosine similarity by importing sklearn.metrics.pairwise.cosine_similarity and use it in place of euclidean_distances.


In [32]:
from sklearn.metrics.pairwise import cosine_similarity

In [33]:
dist = 1 - cosine_similarity(dtm)

In [34]:
np.round(dist, 2)

array([[-0.  ,  0.02,  0.03,  0.05,  0.06,  0.05],
       [ 0.02,  0.  ,  0.02,  0.05,  0.04,  0.04],
       [ 0.03,  0.02,  0.  ,  0.06,  0.05,  0.05],
       [ 0.05,  0.05,  0.06,  0.  ,  0.02,  0.01],
       [ 0.06,  0.04,  0.05,  0.02, -0.  ,  0.01],
       [ 0.05,  0.04,  0.05,  0.01,  0.01, -0.  ]])

In [36]:
# the distance between *Pride and Prejudice* (index 1)
# and *Jane Eyre* (index 3) is
# 0.047026234323162663

In [37]:
# which is greater than the distance between *Jane Eyre* and
# *Villette* (index 5)
dist[1, 3] > dist[3, 5]

True

<h2>Visualizing distances using MDS</h2>

In [42]:
import os  # for os.path.basename
import matplotlib.pyplot as plt
from sklearn.manifold import MDS

# two components as we're plotting points in a two-dimensional plane
# "precomputed" because we provide a distance matrix
# we will also specify `random_state` so the plot is reproducible.
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
pos = mds.fit_transform(dist)  # shape (n_components, n_samples)

<h2>The following Does not run on remote server</h2>

In [44]:
xs, ys = pos[:, 0], pos[:, 1]

# short versions of filenames:
# convert 'data/austen-brontë/Austen_Emma.txt' to 'Austen_Emma'
names = [os.path.basename(fn).replace('.txt', '') for fn in filenames]

# color-blind-friendly palette
for x, y, name in zip(xs, ys, names):
#     color = 'orange' if "Austen" in name else 'skyblue'
    color = 'orange'
    plt.scatter(x, y, c=color)
    plt.text(x, y, name)
    
plt.show()

RuntimeError: Invalid DISPLAY variable

In [61]:
# après Jeremy M. Stober, Tim Vieira
# https://github.com/timvieira/viz/blob/master/mds.py
mds = MDS(n_components=3, dissimilarity="precomputed", random_state=1)

pos = mds.fit_transform(dist)

In [63]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()

ax = fig.add_subplot(111, projection='3d')

ax.scatter(pos[:, 0], pos[:, 1], pos[:, 2])
# Out[48]: <mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x2b96c03c1470>

for x, y, z, s in zip(pos[:, 0], pos[:, 1], pos[:, 2], names):
    ax.text(x, y, z, s)
    
plt.show()

RuntimeError: Invalid DISPLAY variable