In [1]:
# preamble to be able to run notebooks in Jupyter and Colab
try:
    from google.colab import drive
    import sys
    
    drive.mount('/content/drive')
    notes_home = "/content/drive/Shared drives/CSC310/notes/"
    user_home = "/content/drive/My Drive/"
    
    sys.path.insert(1,notes_home) # let the notebook access the notes folder

except ModuleNotFoundError:
    notes_home = "" # running native Jupyter environment -- notes home is the same as the notebook
    user_home = ""  # under Jupyter we assume the user directory is the same as the notebook

# Natural Language Processing (NLP)

Some of the most important data in our society is represented as unstructured text:

* Medical records
* Court cases
* Insurance documents

Other data perhaps not as fundamental but that provides interesting insights into trends and mindsets:

* Twitter and other online blogs
* News feeds


In all of these cases we want to extract meaning from the unstructured text:

* Perhaps we want to do classification (medical records - high risk/low risk)
* Perhaps we want to do a topic analysis of the twitter feeds
* Perhaps we would like to construct a recommendation engine for news feeds

Regardless, what the task, we need to convert the unstructured text into something that we can work with and perhaps most importantly, our models can work with.

☞ The **Vector Model** of text (sometimes called the **Bag-of-Words model**)


## The Vector Model

The vector model converts a document with unstructured text into a **point in an n-dimensional coordinate system** where the coordinate system is defined by the words contained in the text.

Consider: the quick brown fox jumps over the lazy dog

This text can be represented as the tuple rearranged in alphabetical order,
```
(brown,dog,fox,jumps,lazy,over,quick,the)
```

Let’s consider the fact that we have multiple documents and represent them as tuples,

* Doc 1: the quick brown fox jumps over the lazy dog &rarr; `(brown,dog,fox,jumps,lazy,over,quick,the)`
* Doc 2: rudi is a lazy brown dog &rarr; `(a,brown,dog,is,lazy,rudi)`

In order to compare the two documents we create a tuple of the **union** of the words appearing in the 
two sentence tuples,
```
(a,brown,dog,fox,is,jumps,lazy,over,quick,rudi,the)
```
and represent each document as bit vectors with the same length as the tuple above and with 1's and 0's
indicating if the document contains a word at a particular tuple position or not,

* Doc 1: (0,1,1,1,0,1,1,1,1,0,1)
* Doc 2: (1,1,1,0,1,0,1,0,0,1,0)

Notice that our word tuple now has become our coordinate system, in this case with 11 dimensions, and each document is now a point in this 11-dimensional space.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/Rectangular_coordinates.svg/1280px-Rectangular_coordinates.svg.png" width="350" height="300">

The nice thing about this vector model representation is that we can do mathematics on the documents!

Consider adding another document to our collection

* Doc 3: princess jumps over the dog &rarr; `(dog,jumps,over,princess,the)`

Here we have the new word `princess`, so we need to extend our coordinate system to 12 dimensions by adding `princess`,
```
(a,brown,dog,fox,is,jumps,lazy,over,princess,quick,rudi,the)
```
Our three documents become vectors/points in this coordinate system,

* Doc 1: the quick brown fox jumps over the lazy dog &rarr; `(0,1,1,1,0,1,1,1,0,1,0,1)`
* Doc 2: rudi is a lazy brown dog &rarr; `(1,1,1,0,1,0,1,0,0,0,1,0)`
* Doc 3: princess jumps over the dog &rarr; `(0,0,1,0,0,1,1,1,1,0,0,1)`


Given our vector model of the three docs we can ask questions like this, 

> Is doc2 or doc3 more similar to doc1?

Since all three documents are considered points in our coordinate system we can Euclidean distances in that coordinate system to answer that question. More specifically, we can answer this question by considering the Euclidean distances doc1 &harr; doc2 and doc1 &harr; doc3 in our coordinate system.  

The Euclidean distance d in n-dimensional space between two points $p$ and $q$ is defined as:

$d(p,q) = \sqrt{(p_1-q_1)^2+(p_2-q_2)^2+\ldots+(p_n-q_n)^2}$

In our case the point $p$ and $q$ are document vectors and $p_i$ and $q_i$ are the components of the respective 
vectors.

In order to answer our question we have to perform the following computations,

* $d(doc1, doc2) = \sqrt{(0-1)^2+(1-1)^2+(1-1)^2+(1-0)^2+(0-1)^2+(1-0)^2+(1-1)^2+(1-0)^2+(0-0)^2+(1-0)^2+(0-1)^2+(1-0)^2}                      = \sqrt{1+0+0+1+1+1+0+1+0+1+1+1} = \sqrt{8} = 2.8$

* $d(doc1,doc3) = \sqrt{(0-0)^2+(1-0)^2+(1-1)^2+(1-0)^2+(0-0)^2+(1-1)^2+(1-1)^2+(1-1)^2+(0-1)^2+(1-0)^2+(0-0)^2+(1-1)^2} = \sqrt{0+1+0+1+0+0+0+0+1+1+0+0} = \sqrt{4} = 2.0$

> So, doc3 is more similar to doc1 than doc2!


## The Vector Model in Sklearn

Let's try the above in sklearn.

In [2]:
import pandas
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances

# set up our documents
doc_names = ["doc1", "doc2", "doc3"]
docs = ["the quick brown fox jumps over the lazy dog",
        "rudi is a lazy brown dog",
        "princess jumps over the lazy dog"]

# process documents
vectorizer = CountVectorizer(analyzer = "word", binary = True)
docarray = vectorizer.fit_transform(docs).toarray()

# print out the coordinate system
# NOTE: sklearn filters out single character words -- is drops 'a'
print("Coordinates:")
coords = vectorizer.get_feature_names()
print(coords)

# print out how each document is represented in this coordinate system
# NOTE: traditional this mapping is called the 'docterm' matrix - the mapping
#       of each document into the set of terms/words.
print("\nDocterm:")
docterm = pandas.DataFrame(data=docarray,index=doc_names,columns=coords)
print(docterm)

# print pairwise distances between documents
distances = euclidean_distances(docterm)
distances_df = pandas.DataFrame(data=distances, index=doc_names, columns=doc_names)
print("\nPairwise Distances:")
print(distances_df)

Coordinates:
['brown', 'dog', 'fox', 'is', 'jumps', 'lazy', 'over', 'princess', 'quick', 'rudi', 'the']

Docterm:
      brown  dog  fox  is  jumps  lazy  over  princess  quick  rudi  the
doc1      1    1    1   0      1     1     1         0      1     0    1
doc2      1    1    0   1      0     1     0         0      0     1    0
doc3      0    1    0   0      1     1     1         1      0     0    1

Pairwise Distances:
          doc1      doc2      doc3
doc1  0.000000  2.645751  2.000000
doc2  2.645751  0.000000  2.645751
doc3  2.000000  2.645751  0.000000


> Just as we computed by hand - doc3 is more similar to doc1 than doc2.

## Real World Data

from sklearn.datasets import fetch_20newsgroups

“The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.”

```
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
misc.forsale
talk.politics.misc
talk.politics.guns
talk.politics.mideast
talk.religion.misc
alt.atheism
soc.religion.christian
```

Each news item has two fields: 
* Data - the actual text
* Target - index of the category the news item belongs to


In [3]:
import pandas
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.datasets import fetch_20newsgroups

# the categories we want to use
cats = ['talk.politics.misc', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)

# print some meta-info about the data
print("Number of training samples: {}".format(len(newsgroups_train.data)))
print("Target labels: {}".format(list(newsgroups_train.target_names)))
print("Number of labels: {}".format(newsgroups_train.target.shape))
print("Print 5th training instance:")
print(newsgroups_train.data[5])
print("\nLabel of 5th:")
print(newsgroups_train.target_names[newsgroups_train.target[5]])


Number of training samples: 1058
Target labels: ['sci.space', 'talk.politics.misc']
Number of labels: (1058,)
Print 5th training instance:
From: nickh@CS.CMU.EDU (Nick Haines)
Subject: Re: Vandalizing the sky.
In-Reply-To: todd@phad.la.locus.com's message of Wed, 21 Apr 93 16:28:00 GMT
Originator: nickh@SNOW.FOX.CS.CMU.EDU
Nntp-Posting-Host: snow.fox.cs.cmu.edu
Organization: School of Computer Science, Carnegie Mellon University
	<1993Apr21.162800.168967@locus.com>
Lines: 33

In article <1993Apr21.162800.168967@locus.com> todd@phad.la.locus.com (Todd Johnson) writes:

   As for advertising -- sure, why not?  A NASA friend and I spent one
   drunken night figuring out just exactly how much gold mylar we'd need
   to put the golden arches of a certain American fast food organization
   on the face of the Moon.  Fortunately, we sobered up in the morning.

Hmmm. It actually isn't all that much, is it? Like about 2 million
km^2 (if you think that sounds like a lot, it's only a few tens of m



## Let us compute the docterm matrix for the news articles


In [4]:
import pandas
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.datasets import fetch_20newsgroups

cats = ['talk.politics.misc', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)

# process documents                                                                                               
vectorizer = CountVectorizer(analyzer = "word", token_pattern = "[a-zA-Z]+", binary = True)
docarray = vectorizer.fit_transform(newsgroups_train.data).toarray()
print("docarray shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names()[:10]))

docarray shape: (1058, 20590)
first 10 coords: ['a', 'aa', 'aaa', 'aaaaaaaaaaaa', 'aammmaaaazzzzzziinnnnggggg', 'aams', 'aan', 'aangegeven', 'aantal', 'aao']


Look at at the shape of the docarray, we see that we have 23,000+ different features.  We we look at the features it is clear that there are many "nonsense" features.  We need more filtering.

## Let us do more filtering

From this it is clear that we want to do some additional filtering:
* Minimum doc frequency = 2 -- that is, any word has to appear at least twice in the document collection
* Delete anything that is not a word - get rid of things like ‘000’ etc., we use the token pattern arg for that.


In [5]:
import pandas
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.datasets import fetch_20newsgroups
from re import sub

cats = ['talk.politics.misc', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)

# process documents                                                                                               
vectorizer = CountVectorizer(analyzer = "word", 
                             token_pattern = "[a-zA-Z]+",
                             binary = True, 
                             min_df=2)
docarray = vectorizer.fit_transform(newsgroups_train.data).toarray()
                                                                                                 
print("docarray shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names()[:10]))

docarray shape: (1058, 11862)
first 10 coords: ['a', 'aa', 'aammmaaaazzzzzziinnnnggggg', 'aaron', 'aas', 'ab', 'abandon', 'abandoned', 'abandonment', 'abbey']


Notice that we cut the number of features in the space in half and the features look more like words.

## Stemming

The first few coordinates are now:
['aa', 'aammmaaaazzzzzziinnnnggggg', 'aaron', 'aas', 'ab', **'abandon'**, **'abandoned'**, **'abandonment'**, 'abbey', 'abc']

Here, we see one more issue, three different shapes of the same root word, in this case *abandon*.

> Solution: Stemming!


In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem or root form.

A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stems", "stemmer", "stemming", "stemmed" as based on "stem". 

A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word, "fish".

The most popular stemming algorithm:

> The [Porter Stemmer](https://en.wikipedia.org/wiki/Stemming)


In [6]:
import pandas
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.datasets import fetch_20newsgroups
from nltk.stem import PorterStemmer

cats = ['talk.politics.misc', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)

# build the stemmer object
stemmer = PorterStemmer()
# get the default text analyzer from CountVectorizer
analyzer = vectorizer = CountVectorizer(analyzer = "word", token_pattern = "[a-zA-Z]+").build_analyzer()

# build a new analyzer that stems using the default analyzer to create the words to be stemmed
def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

vectorizer = CountVectorizer(analyzer=stemmed_words,
                                 binary=True,
                                 min_df=2)
docarray = vectorizer.fit_transform(newsgroups_train.data).toarray()

print("docarray shape: {}".format(docarray.shape))
print("first 10 coords: {}".format(vectorizer.get_feature_names()[:10]))

docarray shape: (1058, 8657)
first 10 coords: ['a', 'aa', 'aammmaaaazzzzzziinnnnggggg', 'aaron', 'ab', 'abandon', 'abbey', 'abc', 'abdkw', 'abett']


## We can now look at the distances in 8000+ dimensional space

In [7]:
distances = euclidean_distances(docarray)
distances_df = pandas.DataFrame(data=distances)
distances_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1048,1049,1050,1051,1052,1053,1054,1055,1056,1057
0,0.000000,13.228757,14.899664,15.524175,12.884099,15.937377,17.146428,13.038405,13.964240,12.609520,...,16.613248,12.884099,15.394804,28.407745,14.000000,13.601471,17.972201,14.247807,12.609520,15.524175
1,13.228757,0.000000,14.035669,15.362291,12.529964,15.329710,16.278821,12.288206,12.409674,12.165525,...,15.459625,11.874342,14.212670,28.319605,13.453624,13.038405,17.262677,13.190906,12.083046,14.212670
2,14.899664,14.035669,0.000000,15.779734,13.638182,16.309506,16.370706,13.564660,14.525839,13.674794,...,16.248077,13.564660,15.198684,28.053520,14.696938,14.035669,17.916473,14.662878,13.747727,15.394804
3,15.524175,15.362291,15.779734,0.000000,14.933185,16.822604,17.521415,14.933185,15.556349,15.033296,...,17.349352,14.798649,16.673332,28.460499,15.329710,14.764823,18.275667,15.811388,15.033296,16.492423
4,12.884099,12.529964,13.638182,14.933185,0.000000,15.297059,16.248077,11.575837,12.767145,11.874342,...,15.811388,11.313708,14.730920,28.124722,12.961481,12.449900,17.521415,13.152946,11.874342,14.317821
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1053,13.601471,13.038405,14.035669,14.764823,12.449900,15.394804,16.155494,12.041595,13.416408,12.489996,...,15.652476,12.288206,14.352700,28.284271,13.453624,0.000000,17.262677,13.341664,12.409674,14.832397
1054,17.972201,17.262677,17.916473,18.275667,17.521415,18.574176,19.364917,17.000000,17.549929,17.029386,...,19.157244,16.822604,17.888544,29.086079,17.464249,17.262677,0.000000,17.832555,17.262677,18.384776
1055,14.247807,13.190906,14.662878,15.811388,13.152946,15.524175,16.401219,12.845233,13.711309,12.727922,...,11.618950,12.369317,14.899664,28.425341,13.892444,13.341664,17.832555,0.000000,13.190906,14.696938
1056,12.609520,12.083046,13.747727,15.033296,11.874342,15.394804,16.401219,11.789826,12.649111,11.575837,...,15.968719,11.180340,14.071247,28.035692,12.529964,12.409674,17.262677,13.190906,0.000000,14.071247


## Find out which stories are most similar

In [8]:
import sys

# map 0.0 across the major diagonal into FLOAT_MAX
new_df = distances_df.apply(lambda c: c.apply(lambda x: sys.float_info.max if x == 0.0 else x))

In [9]:
# find the column with the minimal value
new_df.min().idxmin()

930

In [10]:
# find the row with the minimal value
new_df.iloc[:,930].idxmin()

1036

In [11]:
# these two news stories are most similar
new_df.iloc[1036, 930]

1.0

In [12]:
print(newsgroups_train.target_names[newsgroups_train.target[1036]])
print(newsgroups_train.target_names[newsgroups_train.target[930]])

sci.space
sci.space


In [13]:
print(newsgroups_train.data[1036])

Subject: <None>
From: bioccnt@otago.ac.nz
Organization: University of Otago, Dunedin, New Zealand
Nntp-Posting-Host: thorin.otago.ac.nz
Lines: 12


Can someone please remind me who said a well known quotation? 

He was sitting atop a rocket awaiting liftoff and afterwards, in answer to
the question what he had been thinking about, said (approximately) "half a
million components, each has to work perfectly, each supplied by the lowest
bidder....." 

Attribution and correction of the quote would be much appreciated. 

Clive Trotman




In [14]:
print(newsgroups_train.data[930])

Subject: Quotation? Lowest bidder...
From: bioccnt@otago.ac.nz
Organization: University of Otago, Dunedin, New Zealand
Nntp-Posting-Host: thorin.otago.ac.nz
Lines: 12


Can someone please remind me who said a well known quotation? 

He was sitting atop a rocket awaiting liftoff and afterwards, in answer to
the question what he had been thinking about, said (approximately) "half a
million components, each has to work perfectly, each supplied by the lowest
bidder....." 

Attribution and correction of the quote would be much appreciated. 

Clive Trotman




> It is a reposting where just the subject of the message changed!