# CIS600 - Social Media & Data Mining
###  
<img src="https://www.syracuse.edu/wp-content/themes/g6-carbon/img/syracuse-university-seal.svg?ver=6.3.9" style="width: 200px;"/>

# NLP Cont'd

###  March 27, 2018

# Other Data

### Let's do what we did last week using data from a different source. You can find the below files in the BB content folder for last week.

In [32]:
# Plotting
from bokeh.layouts import gridplot
from bokeh.plotting import figure, show, output_notebook

# We'll need numpy, as usual
import numpy as np

# Import natural language toolkit
import nltk

# Read training data from a text file
with open('training.txt','r') as f:
    training_lines = f.readlines()

# Read test data from a text file
with open('testdata.txt','r') as f:
    test_lines = f.readlines()

### What are we working with here?

In [33]:
len(training_lines),len(test_lines)

(7086, 33052)

In [34]:
# Looking at the size
import sys
sys.getsizeof(training_lines),sys.getsizeof(test_lines)

(61440, 285392)

### Let's look some examples...

In [5]:
print(training_lines[0],training_lines[-1])

('1\tThe Da Vinci Code book is just awesome.\n', '0\tOh, and Brokeback Mountain was a terrible movie.\n')


In [6]:
print(test_lines[0],test_lines[-1])

('" I don\'t care what anyone says, I like Hillary Clinton.\n', 'I was rejected by the stupid San Francisco literary agency that I sent my manuscript to.\n')


### Nothing is stopping us from looking at this in a text editor or word processor, but let's poke around some more.

### How long are these lines, statistically speaking?

In [7]:
# Calculating lengths
trn_lens, test_lens = [len(x) for x in training_lines], [len(x) for x in test_lines]

# Summary stats
print((np.mean(trn_lens),np.std(trn_lens)),(np.mean(test_lens),np.std(test_lens)))

((63.158340389500424, 38.01212045454934), (61.519575214813024, 43.626223511950705))


### Looks like the test samples are a bit shorter, but also with greater variance. (Aside: how could you find whether that difference is 'real'?). We can visualize this:

In [8]:
# First plot
p1 = figure(title="Training Lengths",
            background_fill_color="#E8DDCB", x_range=(0,300))
hist, edges = np.histogram(trn_lens, density=True, bins=50) # Get histogram stuff from numpy
p1.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
        fill_color="#036564", line_color="#033649") # Add quadrilaterals to the figure

# Second plot
p2 = figure(title="Test Lengths",
            background_fill_color="#E8DDCB", x_range=(0,300))
hist, edges = np.histogram(test_lens, density=True, bins=200) # Get histogram stuff from numpy
p2.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
        fill_color="#036564", line_color="#033649") # Add quadrilaterals to the figure

# Combine plots and show
output_notebook()
show(gridplot(p1,p2,ncols=2))

### It looks like the lines of `training_lines` have `0` or `1` prepended - is this the case?

In [None]:
np.all([x[0] == '0' or x[0] == '1' for x in training_lines])

### Yes.

### But not the test lines? What good is that? This data is from a [*Kaggle competition*](https://inclass.kaggle.com/c/si650winter11).

### We won't bother with the test data. But let's process the data in *training.txt* as we did the corpus data. The first step is to collect all the words in the corpus.

In [11]:
documents = [(x.split()[1:],x[0]) for x in training_lines]
for x,y in documents[:5]:
    print('"' + ' '.join(x) + '"' + ' is in category ' + y)

"The Da Vinci Code book is just awesome." is in category 1
"this was the first clive cussler i've ever read, but even books like Relic, and Da Vinci code were more plausible than this." is in category 1
"i liked the Da Vinci Code a lot." is in category 1
"i liked the Da Vinci Code a lot." is in category 1
"I liked the Da Vinci Code but it ultimatly didn't seem to hold it's own." is in category 1


### Recall that we had to shuffle the documents before splitting them up...

In [13]:
np.random.shuffle(documents)

### Next, we want to filter the short words out of our documents

In [14]:
filtered = []

for (words,sentiment) in documents:
    words_filtered = [w.lower() for w in words if len(w) >= 3]
    filtered.append((words_filtered,sentiment))

### Great, what does this look like?

In [15]:
print(filtered[:5])

[(['love', 'kirsten', 'leah', 'kate', 'escapades', 'and', 'mission', 'impossible', 'tom', 'well...'], '1'), (['hate', 'harry', 'potter..'], '0'), (['and', "i'm", 'not', 'even', 'thinking', 'getting', 'freshman', 'cuz', 'itz', 'just', 'mission', 'impossible', 'lol', 'simply', 'suck', 'too', 'much', 'altogether....'], '0'), (['vinci', 'code', 'up,', 'up,', 'down,', 'down,', 'left,', 'right,', 'left,', 'right,', 'suck!'], '0'), (['vinci', 'code', 'was', 'awesome', 'movie...'], '1')]


### Looks like we could still clean it up a little more, for example taking out the punctuation.

In [16]:
# Taking out (some of) the punctuation
# - why not put it back into 'documents'?

documents = []

for (words, sent) in filtered:
    words_stripped = [w.strip('.,;!:-') for w in words]
    documents.append((words_stripped,sent))

In [17]:
print(documents[:5])

[(['love', 'kirsten', 'leah', 'kate', 'escapades', 'and', 'mission', 'impossible', 'tom', 'well'], '1'), (['hate', 'harry', 'potter'], '0'), (['and', "i'm", 'not', 'even', 'thinking', 'getting', 'freshman', 'cuz', 'itz', 'just', 'mission', 'impossible', 'lol', 'simply', 'suck', 'too', 'much', 'altogether'], '0'), (['vinci', 'code', 'up', 'up', 'down', 'down', 'left', 'right', 'left', 'right', 'suck'], '0'), (['vinci', 'code', 'was', 'awesome', 'movie'], '1')]


### That's a lot cleaner. Now on to the features...

In [18]:
# Get all the words
def get_words_in_docs(docs):
    all_words = []
    for (words, sentiment) in docs:
        all_words.extend(words)
    return all_words

# Extract the most frequent words
def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = [w for (w, c) in wordlist.most_common(1000)]
    return word_features

### And how are we to apply these functions?

In [36]:
# The output of the first is the input to the second
word_features = get_word_features(get_words_in_docs(documents))

In [37]:
print(word_features[:10])

['the', 'and', 'potter', 'harry', 'vinci', 'brokeback', 'mountain', 'code', 'love', 'was']


### On to the dictionaries...

In [38]:
# Unchanged from last time
def extract_features(document): 
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

# Encoding each document
train_set = [(extract_features(d), c) for (d,c) in documents[:4000]]
test_set = [(extract_features(d), c) for (d,c) in documents[4000:]]

### Training a classifier

In [39]:
clf = nltk.classify.NaiveBayesClassifier.train(train_set)

### Looking at informative features

In [40]:
clf.show_most_informative_features(20)

Most Informative Features
          contains(hate) = True                0 : 1      =     89.7 : 1.0
          contains(love) = True                1 : 0      =     89.0 : 1.0
            contains(oh) = True                0 : 1      =     79.7 : 1.0
       contains(fucking) = True                0 : 1      =     55.0 : 1.0
         contains(these) = True                0 : 1      =     46.1 : 1.0
           contains(his) = True                0 : 1      =     46.1 : 1.0
            contains(me) = True                0 : 1      =     45.2 : 1.0
        contains(around) = True                0 : 1      =     44.4 : 1.0
          contains(does) = True                0 : 1      =     42.6 : 1.0
           contains(gay) = True                0 : 1      =     41.8 : 1.0
       contains(opinion) = True                0 : 1      =     41.8 : 1.0
        contains(awards) = True                0 : 1      =     40.9 : 1.0
         contains(begin) = True                0 : 1      =     40.9 : 1.0

### OK, that's enough of that routine. You know enough python at this point to use any reasonable data source.

In [None]:
print( clf.classify(test_set[0][0]), '\n',
      training_lines[4000])

In [None]:
print( clf.classify(test_set[1][0]), '\n',
      training_lines[4001])

In [None]:
training_lines[4000:4010]

In [None]:
print( clf.classify(test_set[3][0]), '\n',
      training_lines[4003])

### (Aside: we didn't specify the target or categories at any point. The `nltk` implementations read it off the form of our data, which is why we've prepared the data in this particular way. Bear that in mind if you're using another package for classification.)

# Other Features & Embeddings

### Let's use the `nltk` corpora to look at some other NLP concepts. Our approach thus far has been to treat each document as a *bag of words*. That lets a lot of meaning slip through (in different ways), and there is much more useable information we can pull out of text documents.

## TF-IDF

### This is a badly named but important quantity. Our models so far have been so stupid that we have not seen the need for this. 

### In particular, we used a *one-hot* encoding - true or false - of words frequent in the corpora. What if we were to use not just those boolean values, but the frequency of certain words? 

### That's not a bad idea, but many meaningless words have high frequency. 

### Therefore, *what about words that are frequent within certain documents, but not so frequent in the whole corpus*? 

### Those should tell us something.

### First, let's define the *term frequency* $TF(t,d)$, where $t$ is a word (loosely) and $d$ is a document:

## $TF(t,d) = \frac{\text{occurrences of }t\text{ in }d}{\text{number of words in }d}$

### Second, let's define the *inverse document frequency* $IDF(t)$:

## $IDF(t) = \log\big(\frac{\text{size of corpus}}{\text{number of documents with }t}\big)$

### Finally, the $TF$-$IDF$ is the *product* of these two.

In [56]:
# Importing nltk stuff
import nltk
from nltk.corpus import movie_reviews

# Our documents, in convenient form
documents = [list(movie_reviews.words(fileid))
                for category in movie_reviews.categories()
                    for fileid in movie_reviews.fileids(category)]
# Total documents
N = len(documents)

# Doing idf first
def idf(t):
    # t is a string
    
    # How many documents contain t?
    docs = [d for d in documents if t in d]
    frac = N / ( 1+ len(docs))
    return np.log(frac)


def tfidf(t,d):
    # t is a string
    # d is an int (index for "documents")
    
    # Calculating term frequency is straightforward
    tf = documents[d].count(t) / len(documents[d])
    
    # Multiply, calling the other function
    prod = tf*idf(t)
    return prod

In [None]:
documents[13].count('the')

In [None]:
len(documents[13])

In [None]:
nltk.FreqDist(documents[13])

In [None]:
tfidf('khan',13)

### In document retrieval, you specify query terms, or otherwise supply information to be matched in a corpus, and appropriate documents are ranked and returned. This is closely related to classification. We can use TF-IDF to build what's sometimes called the *vector space model*. Here is a 2D example:

### Let's say we want to retrieve documents pertaining to *car insurance*. That's two terms, *car* and *insurance*. Define a *query vector* $q$

In [2]:
q = np.array([.71,.71])

### (Why these values?)

In [None]:
np.linalg.norm(q)

### Suppose we have three documents $d_1,d_2,d_3$. Suppose that for each of the two terms *car* and *insurance*, we have computed the TF-IDF for these three documents. By abuse of notation, let's use the document names for their representing vectors:

In [4]:
d1,d2,d3 = np.array([[.13,.99],[.8,.6],[.99,.13]])

### (How about these values?)

In [None]:
np.linalg.norm([d1,d2,d3], axis=1)

### OK, let's take a look.

In [16]:
output_notebook()
p = figure(width=400, height=400)
p.segment(x0=[0,0,0,0], y0=[0,0,0,0], x1=[d1[0], d2[0], d3[0], q[0]],
          y1=[d1[1], d2[1], d3[1], q[1]], color="#F4A582", line_width=3)

show(p)

### Which of these is closest?

In [None]:
np.matmul([d1,d2,d3],q)

### What is going on here? Recall how the *dot/inner product* works:

## $q \cdot d = q_1d_1 + \ldots + q_nd_n$

### It is related to the *smallest angle between the two vectors* through the equation

## $\cos(q,d) = \frac{q \cdot d}{\|q\|\|d\|}$

### When the vectors are already normalized, as is the case here, we need only compute the dot products and compare the results.

### This is equivalent to comparing *distances from the query vector* in the sense that

## $\cos(q,d_1) > \cos(q,d_2) \iff \|q-d_1\| < \|q-d_2\|$.

### An accompanying metric sometimes used is the *collection frequency*, $CF$ - the total number of occurrences of a term in the corpus.

### There are many variants of this metric, and you can find all of them in use. The so-called *augmented* TF-IDF:

## $0.5 + \frac{0.5\times tf_{t,d}}{\max_t tf_{t,d}}$

## Term Distribution Models

### General idea: let's try to characterize how informative a word is. From *Manning & Schütze*:

> One could cast the problem as one of distinguishing content words from non-content (or function) words, but most models have a graded notion of how informative a word is.

### Example: Zipf's Law

## $f \cdot r = k$

### Example: Poisson Distribution

## $p(k;\lambda_i) = e^{-\lambda_i}\frac{\lambda_i^k}{k!}$

### Here, $\lambda_i$ is the average number of occurrences of term $w_i$ per document.

<img src="notebook-images/poiss.png" style="width: 800px;"/>

### The poisson is not a great fit for term distributions. But...

> We can exploit term distribution models in information retrieval by using the parameters of the model fit for a particular term as indicators of relevance.

### Another model, the *K mixture* model of Katz, has larger $\beta$ parameter for content words than for non-content words.

### (see *Manning & Schütze* for more)

## Collocation

### CAUTION: do not confuse collocations with $n$-grams. It is not actually confusing, but the definition of *collocation* is just not terribly precise. From *Manning & Schütze*,

>A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things.

### Examples include "stiff breeze", "broad daylight" and "international best practice".

### Collocations are not necessarily idioms, but idioms are (the extreme of) collocations.

### Let's look for some. How about `reuters` this time...

In [None]:
from nltk.corpus import reuters

from nltk.collocations import *

# For bigrams
bigram_measures = nltk.collocations.BigramAssocMeasures()

# For trigrams
trigram_measures = nltk.collocations.TrigramAssocMeasures()

#
finder = BigramCollocationFinder.from_words(reuters.words())
finder.nbest(bigram_measures.pmi, 100) 


### Notice that in computing 'collocations', we are forced to take a stab at it useful some mathematical simplification of the general idea - here we used *pointwise mutual information*

In [None]:
help(bigram_measures.pmi)

### This compares the actual joint probability of two events with the products of their probabilities. The higher it is, the stronger their association.

### Notice that a lot of the pairs look pretty useless. Some are not even recognizable parts of natural language, and many appear to be the names of particular humans. We can do some filtering.

In [None]:
finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 100)

### Still looking mostly like names, but there is some other good stuff now.

## Synonyms

### You all have heard of synonyms. But `WordNet` has a big collection of synonyms.

### `WordNet` is a semantically oriented dictionary of English, similar to a traditional thesaurus but with richer structure. It is the most popular such thing among NLP types.

### It contains hundreds of thousands of words and *synonym sets*. For example, take sentences $A$ and $B$

>$A)$ Benz is credited with the invention of the $\color{red}{\text{motorcar}}$.

>$B)$ Benz is credited with the invention of the $\color{red}{\text{automobile}}$.

### The sentences $A$ and $B$ have pretty much the same meaning. Let's see what `WordNet` has to offer.

In [None]:
from nltk.corpus import wordnet as wn
motorcar_syn = wn.synsets('motorcar')[0]
motorcar_syn

### This is a collection of synonymous words or *lemmas*. Remember *lemmatization*? We looked at some lemmatizers before, and we said that lemmatization is a harder task than stemming (and makes use of dictionaries).

In [None]:
motorcar_syn.lemma_names()

### Each word of a synset can have several meanings, e.g. "car" can signify "train carriage", "gondola" or "elevator car".

In [None]:
motorcar_syn.definition()

### `WordNet` gives us examples, too.

In [None]:
motorcar_syn.examples()

### The word "motorcar" is pretty precise, so let's look at just "car":

In [None]:
car_syns = wn.synsets('car')
for synset in car_syns:
    print(synset.lemma_names())

### Some lexical relationships hold between lemmas, e.g. *antonymy*:

In [None]:
print(wn.lemma('supply.n.02.supply').antonyms(),'\n',
wn.lemma('rush.v.01.rush').antonyms(),'\n',
wn.lemma('horizontal.a.01.horizontal').antonyms(),'\n',
wn.lemma('staccato.r.01.staccato').antonyms())

## $n$-grams

### We used *bigrams* above and we touched on $n$-grams before. The `nltk` package implements $n$-grams in general.

In [None]:
# Let's have the n-grams tools in here by name
from nltk import bigrams, trigrams, ngrams

# A tweet
t = '''#qcpoli enjoyed a hearty laugh today with #plq debate audience for @jflisee 
        #notrehome tune was that the intended reaction?'''

# A tweet tokenizer
tt = nltk.TweetTokenizer(t)

# Tokens from our tweet
tokens = tt.tokenize(t)

# N-Grams from our tweet
for t in bigrams(tokens): 
    print(t)
for t in trigrams(tokens): 
    print(t)
for t in ngrams(tokens, 4): 
    print(t)

### Recall how we modeled the documents before, recording the presence or absence of certain (frequent) words. Those were *unigrams*. We can do the same thing with these $n$-grams as features! It just takes some more computation.

### But $n$-grams actually lead us to a more powerful embedding of words into vector space. More on that Thursday.