## INFO 3350/6350

## Lecture 10: Feature expansion and natural language processing (NLP)

## What we want from our features

* Sometimes we want to move from specific words to higher-level categories in order to achieve **better classification or similarity results**
    * Here, we can be pragmatic. What features produce the best results?
    * This is an engineering problem
    * But we may still care about explainability
* Sometimes we move from words to concepts because we care about the **concepts themselves**
    * In this case, explainability is really important
    * We need to understand how our features capture (or fail to capture) the concepts we care about
* And sometimes, we want to incorporate **paratextual features**, that is, features that are *about* a book, but aren't *in* the book, strictly speaking
    * Author identity, gender, publication date, national origin, genre, market success, etc.

**Today: look (a little bit) into non-unigram feature types that may help us in each of these cases**

## There's more to life than token unigrams

* Most of our work so far has used counts of **token unigrams** (that is, individual words) to characterize texts
* This is non-crazy!
    * The words in a book tell us a lot about that book
* But we often care less about words than we do about the classes of things those words represent
    * Words are often just the most straightforward way to capture ideas, subject matter, types of action, chacaterization, etc.
* There are additional ways to capture these higher-level concerns
    * ***n*-grams** and **noun phrases** capture multi-word sequences like "best friend" and "New York"
    * **Lemmatization** collapses specific word forms into a single root ("running" -> "run", "cats" -> "cat")
    * **Part of speech** tagging collapses words into their linguistic functions (noun, verb, adjective, preposition, etc.)
    * **Named entity recognition** identifies named entities (people, organizations, places, etc.)
    * Several different approaches to identifying **subject matter**
        * Topic models
        * Latent Semantic Analysis (LSA), which we've already seen, in a different context, as SVD
        * Coming up: word embeddings
* Can think of **all** these methods as potential *dimension reduction* techniques: we use far more words than we have truly distinct concepts
    
## *n*-grams

* ***n*-grams** are sequences of some number of words that occur one after another in a text
* '*n*' represents the number of consecutive words.
    * *n*=1 is called a unigram. We've used these extensively already.
    * *n*=2 is a bigram, 3=trigram, etc.

Note a useful tool: [Google Books Ngram Viewer](https://books.google.com/ngrams). Also, [see a sample](https://books.google.com/ngrams/interactive_chart?content=Great+War%2CWorld+War+I%2CWorld+War+II&year_start=1800&year_end=2019&corpus=27&smoothing=3&direct_url=t1%3B%2CGreat+War%3B%2Cc0%3B.t1%3B%2CWorld+War+I%3B%2Cc0%3B.t1%3B%2CWorld+War+II%3B%2Cc0) that demonstrates an interesting historical issue.

Of course, **be careful when considering historical use of words and phrases**. Consider semantic drift (changes in a word's meaning over time), orthographic issues like the (archaic) long 's', and the general sense in which published books do no represent all of a society in real time. But these are issues we always face! 

We can do this, too. As ever, we can count sequences of words by hand. But `sklearn`'s vectorizers make it easy to add *n*-gram features:

In [1]:
import pandas as pd
from   sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "I grew up in New York",
    "Ada Lovelace was from the United Kingdom",
    "Cats and dogs can be friends"
]

vectorizer = CountVectorizer(
    lowercase=True,
    stop_words='english',
    ngram_range=(2,2) # <- retain bigrams (alone) after stopword removal
)

features = vectorizer.fit_transform(sentences).todense()

# Create a dataframe for easy display
df = pd.DataFrame(features, columns=vectorizer.get_feature_names_out())
display(df)

Unnamed: 0,ada lovelace,cats dogs,dogs friends,grew new,lovelace united,new york,united kingdom
0,0,0,0,1,0,1,0
1,1,0,0,0,1,0,1
2,0,1,1,0,0,0,0


* Note that we've removed stopwords and retained bigrams (2-grams) only
* We could also retain any range of *n*-grams, including unigrams
    * In fact, `CountVectorizer` by default uses `ngram_range=(1,1)`
* Bigrams can feel like they capture people, places, and other named entities or noun phrases, but there are better ways to accomplish those specific tasks
* Still, *n*-grams can be useful features
    * Unlike the NLP-derived features discussed below -- which generally reduce the dimensionality of our data -- using *n*-gram features tends to increase dimensionality
    * Beware rapidly expanding feature matrices, especially with *n*>2


# Classic NLP tasks with `spaCy`

* **Lemmatization, part of speech tagging, named entity recognition, noun phrase detection, dependency parsing**, etc.
* Pretty much all of these involve using text sequence information and hand-labeled training data to learn how to predict the class to which a token belongs
    * By "text sequence information," we mean that for each token in order as we move through a text, we look at one or more tokens before or after that token in order to infer things about it
    * In general, we are trying to produce the label with the maximum likelihood, given what we know about the labels of the tokens around our target token
* NLP is a big subject
    * In the near term, see [Jurafsky and Martin](https://web.stanford.edu/~jurafsky/slp3/), especially [ch. 17](https://web.stanford.edu/~jurafsky/slp3/17.pdf) as (optionally) assigned for today
    * In the longer run, consider taking a class with another of the [Cornell NLP faculty members](https://nlp.cornell.edu/people/)
* Several packages to accomplish many NLP tasks. Two widely used ones for Python:
    * NLTK (Natural Language ToolKit) is a classic in Python
        * We used its sentence splitting and word tokenization features earlier in the course
        * Pros: easy, pythonic
        * Cons: slow, not state-of-the-art performance
    * SpaCy
        * Newer, neural-network based
        * Good speed and performance
        * Pretty easy to use
      
### Install `spaCy` and associated data

Only need to do this once for your installation, not every (subsequent) time you use the library. You should have it already if you installed the [course package set](https://github.com/wilkens-teaching/info3350-f24/tree/main/setup). If not, you'll need to run:

```
conda install -c conda-forge spacy -y
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_lg
```

### Some NLP basics

In [2]:
# Imports
import spacy

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

# Sample sentence
cornell = 'Cornell is a private, Ivy League university and the land-grant university for New York state.'

# Process the document
doc = nlp(cornell)

# Examine the processed document
print(doc)
print(type(doc))
print([token.text for token in doc])

Cornell is a private, Ivy League university and the land-grant university for New York state.
<class 'spacy.tokens.doc.Doc'>
['Cornell', 'is', 'a', 'private', ',', 'Ivy', 'League', 'university', 'and', 'the', 'land', '-', 'grant', 'university', 'for', 'New', 'York', 'state', '.']


#### Tokens, POS tags, dependency

Note that SpaCy's model gives us tokens (among other things), each of which has properties attached to it. So, if we want to know what a token is (that is, what its text is), we refer to `token.text` (assuming our token is stored in a variable named `token`). If we want its part of speech, that's `token.pos_`. Its lemma: `token.lemma_`.

Note that these POS tags are a reduced set (called UPOS). If you want the full Penn Treebank tagset, see the example below.

In [3]:
# Print tokens, POS tags, and dependency info
fmt = "{:>12}"*3 # three, right-justified, 12-character-wide columns
for token in doc:
    print(fmt.format(token.text, token.pos_, token.dep_))

     Cornell       PROPN       nsubj
          is         AUX        ROOT
           a         DET         det
     private         ADJ        amod
           ,       PUNCT       punct
         Ivy       PROPN    compound
      League       PROPN    compound
  university        NOUN        attr
         and       CCONJ          cc
         the         DET         det
        land        NOUN    compound
           -       PUNCT       punct
       grant        NOUN    compound
  university        NOUN        conj
         for         ADP        prep
         New       PROPN    compound
        York       PROPN    compound
       state        NOUN        pobj
           .       PUNCT       punct


In [4]:
# get the root, subject, and direct object of the example sentence
for sent in doc.sents:
    root = sent.root
    subject = None
    dir_obj = None
    for child in root.children:
        if child.dep_ == 'nsubj':
            subject = child
        if child.dep_ == 'dobj':
            dir_obj = child
    print("Root:", root)
    if subject:
        print("Subject:", subject)
    if dir_obj:
        print("Direct object:", dir_obj)

Root: is
Subject: Cornell


In [5]:
# explain a tag
spacy.explain('pobj')

'object of preposition'

Think about why you might want to navigate a dependency tree like this ...

#### Lemmas and POS tags

In [6]:
# Gets lemmas and POS tags
for token in doc:
    print(fmt.format(token.text, token.tag_, token.lemma_))

     Cornell         NNP     Cornell
          is         VBZ          be
           a          DT           a
     private          JJ     private
           ,           ,           ,
         Ivy         NNP         Ivy
      League         NNP      League
  university          NN  university
         and          CC         and
         the          DT         the
        land          NN        land
           -        HYPH           -
       grant          NN       grant
  university          NN  university
         for          IN         for
         New         NNP         New
        York         NNP        York
       state          NN       state
           .           .           .


Note that lemmatization is a useful preprocessing step during vectorization. Think about how you would write a preprocessing function using `spaCy` that would integrate with an `sklearn` vectorizer via the `preprocessor` option.

#### Named entities

Entities aren't strictly token-level properties, so we don't retrieve them by calls to the proerties of individual tokens. Instead, we iterate over `doc.ents`.

In [7]:
# Entities
for ent in doc.ents:
    print(fmt.format(ent.text, ent.label_, ''))

     Cornell         ORG            
    New York         GPE            


You can see the full set of [named entity types](https://spacy.io/models/en) in SpaCy's documentation. Briefly, in addition to `ORG`s and `GPE`s, there are also `LOC`ations, `PERSON`s, `DATE`s and `TIME`s, `MONEY`, and a few more.

In [8]:
# Visualize entities in context
from spacy import displacy
displacy.render(doc, style='ent') # Entities

#### Noun chunks

A noun chunk (also called a "noun phrase") is one or more of words in sequence that collectively behave like a noun. Typically, they contain a noun plus one or more adjectives and determiners.

In [9]:
# Noun chunks
for chunk in doc.noun_chunks:
    print(chunk)

Cornell
a private, Ivy League university
the land-grant university
New York state


## Create a feature matrix

Let's use these new features as, well, *features* like the token counts produced by `CountVectorizer`.

From just three (very) short stories, we're going to collect counts of all the distinct named entities and parts of speech, as well as the total length of each document. Then, we'll turn that data into a document-feature matrix.

Here's what one of the stories looks like:

> From Austin, we headed west. For two weeks, my best friend and I were on a journey of self-exploration, an adventure through the vast American landscape that would find us in a multicolored haze, and an event that would bond us for life. All of this came about because of one thing: a burrito so good we had to drive to California to get it. If it weren’t for that burrito we wouldn’t have found ourselves in a purple sunset, sitting on the edge of the Grand Canyon with a question that would change our lives forever. She said yes.

In [10]:
# Three tiny love stories from the New York Times
# See https://www.nytimes.com/column/modern-love

from collections import defaultdict

stories = {
    'burrito':"""From Austin, we headed west. For two weeks, my best friend and I were on a journey of self-exploration, an adventure through the vast American landscape that would find us in a multicolored haze, and an event that would bond us for life. All of this came about because of one thing: a burrito so good we had to drive to California to get it. If it weren’t for that burrito we wouldn’t have found ourselves in a purple sunset, sitting on the edge of the Grand Canyon with a question that would change our lives forever. She said yes.""",
    'tripod':"""On the eve of the new millennium, I fell in love with Andrew, a dashing English ad executive. Inconveniently, I didn’t fall out of love with Scott, an American architectural photographer and my longtime partner. Our dilemma resulted in an unexpected and enduring romance: a V-shaped love triangle sans vows and offspring. Born English, now a naturalized American, I am the hinge in our harmonious household of three: I sleep with both men, they each sleep only with me. We share everything else: home, finances, friends, vacations, life-threatening calamities. As Scott says, our tripod is more stable than a bipod.""",
    'skating':"""I flew to Idaho over winter break to see Sumner’s hometown. Our first night, we went skating on a frozen pond, surrounded by snow. I was nervous. I didn’t play sports growing up, and I hadn’t ice skated since I was a child. He circled the pond, not showing off, simply enjoying the movement. I’ll never forget the stars piercing the darkness and the shadowy outline of the towering mountains. Fifteen minutes later, I realized I had forgotten that I was supposed to learn how to skate; I had just been watching him the whole time."""
}

story_data = []

for story in stories:
    counts = defaultdict(int)
    doc = nlp(stories[story])
    counts['wordcount'] = len(doc) #
    for entity in doc.ents:
        counts[entity.text+'__'+entity.label_] += 1
    for token in doc:
        counts[token.pos_] += 1
    story_data.append(counts)

display(story_data)

[defaultdict(int,
             {'wordcount': 115,
              'Austin__GPE': 1,
              'two weeks__DATE': 1,
              'American__NORP': 1,
              'one__CARDINAL': 1,
              'California__GPE': 1,
              'the Grand Canyon__LOC': 1,
              'ADP': 16,
              'PROPN': 4,
              'PUNCT': 12,
              'PRON': 17,
              'VERB': 12,
              'NOUN': 18,
              'NUM': 2,
              'ADJ': 5,
              'CCONJ': 2,
              'AUX': 7,
              'DET': 11,
              'SCONJ': 2,
              'ADV': 2,
              'PART': 4,
              'INTJ': 1}),
 defaultdict(int,
             {'wordcount': 126,
              'the eve__DATE': 1,
              'Andrew__PERSON': 1,
              'English__LANGUAGE': 2,
              'Scott__ORG': 1,
              'American__NORP': 2,
              'triangle__ORG': 1,
              'three__CARDINAL': 1,
              'Scott__PERSON': 1,
              'ADP': 13,
  

In [11]:
from sklearn.feature_extraction import DictVectorizer

vectorizer = DictVectorizer(sparse=False)
X = vectorizer.fit_transform(story_data)
display(X)
display(vectorizer.get_feature_names_out())

array([[  5.,  16.,   2.,   7.,   1.,   0.,   1.,   2.,   1.,  11.,   0.,
          0.,   1.,   0.,  18.,   2.,   4.,  17.,   4.,  12.,   2.,   0.,
          0.,   0.,  12.,   0.,   1.,   1.,   0.,   0.,   0.,   1.,   0.,
        115.],
       [  7.,  13.,   5.,   3.,   2.,   1.,   0.,   3.,   0.,  10.,   2.,
          0.,   0.,   0.,  27.,   1.,   1.,  13.,   5.,  23.,   1.,   1.,
          1.,   0.,  14.,   0.,   0.,   0.,   1.,   1.,   1.,   0.,   0.,
        126.],
       [  5.,   7.,   4.,   9.,   0.,   0.,   0.,   2.,   0.,   9.,   0.,
          1.,   0.,   1.,  17.,   1.,   7.,  14.,   2.,  14.,   3.,   0.,
          0.,   1.,  20.,   1.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,
        114.]])

array(['ADJ', 'ADP', 'ADV', 'AUX', 'American__NORP', 'Andrew__PERSON',
       'Austin__GPE', 'CCONJ', 'California__GPE', 'DET',
       'English__LANGUAGE', 'Fifteen minutes later__TIME', 'INTJ',
       'Idaho__GPE', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT',
       'SCONJ', 'Scott__ORG', 'Scott__PERSON', 'Sumner__PERSON', 'VERB',
       'first night__TIME', 'one__CARDINAL', 'the Grand Canyon__LOC',
       'the eve__DATE', 'three__CARDINAL', 'triangle__ORG',
       'two weeks__DATE', 'winter__DATE', 'wordcount'], dtype=object)

In [12]:
X.shape

(3, 34)

In [13]:
# Select just the GPEs and LOCs, as a sub-example
vocab = vectorizer.vocabulary_ # mapping features to index positions

# Find indices of places in feature matrix
# Iterate over feature names, saving index positions for features that have GPE or LOC in their name
idx = [vocab[feature] for feature in vectorizer.feature_names_ if '__GPE' in feature or '__LOC' in feature]
# Get feature names in the desired index positions
place_names = [vectorizer.get_feature_names_out()[i] for i in idx]
# Restrict feature matrix to desired columns
X_place = X[:,idx] # <- all rows, columns from list
display(X_place)
display(place_names)

array([[1., 1., 0., 1.],
       [0., 0., 0., 0.],
       [0., 0., 1., 0.]])

['Austin__GPE', 'California__GPE', 'Idaho__GPE', 'the Grand Canyon__LOC']

## Mixed features

* If we care about these individual features, we can proceed to analyze them (see Soni et al. and Wilkens et al., for example)
* If we want to use any of these features (*n*-grams, POS counts, entities, whatever) *instead* of token unigrams as part of another workflow (classification, say), we can do that.
    * And we can compare the resulting accuracy to that achieved with unigram counts
    * Note that we'll still want to scale features, reduce dimensions, select most-informative features, examine feature importances, and so on
    * The point is that these *are features* just like unigram counts
        * They may (or may not) behave differently in practice (there are relatively few POS types, for instance, so their counts are often higher than wordcounts), but *as features*, we compute with them in just the same way
* Maybe best of all, we can use any of these features, or even non-textual features, *alongside* unigram counts
    * We can do this as part of feature engineering for classification
        * Task is to find the best mix of features for our classification problem
    * Or we can specify the feature mix in advance for unsupervised problems
        * As always, for unsupervised tasks, we have to specify in advance the set of maximally relevant features

### Side note: stacking arrays

To join features from different matrices, you can use `numpy`'s `hstack` method, like so: 

In [14]:
import numpy as np

# Synthetic data
a = np.array(np.zeros(6)).reshape(3,2)
print('a\n',a)

b = np.array(np.ones(6)).reshape(3,2)
print('\nb\n', b)

c = np.hstack([a,b])
print('\nc (stacked)\n', c)

a
 [[0. 0.]
 [0. 0.]
 [0. 0.]]

b
 [[1. 1.]
 [1. 1.]
 [1. 1.]]

c (stacked)
 [[0. 0. 1. 1.]
 [0. 0. 1. 1.]
 [0. 0. 1. 1.]]


In [15]:
# Add synthetic data columns to right side of feature array
X_stacked = np.hstack([X,a])
print(X_stacked)

[[  5.  16.   2.   7.   1.   0.   1.   2.   1.  11.   0.   0.   1.   0.
   18.   2.   4.  17.   4.  12.   2.   0.   0.   0.  12.   0.   1.   1.
    0.   0.   0.   1.   0. 115.   0.   0.]
 [  7.  13.   5.   3.   2.   1.   0.   3.   0.  10.   2.   0.   0.   0.
   27.   1.   1.  13.   5.  23.   1.   1.   1.   0.  14.   0.   0.   0.
    1.   1.   1.   0.   0. 126.   0.   0.]
 [  5.   7.   4.   9.   0.   0.   0.   2.   0.   9.   0.   1.   0.   1.
   17.   1.   7.  14.   2.  14.   3.   0.   0.   1.  20.   1.   0.   0.
    0.   0.   0.   0.   1. 114.   0.   0.]]


Note that you need to keep track of your feature names when you do this. The vectorizer object will still give you the names of the columns that it produced, but you might now have multiple vectorizers, each of which is responsible for part of your stacked feature matrix. This isn't a problem, but be aware that you'll need to deal with it. 

Of course, you need to have ordered your documents in the same way through each of the different vectorization steps if you're going to stack them as we've done here (so that you aren't mixing features from different documents within each row of the feature matrix). If you're using Pandas (rather than NumPy) to store your features, you could use `join` or `merge` to make sure that your feature data is proerly aligned by object identifier.