# Modern Practical Natural Language Processing

This course will cover how you can use NLP to do stuff. 

We envision four videos.
1. Overview and Converting Text to Vectors (this one)
  * For finding similar documents
  * "I have this document or text, what others talk about the same stuff?"
2. Learning with Vectors and Classification
  * For classifying documents
  * "I need to put these documents into buckets."
3. Sequence Generation 
  * For translation and document summarization
  * "I need to create quick summaries of these documents, maybe in Urdu."
4. Extracting Pieces of Information from Text
  * For pulling out sentences and documents that talk about specific things
  * "I need every mention of a street address or business in Garland, Texas."

# Additional Details

The idea is we make short videos that focus on the aspects of NLP that currently work well and are useful.

Speech-to-text now works pretty well, so these methods will also be useful for the audio portions of videos.

All code will be available on GitHub here https://github.com/jmugan/modern_practical_nlp

# About Me, Jonathan Mugan
* PhD in Computer Science in 2010 from UT Austin
* Thesis work was about how a robot could wake up in the world and figure out what is going on
* Work at [DeUmbra](https://deumbra.com/) where we build AI for the DoD
  * We also work in healthcare, which I can talk about. A future video (not in this series) will cover how we use graph neural networks to identify who is at risk for opioid overdose
* Wrote *The Curiosity Cycle: Preparing Your Child for the Ongoing Technological Explosion* http://www.jonathanmugan.com/CuriosityCycle/
* Also do independent consulting work
* Can find me here jonathanwilliammugan@gmail.com or on Twitter at [@jmugan](https://twitter.com/jmugan)

# The Limits of NLP
## Computers can't read
* Reading requires mapping language to internal concepts grounded in behaving in the same general environment as the writer.
  * Computers don’t have those concepts.
  * Example: “I pulled the wagon.” Computers don’t know that wagons can carry things or that pulling exerts a gentle tension to the arm and leg muscles as one walks.
  
## Computers can't write
* Writing requires mapping internal concepts grounded in behaving in the same general environment as the expected reader.
  * Computers don’t have those concepts

# Additional Information on NLP, AI, and Their Limits and Promise
* [Generating Natural-Language Text with Neural Networks](https://medium.com/@jmugan/generating-natural-language-text-with-neural-networks-e983bb48caad)
* [Why Is There Life? and What Does It Have to Do with AI?](https://towardsdatascience.com/why-is-there-life-and-what-does-it-have-to-do-with-ai-2195ac91532f)
* [Chatbots: Theory and Practice](https://medium.com/intuitionmachine/chatbots-theory-and-practice-3274f7e6d648)
* [You and Your Bot: A New Kind of Lifelong Relationship](https://chatbotsmagazine.com/you-and-your-bot-a-new-kind-of-lifelong-relationship-6a9649feeb71)
* [Computers Could Understand Natural Language Using Simulated Physics](https://chatbotslife.com/computers-could-understand-natural-language-using-simulated-physics-26e9706013da)
* [The Two Paths from Natural Language Processing to Artificial Intelligence](https://medium.com/intuitionmachine/the-two-paths-from-natural-language-processing-to-artificial-intelligence-d5384ddbfc18)
* [DeepGrammar: Grammar Checking Using Deep Learning](https://www.linkedin.com/pulse/deep-grammar-checking-using-learning-jonathan-mugan)
* [Deep Learning for Natural Language Processing](https://www.linkedin.com/pulse/deep-learning-natural-language-processing-jonathan-mugan)
* [What Deep Learning Really Means](https://www.linkedin.com/pulse/20141114065942-42285562-what-deep-learning-really-means)

## NLP Works Around Computers Not Having the Experience or Conceptual Framework to Read and Write
* NLP is about how to make natural language amenable to computation even though computers can’t read or write.
* Representing text as *vectors* has transformed NLP in the last 10 years.
* There are also *symbolic methods* that are practically useful; we will cover those too.

# Now the Meat of Video 1: Generating Vectors from Text
* By converting a sentence, paragraph, or document to a vector, we can measure its distance to the vectors of other texts.
  * We can find similar texts this way.
* Similar to how computer vision is done. We can say there is a cat in an image, and we can say that this image is similar to this other image, but we don't really have a human-understandable representation of what exactly is going on.
* But it is still useful.

# Tools to make vectors from text
* PyTorch https://pytorch.org/
  * Like TensorFlow, does automatic differentiation
* HuggingFace Transformer library https://huggingface.co/transformers/index.html
  * Built on top of PyTorch
* spaCy https://spacy.io/
  * Most common tool for NLP practitioners

Get some Python going, if you have Anaconda, you can do from the command line:

`conda create -n nlp_videos python=3.8`

`conda activate nlp_videos`




Install PyTorch from the command line

`pip install torch torchvision`

Install HuggingFace Transformers from the command line

`pip install transformers`

In [1]:
# import a bunch of stuff
from typing import List, Tuple
import scipy
import numpy as np
import os.path
from pathlib import Path
import pickle

import torch  # PyTorch

from transformers import BertModel, BertTokenizer, BertConfig

# Import the model

In [2]:
# https://huggingface.co/transformers/pretrained_models.html
model_name = 'bert-base-uncased'

# Need to use the same tokenizer that was used to train the model so that it breaks 
# up words into tokens the same way.
tokenizer = BertTokenizer.from_pretrained(model_name)

# This model is huge!!!!!!!!
model = BertModel.from_pretrained(model_name)

# Parameters used by the pre-trained model
config = BertConfig.from_pretrained(model_name)

## Define a function to convert text to tokens

In [3]:
def get_tokens(text: str,
               tokenizer: BertTokenizer,
               config: BertConfig) -> List[str]:
    
    tokens = tokenizer.tokenize(text)
    
    # make sure it isn't too long
    max_length = config.max_position_embeddings
    tokens = tokens[:max_length-1] # Will add special begin token
    
    # cls token to hold vector 
    # https://huggingface.co/transformers/main_classes/tokenizer.html
    tokens = [tokenizer.cls_token] + tokens
    
    return tokens

In [4]:
text = "I went to the store."
tokens = get_tokens(text, tokenizer, config)
print(tokens)

['[CLS]', 'i', 'went', 'to', 'the', 'store', '.']


# Convert tokens into integer ids

In [5]:
token_ids: List[int] = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

[101, 1045, 2253, 2000, 1996, 3573, 1012]


# Convert the input into a Torch tensor

In [6]:
token_ids_tensor = torch.tensor(token_ids)
print(token_ids_tensor.shape, token_ids_tensor)

torch.Size([7]) tensor([ 101, 1045, 2253, 2000, 1996, 3573, 1012])


# Make it shape `(1 x num_tokens)`

In real applications, we would send a bunch of sentences in at once.

In [7]:
token_ids_tensor = torch.unsqueeze(token_ids_tensor, 0)
print(token_ids_tensor.shape, token_ids_tensor)

torch.Size([1, 7]) tensor([[ 101, 1045, 2253, 2000, 1996, 3573, 1012]])


# Let's convert it into a vector

In [8]:
last_hidden_state, pooler_output = model(token_ids_tensor)

# pooler output is the last layer hidden state of the first token.
# Since this uses attention, it takes the whole sequence into account.
vector = pooler_output

print(type(vector), vector)

<class 'torch.Tensor'> tensor([[-6.8463e-01, -2.8959e-01, -2.9154e-01,  6.7127e-01,  3.9507e-01,
         -2.1678e-01,  6.2011e-01,  1.9677e-01, -2.8793e-02, -9.4984e-01,
          6.5669e-02,  4.6028e-01,  9.6631e-01, -2.3616e-01,  9.4689e-01,
         -6.0630e-01, -4.7309e-01, -4.3481e-01,  2.0222e-01, -3.4150e-01,
          8.3457e-01,  9.8431e-01,  2.2365e-01,  3.2114e-01,  4.1334e-01,
          2.7076e-01, -3.6277e-01,  9.4006e-01,  8.4959e-01,  6.9592e-01,
         -6.7077e-01, -7.6625e-02, -9.8442e-01, -1.7111e-01, -7.3174e-01,
         -9.3164e-01,  7.5510e-02, -3.6217e-01, -1.8402e-02,  2.4065e-01,
         -8.9202e-01,  2.1222e-01,  9.8343e-01, -7.1324e-01,  6.2176e-01,
         -1.2014e-01, -9.9812e-01,  7.7176e-02, -8.9065e-01, -4.3335e-01,
          2.4937e-01,  1.9725e-01, -1.0021e-02,  2.5571e-01,  2.0711e-01,
         -4.4047e-01, -1.4055e-01,  1.4745e-01, -4.4538e-02, -2.2215e-01,
         -5.2094e-01,  3.4585e-01, -5.1174e-01, -7.0657e-01, -1.8190e-01,
         -1.150

# Let's convert it into Numpy and make it one-dimensional

In [9]:
np_vector = vector.detach().numpy()

# make it one-dimensional
np_vector = np_vector.squeeze()

print(type(np_vector), np_vector)

<class 'numpy.ndarray'> [-6.84628487e-01 -2.89591521e-01 -2.91541070e-01  6.71265423e-01
  3.95071477e-01 -2.16784209e-01  6.20105147e-01  1.96766943e-01
 -2.87933350e-02 -9.49840426e-01  6.56687692e-02  4.60277617e-01
  9.66312706e-01 -2.36155078e-01  9.46891129e-01 -6.06303632e-01
 -4.73092347e-01 -4.34805721e-01  2.02217102e-01 -3.41503441e-01
  8.34570289e-01  9.84307647e-01  2.23645478e-01  3.21142077e-01
  4.13339913e-01  2.70759493e-01 -3.62768382e-01  9.40062046e-01
  8.49591494e-01  6.95917845e-01 -6.70773387e-01 -7.66250417e-02
 -9.84417677e-01 -1.71114579e-01 -7.31735408e-01 -9.31641102e-01
  7.55103827e-02 -3.62168103e-01 -1.84017215e-02  2.40652755e-01
 -8.92024696e-01  2.12224603e-01  9.83429790e-01 -7.13237345e-01
  6.21758878e-01 -1.20137155e-01 -9.98118401e-01  7.71760195e-02
 -8.90646696e-01 -4.33352202e-01  2.49371767e-01  1.97246686e-01
 -1.00208428e-02  2.55708367e-01  2.07114801e-01 -4.40470636e-01
 -1.40554667e-01  1.47446528e-01 -4.45383005e-02 -2.22154856e-01
 

# Let's make it a nice function

In [10]:
def make_vector(text: str) -> np.ndarray:
    tokens = get_tokens(text, tokenizer, config)
    token_ids: List[int] = tokenizer.convert_tokens_to_ids(tokens)
    token_ids_tensor = torch.tensor(token_ids)
    token_ids_tensor = torch.unsqueeze(token_ids_tensor, 0)
    last_hidden_state, pooler_output = model(token_ids_tensor)
    vector = pooler_output
    #vector = last_hidden_state.mean(dim=1)  # Can do this too
    np_vector = vector.detach().numpy()
    np_vector = np_vector.squeeze()
    return np_vector

# This uses the transformer model and takes the sequence of words into account

* The transformer model is graph neural network with a lot of layers. One way it is trained is that for each token position, it looks around at all of the neighboring tokens to decide what that token should be. 
  * Great explanation of transformers http://nlp.seas.harvard.edu/2018/04/03/attention.html
  * Bert builds on transformers https://arxiv.org/pdf/1810.04805.pdf

# Using word order is the hard way
* The easy way is to learn a vector for each word and then make the vector for the sentence the average of the word vectors.
* Let's try that with spaCy.

# Install spaCy

`pip install -U spacy`

The large model has good word vectors.

`python -m spacy download en_vectors_web_lg`

In [11]:
import spacy
nlp = spacy.load("en_vectors_web_lg")
tokens = nlp("dog cat banana afskfsd")
for token in tokens:
    print(token, token.vector)

dog [-5.71195185e-02  5.26851341e-02  3.02558858e-03 -4.85166162e-02
  7.04297796e-03  4.18558009e-02 -2.47040205e-02 -3.97829153e-02
  9.61403828e-03  3.08416426e-01 -8.91298205e-02  4.13809419e-02
 -9.56399292e-02  3.31533775e-02 -4.87142392e-02  2.60333400e-02
  7.14079365e-02  1.51968971e-01  2.08966229e-02 -6.43049553e-02
 -5.94667979e-02 -2.27007996e-02  3.80284972e-02 -6.94757923e-02
  5.18392026e-02 -6.17074501e-03 -3.47954780e-02 -5.93601651e-02
  1.26659293e-02 -3.63281034e-02 -7.91833773e-02  1.74062699e-02
 -1.18751619e-02  7.83303455e-02  5.17652743e-02  2.18392313e-02
  7.92445242e-02 -1.28953964e-01 -6.98042149e-03  5.48504330e-02
  5.40258288e-02  2.05084905e-02 -3.87009755e-02 -5.26268445e-02
 -1.83460359e-02 -2.14468315e-02 -5.41338809e-02  7.04937521e-03
  1.81341972e-02 -1.17702372e-02  2.03862209e-02  4.62589078e-02
  3.87080871e-02  6.20330237e-02 -4.51670177e-02  1.12892650e-01
  3.77171375e-02  1.44092571e-02 -4.73138280e-02  6.13008291e-02
  2.37244479e-02  1.5

# Let's make a function to get the average vector

In [12]:
def get_average_vector_for_text(text: str) -> np.ndarray:
    tokens = nlp(text)
    vectors = []
    for token in tokens:
        vectors.append(token.vector)
    all_vecs = np.array(vectors)
    #print("shape: ", all_vecs.shape)
    vec = np.mean(all_vecs, axis=0)
    return vec

# Let's compare the average vector method with the transfomer method
* We will compare them on the task of finding similar tweets.

# Load Tweets
Can be found at with the code at https://github.com/jmugan/modern_practical_nlp

In [13]:
all_tweets: List[str] = []
with open('jmugan_tweets.txt', 'r') as f:
    for tweet in f:
        tweet = tweet.strip()  # remove newline
        all_tweets.append(tweet)

# Generate the Vectors
* Save them after that because it is a bit slow

In [14]:
home = str(Path.home())
data_dir = os.path.join(home, 'temp')
print("Data directory: ", data_dir)

transformer_vec_pickle_file = os.path.join(data_dir, 'transformer_vecs.pkl')
average_vec_pickle_file = os.path.join(data_dir, 'average_vecs.pkl')

if os.path.isfile(transformer_vec_pickle_file):
    print("Loading saved transformer vecs.")
    with open(transformer_vec_pickle_file, 'rb') as f:
        all_transformer_vecs = pickle.load(f)
else:
    print("Generating transformer vecs.")
    all_transformer_vecs: List[np.ndarray] = []
    for tweet in all_tweets:
        vector = make_vector(tweet)
        all_transformer_vecs.append(vector)
    with open(transformer_vec_pickle_file, 'wb') as f:
        pickle.dump(all_transformer_vecs, f)

if os.path.isfile(average_vec_pickle_file):
    print("Loading saved average vecs.")
    with open(average_vec_pickle_file, 'rb') as f:
        all_average_vecs = pickle.load(f)
else:
    print("Generating average vecs.")
    all_average_vecs: List[np.ndarray] = []
    for tweet in all_tweets:
        vector = get_average_vector_for_text(tweet)
        all_average_vecs.append(vector)
    with open(average_vec_pickle_file, 'wb') as f:
        pickle.dump(all_average_vecs, f)

Data directory:  /Users/jmugan/temp
Loading saved transformer vecs.
Generating average vecs.


# Code to Find Most Similar

In [15]:
def get_most_similar_vector(text_vec: np.ndarray, vectors: List[np.ndarray]
                           ) -> Tuple[float, str]:
    """
    This is simple and slow. In reality you want to use something like 
    Faiss to find the closest vector
    https://engineering.fb.com/data-infrastructure/faiss-a-library-for-efficient-similarity-search/
    """
    closest_index = None
    smallest_distance = 10000
    for i, vector in enumerate(vectors):
        #dist = np.linalg.norm(vector - text_vec)  # fancy people use cosine distance
        dist = scipy.spatial.distance.cosine(vector, text_vec)
        if dist < smallest_distance:
            smallest_distance = dist
            closest_index = i
    return smallest_distance, all_tweets[closest_index]


def get_most_similar_tweet_transformer(text: str) -> Tuple[float, str]:
    text_vec = make_vector(text)
    return get_most_similar_vector(text_vec, all_transformer_vecs)

def get_most_similar_tweet_average(text: str) -> Tuple[float, str]:
    text_vec = get_average_vector_for_text(text)
    return get_most_similar_vector(text_vec, all_average_vecs)

# Try it Out

In [16]:
orig = "Cats don't like to wrestle."
tweet_1 = ("This transformer vector approach doesn't work well on tweets because" +
          " it was trained on other data.")
tweet_2 = "I like to watch movies."  
tweet_3 = "My children like to climb trees, drink water, and read about aardvarks."

new_tweets = [orig, tweet_1, tweet_2, tweet_3]

for tweet in new_tweets:
    print("orig: ",tweet)
    dist, closest = get_most_similar_tweet_transformer(tweet)
    print(f"Transformer ({dist:.2f}): {closest}")
    dist, closest = get_most_similar_tweet_average(tweet)
    print(f"Average ({dist:.2f}): {closest}")
    print()


orig:  Cats don't like to wrestle.
Transformer (0.00): Cats don't like to wrestle.
Average (0.00): Cats don't like to wrestle.

orig:  This transformer vector approach doesn't work well on tweets because it was trained on other data.
Transformer (0.04): Children used to build either glue or snap-on models. Now, Legos are so detailed that kids build with those instead.
Average (0.07): Surprisingly, the search feature on Google docs doesn't work very well. If only they had access to search experts.

orig:  I like to watch movies.
Transformer (0.03): I just watched 127 Hours with the special lady. Felt kind of bad sitting there with my microwave popcorn and big glass of cold water.
Average (0.10): I like a lot of kids movies, but I just can't get into these Ice Age films.

orig:  My children like to climb trees, drink water, and read about aardvarks.
Transformer (0.02): It's funny how, in a group, the result of hoarding is more hoarding.
Average (0.11): Amazing how kids practice actions. 

# Tweets are Weird
* Bert would work better if we used it for data other than tweets. Tweets are weird.
* You can fine-tune Bert on your data for finding similar documents, but it is a little more involved. Here's a place to get you started. https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb


# Conclusion
* We have seen that we can convert natural language into vectors, which is a representation amenable to computation and automation.
* In the next video, we'll look at fine-tuning these vectors for classification so that we can put documents into buckets.