# Topic 39+: Deeper NLP

1. Word vectors
    - Word vectors with Gensim
    - Word vectors with SpaCy
2. Topic Modeling

In [None]:
import numpy as np
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')

from sklearn.decomposition import PCA

import gensim.downloader as api
from gensim.test.utils import datapath
from gensim.models import KeyedVectors

# Word Embedding

Embedding vectors are very different than the TF-IDF or Count Vectors we learned about previously.  TF-IDF and Count Vectors contain only information about the quantity of each word in a document, but nothing about the words' meanings.  

Embedding vectors capture the semantic meaning of words.  Think about that for second.  How can you turn the MEANING of a work into a vector of numbers?

Count the TF-IDF transforms a document into a sparse vector similar to a one-hot encoding (but with values not limited to 0 and 1).  Embedding vectors transform each word (there are ways to transform a sentence or document into one vector, but we'll talk about that later) into a vector in an arbitrarily high dimensional vector space.  In this case the vector is not sparse, but describes the position of the word in each dimension in that space.  This just like how (.5, .3, .7) would describe the position of a point in 3 dimensional space (x, y, z). Word vector spaces can have 50, 100, 500, or more dimensions.

**What does a dimension represent?** 

A dimension in this space represents a relationship between words.  For instance, dimension x may represent gender, and dimension y may represent social status.  

**How are these vectors determined?** 

Embeddings are learned by an unsupervised model, somewhat like PCA.  The model trains on a corpus to determine how words are related to each other in the texts.

The embeddings can be learned from your corpus of documents through models like Word2Vec or can be downloaded from pretrained embedding models.


**Bias Alert**

The dimensions in an embedding model can and do represent bias inherent in language.  Dimensions can represent semantic relationships such as race, gender, ability, sexuality, etc.  'Doctor' and 'Custodian' may occupy different positions along a racial or gender dimension!  There are ways to reduce this bias by collapsing a dimension and projecting it onto a lower dimensional space.  A math heavy and comprehensive paper released by Stanford researchers is available [Here](http://cs229.stanford.edu/proj2016/report/BadieChakrabortyRudder-ReducingGenderBiasInWordEmbeddings-report.pdf)

If you want to test bias in your embedding model, try an analogy like: "'man' is to 'doctor' as woman is to 'X'". You'll learn below how to ask your model to complete these analogies.

### Gensim Documentation

* Pretrained vectors: https://github.com/RaRe-Technologies/gensim-data
* Vector methods: https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format

**GloVe Vectors**

GloVe vectors are a set of word embedding models pre-trained at Stanford and available for free.  There is a collection of different models available.  The one we use below projects words into a 100 dimensional space and is trained on the full corpus of Wikipedia plus a the Gigaword 5 collection gathered from various news sources.  Documentation can be found [here](https://nlp.stanford.edu/projects/glove/)

In [None]:
word_vectors = api.load("glove-wiki-gigaword-100")

## Vector Lookup

In [None]:
word_vectors['coffee']

## Word similarity 

In [None]:
word_vectors.most_similar('coffee')

In [None]:
word_vectors.most_similar('hilton')

In [None]:
result = word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))

## Analogies

In [None]:
def analogy(x1, x2, y1):
    result = word_vectors.most_similar(positive=[y1, x2], negative=[x1])
    return result

In [None]:
analogy('japan', 'japanese', 'australia')

In [None]:
analogy('australia', 'beer', 'france')

In [None]:
analogy('obama', 'clinton', 'reagan')

In [None]:
analogy('tall', 'tallest', 'long')

In [None]:
analogy('particular', 'fussy', 'subservient')

## Investigating Bias

What are your thought about the below results?  Does this word embedding model contain bias?  Feel free to try some more to investigate further.

In [None]:
analogy('white','doctor','black')

## Odd One Out?

In [None]:
word_vectors.doesnt_match("england france germany russia".split())

## Embedding Sentences/Documents

In [None]:
sentence = 'I like my coffee hot'

In [None]:
vectors = []
for w in sentence.split():
    try:
        vectors.append(word_vectors[w])
    except KeyError:
        pass

In [None]:
#Sum the vectors to create an embedding vector that represents the entire sentence.
sum(vectors)

## Graphical Representation

In [None]:
def display_pca_scatterplot(model, words=None, sample=0):
    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
        
    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]
    
    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [None]:
display_pca_scatterplot(word_vectors, 
                        ['coffee', 'tea', 'beer', 'wine', 'brandy', 'rum', 'champagne', 'water',
                         'spaghetti', 'borscht', 'hamburger', 'pizza', 'falafel', 'sushi', 'meatballs',
                         'dog', 'horse', 'cat', 'monkey', 'parrot', 'koala', 'lizard',
                         'frog', 'toad', 'monkey', 'ape', 'kangaroo', 'wombat', 'wolf',
                         'france', 'germany', 'hungary', 'luxembourg', 'australia', 'fiji', 'china',
                         'homework', 'assignment', 'problem', 'exam', 'test', 'class',
                         'school', 'college', 'university', 'institute'])

## SpaCy

SpaCy is a very powerful NLP library that can be used for many of the functions that the NLTK package provides (NLTK is still often used for its list of stopwords), plus word embedding models, and MORE!  If you are interested in NLP, I SERIOUSLY recommend you check out what SpaCy can do.

* Available SpaCy libraries: https://spacy.io/usage/models
* Documentation: https://spacy.io/usage/processing-pipelines



In [None]:
import spacy
import pandas as pd
import numpy as np
from tqdm import tqdm

tqdm.pandas()

In [None]:
raw = pd.read_csv("../resources/nlp_classification.csv")
raw.head()

In [None]:
raw.shape

### SpaCy Objects

The first step in unlocking the power of SpaCy is to convert your documents into SpaCy objects.  This is done by downloading a model, such as en_core_web_sm (english, core, trained on the web, small version) and using it to predict on each of your documents, which transforms them into SpaCy objects.

In [None]:
### This installs spacy if you need
# !pip install spacy

### This downloads the specific pretrained word embeddings

# !python -m spacy download en_core_web_md

In [None]:
nlp = spacy.load('en_core_web_md')

# df.progress_apply() applies a function to your dataframe and shows a progress bar

raw['spacy'] = raw.body.progress_apply(lambda x: nlp(x))

## SpaCy Vectors

In [None]:
# now each element under "spacy" is its own object!
first_spacy = raw.spacy[0]
print(type(first_spacy))
print(type(first_spacy[0]))

* https://spacy.io/api/token
* https://spacy.io/api/doc

In [None]:
print(len(first_spacy.vector))
first_spacy.vector

In [None]:
print(len(first_spacy[0].vector))
first_spacy[0].vector

## Spacy Parts of Speech (pos)

In [None]:
[w.pos_ for w in first_spacy]

In [None]:
df = pd.DataFrame(np.vstack([x.vector for x in raw.spacy]))

In [None]:
df

# Topic Modeling

Topic Modeling is an unsupervised modeling technique that extracts common keywords from corpora to determine which topics are commonly discussed.  This can be useful for determining classes to assign to texts algorithmically when topics are not known.  It's also useful to data exploration to better understand your corpus.

If you want a more comprehensive guide and explanation of topic modeling, I'll refer you to [this article](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/) by Selva Prabhakaran.

In [None]:
import gensim
from nltk.corpus import stopwords
import gensim.corpora as corpora

### Functionalize

It's always a good idea to functionalize your text processing pipeline so you can reuse it easily.

In [None]:
def process_words(texts, stop_words=stopwords.words("english"), allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    
    #use embedded list constructors to iterate over each word in the corpus
    texts = [[word for word in doc.split() if word not in stop_words] for doc in texts]
    texts_out = []
    
    #load your SpaCy model
    nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
    for sent in texts:
        doc = nlp(" ".join(sent))
        #SpaCy allows you to use the parts of speech of each word to guide lemmatization.
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    # remove stopwords once more after lemmatization
    texts_out = [[word for word in doc if word not in stop_words] for doc in texts_out]    
    return texts_out

data_ready = process_words(raw.body) 

In [None]:
data_ready

In [None]:
# Create Dictionary
id2word = corpora.Dictionary(data_ready)

# Create Corpus: Term Document Frequency
corpus = [id2word.doc2bow(text) for text in data_ready]

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=4, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=10,
                                           passes=10,
                                           alpha='symmetric',
                                           iterations=100,
                                           per_word_topics=True)

In [None]:
print(lda_model.print_topics())