# Lesson 4 - Lesson 1- Natural Language Processing NOTES

[good link to explain all of NLP](https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72)

Two Packages going to be used:
 1. NLTK - Good for options, but its models are dated
 2. SpaCy - The opposite: State of the art models, which are must faster, but you lose choices to work with.

In [1]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [17]:
import nltk

nltk.__version__

'3.4'

In [1]:
from nltk.corpus import gutenberg, stopwords

### Playing Around with NLTK

We want to:
1. Do an example of importing a text corpus.
2. Perform text pre-processing.

In [None]:
# from the NLTK corpus - going to use Jane Austin's "Persuasion" and Lewis Caroll's "Alice in Wonderland"

print(gutenberg.fileids())

persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# Print the first 100 characters of Alice in Wonderland.
print('\nRaw:\n', alice[0:100])

In [None]:
# Now, we want to use regular expressions to take out:
# Title, Chapter heading, and unnecessary line breaks & white space

import re 

# This pattern matches all text between square brackets.
pattern = "[\[].*?[\]]"
persuasion = re.sub(pattern, "", persuasion)
alice = re.sub(pattern, "", alice)

# Now we'll match and remove chapter headings.
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)

# Remove newlines and other extra whitespace by splitting and rejoining.
persuasion = ' '.join(persuasion.split())
alice = ' '.join(alice.split())

# Final output check
print('Extra whitespace removed:\n', alice[0:100])

In [2]:
# Using """stopwords""" to identify and separate ('tokenize') unmeaningful words
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
# Using spaCy for analysis

import spacy
nlp = spacy.load('en')

# All the processing work is done here, so it may take a while.
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

# Let's explore the objects we've built.
print("The alice_doc object is a {} object.".format(type(alice_doc)))
print("It is {} tokens long".format(len(alice_doc)))
print("The first three tokens are '{}'".format(alice_doc[:3]))
print("The type of each token is {}".format(type(alice_doc[0])))

### Applications of SpaCy

1. Counting words - good for organization and categorization.
    - This counts for _words_ __and__ _lemmas_ (groupings of the same root word, like thought, think, thinking, etc.)

...and that's it!

In [None]:
from collections import Counter # good to know that it exists

# Utility function to calculate how frequently words appear in the text.
def word_frequencies(text, include_stop=True):
    
    # Build a list of words.
    # Strip out punctuation and, optionally, stop words.
    words = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            words.append(token.text)
            
    # Build and return a Counter object containing word counts.
    return Counter(words)
    
# The most frequent words:
alice_freq = word_frequencies(alice_doc).most_common(10)
persuasion_freq = word_frequencies(persuasion_doc).most_common(10)
print('Alice:', alice_freq)
print('Persuasion:', persuasion_freq)

In [None]:
# this spaCy organizational bonus can be very effective, if done creatively
# for example - finding top ten words in each book that are NOT in the other book

# Pull out just the text from our frequency lists.
alice_common = [pair[0] for pair in alice_freq]
persuasion_common = [pair[0] for pair in persuasion_freq]

# Use sets to find the unique values in each top ten.
print('Unique to Alice:', set(alice_common) - set(persuasion_common))
print('Unique to Persuasion:', set(persuasion_common) - set(alice_common))

In [None]:
# working with lemmas

# Utility function to calculate how frequently lemmas appear in the text.
def lemma_frequencies(text, include_stop=True):
    
    # Build a list of lemmas.
    # Strip out punctuation and, optionally, stop words.
    lemmas = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            lemmas.append(token.lemma_)
            
    # Build and return a Counter object containing word counts.
    return Counter(lemmas)

# Instantiate our list of most common lemmas.
alice_lemma_freq = lemma_frequencies(alice_doc, include_stop=False).most_common(10)
persuasion_lemma_freq = lemma_frequencies(persuasion_doc, include_stop=False).most_common(10)
print('\nAlice:', alice_lemma_freq)
print('Persuasion:', persuasion_lemma_freq)

# Again, identify the lemmas common to one text but not the other.
alice_lemma_common = [pair[0] for pair in alice_lemma_freq]
persuasion_lemma_common = [pair[0] for pair in persuasion_lemma_freq]
print('Unique to Alice:', set(alice_lemma_common) - set(persuasion_lemma_common))
print('Unique to Persuasion:', set(persuasion_lemma_common) - set(alice_lemma_common))

In [None]:
################# can also separate SENTENCES (...and paragraphs)!!!!!

# Initial exploration of sentences.
sentences = list(alice_doc.sents)
print("Alice in Wonderland has {} sentences.".format(len(sentences)))

example_sentence = sentences[2]
print("Here is an example: \n{}\n".format(example_sentence))

# Look at some metrics around this sentence.
example_words = [token for token in example_sentence if not token.is_punct]
unique_words = set([token.text for token in example_words])

print(("There are {} words in this sentence, and {} of them are"
       " unique.").format(len(example_words), len(unique_words)))

In [None]:
################## can also identify PARTS OF SPEECH!!!!!
print(nlp("I need a break")[3].pos_)
print(nlp("I need to break the glass")[3].pos_)

# View the part of speech for some tokens in our sentence.
print('\nParts of speech:')
for token in example_sentence[:9]:
    print(token.orth_, token.pos_)

In [None]:
# even better - identify parts of speech IN CONTEXT! - "DEPENDENCIES"!
# View the dependencies for some tokens.
print('\nDependencies:')
for token in example_sentence[:9]:
    print(token.orth_, token.dep_, token.head.orth_)
    
# Extract the first ten entities.
entities = list(alice_doc.ents)[0:10]
for entity in entities:
    print(entity.label_, ' '.join(t.orth_ for t in entity))

In [None]:
# can even search for ''''type'''' of word

people = [entity.text for entity in list(alice_doc.ents) if entity.label_ == "PERSON"]
print(set(people))

In [None]:
# DATAFRAME SETTING UP - USE THIS!!!!!
import pandas as pd
pd.set_option('display.max_colwidth', 150)

# Group into sentences.
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + persuasion_sents)
sentences.head()

# Lesson 2 - Supervised NLP

### Go to bottom of note set for pretty sweet summary!

In [None]:
# ALWAYS ALWAYS clean your data first!
# That is what is happening within this cell.

# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The Chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice[:int(len(alice)/10)])
persuasion = text_cleaner(persuasion[:int(len(persuasion)/10)])

In [None]:
# Parse the cleaned novels. This can take a bit.
nlp = spacy.load('en')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

# Group into sentences.
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + persuasion_sents)
sentences.head()

In [None]:
# BAG OF WORDS WORK!

# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words out of the top 2000
    return [item[0] for item in Counter(allwords).most_common(2000)]
    

# Creates a data frame with features for ****each**** word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

# Set up the bags.
alicewords = bag_of_words(alice_doc)
persuasionwords = bag_of_words(persuasion_doc)

# Combine bags to create a set of unique words.
common_words = set(alicewords + persuasionwords)

In [None]:
# Create our data frame with features. This can take a while to run.
word_counts = bow_features(sentences, common_words)
word_counts.head()

#### Now that we have the dataframe set up via the bag-of-words function, we can perform machine learning on it!

In [None]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split
import numpy as np

rfc = ensemble.RandomForestClassifier()
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

#### As you can see...

The model is overfit, which is a common problem for bag-of-word models; The massive number of features to be classified and accounted for is the cause of the model picking up noise. Random forest classifiers are also an overfitting machine, so that's why we are where we are with this.

Trying the bag-of-words again with (1) logistic regression and (2) gradient boosting ensemble classifier.

In [None]:
# (1) logistic regression

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2') # No need to specify l2 as it's the default. But we put it for demonstration.
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

In [None]:
# (2) GradientBoosting

gradboost = ensemble.GradientBoostingClassifier()
train = gradboost.fit(X_train, y_train)

print('Training set score:', gradboost.score(X_train, y_train))
print('\nTest set score:', gradboost.score(X_test, y_test))

In [None]:
# seeing if we can improve the above models by training it with a different dataset - Jane Austen's "Emma"
# for consistency purposes, need to download it the same way we did with the previous two texts

import re 

emma = gutenberg.raw('austen-emma.txt')
emma = re.sub(r'VOLUME \w+', '', emma)
emma = re.sub(r'CHAPTER \w+', '', emma)
emma = text_cleaner(emma[:int(len(emma)/60)]) # Emma is long, so only downloading the first 1/60th of it
print(emma[:100])

In [None]:
# parse, group, and bag:

emma_doc = nlp(emma)                                                   # parse

persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents] # group persuasion sentences
emma_sents = [[sent, "Austen"] for sent in emma_doc.sents]             # group emma sentences

emma_sentences = pd.DataFrame(emma_sents)                              # bag-of-word counts into emma dataframe
emma_bow = bow_features(emma_sentences, common_words)

print('done')

In [None]:
# Now we can model it!
# Let's use logistic regression again.

# Combine the Emma sentence data with the Alice data from the test set.
X_Emma_test = np.concatenate((
    X_train[y_train[y_train=='Carroll'].index],
    emma_bow.drop(['text_sentence','text_source'], 1)
), axis=0)
y_Emma_test = pd.concat([y_train[y_train=='Carroll'],
                         pd.Series(['Austen'] * emma_bow.shape[0])])

# Model.
print('\nTest set score:', lr.score(X_Emma_test, y_Emma_test))
lr_Emma_predicted = lr.predict(X_Emma_test)
pd.crosstab(y_Emma_test, lr_Emma_predicted)

# Lesson 3 - Unsupervised NLP

The purpose of unsupervised NLP is to categorize sentences by common characteristics (i.e., an unsupervised problem). How we do that in NLP is to weight certain words that are representative of a category, and de-weight words that are common across categories.

In order to make this work, we need to "vectorize" the words into numbers. 

To penalize common words, we use __Inverse Document Frequency__: <br>
(Inverse Doc. Freq.) = log\[(total documents "N") / (document frequency raw count)\]<br>
(log in base 2!)

idft = log(N/dft) ----- t = 'term'

Vectorizing will create a large number of features, and so it's important to cut the dimensional space down by PCA, or in the NLP context - __Latent Semantic Analysis__.

In [2]:
import nltk
from nltk.corpus import gutenberg
nltk.download('punkt')
nltk.download('gutenberg')
import re
from sklearn.model_selection import train_test_split

#reading in the data, this time in the form of paragraphs
emma=gutenberg.paras('austen-emma.txt')
#processing
emma_paras=[]
for paragraph in emma:
    para=paragraph[0]
    #removing the double-dash from all words
    para=[re.sub(r'--','',word) for word in para]
    #Forming each paragraph into a string and adding it to the list of strings.
    emma_paras.append(' '.join(para))

print(emma_paras[0:4])

[nltk_data] Downloading package punkt to C:\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package gutenberg to C:\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


['[ Emma by Jane Austen 1816 ]', 'VOLUME I', 'CHAPTER I', 'Emma Woodhouse , handsome , clever , and rich , with a comfortable home and happy disposition , seemed to unite some of the best blessings of existence ; and had lived nearly twenty - one years in the world with very little to distress or vex her .']


In [3]:
# TFDIF VECTORIZER - SKlearn makes it so so easy to figure this out for us!

from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test = train_test_split(emma_paras, test_size=0.4, random_state=0)

vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=2, # only use words that appear at least twice
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,#we definitely want to use ***inverse document frequencies*** in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )


#Applying the vectorizer
emma_paras_tfidf=vectorizer.fit_transform(emma_paras)
print("Number of features: %d" % emma_paras_tfidf.get_shape()[1])

#splitting into training and test sets
X_train_tfidf, X_test_tfidf= train_test_split(emma_paras_tfidf, test_size=0.4, random_state=0)


#Reshapes the vectorizer output into something people can read
X_train_tfidf_csr = X_train_tfidf.tocsr()

#number of paragraphs
n = X_train_tfidf_csr.shape[0]

#A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,n)]

#List of features
terms = vectorizer.get_feature_names()

#for each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]

#Keep in mind that the log base 2 of 1 is 0, so a tf-idf score of 0 indicates that the word was present once in that sentence.
print('Original sentence:', X_train[5])
print('Tf_idf vector:', tfidf_bypara[5])

Number of features: 1948
Original sentence: A very few minutes more , however , completed the present trial .
Tf_idf vector: {'minutes': 0.7127450310382584, 'present': 0.701423210857947}


The above created vectors for the words - at a rate of one vector per paragraph.

Now we need to cut down the n-dimensional space via PCA. In the NLP world, it is more appropriate to use __Singular Value Decomposition (SVD) (also known as Latent Semantic Analysis (LSA))__. 

In [4]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

#Our SVD data reducer.  We are going to reduce the feature space from 1379 to 130.
svd= TruncatedSVD(130)
lsa = make_pipeline(svd, Normalizer(copy=False))
# Run SVD on the training data, then project the training data.
X_train_lsa = lsa.fit_transform(X_train_tfidf) ###### LSA IS THE NAME OF THE MODEL USED HERE!

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance*100)

#Looking at what sorts of paragraphs our solution considers similar, for the first ***FIVE*** identified topics
paras_by_component=pd.DataFrame(X_train_lsa,index=X_train)
for i in range(5):
    print('Component {}:'.format(i))
    print(paras_by_component.loc[:,i].sort_values(ascending=False)[0:10])


Percent variance captured by all components: 45.19391444969529
Component 0:
" Oh !     0.99929
" Oh !     0.99929
" Oh !     0.99929
" Oh !     0.99929
" Oh !"    0.99929
" Oh !     0.99929
" Oh !     0.99929
" Oh !     0.99929
" Oh !     0.99929
" Oh !"    0.99929
Name: 0, dtype: float64
Component 1:
" You have made her too tall , Emma ," said Mr . Knightley .                                                                                                                0.634539
" You get upon delicate subjects , Emma ," said Mrs . Weston smiling ; " remember that I am here . Mr .                                                                     0.576886
" I do not know what your opinion may be , Mrs . Weston ," said Mr . Knightley , " of this great intimacy between Emma and Harriet Smith , but I think it a bad thing ."    0.571176
" You are right , Mrs . Weston ," said Mr . Knightley warmly , " Miss Fairfax is as capable as any of us of forming a just opinion of Mrs . Elton .       

Again, the above has _five_ components:
 - 0 focused on "Oh!"
 - 1 focused on Emma-related qualities
 - 2 focused on Chapter headings
 - 3 focused on the exclamation "Ah!"
 - 4 focused on actions by or directly related to Emma

### How LSA is applied?

LSA takes those vectors and measures the angles of them against each other via a cosine algorithm. The cosine function creates a 0 to 1 spread, thus creating a classifier.

Comparing all of the vectors based on _every_ paragraph would be tough, and so dimensionality reduction through SVD is absolutely critical for LSA application. SVD is used here and not PCA - PCA uses the mean as the central tendency for the various dimensions, which is not necessarily appropriate for NLP usage. SVD on the other hand uses something else, which doesn't reduce sparsity - a good thing for NLP.

# Lesson 4 - Word2vec - Unsupervised Neural Network


Word2vec is similar to LSA. The difference is: LSA creates vector representations of _sentences based on words in them_ and Word2vec creates vector representations of _individual words based on words around them._


### How does Word2vec do this?
Two methods:
1. __Continuous Bag of Words:__ The identity of the word is predicted using the words near it in a sentence.
2. __Skip-gram:__ The identities of the words are predicted from the word that they surround (essentially the inverse of CBOW).
    - Skip-gram seems to work better for larger corpuses. Note to self.
    
Word2vec is essentially a vector "clustering" algorithm. Word2vec uses one of two methods of clustering/penalization to arrive at its convergence:
1. __Negative Sampling:__ Every time a word is pulled toward some neighbors, other random vectors are pushed away.
2. __Hierarchical Softmax:__ Neighboring words are pulled or pushed from a subset of words that were determined from a decision tree of possibilities.


Word2vec generally works better with larger corpuses (like...several _billions_ words long). The proximity of words can only be analyzed if that pattern across all sentences is consistent.

An example is below, via the Word2vec implementation __gensim__.


In [5]:
import scipy
import sklearn
import spacy

In [6]:
# Utility function to clean text.
def text_cleaner(text):
    
    # Visual inspection shows spaCy does not recognize the double dash '--'.
    # Better get rid of it now!
    text = re.sub(r'--',' ',text)
    
    # Get rid of headings in square brackets.
    text = re.sub("[\[].*?[\]]", "", text)
    
    # Get rid of chapter titles.
    text = re.sub(r'Chapter \d+','',text)
    
    # Get rid of extra whitespace.
    text = ' '.join(text.split())
    
    return text[0:900000]


# Import all the Austen in the Project Gutenberg corpus.
austen = ""
for novel in ['persuasion','emma','sense']:
    work = gutenberg.raw('austen-' + novel + '.txt')
    austen = austen + work

# Clean the data.
austen_clean = text_cleaner(austen)

In [7]:
# Parse the data. This can take some time.
nlp = spacy.load('en')
austen_doc = nlp(austen_clean)

In [8]:
# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.

sentences = []

for sentence in austen_doc.sents:
    sentence = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    sentences.append(sentence)


print(sentences[20])
print('We have {} sentences and {} tokens.'.format(len(sentences), len(austen_clean)))

['lady', 'russell', 'steady', 'age', 'character', 'extremely', 'provide', 'thought', 'second', 'marriage', 'need', 'apology', 'public', 'apt', 'unreasonably', 'discontent', 'woman', 'marry', 'sir', 'walter', 'continue', 'singleness', 'require', 'explanation']
We have 9298 sentences and 900000 tokens.


In [9]:
# USING WORD2VEC (gensim)!!!!

import gensim
from gensim.models import word2vec

model = word2vec.Word2Vec(
    sentences,
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=10,  # Minimum word count threshold.
    window=6,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=300,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

print('done!')

done!


In [16]:
# List of words in gensim model.
vocab = model.wv.vocab.keys()

print("\n", model.wv.most_similar(positive=['lady', 'man'], negative=['woman']))

# Similarity is calculated using the cosine, so again 1 is total
# 0 is no similarity.
print("\nModel similarity between mr and mrs is", model.wv.similarity('mr', 'mrs'))

# One of these things is not like the other...
print("\nThe highest dissimilarity between the words breakfast, marriage, dinner, and lunch is the word", 
      model.doesnt_match("breakfast marriage dinner lunch".split()))


 [('harville', 0.9532824158668518), ('musgrove', 0.9524492025375366), ('goddard', 0.9467301368713379), ('wentworth', 0.9353393316268921), ('benwick', 0.9299865961074829), ('clay', 0.9209820032119751), ('charles', 0.8827372789382935), ('croft', 0.8577557802200317), ('weston', 0.8440591096878052), ('smith', 0.8287604451179504)]

Model similarity between mr and mrs is 0.9210352


  if sys.path[0] == '':



The highest dissimilarity between the words breakfast, marriage, dinner, and lunch is the word marriage


### Thing You can do with Word2vec:

1. Find all unique words within your dataset via `list(model.wv.vocab.keys())`


2. Obtaining an array of words most similar to a given word, and a percent score representing how similar it is to the inquired word via `model.wv.most_similar('word')` or `.most_similar('word1', word2', 'word3')`
    - You can find what word is most similar to an __arithmetic description of your word.__ For example, if you are looking for words similar to and including queen, you can say that `queen = (king - man) + woman`. To find the percent similarity, your would do `model.wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=No.OfResultsDesired)`
    - You can also find words that don't match. See above for example of `model.doesnt_match(words)`
    
    
3. Obtain the vector for a given article, sentence, parag, etc. (with vector size what size you designate in the hyperparameter 'size' when instantiating the word2vec model) via `model['sentence']`


4. Assessing a cluster of most similar words (as calculated by similarity of their vectors) via:

```
words = list(model.wv.vocab)
X = model[model.wv.vocab] #may need to splice if your unique word list is super long

#dimensionality reduce via PCA 
#(although IMHO LSA should be used because PCA requires a central tendency...which text data does not have...)

pca = PCA(n_components=2)
result = pca.fit_transform(X)

#create a scatter plot of the projection
plt.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab)
for i, word in enumerate(words):
	plt.annotate(word, xy=(result[i, 0], result[i, 1]))
plt.show()
```



# Taking a Break - Summary

After text cleaning, we use NLP algorithms (word2vec or bag-of-words) to tokenize and then vectorize. These vectors are what are used in a machine learning model (such as, for example, logistic regression).

### Tokenizing

Tokenizing is the process of breaking up the corpus into a designated size. A token can be sentences, phrases, individual words ("1-grams"), pairs of words ("bi-grams"), or even individual symbols - that is up to you. Within the tokenization part of NLP is identifying - and ignoring - __stopwords.__

__Example tokenization function:__

In [None]:
def word_extraction(sentence):
    ignore = ['a', "the", "is"]
    words = re.sub("[^\w]", " ",  sentence).split()
    cleaned_text = [w.lower() for w in words if w not in ignore]
    return cleaned_text

def tokenize(sentences):
    words = []
    for sentence in sentences:
        w = word_extraction(sentence)
        words.extend(w)
        
    words = sorted(list(set(words)))
    return words

Instead of manually entering in the ignored stopwords, you can also use nltk's pre-defined set of stopwords:

In [None]:
import nltk
from nltk.corpus import stopwords
set(stopwords.words('english'))

### Vectorization

Bag-of-words vectorizes essentially by counting the words. Take the below example, where the words have already been tokenized and vectorized:

John likes to watch movies. Mary likes movies too. <br>
\[1, 2, 1, 1, 2, 1, 1, 0, 0, 0\]

John also likes to watch football games. <br>
\[1, 1, 1, 1, 0, 0, 0, 1, 1, 1\]

The numbers show how many times the word appears _elsewhere_ in the corpus. If that is the only time it appears - its vector is zero. Notice how the counting accounts for both within its own sentence _and_ in other sentences. In reality, the counts would be across the entire corpus. __The vector is always proportional to the size of the vocabulary.__

__Example Vectorization Function:__

In [None]:
def generate_bow(allsentences):    
    vocab = tokenize(allsentences)
    print("Word List for Document \n{0} \n".format(vocab));
    
for sentence in allsentences:
        words = word_extraction(sentence)
        bag_vector = numpy.zeros(len(vocab))
        for w in words:
            for i,word in enumerate(vocab):
                if word == w: 
                    bag_vector[i] += 1
                    
        print("{0}\n{1}\n".format(sentence,numpy.array(bag_vector)))

#### Limitations of Bag-of-Words Vectorization

- It does not care about meaning, context, word order - it just counts frequency.
- Depending on size of corpus, counting and vectorizing every token may take a LONG time.

Now that you understand how bagging works, you can skip all of the handmade functions and just use sci-kit learn.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(allsentences)

print(X.toarray())

__The ".toarray()" function is what is needed to convert the vectors into a machine learning algorithm-friendly format.__ In such an instance, there will be no target variable - it is merely transforming tokens into vectors. Therefore, these kind of vectorizations will be unsupervised learning.

### Bag-of-Words Count Vectorizer

In [None]:
# Another Example of the Bag-of-Words Count Vectorizer

from sklearn.feature_extraction.text import CountVectorizer

# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]

# create the transform
vectorizer = CountVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# summarize
print(vectorizer.vocabulary_)

# encode document
vector = vectorizer.transform(text)

# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

### Inverse Frequency Tfid Vectorizer

In [26]:
# How to use TfidVectorizer
# This Vectorizer uses INVERSE TOKEN FREQUENCY WEIGHTINGS! MORE SOPHISTICATED THAN BAG OF WORDS!

# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
        "The dog.",
        "The fox"]

# create the transform
vectorizer = TfidfVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

# encode document
vector = vectorizer.transform([text[0]]) # only vectorizing the first sentence in the text

# summarize encoded vector
print(vector.shape)
print(vector.toarray())

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]
(1, 8)
[[0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
  0.36388646 0.42983441]]


### A Hashing Vectorizer

In [27]:
# Using a HashVectorizer
# hashing skips the vocabulary step of the previous two models
# hashing therefore saves on memory when "hashing" (converting) to integers
# but you are unable to hash them back to words if you desired

from sklearn.feature_extraction.text import HashingVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = HashingVectorizer(n_features=20)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(vector.toarray())


(1, 20)
[[ 0.          0.          0.          0.          0.          0.33333333
   0.         -0.33333333  0.33333333  0.          0.          0.33333333
   0.          0.          0.         -0.33333333  0.          0.
  -0.66666667  0.        ]]


# Final Note - Additional Illustrative Resources

Use [this Kaggle article](https://www.kaggle.com/miguelniblock/predict-the-author-unsupervised-nlp-lsa-and-bow) for an explanation of how to do an NLP project. It is in-depth but very explanatory.

[This SKLearn documentation](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) is not as in-depth but nonetheless provides an example outline of what to do as well as an explanation for the steps within that outline.

### Another Summary

In [None]:
vec = sklearn.some_vectorizer() # count, tfidf, or other
vec.transform(X_train) # converts the fitted vocabulary into a usable "Document Term Matrix" (aka DataFrame)
vec.transform(X_test) # same as above but for testing data 
                      # (main difference is that it will drop tokens that it hasn't seen before in the fit)
    
cross_val(vec) # can't do - vectorizer is feature extraction ONLY
               # you have to do sklearn.Pipeline, then do vec, then predictive model


In [None]:
#machinelearningmastery.com
#toastmasters.com

In [None]:
https://monkeylearn.com/blog/beginners-guide-text-vectorization/
    
    https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/