# Building NLP Tools for SEO

##### Nikhil Almeida
##### Sr. Manager Data Science
##### Consumer Track

# Agenda


# SEO

### What is SEO?
SEO stands for “search engine optimization.” It is the process of getting traffic from the “free,” “organic,” “editorial” or “natural” search results on search engines.


### SEO Success Factors

|On Page|Off Page   |
|---|---|
|Quality   |Authority   |
|Content Research|Engage   |
|Words   |Link Quality   |
|Fresh|Link Text|
|Answers|Social Reputation|
|Thin|Number of Ads|
|Relevant Titles|Site History|



<img src="./2017-SEO_Periodic_Table.png" />
https://searchengineland.com/seotable

## SEO Data & Sources

* **Keywords**
* **Search Volume**
* **Keyword Potential** 
* **Keyword opportunity**
* **SERP Ranks**
* **Topic Authority**



# Tools Built
* Related Content Module
* Content De-Duplication
* Question Detection
* Title / Brief Creation
* Topic Authority
* Topic Detection
* Keyword Classification into Topics
* Keyword Potential
* Text Summarization

# Text Blob

TextBlob aims to provide access to common text-processing operations through a familiar interface. You can treat TextBlob objects as if they were Python strings that learned how to do Natural Language Processing.


In [93]:
from textblob import TextBlob
blob = TextBlob("A mad boxer sent a quick, gloved jab to the jaw of his dizzy opponent. The quick brown fox jumps over the lazy dog. The five boxing wizards jump quickly.")
blob.sentences


[Sentence("A mad boxer sent a quick, gloved jab to the jaw of his dizzy opponent."),
 Sentence("The quick brown fox jumps over the lazy dog."),
 Sentence("The five boxing wizards jump quickly.")]

In [95]:
blob.sentences[0].words

WordList(['A', 'mad', 'boxer', 'sent', 'a', 'quick', 'gloved', 'jab', 'to', 'the', 'jaw', 'of', 'his', 'dizzy', 'opponent'])

### Part-of-Speech Tagging

In [105]:
pos_mapper = {"CC": "Coordinating conjunction",
"CD": "Cardinal number",
"DT": "Determiner",
"EX": "Existential there",
"FW": "Foreign word",
"IN": "Preposition or subordinating conjunction",
"JJ": "Adjective",
"JJR": "Adjective, comparative",
"JJS": "Adjective, superlative",
"LS": "List item marker",
"MD": "Modal",
"NN": "Noun, singular or mass",
"NNS": "Noun, plural",
"NNP": "Proper noun, singular",
"NNPS": "Proper noun, plural",
"PDT": "Predeterminer",
"POS": "Possessive ending",
"PRP": "Personal pronoun",
"PRP$": "Possessive pronoun",
"RB": "Adverb",
"RBR": "Adverb, comparative",
"RBS": "Adverb, superlative",
"RP": "Particle",
"SYM": "Symbol",
"TO": "to",
"UH": "Interjection",
"VB": "Verb, base form",
"VBD": "Verb, past tense",
"VBG": "Verb, gerund or present participle",
"VBN": "Verb, past participle",
"VBP": "Verb, non-3rd person singular present",
"VBZ": "Verb, 3rd person singular present",
"WDT": "Wh-determiner",
"WP": "Wh-pronoun",
"WP$": "Possessive wh-pronoun",
"WRB": "Wh-adverb"}


## References
* https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
* http://language.worldofcomputing.net/pos-tagging/parts-of-speech-tagging.html

In [106]:
sentence = blob.sentences[1]
print(sentence, "\n\n", "Tags: ", sentence.tags, '\n')
for tag in sentence.tags:
    print("{}---{}---{}".format(tag[0], tag[1], pos_mapper[tag[1]]))



The quick brown fox jumps over the lazy dog. 

 Tags:  [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')] 

The---DT---Determiner
quick---JJ---Adjective
brown---NN---Noun, singular or mass
fox---NN---Noun, singular or mass
jumps---VBZ---Verb, 3rd person singular present
over---IN---Preposition or subordinating conjunction
the---DT---Determiner
lazy---JJ---Adjective
dog---NN---Noun, singular or mass


### Noun Phrases
A noun phrase includes a noun—a person, place, or thing—and the modifiers which distinguish it.


In [102]:

blob.sentences[0].noun_phrases

WordList(['mad boxer', 'dizzy opponent'])

### Applications of POS Tagging

* Word Sense Disambiguation (Easy Search Improvements)
* Named Entity Resolution
* Sentiment Analysis
* Question Answering

#### Question Detection

In [111]:
import spacy
from spacy.en import English
import json
from nltk import Tree
import os
nlp = English()
from IPython.display import display, Markdown, Latex


In [112]:
# Tree Visualizer
def tok_format(tok):
    return "_".join([tok.orth_, tok.tag_, str(tok.dep_)])


def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(tok_format(node), [to_nltk_tree(child) for child in node.children])
    else:
        return tok_format(node)

def print_tree(sent):
    doc = nlp(sent)
    print(sent)
    [to_nltk_tree(sent.root).pretty_print() for sent in doc.sents]  

In [116]:
print_tree("How was the moon formed?")


How was the moon formed?
               formed_VBN_ROOT                        
       _______________|________________________        
      |               |            |     moon_NN_nsubj
      |               |            |           |       
How_WRB_advmod was_VBD_auxpass ?_._punct   the_DT_det 



In [117]:
print_tree("Is California the best state in the union?")

Is California the best state in the union?
                     Is_VBZ_ROOT                               
     _____________________|_____________                        
    |                            California_NNP_n              
    |                                  subj                    
    |                                   |                       
    |                             state_NN_appos               
    |          _________________________|_______________        
    |         |                         |           in_IN_prep 
    |         |                         |               |       
    |         |                         |         union_NN_pobj
    |         |                         |               |       
?_._punct the_DT_det              best_JJS_amod     the_DT_det 



<img src="./spacy dependency visualizer.png" />

In [123]:
def is_question(_sentence):
    sentence = list(nlp(_sentence).sents)[0]
    if len(sentence) > 15:
        return False
    root = sentence.root
    _is_question = False
    ROOT_QUESTION_VERBS = ['VBD','VBG','VBN', 'VBP','VBZ', 'MD', 'TO']
    VERBS = [ 'VBD','VBG','VBN', 'VBP', 'VBZ','MD', 'TO', 'VB']
    WH_QUESTIONS = ['WP', 'WP$', 'WRB', 'WDT']

    if root.n_lefts == 0: 
        return root.tag_ in set(ROOT_QUESTION_VERBS  + VERBS)

    for child in root.lefts:
        if child.tag_ in WH_QUESTIONS:
            continue
        if child.tag_ in VERBS:
            if child.dep_ == 'aux':
                return True
        return False
    return True


In [126]:
is_question("Is California the best state in the union")

True

In [127]:
is_question("Did the brown fox jump over the lazy dog.")

True

In [128]:
is_question("Will Anthony go to Tokyo")

True

### Sentiment Analysis

In [133]:
print(TextBlob("I think california is great").sentiment)

Sentiment(polarity=0.8, subjectivity=0.75)


In [135]:
print(TextBlob("california is not great").sentiment)

Sentiment(polarity=-0.4, subjectivity=0.75)


In [143]:
print(TextBlob("california's has a population of 39.4M people").sentiment)

Sentiment(polarity=0.0, subjectivity=0.0)


### Named Entity Detection
some more stuff

### Word Distances


In [153]:
import Levenshtein


##### References
* https://rawgit.com/ztane/python-Levenshtein/master/docs/Levenshtein.html

####  Levenshtein Distance
The distance is the number of deletions, insertions, or substitutions required to transform string A into string B.

In [149]:
A = "SoCal Python"
B = "So Cal Python"
Levenshtein.distance(A, B)

1

#### Hamming Distance
The Hamming distance is simply the number of differing characters. That means the length of the strings must be the same.

In [152]:
A = "Southern California Python"
B = "Northern California Python"
Levenshtein.hamming(A, B)

2

#### Jaro
The Jaro string similarity metric is intended for short strings like personal last names. It is 0 for completely different strings and 1 for identical strings.

#### Jaro-Winkler
The Jaro-Winkler string similarity metric is a modification of Jaro metric giving more weight to common prefix, as spelling mistakes are more likely to occur near ends of words.

The prefix weight is inverse value of common prefix length sufficient to consider the strings *identical*. If no prefix weight is specified, 1/10 is used.

# Word Vectors
print(1,2,3,4)

### What are word vectors

### Types of word vectors
1. One hot encoding
2. Count Vectorizer
3. TF_IDF
4. Word2vec & Glove Vectors

### Visualizations of word vectors