## Distance Measures.
### Author: Kevin Okiah

#### 03/17/2019

### 1.	Evaluate text similarity of Amazon book search results by doing the following:

> a.	Do a book search on Amazon. Manually copy the full book title (including subtitle) of each of the top 24 books listed in the first two pages of search results. 

> b.	In Python, run one of the text-similarity measures covered in this course, e.g., cosine similarity. Compare each of the book titles, pairwise, to every other one. 

> c.	Which two titles are the most similar to each other? Which are the most dissimilar? Where do they rank, among the first 24 results?


In [1]:
import numpy as np
import pandas as pd
import selenium
from lxml import html
import urllib3
from bs4 import BeautifulSoup
import lxml
import urllib 
import nltk
import string
from urllib3 import request
from string import punctuation
from TextCleaningToolkit import *
import TextCleaningToolkit
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

http = urllib3.PoolManager()

def get_url_Bs(url):
    tree = BeautifulSoup(url)
    return tree

def get_url_Sel(url):
    tree = html.document_fromstring(url)
    return tree

In [2]:
with open('AmazonBooks.p', 'rb') as f:
     bookTitles = pickle.load(f)

len(bookTitles)

30

In [3]:
bookTitles

[u'Deep Learning with Python',
 u'Deep Learning (Adaptive Computation and Machine Learning series)',
 u'Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems',
 u'Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow, 2nd Edition',
 u'Machine Learning with Python Cookbook: Practical Solutions from Preprocessing to Deep Learning',
 u'Deep Learning Cookbook: Practical Recipes to Get Started Quickly',
 u'Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn and Tensorflow: Step-by-Step Tutorial For Beginners.',
 u'Python Machine Learning: A Deep Dive Into Python Machine Learning and Deep Learning, Using Tensor Flow And Keras: From Beginner To Advance',
 u'Deep Learning with R',
 u"Deep Learning: A Practitioner's Approach",
 u'Python Deep Learning Projects: 9 projects demystifying neural network and deep learning models for building intelligent 

In [4]:
#leveraging Sarkar's codes
from normalization import normalize_corpus 
from utils import build_feature_matrix 
import numpy as np

In [5]:
# normalize and extract features from the 32 books Titles from Amazon
norm_book_corpus = normalize_corpus(bookTitles, lemmatize=True) 
tfidf_vectorizer, tfidf_features = build_feature_matrix(norm_book_corpus,                                                         
                                                        feature_type='tfidf', 
                                                        ngram_range=(1, 1), 
                                                        min_df=0.0, 
                                                        max_df=1.0)
query_docs_tfidf = tfidf_vectorizer.transform(norm_book_corpus)

In [6]:
norm_book_corpus

[u'deep learning python',
 u'deep learning adaptive computation machine learning series',
 u'hands machine learning scikit learn tensorflow concept tool technique build intelligent system',
 u'python machine learning machine learning deep learning python scikit learn tensorflow 2nd edition',
 u'machine learning python cookbook practical solution preprocessing deep learning',
 u'deep learning cookbook practical recipe start quickly',
 u'python machine learning machine learning deep learning python scikit learn tensorflow step step tutorial beginner',
 u'python machine learning deep dive python machine learning deep learning use tensor flow kera beginner advance',
 u'deep learning r',
 u'deep learning practitioner approach',
 u'python deep learning project 9 project demystify neural network deep learning model build intelligent system',
 u'deep learning python illustrated guide beginner intermediate learn approach future kera tensorflow end',
 u'deep learning python natural language proc

In [7]:
tfidf_vectorizer

TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.float64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=0.0,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [8]:
def compute_cosine_similarity(doc_features, corpus_features, 
                              top_n=3):    
    # get document vectors    
    doc_features = doc_features.toarray()[0]    
    corpus_features = corpus_features.toarray()    
    # compute similarities    
    similarity = np.dot(doc_features,                        
                        corpus_features.T)    
    # get docs with highest similarity scores    
    top_docs = similarity.argsort()[::-1][:top_n]    
    top_docs_with_score = [(index, round(similarity[index], 3))                           
                           for index in top_docs]    
    # get docs with lowest similarity scores  
    bottom_docs = similarity.argsort()[::1][:top_n]    
    bottom_docs_with_score = [(index, round(similarity[index], 3))                           
                           for index in bottom_docs]  
    return top_docs_with_score, bottom_docs_with_score

In [9]:
print 'Document Similarity Analysis using Cosine Similarity'     
print '='*100     
for index, doc in enumerate(norm_book_corpus):
    try:
        doc_tfidf = query_docs_tfidf[index] 
        top_similar_docs, bottom_similar_docs = compute_cosine_similarity(doc_tfidf, 
                                                     tfidf_features, 
                                                     top_n=1)
        print 'Document',index+1,':',norm_book_corpus[index]
        print '='*100
        print 'Most similar doc:' 
        print '-'*18
        n = len(top_similar_docs)
        for doc_index, sim_score in top_similar_docs:  
                print 'Doc num: {} Similarity Score: {}\nDoc: {}'. format(doc_index+2,sim_score, norm_book_corpus[doc_index+1]) 
                #print '='*90 

        print '-'*18
        print 'Most dissimilar doc:' 
        print '-'*18
        for doc_index, sim_score in bottom_similar_docs:  
                print 'Doc num: {} Similarity Score: {}\nDoc: {}'. format(doc_index,sim_score, norm_book_corpus[doc_index]) 
        print '='*100 
    except:
        print('Query Failed...')

Document Similarity Analysis using Cosine Similarity
Document 1 : deep learning python
Most similar doc:
------------------
Doc num: 2 Similarity Score: 1.0
Doc: deep learning adaptive computation machine learning series
------------------
Most dissimilar doc:
------------------
Doc num: 25 Similarity Score: 0.0
Doc: neural network
Document 2 : deep learning adaptive computation machine learning series
Most similar doc:
------------------
Doc num: 3 Similarity Score: 1.0
Doc: hands machine learning scikit learn tensorflow concept tool technique build intelligent system
------------------
Most dissimilar doc:
------------------
Doc num: 25 Similarity Score: 0.0
Doc: neural network
Document 3 : hands machine learning scikit learn tensorflow concept tool technique build intelligent system
Most similar doc:
------------------
Doc num: 4 Similarity Score: 1.0
Doc: python machine learning machine learning deep learning python scikit learn tensorflow 2nd edition
------------------
Most dissim

### 2.	Now evaluate using a major search engine.

>a.	Enter one of the book titles from question 1a into Google, Bing, or Yahoo!. Copy the capsule of the first organic result and the 20th organic result. Take web results only (i.e., not video results), and skip sponsored results. 

>b.	Run the same text similarity calculation that you used for question 1b on each of these capsules in comparison to the original query (book title). 

>c.	Which one has the highest similarity measure? 


In [10]:
#Google Search Results
book_title = ['Deep Learning with Python by Francois Chollet']
Capsule1 = ["Deep Learning with Python: Francois Chollet: 9781617294433 ...\
            https://www.amazon.com/Deep-Learning-Python-Francois-Chollet/dp/1617294438\
            Deep Learning with Python [Francois Chollet] on Amazon.com. *FREE* shipping \
            on qualifying offers. Summary Deep Learning with Python introduces the field ..."]
Capsule20 = ["Deep Learning with Python : Francois Chollet : 9781617294433\
            https://www.bookdepository.com/Deep-Learning-with-Python-Francois-Chollet/9781...\
            Dec 22, 2017 - Deep Learning with Python by Francois Chollet, 9781617294433,\
            available at Book Depository with free delivery worldwide."
           ]
Capsule51 = ["Deep Learning with Python by Francois Chollet (9781617294433)\
              https://www.allbookstores.com/Deep-Learning-Python-Francois-Chollet/9781617294...\
             Deep Learning with Python by Francois Chollet. Click here for the lowest price! \
             Paperback, 9781617294433, 1617294438."]

merged = book_title+Capsule1+Capsule20+Capsule51

In [11]:
merged

['Deep Learning with Python by Francois Chollet',
 'Deep Learning with Python: Francois Chollet: 9781617294433 ...            https://www.amazon.com/Deep-Learning-Python-Francois-Chollet/dp/1617294438            Deep Learning with Python [Francois Chollet] on Amazon.com. *FREE* shipping             on qualifying offers. Summary Deep Learning with Python introduces the field ...',
 'Deep Learning with Python : Francois Chollet : 9781617294433            https://www.bookdepository.com/Deep-Learning-with-Python-Francois-Chollet/9781...            Dec 22, 2017 - Deep Learning with Python by Francois Chollet, 9781617294433,            available at Book Depository with free delivery worldwide.',
 'Deep Learning with Python by Francois Chollet (9781617294433)              https://www.allbookstores.com/Deep-Learning-Python-Francois-Chollet/9781617294...             Deep Learning with Python by Francois Chollet. Click here for the lowest price!              Paperback, 9781617294433, 1617294438.

In [12]:
# normalize and extract features from the 32 books Titles from Amazon
norm_book_corpus = normalize_corpus(merged, lemmatize=True) 
title_book_corpus = normalize_corpus(book_title, lemmatize=True) 
tfidf_vectorizer, tfidf_features = build_feature_matrix(norm_book_corpus,                                                         
                                                        feature_type='tfidf', 
                                                        ngram_range=(1, 1), 
                                                        min_df=0.0, 
                                                        max_df=1.0)
query_docs_tfidf = tfidf_vectorizer.transform(norm_book_corpus)

In [13]:
print 'Document Similarity Analysis using Cosine Similarity'     
print '='*100     
index =0
doc_tfidf = query_docs_tfidf[index] 
top_similar_docs, bottom_similar_docs = compute_cosine_similarity(doc_tfidf, 
                                             tfidf_features, 
                                             top_n=1)
print 'Search Title',index+1,':',norm_book_corpus[index]
print '='*100
print 'Most similar doc:' 
print '-'*18
n = len(top_similar_docs)
for doc_index, sim_score in top_similar_docs:  
        print 'Search Result: {} Similarity Score: {}\nDoc: {}'. format(doc_index+1,sim_score, norm_book_corpus[doc_index+1]) 
        #print '='*90 

print '-'*18
print 'Most dissimilar doc:' 
print '-'*18
for doc_index, sim_score in bottom_similar_docs:  
        print 'Search Result: {} Similarity Score: {}\nDoc: {}'. format(doc_index+18,sim_score, norm_book_corpus[doc_index]) 
print '='*100 

Document Similarity Analysis using Cosine Similarity
Search Title 1 : deep learning python francois chollet
Most similar doc:
------------------
Search Result: 1 Similarity Score: 1.0
Doc: deep learning python francois chollet 9781617294433 https www amazon com deep learning python francois chollet dp 1617294438 deep learning python francois chollet amazon com free ship qualify offer summary deep learning python introduce field
------------------
Most dissimilar doc:
------------------
Search Result: 20 Similarity Score: 0.69
Doc: deep learning python francois chollet 9781617294433 https www bookdepository com deep learning python francois chollet 9781 dec 22 2017 deep learning python francois chollet 9781617294433 available book depository free delivery worldwide


 Search Result 1 has the highest Cosine Similarirt Measure
 
 **Summary:**
 
Search results that are top of the list in both Amazon and Google  have a high cosine similarity distance to the actual search.
We can conclude both amazon and google use some distance measure to rank its results.
 