# Problem Setting:

    One of the most important aspects of marketing and advertising is the ability to know what your potential customers need. People query the internet using a variety of query terms, which can be tapped to find out the most relevant questions people are interested in around your business. This experiment is designed to find what people are asking on Google, in an effort to find relevant questions to the client's business.
    
    
Experimental Setting (Workflow):

    1. This experiment takes in a URL (Client's about-us page) as input.
    2. EmbedRank algorithm (unsupervised) is used to extract important keyphrases.
    3. IBM Watson's Natural Language Understanding (NLU) is used to extract important words, entities and concepts.
    4. The important keyphrases from EmbedRank and IBM NLU are merged and ranked according to the cosine similarity         with the content present on the client's about us page. Pre-trained word embeddings (word2vec, sent2vec) are used to embed words and documents in a 700-dimensional vector space.
    5. For the top ranked keyphrases and concepts, Google Search Volumes are queried to retrieve questions people are asking.
    6. The resulting questions are ranked according to the semantic similarity similar to step 4.
    7. Output: A list of top 10 questions that potential customers are asking Google.
  
  
APIs used along the way:

    1. gspread - API that interfaces python modules to the google sheets sitting in your gDrive.
    2. EmbedRank - EmbedRank makes use of Stanford's part-of-speech tagger, sent2vec and NLTK packages to extract most important keyphrases using Maximal Marginal Relevance algorithm.
       More Info: https://github.com/swisscom/ai-research-keyphrase-extraction
    3. Url2text - Helps in retrieving text content from a given URL. 
    4. IBM Watson Natural Language Understanding - An API developed by IBM Watson that helps in extracting important          words, entities and concepts from a given URL or piece of text.
       More Info: https://github.com/watson-developer-cloud/natural-language-understanding-nodejs

## 1. Input

#### Data is stored in a Google sheet, sitting on my Google Drive. This script reads data from Google Sheet and returns a list of "about us" urls corresponding to businesses.

In [1]:
from os import getcwd
from os.path import join, abspath
import gspread
from oauth2client.service_account import ServiceAccountCredentials

scope = ['https://spreadsheets.google.com/feeds',
         'https://www.googleapis.com/auth/drive']

def sheets_client():
    credentials_file = abspath(join(getcwd(), 'credentials.json'))
    credentials = ServiceAccountCredentials.from_json_keyfile_name(credentials_file, scope)
    return gspread.authorize(credentials)

def sheets_get_all_websites():
    wks = sheets_client().open('Business websites').sheet1

    return [ 
        (row[0], row[2])      # Just client name and "about-us" urls needed.
        for row in wks.get_all_values()[1:]
    ]

def sheets_get_dataframe():
    wks = sheets_client().open('Business websites').sheet1

    return pd.DataFrame(wks.get_all_records())

In [2]:
client_data = sheets_get_all_websites()

In [3]:
client_data[0:24]

[('Red Zed', 'http://redzed.com/about-us/'),
 ('Next Wave', 'http://nextwave.org.au/about/'),
 ('Tidyme', 'https://www.tidyme.com.au/'),
 ('World Expeditions',
  'https://worldexpeditions.com/About-Us/Why-Travel-With-Us'),
 ('Agency Select', 'https://www.agentselect.com.au/about-us.php'),
 ('Duckfeet', 'https://www.duckfeet.com.au/pages/duckfeet-shoes-about-us'),
 ('All the kings men',
  'https://allthekingsmen.com.au/pages/all-the-kings-men-about'),
 ('Lifx ', 'https://www.lifx.com/pages/about'),
 ('Balloon Man', 'https://balloonman.com.au/balloonman/'),
 ('Oxfam', 'https://www.oxfam.org.au/what-we-do/about-us/'),
 ('Red Cross', 'https://www.redcross.org.au/prepare'),
 ('Xero', 'https://www.xero.com/au/about/'),
 ('McGrath Foundation', 'http://www.mcgrathfoundation.com.au/AboutUs.aspx'),
 ('Helfie', 'http://www.helfie.io/'),
 ('Circus Oz', 'https://www.circusoz.com/circus-oz/about-circus-oz.html'),
 ('Lorne Hotel', 'http://lornehotel.com.au/'),
 ('Deliciou', 'https://au.deliciou.com/p

## 2. EmbedRank API

#### EmbedRank is an unsupervised keyphrase extraction algorithm which takes in a piece of text as input and gives you a list of most important keyphrases within the text

In [4]:
import rocky_config    # This is an in-house environment which enables jean packages and modules
from jean.nlp2.embedrank import embedrank
from jean.data.web import url2text
import pandas as pd

def get_embedrank_phrases(url):
    
    content = url2text(url)
    phrases = embedrank(content)
    
    data = [[i.text, i.score, i.aliases] for i in phrases]
    pd.set_option('display.max_colwidth', 0)
    df = pd.DataFrame.from_records(data, columns = ['Keyphrase', 'Score', 'Aliases'])
    
    return df

In [5]:
get_embedrank_phrases(client_data[0][1])

Unnamed: 0,Keyphrase,Score,Aliases
0,redzed lending solutions pty,1.0,"[redzed lending solutions, need redzed lending solutions]"
1,redzed lending solutions’ financial support,0.99452,[]
2,home loans,0.959947,[loans]
3,loan experts,0.950052,[]
4,mortgage partner,0.902969,[]
5,lighthouse foundation,0.863978,[]


## 3. IBM Watson Natural Language Understanding API

#### IBM Watson Natural Language Understanding is an API built using IBM's Watson technology, which takes a URL as input and gives you a list of most important keywords, concepts and entities identified within it

In [6]:
import json
from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding_v1 \
import Features, ConceptsOptions, KeywordsOptions, EntitiesOptions

In [7]:
natural_language_understanding = NaturalLanguageUnderstandingV1(
    version='2018-03-16',
    #iam_api_key='RMcqA2KiX0OiyWFgXnEetnFV-he9-MQwmiJjTJRldFte',
    username = "4b5ce69c-5582-43bd-ac28-35ca8366eb9b",
    password = "KaIk6zlL3tcN",
    url  = "https://gateway.watsonplatform.net/natural-language-understanding/api"
)

In [8]:
def get_ibm_keyphrases(url):
    try:
        response_keyphrases = natural_language_understanding.analyze(
          url=url,
          features=Features(
            keywords=KeywordsOptions(
              sentiment=True,
              emotion=True,
              limit=10)))
    except:
        response_keyphrases = 0
        
    return response_keyphrases


def get_ibm_concepts(url):
    try:
        response_concepts = natural_language_understanding.analyze(
          url=url,
          features=Features(
            concepts=ConceptsOptions(
              limit=10)))
    except:
        response_concepts = 0
        
    return response_concepts



def get_ibm_entities(url):
    try:
        response_entities = natural_language_understanding.analyze(
          url=url,
          features=Features(
            entities=EntitiesOptions(
              sentiment=True,
              emotion=True,
              limit=10)))
    except:
        response_entities = 0
        
    return response_entities

#### Keyphrases

In [9]:
#keyphrases
ibm_kp = get_ibm_keyphrases(client_data[0][1])['keywords']
data = [[i['text'], i['relevance']] for i in ibm_kp]
pd.DataFrame.from_records(data, columns = ['keyphrase', 'relevance'])

Unnamed: 0,keyphrase,relevance
0,RedZed Lending Solutions,0.988012
1,common sense approach,0.853351
2,mortgage partner look,0.836247
3,better way,0.725279
4,credit decisions,0.702969
5,Lighthouse Foundation,0.70281
6,rightful place,0.701346
7,youth homelessness,0.699385
8,loan expert,0.694931
9,vulnerable children,0.693298


#### Entities

In [10]:
ibm_en = get_ibm_entities(client_data[0][1])['entities']
data = [[i['text'], i['relevance'], i['type'], i['count']] for i in ibm_en]
pd.DataFrame.from_records(data, columns = ['entity', 'relevance', 'entity type', 'frequency'])

Unnamed: 0,entity,relevance,entity type,frequency
0,RedZed Lending Solutions,0.845405,Company,2
1,Lighthouse Foundation,0.460908,Organization,1
2,Lighthouse,0.410309,Company,1
3,partner,0.354174,JobTitle,1
4,Australia,0.334016,Location,1


#### Concepts

In [11]:
ibm_con = get_ibm_concepts(client_data[0][1])['concepts']
data = [[i['text'], i['relevance'], i['dbpedia_resource']] for i in ibm_con]
pd.DataFrame.from_records(data, columns = ['concept', 'relevance', 'dbpedia link'])

Unnamed: 0,concept,relevance,dbpedia link
0,Debt,0.919186,http://dbpedia.org/resource/Debt
1,Credit,0.795683,http://dbpedia.org/resource/Credit
2,Interest,0.724839,http://dbpedia.org/resource/Interest
3,Predatory lending,0.672455,http://dbpedia.org/resource/Predatory_lending
4,Secured loan,0.67214,http://dbpedia.org/resource/Secured_loan
5,English-language films,0.662739,http://dbpedia.org/resource/English-language_films
6,Youth,0.63804,http://dbpedia.org/resource/Youth
7,Bond,0.588756,http://dbpedia.org/resource/Bond_(finance)
8,Loan,0.559353,http://dbpedia.org/resource/Loan
9,Credit,0.558895,http://dbpedia.org/resource/Credit_(finance)


## 4. Rank the words according to cosine similarity with main content

#### 4a. Merge the phrases from EmbedRank and IBM NLU modules

In [12]:
def merged_embedrank_nlu(url):
    #important phrases as identified by embedrank
    er_phrases = get_embedrank_phrases(url)

    #important phrases as identified by IBM NLU
    ibm_kp = get_ibm_keyphrases(url)['keywords']
    data = [[i['text'], i['relevance']] for i in ibm_kp]
    ibm_phrases = pd.DataFrame.from_records(data, columns = ['keyphrase', 'relevance'])

    # Comparing both results side by side (NOTE: Score is NOT same as relevance)
    df = pd.concat([er_phrases[['Keyphrase', 'Score']], ibm_phrases[['keyphrase', 'relevance']]], axis = 1)

    combined_phrases = list(er_phrases['Keyphrase']) + list(ibm_phrases['keyphrase'])
    
    return ([df, combined_phrases])

In [13]:
[df_merged, combined_phrases] = merged_embedrank_nlu(client_data[0][1])

In [14]:
df_merged

Unnamed: 0,Keyphrase,Score,keyphrase,relevance
0,redzed lending solutions pty,1.0,RedZed Lending Solutions,0.988012
1,redzed lending solutions’ financial support,0.99452,common sense approach,0.853351
2,home loans,0.959947,mortgage partner look,0.836247
3,loan experts,0.950052,better way,0.725279
4,mortgage partner,0.902969,credit decisions,0.702969
5,lighthouse foundation,0.863978,Lighthouse Foundation,0.70281
6,,,rightful place,0.701346
7,,,youth homelessness,0.699385
8,,,loan expert,0.694931
9,,,vulnerable children,0.693298


In [15]:
combined_phrases

['redzed lending solutions pty',
 'redzed lending solutions’ financial support',
 'home loans',
 'loan experts',
 'mortgage partner',
 'lighthouse foundation',
 'RedZed Lending Solutions',
 'common sense approach',
 'mortgage partner look',
 'better way',
 'credit decisions',
 'Lighthouse Foundation',
 'rightful place',
 'youth homelessness',
 'loan expert',
 'vulnerable children']

#### 4b. Get rid of organization and company names (using IBM's entities) to retain possibly interesting words

In [16]:
data = [[i['text'], i['relevance'], i['type'], i['count']] for i in ibm_en]
ibm_entities = pd.DataFrame.from_records(data, columns = ['entity', 'relevance', 'entity type', 'frequency'])
ibm_entities_list = list(ibm_entities['entity'])

In [17]:
combined_phrases = [i.lower() for i in combined_phrases]
ibm_entities_list = [i.lower() for i in ibm_entities_list]

In [18]:
filtered_phrases = list(set(combined_phrases).difference(set(ibm_entities_list)))
filtered_phrases

['common sense approach',
 'redzed lending solutions’ financial support',
 'home loans',
 'better way',
 'youth homelessness',
 'loan expert',
 'rightful place',
 'vulnerable children',
 'redzed lending solutions pty',
 'mortgage partner look',
 'credit decisions',
 'loan experts',
 'mortgage partner']

#### 4c. Rank this list of words by semantic closeness to the content on website

In [27]:
from jean.nlp2.embeddings import embed_document, embed_text
from sklearn.metrics.pairwise import cosine_similarity

content = url2text(client_data[0][1])
doc_emb = embed_document(content)

emb_phrases = embed_text(filtered_phrases)

sims = [float(cosine_similarity(doc_emb, i.reshape(1,-1))) for i in emb_phrases]

data = [[filtered_phrases[i], sims[i]] for i in range(len(sims))]
df_content = pd.DataFrame.from_records(data, columns = ['Keyphrase', 'Relevance']).sort_values(by = ['Relevance'], ascending = False)
df_content

Unnamed: 0,Keyphrase,Relevance
8,redzed lending solutions pty,0.506748
1,redzed lending solutions’ financial support,0.503098
2,home loans,0.485826
11,loan experts,0.478262
12,mortgage partner,0.454813
9,mortgage partner look,0.444824
5,loan expert,0.435256
10,credit decisions,0.407393
0,common sense approach,0.282452
7,vulnerable children,0.22827


#### 4d. Rank IBM's concepts by semantic closeness to the content on website

In [28]:
data = [[i['text'], i['relevance'], i['dbpedia_resource']] for i in ibm_con]
ibm_concepts = pd.DataFrame.from_records(data, columns = ['concept', 'relevance', 'dbpedia link'])
ibm_concepts_list = list(set(ibm_concepts['concept']))

emb_concepts = embed_text(ibm_concepts_list)

sims = [float(cosine_similarity(doc_emb, i.reshape(1,-1))) for i in emb_concepts]

data = [[ibm_concepts_list[i], sims[i]] for i in range(len(sims))]
df_con = pd.DataFrame.from_records(data, columns = ['Concept', 'Relevance']).sort_values(by = ['Relevance'], ascending = False)
df_con

Unnamed: 0,Concept,Relevance
8,Secured loan,0.434788
2,Predatory lending,0.325641
7,English-language films,0.138148
0,Youth,0.0
1,Debt,0.0
3,Bond,0.0
4,Interest,0.0
5,Credit,0.0
6,Loan,0.0


## 5. Google Search Volumes Code

In [21]:
import requests
import re
import html
from functional import pseq, seq
import time
import random
import numpy as np

In [22]:
urlBase = "https://clients1.google.com/complete/search"
resRegex = "^window\.google\.ac\.h\((.*)\)$"
#userAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36";
userAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"
reqId = 0

In [23]:
def suggest(keyword):
    opts={"client": "heirloom-hp", "hl": "en", "levels": 0, "cp": len(keyword)}
    return do_suggest(keyword, opts)


def do_suggest(keyword, opts):
    global reqId
    qs = {
        "client": opts["client"],
        "hl": opts["hl"],
        "gs_rn": 0,
        "gs_ri": opts["client"],
        "cp": opts["cp"],
        "gs_id": reqId,
        "q": keyword
    }
    reqId += 1
    
    # throttling delay
    delay = random.randint(100,2000)
    time.sleep(delay/1000)
    
    response = requests.get(urlBase, params=qs, headers={"User-Agent": userAgent})
    response.raise_for_status()
    result = re.search(resRegex, response.text)
    body = json.loads(result.group(1))
    return [strip_tags(s[0]) for s in body[1]]
    

def strip_tags(text):
    return re.sub('<.*?>', '', html.unescape(text))
    

In [24]:
def get_questions(term):
    questions = [
        "why",
        "when",
        "are",
        "which",
        "where",
        "what",
        "who",
        "will",
        "can",
        "how",
        "does",
        "do"
    ]
    
    return seq(questions).map(lambda q: q + " " + term + " ") \
        .flat_map(lambda s: suggest(s)) \
        .group_by(lambda s: s.split(" ")[0]) \
        .filter(lambda q: q[0]in questions)
        
def get_all_questions(phrase):
    return get_questions(phrase).flat_map(lambda q: q[1]).filter(lambda q: len(q.split(" ")) > len(phrase.split(" ")) + 1).list()

## 6. Rank the questions acc. to cosine similarity with URL's content

In [25]:
def func(ranked_phrases, ranked_concepts):
    questions_phrases = seq(ranked_phrases).flat_map(lambda t: get_all_questions(t)).list()
    questions_concepts = seq(ranked_concepts).flat_map(lambda t: get_all_questions(t)).list()

    ques_phrase_emb = embed_text(questions_phrases)
    ques_concept_emb = embed_text(questions_concepts)

    sims_phrase = [float(cosine_similarity(doc_emb, i.reshape(1,-1))) for i in ques_phrase_emb]
    sims_concepts = [float(cosine_similarity(doc_emb, i.reshape(1,-1))) for i in ques_concept_emb]

    data_phrases = [[questions_phrases[i], sims_phrase[i]] for i in range(len(questions_phrases))]
    data_concepts = [[questions_concepts[i], sims_concepts[i]] for i in range(len(questions_concepts))]

    df_phrases = pd.DataFrame.from_records(
        data_phrases, columns = ['question', 'relevance']).sort_values(by = ['relevance'], ascending = False)

    df_concepts = pd.DataFrame.from_records(
        data_concepts, columns = ['question', 'relevance']).sort_values(by = ['relevance'], ascending = False)

    return ([df_phrases, df_concepts])

In [29]:
[df1, df2] = func(list(df_content['Keyphrase']), list(df_con['Concept']))

## 7. Output

#### 7a. Top 10 questions using EmbedRank + IBM NLU keyphrases 

In [30]:
df1[:10]

Unnamed: 0,question,relevance
72,how home equity loans work,0.541461
88,do home loans cover renovations,0.51221
89,do home loans cover down payment,0.508308
59,can home loans include renovation costs,0.501253
77,how do construction home loans work,0.498682
68,how home loans work,0.494426
61,can home loan be taken jointly,0.489771
74,how do va home loans work,0.48909
3,why home equity loan,0.486487
7,when home equity loan,0.480856


#### 7b. Top 10 questions using IBM NLU concepts

In [31]:
df2[:10]

Unnamed: 0,question,relevance
747,will loan affect mortgage application,0.536005
700,are loan establishment fees deductible,0.51138
785,do loan companies verify employment,0.509889
704,are loan payments deductible,0.504111
65,are student loans predatory lending,0.502125
778,does loan consolidation work,0.500435
784,do loan companies contact your employer,0.487079
731,what loan companies use transunion,0.485238
319,do debt consolidation companies work,0.484866
38,will secured loan help credit,0.484381
