# Challenge: Semantic Search Algorithm  
Design and implement a semantic search algorithm that is able to score and rank a  set of keywords (trends) by how strongly associated they are to a given query term.  The algorithmic approach could borrow techniques from association rule mining to  analyze the co-occurrence of terms within a corpora of tweets and reddit posts, and should take into consideration the uniqueness of the trend and the recency of the  association. For example, the algorithm should be able to determine that the query  ‘iPhone’ is more strongly associated to trends like ‘MagSafe’, ‘5G’, and ‘pacific blue'  then it is to “Biden” or “perfume”.  

## Details:  
- The expected input to the method should be a query term, and the output should  be an ordered set of trends. The method should be implemented in Python (v3.7).  
- You can explore the dataset via the GCP BigQuery WebIDE and you can connect to  the database from python using the provided JSON key.  
- The sample twitter and reddit datasets can found in the tables `nwo-sample.graph.tweets` and `nwo-sample.graph.reddit` respectively   

In [15]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer

from google.cloud import bigquery
from google.oauth2 import service_account

import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

from nltk.corpus import wordnet

In [16]:
# json access key
credentials = service_account.Credentials.from_service_account_file('nwo-sample-5f8915fdc5ec.json')

project_id = 'nwo-sample'
client = bigquery.Client(credentials= credentials,project=project_id)

In [17]:
# query the table
tweet_query = client.query("""
   SELECT `tweet` FROM nwo-sample.graph.tweets 
   LIMIT 10000""")
tweet_results = tweet_query.result() 

reddit_query = client.query("""
   SELECT `body` FROM nwo-sample.graph.reddit 
   LIMIT 10000""")
reddit_results = reddit_query.result() 

type_results = [tweet_results, reddit_results]

In [18]:
terms = []
all_text = []

wn = nltk.stem.wordnet.WordNetLemmatizer()
tokenizer = TweetTokenizer()

# go through all results, keep track of text
# normalize text (tokenize, remove stopwords, small words, and numbers/symbols)
for tresult in type_results:

    for r in tresult:
        tokens = []

        tokenized = tokenizer.tokenize(r[0])
        words = [word.lower() for word in tokenized if word.isalpha() and len(word)>2 and word not in stopwords.words('english')]
        tags = {'N': wordnet.NOUN, 'J': wordnet.ADJ, 'V': wordnet.VERB, 'R': wordnet.ADV}
       
        pos_tags = nltk.pos_tag(words)
        for word in pos_tags:
            if word[1][0] in tags:
                new_pos = tags[word[1][0]]
                lemma = wn.lemmatize(word[0], new_pos)
                tokens.append(lemma)
                terms.append(lemma)
        all_text.append(list(set(tokens)))


In [20]:
# association rule mining
# first, convert to one-hot encoding
encoding = []
term_set = set(terms)
for curr_text in all_text:
    labels = {}
    zeros = list(term_set - set(curr_text))
    ones = list(term_set.intersection(set(curr_text)))
    for missing_word in zeros:
        labels[missing_word] = 0
    for present_word in ones:
        labels[present_word] = 1
    encoding.append(labels)
encoded_df = pd.DataFrame(encoding)

In [22]:
freq_items = apriori(encoded_df, min_support=0.005, use_colnames=True, verbose=1)
freq_items.head()

Processing 11 combinations | Sampling itemset size 11032


Unnamed: 0,support,itemsets
0,0.00525,(local)
1,0.0072,(american)
2,0.0124,(number)
3,0.00535,(imagine)
4,0.00555,(compose)


In [23]:
keywords = {}
for i, row in freq_items.iterrows():
    items = list(row['itemsets'])
    if len(items)>1:
        if items[0] not in keywords:
            keywords[items[0]] = {}
        if items[1] not in keywords:
            keywords[items[1]] = {}
        keywords[items[0]][items[1]] = row['support']
        keywords[items[1]][items[0]] = row['support']

for word in keywords:
    keywords[word] = sorted(keywords[word].items(), key=lambda x: x[1], reverse=True)

In [24]:
rules = association_rules(freq_items, metric="confidence", min_threshold=0.00001)
display(rules)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(perform),(compose),0.00650,0.00555,0.00510,0.784615,141.372141,0.005064,4.617089
1,(compose),(perform),0.00555,0.00650,0.00510,0.918919,141.372141,0.005064,12.253167
2,(message),(compose),0.01115,0.00555,0.00545,0.488789,88.070133,0.005388,1.945284
3,(compose),(message),0.00555,0.01115,0.00545,0.981982,88.070133,0.005388,54.881175
4,(question),(compose),0.02060,0.00555,0.00530,0.257282,46.357037,0.005186,1.338933
...,...,...,...,...,...,...,...,...,...
173585,(subreddit),"(perform, please, message, concern, automatica...",0.00660,0.00510,0.00510,0.772727,151.515152,0.005066,4.377560
173586,(contact),"(perform, please, message, concern, automatica...",0.01245,0.00510,0.00510,0.409639,80.321285,0.005037,1.685239
173587,(moderator),"(perform, please, message, concern, automatica...",0.00550,0.00510,0.00510,0.927273,181.818182,0.005072,13.679875
173588,(compose),"(perform, please, message, concern, automatica...",0.00555,0.00510,0.00510,0.918919,180.180180,0.005072,12.270433


In [28]:

search_term = input("Enter a search term: ")

while search_term not in keywords:
    print("Could not find any related trends for the search term '"+search_term +"'") 
    search_term = input("Enter a search term: ")

print("The top trends for the search term '"+search_term +"' are: ")
for term in keywords[search_term]:
    print(term[0])
        

Enter a search term: compose
The top trends for the search term 'compose' are: 
perform
message
question
concern
moderator
contact
automatically
action
subreddit
please
