# StackOverflow Search Optimazation
Click [here](https://colab.research.google.com/drive/1Fbyg6qPFc-sJoK9bc35IO-XpDIt_6zZt) to run this code in Colaboratory. It only takes a minute!!

#### Installing the required libraries (the following lines will work only if this code is viewed in jupyter notebook or similar environments)
* If you wish to run this in other environments, you have to install these libraries manually.
* Tensorflow v1.13.1 is recommended but lower versions should also work fine. 
* Not compatible with Tensorflow v2.0 out of the box. (You might want to convert to Tensorflow v2.0 format).

In [0]:
!pip install --quiet tensorflow==1.13.1
!pip install --quiet tensorflow-hub
!pip install --quiet numpy
!pip install --quiet nltk
!pip install --quiet requests

#### Importing the required libraries
1. `tensorflow`: the main Machine Learning library for this project.
2. `tensorflow_hub`: it provides with the universal sentence encoder needed to calculate the embeddings.
3. `numpy`: for matrix operations.
4. `requests`: to make API calls.
5. `itertools`: we use the `combinations` function calculate combination.

In [0]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import nltk
import requests
from urllib.parse import quote
from urllib.parse import unquote
from itertools import combinations

#### Processing the input query to get tags
1. NLTK is used to tokenize the sentence.
2. Then stopwords and punctuations are removed from the tokenized sentence to get `crude_tags`.
3. API call is made with the `crude_tags` to get the tags from StackOverflow matching the `crude_tags`.
4. Maximum of 5 tags are returned as it is the limit for the API `api_tags` are returned.

In [0]:
#Cleaning the input to get tags
def get_tags(query):
  #Downloading the required libraries for nltk
  nltk.download('stopwords')
  from nltk.corpus import stopwords
  
  tokenized_sentence = query.split(' ')
  stop_words = set(stopwords.words("english"))
  crude_tags = []
  for w in tokenized_sentence:
      if w not in stop_words:
          crude_tags.append(w)
  tags = list(set(crude_tags))
  
  #searching the API for tags using the obtained tags
  api_tags = []
  for tag in tags:
    URL = f'https://api.stackexchange.com/2.2/tags?order=desc&sort=popular&inname={quote(tag)}&site=stackoverflow'
    r = requests.get(url = URL)
    data = r.json()
    if len(data['items']) > 0:
      api_tags.append(data['items'][0]['name'])

  if len(api_tags) > 5:
    return api_tags[:5]
  else:
    return api_tags

#### Requesting the StackExchange API for questions using the tags obatained
1. A list of all the combination of tags is created to request the API.
2. This is done to maximize the chance of getting questions with atleast any one of the tag included.
3. All the question titles are stored into another list.
4. The titles along with the API response is returned.

In [0]:
#Requesting the StackExchange API for questions using the tags obatained
def get_questions(tags):
  temp = []
  #Creating a list of all the possible combinations of tags
  for i in range(1, len(tags)+1):
      comb = []
      comb.append(list(combinations(tags, i)))
      for j in range(0, len(comb[0])):
          temp.append(list(comb[0][j]))
  #Making API calls to all the possible URLs
  desc = []
  questions = []
  for i in range(len(temp)-1, -1, -1):
      url = ''
      for j in temp[i]:
          url += j + ';'
      URL = f'https://api.stackexchange.com/2.2/questions?order=asc&sort=activity&tagged={quote(url)}&site=stackoverflow'
      r = requests.get(url = URL)
      data = r.json()
      if len(data['items']) > 2 :
        for item in data['items']:
          desc.append(item)
          questions.append(item['title'])
        break
    
  return [questions,desc]


#### Calculating the similarities
1. The `query` string is appended to the `questions` list.
2. TensorFlow Hub's `universal-sentence-encoder-large/3` is used to calculate embeddings for all the questions.
3. Then the inner product between all the question embeddings and the `query` string is calculated to get the similarity.
4. The `probability` is added to the dictionary of question titles.

In [0]:
#Converting sentences to embeddings and computing the inner product to calculate similarity
def get_similarity(questions, query):
  questions.append(query)
  with tf.Graph().as_default():
    # Downloading the pre-trained "Universal Sentence Encoder" from tensorflow hub
    url = "https://tfhub.dev/google/universal-sentence-encoder-large/3" 
    embed = hub.Module(url)
    question_encodings = embed(questions)
    with tf.Session() as session:
        session.run(tf.global_variables_initializer())
        session.run(tf.tables_initializer())
        embeddings = session.run(question_encodings)
        similarity = np.inner(embeddings, embeddings[-1:])

  # Adding probability values to the questions
  dictItems = []
  i = 0
  for i in range(0, len(similarity)-1 ):
      temp = { "probability" : similarity.item(i), "title" : questions[i] }
      dictItems.append(temp)
  return dictItems

### Note:  
Beautification tasks are handled by the frontend (Made using React.js).

## Below is the example use of the code
(Read README.md for using the full app)

In [0]:
#Reduce logging
tf.logging.set_verbosity(tf.logging.ERROR)

#user query string
query = "how to print string in c++"

#getting the tags from the query
tags = get_tags(query)
print("tags: ",tags)

#getting the questions for the obtained tags
questions = get_questions(tags)
#the function returns a list. the 0th elements contains the list of questions
print("list of question:\n", questions[0])

#getting the similarity between the questions and query
similarity = get_similarity(questions[0], query)
print("Similarities:\n", similarity)