# Stack overflow Search Engine

**Stack overflow** is the platform of questions and answers which is used by many professional and enthusiast programmers.The objective of the project is to fine tune the search result and return most relevant results to the user. 

Google Bigquery dataset is updated on quarterly basis, that includes an archive of Stack Overflow content, including posts, votes, tags, and badges.  This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer. More info about the dataset is given at: https://www.kaggle.com/stackoverflow/stackoverflow

we utilized bq_helper that simplifies common read-only tasks in BigQuery by dealing with object references and unpacking result objects into pandas dataframes. 

### Importing the libraries

In [2]:
import bq_helper,os,spacy,warnings
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
EN = spacy.load('en_core_web_sm')

We retrieved the data required by utilizing BigQueryHelper of bq_helper. Using 'SELECT' we retrived the required columns for the dataset and then performed an Inner Join operation on the "post questions" and "post answers". Here we are retrieving data on questions related to python.We limited the collection of data to 500,000 data points for faster processing. We then storing the query data in csv file. 

In [19]:
from bq_helper import BigQueryHelper
temp = BigQueryHelper("bigquery-public-data", "stackoverflow")
QUERY = "SELECT q.id, q.title, q.body, q.tags, a.body as answers, a.score FROM `bigquery-public-data.stackoverflow.posts_questions` AS q INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a ON q.id = a.parent_id WHERE q.tags LIKE '%python%' LIMIT 500000"
data = temp.query_to_pandas(QUERY)
data.to_csv('data/Original_data.csv')

## Load Data

Let's load the data collected using BigQueryHelper.

In [3]:
data = pd.read_csv('data/Original_data.csv')
data.head()

Unnamed: 0,id,title,body,tags,answers,score
0,2345151,how to save/read class wholly in Python,<pre><code>som = SOM_CLASS() # includes many b...,python|class|autosave,"<p>You can (de)serialize with <a href=""http://...",17
1,15288891,How can I serve files with UTF-8 encoding usin...,<p>I often use the following to quickly fire u...,python|webserver,"<p>Had the same problem, the following code wo...",20
2,5762446,Python: Find a best fit function for a list of...,<p>I am aware of many probabilistic functions ...,python|equation,"<p>Look at <a href=""http://docs.scipy.org/doc/...",12
3,1103487,Can I detect if my code is running on cPython ...,<p>I'm working on a small django project that ...,python|django|jython,<p>if you're running Jython </p>\n\n<pre><cod...,17
4,4479710,Sane way to define default variable values fro...,<p>I'd like to set default values for variable...,python|jinja2,"<p>The <a href=""http://jinja.pocoo.org/templat...",14


In [4]:
data[data.id==2345151]

Unnamed: 0,id,title,body,tags,answers,score
0,2345151,how to save/read class wholly in Python,<pre><code>som = SOM_CLASS() # includes many b...,python|class|autosave,"<p>You can (de)serialize with <a href=""http://...",17
90599,2345151,how to save/read class wholly in Python,<pre><code>som = SOM_CLASS() # includes many b...,python|class|autosave,"<p>Take a look at Python's <a href=""http://doc...",3
147378,2345151,how to save/read class wholly in Python,<pre><code>som = SOM_CLASS() # includes many b...,python|class|autosave,<p>I use this code:</p>\n\n<pre><code>import c...,6


In [5]:
s=set()
for i in data["tags"].apply(lambda x:x.split('|')):
    for j in i:
        s.add(j)
print(f'Total number of tags:{len(s)}')

Total number of tags:14384


## Missing Values

We can observe that there is no missing values in the data columns.

In [22]:
data.isna().sum()

id         0
title      0
body       0
tags       0
answers    0
score      0
dtype: int64

In [23]:
print('Dataframe shape:' + str(data.shape))

Dataframe shape:(500000, 6)


## Data Preprocessing

Here, By concatenating all the answers we formed the groups based on their common querys and tags. In addition to that, we also included the scores for each answer so to get a aggregate score for the whole query.

In [24]:
# combining answers
warnings.filterwarnings("ignore")
dict1 = {'answers':{'combined_answers': lambda x: "\n".join(x)},
    'score':{'combined_score': 'sum'}}
combined_data = pd.DataFrame(data.groupby(['id','title', 'body','tags'],as_index=False).agg(dict1))
combined_data.head()

Unnamed: 0_level_0,id,title,body,tags,answers,score
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,combined_answers,combined_score
0,502,Get a preview JPEG of a PDF on Windows?,<p>I have a cross-platform (Python) applicatio...,python|windows|image|pdf,<p>ImageMagick delegates the PDF->bitmap conve...,55
1,1829,How do I make a menu that does not require the...,<p>I've got a menu in Python. That part was ea...,python,<p><strong>On Linux:</strong></p>\n\n<ul>\n<li...,22
2,2311,File size differences after copying a file to ...,<p>I have created a PHP-script to update a web...,php|python|ftp|webserver|ftplib,<p>Well if you go under the properties of your...,18
3,3061,Calling a function of a module by using its na...,<p>What is the best way to go about calling a ...,python|object,<p>Patrick's solution is probably the cleanest...,3323
4,4942,How to sell Python to a client/boss/person,<p>When asked to create system XYZ and you ask...,php|python|ruby-on-rails|ruby,<p>Focus on the shorter time needed for develo...,33


In [25]:
combined_data[combined_data.id==2345151]

Unnamed: 0_level_0,id,title,body,tags,answers,score
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,combined_answers,combined_score
4233,2345151,how to save/read class wholly in Python,<pre><code>som = SOM_CLASS() # includes many b...,python|class|autosave,"<p>You can (de)serialize with <a href=""http://...",26


Let's check the result of combining answers and their scores

In [27]:
print(f"Max score before {np.max(df.score.values)} and Max score after:{np.max(combined_data.score.values)}")

Max score before 6805 and Max score after:10978


Let's define the funtions to perform some text preprocessing. 

- **tokenize_data** takes the raw data and converts the text into tokens. <br/>
- **to_lower** helps to convert the tokens to lower case. <br/>
- **eliminate_punc** helps to remove the punctuations. <br/>
- **eliminate_stopword** will helps to remove the stopwords in english.

In [28]:
import re
import nltk
from nltk.corpus import stopwords

def tokenize_data(text):
    t = EN.tokenizer(text)
    return [token.text.lower() for token in t if not token.is_space]

def to_lower(w):
    list1 = []
    for i in w:
        x = i.lower()
        list1.append(x)
    return list1

def eliminate_punc(w):
    list1 = []
    for i in w:
        x = re.sub(r'[^\w\s]', '', i)
        if x != '':
            list1.append(x)
    return list1

def eliminate_stopword(w):
    list1 = []
    for i in w:
        if i not in stopwords.words('english'):
            list1.append(i)
    return list1

In [34]:
def normalize(w):
    w = to_lower(w)
    w = eliminate_punc(w)
    w = eliminate_stopword(w)
    return w
def preprocess_text(text):
    return ' '.join(normalize(tokenize_data(text)))

Text from Stackoverflow also has HTML markup tags like *p tags, h1-h6 tags inaddition to Questions and Answers. We preprocessed the data and appended a new feature 'post_corpus' which is the combination of question body, title and all the answers. Also column 'question_url' is created by appending question id to url 'http://stackoverflow.com/questions/'

In [30]:
list2,c_list,u_list,comment_list,score_list,tag_list,corpus_list = [],[],[],[],[],[],[] 

for i, row in combined_data.iterrows():
    list2.append(row.title.values[0])    
    tag_list.append(row.tags.values[0])     
    
    # Questions
    content = row.body.values[0]
    soup = BeautifulSoup(content, 'lxml')
    if soup.code: soup.code.decompose()     
    tag_p = soup.p
    tag_pre = soup.pre
    text = ''
    if tag_p: text = text + tag_p.get_text()
    if tag_pre: text = text + tag_pre.get_text()
        
    c_list.append(str(row.title.values[0]) + ' ' + str(text))
    u_list.append('https://stackoverflow.com/questions/' + str(row.id.values[0]))
    
    # Answers
    content = row.answers.values[0]
    soup = BeautifulSoup(content, 'lxml')
    if soup.code: soup.code.decompose()
    tag_p = soup.p
    tag_pre = soup.pre
    text = ''
    if tag_p: text = text + tag_p.get_text()
    if tag_pre: text = text + tag_pre.get_text()
    comment_list.append(text)
    
    score_list.append(row.score.values[0])       
    
    corpus_list.append(c_list[-1] + ' ' + comment_list[-1])

modified_df = pd.DataFrame({'original_title': list2, 'post_corpus': corpus_list, 'question_content': c_list, 'question_url': u_list, 'tags': tag_list, 'overall_scores':score_list,'answers_content': comment_list})

In [31]:
modified_df.head()

Unnamed: 0,original_title,post_corpus,question_content,question_url,tags,overall_scores,answers_content
0,Get a preview JPEG of a PDF on Windows?,Get a preview JPEG of a PDF on Windows? I have...,Get a preview JPEG of a PDF on Windows? I have...,https://stackoverflow.com/questions/502,python|windows|image|pdf,55,ImageMagick delegates the PDF->bitmap conversi...
1,How do I make a menu that does not require the...,How do I make a menu that does not require the...,How do I make a menu that does not require the...,https://stackoverflow.com/questions/1829,python,22,On Linux:\nimport sys\nimport select\nimport t...
2,File size differences after copying a file to ...,File size differences after copying a file to ...,File size differences after copying a file to ...,https://stackoverflow.com/questions/2311,php|python|ftp|webserver|ftplib,18,Well if you go under the properties of your fi...
3,Calling a function of a module by using its na...,Calling a function of a module by using its na...,Calling a function of a module by using its na...,https://stackoverflow.com/questions/3061,python|object,3323,Patrick's solution is probably the cleanest.\n...
4,How to sell Python to a client/boss/person,How to sell Python to a client/boss/person Whe...,How to sell Python to a client/boss/person Whe...,https://stackoverflow.com/questions/4942,php|python|ruby-on-rails|ruby,33,Focus on the shorter time needed for developme...


## Data Normalization

Here, we normalized the 'question_body', 'post_corpus' colums and appended the column 'processed_title' to preserve the original title.

In [35]:
# Preprocess text for 'question_body', 'post_corpus' and a new column 'processed_title'
warnings.filterwarnings("ignore")
modified_df.question_content = modified_df.question_content.apply(lambda x: preprocess_text(x))
modified_df.post_corpus = modified_df.post_corpus.apply(lambda x: preprocess_text(x))
modified_df['processed_title'] = modified_df.original_title.apply(lambda x: preprocess_text(x))

In [36]:
# Save the data
modified_df.to_csv('data/Preprocessed_data1.csv', index=False)