I built a search engine on a 10 GB StackOverflow SQL Database
- Flask
- jQuery
- SQL Server Express
- Pyodbc
- HTML/CSS/JS
- A copy of StackOverflow's SQL database up to 2010 was used in this project.
get_top_questions
returns the top matching questions based on given key words when called in themain.py
script.- The
SearchEngine
Module contains the definition for theget_top_questions
method as well as other helper functions that make up the logic used in extracting matching questions from the sql database.
- When a search query string is received from the frontend, it is parsed and the generated key words are passed to the
SearchEngine
Module. - The
SearchEngine
module extracts a list of matching question IDs for each keyword. For each key word, a HashMap which keeps track of the number of times that matching question ID's contain keywords is updated. - After all matching IDs have been compiled, top 10 IDs with the most matches are then extracted using a max heap (to optimize sorting runtime) and returned.
- While working on the project, I realized the SQL server response was really slow. To optimize the search time, I limited maximum the number of keywords searched to 6.
- Also, while parsing the search query string, words with less than 3 characters are filtered out as they will have a lower chance of affecting the search results and can potentially slow down the search request if included.