-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
html in question/answer content #10
Comments
Hm, here are my thoughts: option 1: Instead of relying on MySQL's (possibly awful?) full-text search, integrate a real search indexer (Sphinx is the one I've seen most often, apparently Lucene/Solr are also popular choices). This offers exactly what we want (i.e. html_strip: http://sphinxsearch.com/docs/2.2.4/conf-html-strip.html) and seems like the best option, both in terms of capability and performance. It may, however, take a bit of time to implement (Sphinx, at least, seems fairly simple...but then, famous last words). option 2: Store the HTML and sanitized info in different fields such that the sanitized info will be easy to drop if a search indexer is implemented. Also, all the default stopwords for full-text search are incredibly awful and we should disable them. |
Right, so option 2 is the first thing that occurred to me, but thanks for finding Sphinx. I'll take a look and see how easy it is to integrate into the system. |
Yeah, sorry - I meant option 2 to be what you said. |
TinyMCE is giving me a ton of problems by inserting byte order marks and spans and all kinds of awful garbage into the HTML. We should find a different editor... |
I've largely removed the html from question/answer content. Right now we just have parentheses, tildes and underscores and custom markers. There's a separate bug to strip those characters out to make the search work properly on them. |
Right now, we sanitize the questions and answers before they're saved into the db. We do this because, I suspect, having HTML in question text will screw up the fulltext search that we will need to have. Maybe this is crazy, but I'm thinking that perhaps we can save each text field twice, once as a sanitized "search" version without any html, and once as the "display" version with html. Obviously this doubles per-question overhead and so is not ideal; for now we won't do this but will see if it causes an issue when fulltext indexing is enabled.
The text was updated successfully, but these errors were encountered: