html in question/answer content #10

grapesmoker · 2014-09-24T03:40:37Z

Right now, we sanitize the questions and answers before they're saved into the db. We do this because, I suspect, having HTML in question text will screw up the fulltext search that we will need to have. Maybe this is crazy, but I'm thinking that perhaps we can save each text field twice, once as a sanitized "search" version without any html, and once as the "display" version with html. Obviously this doubles per-question overhead and so is not ideal; for now we won't do this but will see if it causes an issue when fulltext indexing is enabled.

Theta91 · 2014-09-24T15:59:49Z

Hm, here are my thoughts:

option 1: Instead of relying on MySQL's (possibly awful?) full-text search, integrate a real search indexer (Sphinx is the one I've seen most often, apparently Lucene/Solr are also popular choices). This offers exactly what we want (i.e. html_strip: http://sphinxsearch.com/docs/2.2.4/conf-html-strip.html) and seems like the best option, both in terms of capability and performance. It may, however, take a bit of time to implement (Sphinx, at least, seems fairly simple...but then, famous last words).

option 2: Store the HTML and sanitized info in different fields such that the sanitized info will be easy to drop if a search indexer is implemented.

Also, all the default stopwords for full-text search are incredibly awful and we should disable them.

grapesmoker · 2014-09-24T16:20:36Z

Right, so option 2 is the first thing that occurred to me, but thanks for finding Sphinx. I'll take a look and see how easy it is to integrate into the system.

Theta91 · 2014-09-24T16:34:06Z

Yeah, sorry - I meant option 2 to be what you said.

grapesmoker · 2014-10-03T21:11:27Z

TinyMCE is giving me a ton of problems by inserting byte order marks and spans and all kinds of awful garbage into the HTML. We should find a different editor...

mbentley00 · 2014-11-29T18:21:14Z

I've largely removed the html from question/answer content. Right now we just have parentheses, tildes and underscores and custom markers. There's a separate bug to strip those characters out to make the search work properly on them.

grapesmoker mentioned this issue Sep 24, 2014

search #11

Closed

Theta91 added enhancement question and removed enhancement labels Nov 6, 2014

mbentley00 closed this as completed Nov 29, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

html in question/answer content #10

html in question/answer content #10

grapesmoker commented Sep 24, 2014

Theta91 commented Sep 24, 2014

grapesmoker commented Sep 24, 2014

Theta91 commented Sep 24, 2014

grapesmoker commented Oct 3, 2014

mbentley00 commented Nov 29, 2014

html in question/answer content #10

html in question/answer content #10

Comments

grapesmoker commented Sep 24, 2014

Theta91 commented Sep 24, 2014

grapesmoker commented Sep 24, 2014

Theta91 commented Sep 24, 2014

grapesmoker commented Oct 3, 2014

mbentley00 commented Nov 29, 2014