Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html in question/answer content #10

Closed
grapesmoker opened this issue Sep 24, 2014 · 5 comments
Closed

html in question/answer content #10

grapesmoker opened this issue Sep 24, 2014 · 5 comments
Labels

Comments

@grapesmoker
Copy link
Owner

Right now, we sanitize the questions and answers before they're saved into the db. We do this because, I suspect, having HTML in question text will screw up the fulltext search that we will need to have. Maybe this is crazy, but I'm thinking that perhaps we can save each text field twice, once as a sanitized "search" version without any html, and once as the "display" version with html. Obviously this doubles per-question overhead and so is not ideal; for now we won't do this but will see if it causes an issue when fulltext indexing is enabled.

@grapesmoker grapesmoker mentioned this issue Sep 24, 2014
@Theta91
Copy link
Collaborator

Theta91 commented Sep 24, 2014

Hm, here are my thoughts:

option 1: Instead of relying on MySQL's (possibly awful?) full-text search, integrate a real search indexer (Sphinx is the one I've seen most often, apparently Lucene/Solr are also popular choices). This offers exactly what we want (i.e. html_strip: http://sphinxsearch.com/docs/2.2.4/conf-html-strip.html) and seems like the best option, both in terms of capability and performance. It may, however, take a bit of time to implement (Sphinx, at least, seems fairly simple...but then, famous last words).

option 2: Store the HTML and sanitized info in different fields such that the sanitized info will be easy to drop if a search indexer is implemented.

Also, all the default stopwords for full-text search are incredibly awful and we should disable them.

@grapesmoker
Copy link
Owner Author

Right, so option 2 is the first thing that occurred to me, but thanks for finding Sphinx. I'll take a look and see how easy it is to integrate into the system.

@Theta91
Copy link
Collaborator

Theta91 commented Sep 24, 2014

Yeah, sorry - I meant option 2 to be what you said.

@grapesmoker
Copy link
Owner Author

TinyMCE is giving me a ton of problems by inserting byte order marks and spans and all kinds of awful garbage into the HTML. We should find a different editor...

@mbentley00
Copy link
Collaborator

I've largely removed the html from question/answer content. Right now we just have parentheses, tildes and underscores and custom markers. There's a separate bug to strip those characters out to make the search work properly on them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants