Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This has to do with an error in the blog, where the body and tease text is passed through the inline parser before being sent to Markdown (if the user chooses to use Markdown as their markup language). If the text contained characters like fancy curly quotes or emdashes, Markdown would throw an error, complaining that the input was neither ASCII nor Unicode. Strangely, this occured only with the text after it had been put through the inline parser. At first I thought BeautifulSoup might be doing something odd to the string encoding, but eventually I tracked it down to Django's mark_safe function. The inline parser passes its return value through mark_safe right before returning it. I don't really understand why, but for some reason mark_safe returned non-Unicode content. When I pass the string through Python's unicode constructor before passing it to mark_safe, the string that mark_safe returns is propper Unicode. Markdown can process it fine, even with curly quotes. I don't know if there's any disadvantage to adding the unicode constructor to the inline parser. It hasn't broken anything in my testing. But what if you pass Kanji or some other funky characters into it? I don't know. If you do know, please let me know! My own belief is that blog posts should be written with plain own straight quotes, dashes (--), etc. Then, for the actualy display, pass the post through Smartpants and turn that stuff into curly quotes, emdashes, ellipsis, or whatever. That way your output will look pretty and will be proper HTML, but the data stored in your database will be good old, portable, use-it-anywhere ASCII. But I know a lot of people like to write their blog posts in something like Microsoft Word, which turns straight quotes curly, and then copy/paste that into the blog. This commit should make their lives easier. (Same for folks using the Wordpress importer.) Word.
- Loading branch information