Skip to content

Commit

Permalink
Merge pull request #2 from phihag/fix-write-to-binary-files
Browse files Browse the repository at this point in the history
Write bytes to a binary stream, instead of a string one
  • Loading branch information
kumar303 committed Nov 15, 2011
2 parents fc001d4 + d509ea1 commit 2d2c0fd
Show file tree
Hide file tree
Showing 2 changed files with 31 additions and 27 deletions.
56 changes: 30 additions & 26 deletions index.html
Expand Up @@ -3,15 +3,15 @@
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.7: http://docutils.sourceforge.net/" />
<meta name="generator" content="Docutils 0.8.1: http://docutils.sourceforge.net/" />
<meta name="version" content="S5 1.1" />
<title>Unicode In Python, Completely Demystified</title>
<meta name="author" content="Kumar McMillan" />
<style type="text/css">

/*
:Author: David Goodger (goodger@python.org)
:Id: $Id: html4css1.css 6253 2010-03-02 00:24:53Z milde $
:Id: $Id: html4css1.css 7056 2011-06-17 10:50:48Z milde $
:Copyright: This stylesheet has been placed in the public domain.

Default cascading style sheet for the HTML output of Docutils.
Expand Down Expand Up @@ -49,6 +49,10 @@
dl.docutils dd {
margin-bottom: 0.5em }

object[type="image/svg+xml"], object[type="application/x-shockwave-flash"] {
overflow: hidden;
}

/* Uncomment (and remove this text!) to get bold-faced definition list terms
dl.docutils dt {
font-weight: bold }
Expand Down Expand Up @@ -187,7 +191,7 @@

/* reset inner alignment in figures */
div.align-right {
text-align: left }
text-align: inherit }

/* div.align-center * { */
/* text-align: left } */
Expand Down Expand Up @@ -247,7 +251,7 @@
margin-top: 0 ;
font: inherit }

pre.literal-block, pre.doctest-block {
pre.literal-block, pre.doctest-block, pre.math {
margin-left: 2em ;
margin-right: 2em }

Expand Down Expand Up @@ -439,7 +443,7 @@ <h1>Command line script</h1>
<h1>Let's open a UTF-8 file</h1>
<div class="section" id="ivan-krstic">
<h2>Ivan Krstić</h2>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> f <span style="color: #666666">=</span> <span style="color: #008000">open</span>(<span style="color: #BA2121">&#39;/tmp/ivan_utf8.txt&#39;</span>, <span style="color: #BA2121">&#39;r&#39;</span>)
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> f <span style="color: #666666">=</span> <span style="color: #008000">open</span>(<span style="color: #BA2121">&#39;/tmp/ivan_utf8.txt&#39;</span>, <span style="color: #BA2121">&#39;r&#39;</span>)
<span style="color: #666666">&gt;&gt;&gt;</span> ivan_utf8 <span style="color: #666666">=</span> f<span style="color: #666666">.</span>read()
<span style="color: #666666">&gt;&gt;&gt;</span> ivan_utf8
<span style="color: #BA2121">&#39;Ivan Krsti</span><span style="color: #BB6622; font-weight: bold">\xc4\x87</span><span style="color: #BA2121">&#39;</span>
Expand All @@ -454,7 +458,7 @@ <h2>Ivan Krstić</h2>
</div>
<div class="slide" id="what-is-it">
<h1>What is it?</h1>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> ivan_utf8
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> ivan_utf8
<span style="color: #BA2121">&#39;Ivan Krsti</span><span style="color: #BB6622; font-weight: bold">\xc4\x87</span><span style="color: #BA2121">&#39;</span>
<span style="color: #666666">&gt;&gt;&gt;</span> <span style="color: #008000">type</span>(ivan_utf8)
<span style="color: #666666">&lt;</span><span style="color: #008000">type</span> <span style="color: #BA2121">&#39;str&#39;</span><span style="color: #666666">&gt;</span>
Expand All @@ -469,7 +473,7 @@ <h1>What is it?</h1>
<h1>Text is encoded</h1>
<div class="section" id="id1">
<h2>Ivan Krstić</h2>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #BA2121">&#39;Ivan Krsti</span><span style="color: #BB6622; font-weight: bold">\xc4\x87</span><span style="color: #BA2121">&#39;</span>
<div class="highlight"><pre><span style="color: #BA2121">&#39;Ivan Krsti</span><span style="color: #BB6622; font-weight: bold">\xc4\x87</span><span style="color: #BA2121">&#39;</span>
</pre></div>
<ul class="incremental simple">
<li>This string is encoded in UTF-8 format</li>
Expand Down Expand Up @@ -597,7 +601,7 @@ <h1>The problem</h1>
<p>Can't my Python text remain encoded?</p>
<div class="section" id="id3">
<h2>Ivan Krstić</h2>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> ivan_utf8
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> ivan_utf8
<span style="color: #BA2121">&#39;Ivan Krsti</span><span style="color: #BB6622; font-weight: bold">\xc4\x87</span><span style="color: #BA2121">&#39;</span>
<span style="color: #666666">&gt;&gt;&gt;</span> <span style="color: #008000">len</span>(ivan_utf8)
<span style="color: #666666">12</span>
Expand All @@ -618,7 +622,7 @@ <h2>Ivan Krstić</h2>
<h1>Unicode is more accurate</h1>
<div class="section" id="id4">
<h2>Ivan Krstić</h2>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> ivan_utf8
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> ivan_utf8
<span style="color: #BA2121">&#39;Ivan Krsti</span><span style="color: #BB6622; font-weight: bold">\xc4\x87</span><span style="color: #BA2121">&#39;</span>
<span style="color: #666666">&gt;&gt;&gt;</span> ivan_uni <span style="color: #666666">=</span> ivan_utf8<span style="color: #666666">.</span>decode(<span style="color: #BA2121">&#39;utf-8&#39;</span>)
<span style="color: #666666">&gt;&gt;&gt;</span> ivan_uni
Expand All @@ -632,7 +636,7 @@ <h2>Ivan Krstić</h2>
<h1>Unicode is more accurate</h1>
<div class="section" id="id6">
<h2>Ivan Krstić</h2>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> ivan_uni
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> ivan_uni
<span style="color: #BA2121">u&#39;Ivan Krsti</span><span style="color: #BB6622; font-weight: bold">\u0107</span><span style="color: #BA2121">&#39;</span>
<span style="color: #666666">&gt;&gt;&gt;</span> <span style="color: #008000">len</span>(ivan_uni)
<span style="color: #666666">11</span>
Expand All @@ -643,7 +647,7 @@ <h2>Ivan Krstić</h2>
</div>
<div class="slide" id="unicode-what-is-it">
<h1>Unicode, what is it?</h1>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #BA2121">u&#39;Ivan Krsti</span><span style="color: #BB6622; font-weight: bold">\u0107</span><span style="color: #BA2121">&#39;</span>
<div class="highlight"><pre><span style="color: #BA2121">u&#39;Ivan Krsti</span><span style="color: #BB6622; font-weight: bold">\u0107</span><span style="color: #BA2121">&#39;</span>
</pre></div>
<ul class="incremental simple">
<li>a way to represent text without bytes</li>
Expand Down Expand Up @@ -715,11 +719,11 @@ <h1>Unicode is a concept</h1>
</div>
<div class="slide" id="unicode-transformation-format">
<h1>Unicode Transformation Format</h1>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> ab <span style="color: #666666">=</span> <span style="color: #008000">unicode</span>(<span style="color: #BA2121">&#39;AB&#39;</span>)
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> ab <span style="color: #666666">=</span> <span style="color: #008000">unicode</span>(<span style="color: #BA2121">&#39;AB&#39;</span>)
</pre></div>
<div class="section" id="utf-8">
<h2>UTF-8</h2>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> ab<span style="color: #666666">.</span>encode(<span style="color: #BA2121">&#39;utf-8&#39;</span>)
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> ab<span style="color: #666666">.</span>encode(<span style="color: #BA2121">&#39;utf-8&#39;</span>)
<span style="color: #BA2121">&#39;AB&#39;</span>
</pre></div>
<ul class="incremental simple">
Expand All @@ -731,11 +735,11 @@ <h2>UTF-8</h2>
</div>
<div class="slide" id="id7">
<h1>Unicode Transformation Format</h1>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> ab <span style="color: #666666">=</span> <span style="color: #008000">unicode</span>(<span style="color: #BA2121">&#39;AB&#39;</span>)
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> ab <span style="color: #666666">=</span> <span style="color: #008000">unicode</span>(<span style="color: #BA2121">&#39;AB&#39;</span>)
</pre></div>
<div class="section" id="utf-16">
<h2>UTF-16</h2>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> ab<span style="color: #666666">.</span>encode(<span style="color: #BA2121">&#39;utf-16&#39;</span>)
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> ab<span style="color: #666666">.</span>encode(<span style="color: #BA2121">&#39;utf-16&#39;</span>)
<span style="color: #BA2121">&#39;</span><span style="color: #BB6622; font-weight: bold">\xff\xfe</span><span style="color: #BA2121">A</span><span style="color: #BB6622; font-weight: bold">\x00</span><span style="color: #BA2121">B</span><span style="color: #BB6622; font-weight: bold">\x00</span><span style="color: #BA2121">&#39;</span>
</pre></div>
<ul class="incremental simple">
Expand Down Expand Up @@ -791,7 +795,7 @@ <h1>Decoding text into Unicode</h1>
</div>
<div class="slide" id="python-magic">
<h1>Python magic</h1>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> ivan_uni
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> ivan_uni
<span style="color: #BA2121">u&#39;Ivan Krsti</span><span style="color: #BB6622; font-weight: bold">\u0107</span><span style="color: #BA2121">&#39;</span>
<span style="color: #666666">&gt;&gt;&gt;</span> f <span style="color: #666666">=</span> <span style="color: #008000">open</span>(<span style="color: #BA2121">&#39;/tmp/ivan.txt&#39;</span>, <span style="color: #BA2121">&#39;w&#39;</span>)
<span style="color: #666666">&gt;&gt;&gt;</span> f<span style="color: #666666">.</span>write(ivan_uni)
Expand All @@ -802,7 +806,7 @@ <h1>Python magic</h1>
</div>
<div class="slide" id="python-magic-revealed">
<h1>Python magic, revealed</h1>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> ivan_uni
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> ivan_uni
<span style="color: #BA2121">u&#39;Ivan Krsti</span><span style="color: #BB6622; font-weight: bold">\u0107</span><span style="color: #BA2121">&#39;</span>
<span style="color: #666666">&gt;&gt;&gt;</span> f <span style="color: #666666">=</span> <span style="color: #008000">open</span>(<span style="color: #BA2121">&#39;/tmp/ivan.txt&#39;</span>, <span style="color: #BA2121">&#39;w&#39;</span>)
<span style="color: #666666">&gt;&gt;&gt;</span> <span style="color: #008000; font-weight: bold">import</span> <span style="color: #0000FF; font-weight: bold">sys</span>
Expand All @@ -823,7 +827,7 @@ <h1>Gasp!</h1>
</div>
<div class="slide" id="just-reset-it">
<h1>Just reset it?!</h1>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%">sys<span style="color: #666666">.</span>setdefaultencoding(<span style="color: #BA2121">&#39;utf-8&#39;</span>)
<div class="highlight"><pre>sys<span style="color: #666666">.</span>setdefaultencoding(<span style="color: #BA2121">&#39;utf-8&#39;</span>)
</pre></div>
<ul class="incremental simple">
<li>can't I just put this in <tt class="docutils literal">sitecustomize.py</tt>?</li>
Expand All @@ -843,7 +847,7 @@ <h1>Solution</h1>
<div class="slide" id="decode-early">
<h1>1. Decode early</h1>
<p>Decode to <tt class="docutils literal">&lt;type 'unicode'&gt;</tt> ASAP</p>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> <span style="color: #008000; font-weight: bold">def</span> <span style="color: #0000FF">to_unicode_or_bust</span>(
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> <span style="color: #008000; font-weight: bold">def</span> <span style="color: #0000FF">to_unicode_or_bust</span>(
<span style="color: #666666">...</span> obj, encoding<span style="color: #666666">=</span><span style="color: #BA2121">&#39;utf-8&#39;</span>):
<span style="color: #666666">...</span> <span style="color: #008000; font-weight: bold">if</span> <span style="color: #008000">isinstance</span>(obj, <span style="color: #008000">basestring</span>):
<span style="color: #666666">...</span> <span style="color: #008000; font-weight: bold">if</span> <span style="color: #AA22FF; font-weight: bold">not</span> <span style="color: #008000">isinstance</span>(obj, <span style="color: #008000">unicode</span>):
Expand All @@ -856,7 +860,7 @@ <h1>1. Decode early</h1>
</div>
<div class="slide" id="unicode-everywhere">
<h1>2. Unicode everywhere</h1>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> to_unicode_or_bust(ivan_uni)
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> to_unicode_or_bust(ivan_uni)
<span style="color: #BA2121">u&#39;Ivan Krsti</span><span style="color: #BB6622; font-weight: bold">\u0107</span><span style="color: #BA2121">&#39;</span>
<span style="color: #666666">&gt;&gt;&gt;</span> to_unicode_or_bust(ivan_utf8)
<span style="color: #BA2121">u&#39;Ivan Krsti</span><span style="color: #BB6622; font-weight: bold">\u0107</span><span style="color: #BA2121">&#39;</span>
Expand All @@ -867,7 +871,7 @@ <h1>2. Unicode everywhere</h1>
<div class="slide" id="encode-late">
<h1>3. Encode late</h1>
<p>Encode to <tt class="docutils literal">&lt;type 'str'&gt;</tt> when you write to disk or <strong>print</strong></p>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> f <span style="color: #666666">=</span> <span style="color: #008000">open</span>(<span style="color: #BA2121">&#39;/tmp/ivan_out.txt&#39;</span>,<span style="color: #BA2121">&#39;w&#39;</span>)
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> f <span style="color: #666666">=</span> <span style="color: #008000">open</span>(<span style="color: #BA2121">&#39;/tmp/ivan_out.txt&#39;</span>,<span style="color: #BA2121">&#39;wb&#39;</span>)
<span style="color: #666666">&gt;&gt;&gt;</span> f<span style="color: #666666">.</span>write(ivan_uni<span style="color: #666666">.</span>encode(<span style="color: #BA2121">&#39;utf-8&#39;</span>))
<span style="color: #666666">&gt;&gt;&gt;</span> f<span style="color: #666666">.</span>close()
</pre></div>
Expand All @@ -876,7 +880,7 @@ <h1>3. Encode late</h1>
<h1>Shortcuts</h1>
<div class="section" id="codecs-open">
<h2>codecs.open()</h2>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> <span style="color: #008000; font-weight: bold">import</span> <span style="color: #0000FF; font-weight: bold">codecs</span>
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> <span style="color: #008000; font-weight: bold">import</span> <span style="color: #0000FF; font-weight: bold">codecs</span>
<span style="color: #666666">&gt;&gt;&gt;</span> f <span style="color: #666666">=</span> codecs<span style="color: #666666">.</span>open(<span style="color: #BA2121">&#39;/tmp/ivan_utf8.txt&#39;</span>, <span style="color: #BA2121">&#39;r&#39;</span>,
<span style="color: #666666">...</span> encoding<span style="color: #666666">=</span><span style="color: #BA2121">&#39;utf-8&#39;</span>)
<span style="color: #666666">...</span>
Expand All @@ -890,7 +894,7 @@ <h2>codecs.open()</h2>
<h1>Shortcuts</h1>
<div class="section" id="id11">
<h2>codecs.open()</h2>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> <span style="color: #008000; font-weight: bold">import</span> <span style="color: #0000FF; font-weight: bold">codecs</span>
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> <span style="color: #008000; font-weight: bold">import</span> <span style="color: #0000FF; font-weight: bold">codecs</span>
<span style="color: #666666">&gt;&gt;&gt;</span> f <span style="color: #666666">=</span> codecs<span style="color: #666666">.</span>open(<span style="color: #BA2121">&#39;/tmp/ivan_utf8.txt&#39;</span>, <span style="color: #BA2121">&#39;w&#39;</span>,
<span style="color: #666666">...</span> encoding<span style="color: #666666">=</span><span style="color: #BA2121">&#39;utf-8&#39;</span>)
<span style="color: #666666">...</span>
Expand Down Expand Up @@ -918,7 +922,7 @@ <h1>Python 2 Unicode workarounds</h1>
<li>momentarily encode as UTF-8, then decode immediately</li>
<li>csv documentation shows you how to do this</li>
</ul>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> ivan_bytes <span style="color: #666666">=</span> ivan_uni<span style="color: #666666">.</span>encode(<span style="color: #BA2121">&#39;utf-8&#39;</span>)
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> ivan_bytes <span style="color: #666666">=</span> ivan_uni<span style="color: #666666">.</span>encode(<span style="color: #BA2121">&#39;utf-8&#39;</span>)
<span style="color: #666666">&gt;&gt;&gt;</span> <span style="color: #408080; font-style: italic"># do stuff</span>
<span style="color: #666666">&gt;&gt;&gt;</span> ivan_bytes<span style="color: #666666">.</span>decode(<span style="color: #BA2121">&#39;utf-8&#39;</span>)
<span style="color: #BA2121">u&#39;Ivan Krsti</span><span style="color: #BB6622; font-weight: bold">\u0107</span><span style="color: #BA2121">&#39;</span>
Expand All @@ -942,7 +946,7 @@ <h1>The BOM</h1>
</div>
<div class="slide" id="detecting-the-bom">
<h1>Detecting the BOM</h1>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> f <span style="color: #666666">=</span> <span style="color: #008000">open</span>(<span style="color: #BA2121">&#39;/tmp/ivan_utf16.txt&#39;</span>,<span style="color: #BA2121">&#39;r&#39;</span>)
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> f <span style="color: #666666">=</span> <span style="color: #008000">open</span>(<span style="color: #BA2121">&#39;/tmp/ivan_utf16.txt&#39;</span>,<span style="color: #BA2121">&#39;r&#39;</span>)
<span style="color: #666666">&gt;&gt;&gt;</span> sample <span style="color: #666666">=</span> f<span style="color: #666666">.</span>read(<span style="color: #666666">4</span>)
<span style="color: #666666">&gt;&gt;&gt;</span> sample
<span style="color: #BA2121">&#39;</span><span style="color: #BB6622; font-weight: bold">\xff\xfe</span><span style="color: #BA2121">I</span><span style="color: #BB6622; font-weight: bold">\x00</span><span style="color: #BA2121">&#39;</span>
Expand All @@ -953,7 +957,7 @@ <h1>Detecting the BOM</h1>
</div>
<div class="slide" id="id12">
<h1>Detecting the BOM</h1>
<div class="highlight" style="background: #f8f8f8"><pre style="line-height: 125%"><span style="color: #666666">&gt;&gt;&gt;</span> <span style="color: #008000; font-weight: bold">import</span> <span style="color: #0000FF; font-weight: bold">codecs</span>
<div class="highlight"><pre><span style="color: #666666">&gt;&gt;&gt;</span> <span style="color: #008000; font-weight: bold">import</span> <span style="color: #0000FF; font-weight: bold">codecs</span>
<span style="color: #666666">&gt;&gt;&gt;</span> (sample<span style="color: #666666">.</span>startswith(codecs<span style="color: #666666">.</span>BOM_UTF16_LE) <span style="color: #AA22FF; font-weight: bold">or</span>
<span style="color: #666666">...</span> sample<span style="color: #666666">.</span>startswith(codecs<span style="color: #666666">.</span>BOM_UTF16_BE))
<span style="color: #666666">...</span>
Expand Down

0 comments on commit 2d2c0fd

Please sign in to comment.