Merge commit 'c3748819'

ploeh · Feb 19, 2024 · e9453cb · e9453cb
2 parents 387b03e + c374881
commit e9453cb
Show file tree

Hide file tree

Showing 2 changed files with 304 additions and 0 deletions.
diff --git a/_posts/2024-02-02-extracting-data-from-a-small-csv-file-with-haskell.html b/_posts/2024-02-02-extracting-data-from-a-small-csv-file-with-haskell.html
@@ -0,0 +1,304 @@
+---
+layout: post
+title: "Extracting data from a small CSV file with Haskell"
+description: "Statically typed languages are also good for ad-hoc scripting."
+date: 2024-02-02 12:49 UTC
+tags: [Haskell]
+image: "/content/binary/difference-pmf-plot.png"
+image_alt: "Bar chart of the differences PMF."
+---
+{% include JB/setup %}
+
+<div id="post">
+    <p>
+        <em>{{ page.description }}</em>
+    </p>
+    <p>
+        This article is part of a <a href="">short series of articles</a> that compares ad-hoc scripting in <a href="https://www.haskell.org/">Haskell</a> with solving the same problem in <a href="https://www.python.org/">Python</a>. The <a href="">introductory article</a> describes the problem to be solved, so here I'll jump straight into the Haskell code. In the next article I'll give a similar walkthrough of my Python script.
+    </p>
+    <h3 id="0a705367eb2f4080ac168eb1bbe9b2ec">
+        Getting started <a href="#0a705367eb2f4080ac168eb1bbe9b2ec">#</a>
+    </h3>
+    <p>
+        When working with Haskell for more than a few true one-off expressions that I can type into GHCi (the Haskell <a href="https://en.wikipedia.org/wiki/Read%E2%80%93eval%E2%80%93print_loop">REPL</a>), I usually create a module file. Since I'd been asked to crunch some data, and I wasn't feeling very imaginative that day, I just named the module (and the file) <code>Crunch</code>. After some iterative exploration of the problem, I also arrived at a set of imports:
+    </p>
+    <p>
+        <pre><span style="color:blue;">module</span>&nbsp;Crunch&nbsp;<span style="color:blue;">where</span>
+
+<span style="color:blue;">import</span>&nbsp;Data.List&nbsp;(<span style="color:#2b91af;">sort</span>)
+<span style="color:blue;">import</span>&nbsp;<span style="color:blue;">qualified</span>&nbsp;Data.List.NonEmpty&nbsp;<span style="color:blue;">as</span>&nbsp;NE
+<span style="color:blue;">import</span>&nbsp;Data.List.Split
+<span style="color:blue;">import</span>&nbsp;Control.Applicative
+<span style="color:blue;">import</span>&nbsp;Control.Monad
+<span style="color:blue;">import</span>&nbsp;Data.Foldable</pre>
+    </p>
+    <p>
+        As we go along, you'll see where some of these fit in.
+    </p>
+    <p>
+        Reading the actual data file, however, can be done with just the Haskell <code>Prelude</code>:
+    </p>
+    <p>
+        <pre>inputLines&nbsp;=&nbsp;<span style="color:blue;">words</span>&nbsp;&lt;$&gt;&nbsp;<span style="color:blue;">readFile</span>&nbsp;<span style="color:#a31515;">&quot;survey_data.csv&quot;</span></pre>
+    </p>
+    <p>
+        Already now, it's possible to load the module in GHCi and start examining the data:
+    </p>
+    <p>
+        <pre>ghci&gt; :l Crunch.hs
+[1 of 1] Compiling Crunch           ( Crunch.hs, interpreted )
+Ok, one module loaded.
+ghci&gt; length &lt;$&gt; inputLines
+38</pre>
+    </p>
+    <p>
+        Looks good, but reading a text file is hardly the difficult part. The first obstacle, surprisingly, is to split comma-separated values into individual parts. For some reason that I've never understood, the Haskell base library doesn't even include something as basic as <a href="https://learn.microsoft.com/dotnet/api/system.string.split">String.Split</a> from .NET. I could probably hack together a function that does that, but on the other hand, it's available in the <a href="https://hackage.haskell.org/package/split/docs/Data-List-Split.html">split</a> package; that explains the <code>Data.List.Split</code> import. It's just such a bit of a bother that one has to pull in another package just to do that.
+    </p>
+    <h3 id="a70030690c1645a2b0923ad354fa665b">
+        Grades <a href="#a70030690c1645a2b0923ad354fa665b">#</a>
+    </h3>
+    <p>
+        Extracting all the grades are now relatively easy. This function extracts and parses a grade from a single line:
+    </p>
+    <p>
+        <pre><span style="color:#2b91af;">grade</span>&nbsp;<span style="color:blue;">::</span>&nbsp;<span style="color:blue;">Read</span>&nbsp;a&nbsp;<span style="color:blue;">=&gt;</span>&nbsp;<span style="color:#2b91af;">String</span>&nbsp;<span style="color:blue;">-&gt;</span>&nbsp;a
+grade&nbsp;line&nbsp;=&nbsp;<span style="color:blue;">read</span>&nbsp;$&nbsp;splitOn&nbsp;<span style="color:#a31515;">&quot;,&quot;</span>&nbsp;line&nbsp;!!&nbsp;2</pre>
+    </p>
+    <p>
+        It splits the line on commas, picks the third element (zero-indexed, of course, so element <code>2</code>), and finally parses it.
+    </p>
+    <p>
+        One may experiment with it in GHCi to get an impression that it works:
+    </p>
+    <p>
+        <pre>ghci&gt; fmap grade &lt;$&gt; inputLines :: IO [Int]
+[2,2,12,10,4,12,2,7,2,2,2,7,2,7,2,4,2,7,4,7,0,4,0,7,2,2,2,2,2,2,4,4,2,7,4,0,7,2]</pre>
+    </p>
+    <p>
+        This lists all 38 expected grades found in the data file.
+    </p>
+    <p>
+        In the <a href="">introduction article</a> I spent some time explaining how languages with strong type inference don't need type declarations. This makes iterative development easier, because you can fiddle with an expression until it does what you'd like it to do. When you change an expression, often the inferred type changes as well, but there's no programmer overhead involved with that. The compiler figures that out for you.
+    </p>
+    <p>
+        Even so, the above <code>grade</code> function does have a type annotation. How does that gel with what I just wrote?
+    </p>
+    <p>
+        It doesn't, on the surface, but when I was fiddling with the code, there was no type annotation. The Haskell compiler is perfectly happy to infer the type of an expression like
+    </p>
+    <p>
+        <pre>grade&nbsp;line&nbsp;=&nbsp;<span style="color:blue;">read</span>&nbsp;$&nbsp;splitOn&nbsp;<span style="color:#a31515;">&quot;,&quot;</span>&nbsp;line&nbsp;!!&nbsp;2</pre>
+    </p>
+    <p>
+        The human reader, however, is not so clever (I'm not, at least), so once a particular expression settles, and I'm fairly sure that it's not going to change further, I sometimes add the type annotation to aid myself.
+    </p>
+    <p>
+        When writing this, I was debating the didactics of showing the function <em>with</em> the type annotation, against showing it without it. Eventually I decided to include it, because it's more understandable that way. That decision, however, prompted this explanation.
+    </p>
+    <h3 id="754ea5fced264c439f8784705176851b">
+        Binomial choice <a href="#754ea5fced264c439f8784705176851b">#</a>
+    </h3>
+    <p>
+        The next thing I needed to do was to list all pairs from the data file. Usually, <a href="/2022/01/17/enumerate-wordle-combinations-with-an-applicative-functor">when I run into a problem related to combinations, I reach for applicative composition</a>. For example, to list all possible combinations of the first three primes, I might do this:
+    </p>
+    <p>
+        <pre>ghci&gt; liftA2 (,) [2,3,5] [2,3,5]
+[(2,2),(2,3),(2,5),(3,2),(3,3),(3,5),(5,2),(5,3),(5,5)]</pre>
+    </p>
+    <p>
+        You may now protest that this is sampling with replacement, whereas the task is to pick two <em>different</em> rows from the data file. Usually, when I run into that requirement, I just remove the ones that pick the same value twice:
+    </p>
+    <p>
+        <pre>ghci&gt; filter (uncurry (/=)) $ liftA2 (,) [2,3,5] [2,3,5]
+[(2,3),(2,5),(3,2),(3,5),(5,2),(5,3)]</pre>
+    </p>
+    <p>
+        That works great as long as the values are unique, but what if that's not the case?
+    </p>
+    <p>
+        <pre>ghci&gt; liftA2 (,) "foo" "foo"
+[('f','f'),('f','o'),('f','o'),('o','f'),('o','o'),('o','o'),('o','f'),('o','o'),('o','o')]
+ghci&gt; filter (uncurry (/=)) $ liftA2 (,) "foo" "foo"
+[('f','o'),('f','o'),('o','f'),('o','f')]</pre>
+    </p>
+    <p>
+        This removes too many values! We don't want the combinations where the first <code>o</code> is paired with itself, or when the second <code>o</code> is paired with itself, but we <em>do</em> want the combination where the first <code>o</code> is paired with the second, and vice versa.
+    </p>
+    <p>
+        This is relevant because the data set turns out to contain identical rows. Thus, I needed something that would deal with that issue.
+    </p>
+    <p>
+        Now, bear with me, because it's quite possible that what i did do isn't the simplest solution to the problem. On the other hand, I'm reporting what I did, and how I used Haskell to solve a one-off problem. If you have a simpler solution, please <a href="https://github.com/ploeh/ploeh.github.com?tab=readme-ov-file#comments">leave a comment</a>.
+    </p>
+    <p>
+        You often reach for the tool that you already know, so I used a variation of the above. Instead of combining values, I decided to combine row indices instead. This meant that I needed a function that would produce the indices for a particular list:
+    </p>
+    <p>
+        <pre><span style="color:#2b91af;">indices</span>&nbsp;<span style="color:blue;">::</span>&nbsp;<span style="color:blue;">Foldable</span>&nbsp;t&nbsp;<span style="color:blue;">=&gt;</span>&nbsp;t&nbsp;a&nbsp;<span style="color:blue;">-&gt;</span>&nbsp;[<span style="color:#2b91af;">Int</span>]
+indices&nbsp;f&nbsp;=&nbsp;[0&nbsp;..&nbsp;<span style="color:blue;">length</span>&nbsp;f&nbsp;-&nbsp;1]</pre>
+    </p>
+    <p>
+        Again, the type annotation came later. This just produces sequential numbers, starting from zero:
+    </p>
+    <p>
+        <pre>ghci&gt; indices &lt;$&gt; inputLines
+[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,
+ 21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37]</pre>
+    </p>
+    <p>
+        Such a function hovers just around the <a href="https://wiki.haskell.org/Fairbairn_threshold">Fairbairn threshold</a>; some experienced Haskellers would probably just inline it.
+    </p>
+    <p>
+        Since row numbers (indices) are unique, the above approach to binomial choice works, so I also added a function for that:
+    </p>
+    <p>
+        <pre><span style="color:#2b91af;">choices</span>&nbsp;<span style="color:blue;">::</span>&nbsp;<span style="color:blue;">Eq</span>&nbsp;a&nbsp;<span style="color:blue;">=&gt;</span>&nbsp;[a]&nbsp;<span style="color:blue;">-&gt;</span>&nbsp;[(a,&nbsp;a)]
+choices&nbsp;=&nbsp;<span style="color:blue;">filter</span>&nbsp;(<span style="color:blue;">uncurry</span>&nbsp;<span style="color:#2b91af;">(/=)</span>)&nbsp;.&nbsp;join&nbsp;(liftA2&nbsp;<span style="color:#2b91af;">(,)</span>)</pre>
+    </p>
+    <p>
+        Combined with <code>indices</code> I can now enumerate all combinations of two rows in the data set:
+    </p>
+    <p>
+        <pre>ghci&gt; choices . indices &lt;$&gt; inputLines
+[(0,1),(0,2),(0,3),(0,4),(0,5),(0,6),(0,7),(0,8),(0,9),...</pre>
+    </p>
+    <p>
+        I'm only showing the first ten results here, because in reality, there are <em>1406</em> such pairs.
+    </p>
+    <p>
+        Perhaps you think that all of this seems quite elaborate, but so far it's only four lines of code. The reason it looks like more is because I've gone to some lengths to explain what the code does.
+    </p>
+    <h3 id="bf2d1d52fb3c4a29b6987840b2d46530">
+        Sum of grades <a href="#bf2d1d52fb3c4a29b6987840b2d46530">#</a>
+    </h3>
+    <p>
+        The above combinations are pairs of <em>indices</em>, not values. What I need is to use each index to look up the row, from the row get the grade, and then sum the two grades. The first parts of that I can accomplish with the <code>grade</code> function, but I need to do if for every row, and for both elements of each pair.
+    </p>
+    <p>
+        While tuples are <code>Functor</code> instances, they only map over the second element, and that's not what I need:
+    </p>
+    <p>
+        <pre>ghci&gt; rows = ["foo", "bar", "baz"]
+ghci&gt; fmap (rows!!) &lt;$&gt; [(0,1),(0,2)]
+[(0,"bar"),(0,"baz")]</pre>
+    </p>
+    <p>
+        While this is just a simple example that maps over the two pairs <code>(0,1)</code> and <code>(0,2)</code>, it illustrates the problem: It only finds the row for each tuple's second element, but I need it for both.
+    </p>
+    <p>
+        On the other hand, a type like <code>(a, a)</code> gives rise to a <a href="/2018/03/22/functors">functor</a>, and while a wrapper type like that is not readily available in the <em>base</em> library, defining one is a one-liner:
+    </p>
+    <p>
+        <pre><span style="color:blue;">newtype</span>&nbsp;Pair&nbsp;a&nbsp;=&nbsp;Pair&nbsp;{&nbsp;unPair&nbsp;::&nbsp;(a,&nbsp;a)&nbsp;}&nbsp;<span style="color:blue;">deriving</span>&nbsp;(<span style="color:#2b91af;">Eq</span>,&nbsp;<span style="color:#2b91af;">Show</span>,&nbsp;<span style="color:#2b91af;">Functor</span>)</pre>
+    </p>
+    <p>
+        This enables me to map over pairs in one go:
+    </p>
+    <p>
+        <pre>ghci&gt; unPair &lt;$&gt; fmap (rows!!) &lt;$&gt; Pair <$&gt; [(0,1),(0,2)]
+[("foo","bar"),("foo","baz")]</pre>
+    </p>
+    <p>
+        This makes things a little easier. What remains is to use the <code>grade</code> function to look up the grade value for each row, then add the two numbers together, and finally count how many occurrences there are of each:
+    </p>
+    <p>
+        <pre>sumGrades&nbsp;ls&nbsp;=
+&nbsp;&nbsp;liftA2&nbsp;<span style="color:#2b91af;">(,)</span>&nbsp;NE.<span style="color:blue;">head</span>&nbsp;<span style="color:blue;">length</span>&nbsp;&lt;$&gt;&nbsp;NE.group
+&nbsp;&nbsp;&nbsp;&nbsp;(sort&nbsp;(<span style="color:blue;">uncurry</span>&nbsp;<span style="color:#2b91af;">(+)</span>&nbsp;.&nbsp;unPair&nbsp;.&nbsp;<span style="color:blue;">fmap</span>&nbsp;(grade&nbsp;.&nbsp;(ls&nbsp;!!))&nbsp;.&nbsp;Pair&nbsp;&lt;$&gt;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;choices&nbsp;(indices&nbsp;ls)))</pre>
+    </p>
+    <p>
+        You'll notice that this function doesn't have a type annotation, but we can ask GHCi if we're curious:
+    </p>
+    <p>
+        <pre>ghci&gt; :t sumGrades
+sumGrades :: (Ord a, Num a, Read a) =&gt; [String] -&gt; [(a, Int)]</pre>
+    </p>
+    <p>
+        This enabled me to get a count of each sum of grades:
+    </p>
+    <p>
+        <pre>ghci&gt; sumGrades &lt;$&gt; inputLines
+[(0,6),(2,102),(4,314),(6,238),(7,48),(8,42),(9,272),(10,6),
+ (11,112),(12,46),(14,138),(16,28),(17,16),(19,32),(22,4),(24,2)]</pre>
+    </p>
+    <p>
+        The way to read this is that the sum <em>0</em> occurs six times, <em>2</em> appears <em>102</em> times, etc.
+    </p>
+    <p>
+        There's one remaining task to accomplish before we can produce a PMF of the sum of grades: We need to enumerate the range, because, as it turns out, there are sums that are possible, but that don't appear in the data set. Can you spot which ones?
+    </p>
+    <p>
+        Using tools already covered, it's easy to enumerate all possible sums:
+    </p>
+    <p>
+        <pre>ghci&gt; import Data.List
+ghci&gt; sort $ nub $ (uncurry (+)) &lt;$&gt; join (liftA2 (,)) [-3,0,2,4,7,10,12]
+[-6,-3,-1,0,1,2,4,6,7,8,9,10,11,12,14,16,17,19,20,22,24]</pre>
+    </p>
+    <p>
+        The sums <em>-6</em>, <em>-3</em>, <em>-1</em>, and more, are possible, but don't appear in the data set. Thus, in the PMF for two randomly picked grades, the probability that the sum is <em>-6</em> is <em>0</em>. On the other hand, the probability that the sum is <em>0</em> is <em>6/1406 ~ 0.004267</em>, and so on.
+    </p>
+    <h3 id="ddcac27fb3ff468cb9312f0fcc333865">
+        Difference of experience levels <a href="#ddcac27fb3ff468cb9312f0fcc333865">#</a>
+    </h3>
+    <p>
+        The other question posed in the assignment was to produce the PMF for the absolute difference between two randomly selected students' experience levels.
+    </p>
+    <p>
+        Answering that question follows the same mould as above. First, extract experience level from each data row, instead of the grade:
+    </p>
+    <p>
+        <pre><span style="color:#2b91af;">experience</span>&nbsp;<span style="color:blue;">::</span>&nbsp;<span style="color:blue;">Read</span>&nbsp;a&nbsp;<span style="color:blue;">=&gt;</span>&nbsp;<span style="color:#2b91af;">String</span>&nbsp;<span style="color:blue;">-&gt;</span>&nbsp;a
+experience&nbsp;line&nbsp;=&nbsp;<span style="color:blue;">read</span>&nbsp;$&nbsp;splitOn&nbsp;<span style="color:#a31515;">&quot;,&quot;</span>&nbsp;line&nbsp;!!&nbsp;3</pre>
+    </p>
+    <p>
+        Since I was doing an ad-hoc script, I just copied the <code>grade</code> function and changed the index from <code>2</code> to <code>3</code>. Enumerating the experience differences were also a close copy of <code>sumGrades</code>:
+    </p>
+    <p>
+        <pre>diffExp&nbsp;ls&nbsp;=
+&nbsp;&nbsp;liftA2&nbsp;<span style="color:#2b91af;">(,)</span>&nbsp;NE.<span style="color:blue;">head</span>&nbsp;<span style="color:blue;">length</span>&nbsp;&lt;$&gt;&nbsp;NE.group
+&nbsp;&nbsp;&nbsp;&nbsp;(sort&nbsp;(<span style="color:blue;">abs</span>&nbsp;.&nbsp;<span style="color:blue;">uncurry</span>&nbsp;<span style="color:#2b91af;">(-)</span>&nbsp;.&nbsp;unPair&nbsp;.&nbsp;<span style="color:blue;">fmap</span>&nbsp;(experience&nbsp;.&nbsp;(ls&nbsp;!!))&nbsp;.&nbsp;Pair&nbsp;&lt;$&gt;
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;choices&nbsp;(indices&nbsp;ls)))</pre>
+    </p>
+    <p>
+        Running it in the REPL produces some other numbers, to be interpreted the same way as above:
+    </p>
+    <p>
+        <pre>ghci&gt diffExp &lt;$&gt inputLines
+[(0,246),(1,472),(2,352),(3,224),(4,82),(5,24),(6,6)]</pre>
+    </p>
+    <p>
+        This means that the difference <em>0</em> occurs <em>246</em> times, <em>1</em> appears <em>472</em> times, and so on. From those numbers, it's fairly simple to set up the PMF.
+    </p>
+    <h3 id="948660097f7748f2844ebfd91371b2a2">
+        Figures <a href="#948660097f7748f2844ebfd91371b2a2">#</a>
+    </h3>
+    <p>
+        Another part of the assignment was to produce plots of both PMFs. I don't know how to produce figures with Haskell, and since the final results are just a handful of numbers each, I just copied them into a text editor to align them, and then pasted them into Excel to produce the figures there.
+    </p>
+    <p>
+        Here's the PMF for the differences:
+    </p>
+    <p>
+        <img src="/content/binary/difference-pmf-plot.png" alt="Bar chart of the differences PMF.">
+    </p>
+    <p>
+        I originally created the figure with Danish labels. I'm sure that you can guess what <em>differens</em> means, and <em>sandsynlighed</em> means <em>probability</em>.
+    </p>
+    <h3 id="9b225242b63e4616b25954ad9141e273">
+        Conclusion <a href="#9b225242b63e4616b25954ad9141e273">#</a>
+    </h3>
+    <p>
+        In this article you've seen the artefacts of an ad-hoc script to extract and analyze a small data set. While I've spent quite a few words to explain what's going on, the entire <code>Crunch</code> module is only 34 lines of code. Add to that a few ephemeral queries done directly in GHCi, but never saved to a file. It's been some months since I wrote the code, but as far as I recall, it took me a few hours all in all.
+    </p>
+    <p>
+        If you do stuff like this every day, you probably find that appalling, but data crunching isn't really my main thing.
+    </p>
+    <p>
+        Is it quicker to do it in Python? Not for me, it turns out. It also took me a couple of hours to repeat the exercise in Python.
+    </p>
+    <p>
+        <strong>Next:</strong> Extracting data from a small CSV file with Python.
+    </p>
+</div>
diff --git a/content/binary/difference-pmf-plot.png b/content/binary/difference-pmf-plot.png