# Swedes have the best Haskell: Jumping to conclusions with barroom data science 

_This is a Jupyter notebook._

_**Disclaimer**:..._

A few weeks ago I ordered a copy of [Beautiful Vizualization](http://shop.oreilly.com/product/0636920000617.do), which gathers a few stories from data vizualization experts. Not only is it a great read, but it prompted we to give data vizualization a try. Now of course, you first need data, and I couldn't really find a way to display existing data in a way that hadn't been done a thousand times before. 

One evening, I was having a drink with a fellow Haskeller, and as we enumerated the names of other Haskellers living in Zurich as well. Something struck me as odd: Zurich is a small city, in a small country. That sounded like too many Haskellers. Are there actually many more people writing Haskell than I thought or is there an usually high concentration in Switzerland? So I had a question, and this is a short summary of my journey to get the answer.

## Getting ready

There are several topics discussed in this post. First, we'll play a bit with the tools and services I used to get the data. Then, we'll create a few helpers to make it more convenient to handle the data in Haskell. Next we'll pull the data and process it a bit. Finally 

### The setup
Below is a list of the tools and services used throughout this post. I'll introduce them in more depth as we go:

* [stack](https://docs.haskellstack.org/en/stable/README/) and its [nix integration](https://docs.haskellstack.org/en/stable/nix_integration/#nix-integration) to handle and build the Haskell and system dependencies
* [ihaskell](https://github.com/gibiansky/IHaskell) for running the code with the convenience of a [Jupyter](http://jupyter.org/) notebook
* [curl](https://curl.haxx.se/) and [jq](https://stedolan.github.io/jq/) for discovering the APIs

Web services to get the data from:

* [github](https://developer.github.com/v3/) to get some data about Haskell users and their location
* [geonames](http://www.geonames.org/) for country data
* [amcharts](https://www.amcharts.com/) for getting the maps shown in the final infographic

On the Haskell side we'll use

* [wreq](https://hackage.haskell.org/package/wreq) for making the HTTP calls
* [aeson](https://hackage.haskell.org/package/aeson) for decoding the data
* [timeit](https://hackage.haskell.org/package/timeit) for timing a thing or two
* [conduit](https://hackage.haskell.org/package/conduit) for convenience when handling the data
* [HaskellR](https://tweag.github.io/HaskellR/) and its suite of libraries for displaying the data

## Getting a feel for the APIs

Now that that's out of the way, let's get started! We'll need some raw material: a list of Github users. The [github api](https://developer.github.com/v3/) provides us with three endpoints that will come in handy: a [repository search](https://developer.github.com/v3/search/#search-repositories) endpoint, a [repository collaborators](https://developer.github.com/v3/repos/collaborators/) endpoint and a [users](https://developer.github.com/v3/users/) endpoint. We'll get a list of haskell repositories sorted by stars and source our users from there. It will hopefully allow us to get a pool of people who are somewhat involved in the community. We'll try to get a big enough sample so that we don't _only_ consider people who contributed to "star" repos.

### The GitHub API

Here's how the repository [search API](https://developer.github.com/v3/search/#search-repositories) presents itself:

```
GET /search/repositories
```

with three parameters: `q` the search string and `sort` and `order` for specifying in what order we want the results to appear (have a look at the API if you're curious about the different options ). We only care about the Haskell repositories so we'll format the search string like this: `q=language:haskell`. As mentioned, we'll want the repositories to be sorted by stars so we'll use `sort=stars` (and hope github provides a sensible default ordering). Let's get `curl` out:

``` shell
$ curl -s "https://api.github.com/search/repositories?q=language:haskell&sort=stars" | head

{
  "total_count": 57197,
  "incomplete_results": false,
  "items": [
    {
      "id": 571770,
      "name": "pandoc",
      "full_name": "jgm/pandoc",
      "owner": {
        "login": "jgm",
        
```

Well, it looks like it worked! We don't really want to go back and forth between the Github API description page and our terminal, so we'll just try to infer what the format it. We just need some bits of information, we're not here to write a complete Github API in Haskell. So what do we have? This looks like a JSON object. Sure. It has an `items` field, of which the key is a JSON array. The elements seem to be... well, it looks like a repository to me! This suggests that [`pandoc`](http://pandoc.org/) is the most-starred Haskell repository, which I'm enclined to believe. Another observation: the owner's `login` (i.e. username) and repo `name` are both provided, but we're mostly interested in the combination of the two: the repo's `full_name`.

Let's bring out [`jq`](https://stedolan.github.io/jq/manual/), a tool that I discovered much too late, and that I can only recommend you start using right away. `jq` is _the_ Swiss Army Knife when it comes to playing with JSON from your terminal. Its most simple use is to pretty print JSON, but it also allows you to filter and update JSON. Here's 

``` shell
$ curl -s "https://api.github.com/search/repositories?q=language:haskell&sort=stars" | jq '.items | .[0:5] | .[] | .full_name'

"jgm/pandoc"
"begriffs/postgrest"
"koalaman/shellcheck"
"elm-lang/elm-compiler"
"purescript/purescript"
```

Looks like we got it right! We first index into the JSON object with the `items` field, and take five elements from the array (from `0` inclusive to `5` exclusive). Then we traverse the elements (`.[]`) of which we filter everything but the `full_name`. So far, so good, the GitHub API seems well crafted.

Next thing on our list: getting a list of the people who contribute to a particular repo. Here's what the [GitHub API documentation](https://developer.github.com/v3/repos/#list-contributors) recommends:

```
GET /repos/:owner/:repo/contributors
```

Let's try with pandoc:

``` shell
$ curl -s 'https://api.github.com/repos/jgm/pandoc/contributors' | head

[
  {
    "login": "jgm",
    "id": 3044,
    "avatar_url": "https://avatars.githubusercontent.com/u/3044?v=3",
    "gravatar_id": "",
    "url": "https://api.github.com/users/jgm",
    "html_url": "https://github.com/jgm",
    "followers_url": "https://api.github.com/users/jgm/followers",
    "following_url": "https://api.github.com/users/jgm/following{/other_user}",
```

Okay, it appears that this time we're getting a JSON array rather than an object. This is just making our life easier:

``` shell
$ curl -s 'https://api.github.com/repos/jgm/pandoc/contributors' | jq '.[0:5] | .[] | .login'

"jgm"
"jkr"
"tarleb"
"mpickering"
"lierdakil"
```

Nice! However we're still missing a piece of information: Where do those users live? Unfortunately this is not something that the collaborators API can tell us. We'll use a third GitHub API endpoint, namely the one that you use everytime you browse someone's profile:


``` shell
$ curl -s 'https://api.github.com/users/jgm'   

{
  "login": "jgm",
  "id": 3044,
...
  "location": "Berkeley, CA",
...
}

$ curl -s 'https://api.github.com/users/jgm' | jq '.location'

"Berkeley, CA"
```

Now is a good time to think about what "it's just metadata" means. Ok, let's continue.

### The geonames API

One problem with the data we gathered from GitHub is that we don't get the user's country directly. For instance, in which country is "Berkeley, CA" located? Well, "CA" is probably California, and California is in the US. Ok. But we don't really want to have to do that for every single user. Rather, we'll let someone else do that for us: enter [geonames.org](http://geonames.org). What is GeoNames?

> The GeoNames geographical database covers all countries and contains over eleven million placenames that are available for download free of charge.

Sounds good. But how's that going to help, you ask? Well, they too provide a [search API](http://www.geonames.org/export/geonames-search.html). It looks something like this:

```
GET api.geonames.org/search?
```

to which we can pass a `q` parameter. Let's try something simple:

``` 
$ curl -s 'api.geonames.org/searchJSON?q=France&username=demo' | jq '.' | head

{
  "totalResultsCount": 144895,
  "geonames": [
    {
      "adminCode1": "00",
      "lng": "2",
      "geonameId": 3017382,
      "toponymName": "Republic of France",
      "countryId": "3017382",
      "fcl": "A",
```

The JSON object returned is not pretty printed, as opposed to GitHub's API. That's no problem, `jq` saves the day again. There's another subtelty involved: you have to pass in a username. Thankfully geonames provides the `demo` user for ... demo purposes. Anyhow, looks like we can get some information about France, and more importantly we can guess the format: it's a JSON object that contains a field `geonames`, which contains an array. Let's investigate a bit further and see what kind of elements are contained in the array by inspecting the first element:

``` shell
$ curl -s 'api.geonames.org/searchJSON?q=France&username=demo' | jq ' .geonames | .[0]'

{
  "adminCode1": "00",
  "lng": "2",
  "geonameId": 3017382,
  "toponymName": "Republic of France",
  "countryId": "3017382",
  "fcl": "A",
  "population": 64768389,
  "countryCode": "FR",
  "name": "France",
  "fclName": "country, state, region,...",
  "countryName": "France",
  "fcodeName": "independent political entity",
  "adminName1": "",
  "lat": "46",
  "fcode": "PCLI"
}
```

So there are a few interesting fields. Let's see what comes up when we perform a search on "Berkeley, CA":

``` shell
$ curl -s 'api.geonames.org/searchJSON?q=Berkeley,%20CA&username=demo' | jq ' .geonames | .[0]'

{
  "adminCode1": "CA",
  "lng": "-122.27275",
  "geonameId": 5327684,
  "toponymName": "Berkeley",
  "countryId": "6252001",
  "fcl": "P",
  "population": 112580,
  "countryCode": "US",
  "name": "Berkeley",
  "fclName": "city, village,...",
  "countryName": "United States",
  "fcodeName": "populated place",
  "adminName1": "California",
  "lat": "37.87159",
  "fcode": "PPL"
}
```

You guessed it: we're going to throw our location strings at geonames and grab the country name!

``` shell
$ curl -s 'api.geonames.org/searchJSON?q=Berkeley,%20CA&username=demo' | jq ' .geonames | .[0] | .countryName'

"United States"
```

This is by all means not bullet proof but it's quite enough for our needs. And it's actually pretty good, I encourage you to try out a few locations:

``` shell
$ curl -s 'api.geonames.org/searchJSON?q=Zurich&username=demo' | jq ' .geonames | .[0] | .countryName'

"Switzerland"

$ curl -s 'api.geonames.org/searchJSON?q=Manchester&username=demo' | jq ' .geonames | .[0] | .countryName'

"United Kingdom"

$ curl -s 'api.geonames.org/searchJSON?q=Brno&username=demo' | jq ' .geonames | .[0] | .countryName'

"Czechia"
```

And believe it or not, when you search for a country, you actually get the population as well:

``` shell
$ curl -s 'api.geonames.org/searchJSON?q=Germany&username=demo' | jq ' .geonames | .[0] | .population'

81802257
```

Quick recap: we are able to get a list of Haskell repositories from GitHub. From those repositories, we're able to draw a pool of users. And once you have a user, you can easily get their location, which might be a city, state, or country. Once we have such a general location, we can query GeoNames to get the country name. And with the country name, we can get the country population. Looks like we have everything we need, let's write some Haskell!

## Querying the APIs with Haskell

As you'll see, this will be a pretty straight forward step. We only have to translate the `curl` and `jq` commands into Haskell code. We'll leverage the [`wreq`](https://hackage.haskell.org/package/wreq) and [aeson-lens](https://hackage.haskell.org/package/aeson-lens) and it'll be very natural.

There's one thing I want to mention: we'll be as lazy as possible when it comes to imports and language extensions. We'll have a very call-by-need approach, we're not expected to know right from the beginning what we'll need down the line. Same goes for class instance of the datatypes we'll introduce, we'll rely on `StandaloneDeriving` when we actually need the instance. Hopefully it will also make it clearer where and why we introduce a new import/library/instance (there's one caveat though: if a package is missing, you have to `stack install` it and restart the kernel).

This is going to be `wreq`-heavy; fortunately the API is very straightforward. You basically call `get` on an URL. Then you get a few lenses to play with the result:

In [2]:
{-# LANGUAGE OverloadedStrings #-}

import Network.Wreq
import Data.Aeson.Lens
import Control.Lens

r <- get "https://api.github.com/search/repositories?q=language:haskell&sort=stars"
mapM_ print $ take 5 $ r ^.. responseBody . key "items" . values . key "full_name" . _String

"jgm/pandoc"
"begriffs/postgrest"
"koalaman/shellcheck"
"purescript/purescript"
"elm-lang/elm-compiler"

Okay, what just happened. The URL should look familiar, we're querying GitHub's seach API and ask it to give us the most starred repo. `r` is the result/response from performing the request. We could look at its body from `view`ing `responseBody` with `r ^. responseBody` and then decode it with the usual `aeson` functions; however we can do more. The funny `^..` operator comes from the [lens](https://www.stackage.org/haddock/lts-7.16/lens-4.14/Control-Lens-Fold.html#v:-94-..), we'll come back to that in a sec. Let's get back our `jq` filter from earlier, for reference:

``` shell
$ jq '.items | .[0:5] | .[] | .full_name'
```
It's a bit different, but not that much. In Haskell we request the `key` `items`, and use `values` to traverse them, very much like you would tell `jq` to filter `.items | .[]`. Since we're traversing, we're operating on one element at a time, of which we want the `key` `"full_name"`. Finally, since `lens` is very much typed, we have to tell it that we're expecting a string (actually a `Text`) by asking for the `_String` of the element. The funky `(^..)` operator is very much like `(^.)` for when you're expecting a list of stuff. In our case, because of `values`.

Okay, let's list `pandoc`'s contributors:

In [3]:
r <- get "https://api.github.com/repos/jgm/pandoc/contributors"
mapM_ print $ take 5 $ r ^.. responseBody . values . key "login" . _String

"jgm"
"jkr"
"tarleb"
"mpickering"
"labdsf"

No surprise, that's still working. What about the location of a particular user?

In [4]:
r <- get "https://api.github.com/users/jgm"
r ^?! responseBody . key "location" . _String

"Berkeley, CA"

And what about GeoNames?

In [6]:
r <- get "http://api.geonames.org/searchJSON?q=Berkeley,%20CA&username=demo"
r ^?! responseBody . key "geonames" . nth 0 . key "countryName" . _String

"United States"

In [9]:
r <- get "http://api.geonames.org/searchJSON?q=United%20States&username=demo"
print $ r ^?! responseBody . key "geonames" . nth 0 . key "countryName" . _String
print $ r ^?! responseBody . key "geonames" . nth 0 . key "population" . _Integer

"United States"

310232863

Oh yeah. Notice that this time we're using `nth 0` instead of `values`, because we only care about the first element. The `(^?!)` just says "There _might_ be such an element, I don't care. Just crash if you can't find such an element". If you would rather get a `Maybe` back, use `(^?)`. Here we don't really care, crashing is fine. We'll just tweak our query and rerun the cell. That's the great power of the notebook (and GHCi, of course).

Ok, so we've convinced ourselves that we can express those various APIs using Haskell (which shouldn't really come as a surprise). Now let's type up. We'll define a few basic datatypes, but we'll once again to be as lazy as possible and leave most of the typeclass instances for later. Let's see what we've got:

In [10]:
{-# LANGUAGE GeneralizedNewtypeDeriving #-}

import Data.String (IsString)
import qualified Data.Text as T

newtype GithubRepo        = GithubRepo        T.Text deriving (IsString, Show) 
newtype GithubUser        = GithubUser        T.Text deriving (IsString, Show)
newtype CountryName       = CountryName       T.Text deriving (IsString, Show)
newtype CountryPopulation = CountryPopulation Int    deriving Show
data Country = Country { 
          countryName :: CountryName
        , countryPopulation :: CountryPopulation 
        } deriving Show

This should be pretty self-explanatory. The only instance that we derive for all the datatypes is `Show`, because we'll constantly be pretty stuff out. We're also deriving an `IsString` instance for some of the types; those will allow us to go straight from a stringy thing (`"nmattia"`) to the typed data (`GithubUser "nmattia"`) without having to remember what the constructor's name is. Next, we'll pull the API endpoints as top-level definitions:

In [11]:
import Data.Monoid

githubApiSearchRepos :: T.Text
githubApiSearchRepos = "https://api.github.com/search/repositories"

githubApiRepos :: GithubRepo -> T.Text
githubApiRepos (GithubRepo repo) = "https://api.github.com/repos/" <> repo

githubApiUsers :: GithubUser -> T.Text
githubApiUsers (GithubUser login) = "https://api.github.com/users/" <> login

githubApiRepoContributors :: GithubRepo -> T.Text
githubApiRepoContributors repo = githubApiRepos repo <> "/contributors"

geonamesApiSearchJSON :: T.Text
geonamesApiSearchJSON = "http://api.geonames.org/searchJSON"

We've used the datatypes defined above, which will prevent us from requesting the top repositories of a repository, for instance.

Now we'll write some function wrappers to access the APIs. There are some requests that we will potentially send quite a few times, and as such the default limitations of Github and Geonames are not going to be sufficient. With Geonames you can [create a free user](http://www.geonames.org/login), which will allow you to perform up to 30,000 requests a day. The only thing you then need to perform the requests is the username, as we did with `username=demo` above. I keep mine secretly in a file called `.geonames-username`:

In [13]:
import qualified Data.Text.IO as T

geonamesUsername <- T.readFile ".geonames-username"

Next is the `findCountryName` function, which finds a country given a location. It looks very much like the one we drafted earlier. One important thing to notice is that we're now using wreq's mechanisms for encoding query parameters using the `param` lens:

In [14]:
findCountryName :: T.Text -> IO (Maybe CountryName)
findCountryName place = do
    let opts = defaults & param "q"        .~ [place]
                        & param "username" .~ [geonamesUsername]
    r <- getWith opts $ T.unpack geonamesApiSearchJSON
    return $ CountryName <$> (r ^? responseBody . key "geonames" . nth 0 . key "countryName" . _String)

I first implemented a function that returned a `CountryName` rather than a `Maybe CountryName`, but it turns out that some Github users live in very strange countries (looking at you, people of "Where do you want me to be?"). Anyway, let's try it out:

In [15]:
findCountryName "Zurich"

All good. Now we'll define `countryByCountryName`, which, given a `CountryName`, basically fetches the population and creates a `Country` object (if everything goes well):

In [16]:
countryByCountryName :: CountryName -> IO (Maybe Country)
countryByCountryName (CountryName n) = do
    let opts = defaults & param "q"        .~ [n]
                        & param "username" .~ [geonamesUsername]
    r <- getWith opts $ T.unpack geonamesApiSearchJSON
    return $ do name       <- r ^? responseBody . key "geonames" . nth 0 . key "countryName" . _String
                population <- r ^? responseBody . key "geonames" . nth 0 . key "population" . _Integer
                return Country { countryName       = CountryName name
                               , countryPopulation = CountryPopulation $ fromIntegral population }

In [17]:
countryByCountryName "Switzerland"

Sweet! Did you know that there are about a million people more in London than in the whole of Switzerland?

Geonames is taken care of, let's move to Github. Same here, I created an API [token](https://help.github.com/articles/creating-an-access-token-for-command-line-use/), which bump the requests limits a bit. We'll use that to feed wreq's authentication mechanisms ([oauth2Token](https://www.stackage.org/haddock/lts-7.19/wreq-0.4.1.0/Network-Wreq.html#v:oauth2Token) in this case). As it turns out, those prefer ByteStrings over Text. Oh well.

In [26]:
import qualified Data.ByteString as BS

githubAuth <- oauth2Token <$> BS.readFile ".github-api-token"

In [27]:
githubUserCountryName :: GithubUser -> IO (Maybe CountryName)
githubUserCountryName user = do
    r <- getWith opts $ T.unpack $ githubApiUsers user
    maybe (return Nothing) 
          (findCountryName)
          (r ^? responseBody . key "location" . _String)
    where
        opts = defaults & auth ?~ githubAuth

Same as with Geonames, we're basically wrapping what we did earlier in more type-safe functions. The wreq library handles Github's oauth authentication out of the box, which is great. Let's try:

In [28]:
githubUserCountryName "nmattia"

The `IsString` instance that we derived for `GithubUser` is coming in really handy here (have a try with your Github nick). Next, we'll want to also get the user's country's population:

In [29]:
{-# LANGUAGE LambdaCase #-}

githubUserCountry :: GithubUser -> IO (Maybe Country)
githubUserCountry user = githubUserCountryName user >>= \case
    Just cname -> countryByCountryName cname
    Nothing    -> return Nothing

In [30]:
githubUserCountry "nmattia"

Cool cool. You might now be thinking:

> Well, it works and all, but it'll get costly if we have to perform requests over and over.

And you'd be right! Let's build us a little cache for great good:

In [31]:
import Control.Concurrent.MVar
import Data.Hashable (Hashable)
import qualified Data.HashMap.Strict as HMS

-- Note: some results might be fetched twice if accessed concurrently
cacheForever :: (Eq a, Hashable a) => (a -> IO b) -> IO (a -> IO b)
cacheForever f = do
    mvar <- newMVar HMS.empty
    return $ \k -> do mv <- HMS.lookup k <$> readMVar mvar
                      case mv of
                          Just v -> return v
                          Nothing -> do v <- f k
                                        modifyMVar_ mvar (return . HMS.insert k v)
                                        return v

I'm not going to go into the implementation, but it's pretty straightforward. When you pass it a function `(a -> IO b)` it spits out another one with the same signature. However that new function will store the results in a hash map, and will perform a lookup before running the function on a new value. One very important thing to note: we never expire the values. Do. Not. Use. This. In. Production. Instead go see Jasper Van der Jeugt's post on [Writing an LRU cache in Haskell](https://jaspervdj.be/posts/2015-02-24-lru-cache.html) (or simply cap the HashMap's size, I don't care, but don't use this particular implementation in production).

How would we use `cacheForever`, you ask? Simple:

In [32]:
{-# LANGUAGE StandaloneDeriving #-}

deriving instance Eq       GithubUser
deriving instance Hashable GithubUser

githubUserCountry' <- cacheForever githubUserCountry

Using the very handy [timeit](https://www.stackage.org/lts-7.19/package/timeit-1.0.0.0) package, we can (quite) convince ourselves that it does what we expect. On the first request:

In [33]:
import System.TimeIt

timeIt $ githubUserCountry' "nmattia"

About `100ms` for the request, ok, and on the second one:

In [24]:
timeIt $ githubUserCountry' "nmattia"

CPU time:   0.00s
Just (Country {countryName = CountryName "Switzerland", countryPopulation = CountryPopulation 7581000})

Hurray! Zero time. _(note: it looks like timeIt calculates the CPU time, while we'd be more interested in the wall clock time. Whatever, it's just to show you that it at least kind of works.)

That was it for the one-off requests, but now we have a bigger challenge on our hands: listing many contributors. Why challenge, you say? Github won't allow you to read _all_ of the contributors of a project with a single request (and you probably wouldn't want that anyway). Instead they use a pagination system, which is described [here](https://developer.github.com/guides/traversing-with-pagination/). As it turns out, wreq once again comes with support for Github's pagination:

In [25]:
r <- get "https://api.github.com/repos/jgm/pandoc/contributors"
r ^? responseLink "rel" "next"

Just (Link {linkURL = "https://api.github.com/repositories/571770/contributors?page=2", linkParams = [("rel","next")]})

Basically, you can check whether there is a link to a next page (`Just Link`) or if you're at the end (`Nothing`). Cool. How should we use that, then? Well, let's get a bit crazy. We could simply write a function

``` haskell
topRepos :: Int -> IO [GithubRepo]
topRepos = ...
```

that returns a given number of repos. But let's say (as we'll do later) that we want a specific number of _users_ not _repos_. Then you'd have to make sure that you requested enough repos to source your users from. Not ideal. Alternatively, here's another we could do:

``` haskell
topRepos :: IO [GithubRepo]
topRepos = ...
```

You just through an `unsafeInterleaveIO` somewhere in there and it works, right? To be honest I'm not too sure. Let's go a bit crazy, but not too crazy. Let's use Conduits!

In [26]:
{-# LANGUAGE RankNTypes #-}

import Control.Monad.IO.Class (liftIO)
import Data.ByteString.Lens
import Data.Conduit
import qualified Data.Conduit.Combinators as C

topRepos :: T.Text -> Producer IO GithubRepo
topRepos language = go (getWith opts $ T.unpack $ githubApiSearchRepos)
    where 
        go req = do
            r <- liftIO req
            C.yieldMany $ GithubRepo <$> r ^.. responseBody . key "items" . values . key "full_name" . _String
            case r ^? responseLink "rel" "next" of
                Just link -> go (getWith opts $ link ^. linkURL . unpackedChars)
                Nothing -> return ()
        opts = defaults & param "q"        .~ ["language:" <> language]
                        & param "sort"     .~ ["stars"]
                        & param "per_page" .~ ["100"]
                        & auth             ?~ githubAuth

`topRepos` takes a language name (like, say, `"haskell"`) and reads the top projects from Github. Easy peasy, we perform a request, and try to `yield` the project names that we received. If downstream doesn't want any more values, well, we block there forever and nothing happens. But if they do, we'll lookup the link to the next page and repeat. And if it turns out that Github has no more values, we return, and sorry for downstream. Let's see how it works in practice:

In [27]:
topRepos "haskell" $$ C.take 5 =$ C.mapM_ print

GithubRepo "jgm/pandoc"
GithubRepo "begriffs/postgrest"
GithubRepo "koalaman/shellcheck"
GithubRepo "elm-lang/elm-compiler"
GithubRepo "purescript/purescript"

Here's the control flow: the "downstream" functions (the ones on the right) will ask for results from upstream (the ones on the left). The rightmost part is `C.mapM_ print`, which will print any value it manages to receive. And it's very greedy, it'll keep asking for values from upstream. And that values it receives come from `C.take 5`. That one is quite greedy as well, but doesn't have as much appetite: after having read (and forwarded) five values, it'll give up and block everybody. And those five values are provided from `topRepos`, that we just wrote, and that will happily provide values as long as it can. 

In [28]:
repoContributors :: GithubRepo -> Producer IO GithubUser
repoContributors repo = go (getWith opts $ T.unpack $ githubApiRepoContributors repo)
        where 
        go req = do
            r <- liftIO req
            C.yieldMany $ GithubUser <$> r ^.. responseBody . values . key "login" . _String
            case r ^? responseLink "rel" "next" of
                Just link -> go (getWith opts $ link ^. linkURL . unpackedChars)
                Nothing -> return ()
        opts = defaults & param "per_page" .~ ["100"]
                        & auth ?~ githubAuth

In the same vein, we now source github users from a repo. In practice:

In [29]:
repoContributors "jgm/pandoc" $$ C.take 5 =$= C.mapM_ print

GithubUser "jgm"
GithubUser "jkr"
GithubUser "tarleb"
GithubUser "mpickering"
GithubUser "lierdakil"

And now, ta da:

In [30]:
topRepos "haskell" $$ awaitForever repoContributors =$= C.take 5 =$ C.mapM_ print

GithubUser "jgm"
GithubUser "jkr"
GithubUser "tarleb"
GithubUser "mpickering"
GithubUser "lierdakil"

Using `awaitForever`, which passes the values it gets from upstream (the repos) as argument to some function (the contributors by repos).

Well that's kind of nice, and we could stop there. However, we can't know for sure that a Haskeller will only ever contribute to one repo. And we want to count the number of programmers per country, not the number of project contributed per country. We'll add a last step to our pipeline (_err_, conduit) which will accumulate the users until we have enough _unique_ values (we don't really care about running in constant memory, so that's fine, but you probably don't want to use something like that when reading you petabyte files).

Let's see what it looks like:

In [31]:
import Control.Monad
import Control.Monad.ST
import Data.HashMap.Strict (HashMap)
import Data.Vector (Vector)

import qualified Data.Vector.Mutable as MV
import qualified Data.Vector as V

accumulateUniques :: (Eq a, Hashable a) => Int -> Sink a IO (Vector a)
accumulateUniques n = go HMS.empty
    where
        go :: (Eq a, Hashable a) => HashMap a Int -> Sink a IO (Vector a)
        go m | HMS.size m >= n = return $ toVector m
             | otherwise       = await >>= \case
                                Just v -> go (HMS.insertWith (\_ old -> old) 
                                                             v 
                                                             (HMS.size m) 
                                                             m )
                                Nothing -> return $ toVector m
        toVector :: HashMap a Int -> Vector a
        toVector m = runST $ do vec <- MV.new n
                                forM_ (HMS.toList m) (\(k, v) -> MV.write vec v k)
                                V.freeze vec

We keep all the users we've seen so far in a hash map. Whenever we see a new user, we insert it as a key. That value we use is the size of the hash map right before insertion. here's what it would look like, step-by-step:

```
new user    | hashmap size | hashmap contents | action
--------------------------------------------------------------
"jgm"       | 0            | []               | insert "jgm" 0
"jkr"       | 1            | [jgm]            | insert "jkr" 1
"jgm"       | 2            | [jgm, jkr]       | nothing
"mpickering"| 2            | [jgm, jkr]       | insert "mpickering" 2
```

And finally, when the hash map contains enough elements (`n`), we stop. So we build a map where all the keys are (different) users, and all the values are range from `0` to `n-1`. We can then create a vector of users, where the hash map's values are the indices, and the hash map's keys are the elements, _i.e._ the users.

In practice:

In [32]:
C.yieldMany ["foo", "foo", "bar", "foo", "baz"] $$ accumulateUniques 3 

["foo","bar","baz"]

In [33]:
cs <- topRepos "haskell" $$ awaitForever repoContributors =$ accumulateUniques 2000



In [34]:
mapM_ print $ V.slice 1000 10 cs

GithubUser "rekahsoft"
GithubUser "dacto"
GithubUser "sargon"
GithubUser "dpwright"
GithubUser "dsferruzza"
GithubUser "vtduncan"
GithubUser "ericrasmussen"
GithubUser "edom"
GithubUser "favonia"
GithubUser "wferi"

In [35]:
import Data.Maybe (catMaybes)

--vecMapMaybe :: (a -> Maybe b) -> Vector a -> Vector b
--vecMapMaybe f = V.fromList . mapMaybe f . V.toList 

vecCatMaybes :: Vector (Maybe a) -> Vector a
vecCatMaybes = V.fromList . catMaybes . V.toList

In [36]:
countryName' :: Country -> String
countryName' country = let CountryName str = countryName country in T.unpack str

In [38]:
ccs <- vecCatMaybes <$> mapM githubUserCountry' cs
mapM_ print $ countryName <$> V.take 5 ccs

CountryName "United States"
CountryName "Germany"
CountryName "United Kingdom"
CountryName "Russia"
CountryName "Switzerland"

In [39]:
import qualified Data.HashSet as HS
import Data.Hashable (Hashable(..))

scanUniqueCount :: (Eq a, Hashable a) => Vector a -> Vector (a,Int)
scanUniqueCount vec = V.zip vec 
                    . V.map HS.size 
                    . V.postscanl' (flip HS.insert) HS.empty
                    $ vec

In [40]:
scanUniqueCount $ V.fromList ["foo", "bar", "foo", "baz", "qux", "bar"]

[("foo",1),("bar",2),("foo",2),("baz",3),("qux",4),("bar",4)]

In [42]:
import Data.Function (on)

--deriving instance Eq CountryPopulation
deriving instance Eq CountryName

instance Eq Country where
    (==)    = (==) `on` countryName
instance Hashable Country where
    hashWithSalt n c = let CountryName cname = countryName c in hashWithSalt n cname

In [44]:
{-# LANGUAGE QuasiQuotes #-}

import Data.Int (Int32)

let xs = [0,1,2,3] :: [Int32]
    ys = [1, 4, 60, 40] :: [Int32]--[(0,0), (1, 5), (2, 0)] :: [(Int32, Int32)]

[rgraph|plot(x = xs_hs, y = ys_hs)|]

In [45]:
{-# LANGUAGE QuasiQuotes #-}
-- import Data.Int

let countryCountEvolution = snd <$> scanUniqueCount ccs    --  :: Vector Int32
    indices = V.toList $ V.enumFromN 0 (V.length countryCountEvolution) :: [] Int32
    indiceslol = V.toList $ fromIntegral <$> countryCountEvolution :: [] Int32
  
-- countryCountEvolution
[rgraph|plot(x = indices_hs, y = indiceslol_hs)|]

In [46]:
pieData = [10,20,30] :: [Double]
[rgraph|pie(c(pieData_hs))|]

In [47]:
import qualified Data.HashMap.Strict as HMS

let countries = HMS.toList 
              $ HMS.filter (>15) 
              $ V.foldl' (\m c -> HMS.insertWith (+) c 1 m) HMS.empty ccs :: [(Country, Int)]
    labels = (countryName' . fst) <$> countries :: [String]
    occs   = (fromIntegral . snd) <$> countries :: [Int32]
    
[rgraph|pie(occs_hs, labels = labels_hs)|]

In [49]:
countryRatio :: Country -> Int -> Double
countryRatio c n = (fromIntegral n) / (fromIntegral population)
    where (CountryPopulation population)  = countryPopulation c

In [51]:
{-# LANGUAGE TupleSections #-}

let countriesRel = (\(c,n) -> (c,) $ countryRatio c n) <$> countries
    labels = (countryName' . fst) <$> countriesRel :: [String]
    occs   = snd <$> countriesRel :: [Double]
[rgraph|pie(occs_hs, labels = labels_hs)|]

In [52]:
import Data.List (sortBy)
import Control.Arrow (first)

let countriesByRatio = sortBy (flip compare `on` snd) countriesRel
mapM_ (print . first countryName') countriesByRatio

("Sweden",3.7645028745031746e-6)
("Switzerland",3.4296266983247594e-6)
("Australia",2.9280870193998313e-6)
("Netherlands",1.9825773505557223e-6)
("United Kingdom",1.539733619988963e-6)
("Germany",1.4302783846171872e-6)
("United States",1.2990242107265083e-6)
("Canada",1.247067905816681e-6)
("France",4.94068178845702e-7)
("Poland",4.4155844155844157e-7)
("Russia",3.2693209762476723e-7)
("Japan",3.2210420463830053e-7)
("China",2.4059354427372327e-8)
("India",1.7901164835444847e-8)

In [53]:
import Control.Arrow (second)

let baseline = snd $ head countriesByRatio
    countriesToBaseline = second (/ baseline) <$> countriesByRatio
    
mapM_ (print . first countryName') countriesToBaseline

("Sweden",1.0)
("Switzerland",0.9110437188276524)
("Australia",0.7778150573961959)
("Netherlands",0.5266505078223311)
("United Kingdom",0.4090137984533141)
("Germany",0.37993818368539567)
("United States",0.3450719136183284)
("Canada",0.33127027588769326)
("France",0.1312439371987217)
("Poland",0.11729528606528607)
("Russia",8.68460215129772e-2)
("Japan",8.556354328214205e-2)
("China",6.391110653766626e-3)
("India",4.7552533315059235e-3)

In [54]:
--githubUserCountry' "srsudar"

In [None]:
-- let ucs = vecMapMaybe snd ccs

In [None]:
--mapM_ print $ V.take 5 ucs

In [None]:
--let countries = foldl' (\m c -> Map.insertWith (+) c 1 m) Map.empty ucs

In [None]:
--Map.size countries

In [None]:
--take 5 $ Map.assocs countries

In [None]:
--import Data.Function (on)
--
--instance Eq Country where
--    (==)    = (==) `on` countryName
--instance Ord Country where
--    compare = compare `on` countryName

In [None]:
--{-# LANGUAGE StandaloneDeriving #-}
--import Data.List
--deriving instance (Ord CountryPopulation)
--deriving instance (Eq CountryPopulation)
--
--let topCountries = sortBy (flip compare `on` snd) $ Map.toList countries
--mapM_ (print . countryName' . fst) $ take 10 topCountries

In [None]:
--let countryAndRatios = map (\(c,n) -> (c, haskellRatio c n)) $ Map.toList countries

In [None]:
--let topCountriesRatio = sortBy (flip compare `on` snd) $ Map.toList countryAndRatios

In [None]:
--let topCountriesRel = sortBy (flip compare `on` uncurry haskellRatio) $ Map.toList countries
--mapM_ print $ take 10 topCountriesRel

In [None]:
--r <- get "https://api.github.com/repos/jgm/pandoc/contributors"
--let contributors = r ^.. responseBody . values . key "login" . _String
--print (r ^? responseLink "rel" "next" . linkURL)

-- mapM (githubUserCountry . GithubUser) $ take 5 contributors