Kaggle's Titanic Dataset Predictions in Haskell
===================================================

This notebook shows a simple example of an analysis of the Titanic disaster in Haskell using the existing dataHaskell utilities as of November 2025. This is primariliy aimed at those who already have a working knowledge of data analytics and data science in Python/R, and who are looking to switch to Haskell!

What This Program Does
----------------------

1. Loads the canonical [Titanic dataset](https://www.kaggle.com/datasets/yasserh/titanic-dataset)
2. Describes the columns within the Titanic dataset
4. Builds a logistic regression with Hasktorch
5. Evaluates the performance of the logistic regression

## Imports, etc

1. We have to set `-XOverloadedStrings` because want to be able to refer to Haskell's `T.Text` type as regular `String`s, i,e. as sequences of characters between double-quotes: `"this is both a T.Text and a String!"`
2. The `pandas` equivalent in Haskell is `DataFrame`.
3. We import the pipe operator `|>`, similar to `%>%` in `dplyr`

In [None]:
:set -XOverloadedStrings

import qualified DataFrame as D
import qualified DataFrame.Functions as F
import           DataFrame.Operators

import qualified Data.Text as T

## Reading in our dataset

1. `DataFrame` is smart enough to understand what types the underlying CSV correspond to.

In [None]:
-- Since we have imported qualified DataFrame as D, we must refer to 
-- functions inside the DataFrame library with the prefix `D.`

df <- D.readCsv "../data/titanic.csv"

-- by the way, you can extract the first N rows using D.take
df |> D.take 5

-- and you can get summary stats using D.summarize

df |> D.summarize

  ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
| PassengerId<br>Int | Survived<br>Int | Pclass<br>Int |                    Name<br>Text                     | Sex<br>Text | Age<br>Maybe Double | SibSp<br>Int | Parch<br>Int |  Ticket<br>Text  | Fare<br>Double | Cabin<br>Maybe Text | Embarked<br>Maybe Text |
| -------------------|-----------------|---------------|-----------------------------------------------------|-------------|---------------------|--------------|--------------|------------------|----------------|---------------------|----------------------- |
| 1                  | 0               | 3             | Braund, Mr. Owen Harris                             | male        | Just 22.0           | 1            | 0            | A/5 21171        | 7.25           | Nothing             | Just "S"               |
| 2                  | 1               | 1             | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female      | Just 38.0           | 1            | 0            | PC 17599         | 71.2833        | Just "C85"          | Just "C"               |
| 3                  | 1               | 3             | Heikkinen, Miss. Laina                              | female      | Just 26.0           | 0            | 0            | STON/O2. 3101282 | 7.925          | Nothing             | Just "S"               |
| 4                  | 1               | 1             | Futrelle, Mrs. Jacques Heath (Lily May Peel)        | female      | Just 35.0           | 1            | 0            | 113803           | 53.1           | Just "C123"         | Just "S"               |
| 5                  | 0               | 3             | Allen, Mr. William Henry                            | male        | Just 35.0           | 0            | 0            | 373450           | 8.05           | Nothing             | Just "S"               |


  ------------------------------------------------------------------------------------------------------------------------------------------------------  
| Statistic<br>Text | PassengerId<br>Double | Survived<br>Double | Pclass<br>Double | Age<br>Double | SibSp<br>Double | Parch<br>Double | Fare<br>Double |
| ------------------|-----------------------|--------------------|------------------|---------------|-----------------|-----------------|--------------- |
| Count             | 891.0                 | 891.0              | 891.0            | 714.0         | 891.0           | 891.0           | 891.0          |
| Mean              | 446.0                 | 0.38               | 2.31             | 29.7          | 0.52            | 0.38            | 32.2           |
| Minimum           | 1.0                   | 0.0                | 1.0              | 0.42          | 0.0             | 0.0             | 0.0            |
| 25%               | 223.5                 | 0.0                | 2.0              | 20.12         | 0.0             | 0.0             | 7.91           |
| Median            | 446.0                 | 0.0                | 3.0              | 28.0          | 0.0             | 0.0             | 14.45          |
| 75%               | 668.5                 | 1.0                | 3.0              | 38.0          | 1.0             | 0.0             | 31.0           |
| Max               | 891.0                 | 1.0                | 3.0              | 80.0          | 8.0             | 6.0             | 512.33         |
| StdDev            | 257.35                | 0.49               | 0.84             | 14.53         | 1.1             | 0.81            | 49.69          |
| IQR               | 445.0                 | 1.0                | 1.0              | 17.88         | 1.0             | 0.0             | 23.09          |
| Skewness          | 0.0                   | 0.48               | -0.63            | 0.39          | 3.69            | 2.74            | 4.78           |


# Prepping our data for Hasktorch

Here are some things that we can observe about our dataset:

1. `Embarked` represents the port where the person boarded. There's no reason why a person boarding in France would have a better chance than a person boarding in England, so we will drop this column.
2. `Cabin` represents the room number the passenger was assigned. Not all passengers were assigned a cabin. Since this is correlated with `Pclass`, we'll drop this column.
3. `Fare` is the amount the passenger paid for their ticket. This is also correlated with `Pclass`. In order to reduce the amount of variance in the dataset, we'll use `Pclass` and drop this column.
4. `Parch` refers to the number of parents or children the individual had riding with them. We'll ignore this column for the time being and leave the inclusion of this as an exercise.
5. `SipSp` refers to the number of siblings or spouses the individual had riding with them. We'll ignore this column for the time being and leave the inclusion of this as an exercise.
6. `Age` - self explanatory. We will impute the average age in order to handle missing values.
7. `Sex` - self explanatory.
8. `Pclass` refers to the class of the ticket: first, second or third. Though this is read in as an `Int`, it can't be treated as an `Int`, otherwise our model would think that the value of third class is three times the value of first class, which in general is not true (first class tickets are not 1/3 the price of third class tickets, for example). Therefore, we will convert this into two binary variables, `isFirstClass` and `isSecondClass`. The default will be the passenger belongs to third class, unless either of these variables is set to 1. 
9. `Survived` is our outcome variable of interest.
10. `PassengerId` has to be dropped as it is an ID.


The outcome of interest is the column called `Survived`. Let's compare between two logistic regression models:

1. _Minimal model_: The only independent variables is `Sex`
2. _Bigger model_: The independent variables are `Sex`, class (represented by binary variables `isFirstClass` and `isSecondClass` and `Age`.

Let's start by prepping the data. 

1. We'll binarize the `Sex` and `Pclass` columns
2. We'll impute missing `Age` values



In [60]:
-- let's compute the mean age. We can exclude
-- the null rows using the function `filterJust`.
-- (Similarly, you could exclude the nonnull rows
-- using the function `filterNothing`)


df |> D.filterJust "Age" |> D.mean (F.col @Double "Age")

29.69911764705882

In [61]:
cleanedDf = df |> D.select ["Sex", "Pclass", "Age", "Survived"]
               |> D.derive "isFemale" (F.ifThenElse (F.col "Sex" .== "female") 1 0)
               |> D.derive "isFirstClass" (F.ifThenElse (F.col "Pclass" .== 1) 1 0)
               |> D.derive "isSecondClass" (F.ifThenElse (F.col "Pclass" .== 2) 1 0)
               |> D.impute (F.col "Age") 29.69911764705882
               |> D.exclude ["Sex", "Pclass"]

cleanedDf |> D.take 5

  ----------------------------------------------------------------------------------------------  
| Age<br>Double | Survived<br>Int | isFemale<br>Int | isFirstClass<br>Int | isSecondClass<br>Int |
| --------------|-----------------|-----------------|---------------------|--------------------- |
| 22.0          | 0               | 0               | 0                   | 0                    |
| 38.0          | 1               | 1               | 1                   | 0                    |
| 26.0          | 1               | 1               | 0                   | 0                    |
| 35.0          | 1               | 1               | 1                   | 0                    |
| 35.0          | 0               | 0               | 0                   | 0                    |


# Logistic Regression in Hasktorch

Let's take a brief detour and explore how we would build a logistic regression using the functionality given to us in Torch.

- A logistic regression is used whenever the output is a binary variable. It is a generalized linear regression model that uses the logit link.
- This can be expressed in torch as a linear model with no hidden features. The input features are forwarded to the output features with the sigmoid function (inverse of the logit link).
- There is one output, which can be interpreted as the probability that the output (a binary variable) is "successful".

Let's first define this architecture using Hasktorch.

In [62]:
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE RecordWildCards #-}

import GHC.Generics (Generic)
import qualified Torch as HT
import Control.Monad (when)

-- Forward pass
-- similar to the following code in PyTorch
-- def forward(self, x):
--        out = self.linear(x) # Pass the input through the linear layer
--        out = self.sigmoid(out) # Apply the sigmoid activation function
--        return out

logReg :: HT.Linear -> HT.Tensor -> HT.Tensor
logReg wtsbs input = HT.squeezeAll $ HT.sigmoid $ HT.linear wtsbs input

-- here is a helper function that runs a training loop.
-- similar to `for epoch in range(1000)...` in PyTorch.

trainLoop :: 
    Int ->
    (HT.Tensor, HT.Tensor) ->
    HT.Linear ->
    IO HT.Linear
trainLoop 
    n
    (features, labels)
    initialM =
        HT.foldLoop initialM n $ \state i -> do
            -- Forward pass: compute predictions
            let predicted = logReg state features
            
            -- Compute loss (how wrong our predictions are)
            let loss = HT.binaryCrossEntropyLoss' labels predicted
            
            -- Every 1000 iterations, print progress
            when (i `mod` 500 == 0) $ do
                putStrLn $
                    "Iteration :"
                        ++ show i
                        ++ " | Loss: "
                        ++ show (HT.asValue loss :: Float)
            
            -- Backward pass: update weights using gradient descent
            -- HT.GD is the optimizer, 1e-2 is the learning rate
            (state', _) <- HT.runStep state HT.GD loss 1e-2
            pure state'

# Converting Dataframes to Tensors

Hasktorch works with Tensors, so we need to convert the Data to Tensors. We can use the Hasktorch utils in the Dataframe library to do this.

In [63]:
import qualified DataFrame.Hasktorch as DHT

labelsTensor = cleanedDf |> D.select ["Survived"] 
                         |> DHT.toTensor

featuresTensor = cleanedDf |> D.exclude ["Survived"]
                           |> DHT.toTensor

# Running the model

We're now ready to run the models:

1. First, initialize the models randomly.
2. Then, using the `trainLoop` function we wrote above, train the model for a certain number of epochs.

To make things easy, we'll just use 10,000 epochs. But you can modify `trainLoop` to do fancier things such as stopping only after the average error over 100 epochs is not improving.

In [41]:
initialModel <- HT.sample $ HT.LinearSpec 4 1

trainedModel <- trainLoop 25000 (featuresTensor, labelsTensor) initialModel

Iteration :500 | Loss: 0.6940311
Iteration :1000 | Loss: 0.62007225
Iteration :1500 | Loss: 0.5619386
Iteration :2000 | Loss: 0.5133588
Iteration :2500 | Loss: 0.49150816
Iteration :3000 | Loss: 0.48291054
Iteration :3500 | Loss: 0.47661462
Iteration :4000 | Loss: 0.47188118
Iteration :4500 | Loss: 0.46824375
Iteration :5000 | Loss: 0.46539697
Iteration :5500 | Loss: 0.46313477
Iteration :6000 | Loss: 0.46131372
Iteration :6500 | Loss: 0.45983192
Iteration :7000 | Loss: 0.45861503
Iteration :7500 | Loss: 0.45760784
Iteration :8000 | Loss: 0.4567686
Iteration :8500 | Loss: 0.4560653
Iteration :9000 | Loss: 0.45547295
Iteration :9500 | Loss: 0.4549719
Iteration :10000 | Loss: 0.45454657
Iteration :10500 | Loss: 0.4541842
Iteration :11000 | Loss: 0.45387465
Iteration :11500 | Loss: 0.45360953
Iteration :12000 | Loss: 0.45338193
Iteration :12500 | Loss: 0.45318606
Iteration :13000 | Loss: 0.45301735
Iteration :13500 | Loss: 0.45287165
Iteration :14000 | Loss: 0.4527457
Iteration :14500 | L

# Using the model to generate predictions

We now have a model that we can use to generate predictions on whether a given passenger survived. But note that our predictions are values between 0 and 1! We interpret this as the probability that a given passenger would survive. 

As a first step, we can just have a threshold of 0.5: if the probability is more than 50%, then we predict the passenger to survive, otherwise, we predict them to not survive.

In [83]:
import Data.Either (fromRight)
import qualified Data.Vector.Unboxed as VU

getValues :: HT.Tensor -> [Float]
getValues tsr = HT.asValue tsr :: [Float]

assignSurvival :: Float -> [Float] -> [Int]
assignSurvival threshold = fmap (\ p -> if p > threshold then 1 else 0)

computeAccuracy :: [Int] -> [Int] -> Float
computeAccuracy actual predicted = fromIntegral numCorrect / fromIntegral numObs
    where correct = zipWith (\ a b -> if a == b then 1 else 0) actual predicted
          numCorrect = sum correct
          numObs = length actual

predictedProbsTensor = logReg trainedModel featuresTensor
predictedSurvival = assignSurvival 0.5 $ getValues predictedProbsTensor
accuracy = computeAccuracy predictedSurvival (VU.toList $ fromRight undefined (df |> D.columnAsIntVector "Survived"))
accuracy

0.8013468

# Conclusions and Next Steps

Great! We now have a preliminary model where we have about 80% predictive accuracy on the underlying dataset.

Here are a couple of ideas on what you can do next with Haskell's dataframes

1. This model is a linear model, so you'll need to introduce nonlinearities manually -- for example, it's plausible that age squared might be better correlated with the outcome, as the very young and the very old might have a lower probability of survival than those in their 20s/30s. Can you add a new column, `ageSq`?
2. Can you compute other metrics such as precision and recall? The `iris` example notebook has code snippets that you can use.
3. Download the test set from Kaggle. Generate the predictions and submit it to Kaggle. What's the accuracy that you get?