<center><img src="https://raw.githubusercontent.com/mateuszk098/kaggle_notebooks/master/playground_series_s3e22/undraw_Working_late_re_0c3y.png" width=400px></center>

# <p style="font-family: 'JetBrains Mono'; font-weight: bold; font-size: 125%; color: #4A4B52; text-align: center">Playground Series S3E22 - Horse Survival Classification</p>


In [1]:
# %load ../general_settings.py
import glob
import os
import shutil
import subprocess
import sys
import warnings
from array import array
from collections import defaultdict, namedtuple
from copy import copy
from functools import partial, singledispatch
from itertools import chain, combinations, product
from pathlib import Path
from time import strftime

import joblib
import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
import scipy.stats as stats
import seaborn as sns
import shap
from colorama import Fore, Style
from IPython.core.display import HTML, display_html
from plotly.subplots import make_subplots
from scipy.cluster.hierarchy import linkage
from scipy.spatial.distance import squareform

ON_KAGGLE = os.getenv("KAGGLE_KERNEL_RUN_TYPE") is not None

# Colorama settings.
CLR = (Style.BRIGHT + Fore.BLACK) if ON_KAGGLE else (Style.BRIGHT + Fore.WHITE)
RED = Style.BRIGHT + Fore.RED
BLUE = Style.BRIGHT + Fore.BLUE
CYAN = Style.BRIGHT + Fore.CYAN
RESET = Style.RESET_ALL

# Data Frame color theme.
FONT_COLOR = "#4A4B52"
BACKGROUND_COLOR = "#FFFCFA"

CELL_HOVER = {  # for row hover use <tr> instead of <td>
    "selector": "td:hover",
    "props": "background-color: #FFFCFA",
}
TEXT_HIGHLIGHT = {
    "selector": "td",
    "props": "color: #4A4B52; font-weight: bold",
}
INDEX_NAMES = {
    "selector": ".index_name",
    "props": "font-weight: normal; background-color: #FFFCFA; color: #4A4B52;",
}
HEADERS = {
    "selector": "th:not(.index_name)",
    "props": "font-weight: normal; background-color: #FFFCFA; color: #4A4B52;",
}
DF_STYLE = (INDEX_NAMES, HEADERS, TEXT_HIGHLIGHT)
DF_CMAP = sns.light_palette("#BAB8B8", as_cmap=True)

# Utility functions.
def download_from_kaggle(expr: list[str], directory: Path | None = None, /) -> None:
    if directory is None:
        directory = Path("data")
    if not isinstance(directory, Path):
        raise TypeError("The `directory` argument must be `Path` instance!")
    match expr:
        case ["kaggle", _, "download", *args] if args:
            directory.parent.mkdir(parents=True, exist_ok=True)
            filename = args[-1].split("/")[-1] + ".zip"
            if not (directory / filename).is_file():
                subprocess.run(expr)
                shutil.unpack_archive(filename, directory)
                shutil.move(filename, directory)
        case _:
            raise SyntaxError("Invalid expression!")


def get_interpolated_colors(color1: str, color2: str, /, num_colors: int = 2) -> list[str]:
    """Return `num_colors` interpolated beetwen `color1` and `color2`.
    Arguments need to be HEX."""

    def interpolate_color(color1, color2, t) -> str:
        r1, g1, b1 = int(color1[1:3], 16), int(color1[3:5], 16), int(color1[5:7], 16)
        r2, g2, b2 = int(color2[1:3], 16), int(color2[3:5], 16), int(color2[5:7], 16)
        r = int(r1 + (r2 - r1) * t)
        g = int(g1 + (g2 - g1) * t)
        b = int(b1 + (b2 - b1) * t)
        return f"#{r:02X}{g:02X}{b:02X}"

    num_colors = num_colors + 2
    return [interpolate_color(color1, color2, i / (num_colors - 1)) for i in range(num_colors)]


# Html highlight. Must be included at the end of all imports!
HTML(
    """
<style>
code {
    background: rgba(42, 53, 125, 0.10) !important;
    border-radius: 4px !important;
}
a {
    color: rgba(123, 171, 237, 1.0) !important;
}
ol.numbered-list {
  counter-reset: item;
}
ol.numbered-list li {
  display: block;
}
ol.numbered-list li:before {
  content: counters(item, '.') '. ';
  counter-increment: item;
}
</style>
"""
)


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #4A4B52;
">
    <b>Competition Description</b> 📜
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    The dataset for this competition is a synthetic dataset based off of the <a href="https://www.kaggle.com/datasets/yasserh/horse-survival-dataset"><b>Horse Survival Dataset</b></a>. As we can read in the original dataset description, the main task of the dataset is to understand the data and assess whether a horse can survive or not based on medical conditions. The original dataset is rich in missing values and according to the author, this issue is a real problem there. Moreover, all indicators were converted into words to understand what they represent easily.
</p>
    
<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #4A4B52;
">
    <b>Task</b> 💡
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    This is a multiclass classification problem, where the main task is to predict the <code>Outcome</code> feature ($3$ classes) about horse health status. The competition evaluation metric is <a href="https://en.wikipedia.org/wiki/F-score"><b>Micro-Averaged F1-Score</b></a> (a harmonic mean of precision and recall):
    \[F_1 =\frac{2}{\textrm{Precision}^{-1} + \textrm{Recall}^{-1}},\]
    where <i>Micro-Averaged</i> means calculate metrics globally by counting the total true positives, false negatives and false positives.
</p>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #4A4B52;
">
    <b>Table of Contents</b> 📔
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    The table of contents provides pleasurable navigation through the whole notebook. You can easily navigate through sections and return to TOC. If you want quickly find out something about the dataset, just read the first section, i.e. <b>Quick Overview</b>.
</p>

<blockquote class="anchor" id="top" style="
    margin-right: auto; 
    margin-left: auto;
    padding: 10px;
    background-color: #3A5A81;
    border-radius: 2px;
    border: 1px solid #3A5A81;
">
<ol class="numbered-list" style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #F2F2F0;
    margin-top: 15px;
    margin-bottom: 15px;
">
    <li><a href="#quick_overview"><span style="color: #F2F2F0">Quick Overview</span></a>
    <ol class="numbered-list" class="numbered-list" style="
        font-size: 16px;
        font-family: 'JetBrains Mono';
        color: #F2F2F0;
    ">
        <li><a href="#data_reading_and_features_description"><span style="color: #F2F2F0">Data Reading &amp; Features Description</span></a></li>
        <li><a href="#basic_numerical_properties_summaries"><span style="color: #F2F2F0">Basic Numerical Properties &amp; Summaries</span></a></li>
    </ol>
    </li>
</ol>
</blockquote>


# <b> <span style="font-family: 'JetBrains Mono'; color: #4A4B52">1</span> <span style='color: #3A5A81'>|</span> <span style="font-family: 'JetBrains Mono'; color: #4A4B52">Quick Overview</span></b><a class="anchor" id="quick_overview"></a> [↑](#top)

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #4A4B52;
    border-bottom: 2px solid #3A5A81;
">
    <b>About Section</b> 💡
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    In this section, I provide a quick overview of the dataset. More detailed analysis will be done in subsequent sections.
</p>

## <b> <span style="font-family: 'JetBrains Mono'; color: #4A4B52">1.1</span> <span style='color: #3A5A81'>|</span> <span style="font-family: 'JetBrains Mono'; color: #4A4B52">Data Reading &amp; Features Description</span></b><a class="anchor" id="data_reading_and_features_description"></a> [↑](#top)

In [2]:
competition = "playground-series-s3e22"
expr = f"kaggle competitions download -c {competition}".split()

if not ON_KAGGLE:
    download_from_kaggle(expr)
    train_path = "data/train.csv"
    test_path = "data/test.csv"
else:
    train_path = f"/kaggle/input/{competition}/train.csv"
    test_path = f"/kaggle/input/{competition}/test.csv"

train = pd.read_csv(train_path, index_col="id", engine="pyarrow").rename(columns=str.title)
test = pd.read_csv(test_path, index_col="id", engine="pyarrow").rename(columns=str.title)


In [3]:
train.head().style.set_table_styles(DF_STYLE).format(precision=1)


Unnamed: 0_level_0,Surgery,Age,Hospital_Number,Rectal_Temp,Pulse,Respiratory_Rate,Temp_Of_Extremities,Peripheral_Pulse,Mucous_Membrane,Capillary_Refill_Time,Pain,Peristalsis,Abdominal_Distention,Nasogastric_Tube,Nasogastric_Reflux,Nasogastric_Reflux_Ph,Rectal_Exam_Feces,Abdomen,Packed_Cell_Volume,Total_Protein,Abdomo_Appearance,Abdomo_Protein,Surgical_Lesion,Lesion_1,Lesion_2,Lesion_3,Cp_Data,Outcome
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
0,yes,adult,530001,38.1,132.0,24.0,cool,reduced,dark_cyanotic,more_3_sec,depressed,absent,slight,slight,less_1_liter,6.5,decreased,distend_small,57.0,8.5,serosanguious,3.4,yes,2209,0,0,no,died
1,yes,adult,533836,37.5,88.0,12.0,cool,normal,pale_cyanotic,more_3_sec,mild_pain,absent,moderate,none,more_1_liter,2.0,absent,distend_small,33.0,64.0,serosanguious,2.0,yes,2208,0,0,no,euthanized
2,yes,adult,529812,38.3,120.0,28.0,cool,reduced,pale_pink,less_3_sec,extreme_pain,hypomotile,moderate,slight,none,3.5,,distend_large,37.0,6.4,serosanguious,3.4,yes,5124,0,0,no,lived
3,yes,adult,5262541,37.1,72.0,30.0,cold,reduced,pale_pink,more_3_sec,mild_pain,hypomotile,moderate,slight,more_1_liter,2.0,decreased,distend_small,53.0,7.0,cloudy,3.9,yes,2208,0,0,yes,lived
4,no,adult,5299629,38.0,52.0,48.0,normal,normal,normal_pink,less_3_sec,alert,hypomotile,none,slight,less_1_liter,7.0,normal,normal,47.0,7.3,cloudy,2.6,no,0,0,0,yes,lived


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #4A4B52;
    border-bottom: 2px solid #3A5A81;
">
    <b>Insight</b> 💡
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    I prefer to capitalize data frame columns because sometimes they can contain features named the same way as some methods. Since I like to refer to columns as an attribute (with a dot), it's better to provide capitalized names.
</p>

In [4]:
train.info(verbose=False)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1235 entries, 0 to 1234
Columns: 28 entries, Surgery to Outcome
dtypes: float64(7), int64(4), object(17)
memory usage: 279.8+ KB


In [5]:
test.info(verbose=False)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 824 entries, 1235 to 2058
Columns: 27 entries, Surgery to Cp_Data
dtypes: float64(7), int64(4), object(16)
memory usage: 180.2+ KB


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #4A4B52;
    border-bottom: 2px solid #3A5A81;
">
    <b>Features Description</b> 📔
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    The dataset is divided into $27$ main features and the target. As we can read in the original dataset description, these are:
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    <li><code>Surgery</code> - Whether the horse was treated with surgery or not.</li>
    <li><code>Age</code> - Age of the horse (young mean $< 6$ months).</li>
    <li><code>Hospital_Number</code> - The case number assigned to the horse (may not be unique if the horse is treated $> 1$ time).</li>
    <li><code>Rectal_Temp</code> - The normal temperature is $37.8$ (Celcius degrees) and will usually change as the problem progresses. It may start out normal, then become elevated because of the lesion, passing back through the normal range as the horse goes into shock.</li>
    <li><code>Pulse</code> - Heart rate in beats per minute. This reflects the heart condition: $30$ - $40$ is normal for adult horses. It's rare to have a lower than normal rate, although athletic horses may have a rate of $20$ - $25$. Animals with painful lesions or suffering from circulatory shock may have an elevated heart rate.</li>
    <li><code>Respiratory_Rate</code> - The normal rate is $8$ to $10$ and usefulness is doubtful due to the great fluctuations.</li>
    <li><code>Temp_Of_Extremities</code> - A subjective indication of peripheral circulation. Cool to cold extremities indicate possible shock, whereas hot extremities should correlate with an elevated rectal temperature.</li>
    <li><code>Peripheral_Pulse</code> - Subjective. Normal or increased values are indicative of adequate circulation, while reduced or absent indicate poor perfusion.</li>
    <li><code>Mucous_Membrane</code> - A subjective measurement of colour. Normal pink and bright pink probably indicate a normal or slightly increased circulation. Pale pink may occur in early shock. Pale cyanotic and dark cyanotic are indicative of serious circulatory compromise. Bright red is more indicative of septicemia.</li>
    <li><code>Capillary_Refill_Time</code> - A clinical judgement. The longer the refill, the poorer the circulation.</li>
    <li><code>Pain</code> - A subjective judgement of the horse's pain level. In general, the more painful, the more likely it is to require surgery, but should not be treated as an ordered or discrete variable.</li>
    <li><code>Peristalsis</code> - An indication of the activity in the horse's gut. As the gut becomes more distended or the horse becomes more toxic, the activity decreases.</li>
    <li><code>Abdominal_Distention</code> - According to the author, this is an important parameter. An animal with abdominal distension is likely to be painful and have reduced gut motility. A horse with severe abdominal distension is likely to require surgery just to relieve the pressure.</li>
    <li><code>Nasogastric_Tube</code> - Refers to any gas coming out of the tube. A large gas cap in the stomach is likely to give the horse discomfort.</li>
    <li><code>Nasogastric_Reflux</code> - The greater the amount of reflux, the more likelihood that there is some serious obstruction to the fluid passage from the rest of the intestine.</li>
    <li><code>Nasogastric_Reflux_Ph</code> - Scale is from $0$ to $14$ with $7$ being neutral, and normal values are in the $3$ to $4$ range.</li>
    <li><code>Rectal_Exam_Feces</code> - Absent feces probably indicates an obstruction.</li>
    <li><code>Abdomen</code> - Firm is probably an obstruction caused by a mechanical impaction and is normally treated medically. A distended small intestine and a distended large intestine indicate a surgical lesion.</li>
    <li><code>Packed_Cell_Volume</code> - Normal range is $30$ to $50$. The level rises as the circulation becomes compromised or as the animal becomes dehydrated.</li>
    <li><code>Total_Protein</code> - Normal values lie in the $6$ - $7.5$ (gms/dL) range, and the higher the value the greater the dehydration.</li>
    <li><code>Abdomo_Appearance</code> - Normal fluid is clear while cloudy or serosanguinous indicates a compromised gut.</li>
    <li><code>Abdomo_Protein</code> - The higher the level of protein the more likely it is to have a compromised gut. Values are in gms/dL.</li>
    <li><code>Surgical_Lesion</code> - Retrospectively, was the problem (lesion) surgical?</li>
    <li><code>Lesion_1</code>, <code>Lesion_2</code>, <code>Lesion_3</code> - Type of lesion. First number is site of lesion. Second number is type. Third number is subtype. Fourth number is specific code. Unfortunately, these indicators can be two-digit ones, so it's hard to interpret what is what.</li>
    <li><code>Cp_Data</code> - Is pathology data present for this case? This variable is of no significance since pathology data is not included or collected for these cases.</li>
    <li><code>Outcome</code> - <b>Horse health status. Variable to predict.</b></li>
</ul>
<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    Reading the original dataset description, we can get a little intuition about which features may be important from the task perspective and which are not. <b>For example, the author says right up front that some features like</b> <code>Respiratory_Rate</code> <b>and</b> <code>Cp_Data</code> <b>are probably useless.</b> <b>On the other hand,</b> <code>Abdominal_Distention</code> <b>perhaps is the most important.</b> Also, <code>Hospital_Number</code> may be doubtful since this is a case number assigned to the horse, not being related directly to the health status. Anyway, there we have $27$ features and $1235$ training samples, which gives us around $45$ samples per dimension. This is a little bit low value, and we should perform a feature selection analysis.
</p>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #4A4B52;
    border-bottom: 2px solid #3A5A81;
">
    <b>Insight</b> 💡
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    The <code>Hospital_Number</code> should be related to only one horse, but observation can appear more than one time in the original dataset. On the other hand in the synthetic training dataset the same <code>Hospital_Number</code> value is associated with different horses. See below.
</p>

In [157]:
train[train.Hospital_Number == 528800].style.set_table_styles(DF_STYLE).format(precision=1)


Unnamed: 0_level_0,Surgery,Age,Hospital_Number,Rectal_Temp,Pulse,Respiratory_Rate,Temp_Of_Extremities,Peripheral_Pulse,Mucous_Membrane,Capillary_Refill_Time,Pain,Peristalsis,Abdominal_Distention,Nasogastric_Tube,Nasogastric_Reflux,Nasogastric_Reflux_Ph,Rectal_Exam_Feces,Abdomen,Packed_Cell_Volume,Total_Protein,Abdomo_Appearance,Abdomo_Protein,Surgical_Lesion,Lesion_1,Lesion_2,Lesion_3,Cp_Data,Outcome
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
186,yes,adult,528800,39.3,66.0,48.0,warm,reduced,bright_red,less_3_sec,mild_pain,hypomotile,moderate,slight,less_1_liter,4.5,absent,distend_small,54.0,7.5,cloudy,2.9,yes,2205,0,0,no,lived
397,yes,adult,528800,37.8,48.0,12.0,cool,normal,pale_cyanotic,less_3_sec,mild_pain,normal,slight,slight,more_1_liter,2.0,decreased,firm,50.0,7.0,serosanguious,5.0,yes,2205,0,0,yes,euthanized
924,yes,adult,528800,38.0,48.0,16.0,cool,reduced,pale_cyanotic,less_3_sec,mild_pain,hypomotile,slight,slight,none,2.0,,,46.0,6.1,serosanguious,4.5,yes,1400,0,0,no,euthanized
944,yes,adult,528800,39.1,96.0,40.0,cold,absent,bright_red,more_3_sec,depressed,absent,moderate,slight,more_1_liter,1.0,absent,distend_small,74.0,6.6,cloudy,1.4,yes,2209,0,0,no,died
1080,no,adult,528800,37.6,48.0,20.0,cool,,pale_cyanotic,less_3_sec,depressed,absent,slight,slight,none,4.5,decreased,firm,45.0,7.5,cloudy,2.3,no,400,0,0,yes,lived


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #4A4B52;
    border-bottom: 2px solid #3A5A81;
">
    <b>Hospital Number Feature</b> 📜
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    Here for example, $528800$ number is related to different observations (different horses), which have no sense. <b>That's probably the result of synthetic generation process. We can reject this feature at once I believe.</b>
</p>

## <b> <span style="font-family: 'JetBrains Mono'; color: #4A4B52">1.2</span> <span style='color: #3A5A81'>|</span> <span style="font-family: 'JetBrains Mono'; color: #4A4B52">Basic Numerical Properties &amp; Summaries</span></b><a class="anchor" id="basic_numerical_properties_summaries"></a> [↑](#top)

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    Now we're going to look more closely at feature properties and dependencies. Let's have a look at the target variable distribution first.
</p>

In [6]:
fig = px.histogram(
    y=train.Outcome,
    histnorm="percent",
    text_auto=".1f",
    title="Horse Health Status - Target to Predict<br>"
    "<span style='font-size: 75%; font-weight: bold;'>"
    "Three different outcomes which gives a multiclass classification problem</span>",
    color_discrete_sequence=["#4A4B52"],
    opacity=0.5,
    height=340,
    width=840,
)
fig.update_xaxes(title="Percent Count", range=(-2, 100))
fig.update_yaxes(title="", categoryorder="total ascending")
fig.update_traces(textposition="outside")
fig.update_layout(
    font_color=FONT_COLOR,
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    xaxis_showgrid=False,
    yaxis_showgrid=False,
    bargap=0.4,
)
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #4A4B52;
    border-bottom: 2px solid #3A5A81;
">
    <b>Target Distribution</b> 📜
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    The <code>Outcome</code> feature has three distinct classes, i.e. horse health status. As we can see almost half ($46.5$%) of horses have lived, whereas $53.5$% have died or have been euthanized. Let's get to the basic numerical summaries of the dataset. Firstly, let's see the diversity of the dataset, i.e. missing values, unique values and their frequency.
</p>

In [18]:
def missing_unique_vals_summary(frame):
    missing_vals = frame.isna().sum()

    unique_vals = frame.apply(lambda col: len(col.unique()))
    most_freq_count = frame.apply(lambda col: col.value_counts().iloc[0])
    most_freq_val = frame.mode().iloc[:1].T.squeeze()

    unique_ratio = unique_vals / len(frame)
    freq_count_ratio = most_freq_count / len(frame)

    return pd.DataFrame(
        {
            "MissingValues": missing_vals,
            "UniqueValues": unique_vals,
            "UniqueValuesRatio": unique_ratio,
            "MostFreqValueCount": most_freq_count,
            "MostFreqValueCountRatio": freq_count_ratio,
            "MostFreqValue": most_freq_val,
            "Dtype": frame.dtypes,
        }
    )


train_summary = missing_unique_vals_summary(train)
test_summary = missing_unique_vals_summary(test)


In [19]:
print(CLR + "Training Dataset:")
train_summary.style.set_table_styles(DF_STYLE).background_gradient(DF_CMAP).format(
    {"UniqueValuesRatio": "{:.1%}", "MostFreqValueCountRatio": "{:.1%}"}, precision=1
)


[1m[37mTraining Dataset:


Unnamed: 0,MissingValues,UniqueValues,UniqueValuesRatio,MostFreqValueCount,MostFreqValueCountRatio,MostFreqValue,Dtype
Surgery,0,2,0.2%,887,71.8%,yes,object
Age,0,2,0.2%,1160,93.9%,adult,object
Hospital_Number,0,255,20.6%,46,3.7%,529461,int64
Rectal_Temp,0,43,3.5%,120,9.7%,38.0,float64
Pulse,0,50,4.0%,106,8.6%,48.0,float64
Respiratory_Rate,0,37,3.0%,163,13.2%,24.0,float64
Temp_Of_Extremities,0,5,0.4%,700,56.7%,cool,object
Peripheral_Pulse,0,5,0.4%,724,58.6%,reduced,object
Mucous_Membrane,0,7,0.6%,284,23.0%,pale_pink,object
Capillary_Refill_Time,0,4,0.3%,834,67.5%,less_3_sec,object


In [20]:
print(CLR + "Test Dataset:")
test_summary.style.set_table_styles(DF_STYLE).background_gradient(DF_CMAP).format(
    {"UniqueValuesRatio": "{:.1%}", "MostFreqValueCountRatio": "{:.1%}"}, precision=1
)


[1m[37mTest Dataset:


Unnamed: 0,MissingValues,UniqueValues,UniqueValuesRatio,MostFreqValueCount,MostFreqValueCountRatio,MostFreqValue,Dtype
Surgery,0,2,0.2%,589,71.5%,yes,object
Age,0,2,0.2%,782,94.9%,adult,object
Hospital_Number,0,210,25.5%,35,4.2%,529461.0,int64
Rectal_Temp,0,34,4.1%,75,9.1%,38.0,float64
Pulse,0,49,5.9%,74,9.0%,52.0,float64
Respiratory_Rate,0,38,4.6%,109,13.2%,24.0,float64
Temp_Of_Extremities,0,5,0.6%,472,57.3%,cool,object
Peripheral_Pulse,0,5,0.6%,478,58.0%,reduced,object
Mucous_Membrane,0,7,0.8%,212,25.7%,pale_cyanotic,object
Capillary_Refill_Time,0,4,0.5%,524,63.6%,less_3_sec,object


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #4A4B52;
    border-bottom: 2px solid #3A5A81;
">
    <b>Unique &amp; Most Frequent Values</b> 📜
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    The dataset is mainly an object-type dataset with categorical features. Those compose $17$ features, whereas numerical ones comprise $11$ variables ($7$ decimals and $4$ integers). <b>As we can see, there are no missing values in both training and test datasets. Moreover, these are very similar regarding diversity, showing only small differences.</b> There are two variables which can be rejected at hand, i.e., <code>Lesion_2</code> and <code>Lesion_3</code>, since these are composed of one unique value that covers almost the whole dataset. Let's see numerical summaries now.
</p>

In [10]:
def numeric_descr(frame):
    return (
        frame.describe(percentiles=[0.01, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99])
        .T.drop("count", axis=1)
        .rename(columns=str.title)
    )


train_num_descr = numeric_descr(train)
test_num_descr = numeric_descr(test)


In [11]:
train_num_descr.style.set_table_styles(DF_STYLE).format(precision=1)


Unnamed: 0,Mean,Std,Min,1%,5%,25%,50%,75%,95%,99%,Max
Hospital_Number,954500.4,1356403.1,521399.0,527365.0,527563.0,528800.0,529777.0,534145.0,5290409.0,5299603.0,5305129.0
Rectal_Temp,38.2,0.8,35.4,36.1,37.1,37.8,38.2,38.6,39.5,40.3,40.8
Pulse,79.6,29.1,30.0,36.0,42.0,53.0,76.0,100.0,129.0,164.0,184.0
Respiratory_Rate,30.1,16.5,8.0,10.0,12.0,18.0,28.0,36.0,60.0,90.0,96.0
Nasogastric_Reflux_Ph,4.4,1.9,1.0,1.0,2.0,2.0,4.5,6.0,7.0,7.5,7.5
Packed_Cell_Volume,49.6,10.5,23.0,30.0,34.7,43.0,48.0,57.0,69.0,75.0,75.0
Total_Protein,21.4,26.7,3.5,4.5,5.9,6.6,7.5,9.1,80.0,82.0,89.0
Abdomo_Protein,3.3,1.6,0.1,1.0,1.0,2.0,3.0,4.3,6.5,8.0,10.1
Lesion_1,3832.5,5436.7,0.0,0.0,0.0,2205.0,2209.0,3205.0,8700.0,31110.0,41110.0
Lesion_2,14.6,193.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3112.0


In [12]:
test_num_descr.style.set_table_styles(DF_STYLE).format(precision=1)


Unnamed: 0,Mean,Std,Min,1%,5%,25%,50%,75%,95%,99%,Max
Hospital_Number,1108357.2,1555626.9,521399.0,527365.0,527680.2,528743.0,529808.5,534644.0,5291243.5,5299603.0,5305129.0
Rectal_Temp,38.2,0.8,36.0,36.4,37.1,37.8,38.2,38.6,39.5,40.3,40.8
Pulse,80.2,29.2,36.0,40.0,44.0,54.0,76.0,100.0,132.0,164.0,184.0
Respiratory_Rate,30.7,17.4,9.0,12.0,12.0,18.0,28.0,36.0,68.0,90.0,96.0
Nasogastric_Reflux_Ph,4.5,1.9,1.0,1.0,2.0,3.0,4.5,6.5,7.0,7.5,7.5
Packed_Cell_Volume,49.1,10.5,23.0,30.0,33.0,43.0,48.0,55.0,69.0,75.0,75.0
Total_Protein,20.8,26.4,3.9,4.5,5.9,6.6,7.5,8.9,81.0,82.0,89.0
Abdomo_Protein,3.3,1.5,0.1,1.0,1.0,2.0,3.3,4.3,5.6,8.0,10.1
Lesion_1,3709.8,5112.9,0.0,0.0,0.0,2205.0,2209.0,3205.0,8400.0,31110.0,31110.0
Lesion_2,12.4,197.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4300.0


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #4A4B52;
    border-bottom: 2px solid #3A5A81;
">
    <b>Numerical Summary</b> 📜
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    There we have almost perfect overlaying in both training and test datasets when you look at percentiles. <b>This means that datasets most probably derive from the same distribution.</b> A slightly different situation appears for the <code>Lesion-Variables</code> since these are not truly numeric values but codes. It's hard to say whether these may be handy. Since we have tables above, let's have a look at distribution plots.
</p>

In [367]:
def get_n_rows_axes(n_features, n_cols):
    n_rows = int(np.ceil(n_features / n_cols))
    current_col = range(1, n_cols + 1)
    current_row = range(1, n_rows + 1)
    return n_rows, list(product(current_row, current_col))


def draw_dist_for_group(group, vars_to_plot, n_cols=3, height=640):
    n_rows, axes = get_n_rows_axes(len(vars_to_plot), n_cols)
    fig = make_subplots(
        rows=n_rows,
        cols=n_cols,
        y_title="Probability Density",
        horizontal_spacing=0.1,
        vertical_spacing=0.1,
    )
    fig.update_annotations(font_size=14)
    for frame, color, frame_name in zip((train, test), ("#4A4B52", "#3A5A81"), ("Train", "Test")):
        for k, (var, (row, col)) in enumerate(zip(vars_to_plot, axes), start=1):
            # density, bins = np.histogram(frame[var].dropna(), density=True)
            fig.add_histogram(
                # x=bins,
                x=frame[var],
                histnorm="probability density",
                marker_color=color,
                marker_line_width=0,
                # marker_line_color=color,
                opacity=0.75,
                name=frame_name,
                legendgroup=frame_name,
                showlegend=k == 1,
                row=row,
                col=col,
            )
            fig.update_xaxes(
                tickfont_size=8,
                showgrid=False,
                title_text=var,
                titlefont_size=8,
                titlefont_family="Arial Black",
                row=row,
                col=col,
            )
            fig.update_yaxes(tickfont_size=8, showgrid=False, row=row, col=col)

    fig.update_layout(
        width=840,
        height=height,
        title=f"{group} Features - Distributions<br>"
        "<span style='font-size: 75%; font-weight: bold;'>"
        "Training and test datasets show almost perfect overlaying</span>",
        font_color=FONT_COLOR,
        title_font_size=18,
        plot_bgcolor=BACKGROUND_COLOR,
        paper_bgcolor=BACKGROUND_COLOR,
        bargap=0.1,
        bargroupgap=0.1,
        legend=dict(yanchor="bottom", xanchor="right", y=1, x=1, orientation="h", title=""),
    )
    return fig


cat_features = train.drop("Outcome", axis=1).select_dtypes("object").columns.to_list()
fig = draw_dist_for_group("Categorical", cat_features, height=940)
fig.update_xaxes(categoryorder="total ascending")
fig.update_layout(bargap=0.2, bargroupgap=0.2)
fig.show()


In [368]:
num_features = train.select_dtypes("number").columns.to_list()
fig = draw_dist_for_group("Numeric", num_features, height=740)
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #4A4B52;
    border-bottom: 2px solid #3A5A81;
">
    <b>Categorical &amp; Numerical Distributions</b> 📜
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    Training and test datasets show us almost perfect overlaying in all features. This mean that values derive from the same distribution as we thought earlier. <b>Such situation gives us hope that results obtained from machine learning models for training and test datasets should follow each other.</b> In the case of categorical features, a one problem arises, since there is a lot of categories, there may be a problem with one hot encoding. OHE will introduce too many features here, so probably we should use ordinal encoding. Let's have a look at correlation matrix now.
</p>

In [148]:
pearson_corr = train.corr(method="pearson", numeric_only=True).round(2)
lower_triangular_corr = (
    pearson_corr.mask(np.triu(np.ones_like(pearson_corr, dtype=bool)))
    .dropna(axis="index", how="all")
    .dropna(axis="columns", how="all")
)

color_map = [[0.0, "#FCFCFC"], [0.5, "#4A4B52"], [1.0, "#3A5A81"]]

heatmap = go.Heatmap(
    z=lower_triangular_corr,
    x=lower_triangular_corr.columns,
    y=lower_triangular_corr.index,
    text=lower_triangular_corr.fillna(""),
    texttemplate="%{text}",
    opacity=0.85,
    xgap=4,
    ygap=4,
    showscale=True,
    colorscale=color_map,
    colorbar_len=1.02,
    hoverinfo="none",
)
fig = go.Figure(heatmap)
fig.add_annotation(
    x=3.5,
    y=1,
    align="left",
    xanchor="left",
    text="<b>Lesion_2 and Lesion_3 are mostly correlated (0.64),<br>"
    "but it's misleading, since these features are<br>"
    "composed of zeros in almost all observations.</b>",
    showarrow=False,
)
fig.update_yaxes(tickfont_size=10)
fig.update_xaxes(tickfont_size=10)
fig.update_layout(
    font_color=FONT_COLOR,
    title="Lower Triangular of Correlation Matrix (Pearson)<br>"
    "<span style='font-size: 75%; font-weight: bold;'>"
    "There is only 6 somewhat linearly correlated numerical pairs</span>",
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    width=840,
    height=840,
    xaxis_showgrid=False,
    yaxis_showgrid=False,
    yaxis_autorange="reversed",
)
fig.show()


In [66]:
highest_abs_corr = (
    lower_triangular_corr.abs()
    .unstack()
    .sort_values(ascending=False)  # type: ignore
    .rename("Absolute Pearson Correlation")
    .to_frame()
    .reset_index(names=["Feature 1", "Feature 2"])
)

top_corr = highest_abs_corr[highest_abs_corr["Absolute Pearson Correlation"] >= 0.4]
top_corr.style.set_table_styles(DF_STYLE).format(precision=2)


Unnamed: 0,Feature 1,Feature 2,Absolute Pearson Correlation
0,Lesion_2,Lesion_3,0.64
1,Nasogastric_Reflux_Ph,Total_Protein,0.58
2,Total_Protein,Abdomo_Protein,0.47
3,Pulse,Packed_Cell_Volume,0.44
4,Nasogastric_Reflux_Ph,Abdomo_Protein,0.43
5,Pulse,Respiratory_Rate,0.4


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #4A4B52;
    border-bottom: 2px solid #3A5A81;
">
    <b>Correlation Matrix</b> 📜
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    As we can see, there is only six somewhat correlated pairs, however <code>Lesion_2</code> and <code>Lesion_3</code> are composed almost from zeros, so it's hard to say about correlation here. <b>I also checked the Spearman correlation, but the situation is similar to the Pearson one, i.e. weakly correlated features. As the training and test datasets follow each other, the situation is similar in both cases.</b><br><br>
    Since we have the correlation matrix, we could use it for another visualisation - hierarchical clustering. Nevertheless, I checked that and there is no clear clusters, which may be useful in PCA or some other dimensionality reduction algorithm. Let's have a look at numerical variables pair plots.
</p>

In [165]:
dims = np.setdiff1d(num_features, ("Lesion_2", "Lesion_3", "Hospital_Number"))

fig = px.scatter_matrix(
    train,
    dimensions=dims,
    color="Outcome",
    color_discrete_sequence=["#815B3A", "#4A4B52", "#3A5A81"],
    symbol="Outcome",
    symbol_sequence=["cross", "circle", "diamond"],
    opacity=0.75,
    title="Numerical Features - Scatter Pair Plots",
    width=840,
    height=840,
)
fig.update_yaxes(showgrid=False)
fig.update_xaxes(showgrid=False)
fig.update_traces(diagonal_visible=False, showupperhalf=False, marker_size=1.5)
fig.update_layout(
    font_color=FONT_COLOR,
    font_size=7,
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    showlegend=True,
    legend=dict(
        orientation="h",
        yanchor="bottom",
        xanchor="right",
        y=1,
        x=1,
        itemsizing="constant",
        font_size=12,
        title="",
    ),
)
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #4A4B52;
    border-bottom: 2px solid #3A5A81;
">
    <b>Pair Plots</b> 📜
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    In the pair plots above, I rejected <code>Lesion_2</code>, <code>Lesion_3</code>, <code>Hospital_Number</code> features, since these are useless in this visualization. Here, we can see that regardless the outcome, there is an overlaying in data points. So there is no distinguish areas of clusters, where for example <code>Outcome</code> is <i>lived</i> or <i>died</i>. Let's have a look at categorical feature once more time. This time regarding to the <code>Outcome</code>.
</p>

In [363]:
n_cols = 3
n_rows, axes = get_n_rows_axes(len(cat_features), n_cols)
fig = make_subplots(
    rows=n_rows,
    cols=n_cols,
    y_title="Sum of Count (Normalized as Percent)",
    horizontal_spacing=0.1,
    vertical_spacing=0.1,
)
fig.update_annotations(font_size=14)

for k, (var, (row, col)) in enumerate(zip(cat_features, axes), start=1):
    for outcome, color in zip(("died", "euthanized", "lived"), ("#815B3A", "#4A4B52", "#3A5A81")):
        fig.add_histogram(
            x=train.query(f"Outcome == '{outcome}'")[var],
            marker_color=color,
            opacity=0.75,
            name=outcome,
            legendgroup=outcome,
            showlegend=k == 1,
            row=row,
            col=col,
        )
        fig.update_xaxes(
            tickfont_size=8,
            showgrid=False,
            title_text=var,
            titlefont_size=8,
            titlefont_family="Arial Black",
            row=row,
            col=col,
        )
        fig.update_yaxes(tickfont_size=8, showgrid=False, row=row, col=col)

fig.update_layout(
    font_color=FONT_COLOR,
    title="Categorical Features vs Horse Health Status",
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    width=840,
    height=1040,
    xaxis_showgrid=False,
    yaxis_showgrid=False,
    barnorm="percent",
    barmode="relative",
    bargap=0.3,
    legend=dict(
        orientation="h",
        yanchor="bottom",
        xanchor="right",
        y=1.02,
        x=1,
        itemsizing="constant",
        title="",
    ),
)
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #4A4B52;
    border-bottom: 2px solid #3A5A81;
">
    <b>Categories vs Horse Health Status</b> 📜
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    There we have bar plots related to the horse health status. Bars were normalized as percentage, so we can easily compare specific category and <code>Outcome</code> feature. This way we can easily see that:
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
">
    <li>If the horse was treated with surgery it was a higher chance to die.</li>
    <li>Young horses have more chance to die.</li>
    <li>If <code>Temp_Of_Extremities</code> is <i>warm</i> or <i>normal</i>, the horse has a higher chance to live.</li>
    <li>If <code>Peripheral_Pulse</code> is <i>normal</i> or <i>increased</i>, the horse has a higher chance to live. On the other hand <i>absent</i> means a higher chance to be euthanized.</li>
    <li>If <code>Mucous_Membrane</code> is <i>pale_pink</i>, <i>bright_pink</i> or <i>normal_pink</i> the horse has a higher chance to live.</li>
    <li>If <code>Capillary_Refill_Time</code> is <i>less_3_sec</i>, the horse has a higher chance to live.</li>
    <li>The <i>mild_pain</i> or <i>alert</i> in <code>Pain</code> give the horse a higher chance to survive.</li>
    <li>The <i>normal</i>, <i>hypermotile</i> or <i>hypomotile</i> in <code>Peristalsis</code> give the horse a higher chance to survive.</li>
    <li>The <i>slight</i> or <i>none</i> in <code>Abdominal_Distention</code> give the horse a higher chance to survive.</li>
    <li>When <code>Nasogastric_Tube</code> is <i>slight</i>, there is more than $50$% that the horse is lived. In the rest categories the situation is mixed.</li>
    <li>When <code>Nasogastric_Reflux</code> is <i>slight</i> or <i>none</i>, there is a big chance that the horse is lived.</li>
    <li>When <code>Rectal_Exam_Feces</code> is <i>absent</i> there is a higher chance that the horse is died.</li>
    <li>When <code>Abdomen</code> is <i>firm</i> or <i>other</i> there is a higher chance that the horse is lived.</li>
    <li>In the case of the <code>Abdomo_Appearance</code>, the is <i>none</i> means that the horse has a higher chance to be euthanized. On the other hand, <i>clear</i> gives him a higher chance to live.</li>
    <li>If the problem (lesion) was surgical, then the horse has a higher chance to be died.</li>
    <li>If the pathology data is present, there is a bigger chance to be euthanized for being alive.</li>
</ul>


Continue...