<a href="https://www.kaggle.com/code/oscarm524/ps-s3-ep22-eda-modeling-submission?scriptVersionId=142774449" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<a id="table"></a>
<h1 style="background-color:lightgray;font-family:newtimeroman;font-size:350%;text-align:center;border-radius: 15px 50px;">Table of Contents</h1>

[1. Notebook Versions](#1)

[2. Loading Libraries](#2)

[3. Reading Data Files](#3)

[4. Data Exploration](#4)


<a id="1"></a>
# <h1 style="background-color:lightgray;font-family:newtimeroman;font-size:350%;text-align:center;border-radius: 15px 50px;">Notebook Versions</h1>

- Version 1 (09/11/2023)
    * EDA 
    
    
- Version 2 (09/12/2023)
    * EDA updated
    
<a id="2"></a>
# <h1 style="background-color:lightgray;font-family:newtimeroman;font-size:350%;text-align:center;border-radius: 15px 50px;">Loading Libraries</h1>    

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd; pd.set_option('display.max_columns', 100)
import numpy as np

from tqdm.notebook import tqdm

import re

from functools import partial
import scipy as sp

import matplotlib.pyplot as plt; plt.style.use('ggplot')
import seaborn as sns
import plotly.express as px

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score, roc_curve, RocCurveDisplay, cohen_kappa_score, log_loss
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import RFE, RFECV
from sklearn.isotonic import IsotonicRegression
from sklearn.calibration import CalibrationDisplay
from sklearn.inspection import PartialDependenceDisplay
from sklearn.linear_model import LogisticRegression
from collections import Counter
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

<a id="3"></a>
# <h1 style="background-color:lightgray;font-family:newtimeroman;font-size:350%;text-align:center;border-radius: 15px 50px;">Reading Data Files</h1> 

In [None]:
train = pd.read_csv('../input/playground-series-s3e22/train.csv')
test = pd.read_csv('../input/playground-series-s3e22/test.csv')
original = pd.read_csv('../input/horse-survival-dataset/horse.csv')

print('The dimension of the train dataset is:', train.shape)
print('The dimension of the test dataset is:', train.shape)
print('The dimension of the original train dataset is:', original.shape)

In [None]:
train.head()

<a id="4"></a>
# <h1 style="background-color:lightgray;font-family:newtimeroman;font-size:350%;text-align:center;border-radius: 15px 50px;">Data Exploration</h1>

First, we explore the competition dataset. We start by visualizing `outcome`, the variable of interest.

In [None]:
sns.countplot(data = train, x = 'outcome')
plt.ylabel('Frequency');

From the above, the most frequent label is `lived`; on the other hand, `euthanized` is the least frequent label. Next, we explore relationships between the categorical features and `outcome`.

In [None]:
fig, axes = plt.subplots(2, 2, figsize = (25, 17))

cmap = sns.diverging_palette(100, 7, s = 75, l = 40, n = 5, center = 'light', as_cmap = True)

sns.heatmap(data = pd.crosstab(train['surgery'], train['outcome']), annot = True, cmap = cmap, fmt = '.0f', ax = axes[0, 0])
sns.heatmap(data = pd.crosstab(train['age'], train['outcome']), annot = True, cmap = cmap, fmt = '.0f', ax = axes[0, 1])
sns.heatmap(data = pd.crosstab(train['temp_of_extremities'], train['outcome']), annot = True, cmap = cmap, fmt = '.0f', ax = axes[1, 0])
sns.heatmap(data = pd.crosstab(train['peripheral_pulse'], train['outcome']), annot = True, cmap = cmap, fmt = '.0f', ax = axes[1, 1]);

From the above heatmaps, these are some observations:

- `young` horses are more likely to die.
- horses with `temp_of_extremities = normal` are more likely to live.
- horses with `peripheral_pulse = normal` are more likely to live.

In [None]:
fig, axes = plt.subplots(2, 2, figsize = (25, 17))

sns.heatmap(data = pd.crosstab(train['mucous_membrane'], train['outcome']), annot = True, cmap = cmap, fmt = '.0f', ax = axes[0, 0])
sns.heatmap(data = pd.crosstab(train['capillary_refill_time'], train['outcome']), annot = True, cmap = cmap, fmt = '.0f', ax = axes[0, 1])
sns.heatmap(data = pd.crosstab(train['pain'], train['outcome']), annot = True, cmap = cmap, fmt = '.0f', ax = axes[1, 0])
sns.heatmap(data = pd.crosstab(train['peristalsis'], train['outcome']), annot = True, cmap = cmap, fmt = '.0f', ax = axes[1, 1]);

From the above heatmaps, these are some observations:

- horses with `mucous_membrane = normal_pink` are more likely to live.
- Only two observations with `capillary_refill_time = 3`.
- horses with `pain = mild_pain` are more likely to live.
- Only one observation with `pain = slight`.
- Only one observation with `peristalsis = distend_small`.

In [None]:
fig, axes = plt.subplots(2, 2, figsize = (25, 17))

sns.heatmap(data = pd.crosstab(train['abdominal_distention'], train['outcome']), annot = True, cmap = cmap, fmt = '.0f', ax = axes[0, 0])
sns.heatmap(data = pd.crosstab(train['nasogastric_tube'], train['outcome']), annot = True, cmap = cmap, fmt = '.0f', ax = axes[0, 1])
sns.heatmap(data = pd.crosstab(train['nasogastric_reflux'], train['outcome']), annot = True, cmap = cmap, fmt = '.0f', ax = axes[1, 0])
sns.heatmap(data = pd.crosstab(train['rectal_exam_feces'], train['outcome']), annot = True, cmap = cmap, fmt = '.0f', ax = axes[1, 1]);

From the above heatmaps, these are some observations:

- horses with `abdominal_distention = slight` are more likely to live.
- Only one observation with `nasogastric_reflux = slight`.
- Only one observation with `rectal_exam_feces = serosanguious`.

In [None]:
fig, axes = plt.subplots(2, 2, figsize = (25, 17))

sns.heatmap(data = pd.crosstab(train['abdomen'], train['outcome']), annot = True, cmap = cmap, fmt = '.0f', ax = axes[0, 0])
sns.heatmap(data = pd.crosstab(train['abdomo_appearance'], train['outcome']), annot = True, cmap = cmap, fmt = '.0f', ax = axes[0, 1])
sns.heatmap(data = pd.crosstab(train['surgical_lesion'], train['outcome']), annot = True, cmap = cmap, fmt = '.0f', ax = axes[1, 0])
sns.heatmap(data = pd.crosstab(train['cp_data'], train['outcome']), annot = True, cmap = cmap, fmt = '.0f', ax = axes[1, 1]);

From the above heatmaps, these are some observations:

- horses with `abdomo_appearance = clear` are more likely to live.
- horses with `surgical_lesion = no` are more likely to live.

Next, we explore potential relationships between the numeric input features and `outcome`.

In [None]:
fig, axes = plt.subplots(2, 2, figsize = (25, 17))

sns.boxplot(ax = axes[0, 0], data = train, x = 'outcome', y = 'rectal_temp');
sns.boxplot(ax = axes[0, 1], data = train, x = 'outcome', y = 'pulse');
sns.boxplot(ax = axes[1, 0], data = train, x = 'outcome', y = 'respiratory_rate');
sns.boxplot(ax = axes[1, 1], data = train, x = 'outcome', y = 'nasogastric_reflux_ph');

From the above boxplots, these are some observations:

- `rectal_temp` distributions are similar across the three different labels of `outcome`.
- The `pulse` of horses that lived, on average, is lower.
- There is a slight downward trend in `respiratory_rate`, on average, from horses that died to horses that lived.

In [None]:
train.head()

In [None]:
fig, axes = plt.subplots(2, 2, figsize = (25, 17))

sns.boxplot(ax = axes[0, 0], data = train, x = 'outcome', y = 'packed_cell_volume');
sns.boxplot(ax = axes[0, 1], data = train, x = 'outcome', y = 'total_protein');
sns.boxplot(ax = axes[1, 0], data = train, x = 'outcome', y = 'abdomo_protein');
sns.boxplot(ax = axes[1, 1], data = train, x = 'outcome', y = 'lesion_1');

From the above boxplots, these are some observations:

- On average, `euthanized` horses have a higher `picked_cell_volume`.
- On average, `euthanized` horses have a higher `total_protein`.