<a href="https://www.kaggle.com/code/sjagkoo7/predict-health-outcomes-of-horses-s3-ep22?scriptVersionId=144262971" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

<div style = "color: White; display: fill;
              border-radius: 5px;
              background-color: #2C7FB8;
              font-size: 100%;
              font-family: Verdana">
    
Predict whether or not a horse can survive based upon past medical conditions. There is an original dataset **horse-survival-dataset** I have also used for reference.**Outcome** is target variable.This is a **multi-class classification** challenge to predict horse survival using the provided features. We shall explore multi-class classification (not multi-label classification) as in competation description it is mention that submissions are evaluated on **micro-averaged F1-Score** between predicted and actual values. micro-averaged F1-Score is applicable for multi-class classification.

# Importing Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # visualization like pie plot
import seaborn as sns # visualization like scatter plot 
pd.set_option('display.max_columns', 50) # display 50   columns by default
pd.set_option('display.max_rows', 50) # display 50 rows by default

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Loading the Dataset

In [None]:
original=pd.read_csv('/kaggle/input/horse-survival-dataset/horse.csv')
train=pd.read_csv('/kaggle/input/playground-series-s3e22/train.csv')
test=pd.read_csv('/kaggle/input/playground-series-s3e22/test.csv')
submission=pd.read_csv('/kaggle/input/playground-series-s3e22/sample_submission.csv')

In [None]:
#dropping the id value as there is no importance for horse survival,it's just number
train=train.drop('id',axis=1)
id=test['id']
test=test.drop('id',axis=1)

In [None]:
# first three rows of original dataset
original.head(3)

In [None]:
# first three rows of train dataset
train.head(3)

In [None]:
# first three rows of test dataset
test.head(3)

In [None]:
# first three rows of submission dataset
submission.head(3)

# Exploring the Dataset

In [None]:
# Summary of Datasets
def summary(df):
    data=pd.DataFrame(index=df.columns)
    data['dtypes']=df.dtypes
    data['count']=df.count()
    data['#unique']=df.nunique()
    data['#missing']=df.isna().sum()
    data['missing%']=df.isna().sum()/len(df)*100
    data=pd.concat([data,df.describe().T.drop('count',axis=1)],axis=1)
    return data

In [None]:
# Summary of Training Dataset
summary(train).style.background_gradient(cmap='YlGnBu')

In [None]:
# Summary of test Dataset
summary(test).style.background_gradient(cmap='YlOrBr')

<div style = "color: White; display: fill;
              border-radius: 5px;
              background-color: #2C7FB8;
              font-size: 100%;
              font-family: Verdana">
    
<b>Insight:</b>
* We have numerical, categorical and object columns
* Missing Values: The dataset contains a significant number of NA values, data imputation  will be required .
* The column hosptial number can be treated as  a categorical variable because it represents the different hospitals.
* Outcome: Target variable is  "outcome" variable. Possibilities include: lived, died, was euthanized.

In [None]:
# train dataset - displaying rows if any have duplicate rows
train_duplicated_rows=train[train.duplicated()]
train_duplicated_rows

In [None]:
# test dataset - displaying rows if any have duplicate rows
test_duplicated_rows=test[test.duplicated()]
test_duplicated_rows

<div style = "color: White; display: fill;
              border-radius: 5px;
              background-color:  #2C7FB8;
              font-size: 100%;
              font-family: Verdana">
    
* There is no duplicated rows in train & test dataset

In [None]:
#train dataset - displaying rows if any have null rows
train[train.isna().any(axis=1)]

In [None]:
#test dataset - displaying rows if any have null rows
test[test.isna().any(axis=1)]

In [None]:
# Dataset Attributes Description
train.columns

<div style="border-radius:10px; border:#DEB887 solid; padding: 15px; background-color: #2C7FB8; font-size:100%; text-align:left">

<h3 align="left"><font color='#DEB887'>💡 Dataset Attributes Description:</font></h3>

<table border="1" cellpadding="5" cellspacing="0">
    <thead>
        <tr>
            <th>Attribute</th>
            <th>Description</th>
            <th>Values</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>surgery?</td>
            <td>Whether the horse had surgery</td>
            <td>1 = Yes, 2 = No</td>
        </tr>
        <tr>
            <td>Age</td>
            <td>Age category of the horse</td>
            <td>1 = Adult, 2 = Young (&lt; 6 months)</td>
        </tr>
        <tr>
            <td>Hospital Number</td>
            <td>Case number assigned to the horse</td>
            <td>Numeric ID</td>
        </tr>
        <tr>
            <td>rectal temperature</td>
            <td>Temperature in degrees celsius</td>
            <td>Linear</td>
        </tr>
        <tr>
            <td>pulse</td>
            <td>Heart rate in beats per minute</td>
            <td>Linear</td>
        </tr>
        <tr>
            <td>respiratory rate</td>
            <td>Rate of respiration</td>
            <td>Linear</td>
        </tr>
        <tr>
            <td>temperature of extremities</td>
            <td>Indication of peripheral circulation</td>
            <td>1 = Normal, 2 = Warm, 3 = Cool, 4 = Cold</td>
        </tr>
        <tr>
            <td>peripheral pulse</td>
            <td>Subjective assessment of peripheral pulse</td>
            <td>1 = Normal, 2 = Increased, 3 = Reduced, 4 = Absent</td>
        </tr>
        <tr>
            <td>mucous membranes</td>
            <td>Measurement of color of mucous membranes</td>
            <td>1-6 as described in the given data</td>
        </tr>
        <tr>
            <td>capillary refill time</td>
            <td>Clinical judgment of capillary refill time</td>
            <td>1 = &lt; 3 seconds, 2 = &gt;= 3 seconds</td>
        </tr>
        <tr>
            <td>pain</td>
            <td>Level of pain</td>
            <td>1 = no pain, 2 = depressed, 3 = intermittent mild pain, 4 = intermittent severe pain, 5 = continuous severe pain </td>
        </tr>
        <tr>
            <td>peristalsis</td>
            <td>An indication of the activity in the horse's gut.</td>
            <td>absent,hypomotile,hypermotile,normal </td>
        </tr>
        <tr>
            <td>abdominal distention</td>
            <td>an animal with abdominal</td>
            <td>1 = none, 2 = slight, 3 = moderate, 4 = severe </td>
        </tr>
        <tr>
            <td>nasogastric tube</td>
            <td> any gas coming out of the tube</td>
            <td>1 = none, 2 = slight, 3 = significant </td>
        </tr>
        <tr>
            <td>nasogastric_reflux</td>
            <td> the greater amount of reflux, the more likelihood that there is some serious</td>
            <td>1 = none, 2 = > 1 liter, 3 = &lt; 1 liter </td>
        </tr>
        <tr>
            <td>nasogastric_reflux_ph</td>
            <td> scale is from 0 to 14 with 7 being neutral - normal values are in the 3 to 4 range</td>
            <td>linear </td>
        </tr>
        <tr>
            <td>rectal_exam_feces</td>
            <td> indicates an obstruction</td>
            <td>1 = normal, 2 = increased, 3 = decreased, 4 = absent  </td>
        </tr>
        <tr>
            <td>abdomen</td>
            <td> intestine size</td>
            <td>1 = normal ,2 = other, 3 = large intestine, 4 = small intestine ,5 = distended </td>
        </tr>
        <tr>
            <td>packed_cell_volume</td>
            <td> red cells by volume in the blood - normal range is 30 to 50.</td>
            <td>linear</td>
        </tr>
        <tr>
            <td>total_protein</td>
            <td> normal values lie in the 6-7.5 (gms/dL) range - the higher the value the greater the dehydration</td>
            <td>linear</td>
        </tr>  
        <tr>
            <td>abdominocentesis appearance</td>
            <td>Appearance of fluid from abdominocentesis</td>
            <td>1 = Clear, 2 = Cloudy, 3 = Serosanguinous</td>
        </tr>
        <tr>
            <td>abdomcentesis total protein</td>
            <td>Total protein from abdominocentesis</td>
            <td>Linear (gms/dL)</td>
        </tr>
        <tr>
            <td>outcome</td>
            <td>Horse will survive ?</td>
            <td>1 = Lived, 2 = Died, 3 = Euthanized</td>
        </tr>
        <tr>
            <td>surgical lesion?</td>
            <td>If the lesion was surgical</td>
            <td>1 = Yes, 2 = No</td>
        </tr>
        <tr>
            <td>type of lesion {lesion_1,lesion_2,lesion_3}</td>
            <td>Type of lesion identified</td>
            <td>Comprehensive description given (Multiple layers)</td>
        </tr>
        <tr>
            <td>cp_data</td>
            <td>Presence of pathology data for the case</td>
            <td>1 = Yes, 2 = No</td>
        </tr>
    </tbody>
</table>

</div>

# Visualization

In [None]:
# Target variable distribution
fig, ax = plt.subplots(1,2,figsize=(12,5))

# ax[0] means first columns -- ax[0][0]
# ax[1] means second columns -- ax[0][1]

ax[0].pie(x=train.outcome.value_counts(),
          explode= [0.0, 0.2, 0.2],startangle= 45,
          shadow = True,colors = ['#3377ff', '#66ffff','#809fff'],
          autopct='%.1f%%',labels=train.outcome.value_counts().index,
          textprops={'fontsize': 12, 'weight': 'bold'})
# explode -- to make slice in pie graph and array value represent the distance between one slice to another
# startangle -- to rotate slice
# shadow -- to create shadow of graph i.e. back image

sns.barplot(x=train.outcome.value_counts(),y=train.outcome.value_counts().index,ax=ax[1], palette='YlGnBu')

plt.setp(ax[1].get_yticklabels(), fontweight="bold") # get_yticklabels will fetch the yticklabels the setup will set again with bold changes
plt.setp(ax[1].get_xticklabels(), fontweight="bold") # get_yticklabels will fetch the xticklabels the setup will set again with bold changes
ax[1].set_xlabel('count',fontweight="bold") # set x label
ax[1].set_ylabel('outcome',fontweight="bold") # set x label

ax[1].spines['top'].set_visible(False) # it will remove the top boundry line
ax[1].spines['right'].set_visible(False) # it will remove the right boundry line

# it will remove the x-axis tick and label
ax[1].tick_params(
        axis='x',         
        which='both',      
        bottom=False,      
        labelbottom=False
    )

val_count=train.outcome.value_counts()
for i,v in enumerate(val_count):
    ax[1].text(v,i+0.1,str(v), fontdict={'fontsize':8,'fontweight':'bold'})
# text --  is a function to add text to the graph
# v, i+0.1 -- These are the x and y coordinates where the text will be placed. v is the value from the s1 array, and i+0.1 adds a small vertical offset to position the text slightly above the corresponding bar in the chart.
# str(v) -- is the text 

fig.suptitle('Target Variable(Outcome) Distribution')
plt.tight_layout()
plt.show()

In [None]:
# splitting categroical and continuous variables

# unique value counts for each column
unique_count=train.nunique()

# unique count to distinguish between categroical and continuous
max_unique=10

cat_cols=unique_count[unique_count<=max_unique].index.to_list()
cont_cols=unique_count[unique_count>max_unique].index.to_list()

#removing 'outcome'  from categorical variable as it is target variable
if 'outcome' in cat_cols:
    cat_cols.remove('outcome')


In [None]:
# Categorical Variable Distribution
def cat_distribution(df,columns,n_cols,hue):
    '''
    # Function to plot countplot for categorical varaible distribution
    df: train dataset
    columns: category variables
    n_cols: num of cols
    '''
    n_rows=len(columns)//n_cols
    fig,ax=plt.subplots(n_rows,n_cols,figsize=(18,4*n_rows))
    ax=ax.flatten()  # Convert the ax array into a 1D array. means it converts ax 2-D array (6,3) into 1-D array to avoid issues.
    for i,column in enumerate(columns):
        sns.countplot(data=df,x=column,hue=hue,ax=ax[i],palette='viridis')
        ax[i].set_title(f'{column} Counts',fontsize=12)
        ax[i].tick_params(axis='x',rotation=10)
        
        #give the count on each bar graph
        for p in ax[i].patches: # patches - contain the individual bar elements. Using patches we can access each element of bar graph
            value = int(p.get_height())
            ax[i].annotate(f'{value:.0f}', (p.get_x() + p.get_width() / 2, p.get_height()),
                           ha='center', va='bottom', fontsize=9)
        
    plt.tight_layout()
    plt.show()
    
cat_distribution(train,cat_cols,3,'outcome')

#### Observations :
* `lesion_2` Counts and `lesion_3` Counts appear to have similar distributions. When they are not 0, the horse has a high probability of not dying.

In [None]:
# Continuous Variable Distribution
plt.figure(figsize=(14,len(cont_cols)*2.5))

for idx,column in enumerate(cont_cols):
    plt.subplot(len(cont_cols),2,idx*2+1)
    sns.histplot(data=train,x=column,hue='outcome',bins=30,kde=True,palette='Set1')
    plt.title(f'{column} distribution for outcome')
    
plt.tight_layout()
plt.show()

In [None]:
# Continuous Variable Distribution with Outlier Check
df=pd.concat([train[cont_cols].assign(Source='train'),
             test[cont_cols].assign(Source='test'),
             original[cont_cols].assign(Source='original')],axis=0,ignore_index=True)
df.head(2)

In [None]:
#now plotting the plot
fig,ax=plt.subplots(len(cont_cols), 4 ,figsize = (16, len(cont_cols) * 4.2))

for i,column in enumerate(cont_cols):
    #plotting kde plot
    sns.kdeplot(data=df[[column,'Source']],x=column,hue='Source',ax=ax[i,0])
    ax[i,0].grid(visible=True, which = 'both', linestyle = '--', color='lightgrey', linewidth = 0.75);
    ax[i,0].set(xlabel='',ylabel='')
    ax[i,0].set_title(f'{column}',fontdict={'fontweight':'bold','fontsize':9})
    
    #plotting box plot to check outliers
    #train dataset
    sns.boxplot(data=df.loc[df.Source=='train',[column]],y=column,ax=ax[i,1],color='#037d97')
    ax[i,1].set(xlabel='',ylabel='')
    ax[i,1].set_title('train',fontdict={'fontweight':'bold','fontsize':9})
    
    #test dataset
    sns.boxplot(data=df.loc[df.Source=='test',[column]],y=column,ax=ax[i,2],color='#E4591E')
    ax[i,2].set(xlabel='',ylabel='')
    ax[i,2].set_title('test',fontdict={'fontweight':'bold','fontsize':9})
    
    #original dataset
    sns.boxplot(data=df.loc[df.Source=='original',[column]],y=column,ax=ax[i,3],color='#33ccff')
    ax[i,3].set(xlabel='',ylabel='')
    ax[i,3].set_title('original',fontdict={'fontweight':'bold','fontsize':9})

plt.tight_layout()
plt.show()

#### Obserbations :
* There are `outlier` in features and some oultliers shown in train and  test dataset buut not present in original dataset like `total_protein`

# References
* https://www.kaggle.com/code/kimtaehun/eda-and-baseline-with-multiple-models
* https://www.kaggle.com/code/ravi20076/playgrounds3e22-eda-baseline
* https://www.kaggle.com/code/yaaangzhou/playground-s3-e22-eda-modeling#Predict-Health-Outcomes-of-Horses

Thank You :)