# Horse colic
## Project goal
The goal of the project is to train and build a horse colic complication model based on [Horse Colic Data Set](https://archive.ics.uci.edu/ml/datasets/Horse+Colic).
The task is a multinomial classification problem and outcome model should predict whether a colic horse will die, survive or be euthanized.

## Data acquisition, analysis and pre-processing
### Data downloading
Dataset is split into 3 parts:
- file *horse-colic.data* with training data - 300 records
- file *horse-colic.test* with test data - 68 records
- file *horse-colic.names* with dataset description and column names

Data is downloaded programmatically and saved into data subfolder of current working directory. However, *horse-colic.names* need to be parsed before pandas can use it as column names definition. Cell below does this task. Parsed column names are saved into *horse-colic.names.parsed* file.

In [1]:
URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic'
TRAIN_DATA_FILE_NAME = 'horse-colic.data'
TEST_DATA_FILE_NAME = 'horse-colic.test'
NAMES_FILE_NAME = 'horse-colic.names'
DATA_DIRECTORY_PATH = './data'
COLUMN_NAMES = []

import os
import re
import requests

#function downloads 3 files and saves it in given path. function also creates 4th file which contains parsed dataframe-ready column names
def download_data(url=URL, train_file_name=TRAIN_DATA_FILE_NAME, test_file_name=TEST_DATA_FILE_NAME,
                  names_file_name=NAMES_FILE_NAME, path=DATA_DIRECTORY_PATH):
    try:
        os.mkdir(path)
    except FileExistsError:
        print(f"Directory {path} already exists")
    for filename in [train_file_name, test_file_name, names_file_name]:
        download_save_file(url, filename, path=path)
    parse_names(names_file_name, f"{names_file_name}.parsed")

#function downloads single file and saves it in given path
def download_save_file(url, filename, path=DATA_DIRECTORY_PATH):
    content = requests.get(f"{url}/{filename}").content
    with open(f"{path}/{filename}", 'w') as f:
        f.write(bytes.decode(content, 'utf-8').replace(' \n', '\n')) #trailing space at the line endings leads to problems

#function extracts column names from names file (file also contains column descriptions)
def parse_names(filename_in, filename_out, path=DATA_DIRECTORY_PATH):
    pattern = '^\d+:\s*' #columns are generally listed like '{id}:'
    names = []
    with open(f"{path}/{filename_in}") as f:
        for line in [line.strip() for line in f.readlines()]:
            if re.match(pattern, line):
                names.append(re.sub(pattern, '', line))
    last_name = names.pop()
    names.extend(['lesion type 1', 'lesion type 2', 'lesion type 3']) #some columns are listed in different way
    names.append(last_name)
    names[names.index("pain - a subjective judgement of the horse's pain level")] = "pain" #this column name is not parsed correctly
    with open(f"{path}/{filename_out}", "w") as f:
        f.write(','.join(names))
    COLUMN_NAMES.extend(names) #save it in variable for later usage

download_data()

Directory ./data already exists


### Data analysis
#### Missing values
Many fields with null values - according to description its about 30% of missing values. There are only 7 columns out of 28 with value in every row! Moreover, there are only 6 rows out of 300 in values in every col! Fixing those missing values will be quite challenging.
Rows with missing values in outcome column definitely should be dropped, because outcome is the value that we predict.

In [2]:
import pandas as pd

#prevent information lose with dtype - lesions are encoded with 0s at beginning
dtype = {'lesion type 1': str, 'lesion type 2': str, 'lesion type 3': str }

horse_colic_train = pd.read_csv("./data/horse-colic.data", sep=' ', na_values='?', header=None, names=COLUMN_NAMES, dtype=dtype)
horse_colic_test = pd.read_csv("./data/horse-colic.test", sep=' ', na_values='?', header=None, names=COLUMN_NAMES, dtype=dtype)
horse_colic_train.info() #columns without nulls counted manually
print(f"not null rows count: {horse_colic_train[horse_colic_train.notnull().all(1)].shape[0]}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 28 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   surgery?                     299 non-null    float64
 1   Age                          300 non-null    int64  
 2   Hospital Number              300 non-null    int64  
 3   rectal temperature           240 non-null    float64
 4   pulse                        276 non-null    float64
 5   respiratory rate             242 non-null    float64
 6   temperature of extremities   244 non-null    float64
 7   peripheral pulse             231 non-null    float64
 8   mucous membranes             253 non-null    float64
 9   capillary refill time        268 non-null    float64
 10  pain                         245 non-null    float64
 11  peristalsis                  256 non-null    float64
 12  abdominal distension         244 non-null    float64
 13  nasogastric tube    

#### Column types
All 27 columns are numerical. That's true only because every categorical column is encoded. We will of course need to deal with it.
Continuous variables are fine as they are (for now let's omit data scaling etc.)
Categorical variables with ordinal scale will stay encoded into numbers, however some modifications may be required e.g. when encoding does not represent natural order well.
Categorical variables with nominal scale will be decoded back to strings. These variables will be dummy encoded in next steps so using string is better approach - readability will be improved.

In [3]:
#it can be binary encoded with '0' and '1'
horse_colic_train['surgery?'].value_counts()

1.0    180
2.0    119
Name: surgery?, dtype: int64

In [4]:
#2 possible age values - according to description '1' for adult horse (>6 month) and '2' for young horse (<6 months)
#however current column values are different. we assume that '9' means that horse is young
#it's also better to rename this column to 'adult?' and encode it binary
horse_colic_train['Age'].value_counts()

1    276
9     24
Name: Age, dtype: int64

In [5]:
#this column indicates horse unique identifier, some rows are repeated because one horse may have more than one row in dataset
#this column may useful for creating an interesting time series but there is not enough data, so we will probably delete it.
horse_colic_train['Hospital Number']

0       530101
1       534817
2       530334
3      5290409
4       530255
        ...   
295     533886
296     527702
297     529386
298     530612
299     534618
Name: Hospital Number, Length: 300, dtype: int64

In [6]:
#next 3 columns (rectal temperature, pulse, respiratory rate) are continuous. They will probably be very relevant because they determine horse base vital signs.
#pulse and respiratory rate have max value much bigger than 3rd quartile but as we see there are not single cases so we don't treat it as outliers
print(horse_colic_train.iloc[:, 3:6].describe())
print(horse_colic_train['pulse'].sort_values(ascending=False).head())
print(horse_colic_train['respiratory rate'].sort_values(ascending=False).head())

       rectal temperature       pulse  respiratory rate
count          240.000000  276.000000        242.000000
mean            38.167917   71.913043         30.417355
std              0.732289   28.630557         17.642231
min             35.400000   30.000000          8.000000
25%             37.800000   48.000000         18.500000
50%             38.200000   64.000000         24.500000
75%             38.500000   88.000000         36.000000
max             40.800000  184.000000         96.000000
255    184.0
3      164.0
55     160.0
275    150.0
41     150.0
Name: pulse, dtype: float64
39     96.0
106    96.0
269    90.0
186    90.0
244    88.0
Name: respiratory rate, dtype: float64


In [7]:
#this column is in ordinal scale, so we will keep it encoded. however we will need to change encoding because
#current one doesn't represent natural order (1 = Normal, 2 = Warm, 3 = Cool, 4 = Cold)
#in case of regression model we will probably delete this column because it's strongly correlated with rectal temperature
horse_colic_train['temperature of extremities'].value_counts()

3.0    109
1.0     78
2.0     30
4.0     27
Name: temperature of extremities, dtype: int64

In [8]:
#this column encoding should be also fixed, because for now it doesn't represent natural order (1 = normal, 2 = increased, 3 = reduced, 4 = absent)
#we will consider removing this column because it's values are missing in 69 rows and it's very subjective
horse_colic_train['peripheral pulse'].value_counts()

1.0    115
3.0    103
4.0      8
2.0      5
Name: peripheral pulse, dtype: int64

In [9]:
#this column ordering is also wrong, but this time we only need to swap '5' with '6'
#encoding meaning - 1 = normal pink, 2 = bright pink, 3 = pale pink, 4 = pale cyanotic, 5 = bright red / injected, 6 = dark cyanotic
#dataset description specifies 4 groups for these 6 values, so we would be able to reduce value count, however we will not do it
#because we could lose some information and value count is not critical (ordinal scale = not dummies = no extra dimensions)
horse_colic_train['mucous membranes'].value_counts()

1.0    79
3.0    58
4.0    41
2.0    30
5.0    25
6.0    20
Name: mucous membranes, dtype: int64

In [10]:
#this column can be encoded binary. according to dataset description there are only 2 possible values - '1' and '2', so we have to delete values from rows with '3' (meaningless)
#The Capillary refill test (CRT) is a rapid test used for assessing the blood flow through peripheral tissues.
print(horse_colic_train['capillary refill time'].value_counts())

horse_colic_train.loc[horse_colic_train['capillary refill time'] == 3, 'capillary refill time'] = None
print(horse_colic_train['capillary refill time'].value_counts())

1.0    188
2.0     78
3.0      2
Name: capillary refill time, dtype: int64
1.0    188
2.0     78
Name: capillary refill time, dtype: int64


In [11]:
#this column data is in ordinal scale but according to dataset description we should not treat this column as a ordered variable.
# we will use dummy variables but before that we will decode it back to text representation (improved readability)
horse_colic_train['pain'].value_counts()

3.0    67
2.0    59
5.0    42
4.0    39
1.0    38
Name: pain, dtype: int64

In [None]:
#Peristalsis is the automatic wave-like movement of the muscles that line gastrointestinal tract.
#this column will be treated like a nominal variable because it's values meaning is not clear
horse_colic_train['peristalsis'].value_counts()

In [None]:
#this column is a ordinal variable and finally its encoded correctly (1 = none, 2 = slight, 3 = moderate, 4 = severe) - however we could subtract one from each value to start encoding from 0
#moreover this variable is marked as an IMPORTANT parameter in dataset description
horse_colic_train['abdominal distension'].value_counts()

In [None]:
#according to dataset description this column refers to any gas coming out of the tube
#encoding is correct with a good order (1 = none, 2 = slight, 3 = significant)
horse_colic_train['nasogastric tube'].value_counts()

In [None]:
#column order is not correct (1 = none, 2 = > 1 liter, 3 = < 1 liter)
horse_colic_train['nasogastric reflux'].value_counts()

In [None]:
#this column has values in only 53 rows! we will probably delete it
horse_colic_train['nasogastric reflux PH'].describe()

In [None]:
#these columns will be encoded with dummy variables
print(horse_colic_train['rectal examination - feces'].value_counts())
print(horse_colic_train['abdomen'].value_counts())

In [None]:
#this column is a continuous variable, values look good - probably no outliers
print(horse_colic_train['packed cell volume'].describe())
print(horse_colic_train['packed cell volume'].sort_values(ascending=False).head())

In [None]:
#this column is a continuous variable, values also look good - probably no outliers
print(horse_colic_train['total protein'].describe())
print(horse_colic_train['total protein'].sort_values(ascending=False).head())

In [None]:
#we will remove these columns because they have many missing values
print(horse_colic_train['abdominocentesis appearance'].value_counts())
print(horse_colic_train['abdomcentesis total protein'].describe())

In [None]:
#we want to predict the value of this column - its missing in one row in training set so this row will be removed
horse_colic_train['outcome'].value_counts()

In [12]:
#this column indicates whether the horse had a lesion. it will be removed because we will be more interested in total number of lesion - see below
print(horse_colic_train['surgical lesion?'].value_counts())

1    191
2    109
Name: surgical lesion?, dtype: int64


In [19]:
#3 next columns indicates horse's lesions. there are max 3 lesions listed in dataset but horse can have 0, 1, 2 or 3 lesions. statistics are computed and showed below
_lesions = horse_colic_train.loc[:, ['lesion type 1', 'lesion type 2', 'lesion type 3']]
_lesions_counts = [
    _lesions[_lesions.iloc[:, 0] == '00000'].count()[0],
    _lesions[(_lesions.iloc[:,0] != '00000') & (_lesions.iloc[:,1] == '00000')].count()[0],
    _lesions[(_lesions.iloc[:,0] != '00000') & (_lesions.iloc[:,1] != '00000') & (_lesions.iloc[:,2] == '00000')].count()[0],
    _lesions[(_lesions.iloc[:,0] != '00000') & (_lesions.iloc[:,1] != '00000') & (_lesions.iloc[:,2] != '00000')].count()[0]
]

for i, count in zip(range(4), _lesions_counts):\
    print(f"{count} horse(s) with {i} lesion(s)")
print(f"-------------------------\ntotal={sum(_lesions_counts)}")

56 horse(s) with 0 lesion(s)
237 horse(s) with 1 lesion(s)
6 horse(s) with 2 lesion(s)
1 horse(s) with 3 lesion(s)
-------------------------
total=300


In [20]:
#moreover each lesion is composed of 4 parts: site of lesion (11 possible values),
# type (4 possible values), subtype (3 possible values) and "specific code" (11 possible values).
#this in total gives (11 + 4 + 3 + 11) * 3 = 87 dummy variables!
#its definitely too much so our approach will use only one lesion (the newest one) details and total number of lesions.
_lesions.head()

Unnamed: 0,lesion type 1,lesion type 2,lesion type 3
0,11300,0,0
1,2208,0,0
2,0,0,0
3,2208,0,0
4,4300,0,0


In [26]:
#lesion parts are encoded with positional numbers but its length is not fixed. 1st part is can have 1 or 2 digits (range from 00 to 11), 2nd and 3rd are fixed to 1 digit, 4th part can have 1 or 2 digits (range from 0 to 10).
#As we see we practically always have 5 digits. In this case if 2 first digits are in 1t part range then we will
#always interpret them as 1st part. Else we will interpret 2 last digits as a 4th part.
print(_lesions['lesion type 1'].apply(lambda x: len(x)).value_counts())
print(_lesions['lesion type 2'].apply(lambda x: len(x)).value_counts())
print(_lesions['lesion type 3'].apply(lambda x: len(x)).value_counts())

5    300
Name: lesion type 1, dtype: int64
5    300
Name: lesion type 2, dtype: int64
5    299
6      1
Name: lesion type 3, dtype: int64


In [None]:
#meaning of this column is not clear, so it will be deleted
horse_colic_train['cp_data'].value_counts()

#### Test set quick inspection
Now we will look at test set. We will check if every column with no NA values in train set is also filled in test set. If not and we don't handle this now it could lead to problems in model evaluation. We won't touch test set! If its true we will adjust our pipe in next section.

In [None]:
horse_colic_test.info()

In [None]:
#we had invalid value in this column in train set but in test set its fine
horse_colic_test['capillary refill time'].value_counts()

In [None]:
#columns with na in test AND columns with NOT na in train (we don't care about opposite direction)
horse_colic_test.isna().any() & ~horse_colic_train.isna().any()
#all columns with False so our test set if fine (we will handle NA in outcome column later)

### Data pre-processing
Now, based on last section, we will prepare our data for machine learning algorithms. We will use sklearn Pipeline so this steps will be reusable (and used for test set in evaluation phase).
We start with columns dropping. Then we will drop rows for which columns have max one missing value - in our example in column 'surgery?' and 'outcome'.

In [135]:
#https://stackoverflow.com/questions/68402691/adding-dropping-column-instance-into-a-pipeline
class ColumnDropperTransformer:
    def __init__(self,columns):
        self.columns=columns

    def transform(self,X,y=None):
        return X.drop(self.columns,axis=1)

    def fit(self, X, y=None):
        return self

columnDropper = ColumnDropperTransformer(['Hospital Number','temperature of extremities',
                                          'peripheral pulse', 'nasogastric reflux PH',
                                          'abdominocentesis appearance', 'abdomcentesis total protein',
                                          'surgical lesion?', 'cp_data'
                                          ])

horse_colic_train_cp = horse_colic_train.copy()
horse_colic_train_cp = columnDropper.transform(horse_colic_train_cp)
horse_colic_train_cp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   surgery?                    299 non-null    float64
 1   Age                         300 non-null    int64  
 2   rectal temperature          240 non-null    float64
 3   pulse                       276 non-null    float64
 4   respiratory rate            242 non-null    float64
 5   mucous membranes            253 non-null    float64
 6   capillary refill time       266 non-null    float64
 7   pain                        245 non-null    float64
 8   peristalsis                 256 non-null    float64
 9   abdominal distension        244 non-null    float64
 10  nasogastric tube            196 non-null    float64
 11  nasogastric reflux          194 non-null    float64
 12  rectal examination - feces  198 non-null    float64
 13  abdomen                     182 non

In [136]:
class RowDropperTransformer:
    def __init__(self,columns):
        self.columns=columns

    def transform(self,X,y=None):
        return X.dropna(subset = self.columns)

    def fit(self, X, y=None):
        return self

rowDropper = RowDropperTransformer(['surgery?', 'outcome'])
horse_colic_train_cp = rowDropper.transform(horse_colic_train_cp)
horse_colic_train_cp.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 299 entries, 0 to 299
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   surgery?                    299 non-null    float64
 1   Age                         299 non-null    int64  
 2   rectal temperature          239 non-null    float64
 3   pulse                       275 non-null    float64
 4   respiratory rate            241 non-null    float64
 5   mucous membranes            252 non-null    float64
 6   capillary refill time       265 non-null    float64
 7   pain                        244 non-null    float64
 8   peristalsis                 255 non-null    float64
 9   abdominal distension        243 non-null    float64
 10  nasogastric tube            195 non-null    float64
 11  nasogastric reflux          193 non-null    float64
 12  rectal examination - feces  197 non-null    float64
 13  abdomen                     181 non

In [137]:
horse_colic_train_cp.head()

Unnamed: 0,surgery?,Age,rectal temperature,pulse,respiratory rate,mucous membranes,capillary refill time,pain,peristalsis,abdominal distension,nasogastric tube,nasogastric reflux,rectal examination - feces,abdomen,packed cell volume,total protein,outcome,lesion type 1,lesion type 2,lesion type 3
0,2.0,1,38.5,66.0,28.0,,2.0,5.0,4.0,4.0,,,3.0,5.0,45.0,8.4,2.0,11300,0,0
1,1.0,1,39.2,88.0,20.0,4.0,1.0,3.0,4.0,2.0,,,4.0,2.0,50.0,85.0,3.0,2208,0,0
2,2.0,1,38.3,40.0,24.0,3.0,1.0,3.0,3.0,1.0,,,1.0,1.0,33.0,6.7,1.0,0,0,0
3,1.0,9,39.1,164.0,84.0,6.0,2.0,2.0,4.0,4.0,1.0,2.0,3.0,,48.0,7.2,2.0,2208,0,0
4,2.0,1,37.3,104.0,35.0,6.0,2.0,,,,,,,,74.0,7.4,2.0,4300,0,0


In [138]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer, KNNImputer

#these columns will be imputed with a mean
continuousColumns = ['rectal temperature', 'pulse', 'respiratory rate', 'packed cell volume', 'total protein', ]
#these columns will be imputed with a median
discreteColumns = ['mucous membranes', 'capillary refill time', 'pain', 'peristalsis', 'nasogastric tube', 'nasogastric reflux', 'rectal examination - feces', 'abdomen', ]
#this column will be imputed with a KNN because its marked as an important parameter
knnColumns = ['abdominal distension']

fillNaTransformer = ColumnTransformer([
    ('fillWithMean', SimpleImputer(strategy='mean'), continuousColumns),
    ('fillWithMedian', SimpleImputer(strategy="most_frequent"), discreteColumns),
    ('fillWithKnn', KNNImputer(n_neighbors=3), knnColumns)
], remainder="passthrough")

result = fillNaTransformer.fit_transform(horse_colic_train_cp)
column_names = [name[name.index('__')+2:] for name in fillNaTransformer.get_feature_names_out()]

horse_colic_train_cp = pd.DataFrame(result, columns = column_names)
horse_colic_train_cp.head()

Unnamed: 0,rectal temperature,pulse,respiratory rate,packed cell volume,total protein,mucous membranes,capillary refill time,pain,peristalsis,nasogastric tube,nasogastric reflux,rectal examination - feces,abdomen,abdominal distension,surgery?,Age,outcome,lesion type 1,lesion type 2,lesion type 3
0,38.5,66.0,28.0,45.0,8.4,1.0,2.0,5.0,4.0,2.0,1.0,3.0,5.0,4.0,2.0,1,2.0,11300,0,0
1,39.2,88.0,20.0,50.0,85.0,4.0,1.0,3.0,4.0,2.0,1.0,4.0,2.0,2.0,1.0,1,3.0,2208,0,0
2,38.3,40.0,24.0,33.0,6.7,3.0,1.0,3.0,3.0,2.0,1.0,1.0,1.0,1.0,2.0,1,1.0,0,0,0
3,39.1,164.0,84.0,48.0,7.2,6.0,2.0,2.0,4.0,1.0,2.0,3.0,5.0,4.0,1.0,9,2.0,2208,0,0
4,37.3,104.0,35.0,74.0,7.4,6.0,2.0,3.0,3.0,2.0,1.0,4.0,5.0,2.271605,2.0,1,2.0,4300,0,0


In [139]:
class DiscreteEncodingTransformer:
    def __init__(self, column, mapping, target_name=None, target_type='uint8'):
        self.column = column
        self.mapping = mapping
        self.target_name = target_name
        self.target_type = target_type

    def transform(self,X,y=None):
        X[self.column] = X[self.column].map(self.mapping).fillna(X[self.column])
        if self.target_type:
            X[self.column] = X[self.column].astype({self.column: self.target_type})
        if self.target_name:
            X = X.rename(columns={self.column: self.target_name})
        return X

    def fit(self, X, y=None):
        return self

    def get_params(self, deep=True):
        return {"column": self.column, "mapping": self.mapping,
                "target_name": self.target_name, "target_type": self.target_type}

In [140]:
from sklearn.pipeline import Pipeline

discreteEncodingTransformersPipe = Pipeline([
    ('det_1', DiscreteEncodingTransformer('surgery?', {2: 0})),
    ('det_2', DiscreteEncodingTransformer('Age', {9:0}, target_name='adult?')),
    ('det_3', DiscreteEncodingTransformer('mucous membranes', {5: 6, 6: 5})),
    ('det_4', DiscreteEncodingTransformer('mucous membranes', lambda x: x-1)),
    ('det_5', DiscreteEncodingTransformer('capillary refill time', lambda x: x-1, target_name='capillary refill time >= 3s?')),
    ('det_6', DiscreteEncodingTransformer('pain', {1: 'alert, no pain', 2: 'depressed',
                                                   3: 'intermittent mild pain',
                                                   4: 'intermittent severe pain',
                                                   5: 'continuous severe pain'},
                                          target_type='category')),
    ('det_7', DiscreteEncodingTransformer('peristalsis', {1: 'hypermotile', 2: 'normal',
                                                          3: 'hypomotile', 4: 'absent'},
                                          target_type='category')),
    ('det_8', DiscreteEncodingTransformer('abdominal distension', lambda x: round(x-1))),
    ('det_9', DiscreteEncodingTransformer('nasogastric tube', lambda x: x-1)),
    ('det_10', DiscreteEncodingTransformer('nasogastric reflux', {1: 0, 3: 1})),
    ('det_11', DiscreteEncodingTransformer('rectal examination - feces', {1: 'normal', 2:'increased',
                                                                          3: 'decreased', 4: 'absent'},
                                           target_type='category')),
    ('det_12', DiscreteEncodingTransformer('abdomen', {1: 'normal', 2:'other', 3: 'firm feces in the large intestine',
                                                       4: 'distended small intestine', 5: 'distended large intestine'},
                                           target_type='category')),
    ('det_13', DiscreteEncodingTransformer('outcome', {1: 'lived', 2: 'died', 3: 'was euthanized'},
                                           target_type='category'))
])

horse_colic_train_cp = discreteEncodingTransformersPipe.fit_transform(horse_colic_train_cp)
horse_colic_train_cp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   rectal temperature            299 non-null    object  
 1   pulse                         299 non-null    object  
 2   respiratory rate              299 non-null    object  
 3   packed cell volume            299 non-null    object  
 4   total protein                 299 non-null    object  
 5   mucous membranes              299 non-null    uint8   
 6   capillary refill time >= 3s?  299 non-null    uint8   
 7   pain                          299 non-null    category
 8   peristalsis                   299 non-null    category
 9   nasogastric tube              299 non-null    uint8   
 10  nasogastric reflux            299 non-null    uint8   
 11  rectal examination - feces    299 non-null    category
 12  abdomen                       299 non-null    cate

In [141]:
horse_colic_train_cp.head()

Unnamed: 0,rectal temperature,pulse,respiratory rate,packed cell volume,total protein,mucous membranes,capillary refill time >= 3s?,pain,peristalsis,nasogastric tube,nasogastric reflux,rectal examination - feces,abdomen,abdominal distension,surgery?,adult?,outcome,lesion type 1,lesion type 2,lesion type 3
0,38.5,66.0,28.0,45.0,8.4,0,1,continuous severe pain,absent,1,0,decreased,distended large intestine,3,0,1,died,11300,0,0
1,39.2,88.0,20.0,50.0,85.0,3,0,intermittent mild pain,absent,1,0,absent,other,1,1,1,was euthanized,2208,0,0
2,38.3,40.0,24.0,33.0,6.7,2,0,intermittent mild pain,hypomotile,1,0,normal,normal,0,0,1,lived,0,0,0
3,39.1,164.0,84.0,48.0,7.2,4,1,depressed,absent,0,2,decreased,distended large intestine,3,1,0,died,2208,0,0
4,37.3,104.0,35.0,74.0,7.4,4,1,intermittent mild pain,hypomotile,1,0,absent,distended large intestine,1,0,1,died,4300,0,0


In [142]:
class LesionTransformer:
    def transform(self,X,y=None):
        lesion_type_names = ['lesion type 1', 'lesion type 2', 'lesion type 3']
        lesion_part_names = ['lesion site', 'lesion type', 'lesion subtype', 'lesion specific code']
        X['number of lesions'] = (X.loc[:, lesion_type_names] != '00000').sum(axis=1)
        X['lesion'] = X.apply(LesionTransformer._get_latest_lesion, axis=1)
        for index, name in zip(range(4), lesion_part_names):
            X[name] = X.apply(lambda x: LesionTransformer._split_lesion(x, index), axis=1)
        X.drop([*lesion_type_names, 'lesion'], inplace=True, axis=1)
        return X

    def fit(self, X, y=None):
        return self

    @staticmethod
    def _get_latest_lesion(row):
        num_of_lesions = row['number of lesions']
        if num_of_lesions != 0:
            return row[f'lesion type {num_of_lesions}']
        else:
            return '00000'

    @staticmethod
    def _split_lesion(row, part):
        lesion = row['lesion']
        split_1 = [int(part) for part in (lesion[:2], lesion[2], lesion[3], lesion[4:])]
        split_2 = [int(part) for part in (lesion[0], lesion[1], lesion[2], lesion[3:])]
        if len(lesion) == 6:
            return split_1[part]
        else:
            if int(lesion[:2]) <= 11:
                return split_1[part]
            else:
                return split_2[part]

In [143]:
lesion_site_values_map = {1: 'gastric', 2: 'sm intestine', 3: 'lg colo',
                          4: 'lg colon and cecum', 5: 'cecum', 6: 'transverse colon',
                          7: 'retum/descending colon', 8: 'uterus', 9: 'bladder',
                          11: 'all intestinal sites'}
lesion_type_values_map = {1: 'simple', 2: 'strangulation', 3: 'inflammation'}
lesion_subtype_values_map = {1: 'mechanical', 2: 'paralytic'}
lesion_spec_code_values_map = {1: 'obturation', 2: 'intrinsic', 3: 'extrinsic',
                               4: 'adynamic', 5: 'volvulus/torsion', 6: 'intussuption',
                               7: 'thromboembolic', 8: 'hernia', 9: 'lipoma/slenic incarceration',
                               10: 'displacement'}
lesionPipe = Pipeline([
    ('lesion_transformer', LesionTransformer()),
    ('det_1', DiscreteEncodingTransformer('lesion site', lesion_site_values_map, target_type='category')),
    ('det_2', DiscreteEncodingTransformer('lesion type', lesion_type_values_map, target_type='category')),
    ('det_3', DiscreteEncodingTransformer('lesion subtype', lesion_subtype_values_map, target_type='category')),
    ('det_4', DiscreteEncodingTransformer('lesion specific code', lesion_spec_code_values_map, target_type='category')),
])

#lesion
horse_colic_train_cp = lesionPipe.fit_transform(horse_colic_train_cp)
horse_colic_train_cp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 22 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   rectal temperature            299 non-null    object  
 1   pulse                         299 non-null    object  
 2   respiratory rate              299 non-null    object  
 3   packed cell volume            299 non-null    object  
 4   total protein                 299 non-null    object  
 5   mucous membranes              299 non-null    uint8   
 6   capillary refill time >= 3s?  299 non-null    uint8   
 7   pain                          299 non-null    category
 8   peristalsis                   299 non-null    category
 9   nasogastric tube              299 non-null    uint8   
 10  nasogastric reflux            299 non-null    uint8   
 11  rectal examination - feces    299 non-null    category
 12  abdomen                       299 non-null    cate

In [144]:
#values that are not encoded (some of them are invalid according to dataset description) will be treated as
#unknown in OneHotEncoder
print(horse_colic_train_cp['lesion site'].value_counts())
print(horse_colic_train_cp['lesion type'].value_counts())
print(horse_colic_train_cp['lesion subtype'].value_counts())#
print(horse_colic_train_cp['lesion specific code'].value_counts())

lg colo                   85
sm intestine              79
0                         63
lg colon and cecum        21
retum/descending colon    13
cecum                     12
gastric                   12
all intestinal sites       4
transverse colon           4
bladder                    3
uterus                     3
Name: lesion site, dtype: int64
strangulation    104
simple            97
0                 61
4                 25
inflammation      12
Name: lesion type, dtype: int64
0             197
mechanical     77
paralytic      24
3               1
Name: lesion subtype, dtype: int64
0                              93
volvulus/torsion               56
obturation                     48
hernia                         21
lipoma/slenic incarceration    19
adynamic                       18
intrinsic                      14
displacement                    9
extrinsic                       9
intussuption                    7
thromboembolic                  5
Name: lesion specific code, dty

In [145]:
horse_colic_train_cp.head()

Unnamed: 0,rectal temperature,pulse,respiratory rate,packed cell volume,total protein,mucous membranes,capillary refill time >= 3s?,pain,peristalsis,nasogastric tube,...,abdomen,abdominal distension,surgery?,adult?,outcome,number of lesions,lesion site,lesion type,lesion subtype,lesion specific code
0,38.5,66.0,28.0,45.0,8.4,0,1,continuous severe pain,absent,1,...,distended large intestine,3,0,1,died,1,all intestinal sites,inflammation,0,0
1,39.2,88.0,20.0,50.0,85.0,3,0,intermittent mild pain,absent,1,...,other,1,1,1,was euthanized,1,sm intestine,strangulation,0,hernia
2,38.3,40.0,24.0,33.0,6.7,2,0,intermittent mild pain,hypomotile,1,...,normal,0,0,1,lived,0,0,0,0,0
3,39.1,164.0,84.0,48.0,7.2,4,1,depressed,absent,0,...,distended large intestine,3,1,0,died,1,sm intestine,strangulation,0,hernia
4,37.3,104.0,35.0,74.0,7.4,4,1,intermittent mild pain,hypomotile,1,...,distended large intestine,1,0,1,died,1,lg colon and cecum,inflammation,0,0


In [146]:
from sklearn.preprocessing import OneHotEncoder

oneHotEncodersTransformer = ColumnTransformer([
    ('encodeLesionSite', OneHotEncoder(categories=[list(lesion_site_values_map.values())], handle_unknown='ignore'), ['lesion site']),
    ('encodeLesionType', OneHotEncoder(categories=[list(lesion_type_values_map.values())], handle_unknown='ignore'), ['lesion type']),
    ('encodeLesionSubtype', OneHotEncoder(categories=[list(lesion_subtype_values_map.values())], handle_unknown='ignore'), ['lesion subtype']),
    ('encodeLesionSpecCode', OneHotEncoder(categories=[list(lesion_spec_code_values_map.values())], handle_unknown='ignore'), ['lesion specific code']),
    ('encodeOther', OneHotEncoder(), ['pain', 'peristalsis', 'rectal examination - feces', 'abdomen']),
], remainder="passthrough")

result = oneHotEncodersTransformer.fit_transform(horse_colic_train_cp)
column_names = [name[name.index('__')+2:].replace('_', ' -> ') for name in oneHotEncodersTransformer.get_feature_names_out()]

horse_colic_train_cp = pd.DataFrame(result, columns = column_names)
horse_colic_train_cp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 57 columns):
 #   Column                                               Non-Null Count  Dtype 
---  ------                                               --------------  ----- 
 0   lesion site -> gastric                               299 non-null    object
 1   lesion site -> sm intestine                          299 non-null    object
 2   lesion site -> lg colo                               299 non-null    object
 3   lesion site -> lg colon and cecum                    299 non-null    object
 4   lesion site -> cecum                                 299 non-null    object
 5   lesion site -> transverse colon                      299 non-null    object
 6   lesion site -> retum/descending colon                299 non-null    object
 7   lesion site -> uterus                                299 non-null    object
 8   lesion site -> bladder                               299 non-null    object
 9  

In [147]:
horse_colic_train_cp.head()

Unnamed: 0,lesion site -> gastric,lesion site -> sm intestine,lesion site -> lg colo,lesion site -> lg colon and cecum,lesion site -> cecum,lesion site -> transverse colon,lesion site -> retum/descending colon,lesion site -> uterus,lesion site -> bladder,lesion site -> all intestinal sites,...,total protein,mucous membranes,capillary refill time >= 3s?,nasogastric tube,nasogastric reflux,abdominal distension,surgery?,adult?,outcome,number of lesions
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,8.4,0,1,1,0,3,0,1,died,1
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,85.0,3,0,1,0,1,1,1,was euthanized,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6.7,2,0,1,0,0,0,1,lived,0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.2,4,1,0,2,3,1,0,died,1
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7.4,4,1,1,0,1,0,1,died,1


In [148]:
from sklearn.preprocessing import MinMaxScaler

minMaxScalerTransformer = ColumnTransformer([
    ('scaler', MinMaxScaler(), horse_colic_train_cp.columns.drop('outcome'))
], remainder="passthrough")

result = minMaxScalerTransformer.fit_transform(horse_colic_train_cp)
column_names = [name[name.index('__')+2:] for name in minMaxScalerTransformer.get_feature_names_out()]

horse_colic_train_cp = pd.DataFrame(result, columns = column_names)
horse_colic_train_cp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 57 columns):
 #   Column                                               Non-Null Count  Dtype 
---  ------                                               --------------  ----- 
 0   lesion site -> gastric                               299 non-null    object
 1   lesion site -> sm intestine                          299 non-null    object
 2   lesion site -> lg colo                               299 non-null    object
 3   lesion site -> lg colon and cecum                    299 non-null    object
 4   lesion site -> cecum                                 299 non-null    object
 5   lesion site -> transverse colon                      299 non-null    object
 6   lesion site -> retum/descending colon                299 non-null    object
 7   lesion site -> uterus                                299 non-null    object
 8   lesion site -> bladder                               299 non-null    object
 9  

In [149]:
horse_colic_train_cp.head()

Unnamed: 0,lesion site -> gastric,lesion site -> sm intestine,lesion site -> lg colo,lesion site -> lg colon and cecum,lesion site -> cecum,lesion site -> transverse colon,lesion site -> retum/descending colon,lesion site -> uterus,lesion site -> bladder,lesion site -> all intestinal sites,...,total protein,mucous membranes,capillary refill time >= 3s?,nasogastric tube,nasogastric reflux,abdominal distension,surgery?,adult?,number of lesions,outcome
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.05951,0.0,1.0,0.5,0.0,1.0,0.0,1.0,0.333333,died
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.953326,0.6,0.0,0.5,0.0,0.333333,1.0,1.0,0.333333,was euthanized
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.039673,0.4,0.0,0.5,0.0,0.0,0.0,1.0,0.0,lived
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.045508,0.8,1.0,0.0,1.0,1.0,1.0,0.0,0.333333,died
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.047841,0.8,1.0,0.5,0.0,0.333333,0.0,1.0,0.333333,died


## Modeling