# Horse colic
## Project goal
The goal of the project is to train and build a horse colic complication model based on [Horse Colic Data Set](https://archive.ics.uci.edu/ml/datasets/Horse+Colic).
The task is a multinomial classification problem and outcome model should predict whether a colic horse will die, survive or be euthanized.

## Data acquisition, analysis and pre-processing
### Data downloading
Dataset is split into 3 parts:
- file *horse-colic.data* with training data - 300 records
- file *horse-colic.test* with test data - 68 records
- file *horse-colic.names* with dataset description and column names

Data is downloaded programmatically and saved into data subfolder of current working directory. However, *horse-colic.names* need to be parsed before pandas can use it as column names definition. Cell below does this task. Parsed column names are saved into *horse-colic.names.parsed* file.

In [3]:
URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic'
TRAIN_DATA_FILE_NAME = 'horse-colic.data'
TEST_DATA_FILE_NAME = 'horse-colic.test'
NAMES_FILE_NAME = 'horse-colic.names'
DATA_DIRECTORY_PATH = './data'
COLUMN_NAMES = []

import os
import re
import requests

#function downloads 3 files and saves it in given path. function also creates 4th file which contains parsed dataframe-ready column names
def download_data(url=URL, train_file_name=TRAIN_DATA_FILE_NAME, test_file_name=TEST_DATA_FILE_NAME,
                  names_file_name=NAMES_FILE_NAME, path=DATA_DIRECTORY_PATH):
    try:
        os.mkdir(path)
    except FileExistsError:
        print(f"Directory {path} already exists")
    for filename in [train_file_name, test_file_name, names_file_name]:
        download_save_file(url, filename, path=path)
    parse_names(names_file_name, f"{names_file_name}.parsed")

#function downloads single file and saves it in given path
def download_save_file(url, filename, path=DATA_DIRECTORY_PATH):
    content = requests.get(f"{url}/{filename}").content
    with open(f"{path}/{filename}", 'w') as f:
        f.write(bytes.decode(content, 'utf-8').replace(' \n', '\n')) #trailing space at the line endings leads to problems

#function extracts column names from names file (file also contains column descriptions)
def parse_names(filename_in, filename_out, path=DATA_DIRECTORY_PATH):
    pattern = '^\d+:\s*' #columns are generally listed like '{id}:'
    names = []
    with open(f"{path}/{filename_in}") as f:
        for line in [line.strip() for line in f.readlines()]:
            if re.match(pattern, line):
                names.append(re.sub(pattern, '', line))
    last_name = names.pop()
    names.extend(['lesion type 1', 'lesion type 2', 'lesion type 3']) #some columns are listed in different way
    names.append(last_name)
    with open(f"{path}/{filename_out}", "w") as f:
        f.write(','.join(names))
    COLUMN_NAMES.extend(names) #save it in variable for later usage

download_data()

Directory ./data already exists


### Data analysis
#### Missing values
Many fields with null values - according to description its about 30% of missing values. There are only 7 columns out of 28 with value in every row! Moreover, there are only 6 rows out of 300 in values in every col! Fixing those missing values will be quite challenging.
Rows with missing values in outcome column definitely should be dropped, because outcome is the value that we predict.

In [4]:
import pandas as pd

horse_colic_train = pd.read_csv("./data/horse-colic.data", sep=' ', na_values='?', header=None, names=COLUMN_NAMES)
horse_colic_test = pd.read_csv("./data/horse-colic.test", sep=' ', na_values='?', header=None, names=COLUMN_NAMES)
horse_colic_train.info() #columns without nulls counted manually
print(f"not null rows count: {horse_colic_train[horse_colic_train.notnull().all(1)].shape[0]}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 28 columns):
 #   Column                                                   Non-Null Count  Dtype  
---  ------                                                   --------------  -----  
 0   surgery?                                                 299 non-null    float64
 1   Age                                                      300 non-null    int64  
 2   Hospital Number                                          300 non-null    int64  
 3   rectal temperature                                       240 non-null    float64
 4   pulse                                                    276 non-null    float64
 5   respiratory rate                                         242 non-null    float64
 6   temperature of extremities                               244 non-null    float64
 7   peripheral pulse                                         231 non-null    float64
 8   mucous membranes              

#### Column types
All 27 columns are numerical. That's true only because every categorical column is encoded. We will of course need to deal with it.
Continuous variables are fine as they are (for now let's omit data scaling etc.)
Categorical variables with ordinal scale will stay encoded into numbers, however some modifications may be required e.g. when encoding does not represent natural order well.
Categorical variables with nominal scale will be decoded back to strings. These variables will be dummy encoded in next steps so using string is better approach - readability will be improved.

In [6]:
#it can be binary encoded with '0' and '1'
horse_colic_train['surgery?'].value_counts()

1.0    180
2.0    119
Name: surgery?, dtype: int64

In [19]:
#2 possible age values - according to description '1' for adult horse (>6 month) and '2' for young horse (<6 months)
#however current column values are different. we assume that '9' means that horse is young
#it's also better to rename this column to 'adult?' and encode it binary
horse_colic_train['Age'].value_counts()

1    276
9     24
Name: Age, dtype: int64

In [12]:
#this column indicates horse unique identifier, some rows are repeated because one horse may have more than one row in dataset
#this column may useful for creating an interesting time series but there is not enough data, so we will probably delete it.
horse_colic_train['Hospital Number']

0       530101
1       534817
2       530334
3      5290409
4       530255
        ...   
295     533886
296     527702
297     529386
298     530612
299     534618
Name: Hospital Number, Length: 300, dtype: int64

In [17]:
#next 3 columns (rectal temperature, pulse, respiratory rate) are continuous so for now we don't analyze them,
#although they will probably be very relevant because they determine horse base vital signs.
#pulse and respiratory rate have max value much bigger than 3rd quartile - we will verify that later
horse_colic_train.iloc[:, 3:6].describe()

Unnamed: 0,rectal temperature,pulse,respiratory rate
count,240.0,276.0,242.0
mean,38.167917,71.913043,30.417355
std,0.732289,28.630557,17.642231
min,35.4,30.0,8.0
25%,37.8,48.0,18.5
50%,38.2,64.0,24.5
75%,38.5,88.0,36.0
max,40.8,184.0,96.0


In [20]:
#this column is in ordinal scale, so we will keep it encoded. however we will need to change encoding because
#current one doesn't represent natural order (1 = Normal, 2 = Warm, 3 = Cool, 4 = Cold)
#in case of regression model we will probably delete this column because it's strongly correlated with rectal temperature
horse_colic_train['temperature of extremities'].value_counts()

3.0    109
1.0     78
2.0     30
4.0     27
Name: temperature of extremities, dtype: int64

In [28]:
#this column encoding should be also fixed, because for now it doesn't represent natural order (1 = normal, 2 = increased, 3 = reduced, 4 = absent)
#we will consider removing this column because it's values are missing in 69 rows and it's very subjective
horse_colic_train['peripheral pulse'].value_counts()

1.0    115
3.0    103
4.0      8
2.0      5
Name: peripheral pulse, dtype: int64

In [29]:
#this column ordering is also wrong, but this time we only need to swap '5' with '6'
#encoding meaning - 1 = normal pink, 2 = bright pink, 3 = pale pink, 4 = pale cyanotic, 5 = bright red / injected, 6 = dark cyanotic
#dataset description specifies 4 groups for these 6 values, so we would be able to reduce value count, however we will not do it
#because we could lose some information and value count is not critical (ordinal scale = not dummies = no extra dimensions)
horse_colic_train['mucous membranes'].value_counts()

1.0    79
3.0    58
4.0    41
2.0    30
5.0    25
6.0    20
Name: mucous membranes, dtype: int64

In [30]:
#this column can be encoded binary. according to dataset description there are only 2 possible values - '1' and '2', so we have to delete rows with '3' (meaningless)   y
horse_colic_train['capillary refill time'].value_counts()

1.0    188
2.0     78
3.0      2
Name: capillary refill time, dtype: int64