This notebook aims to fulfil the first milestone in the LiveProject "Anomaly Detection using scikit-learn" (https://www.manning.com/liveproject/using-scikit-learn).

The milestone requires downloading and performing some basic exploratory analysis of the UC Irvine thyroid disease dataset available at: http://odds.cs.stonybrook.edu/thyroid-disease-dataset/ . The problem is to determine whether a patient referred to the clinic is hypothyroid. Therefore three classes are built: normal (not hypothyroid), hyperfunction and subnormal functioning. For outlier detection, 3772 training instances are used, with only 6 real attributes. The hyperfunction class is treated as outlier class and other two classes are inliers, because hyperfunction is a clear minority class.

As detailed below, the proportion of outliers is low: 0.024655 or 2.466%.

In [3]:
import pandas as pd 
import numpy as np

import scipy.io

In [4]:
mat_contents = scipy.io.loadmat("thyroid.mat")

In [5]:
print(mat_contents.keys())

dict_keys(['__header__', '__version__', '__globals__', 'X', 'y'])


In [6]:
mat_contents

{'__header__': b'MATLAB 5.0 MAT-file, written by Octave 3.8.0, 2014-12-05 13:11:25 UTC',
 '__version__': '1.0',
 '__globals__': [],
 'X': array([[7.74193548e-01, 1.13207547e-03, 1.37571157e-01, 2.75700935e-01,
         2.95774648e-01, 2.36065574e-01],
        [2.47311828e-01, 4.71698113e-04, 2.79886148e-01, 3.29439252e-01,
         5.35211268e-01, 1.73770492e-01],
        [4.94623656e-01, 3.58490566e-03, 2.22960152e-01, 2.33644860e-01,
         5.25821596e-01, 1.24590164e-01],
        ...,
        [9.35483871e-01, 2.45283019e-02, 1.60341556e-01, 2.82710280e-01,
         3.75586854e-01, 2.00000000e-01],
        [6.77419355e-01, 1.47169811e-03, 1.90702087e-01, 2.42990654e-01,
         3.23943662e-01, 1.95081967e-01],
        [4.83870968e-01, 3.56603774e-03, 1.90702087e-01, 2.12616822e-01,
         3.38028169e-01, 1.63934426e-01]]),
 'y': array([[0.],
        [0.],
        [0.],
        ...,
        [0.],
        [0.],
        [0.]])}

In [7]:
X = pd.DataFrame(mat_contents.get('X'))

In [8]:
X.head()

Unnamed: 0,0,1,2,3,4,5
0,0.774194,0.001132,0.137571,0.275701,0.295775,0.236066
1,0.247312,0.000472,0.279886,0.329439,0.535211,0.17377
2,0.494624,0.003585,0.22296,0.233645,0.525822,0.12459
3,0.677419,0.001698,0.156546,0.175234,0.333333,0.136066
4,0.236559,0.000472,0.241935,0.320093,0.333333,0.247541


In [9]:
X.tail()

Unnamed: 0,0,1,2,3,4,5
3767,0.817204,0.000113,0.190702,0.287383,0.413146,0.188525
3768,0.430108,0.002453,0.232448,0.287383,0.446009,0.17541
3769,0.935484,0.024528,0.160342,0.28271,0.375587,0.2
3770,0.677419,0.001472,0.190702,0.242991,0.323944,0.195082
3771,0.483871,0.003566,0.190702,0.212617,0.338028,0.163934


In [10]:
y = pd.DataFrame(mat_contents.get('y'))

In [11]:
y.head()

Unnamed: 0,0
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0


In [12]:
y.tail()

Unnamed: 0,0
3767,0.0
3768,0.0
3769,0.0
3770,0.0
3771,0.0


In [13]:
df = pd.concat([X,y], axis=1)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3772 entries, 0 to 3771
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       3772 non-null   float64
 1   1       3772 non-null   float64
 2   2       3772 non-null   float64
 3   3       3772 non-null   float64
 4   4       3772 non-null   float64
 5   5       3772 non-null   float64
 6   0       3772 non-null   float64
dtypes: float64(7)
memory usage: 206.4 KB


In [15]:
df.describe()

Unnamed: 0,0,1,2,3,4,5,0.1
count,3772.0,3772.0,3772.0,3772.0,3772.0,3772.0,3772.0
mean,0.543121,0.008983,0.186826,0.248332,0.376941,0.177301,0.024655
std,0.20379,0.043978,0.070405,0.080579,0.087382,0.054907,0.155093
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.376344,0.001132,0.156546,0.203271,0.328638,0.14918,0.0
50%,0.569892,0.003019,0.190702,0.241822,0.375587,0.17377,0.0
75%,0.709677,0.004528,0.213472,0.28271,0.413146,0.196721,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [16]:
df.columns = ['V0', 'V1', 'V2', 'V3', 'V4', 'V5', 'y']

From df.info() we can see that there are no columns which contain missing values. All variables are numerical so we don't need to run a separate check on blanks etc. in string values. From df.describe(), all variables appear to have a healthy amount of variation.

In [17]:
df.to_csv('thyroid.csv', index=False)