## Fixing Missing Values for the Horse Colic Dataset

In [1]:
import pandas as pd

In [2]:
file_url = ('https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/'\
            'master/Chapter11/dataset/horse-colic.data')

In [3]:
df = pd.read_csv(file_url, header=None, sep='\s+', prefix='X')
df.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X18,X19,X20,X21,X22,X23,X24,X25,X26,X27
0,2,1,530101,38.5,66,28,3,3,?,2,...,45.0,8.4,?,?,2,2,11300,0,0,2
1,1,1,534817,39.2,88,20,?,?,4,1,...,50.0,85.0,2,2,3,2,2208,0,0,2
2,2,1,530334,38.3,40,24,1,1,3,1,...,33.0,6.7,?,?,1,2,0,0,0,1
3,1,9,5290409,39.1,164,84,4,1,6,2,...,48.0,7.2,3,5.30,2,1,2208,0,0,1
4,2,1,530255,37.3,104,35,?,?,6,2,...,74.0,7.4,?,?,2,2,4300,0,0,2


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 28 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   X0      300 non-null    object
 1   X1      300 non-null    int64 
 2   X2      300 non-null    int64 
 3   X3      300 non-null    object
 4   X4      300 non-null    object
 5   X5      300 non-null    object
 6   X6      300 non-null    object
 7   X7      300 non-null    object
 8   X8      300 non-null    object
 9   X9      300 non-null    object
 10  X10     300 non-null    object
 11  X11     300 non-null    object
 12  X12     300 non-null    object
 13  X13     300 non-null    object
 14  X14     300 non-null    object
 15  X15     300 non-null    object
 16  X16     300 non-null    object
 17  X17     300 non-null    object
 18  X18     300 non-null    object
 19  X19     300 non-null    object
 20  X20     300 non-null    object
 21  X21     300 non-null    object
 22  X22     300 non-null    ob

In [5]:
# tranform ? character into missing value
df = pd.read_csv(file_url, header=None, sep='\s+', prefix='X', na_values='?')
df.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X7,X8,X9,...,X18,X19,X20,X21,X22,X23,X24,X25,X26,X27
0,2.0,1,530101,38.5,66.0,28.0,3.0,3.0,,2.0,...,45.0,8.4,,,2.0,2,11300,0,0,2
1,1.0,1,534817,39.2,88.0,20.0,,,4.0,1.0,...,50.0,85.0,2.0,2.0,3.0,2,2208,0,0,2
2,2.0,1,530334,38.3,40.0,24.0,1.0,1.0,3.0,1.0,...,33.0,6.7,,,1.0,2,0,0,0,1
3,1.0,9,5290409,39.1,164.0,84.0,4.0,1.0,6.0,2.0,...,48.0,7.2,3.0,5.3,2.0,1,2208,0,0,1
4,2.0,1,530255,37.3,104.0,35.0,,,6.0,2.0,...,74.0,7.4,,,2.0,2,4300,0,0,2


In [7]:
df.dtypes

X0     float64
X1       int64
X2       int64
X3     float64
X4     float64
X5     float64
X6     float64
X7     float64
X8     float64
X9     float64
X10    float64
X11    float64
X12    float64
X13    float64
X14    float64
X15    float64
X16    float64
X17    float64
X18    float64
X19    float64
X20    float64
X21    float64
X22    float64
X23      int64
X24      int64
X25      int64
X26      int64
X27      int64
dtype: object

In [8]:
df.isna().sum()

X0       1
X1       0
X2       0
X3      60
X4      24
X5      58
X6      56
X7      69
X8      47
X9      32
X10     55
X11     44
X12     56
X13    104
X14    106
X15    247
X16    102
X17    118
X18     29
X19     33
X20    165
X21    198
X22      1
X23      0
X24      0
X25      0
X26      0
X27      0
dtype: int64

Create a condition mask called x0_mask so that you can find the missing values in the X0 column using the .isna() method:

In [9]:
x0_mask = df['X0'].isna()
x0_mask.sum()

1

Extract the mean of X0 using the .median() method and store it in a new variable called x0_median.

In [10]:
x0_median = df['X0'].median()
print(x0_median)

1.0


Replace all the missing values in the X0 variable with their median using the .fillna() method, along with the inplace=True parameter:

In [11]:
df['X0'].fillna(x0_median, inplace=True)
df['X0'].isna().sum()

0

Create a for loop that will iterate through all the columns of the DataFrame. In the for loop, calculate the median for each and save them into a variable called col_median. Then, impute missing values with this median value using the .fillna() method, along with the inplace=True parameter, and print the name of the column and its median value:

In [13]:
for col_name in df.columns:
    col_median = df[col_name].median()
    df[col_name].fillna(col_median, inplace=True)

In [14]:
df.isna().sum()

X0     0
X1     0
X2     0
X3     0
X4     0
X5     0
X6     0
X7     0
X8     0
X9     0
X10    0
X11    0
X12    0
X13    0
X14    0
X15    0
X16    0
X17    0
X18    0
X19    0
X20    0
X21    0
X22    0
X23    0
X24    0
X25    0
X26    0
X27    0
dtype: int64

You have successfully fixed the missing values for all the numerical variables using the methods provided by the pandas package: .isna() and .fillna().