<h1>Handling Missing Values</h1>

<p>This demo will cover these topics:</p>

<ul>
    <li>Deleting missing values</li>
    <li>Replacing missing value</li>
</ul>

<strong>First install pandas: </strong>

In [3]:
!pip install pandas

Defaulting to user installation because normal site-packages is not writeable
Collecting pandas
  Downloading pandas-2.3.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting tzdata>=2022.7
  Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m347.8/347.8 KB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
Collecting numpy>=1.22.4
  Downloading numpy-2.2.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.8/16.8 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: tzdata, numpy, pandas
Successfully installed numpy-2.2.6 pandas-2.3.3 tzdata-2025.2


<strong>Next import pandas to notebook:</strong>

In [2]:
import pandas as pd

Let us now look at the dataset for this task. The dataset is about predicting whether a female has diabetes or not based on parameters such as Glucose and Insulin levels. 

<strong>Read the dataset in Pandas dataframe named as df1:</strong>

In [3]:
df1 = pd.read_csv('csv-files/diabetes.csv')

<strong>Use .head() to see the first few rows of the dataset:</strong>

In [4]:
df1.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


<strong>Use .info() to know more details about the data such as number of rows and columns:</strong>

In [5]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


As we can see, there are 9 columns and 768 rows. There appears to be no null values in the data, but let's dive more deeply in the dataset.

<strong>Use .describe() to check if there really is no null value in the dataset:</strong> <br>
<em>Remember: Null values needs to be represented as NaN or None, not 0 or -</em>

In [6]:
df1.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


<p>Looking at the minimum value for each column, we can observe that:
    <ul>
        <li>Glucose</li>
        <li>BloodPressure</li>
        <li>SkinThickness</li>
        <li>Insulin</li>
        <li>BMI</li>
    </ul>
    have 0 minimum values, which does not make sense because these parameters cannot be 0 for any person. This suggests that the missing values are represented by 0.</p>

<strong>Use .sum() to find how many 0 values are in each of these columns:</strong> <br>
<em>First create a variable to store the identified columns.<em> 

In [7]:
data_cols = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

In [11]:
(df1[data_cols] == 0).sum() #df1[data_cols] == 0 means only find rows with 0 values in each column

Glucose            5
BloodPressure     35
SkinThickness    227
Insulin          374
BMI               11
dtype: int64

The count of 0 values have been listed above. These numbers confirmed that 0 are indeed representing missing values. Now, try to find the count of null values in each column.

<strong>Use .isnull().sum() to find the count of null values:</strong>

In [12]:
df1.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

It shows 0 null values for every column. This happens because null values are not represented by the standard representation of 'NaN'or 'None'. Since null values are represented by 0, pandas is not able to identify any null values in the dataset. For this to work, replace these 0 values with 'NaN'.

<strong>First import nan from numpy:</strong>

In [13]:
from numpy import nan

<strong>Use .replace() to replace 0 with NaN:</strong>

In [16]:
df1[data_cols] = df1[data_cols].replace(0, nan)

Check the null values again in the dataset.

In [17]:
df1.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

Null values are now being detected. All 0s have been converted to null values. View the first 20 rows of the dataset to see some of the NaN values.

In [31]:
df1.head(20)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,156.056122,33.6,0.627,50,1
1,1,85.0,66.0,29.0,156.056122,26.6,0.351,31,0
2,8,183.0,64.0,26.0,156.056122,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
5,5,116.0,74.0,33.5,156.056122,25.6,0.201,30,0
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
10,4,110.0,92.0,39.5,156.056122,37.6,0.191,30,0
11,10,168.0,74.0,34.0,156.056122,38.0,0.537,34,1


<h1>Deleting Missing Values</h1>

From the count of null values, it can be seen that columns 'Glucose', 'BloodPressure', and 'BMI' have very fe null values. So, deleting these observations would not be detrimental to the dataset.

<strong>Use .dropna() to drop these missing values:</strong>

In [20]:
df1 = df1.dropna(subset = ['Glucose', 'BloodPressure', 'BMI']) #subset specifies the columns where there are null values

#show new count of null values
df1.isnull().sum()

Pregnancies                   0
Glucose                       0
BloodPressure                 0
SkinThickness               192
Insulin                     332
BMI                           0
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

Null values from columns Glucose, BloodPressure, and BMI have been removed. Deletion of these rows have deleted some null values for SkinThickness and Insulin as well; thus, their reduced null values. (For Example row 4 has null values in Glucose and Insulin).

<h1>Replacing Missing Values</h1>

The column Insulin has 332 missing values, which should be replaced since it is a big number. Use mean of the column Insulin to replace the missing values.

<strong>Use .mean() to find the mean of a column:</strong>

In [22]:
mean_val = df1['Insulin'].mean()
print(mean_val)

156.05612244897958


<strong>Use .fillna() to replace all the NaN values in the column Insulin with the column's mean:</strong>

In [24]:
df1.fillna({'Insulin' : mean_val}, inplace = True)

#The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

#For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.

Check the missing values of the dataset again.

In [26]:
df1.isnull().sum()

Pregnancies                   0
Glucose                       0
BloodPressure                 0
SkinThickness               192
Insulin                       0
BMI                           0
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

Now, try using interpolation to replace missing values. <br>

<strong>Use .interpolate() to replace missing values with interpolation:</strong>

In [29]:
df1['SkinThickness'] = df1['SkinThickness'].interpolate()

Check the missing values of the dataset again.

In [30]:
df1.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

<em>
    <strong>The dataset is now free of all the null values.</strong>
</em>