## INFS3081 Predictive Analytics

### Practical Activity: Data Manipulation and Feature Selection

This notebook is an exercise for performing data preprocessing and manipulation, including the following tasks:
- Handling missing values
- Performing feature selection and feature filtering

We apply the concepts discussed in Data Exploration and Data Preprocessing.

We will use the following python libraries for this practical:
- numpy https://numpy.org/
- pandas https://pandas.pydata.org/
- scikit-learn https://scikit-learn.org/stable/

### Diabetes Dataset

Our aim is to build a classification model to predict diabetes. We will be using the diabetes dataset which contains 768 observations and 9 variables, as below:
- **Pregnancies**: Number of times pregnant.
- **Glucose**: Plasma glucose concentration [2 hours in an oral glucose tolerance test].
- **BloodPressure**: Diastolic blood pressure (mm Hg).
- **Skin Thickness**: Triceps skinfold thickness (mm).
- **Insulin**: 2-hour serum insulin (mu U/ml).
- **BMI**: Body mass index (weight in kg/(height in m)^2).
- **DiabetesPedigreeFunction**: Diabetes pedigree function.
- **Age**: Age in years.
- **Outcome**: "1" represents the presence of diabetes, while "0" represents absence.

The dataset was downloaded from https://www.kaggle.com/uciml/pima-indians-diabetes-database

### Task 1. Handling missing values

Missing values are one of the main obstacles in building predictive models. It is important to explore various approaches to imputing missing values and understand the reasons for selecting a specific method. By doing so, you can ensure your data is as complete and accurate as possible before training and evaluating your predictive models.

#### Step 1 - Loading the required libraries and modules.

In [None]:
#
# * import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer

In [3]:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
# function that renders the figure in a notebook
%matplotlib inline

In [None]:
#
# * Load the dataset
df = pd.read_csv("./diabetes.csv")

### Step 2 - Describing and Summarising the dataset

In [None]:
#
# * return the number of rows and columns in the dataframe
df.shape

(768, 9)

In [None]:
#
# * return the first 5 rows of the dataframe
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
#
# * return a concise sumary of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [None]:
#
# * return a descriptive statistics of the dataset
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


### Step 3 - Handling missing values
In this step, we explore three common techniques for dealing with missing values in datasets. Proper handling of missing data is essential to ensure the reliability and accuracy of models.

#### 1. Removing Rows with Missing Values
- This method involves deleting rows that contain missing values.
- It is simple but can lead to loss of valuable data, especially if missing values are widespread.

#### 2. Imputing Missing Values with a Summary Statistic (Mean, Median, or Mode)
Instead of removing missing values, they can be replaced with a representative value:
- **Mean**: Suitable for normally distributed data.
- **Median**: More robust for skewed data or when outliers are present.
- **Mode**: Works well for categorical data by filling in the most frequent value.

#### 3. Imputing Missing Values Using an Estimator
- A more advanced method where a predictive model (e.g., regression, kNN, or decision trees) is used to estimate and fill missing values based on existing data patterns.

Based on the descriptive analysis, it is evident that the following columns may **have invalid zero values**, which should be treated as missing data: Glucose, BloodPressure, SkinThickness, Insulin, BMI. These zero values are likely invalid and should be addressed using one of the methods above to improve the quality of the dataset.


In [9]:
#
# * Columns where zeros should be replaced with NaN
columns_to_replace = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]

# * Replace zeros with NaN in the specified columns
df[columns_to_replace] = df[columns_to_replace].replace(0, np.nan)

#### Approach 1: Removing rows with missing values

In [10]:
#
# * make a copy of the dataset
df_dropna = df.copy()

In [11]:
#
# * drop the rows with NaN values
df_dropna.dropna(inplace=True)

In [12]:
#
# * return the number of rows and columns in the dataframe after dropping NaN values
df_dropna.shape

(392, 9)

In [13]:
df_dropna.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,3.30102,122.627551,70.663265,29.145408,156.056122,33.086224,0.523046,30.864796,0.331633
std,3.211424,30.860781,12.496092,10.516424,118.84169,7.027659,0.345488,10.200777,0.471401
min,0.0,56.0,24.0,7.0,14.0,18.2,0.085,21.0,0.0
25%,1.0,99.0,62.0,21.0,76.75,28.4,0.26975,23.0,0.0
50%,2.0,119.0,70.0,29.0,125.5,33.2,0.4495,27.0,0.0
75%,5.0,143.0,78.0,37.0,190.0,37.1,0.687,36.0,1.0
max,17.0,198.0,110.0,63.0,846.0,67.1,2.42,81.0,1.0


In [14]:
#
# * check if there are still missing values in the dataset
df_dropna.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

#### Approach 2: Imputing missing values with the mean
We can impute missing values with the mean using two different approaches. One way is to use **Pandas** `fillna()` function.

In [15]:
#
# * make two copies of the dataset
df_mean_a1 = df.copy()
df_mean_a2 = df.copy()

In [16]:
#
# * fill missing values with the mean of the column
df_mean_a1.fillna(df_mean_a1.mean(), inplace=True)

In [17]:
df_mean_a1.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,121.686763,72.405184,29.15342,155.548223,32.457464,0.471876,33.240885,0.348958
std,3.369578,30.435949,12.096346,8.790942,85.021108,6.875151,0.331329,11.760232,0.476951
min,0.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0,0.0
25%,1.0,99.75,64.0,25.0,121.5,27.5,0.24375,24.0,0.0
50%,3.0,117.0,72.202592,29.15342,155.548223,32.4,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,155.548223,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [18]:
#
# * check if there are still missing values in the dataset
df_mean_a1.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

One other way is to use scikit-learn `SimpleImputer` object. The `SimpleImputer` class provides straightforward strategies for handling missing values in a datset. Missing values can be replaced with a specified constant or with a statistical measure such as the **mean**, **median**, or **most frequent** value from each column containing missing data. For comparison purposes, we use the **mean** in this example.

Keep in mind that `SimpleImputer` class supports different encodings for missing values, making it flexible for various datasets and data cleaning scenarios.

In [19]:
#
# * retrieve the numpy array as the SimpleImputer object operates
# * directly on the numpy array instead of pandas dataframe
values_mean = df_mean_a2.values
# * initialise the simple imputer and specify the replacing value
# * as the column mean.
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")
# * fit the imputer on the data
transformed_mean = imputer.fit_transform(values_mean)

In [20]:
#
# * count the number of missing values in the dataset
print("Missing: %d" % np.isnan(transformed_mean.sum()))

Missing: 0


In [21]:
transformed_mean

array([[  6.   , 148.   ,  72.   , ...,   0.627,  50.   ,   1.   ],
       [  1.   ,  85.   ,  66.   , ...,   0.351,  31.   ,   0.   ],
       [  8.   , 183.   ,  64.   , ...,   0.672,  32.   ,   1.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,   0.245,  30.   ,   0.   ],
       [  1.   , 126.   ,  60.   , ...,   0.349,  47.   ,   1.   ],
       [  1.   ,  93.   ,  70.   , ...,   0.315,  23.   ,   0.   ]],
      shape=(768, 9))

In [22]:
transformed_mean.mean(axis=0)

array([  3.84505208, 121.68676278,  72.40518417,  29.15341959,
       155.54822335,  32.45746367,   0.4718763 ,  33.24088542,
         0.34895833])

___
If you're curious to learn more about how to handle missing data effectively, check out this additional resource:

:point_right: [Statistical Imputation for Missing Values in Machine](https://machinelearningmastery.com/statistical-imputation-for-missing-values-in-machine-learning/)