**Objective:**

The diabetes dataset provides a quantitative measure of disease progression for patients in a study by Statistics Department at Stanford University.

Our goal here will be to build data preprocessing pipeline and a ML model to predict diabetes progression of a patient using the following attributes:
- age : age in years
- sex : gender
- bmi : body mass index
- bp : average blood pressure
- s1 : tc, total serum cholesterol
- s2 : ldl, low-density lipoproteins
- s3 : hdl, high-density lipoproteins
- s4 : tch, total cholesterol / HDL
- s5 : ltg, possibly log of serum triglycerides level
- s6 : glu, blood sugar level.

Let's load essential packages and inform ourselves of their version in use:

In [1]:
print('Python Version : 3.9.18')

# for data manipulation
import pandas as pd
print('Pandas Version : ', pd.__version__)

# for data visualization
import matplotlib.pyplot as plt 
import matplotlib as mpl
print('Matplotlib Version : ', mpl.__version__)

# for machine learning
import sklearn as skl
print('SciKit-Learn Version : ', skl.__version__)

Python Version : 3.9.18
Pandas Version :  2.0.3
Matplotlib Version :  3.7.2
SciKit-Learn Version :  1.3.0


## Obtaining Dataset

We load the data using SciKit-Learn datasets and print its description.

In [6]:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
print(diabetes.data.shape, diabetes.target.shape)
print(diabetes.feature_names, ['dpi']) # target is named 'diabetes progression indicator (dpi)'

(442, 10) (442,)
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] ['dpi']


In [7]:
print(diabetes.DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 1