The Diabetes Dataset is a well-known dataset in machine learning, often used for regression tasks. It contains data collected from diabetes patients, including various medical predictor variables (such as age, BMI, blood pressure, and blood serum measurements) and a quantitative measure of disease progression one year after baseline.

**Problem Statement:**
The main problem is to predict the progression of diabetes in patients based on their medical attributes. This is a regression problem because the target variable (disease progression) is continuous.

**How Linear Regression Can Help:**
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. By applying linear regression to the Diabetes Dataset, we can:
- Learn the relationship between patient features and disease progression.
- Predict the future progression of diabetes for new patients based on their medical data.
- Identify which features are most influential in predicting disease progression.

In summary, linear regression provides a simple yet effective way to model and predict diabetes progression using the available patient data.

## Steps to Solve the Diabetes Progression Prediction Problem

1. **Load the Dataset**  
    Import the Diabetes Dataset, which contains medical attributes and the target variable (disease progression).

2. **Explore and Understand the Data**  
    Examine the dataset to understand the features, check for missing values, and get basic statistics.

3. **Preprocess the Data**  
    Prepare the data for modeling by handling missing values, encoding categorical variables (if any), and scaling features if necessary.

4. **Split the Data**  
    Divide the dataset into training and testing sets to evaluate model performance.

5. **Build and Train the Linear Regression Model**  
    Use linear regression to model the relationship between the features and the target variable.

6. **Evaluate the Model**  
    Assess the model's performance using appropriate regression metrics such as Mean Squared Error (MSE) or R² score.

7. **Interpret the Results**  
    Analyze the coefficients to understand the influence of each feature and draw conclusions about the model's predictions.

8. **Make Predictions**  
    Use the trained model to predict diabetes progression for new or unseen data.

In [4]:
## 1. Load the Dataset

from sklearn.datasets import load_diabetes
import pandas as pd

# Load the diabetes dataset
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = pd.Series(diabetes.target, name='disease_progression')

In [5]:
## 2. Explore and Understand the Data

# Display the first few rows of the features and target
print(X.head())
print(y.head())

# Check for missing values and basic statistics
print(X.info())
print(X.describe())

        age       sex       bmi        bp        s1        s2        s3  \
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2  0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   

         s4        s5        s6  
0 -0.002592  0.019907 -0.017646  
1 -0.039493 -0.068332 -0.092204  
2 -0.002592  0.002861 -0.025930  
3  0.034309  0.022688 -0.009362  
4 -0.002592 -0.031988 -0.046641  
0    151.0
1     75.0
2    141.0
3    206.0
4    135.0
Name: disease_progression, dtype: float64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    flo

In [6]:
## 3. Preprocess the Data

from sklearn.preprocessing import StandardScaler

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [7]:

## 4. Split the Data

from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=42)

In [8]:

## 5. Build and Train the Linear Regression Model

from sklearn.linear_model import LinearRegression

# Initialize and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

In [9]:

## 6. Evaluate the Model

from sklearn.metrics import mean_squared_error, r2_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

Mean Squared Error: 2848.31
R² Score: 0.48


In [10]:


## 7. Interpret the Results

# Display the coefficients for each feature
coefficients = pd.Series(model.coef_, index=diabetes.feature_names)
print(coefficients.sort_values(ascending=False))

s5     33.096217
bmi    25.303283
s2     24.175357
bp     18.149095
s4     12.818442
s3      5.562752
age     2.271221
s6      1.252133
sex   -11.510334
s1    -43.688728
dtype: float64


In [11]:

## 8. Make Predictions

# Predict disease progression for new data (example: first 5 samples from test set)
new_predictions = model.predict(X_test[:5])
print(new_predictions)

[137.94908878 182.533354   129.85295373 292.5630923  124.86788221]
