<a href="https://colab.research.google.com/github/jtao/dswebinar/blob/master/intro_to_ds/case0/FirstPeek.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A sample data science project with the diabetes dataset

[Jian Tao](https://orcid.org/0000-0003-4228-6089), Texas A&M University

May 1, 2021

### The goal of this project is to build a model to predict disease progression.

Given a dataset, we will
1. explore the diabetes data set,
2. build a multilinear model with top 3 features that are closely correlated with the target,
3. create a Deep Neural Network with 3 hidden layers, and finally,
4. compare the models and discuss the results.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import seaborn as sns
import os

### 1. First of all, load and explore the data

In [None]:
# we will load the diabetes data set distributed with sklearn.
diabetes = datasets.load_diabetes()

# load data into a Pandas DataFrame when you need to do EDA.
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df["target"] = diabetes.target

In [None]:
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df["target"] = diabetes.target # only for plot the first figure.

In [None]:
print(diabetes.DESCR)

In [None]:
df.describe().T

In [None]:
df.info()

In [None]:
df

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(df.corr(), annot=True);

In [None]:
g = sns.pairplot(df[["target", "bmi", "bp", "s4", "s5"]])
g.map_lower(sns.kdeplot, levels=4, color=".2")

### 2. Build a multilinear regression model with top 3 correlated features
Top 3 features that are correlated with the target are bmi (0.59), s5 (0.57), and bp (0.44)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

X = df[['bmi',"bp", "s5"]]
y = df[["target"]]

multi_reg = LinearRegression()
multi_reg.fit(X, y)

y_pred = multi_reg.predict(X)

print('Coefficients:', multi_reg.coef_)
print('MSE:', mean_squared_error(y, y_pred) )
print('R-sq:', r2_score(y, y_pred) )

In [None]:
plt.scatter(y, y_pred)
plt.plot(y, y)

### 3. Build a Deep Neural Network with 3 hidden Dense layers with all the features.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import datasets

In [None]:
diabetes = datasets.load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df["target"] = diabetes.target

In [None]:
X = df.drop("target", axis = 1)
y = df["target"]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state = 101)

In [None]:
X_train

In [None]:
# Neural network
model = Sequential()
model.add(Dense(12, input_dim=10, activation="relu"))
model.add(Dense(10, activation="relu"))
model.add(Dense(1))

# compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# train the model (set verbose to True to see the output)
model.fit(X_train, y_train, validation_split=0.2, epochs=400, verbose=False)

In [None]:
y_pred = model.predict(X_test)

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

print('MSE:', mean_squared_error(y_test, y_pred) )

print('R-sq:', r2_score(y_test, y_pred) )

### 4. Comparing models
1. R2 - Multi-linear: 0.48008281990946056
2. R2 - Deep Learning: 0.4773411304310343

R2 score of the multi-linear method is comparable to that of the Deep Learning regressor. For the multi-linear method, we will need to manually extract the features. For this dataset, those 3 features give a relative good result, which is comparable to that from the Deep Learning method, which considers all the features.