# Linear Regression, Correlation, Coefficient of Determination

We are going to apply a model to predict the response of treatment for diabetes using the 'diabetes' dataset in sklearn.
We will use Linear Regression for the model to try to fit the best-fit line to the input or feature data and the output response. the output response is a measure of disease progression one year after baseline measurements were taken
We will then calculate the regression coefficients, the mean square error, and the coefficient
of determination.
We will also standardize the data for greater metric accuracy.


In trying to determine the best-fit that describes the relationship of the input features with the output response, we will briefly mention ways of performing this method. However, a more thorough explanation of latest techniques can be found in the original paper.
http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf

In [1]:
from sklearn import datasets
from sklearn import preprocessing
import pandas as pd

diabetes_data = datasets.load_diabetes()
df_diabetes = pd.DataFrame(diabetes_data.data,columns=diabetes_data.feature_names)
df_diabetes['response'] = pd.Series(diabetes_data.target)
df_diabetes.head(10)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,response
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0
5,-0.092695,-0.044642,-0.040696,-0.019442,-0.068991,-0.079288,0.041277,-0.076395,-0.04118,-0.096346,97.0
6,-0.045472,0.05068,-0.047163,-0.015999,-0.040096,-0.0248,0.000779,-0.039493,-0.062913,-0.038357,138.0
7,0.063504,0.05068,-0.001895,0.06663,0.09062,0.108914,0.022869,0.017703,-0.035817,0.003064,63.0
8,0.041708,0.05068,0.061696,-0.040099,-0.013953,0.006202,-0.028674,-0.002592,-0.014956,0.011349,110.0
9,-0.0709,-0.044642,0.039062,-0.033214,-0.012577,-0.034508,-0.024993,-0.002592,0.067736,-0.013504,310.0


In [None]:
[col for col in df_diabetes.columns] # lets just look at the column titles


In [None]:
#predictors is a list containing the input features

#predictors = ['age']
#predictors = ['bmi'] # 0.40
#predictors = ['bp'] # 0.19
#predictors = ['s5'] # 0.31
#predictors = ['s4'] # 0.17
predictors = ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] # 0.46, 3201
#predictors = ['bmi', 'bp', 's4', 's5'] # 0.45, 2923
#predictors = [ 'bmi', 's5'] # 0.364
#predictors = [ 'bmi', 's5','s6' ]
#predictors = [ 'bmi', 's5','s6', 'bp' ]

In [None]:
# response is a list containing the response
# response is a measure of disease progression one year after baseline measurements were taken

response = ['response']

In [None]:
x = df_diabetes[predictors]
x= preprocessing.scale(x)

In [None]:
y = df_diabetes[response]

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y)

In [None]:
linrgr = LinearRegression()
linrgr.fit(x_train, y_train)

In [None]:
linrgr.score(x_test, y_test)

In [None]:
# Let's now predict what our model will output for the test data input
y_pred = linrgr.predict(x_test)

# Here are the regression coefficients
print('Regression Coeffs:\n', linrgr.coef_)

# The coefficient of determination is the correlation coefficient squared
# This will give an indication of which features contributed most to explaining the variance in the output response
print('Coeff of Determination:\n', r2_score(y_test, y_pred))

# This will calculate the Mean Square Error between our model output and the 
# Here we compare our model output predictions to the actual data response output using the selected y test data
print('MSE: \n', mean_squared_error(y_test, y_pred))


In [None]:
import seaborn as sns
#sns.scatterplot(df_diabetes['bmi'], df_diabetes['response']);
#sns.scatterplot(df_diabetes['s5'], df_diabetes['response']);
sns.scatterplot(df_diabetes['bmi'], df_diabetes['response']);
plt.plot(x_test, y_pred, color='red', linewidth=2)

In [None]:
from itertools import chain 

feature_importance = linrgr.coef_
feature_importance=list(chain.from_iterable(feature_importance))
print(feature_importance)

In [None]:
import matplotlib.pyplot as plt; plt.rcdefaults()
import numpy as np

features = ('age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6')
#features = ('s2', 's5', )
#features = ('bmi', 'bp', 's4', 's5')
y_pos = np.arange(len(features))

plt.bar(y_pos, feature_importance, align='center', alpha=0.5)
plt.xticks(y_pos, features)
plt.ylabel('Regression Coefficient')
plt.title('Feature Importance')
plt.show()
print(feature_importance)
