# Linear Regression Project

For this project you will use the diabetes dataset of the package Scikit-Learn. 

To simplify things, you will use only the first feature of the dataset, in order to illustrate the data points within a two-dimensional plot.

A straight line must be seen in the plot, showing how linear regression attempts to draw a straight line that will best minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation.

## Your work to do

Your work is to implement tow functions:
1. costFunction: a function that using a linear model computes the cost for the theta coefficients and its gradient
2. predict: a function to compute the hypothesis of your linear model

Also, you should use `scipy.optimize` to optimize your model, i.e., to fit the model to your data. 

Finally, after an analysis from the comparison, you have to answer the questions.

## Comparison

The coefficients, residual sum of squares and the coefficient of determination are calculated, in order to compare with scikit-learn linear model.

The plot could look like 
![linear plot](https://scikit-learn.org/stable/_images/sphx_glr_plot_ols_001.png)

## Diabetes dataset
For reference on the dataset you may visit [https://scikit-learn.org/stable/datasets/toy_dataset.html]

Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

Data Set Characteristics:

    Number of Instances  |  442
    Number of Attributes |  First 10 columns are numeric predictive values
    Target               |  Column 11 is a quantitative measure of disease progression one year after baseline
    Attribute Information|  age in years
                            sex
                            bmi body mass index
                            bp average blood pressure
                            s1 tc, T-Cells (a type of white blood cells)
                            s2 ldl, low-density lipoproteins
                            s3 hdl, high-density lipoproteins
                            s4 tch, thyroid stimulating hormone
                            s5 ltg, lamotrigine
                            s6 glu, blood sugar level




In [None]:
!pip install scikit-learn
!pip install scipy

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model, datasets
from sklearn.metrics import mean_squared_error, r2_score
from scipy.optimize import minimize

In [None]:
# Load the diabetes dataset
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

# Use only one feature
diabetes_X = diabetes_X[:, np.newaxis, 2]

#  Setup the data matrix appropriately, and add ones for the intercept term
m, n = diabetes_X.shape

# Add intercept term to x and X_test
diabetes_X = np.concatenate((np.ones((m,1)), diabetes_X), axis=1) # Add a column of ones to x

# Initialize fitting parameters
initial_theta = np.zeros(n + 1)

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]

In [None]:
# Linear model 
# Cost function and Gradient
def costFunction(theta, X, y):

    cost = np.zeros(1)
    grad = np.zeros(theta.shape)

    #You have to complete this function



    return cost.item(), grad

In [None]:
# Train the model using the training sets

# Using scipy optimize the cost function
options={'maxiter':400,'gtol': 1e-8, 'disp': True}
solution = minimize(costFunction, initial_theta, args=(diabetes_X_train, diabetes_y_train),jac=True, options=options)
cost = solution['fun']
theta = solution['x']
# Print theta to screen
print('Cost at theta found by minimize function: ', cost)
# The coefficients
print('Coefficients (theta): \n', theta)

In [None]:
def predict(theta, X):
    '''Predict the result value using learned linear
    regression parameters '''

    m, n = X.shape
    p = np.zeros((m, 1))

    #You have to complete this function

    return p

In [None]:
# Make predictions using the testing set
diabetes_y_pred = predict(theta, diabetes_X_test)

# The mean squared error
print('Mean squared error: %.2f'
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(diabetes_y_test, diabetes_y_pred))

In [None]:
# Plot outputs
plt.scatter(diabetes_X_test[:,1:], diabetes_y_test,  color='black')
plt.plot(diabetes_X_test[:,1:], diabetes_y_pred, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

In [None]:
# Now compare your result with that from Scikit-Learn

# Create linear regression object
regr = linear_model.LinearRegression(fit_intercept=True)

# Split the data into training/testing sets, eliminating the ones
diabetes_X_train = diabetes_X[:-20,1:]
diabetes_X_test = diabetes_X[-20:,1:]

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print('Coefficients: \n', regr.intercept_, regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
      % r2_score(diabetes_y_test, diabetes_y_pred))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()



## Questions

1. Did you perform a data preprocessing (Standardization, Scaling, Normalization, Categorical Encoding or Imputation)?
2. Are the model parameters theta the same as the intercept and coefficients from scikit-learn? Why do you think that this happens? They represent (almost) the same line?
3. Do you have to modify the two functions that you implemented in order to fit a model that uses more than one feature of the dataset?

