## Visual project

<h4>
    I will construct hypothetical outcome plots (HOPs) and spaghetti plots for a fit line and will then use it on Diabetes dataset.
    </h4>

In [1]:
import time
import altair as alt
import pandas as pd
import ipywidgets as widgets
from ipywidgets import interact
from sklearn import linear_model
from sklearn import gaussian_process
import numpy as np

import operator
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures


In [2]:
salary_df = pd.read_csv("Salary_Data.csv")
salary_df.head()

Unnamed: 0,YearsExperience,Salary
0,1.1,39343.0
1,1.3,46205.0
2,1.5,37731.0
3,2.0,43525.0
4,2.2,39891.0


<h5> Scatter plot of data </h5>

In [3]:
def get_salary_points_chart():
    graph = alt.Chart(salary_df).mark_circle(size=60).encode(
    x='YearsExperience',
    y='Salary')
    
    return graph
    
    
get_salary_points_chart()

<h5> Bootstrap one linear regression fit </h5>

We will need a function that returns one bootstrap sample of the regression fit. That is, it resamples the dataset with replacement, then fits a linear regression to the data. Fill in the code below to complete the function: 

<br>

Below function returns a sklearn.linear_model.LinearRegression model representing a fit to a bootstrap-resampled version of salary_df

In [4]:
def get_one_bootstrap_salary_fit():
    #resample the data with replacement (replace=True) to a data frame with 
    #the same number of data points (frac=1.0)
    resampled_df = salary_df.sample(frac=1.0, replace=True)

    X = resampled_df[['YearsExperience']] 
    y = resampled_df['Salary']           
    
    model = LinearRegression()
    model.fit(X,y)
    
    return model

In [5]:
np.random.seed(1234)
fit = get_one_bootstrap_salary_fit()

We can use this function to get a single sample from the bootstrap sampling distribution of the fit (e.g., its slope and intercept). Each time you run the following cell you should get slightly different values:

In [6]:
salary_reg = get_one_bootstrap_salary_fit()
print("Bootstrapped intercept: ", salary_reg.intercept_)
print("Bootstrapped slope:     ", salary_reg.coef_[0])

Bootstrapped intercept:  23608.86257018456
Bootstrapped slope:      9759.092501075673


<h5> Construct an Altair chart of one regression fit </h5>
To construct a chart of a fit line or fit curve, we first need a *prediction grid*: a set of x values we want to use to make predictions. This should be in the same form as the input to the regression function (i.e., a DataFrame). 

For this example, we will use evenly-spaced values of `"YearsExperience"`, the x value in our charts. Because it is a linear fit, we strictly speaking only need 2 values, but we will use more (101) because it generalizes better. When you plot non-linear relationships (as we will in Part 2), you need a large number of points in your prediction grid so that the curve is smooth.

In [7]:
# construct a prediction grid for the salary dataset with 101 
# evenly-spaced values from the minimum to maximum number of years of experience
salary_pred_grid = pd.DataFrame({'YearsExperience': np.linspace(
    salary_df['YearsExperience'].min(), 
    salary_df['YearsExperience'].max(), 
    num=101
)})

The below function displays a single fit line from the linear regression fit passed in to it

In [8]:
def get_salary_linear_fit_chart(salary_reg, opacity=0.5):
    pred_df = pd.DataFrame({
        'YearsExperience': salary_pred_grid['YearsExperience'],
        'Salary': salary_reg.predict(salary_pred_grid)
    })
    
    graph = alt.Chart(pred_df).mark_line(opacity = opacity).encode(
    x="YearsExperience",
    y="Salary"
    )
    
    return graph

get_salary_linear_fit_chart(salary_reg)

<h5> Constructing HOP </h5>

In [9]:
points_chart = get_salary_points_chart()
salary_reg = get_one_bootstrap_salary_fit()
line_chart = get_salary_linear_fit_chart(salary_reg)
line_chart + points_chart

In [10]:
def get_one_frame(i):
    time.sleep(.2)
    points_chart = get_salary_points_chart()
    salary_reg = get_one_bootstrap_salary_fit()
    line_chart = get_salary_linear_fit_chart(salary_reg)
    return line_chart + points_chart

interact(get_one_frame, i = widgets.Play(
    value=0,
    min=0,
    max=100,
    step=1,
    description="Press play",
    disabled=False))

interactive(children=(Play(value=0, description='Press play'), Output()), _dom_classes=('widget-interact',))

<function __main__.get_one_frame(i)>

A Jupyter widget could not be displayed because the widget state could not be found. This could happen if the kernel storing the widget is no longer available, or if the widget state was not saved in the notebook. You may be able to create the widget by running the appropriate cells.

<h5> Sphagetti Plot </h5>

In [11]:
B = 50

# get `B` bootstrapped fit line charts
# Note opacity=0.1 sets the line opacity so it is easier to see the overlapping lines. Make
line_charts = [get_salary_linear_fit_chart(get_one_bootstrap_salary_fit(), opacity=0.1) for _ in range(B)]

alt.layer(*line_charts) + get_salary_points_chart()

<h4> Spaghetti plots for Polynomial Regression </h4>

In [13]:
#prepare dataset
np.random.seed(42)
n = 25

original_x = 5 - 4 * np.random.normal(0, 1, n)
original_y = -2 + 3*original_x - 5*(original_x ** 2) + 7*(original_x ** 3) + np.random.normal(0, 1000, n)

poly_df = pd.DataFrame({'x': original_x, 'y': original_y})

In [14]:
#helper function
def get_poly_points_chart():

    return alt.Chart(poly_df).mark_circle(color="black").encode(
        x='x',
        y='y'
    )

get_poly_points_chart()

In [15]:
#prediction grid
poly_pred_grid = pd.DataFrame({
    "x": np.linspace(poly_df['x'].min(), poly_df['x'].max(), num=101)
})

def get_one_bootstrap_poly_fit():
    #resample the data with replacement (replace=True) to a data frame with 
    #the same number of data points (frac=1.0)
    resampled_df = poly_df.sample(frac=1.0, replace=True)

    #fit model to resampled data
    X = resampled_df[['x']] #[[ ]] subsets so X remains a DataFrame
    y = resampled_df['y']   #y should be an array, so we use [ ]
    
    #x must be transformed into polynomials (e.g. x, x^2, x^3 ... up to the value of `degree`)
    polynomial_features = PolynomialFeatures(degree=2)
    X_poly = polynomial_features.fit_transform(X)
    poly_reg = linear_model.LinearRegression()
    poly_reg.fit(X_poly, y)
    
    return poly_reg


def get_poly_fit_chart(poly_reg, opacity=0.5):
    #use the model to predict y at each x position
    polynomial_features = PolynomialFeatures(degree=2)
    pred_df = pd.DataFrame({
        'x': poly_pred_grid['x'],
        'y': poly_reg.predict(polynomial_features.fit_transform(poly_pred_grid))
    })

    #return an Altair chart showing the fit line
    return alt.Chart(pred_df).mark_line(
        opacity=opacity,
        color='red'
    ).encode(
        x='x',
        y='y'
    )

poly_reg = get_one_bootstrap_poly_fit()
get_poly_fit_chart(poly_reg)

<h4> Sphagetti plot for polynomial regression </h4>

In [16]:
B = 50
poly_charts = [get_poly_fit_chart(get_one_bootstrap_poly_fit(), opacity=0.1) for _ in range(B)]
alt.layer(*poly_charts) + get_poly_points_chart()

<h4> Loading and visualizing diabetes dataset </h4>
Applying above concept to diabetes dataset

In [17]:
from sklearn.datasets import load_diabetes
X, y = load_diabetes(return_X_y=True)

diabetes_X = pd.DataFrame(X, columns=["age","sex","bmi","bp", "tc", "ldl", "hdl","tch", "ltg", "glu"])
diabetes_y = pd.DataFrame(y, columns=["disease_progression"])


diabetes_df = pd.concat([diabetes_y, diabetes_X], axis=1)


alt.Chart(diabetes_df).mark_point().encode(
    x="hdl",
    y="disease_progression"
)

In [18]:
def get_diabetes_points_chart():
    
    graph = alt.Chart(diabetes_df).mark_circle(size=60).encode(
    x='hdl',
    y='disease_progression')
    
    return graph
    
    
get_diabetes_points_chart()

In [19]:
diabetes_pred_grid = pd.DataFrame({'hdl': np.linspace(
    diabetes_df['hdl'].min(), 
    diabetes_df['hdl'].max(), 
    num=101
)})

def get_one_bootstrap_diabetes_fit():
    
    resampled_df = diabetes_df.sample(frac=1.0, replace=True)

    X = resampled_df[['hdl']] 
    y = resampled_df['disease_progression']            
    
    polynomial_features = PolynomialFeatures(degree=2)
    X_poly = polynomial_features.fit_transform(X)
    poly_reg = linear_model.LinearRegression()
    poly_reg.fit(X_poly, y)
    
    return poly_reg

In [20]:
def get_diabetes_fit_chart(diabetes_reg, opacity=0.5):
    
    polynomial_features = PolynomialFeatures(degree=2)
    
    pred_df = pd.DataFrame({
        'hdl': diabetes_pred_grid['hdl'],
        'disease_progression': diabetes_reg.predict(polynomial_features.fit_transform(diabetes_pred_grid))
    })
    
    graph = alt.Chart(pred_df).mark_line(opacity = opacity, color = 'red').encode(
    x="hdl",
    y="disease_progression"
    )
    
    return graph

diabetes_reg = get_one_bootstrap_diabetes_fit()
get_diabetes_fit_chart(diabetes_reg)

In [21]:
B = 50

line_charts = [get_diabetes_fit_chart(get_one_bootstrap_diabetes_fit(), opacity=0.1) for _ in range(B)]
alt.layer(*line_charts) + get_diabetes_points_chart()

In [22]:

points_chart = get_diabetes_points_chart()
diabetes_reg = get_one_bootstrap_diabetes_fit()
line_chart = get_diabetes_fit_chart(diabetes_reg)
line_chart + points_chart