# Homework 06: Spark and Least Squares Linear Regression

## Introduction

In this assignment, you will implement distributed least squares linear regression using Apache Spark. As with Lab09 we will be using a service called Databricks to develop and run code. Databricks simplifies the setup of Apache Spark and the cloud, and it provides limited free cloud computing. Outside the context of this assignment, you can always run Apache Spark code on your own computer or in the cloud without Databricks.


In [1]:
# Run this cell to set up your notebook
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib notebook

from client.api.notebook import Notebook
ok = Notebook('hw6.ok')

Assignment: hw6
OK, version v1.13.9



In [2]:
ok.auth(force=False) # Change False to True if you are getting errors authenticating

ERROR  | auth.py:91 | {'error': 'invalid_grant'}



Open the following URL:

https://okpy.org/client/login/

After logging in, copy the code from the web page and paste it into the box.
Then press the "Enter" key on your keyboard.

Paste your code here: T2IILzQKzDpshTcivTo880Uq2HOriR
Successfully logged in as sungbin.andy.kang@berkeley.edu


## Question 1. Understanding Least Squares Regression


In the first part of this homework, we explore some properties of multiple regression.  In particular, the goals are to

* Interpret of parameters in simple and multiple linear regression
* Understand how the correlation of the explanatory variables can impact the coefficients
* Observe how te correlation between explanatry variables can impact the standard error of the coefficients.


We will also introduce the tools in scikit learn for fitting linear models. Note that these tools are not used in the second part of this assignment where you implement linear least squares using Spark.

In [3]:
# Run this cell to set up your notebook
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from IPython.display import display, Latex, Markdown

%matplotlib notebook

### Creating the Data

We generate two sets of data, a response vector `Y` and a two-column design matrix `X`. 

* In the first data set, the columns of `X` are correlated with each other as well as being correlated with `Y`.  
* In the second data set, the columns of `X` are uncorrelated with each other and both columns are correlated with `Y`.   

The following code creates the first data set. 

In [8]:
n = 100
p = 2

mean = [0, 0, 0]
cov = [[1, 0.7, 0.7], [0.7, 1, 0.9], [0.7, 0.9, 1]]

np.random.seed(1141)
v, u, Y = np.random.multivariate_normal(mean, cov, n).T
X = np.array([u, v]).T
Y

array([ 0.74899194, -0.17473052,  1.45498779,  0.5680849 , -0.78233717,
        0.50588611, -0.17921599, -0.84240784,  0.42868872,  1.66802024,
       -0.73853543, -0.03793409,  1.27539948,  0.03221326, -0.22099399,
        0.07320065,  0.29806637,  0.4326946 ,  1.04974713,  1.44846328,
        0.11438585,  0.19163337, -1.62283671, -1.20550085, -1.30741061,
        0.56149772,  1.81792486,  1.62522877, -1.03095653, -1.42660051,
       -1.85179084, -0.4107067 , -0.18868151, -1.73025432,  0.93648722,
       -1.07318002, -1.88492722, -0.5821217 ,  0.98069282, -0.15237997,
       -0.39497925,  0.54270843,  1.35407224,  0.37923428, -0.40183448,
        1.22946135, -0.11270924,  0.09040211,  0.88268444, -0.24432088,
       -1.1622403 ,  0.88038541, -1.07097092,  0.89267793, -0.72919906,
       -0.39957604, -0.1861604 , -0.73227711, -1.65127913, -0.22214699,
       -0.97697303, -0.60651912,  0.68714598,  0.65598933, -0.70830423,
       -1.16353735, -0.68316188, -0.94100075,  1.3494668 , -1.08

#### Question1a 
Find the mean and standard deviation of `Y`

In [9]:
mean_Y = np.mean(Y)
sd_Y = np.std(Y)

In [10]:
_ = ok.grade('q01a')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% completete01% completee
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/mwOQ9R
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



#### Three-dimensional plot

Create a 3D plot of `Y` and `X`. 
Take the following plot for a spin (literally).  Drag across the plot to spin it. Notice that we added the origin in red.

In [12]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:,0], X[:,1], Y)
# Added the origin 
ax.scatter([0],[0],[0], "o", color='red')
ax.set_xlabel(r"$X_0$ axis")
ax.set_ylabel(r'$X_1$ axis')
ax.set_zlabel('Y axis')


<IPython.core.display.Javascript object>

Text(0.5,0,'Y axis')

#### Question1b
Spin the plot to examine the range of $X_0$, $X_1$ and $Y$. State whether each statement is true or false.

1. The range of $X_0$ and $X_1$ are both from about -2 to 2
1. Together $X_0$ and $X_1$ nearly fill their respective plane.
1. The response $Y$ appears correlated with both $X_0$ and $X_1$

In [13]:
Q1b_answer = '''

1. True
1. True
1. True

'''

display(Markdown(Q1b_answer))



1. True
1. True
1. True



#### Question1c 
In addition to the 3D plot, examine the three pairwise scatter plots:

* `Y` and the first column of `X`
* `Y` and the second column of `X`
*  the two columns of `X`

Arrange your 3 plots in a 2 by 2 grid (with one empty facet).

Label your axes so that you can tell which plot is which.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:,0], X[:,1], Y)
# Added the origin 
ax.scatter([0],[0],[0], "o", color='red')
ax.set_xlabel(r"$X_0$ axis")
ax.set_ylabel(r'$X_1$ axis')
ax.set_zlabel('Y axis')

In [19]:
plt.figure(figsize=(8,9))
YX0 = plt.subplot(2,2,1)
YX0.scatter(Y, X[:, 0])
YX0.set_xlabel(r"$X_0$ axis")
YX0.set_ylabel('Y axis')
YX1 = plt.subplot(2,2,2)
YX1.scatter(Y, X[:, 1])
YX1.set_xlabel(r"$X_1$ axis")
YX1.set_ylabel("Y axis")
X0X1 = plt.subplot(2,2,3)
X0X1.scatter(X[:, 0], X[:, 1])
X0X1.set_xlabel(r"$X_0$ axis")
X0X1.set_ylabel(r"$X_1$ axis")

<IPython.core.display.Javascript object>

Text(0,0.5,'$X_1$ axis')

Note that it is difficult to see how $Y$ depends on both $X_0$ and $X_1$ together in the pairwise plots.  

#### Question1d 
Use 'corrcoef' to find the correlation matrix of all pairwise correlation 
coefficients between $Y$, $X_0$ and $X_1$.

In [24]:
corr = np.corrcoef(X.T, Y)

In [25]:
_ = ok.grade('q01d')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/oYVZVk
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



### Fitting a least squares linear model 

Let's compare the coefficients of the least squares fit for the following models

* $Y$ as a linear function of $X_0$
* $Y$ as a linear function of $X_1$
* $Y$ as a linear function of $X_0$ and $X_0$


#### Question1e
Use 'linear_model' in scikit learn to fit the models and examine the coefficients.
Do not fit an intercept term in any of the three models.

In [45]:
reshaped = X[:, 0].reshape(1, -1)


array([[ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True]], dtype=bool)

In [46]:
# Fit Y to the first column of X
model_1 = linear_model.LinearRegression(fit_intercept=False).fit(X[:, 0].reshape(-1, 1), Y)


# Fit Y to the second column of X
model_2 = linear_model.LinearRegression(fit_intercept=False).fit(X[:, 1].reshape(-1, 1), Y)

# Fit Y to X
model_3 = linear_model.LinearRegression(fit_intercept=False).fit(X, Y)


In [47]:
_ = ok.grade('q01e')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/vgl8or
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



### Co-plots

Compare the coefficients from the simple linear fit to the coefficients in the two-variable fit. Notice that the coefficient for $X_1$ has changed quite a bit. It is $0.71$ in the single variable model and only $0.21$ in the two-variable model.

The coefficients in the two-variable model depend on the presence of the other explanatory variables in the model. 

In this case since $X_0$ is in the model and it is very highly correlated with $Y$, then $X_1$ does not explain much additional variation in $Y$. That is, given $X_0$, the relationship between $Y$ and $X_1$ is not as strong as the relationship between $Y$ and $X_1$ without knowledge of $X_0$.

We can see this when we plot $Y$ on $X_1$ for subgroups of the data where $X_0$ is roughly constant. 

#### Question1f

Create four scatter plots of the relationship between $Y$ and $X_1$, conditioned on $X_0$. 
To do this, bin $X_0$ into the following categories: -4 to -1, -1 to 0, 0 to 1, and 1 to 4.
For each subset of records, make a scatter plot $Y$ and $X_1$. In your plot be sure to

* Keep the $Y$ limits the same on all 4 plots
* Keep the $X_1$ limits the same on all 4 plots
* Provide a title that indicates which subgroup of records is being plotted

In [52]:
X[X[:, 0] > 0][:, 1]
Y[X[:, 0] > 0]

array([ 0.73142381,  1.12174425, -0.15435425,  0.15474015,  1.83685499,
        2.59892446, -0.82341123,  0.6420166 ,  0.84740442, -0.29889544,
        0.73301654,  1.6046932 ,  1.19486298,  0.22103881,  1.00901057,
        1.59399096,  2.47806775,  0.14055014,  1.58288888,  0.44158457,
        0.68653436,  0.66682316,  0.38778183,  0.92230565,  1.27932217,
       -1.47846321,  0.57163009,  1.66466785,  0.17036067,  1.4233232 ,
        1.28490244,  0.97307684,  1.50572531,  2.03277655,  1.02179076,
        0.80996321,  1.37762727,  0.07321266,  0.69112074,  2.44299106,
       -0.05890106,  1.99662869,  1.43527422, -0.5928819 ,  1.0807337 ])

In [55]:
plt.figure(figsize=(8,9))
bins = [-4, -1, 0, 1, 4]

for i in range(1, len(bins)):
    plt.scatter(X[(bins[i-1] < X[:, 0]) & (X[:, 0] < bins[i])][:, 1], Y[(bins[i-1] < X[:, 0]) & (X[:, 0] < bins[i])])

<IPython.core.display.Javascript object>

#### Question1g
How does the relationship between $Y$ and $X_1$ change from the plot made in Q1d to these plots? State whether each statement is true or false.

1. There is a stronger linear relationship between $Y$ and $X_1$ in the plot in Q1d than in the group of 4 plots
1. Each of the above 4 plots shows a similar strength of relationship between $Y$ and $X_1$
1. The average levels of $Y$ in the 4 plots are about the same in all 4 plots

In [56]:
Q1g_answer = '''

1. False
1. True
1. False


'''

display(Markdown(Q1g_answer))



1. False
1. True
1. False




#### Question1h

Lastly, we examine the multiple correlation coefficient from the regression.

The multiple correlation coefficient is the ratio of the explained variation in $Y$ (i.e., the variation in $Y$ that has been explained by the linear fit, or the variation in $\hat{Y}$) to the total variation in $Y$. It is similar in spirit to the correlation coefficient from lab, but is useful for the multiple regression case. 

Compute the multiple $R^2$ for the 2-variable regression. To do this, 

* Compute the predicted values, $\hat{Y}$
* Compute the ratio of the explained variation $||\hat{Y} - \bar{Y}||^2$ to the total variation $||Y - \bar{Y}||^2$ using `r2_score`

In [68]:
np.array([mean_Y] * 10)

array([-0.02781062, -0.02781062, -0.02781062, -0.02781062, -0.02781062,
       -0.02781062, -0.02781062, -0.02781062, -0.02781062, -0.02781062])

In [74]:
Y_hat = model_3.predict(X)
multiple_R2 = r2_score(Y, Y_hat)
multiple_R2

0.86225295102790045

In [75]:
_ = ok.grade('q01h')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/58QE2q
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



### Uncorrelated explanatory variables

Now repeat the investigation that you have done above with a different data set. Compare the plots for these data to the plots that you made with the first set of data.

First, run the following code chunk to create the data set.

In [81]:
np.random.seed(21141)
mean = [0, 0, 0]
cov = [[1, 0.7, 0.7], [0.7, 1, 0.], [0.7, 0., 1]]

Y, u, v = np.random.multivariate_normal(mean, cov, n).T
X = np.array([u, v]).T

#### Make the 3D plot of $Y$ and $X$.

In [82]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:,0], X[:,1], Y)
# Added the origin 
ax.scatter([0],[0],[0], "o", color='red')
ax.set_xlabel(r"$X_0$ axis")
ax.set_ylabel(r'$X_1$ axis')
ax.set_zlabel('Y axis')

<IPython.core.display.Javascript object>

Text(0.5,0,'Y axis')

#### Make three pairwise plots

* `Y` and the first column of `X`
* `Y` and the second column of `X`
*  the two columns of `X`

Arrange your 3 plots in a 2 by 2 grid (with one empty facet).

Label your axes so that you can tell which plot is which.

In [83]:
plt.figure(figsize=(8,9))
YX0 = plt.subplot(2,2,1)
YX0.scatter(Y, X[:, 0])
YX0.set_xlabel(r"$X_0$ axis")
YX0.set_ylabel('Y axis')
YX1 = plt.subplot(2,2,2)
YX1.scatter(Y, X[:, 1])
YX1.set_xlabel(r"$X_1$ axis")
YX1.set_ylabel("Y axis")
X0X1 = plt.subplot(2,2,3)
X0X1.scatter(X[:, 0], X[:, 1])
X0X1.set_xlabel(r"$X_0$ axis")
X0X1.set_ylabel(r"$X_1$ axis")

<IPython.core.display.Javascript object>

Text(0,0.5,'$X_1$ axis')

#### Compute the pairwise correlation coefficients

In [84]:
corrs = np.corrcoef(X.T, Y)

#### Co-plots

Create scatter plots of the relationship between $Y$ and $X_1$, conditioned on $X_0$. Bin $X_0$ into the following categories: -4 to -1, -1 to 0, 0 to 1, and 1 to 4. 

In [85]:
plt.figure(figsize=(8,9))
bins = [-4, -1, 0, 1, 4]

for i in range(1, len(bins)):
        plt.scatter(X[(bins[i-1] < X[:, 0]) & (X[:, 0] < bins[i])][:, 1], Y[(bins[i-1] < X[:, 0]) & (X[:, 0] < bins[i])])

<IPython.core.display.Javascript object>

#### Fitting the least squares linear models 

As before fit the following models and compare the coefficients

* $Y$ as a linear function of $X_0$
* $Y$ as a linear function of $X_1$
* $Y$ as a linear function of $X_0$ and $X_0$

Do not fit an intercept term in any of the three models.

In [86]:
# Fit Y to the first column of X
model_1_second = linear_model.LinearRegression(fit_intercept=False).fit(X[:, 0].reshape(-1, 1), Y)


# Fit Y to the second column of X
model_2_second = linear_model.LinearRegression(fit_intercept=False).fit(X[:, 1].reshape(-1, 1), Y)

# Fit Y to X
model_3_second = linear_model.LinearRegression(fit_intercept=False).fit(X, Y)


### Find the multiple correlation coefficient for the 2-variable model

In [87]:
Y_hat_second = model_3_second.predict(X)
multiple_R2_second = r2_score(Y, Y_hat)

#### Question1i
Now it's time to compare your findings of the two data sets.

Answer the following questions.

1. In the 3D plot, consider the spread of points in the $X_0$, $X_1$ plane. Do the two sets of data fill this plane similarly?
1. Compare the pairwise scatter plots of ($X_0$, $Y$) and ($X_1$, $Y$), and ($X_0$, $X_1$). Two of the pairs should look roughly the same for the different data sets and one should look different. Which one is different across the two data sets? How is it different? 
1. Examine the 4 co-plots for the second set of data. Is the slope of the linear relationship for these plots roughly the same? Is the strength of the relationship roughly the same? How does the linear relationship in these 4 plots compare to the relationship observed between $X_1$ and $Y$ without conditioning on $X_0$?
1. Compare the 4 co-plots for the two sets of data. the How are they different? How are they the same?
1. Consider how the single variable and two-variable coefficients change in the regressions for the second data set. How is this change different than the change observed for the first data set?
1. Compare the multiple $R^2$ of the two-variable regression for the two data sets. Do you think this $R^2$ gives any indication of whether the two variable regression would have different coefficients for the explanatory variables than the one variable regression?


In [88]:
Q1i_answer = '''

1. Yes
1. (X0, X1) is different from the other two. There is no apparent correlation.
1. Yes, but a stronger linear relationship is observed since each spread is far smaller than the relationship without conditioning on X0.
1. Overall slope stayed the same, but since X0 and X1 weren't correlated in the second data set, the co-plots were roughly all in the same range.
1. It increases with two-variables unlike the first data set.
1. Yes

'''

display(Markdown(Q1i_answer))



1. Yes
1. (X0, X1) is different from the other two. There is no apparent correlation.
1. Yes, but a stronger linear relationship is observed since each spread is far smaller than the relationship without conditioning on X0.
1. Overall slope stayed the same, but since X0 and X1 weren't correlated in the second data set, the co-plots were roughly all in the same range.
1. It increases with two-variables unlike the first data set.
1. Yes



# Question 2

In this question we will use Apache Spark to compute the statistics needed to solve the ordinary least squared linear regression problem.

**Note: Apache Spark already has estimate a wide range of models including linear regression.  However we will be doing this by hand (for practice).**


## Setup

Step 1 is to create a Databricks account.  Go [here](https://accounts.cloud.databricks.com/registration.html#signup/community) to sign up.  Use your @berkeley email address. If you have already signed up before (in lab), go to [this](https://community.cloud.databricks.com/) page to login directly.

After you sign up, sign in to your Databricks account, then click Workspace -> Users -> `<your-username>@berkeley.edu`.    Click on the arrow pointing down beside your email address and select **`Import`**.  Import the `hw06.dbc` file in this folder containing this notebook.

![Importing](https://github.com/DS-100/sp17-materials/blob/master/sp17/hw/hw7/importing_notebooks.png?raw=true)

This will create a Databricks notebook file.  Open it.

The rest of this assignment is primarily conducted in the Databricks notebook.  However, this notebook contains the OK tests you can use to check your work, and it contains the invocations to submit your assignment when you're done.  Follow the instructions in the Databricks notebook to download your results in a form that the tests here will understand.

** Issue: **
1. Databricks Cloud runs Python 2.7 so you won't be able to use `X.T @ Y` operator.  Instead you can use `X.T.dot(Y)`.

## Question 2a

Complete question 2a and paste answer below:

In [89]:
size_of_diamonds = 3192560

In [90]:
_ = ok.grade('q02a')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/KO1vEM
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



## Question 2b
Complete question 2b and paste your answer below:

In [91]:
number_of_rows = 53940

In [92]:
_ = ok.grade('q02b')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/El8m4l
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



## Question 2c

The size of the training data after constructing a 90/10 train test split:

In [99]:
number_of_rows_in_training = 48556

In [100]:
_ = ok.grade('q02c')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/g5Z9xY
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



## Question 2d

The average price of diamonds in the training data:

In [105]:
avg_price_of_diamonds_in_training = 3921

In [106]:
_ = ok.grade('q02d')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/wjVqAz
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



## Question 3a
The value of $\theta$

In [107]:
theta = [7865.41870378,   -147.65515738,   -101.57023038,  12608.52774277]

In [108]:
_ = ok.grade('q03a')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/r083qW
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



## Question 3b
It seems like the weight of `carat` is way bigger than the other two, could we say it is the dominating feature?

In [110]:
#Yes

## Question 3c
Compute the RMSE for $\theta$ estimated using carat, depth, table.

In [115]:
rmse = 1522.3

In [116]:
_ = ok.grade('q03c')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/pg800p
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



## Question 3d
Compute the improved RMSE using more features.

In [117]:
rmse_improved = 1499

In [118]:
_ = ok.grade('q03d')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/g5Zgg3
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



## Question 3e
Compute the improved test RMSE using additional one-hot features.

In [119]:
test_rmse = 1345.72691757

In [120]:
_ = ok.grade('q03e')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'hw6.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/hw6/backups/Y6WGk0
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



# Submitting your assignment
Congratulations, you're done with this homework!

Run the next cell to run all the tests at once.

In [121]:
_ = ok.grade_all()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------

Now, run the cell below to submit your assignment to OkPy. The autograder should email you shortly with your autograded score. The autograder will only run once every 30 minutes.

**If you're failing tests on the autograder but pass them locally**, you should simulate the autograder by doing the following:

1. In the top menu, click Kernel -> Restart and Run all.
2. Run the cell above to run each OkPy test.

**You must make sure that you pass all the tests when running steps 1 and 2 in order.** If you are still failing autograder tests, you should double check your results.

In [None]:
_ = ok.submit()

<IPython.core.display.Javascript object>