# In-sample RMSE for linear regression on diamonds

As you saw in the video, included in the course is the diamonds dataset, which is a classic dataset from the ggplot2 package. The dataset contains physical attributes of diamonds as well as the price they sold for. One interesting modeling challenge is predicting diamond price based on their attributes using something like a linear regression.

Recall that to fit a linear regression, you use the lm() function in the following format:

mod <- lm(y ~ x, my_data)
To make predictions using mod on the original data, you call the predict() function:

pred <- predict(mod, my_data)

Instructions
Fit a linear model on the diamonds dataset predicting price using all other variables as predictors (i.e. price ~ .). Save the result to model.
Make predictions using model on the full original dataset and save the result to p.
Compute errors using the formula errors=predicted−actualerrors=predicted−actual. Save the result to error.
Compute RMSE using the formula you learned in the video and print it to the console.

In [1]:
library(ggplot2)
head(diamonds)

carat,cut,color,clarity,depth,table,price,x,y,z
0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31
0.23,Good,E,VS1,56.9,65,327,4.05,4.07,2.31
0.29,Premium,I,VS2,62.4,58,334,4.2,4.23,2.63
0.31,Good,J,SI2,63.3,58,335,4.34,4.35,2.75
0.24,Very Good,J,VVS2,62.8,57,336,3.94,3.96,2.48


In [2]:
# Fit lm model: model
model <- lm(price~., diamonds)

# Predict on full data: p
p <- predict(model, diamonds, type="response")

# Compute errors: error
error = p-diamonds$price

# Calculate RMSE
print(sqrt(mean(error^2)))

[1] 1129.843


In [3]:
rm(model)

In [9]:
# Fit lm model: model
model <- lm(price~., diamonds)

# Predict on full data: p
p <- predict(model, diamonds[,-7], type="response")

# Compute errors: error
error = p-diamonds$price

# Calculate RMSE
print(sqrt(mean(error^2)))

[1] 1129.843


In [11]:
head(subset(diamonds, select = c(-price)))

carat,cut,color,clarity,depth,table,x,y,z
0.23,Ideal,E,SI2,61.5,55,3.95,3.98,2.43
0.21,Premium,E,SI1,59.8,61,3.89,3.84,2.31
0.23,Good,E,VS1,56.9,65,4.05,4.07,2.31
0.29,Premium,I,VS2,62.4,58,4.2,4.23,2.63
0.31,Good,J,SI2,63.3,58,4.34,4.35,2.75
0.24,Very Good,J,VVS2,62.8,57,3.94,3.96,2.48


# Randomly order the data frame

One way you can take a train/test split of a dataset is to order the dataset randomly, then divide it into the two sets. This ensures that the training set and test set are both random samples and that any biases in the ordering of the dataset (e.g. if it had originally been ordered by price or size) are not retained in the samples we take for training and testing your models. You can think of this like shuffling a brand new deck of playing cards before dealing hands.

First, you set a random seed so that your work is reproducible and you get the same random split each time you run your script:

set.seed(42)
Next, you use the sample() function to shuffle the row indices of the diamonds dataset. You can later use these these indices to reorder the dataset.

rows <- sample(nrow(diamonds))
Finally, you can use this random vector to reorder the diamonds dataset:

diamonds <- diamonds[rows, ]

Instructions
Set the random seed to 42.
Make a vector of row indices called rows.
Randomly reorder the diamonds data frame.

In [None]:
# Set seed
set.seed(42)

# Shuffle row indices: rows
rows <- sample(nrow(diamonds))

# Randomly order data
diamonds <- diamonds[rows,]