## Building Regression Models with the UCI Automobile Dataset

For this week, we'll be working on building simple regression models on the automobile dataset.

### Step 0: Defining the Question

We have seen some evidence for a linear relationship between engine-size and price, and horsepower and price. With this, we can hypothesize that we can fit a linear model to predict the data with non-zero weight.

### Step 1: Environment Setup

In [1]:
# Run this cell to install the required packages:
# You might have to replace '!' with '%', depending on your environment.
# You might also need to restart your kernel after installation.
!pip install numpy pandas matplotlib scikit-learn --quiet

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures

### Step 2: Read and Clean Data

In [None]:
# load the dataset from csv file


In [None]:
# Display the columns of the data


Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')

In [None]:
# Drop all rows with missing values


### Step 2: Linear Regression

First we'll do a linear regression on engine size vs. price

In [66]:
# First let's extract the features we want to use for our regression model.


In [67]:
# Now let's build our target variable and call it y


In [68]:
# Now split the data into training and testing sets with a test size of 20%. Use a random state of 42 for reproducibility.
# This can be done using train_test_split from sklearn.model_selection


In [69]:
# Create a linear regression model using scikit-learn's LinearRegression class


In [70]:
# Now to fit the model to our training data using the fit method


In [71]:
# Now we can make predictions on our test data using the predict method


In [72]:
# Let's evaluate the model. Use mean_squared_error and r2_score from sklearn.metrics


In [73]:
# Get the coefficients of the model using the coef_ attribute

# Get the intercept of the model using the intercept_ attribute


In [74]:
# Now plot a line using the slope and intercept, and the original data points


### Step 3: Multivariate Linear Regression

Let's see if we can improve prediction accuracy with multiple features.

In [75]:
# Let's now try a multivariate linear regression with both engine-size and horsepower
# Define X and y below


In [76]:
# Now split the data into training and testing sets with a test size of 20% (same as above).


In [77]:
# Create another model


In [78]:
# Train the multivariate model


In [79]:
# Now let's evaluate the multivariate model


In [80]:
# We can see which features are more important by looking at the coefficients again
# Print the coefficients of the multivariate model below


#### Does our model perform better with multiple features? What about even more features?

#### On your own: try the above multivariate but with more features.

In [81]:
# Add even more features to see if our model performs better
features = []

You may find that accuracy does not always increase with more features. Sometimes it decreases!

### Step 4: Nonlinear Regression

#### On your own: First build a linear model that predicts highway-mpg from engine-size.

In [82]:
# Build model below and print the evaluation metrics.


#### Then transform your engine-size feature to be 1/engine-size, and use that as a feature. 

In [83]:
# Now let's try a nonlinear regression model for highway MPG using 1/X as the feature


#### Which works better, the original feature or transformed one?

### Step 4: Polynomial Regression

In [84]:
# Now try polynomial regression for highway MPG using PolynomialFeatures degree=2
poly = PolynomialFeatures(degree=2)  # You can change the degree for more complexity


In [85]:
# Get R^2 values for polynomial degrees [1, 2, 5, 10, 20, 30]


#### What do you notice about the regression evaluation metrics?