## We will construct a linear model that can predict a car's mileage (mpg) by using its other attributes.

### Data Description: 

The dataset has 6 variables, including the name of the car and its various attributes like horsepower, weight etc. Missing values in the data are marked by a series of question marks.

A detailed description of the variables is given below.

1. mpg: miles per gallon
2. cylinders: number of cylinders
3. displacement: engine displacement in cubic inches
4. horsepower: horsepower of the car
5. weight: weight of the car in pounds
6. acceleration: time taken, in seconds, to accelerate from O to 60 mph


### Importing Libraries

In [None]:
import pandas as pd
import numpy as np

# for visualizing data
import matplotlib.pyplot as plt
import seaborn as sns

# For randomized data splitting
from sklearn.model_selection import train_test_split

# To build linear regression_model
from sklearn.linear_model import LinearRegression

# To check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

### Load the data

In [None]:
df = pd.read_csv('/content/AUTOMPG.csv')

### Displaying the first few rows of the dataset

In [None]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration
0,18.0,8,307.0,130.0,3504,12.0
1,15.0,8,350.0,165.0,3693,11.5
2,18.0,8,318.0,150.0,3436,11.0
3,16.0,8,304.0,150.0,3433,12.0
4,17.0,8,302.0,140.0,3449,10.5


### Checking the shape of the dataset

In [None]:
df.shape

(398, 6)

* There are 398 rows and 6 columns in the data

### Data Preparation for modeling

In [None]:
# independent variables
X = df.drop(["mpg"], axis=1)
# dependent variable
y = df["mpg"]

**We will now split X and y into train and test sets in a 70:30 ratio.**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=1
)

## Model Building - Linear Regression

In [None]:
lr=LinearRegression()
lr.fit(X_train,y_train)

LinearRegression()

### Model Performance Check

**Let's check the performance of the model using different metrics.**

* We will be using metric functions defined in sklearn for RMSE, MAE, and $R^2$.
* We will define a function to calculate MAPE and adjusted $R^2$.    
* We will create a function which will print out all the above metrics in one go.

In [None]:
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# function to compute MAPE
def mape_score(targets, predictions):
    return np.mean(np.abs(targets - predictions) / targets) * 100


# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # to compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # to compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE
    mae = mean_absolute_error(target, pred)  # to compute MAE
    mape = mape_score(target, pred)  # to compute MAPE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf

In [None]:
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
model_train_perf = model_performance_regression(lr, X_train, y_train)
model_train_perf

Training Performance



Unnamed: 0,RMSE,MAE,R-squared,Adj. R-squared,MAPE
0,4.402643,3.344707,0.687252,0.681503,14.635446


In [None]:
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
model_test_perf = model_performance_regression(lr, X_test, y_test)
model_test_perf

Test Performance



Unnamed: 0,RMSE,MAE,R-squared,Adj. R-squared,MAPE
0,3.834676,2.988926,0.748495,0.737464,12.914085


**Observations**

- MAE indicates that our current model is able to predict MPG within a mean error of ~2.988  on test data.
- The RMSE values are higher than the MAE values as the squares of residuals penalizes the model more for larger errors in prediction.
- MAPE of ~12.91 on the test data indicates that the model can predict within ~12.91% of the MPG.