# Phase 1: Data Gathering

## a. Import required libraries

In [None]:
# These are the latest versions of said modules upon the creation of this project

# Uncomment the line below to install dependencies for the required libraries
#!pip install -r requirements.txt

# cross check with the cloud based version on the course

**Pandas** is a widely-used library in Python for data analysis and manipulation. It provides a flexible data structure called a DataFrame, which is similar to a spreadsheet or a database table. With Pandas, you can load, clean, and transform data, compute descriptive statistics, visualize patterns, select specific subsets of data based on conditions, group data based on variables, and merge data from multiple sources. This library is an essential tool for data scientists and can greatly simplify the process of analyzing large datasets like the 1985 Auto Imports dataset.

In [None]:
# pandas 1.5.3
import pandas as pd

**NumPy** is a fundamental library in Python for scientific computing and data analysis. It provides support for arrays and matrices of numerical data, and allows for the efficient execution of mathematical operations on these arrays. NumPy can be used for a wide range of applications in data science, including numerical computation, array manipulation, and statistical analysis. 

In the context of the 1985 Auto Imports dataset, NumPy can be used to perform numerical operations on the dataset, such as computing the mean, median, or standard deviation of the car prices, or performing more complex operations like matrix multiplication. The efficiency and versatility of NumPy make it a valuable tool for data scientists working with large datasets.

In [None]:
# numpy 1.24.1
import numpy as np

**Matplotlib** is a plotting library in Python that provides an interface for creating a variety of visualizations including line plots, bar charts, scatter plots, histograms, and more. It can be used to explore relationships between variables in a dataset, such as the 1985 Auto Imports dataset.

**Pylab** and **Pyplot** are modules within Matplotlib that provide a convenient interface for creating plots and visualizations. Pylab provides a set of convenience functions that allow for simple and quick creation of plots, while Pyplot provides an interface for creating plots with a more explicit structure. 

In the context of the 1985 Auto Imports dataset, Pylab and Pyplot could be used to create visualizations such as histograms to visualize the distribution of car prices, or scatter plots to visualize the relationship between car weight and fuel efficiency. These modules make it easy to create high-quality visualizations, allowing data scientists to gain insights into the data and explore relationships between variables.

In [None]:
# matplotlib 3.6.3
import matplotlib.pylab as plt_lab
import matplotlib.pyplot as plt_plot

**Seaborn** is a library in Python for data visualization that is built on top of Matplotlib. It provides a high-level interface for creating advanced visualizations, including statistical plots, such as density plots, box plots, and violin plots, and relational plots, such as scatter plots and line plots. Seaborn also provides convenient functions for customizing the appearance of the visualizations, such as color palette choices, style themes, and annotating the plots with text and labels. 

In terms of the 1985 Auto Imports dataset, Seaborn could be used to create visualizations that highlight the distribution of car prices, engine sizes, or other variables, or to create scatter plots to visualize relationships between variables. The advanced features and ease of use of Seaborn make it a valuable tool for data scientists to gain insights and understanding about the 1985 Auto Imports dataset.

In [None]:
# seaborn 0.12.2
import seaborn as sns

**SciPy** is a library in Python for scientific computing and data analysis. Its stats module provides a wide range of statistical functions and tools for data analysis. In the context of the 1985 Auto Imports dataset, the SciPy stats library can be used to analyze the data in various ways such as:
- Calculating summary statistics like mean, median and standard deviation of car prices or other variables
- Testing hypotheses such as if there's a significant difference in car prices between different countries
- Fitting statistical models, such as linear regression, to understand the relationship between variables and make predictions

In [None]:
# scipy 1.2.1
from scipy import stats

**Scikit-learn** is a machine learning library in Python that provides a range of algorithms for classification, regression, clustering, and dimensionality reduction. 

In the context of the 1985 Auto Imports dataset, scikit-learn could be used to build predictive models for car prices based on the features of the cars, such as engine size and country of origin. For example, a linear regression model could be fit to the data to predict the car price based on the engine size, or a k-means clustering algorithm could be used to group similar cars based on their features. By providing a wide range of machine learning algorithms, scikit-learn is a powerful tool for exploring and understanding the relationships in the 1985 Auto Imports dataset.





In [None]:
# scikit-learn 1.2.1

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline

**IPyWidgets** is a library in Python for creating interactive user interfaces within Jupyter notebooks. It provides a variety of interactive widgets, such as sliders, buttons, and dropdown menus, that allow users to control and explore the data in real-time. 

In the context of the 1985 Auto Imports dataset, IPyWidgets could be used to create interactive visualizations that allow the user to dynamically explore the relationships between variables. For example, a user could use a slider widget to control the range of car prices displayed in a scatter plot, or a dropdown menu to select different engine sizes to plot. By allowing the user to interact with the data in real-time, IPyWidgets provides a powerful tool for gaining insights into the 1985 Auto Imports dataset.





In [None]:
# ipywidgets 8.0.4
from ipywidgets import interact, interactive, fixed, interact_manual

**tqdm** is a library in Python for creating progress bars to visualize the progress of long-running operations. It provides a simple way to add a progress bar to loops and iterators, allowing the user to see how far the operation has progressed and how much time remains. 

In the context of the 1985 Auto Imports dataset, tqdm could be used to create a progress bar for any time-intensive operations, such as loading or processing the data. By providing visual feedback on the progress of the operation, tqdm helps to make the user experience more seamless and less frustrating, particularly for long-running operations.

In [None]:
# tqdm 4.64.1
from tqdm import tqdm

## b. 1985 Auto Imports Database
This dataset is donated by Jeffrey C. Schlimmer on May 19, 1987. The dataset's source are:
1. 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook
2. Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038
3. Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037

### Attributes
The attributes of this dataset represent:
- specification of an auto in terms of various characteristics
- assigned insurance risk rating (symboling)
- normalized losses in use as compared to other cars

    *This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc...), and represents the average loss per car per year.*

1. **symboling** - the degree to which the auto is more risky than its price indicates (3 = very risky, -3 = very safe)
2. **normalized-losses** - normalized loss value for a particular car; may represent the financial loss or cost associated with a car
3. **make** - make of the car (e.g., Honda, Toyota, etc.)
4. **fuel-type** - type of fuel used by the car (gas or diesel)
5. **aspiration** - type of engine aspiration (standard or turbocharged)
6. **num-of-doors** - number of doors on the car 
7. **body-style** - style of the car's body (e.g., sedan, hatchback, etc.)
8. **drive-wheels** - configuration of the car's drive wheels (front, rear, or four-wheel)
9. **engine-location** - location of the car's engine (front or rear)
10. **wheel-base** - distance between the centers of the front and rear wheels 
11. **length** - length of the car 
12. **width** - width of the car 
13. **height** - height of the car
14. **curb-weight** - weight of the car without any passengers or cargo
15. **engine-type** - type of engine in the car (e.g., dohc, ohc, etc.)
16. **num-of-cylinders** - number of cylinders in the car's engine
17. **engine-size** - size of the car's engine
18. **fuel-system** - fuel system used by the car (e.g., mpfi, 2bbl, etc.)
19. **bore** - diameter of the car's cylinders
20. **stroke** - distance traveled by the car's pistons during one engine cycle
21. **compression-ratio** - ratio of the volume of the combustion chamber at the bottom of the piston stroke to the volume at the top
22. **horsepower** - power of the car's engine
23. **peak-rpm** - highest number of revolutions per minute the car's engine can make
24. **city-mpg** - number of miles per gallon the car can get in the city
25. **highway-mpg** - number of miles per gallon the car can get on the highway
26. **price** - price of the car

### Replace dataset headers with attribute names

In [None]:
# rename the columns to their proper attribute labels
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

### Read and save the dataset


In [None]:
dataset = "./auto.csv"

df = pd.read_csv(dataset, header=None, names=headers)

df.head()

## c. Basic insight on the dataset

In [None]:
print("The dataset has", df.shape[0] , "rows and", df.shape[1], "columns")

### Data Types

In [None]:
print("The types of the columns are as follows:\n", df.dtypes)

### Overview of the Dataset
This provides a statistical summary of all columns, including object-typed attributes.

*Note: Some values in the table below show as "NaN" because their respective columns contain missing data.*

In [None]:
df.describe(include = "all")

In [None]:
df.info()

# Phase 2: Cleansing Data

## a. Identify and handle missing values

In [None]:
# Convert "?" to NaN
df.replace("?", np.nan, inplace = True)

# Identify missing values
print("The number of missing values per column are:\n", df.isnull().sum())

#### There are several ways to deal with missing data:
1. **Drop data** - a data point (row) is removed if it contains a missing value
2. **Replace data** - missing value is replaced by mean or frequency
3. **Drop attribute** - the entire column is removed if it contains enough missing values


### Normalized Losses
The normalized losses attribute has 41 missing values, which is about 20% of the total dataset, which is insignificant enough for the attribute to be dropped. Furthermore, since it is a numerical variable and the data distribution is barely asymmetrical, replacing the missing values with the mean is recommended.

In [None]:
# visualize the distribution of normalized losses
plt_lab.hist(df["normalized-losses"].astype("float"), bins = 10)

In [None]:
# calculate the mean value for normalized losses
mean_normalized_losses = df["normalized-losses"].astype("float").mean(axis=0)
print("Average of normalized losses:", mean_normalized_losses)

# replace NaN with mean value in "normalized-losses" column
df["normalized-losses"].replace(np.nan, mean_normalized_losses, inplace=True)

### Num of Doors
The num of doors attribute has only 2 missing values. It is a categorical variable, so replacing the missing values with the most frequent value is recommended.

In [None]:
# calculate the frequency of each value in the "num-of-doors" column
print("The frequency of each value in the num-of-doors column is:\n", df["num-of-doors"].value_counts())

# replace the missing 'num-of-doors' values by the most frequent
df["num-of-doors"].replace(np.nan, "four", inplace=True)


### Bore
The bore attribute has only 4 missing values. It is a numerical variable, so replacing the missing values with the mean is recommended.

In [None]:
# calculate the mean value for bore
mean_bore = df["bore"].astype("float").mean(axis=0)
print("Average of bore:", mean_bore)

# replace NaN with mean value in "bore" column
df["bore"].replace(np.nan, mean_bore, inplace=True)

### Stroke
The stroke attribute has only 4 missing values. It is a numerical variable, so replacing the missing values with the mean is recommended.

In [None]:
# calculate the mean value for stroke
mean_stroke = df["stroke"].astype("float").mean(axis=0)
print("Average of stroke:", mean_stroke)

# replace NaN with mean value in "stroke" column
df["stroke"].replace(np.nan, mean_stroke, inplace=True)

### Horsepower
The horsepower attribute has only 2 missing values. It is a numerical variable, so replacing the missing values with the mean is recommended.


In [None]:
# calculate the mean value for horsepower
mean_horsepower = df["horsepower"].astype("float").mean(axis=0)
print("Average horsepower:", mean_horsepower)

# replace NaN with mean value in "horsepower" column
df["horsepower"].replace(np.nan, mean_horsepower, inplace=True)

### Peak RPM
The peak rpm attribute has only 2 missing values. It is a numerical variable, so replacing the missing values with the mean is recommended.

In [None]:
# calculate the mean value for peak-rpm
mean_peak_rpm = df["peak-rpm"].astype("float").mean(axis=0)
print("Average peak rpm:", mean_peak_rpm)

# replace NaN with mean value in "peak-rpm" column
df["peak-rpm"].replace(np.nan, mean_peak_rpm, inplace=True)

### Price
The price attribute has only 4 missing values. However, it is the target variable. As such, the rows with missing values are dropped.

In [None]:
# drop all rows that do not have price data
df.dropna(subset=["price"], axis=0, inplace=True)

# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)

## b. Data standardization
In the context of the auto dataset provided, data standardization could be applied to the variables in order to make them comparable and interpretable. For example, the variables "normalized-losses", "wheel-base", "length", "width", "height", etc. are measured in different units and scales, which can make it difficult to compare and analyze them. By standardizing these variables, we can transform them into a common format, such as z-scores, which makes it easier to compare and analyze them.

Also, standardizing the data can help to identify the relationship between the independent and dependent variable. For instance, using the mean subtraction and division by the standard deviation method, can make the data more easily comparable and interpretable, by scaling the data and making it more easily comparable to other variables in the dataset. This can be useful in identifying patterns and relationships in the data that would not be easily apparent if the variables were in their original format.

### Correcting the data format of the columns
It has been observed that some columns in the dataset have incorrect data types. To fix this issue, the "astype()" method will be used to convert the data types of each column to the appropriate format.

In [None]:
# categorical are symboling, make, fuel-type, aspiration, num-of-doors, body-style, drive-wheels, engine-location, engine-type, num-of-cylinders, fuel-system
df[["symboling", "make", "fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels", "engine-location", "engine-type", "num-of-cylinders", "fuel-system"]] = df[["symboling", "make", "fuel-type", "aspiration", "num-of-doors", "body-style", "drive-wheels", "engine-location", "engine-type", "num-of-cylinders", "fuel-system"]].astype("category")

# normalized-losses as an int
df[['normalized-losses']] = df[['normalized-losses']].astype("int")

# bore, stroke, horsepower, peak-rpm, price as a float
df[['bore', 'stroke', 'horsepower', 'peak-rpm', 'price']] = df[['bore', 'stroke', 'horsepower', 'peak-rpm', 'price']].astype("float")

# the rest are continuous. float is used to ensure inclusivity

df.dtypes

## c. Data normalization
In the context of the auto dataset provided, normalization can be useful when comparing variables that have different scales and units of measurement. For example, "normalized-losses", "wheel-base", "length", "width", "height" etc are measured in different units and scales, so in order to compare them effectively and make meaningful analysis, normalizing the variables so that they have similar ranges can be necessary. 

In [None]:
# replace original value by (original value)/(maximum value)
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()
df['height'] = df['height']/df['height'].max()

# Phase 3: Exploratory Data Analysis

## Overview of the Dataset


In [None]:
# describe the characteristics of the continuous variables in the dataset
df.describe()

In [None]:
# describe the characteristics of the categorical variables in the dataset
df.describe(include=['category'])

## Continuous numerical variables

In [None]:
width = 12
height = 10

### Normalized Losses vs Price
The plot shows a *very weak positive* linear relationship between normalized losses and price. 

In [None]:
# regression plot of normalized losses vs price
plt_plot.figure(figsize=(width, height))
sns.regplot(x="normalized-losses", y="price", data=df)
plt_plot.ylim(0,)

In [None]:
# examine the correlation between normalized losses and price
df[["normalized-losses", "price"]].corr()

### Wheel Base vs Price
The plot shows a *moderate positive* linear relationship between wheel base and price. This indicates that as wheel base increases, price also increases. This makes sense, as cars with longer wheel bases are generally more expensive.

In [None]:
# regression plot of wheel base vs price
plt_plot.figure(figsize=(width, height))
sns.regplot(x="wheel-base", y="price", data=df)
plt_plot.ylim(0,)

In [None]:
# examine the correlation between wheel base and price
df[["wheel-base", "price"]].corr()

### Length vs Price
The plot shows a *strong positive* linear relationship between length and price. This indicates that as length increases, price also increases. This makes sense, as longer cars are generally more expensive.

In [None]:
# regression plot of length vs price
plt_plot.figure(figsize=(width, height))
sns.regplot(x="length", y="price", data=df)
plt_plot.ylim(0,)

In [None]:
# examine the correlation between length and price
df[["length", "price"]].corr()

### Width vs Price
The plot shows a *strong positive* linear relationship between width and price. This indicates that as width increases, price also increases. This makes sense, as wider cars are generally more expensive.

In [None]:
# regression plot of width vs price
plt_plot.figure(figsize=(width, height))
sns.regplot(x="width", y="price", data=df)
plt_plot.ylim(0,)

In [None]:
# examine the correlation between width and price
df[["width", "price"]].corr()

### Height vs Price
The plot shows a *very weak positive* linear relationship between height and price. 

In [None]:
# regression plot of height vs price
plt_plot.figure(figsize=(width, height))
sns.regplot(x="height", y="price", data=df)
plt_plot.ylim(0,)

In [None]:
# examine the correlation between height and price
df[["height", "price"]].corr()

### Curb Weight vs Price
The plot shows a *very strong positive* linear relationship between curb weight and price. This indicates that as curb weight increases, price also increases. This makes sense, as heavier cars are generally more expensive.

In [None]:
# visualize the relationship between curb-weight and price
plt_plot.figure(figsize=(width, height))
sns.regplot(x="curb-weight", y="price", data=df)
plt_plot.ylim(0,)

In [None]:
# examine the correlation between curb-weight and price
df[["curb-weight", "price"]].corr()

### Engine size vs price

The plot shows a *very strong positive* correlation between engine size and price, meaning as engine size increases, so does the price. This makes sense as larger engines are typically found in more expensive vehicles.



In [None]:
# regression plot of "engine-size" and "price"
plt_plot.figure(figsize=(width, height))
sns.regplot(x="engine-size", y="price", data=df)
plt_plot.ylim(0,)

In [None]:
# examine the correlation between engine-size and price
df[["engine-size", "price"]].corr()

### Bore vs Price
The plot shows a *moderate positive* linear relationship between bore and price. This indicates that as bore increases, price also increases.

In [None]:
# regression plot of "bore" and "price"
plt_plot.figure(figsize=(width, height))
sns.regplot(x="bore", y="price", data=df)
plt_plot.ylim(0,)

In [None]:
# examine the correlation between bore and price
df[["bore", "price"]].corr()

### Stroke vs price
The plot shows a *very weak positive* linear relationship between stroke and price. 

In [None]:
# regression plot of "stroke" and "price"
plt_plot.figure(figsize=(width, height))
sns.regplot(x="stroke", y="price", data=df)
plt_plot.ylim(0,)

In [None]:
# examine the correlation between stroke and price
df[["stroke","price"]].corr()

### Compression Ratio
The plot shows a *very weak positive* linear relationship between compression ratio and price. 

In [None]:
# regression plot of "compression-ratio" and "price"
plt_plot.figure(figsize=(width, height))
sns.regplot(x="compression-ratio", y="price", data=df)
plt_plot.ylim(0,)

In [None]:
# examine the correlation between compression-ratio and price
df[["compression-ratio","price"]].corr()

### Horsepower
The plot shows a *very strong positive* linear relationship between horsepower and price. This indicates that as horsepower increases, price also increases. This makes sense, as more powerful cars are generally more expensive.

In [None]:
# regression plot of "horsepower" and "price"
plt_plot.figure(figsize=(width, height))
sns.regplot(x="horsepower", y="price", data=df)
plt_plot.ylim(0,)

In [None]:
# examine the correlation between horsepower and price
df[["horsepower","price"]].corr()

### Peak rpm vs price
The plot shows a *very weak negative* linear relationship between peak rpm and price.

In [None]:
# regression plot of "peak-rpm" and "price"
plt_plot.figure(figsize=(width, height))
sns.regplot(x="peak-rpm", y="price", data=df)
plt_plot.ylim(0,)

In [None]:
# examine the correlation between peak-rpm and price
df[["peak-rpm","price"]].corr()

### Highway vs price
The plot shows a *very strong positive* linear relationship between highway mpg and price. This indicates that as highway mpg increases, price also increases. This makes sense, as cars that get better gas mileage are generally more expensive.

In [None]:
# regression plot of highway-mpg and price
plt_plot.figure(figsize=(width, height))
sns.regplot(x="highway-mpg", y="price", data=df)
plt_plot.ylim(0,)

In [None]:
# examine the correlation between highway-mpg and price
df[["highway-mpg", "price"]].corr()

### City vs price


In [None]:
# regression plot of city-mpg and price
plt_plot.figure(figsize=(width, height))
sns.regplot(x="city-mpg", y="price", data=df)
plt_plot.ylim(0,)

In [None]:
# examine the correlation between city-mpg and price
df[["city-mpg","price"]].corr()

## Categorical variables

### Symboling vs price

In [None]:
# box plot with width and height 
plt_plot.figure(figsize=(width, height))
sns.boxplot(x="body-style", y="price", data=df)

In [None]:
# value counts of symboling
df['symboling'].value_counts()

### Make 

In [None]:
# unique values of make
print("Number of brands: ", len(df['make'].unique()))

In [None]:
# number of cars per brand
df['make'].value_counts()

### Num of Doors vs Price

In [None]:
# box plot of num-of-doors vs price
plt_plot.figure(figsize=(width, height))
sns.boxplot(x="num-of-doors", y="price", data=df)

In [None]:
# value counts of num-of-doors
df['num-of-doors'].value_counts()

### Body style vs price

In [None]:
# box plot of "body-style" and "price"
plt_plot.figure(figsize=(width, height))
sns.boxplot(x="body-style", y="price", data=df)

In [None]:
# value counts of body-style
df['body-style'].value_counts()

### Drive wheels vs price

In [None]:
# box plot of "drive-wheels" and "price"
plt_plot.figure(figsize=(width, height))
sns.boxplot(x="drive-wheels", y="price", data=df)

In [None]:
# value counts on drive-wheels
df["drive-wheels"].value_counts()

### Engine location vs price

In [None]:
# box plot of "engine-location" and "price"
plt_plot.figure(figsize=(width, height))
sns.boxplot(x="engine-location", y="price", data=df)

In [None]:
# value counts of engine-location
df['engine-location'].value_counts()

### Engine Type vs price

In [None]:
# box plot of "engine-type" and "price"
plt_plot.figure(figsize=(width, height))
sns.boxplot(x="engine-type", y="price", data=df)

In [None]:
# value counts of engine-type
df['engine-type'].value_counts()

### Num of Cylinders vs price

In [None]:
# box plot of "num-of-cylinders" and "price"
plt_plot.figure(figsize=(width, height))
sns.boxplot(x="num-of-cylinders", y="price", data=df)

In [None]:
# value counts of num-of-cylinders
df['num-of-cylinders'].value_counts()

### Fuel System vs price

In [None]:
# box plot of "fuel-system" and "price"
plt_plot.figure(figsize=(width, height))
sns.boxplot(x="fuel-system", y="price", data=df)

In [None]:
# value counts of fuel-system
df['fuel-system'].value_counts()

## General Problem Statement:
The problem statement for the analysis of the auto dataset is to determine the factors that influence the price of a car and to develop a predictive model that can accurately estimate the price of a car based on its characteristics. The objective is to identify the key variables that contribute to the price of a car and to evaluate the performance of a multiple linear regression model in terms of their ability to predict the price of a car. Additionally, the study aims to explore any patterns or relationships between the car characteristics and the price, and to draw conclusions about the best approach for predicting the price of a car.

### Data Science Questions:
1. What are the key variables that contribute to the price of a car?
2. How well can a multiple linear regression model predict the price of a car?
3. What are the patterns and relationships between the car characteristics and the price?

# Phase 4: Data Modeling

## Feature Extraction
Feature extraction is a crucial step in the data analysis process as it involves identifying the most relevant and important features that have a significant impact on the outcome being studied. In the case of the 1985 Auto Imports dataset, Stepwise Regression and residual plots will be utilized to conduct the feature extraction. 

Stepwise regression is a feature selection method that iteratively adds or removes variables based on statistical criteria, while residual plots are used to visualize the difference between the predicted values and the actual values of the target variable. By combining these two methods, it is possible to identify the most significant predictors that contribute to the variability of the target variable and extract them for further analysis and modeling. This will provide a more parsimonious and interpretable model, as well as improve the model's predictive performance.

### Stepwise Regression

In [None]:
## getting column names
x_columns = ["normalized-losses", "wheel-base", "length", "width", "height", "curb-weight", "engine-size", "bore", "stroke", "compression-ratio", "horsepower", "peak-rpm", "city-mpg", "highway-mpg"]
y = df["price"]

In [None]:
# ## creating function to get model statistics
# import numpy as np
import statsmodels.api as sm
def get_stats():
    x = df[x_columns]
    results = sm.OLS(y, x).fit()
    print(results.summary())
get_stats()

In [None]:
# create a loop that removes variables with p-values > 0.05
while True:
    x = df[x_columns]
    results = sm.OLS(y, x).fit()
    p_values = results.pvalues
    max_p_value = p_values.max()
    if max_p_value > 0.05:
        max_p_value_index = p_values.idxmax()
        x_columns.remove(max_p_value_index)
    else:
        break

print(results.summary())

In [None]:
# print the variables in the results starting from the least p-value. just the variable names, no numbers
significant = results.pvalues.sort_values(ascending=True)
print("The most significant variables are:")
# print not as array and without index
print(significant.index.values)

### Residual Plot

In [None]:
width = 12
height = 10

In [None]:
# residual plot of engine-size and price
plt_plot.figure(figsize=(width, height))
sns.residplot(x = df['engine-size'], y = df['price'])
plt_plot.show()

In [None]:
# residual plot of compression-ratio and price
plt_plot.figure(figsize=(width, height))
sns.residplot(x = df['compression-ratio'], y = df['price'])
plt_plot.show()

In [None]:
# residual plot of stroke and price
plt_plot.figure(figsize=(width, height))
sns.residplot(x = df['stroke'], y = df['price'])
plt_plot.show()

In [None]:
# residual plot of city-mpg and price
plt_plot.figure(figsize=(width, height))
sns.residplot(x = df['city-mpg'], y = df['price'])
plt_plot.show()

In [None]:
# residual plot of bore and price
plt_plot.figure(figsize=(width, height))
sns.residplot(x = df['bore'], y = df['price'])
plt_plot.show()

In [None]:
# residual plot of peak-rpm and price
plt_plot.figure(figsize=(width, height))
sns.residplot(x = df['peak-rpm'], y = df['price'])
plt_plot.show()

In [None]:
# residual plot of horsepower and price
plt_plot.figure(figsize=(width, height))
sns.residplot(x = df['horsepower'], y = df['price'])
plt_plot.show()

In [None]:
# residual plot of wheel-base and price
plt_plot.figure(figsize=(width, height))
sns.residplot(x = df['wheel-base'], y = df['price'])
plt_plot.show()

## Multiple Linear Regression Model Development

In [None]:
# create the linear regression object
lm = LinearRegression()

# create a multiple linear regression model with the updated x_columns
x = df[x_columns]
y = df["price"]
lm.fit(x, y)

## Model Testing and Evaluation

In [None]:
# make a prediction
yhat = lm.predict(x)

# create a distribution plot for the multiple linear regression model
plt_plot.figure(figsize=(width, height))
ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(yhat, hist=False, color="b", label="Fitted Values" , ax=ax1)

plt_plot.title('Actual vs Fitted Values for Price')
plt_plot.xlabel('Price (in dollars)')
plt_plot.ylabel('Proportion of Cars')

plt_plot.show()
plt_plot.close()

In [None]:
# evaluate the model with the R-square
print("The R-square is: ", lm.score(x, y))


In [None]:
# evaluate the model with the MSE
mse = mean_squared_error(df['price'], yhat)
print("The mean square error of price and predicted value is: ", mse)

# Data Evaluation

// explain model results in non technical terms

    // actionable insights

### Question 1: Does curb weight have an impact on the price of a vehicle?

### Question 2: Does the size of the engine influence the price of a vehicle?

### Question 3 : What is the relationship between the horsepower and price of a vehicle?

// visualize findings and relate it to our problem statement and questions

// recommendations