 <h1 align=center><font size = 6>Data Analysis with Python</font></h1>

# (Project-Automobile Dataset) 


## 1. Introduction


<h4>We are estimating the price of a used car based on its characteristics?</h4>
    
To answer this question, we are going to use various Python packages to perform data cleaning, exploratory data analysis, model development and model evaluation

## Table of content

<div class="alert alert-block alert-info" style="margin-top: 20px">
1. Introduction<br>
    1.1 Data Acquisition<br>
    1.2 Basic Insight of Dataset<br>
2. Data Wrangling<br>
    2.1 Identify and handle missing values<br>
        2.1.1 Identify missing values<br>
        2.1.2 Deal with missing values<br>
        2.1.3 Correct data format<br>
    2.2 Data standardization<br>
    2.3 Data Normalization (centring/scaling)<br>
    2.4 Binning<br>
    2.5 Indicator variable<br>
3. Exploratory Data Analysis<br>
    3.1 Analyzing Individual Feature Patterns using Visualization<br>
    3.2 Categorical variables<br>
    3.3 Descriptive Statistical Analysis<br>
    3.4 Basic of Grouping<br>
    3.5 Correlation and Causation<br>
    3.6 ANOVA<br>
4. Model Development<br>
    4.1 Linear Regression and Multiple Linear Regression<br>
    4.2 Model Evaluation using Visualization<br>
        4.2.1 Regression Plot<br>
        4.2.2 Residual Plot<br>
        4.2.3 Multiple Linear Regression<br>
    4.3 Polynomial Regression and Pipelines<br>
        4.3.1 Pipeline<br>
    4.4 Measures for In-Sample Evaluation<br>
        4.1.1 R-squared<br>
        4.4.2 Mean Squared Error (MSE)<br>
    4.5 Prediction and Decision Making<br>
5. Model Evaluation and Refinement<br>
    5.1 Training and Testing<br>
    5.2 Overfitting, Underfitting and Model Selection<br>
    5.3 Ridge regression<br>
    5.4 Grid Search<br>
</div>

<a id="ref1"></a>
## 1.1 Data Acquisition
There are various formats for a dataset, .csv, .json, .xlsx  etc. The dataset can be stored in different places, on your local machine or sometimes online.
In this section, you will learn how to load a dataset into our Jupyter Notebook.
In our case, the Automobile Dataset is an online source, and it is in CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.
data source: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
data type: csv
The Pandas Library is a useful tool that enables us to read various datasets into a data frame; our Jupyter notebook platforms have a built-in **Pandas Library** so that all we need to do is import Pandas without installing.


In [None]:
# import pandas library
import pandas as pd

### Read Data
We use **"pandas.read_csv()"**  function to read the csv file. In the bracket, we put the file path along with a quotation mark, so that pandas will read the file into a data frame from that address. The file path can be either an URL or your local file address.
Because the data does not include headers, we can add an argument **" headers = None"**  inside the  **"read_csv()"** method, so that pandas will not automatically set the first row as a header.
You can also assign the dataset to any variable you create.

In [None]:
# import pandas library
import pandas as pd
# read the online file by the URL provides above, and assign it to variable "df"
path="https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"

df = pd.read_csv(path,header=None)
print("Done")

After reading the dataset, we can use the **`dataframe.head(n)`** method to check the top n rows of the dataframe; where n is an integer. Contrary to **`dataframe.head(n)`**, **`dataframe.tail(n)`** will show you the bottom n rows of the dataframe.


In [None]:
# show the first 5 rows using dataframe.head() method
df.head(5)


### Question 1: check the bottom 10 rows of data frame "df".


In [None]:
df.tail(10)


### Add Headers
Take a look at our dataset; pandas automatically set the header by an integer from 0. 
<div>
To better describe our data we can introduce a header, this information is available at:  https://archive.ics.uci.edu/ml/datasets/Automobile</div>
<p></p>
<div>Thus, we have to add headers manually.</div>
<div>Firstly, we create a list "headers" that include all column names in order.</div>
<div>Then, we use **`dataframe.columns = headers`** to replace the headers by the list we created.</div>

In [None]:
# create headers list
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]
headers

 We replace headers and recheck our data frame

In [None]:
df.columns = headers
df.head(10)

we can drop missing values along the column "price" as follows  

In [None]:
df.dropna(subset=["price"], axis=0)

Now, we have successfully read the raw dataset and add the correct headers into the data frame.


### Question 2:Find the name of the colunms of the dataframe


In [None]:
df.columns

### Save Dataset
Correspondingly, Pandas enables us to save the dataset to csv  by using the **`dataframe.to_csv()`** method, you can add the file path and name along with quotation marks in the brackets.

For example, if you would save the dataframe "df" as "automobile.csv" to your local machine, you may use the syntax below:
~~~~
df.to_csv("automobile.csv")
~~~~



 We can also read and save other file formats, we can use similar functions to **`pd.read_csv()`** and **`df.to_csv()`** for other data formats, the functions are listed in the following table:


### Read/Save Other Data Formats



| Data Formate  | Read           | Save             |
| ------------- |:--------------:| ----------------:|
| csv           | `pd.read_csv()`  |`df.to_csv()`     |
| json          | `pd.read_json()` |`df.to_json()`    |
| excel         | `pd.read_excel()`|`df.to_excel()`   |
| hdf           | `pd.read_hdf()`  |`df.to_hdf()`     |
| sql           | `pd.read_sql()`  |`df.to_sql()`     |
| ...           |   ...          |       ...        |

<a id="ref2"></a>
## 1.2 Basic Insight of Dataset
After reading data into Pandas dataframe, it is time for us to explore the dataset.
There are several ways to obtain essential insights of the data to help us better understand our dataset.

### Data Types
Data has a variety of types.
The main types stored in Pandas dataframes are `object`, `float`, `int`, `bool` and `datetime64`. In order to better learn about each attribute, it is always good for us to know the data type of each column. In Pandas:
~~~~
dataframe.dtypes
~~~~
returns a Series with the data type of each column.

In [None]:
# check the data type of data frame "df" by .dtypes
df.dtypes

As a result, as shown above, it is clear to see that the data type of "symboling" and "curb-weight" are `int64`, "normalized-losses" is `object`, and "wheel-base" is `float64`, etc.
These data types can be changed; we will learn how to accomplish this in a later module. 

### Describe
If we would like to get a statistical summary of each column, such as  count, column mean value, column standard deviation, etc. We use the describe method:
~~~~
dataframe.describe()
~~~~
This method will provide various summary statistics, excluding `NaN` (Not a Number) values.

In [None]:
df.describe()

This shows the statistical summary of all numeric-typed (int, float) columns.
For example, the attribute "symboling" has 205 counts, the mean value of this column is 0.83, the standard deviation is 1.25, the minimum value is -2, 25th percentile is 0, 50th percentile is 1, 75th percentile is 2, and the maximum value is 3.

However, what if we would also like to check all the columns including those that are of type object.


You can add an argument `include = "all"` inside the bracket. Let's try it again.

In [None]:
# describe all the columns in "df" 
df.describe(include = "all")

Now, it provides the statistical summary of all the columns, including object-typed attributes.
We can now see how many unique values, which is the top value and the frequency of top value in the object-typed columns.
Some values in the table above show as "NaN", this is because those numbers are not available regarding a particular column type.


### Question 3: Apply the  method to ".describe()" to the columns 'length' and 'compression-ratio'.

In [None]:
df[['length', 'compression-ratio']].describe()

### Info
Another method you can use to check your dataset is:
~~~~
dataframe.info
~~~~
It provide a concise summary of your DataFrame.

In [None]:
# look at the info of "df"
df.info

Here we are able to see the information of our dataframe, with the top 30 rows and the bottom 30 rows.

And, it also shows us the whole data frame has 205 rows and 26 columns in total.

# 2. Data Wrangling

**What is the purpose of Data Wrangling?**

Data Wrangling is the process of converting data from the initial format to a format that may be better for analysis.

 *(As we can see, several question marks appeared in the dataframe; those are missing values which may hinder our further analysis).*

### How do we identify all those missing values and deal with them and how to work with missing data?


Steps for working with missing data:
1. identify missing data
2. deal with missing data
3. correct data format

<a id="ref1"></a>
## 2.1.1 Identify and handle missing values


<a id="ref2"></a>
#### Convert "?" to NaN
In the car dataset, missing data comes with the question mark "?".
We replace "?" with NaN (Not a Number), which is Python's default missing value marker, for reasons of computational speed and convenience. Here we use the function: 
 <pre>.replace(A, B, inplace = True) </pre>
to replace A by B

In [None]:
import numpy as np

# replace "?" to NaN
df.replace("?", np.nan, inplace = True)
df.head(5)

#### Evaluating for Missing Data

The missing values are converted to Python's default. We use Python's built-in functions to identify these missing values. There are two methods to detect missing data:
1.  **.isnull()**
2.  **.notnull()**

The output is a boolean value indicating whether the passed in argument value are in fact missing data.

In [None]:
missing_data = df.isnull()
missing_data.head(5)

"True" stands for missing value, while "False" stands for not missing value.

#### Count missing values in each column
Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value, "False"  means the value is present in the dataset.  In the body of the for loop the method  ".value_couts()"  counts the number of "True" values. 

In [None]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")    

Based on the summary above, each column has 205 rows of data, seven columns containing missing data:

1. "normalized-losses": 41 missing data
2. "num-of-doors": 2 missing data
3. "bore": 4 missing data
4. "stroke" : 4 missing data
5. "horsepower": 2 missing data
6. "peak-rpm": 2 missing data
7. "price": 4 missing data

<a id="ref3"></a>
## 2.1.2 Deal with missing data
**How to deal with missing data?**

    
    1. drop data 
        a. drop the whole row
        b. drop the whole column
    2. replace data
        a. replace it by mean
        b. replace it by frequency
        c. replace it based on other functions

Whole columns should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty enough to drop entirely.
We have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. We will apply each method to many different columns:

**Replace by mean:**

    "normalized-losses": 41 missing data, replace them with mean
    "stroke": 4 missing data, replace them with mean
    "bore": 4 missing data, replace them with mean
    "horsepower": 2 missing data, replace them with mean
    "peak-rpm": 2 missing data, replace them with mean
    
**Replace by frequency:**

    "num-of-doors": 2 missing data, replace them with "four". 
        * Reason: 84% sedans is four doors. Since four doors is most frequent, it is most likely to 
    

**Drop the whole row:**

    "price": 4 missing data, simply delete the whole row
        * Reason: price is what we want to predict. Any data entry without price data cannot be used for prediction; therefore they are not useful to us

#### Calculate the average of the column 

In [None]:
avg_1 = df["normalized-losses"].astype("float").mean(axis = 0)

#### Replace "NaN" by mean value in "normalized-losses" column

In [None]:
df["normalized-losses"].replace(np.nan, avg_1, inplace = True)

#### Calculate the mean value for 'bore' column

In [None]:
avg_2=df['bore'].astype('float').mean(axis=0)

#### Replace NaN by mean value

In [None]:
df['bore'].replace(np.nan, avg_2, inplace= True)

### Question 4: According to the example above, replace NaN in "stroke" column by mean.


In [None]:
# calculate the mean vaule for "stroke" column
avg_3 = df["stroke"].astype('float').mean(axis=0)

# replace NaN by mean value in "stroke" column
df["stroke"].replace(np.nan, avg_3, inplace= True)

#### Calculate the mean value for the  'horsepower' column:

In [None]:
avg_4=df['horsepower'].astype('float').mean(axis=0)

#### Replace "NaN" by mean value :

In [None]:
df['horsepower'].replace(np.nan, avg_4, inplace= True)

#### Calculate the mean value for 'peak-rpm' column:

In [None]:
avg_5=df['peak-rpm'].astype('float').mean(axis=0)

#### Replace NaN by mean value:

In [None]:
df['peak-rpm'].replace(np.nan, avg_5, inplace= True)

To see which values are present in a particular column, we can use the ".value_counts()" method:

In [None]:
df['num-of-doors'].value_counts()

We can see that four doors are the most common type. We can also use the ".idxmax()" method to calculate for us the most common type automatically:

In [None]:
df['num-of-doors'].value_counts().idxmax()

The replacement procedure is very similar to what we have seen previously

In [None]:
#replace the missing 'num-of-doors' values by the most frequent 
df["num-of-doors"].replace(np.nan, "four", inplace = True)

Finally, let's drop all rows that do not have price data:

In [None]:
# simply drop whole row with NaN in "price" column
df.dropna(subset=["price"], axis=0, inplace = True)

# reset index, because we droped two rows
df.reset_index(drop = True, inplace = True)

In [None]:
df.head()

**Good!** Now, we obtain the dataset with no missing values.

<a id="ref4"></a>
## 2.1.3 Correct  data format
**We are almost there!**
<div>The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).</div>

In Pandas, we use 
<div>**.dtype()** to check the data type</div>
<div>**.astype()** to change the data type</div>

#### Lets list the data types for each column

In [None]:
df.dtypes

As we can see above, some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. For example, 'bore' and 'stroke' variables are numerical values that describe the engines, so we should expect them to be of the type 'float' or 'int'; however, they are shown as type 'object'. We have to convert data types into a proper format for each column using the "astype()" method.  

#### Convert data types to proper format

In [None]:

df[["bore", "stroke"]] = df[["bore", "stroke"]].astype("float")
df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")
df[["price"]] = df[["price"]].astype("float")
df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")
print("Done")

#### Let us list the columns after the conversion  

In [None]:
df.dtypes

**Wonderful!**

Now, we finally obtain the cleaned dataset with no missing values and all data in its proper format.

<a id="ref5"></a>
## 2.2 Data Standardization
Data is usually collected from different agencies with different formats.
(Data Standardization is also a term for a particular type of data normalization, where we subtract the mean and divide by the standard deviation)

**What is Standardization?**
<div>Standardization is the process of transforming data into a common format which allows the researcher to make the meaningful comparison.
</div>

**Example**
<div>Transform mpg to L/100km:</div>
<div>In our dataset, the fuel consumption columns "city-mpg" and "highway-mpg" are represented by mpg (miles per gallon) unit. Assume we are developing an application in a country that accept the fuel consumption with L/100km standard</div>
<div>We will need to apply **data transformation** to transform mpg into L/100km?</div>


The formula for unit conversion is
L/100km = 235 / mpg
<div>We can do many mathematical operations directly in Pandas.</div>

In [None]:
df.head()

In [None]:
# transform mpg to L/100km by mathematical operation (235 divided by mpg)
df['city-L/100km'] = 235/df["city-mpg"]

# check your transformed data 
df.head()

### Question 5: According to the example above, transform mpg to L/100km in the column of "highway-mpg", and change the name of column to "highway-L/100km".


In [None]:
# transform mpg to L/100km by mathematical operation (235 divided by mpg)
df["highway-mpg"] = 235/df["highway-mpg"]

# rename column name from "highway-mpg" to "highway-L/100km"
df.rename(columns={'"highway-mpg"':'highway-L/100km'}, inplace=True)

# check your transformed data 
df.head()

<a id="ref6"></a>
## 2.3 Data Normalization 

**Why normalization?**
<div>Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variable variance is 1, or scaling variable so the variable values range from 0 to 1
 </div>

**Example**
<div>To demonstrate normalization, let's say we want to scale the columns "length", "width" and "height" </div>
<div>**Target:** we would like to Normalize those variables so their value ranges from 0 to 1.</div>
<div>**Approach:** replace origianl value by (original value)/(maximum value)</div>

In [None]:
# replace (origianl value) by (original value)/(maximum value)
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()

### Questiont 6: According to the example above, normalize the column "height".


In [None]:
df['height'] = df['height']/df['height'].max() 

# show the scaled columns
df[["length","width","height"]].head()

Here we can see, we've normalized "length", "width" and "height" in the range of [0,1].

##  2.4 Binning
**Why binning?** 
<div>Binning is a process of transforming continuous numerical variables into discrete categorical 'bins', for grouped analysis.
 </div>

**Example: ** 
<div>In our dataset, "horsepower" is a real valued variable ranging from 48 to 288, it has 57 unique values. What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins' to simplify analysis? </div>

<div>We will use the Pandas method 'cut' to segment the 'horsepower' column into 3 bins </div>

### Example of Binning Data In Pandas

 Convert data to correct format 

In [None]:
df["horsepower"]=df["horsepower"].astype(float, copy=True)

We would like four bins of equal size bandwidth,the forth is because the function "cut"  include the rightmost edge:

In [None]:
binwidth = (max(df["horsepower"])-min(df["horsepower"]))/4

We build a bin array, with a minimum value to a maximum value, with bandwidth calculated above. The bins will be values used to determine when one bin ends and another begins.

In [None]:
bins = np.arange(min(df["horsepower"]), max(df["horsepower"]), binwidth)
bins

 We set group  names:

In [None]:
group_names = ['Low', 'Medium', 'High']

 We apply the function "cut" the determine what each value of "df['horsepower']" belongs to. 

In [None]:
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names,include_lowest=True )
df[['horsepower','horsepower-binned']].head(20)

Check the dataframe above carefully, you will find the last column provides the bins for "horsepower" with 3 categories ("Low","Medium" and "High"). 
<div>We successfully narrow the intervals from 57 to 3!</div>

### Bins visualization 
Normally, a histogram is used to visualize the distribution of bins we created above. 

In [None]:
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot

a = (0,1,2)

# draw historgram of attribute "horsepower" with bins = 3
plt.pyplot.hist(df["horsepower"], bins = 3)

# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")

The plot above shows the binning result for attribute "horsepower". 

## 2.5 Indicator variable (or dummy variable)
**What is an indicator variable?**
<div>An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning. </div>

**Why we use indicator variables?**
<div>So we can use categorical variables for regression analysis in the later modules.</div>

**Example**
<div>We see the column "fuel-type" has two unique values, "gas" or "diesel". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert "fuel-type" into indicator variables.</div>

<div>We will use the panda's method 'get_dummies' to assign numerical values to different categories of fuel type. </div>

In [None]:
df.columns

 get indicator variables and assign it to data frame "dummy_variable_1" 

In [None]:
dummy_variable_1 = pd.get_dummies(df["fuel-type"])
dummy_variable_1.head()

change column names for clarity 

In [None]:
dummy_variable_1.rename(columns={'fuel-type-diesel':'gas', 'fuel-type-diesel':'diesel'}, inplace=True)
dummy_variable_1.head()

We now have the value 0 to represent "gas" and 1 to represent "diesel" in the column "fuel-type". We will now insert this column back into our original dataset. 

In [None]:
# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis=1)

# drop original column "fuel-type" from "df"
df.drop("fuel-type", axis = 1, inplace=True)

In [None]:
df.head()


The last two columns are now the indicator variable representation of the fuel-type variable. It's all 0s and 1s now.

### Question 7: As above, create indicator variable to the column of "aspiration": "std" to 0, while "turbo" to 1.


In [None]:
# get indicator variables of aspiration and assign it to data frame "dummy_variable_2"
dummy_variable_2 = pd.get_dummies(df['aspiration'])

# change column names for clarity
dummy_variable_2.rename(columns={'std':'aspiration-std', 'turbo': 'aspiration-turbo'}, inplace=True)

# show first 5 instances of data frame "dummy_variable_1"
dummy_variable_2.head()

### Question 8: Merge the new dataframe to the original dataframe then drop the column 'aspiration'</b>


In [None]:
#merge the new dataframe to the original datafram
df = pd.concat([df, dummy_variable_2], axis=1)

# drop original column "aspiration" from "df"
df.drop('aspiration', axis = 1, inplace=True)

 save the new csv 

In [None]:
df.to_csv('clean_df.csv')

## 3. Exploratory Data Analysis

**Question**: What are the main characteristics which have the most impact on the car price?
    

## 3.1 Analyzing Individual Feature Patterns using Visualization

 Import visualization packages "Matplotlib" and "Seaborn", don't forget about "%matplotlib inline" to plot in a Jupyter notebook.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

### How to choose the right visualization method ?
When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualisation method for that variable.


In [None]:
# list the data types for each column
df.dtypes

### Question 9: What is the data type of the colunm "peak-rpm"?

for example, we can calculate the correlation between variables  of type "int64" or "float64" using the method "corr":

In [None]:
df.corr()

The diagonal elements are always one; we will study correlation more precisely Pearson correlation in-depth at the end of the notebook.

### Question 10: Find the correlation between the following columns (bore, stroke,compression-ratio, and horsepower)?



In [None]:
df[['bore', 'stroke', 'compression-ratio', 'horsepower']].corr()

### Continuous numerical variables: 

Continuous numerical variables are variables that may contain any value within some range. Continuous numerical variables can have the type "int64" or "float64". A great way to visualize these variables is by using scatterplots with fitted lines. 

In order to start understanding the (linear) relationship between an individual variable and the price. We can do this by using "regplot", which plots the scatterplot plus the fitted regression line for the data.

 Let's see several examples of different linear relationships:

### Positive linear relationship

Let's find the scatterplot of "engine-size" and "price" 

In [None]:
# Engine size as potential predictor variable of price
sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)

As the engine-size goes up, the price goes up: this indicates a positive direct correlation between these two variables. Engine size seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line. E

 We can examine the correlation between 'engine-size' and 'price' and see it's approximately  0.87

In [None]:
df[["engine-size", "price"]].corr()

### Negative linear relationship

Highway mpg is a potential predictor variable of price 

In [None]:
sns.regplot(x="highway-mpg", y="price", data=df)

As the highway-mpg goes up, the price goes down: this indicates an inverse/ negative relationship between these two variables. Highway mpg could potentially be a predictor of price.


We can examine the correlation between 'highway-mpg' and 'price' and see it's approximately  -0.704

In [None]:
df[['highway-mpg', 'price']].corr()

### Weak Linear Relationship

Let's see if "Peak-rpm" as a predictor variable of "price".

In [None]:
sns.regplot(x="peak-rpm", y="price", data=df)

Peak rpm does not seem like a good predictor of the price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore it's it is not a reliable variable.


 we can examine the correlation between 'peak-rpm'  and 'price'and see it's approximately  -0.101616 

In [None]:
df[['peak-rpm','price']].corr()

### Question 11(a): Find the correlation  between x="stroke", y="price".

In [None]:
# The correlation is 0.0823, the non-diagonal elements of the table.
code:df[["stroke","price"]].corr()

### Question 11(b): Given the correlation results between "price" and "stroke"  do you expect a linear relationship? Verify your results using the function "regplot()".

In [None]:
# There is a weak correlation between the variable 'stroke' and 'price.' as such regression will not work well.We can see this use "regplot" to demonstrate this.
sns.regplot(x="stroke", y="price", data=df)

## 3.2 Categorical variables

These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type "object" or "int64". A good way to visualize categorical variables is by using boxplots.

Let's look at the relationship between "body-style" and "price".

In [None]:
sns.boxplot(x="body-style", y="price", data=df)

We see that the distributions of price between the different body-style categories have a significant overlap, and so body-style would not be a good predictor of price. Let's examine engine "engine-location" and "price" :

In [None]:
sns.boxplot(x="engine-location", y="price", data=df)

Here we see that the distribution of price between these two engine-location categories, front and rear, are distinct enough to take engine-location as a potential good predictor of price. 

 Let's examine "drive-wheels" and "price".

In [None]:
# drive-wheels
sns.boxplot(x="drive-wheels", y="price", data=df)

Here we see that the distribution of price between the different drive-wheels categories differs; as such drive-wheels could potentially be a predictor of price.

## 3.3 Descriptive Statistical Analysis

Let's first take a look at the variables by utilising a description method.

The **describe** function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.

This will show:
- the count of that variable
- the mean
- the standard deviation (std) 
- the minimum value
- the IQR (Interquartile Range: 25%, 50% and 75%)
- the maximum value



 We can apply the method "describe" as follows:

In [None]:
df.describe()

 The default setting of "describe" skips variables of type object. We can apply the method "describe" on the variables of type 'object' as follows:

In [None]:
df.describe(include=['object'])

### Value Counts

Value-counts is a good way of understanding how many units of each characteristic/variable we have. We can apply the "value_counts" method on the column   'drive-wheels'. Don’t forget the method "value_counts" only works on Pandas series, not Pandas Dataframes. As a result, we only include one bracket  "df['drive-wheels']" not two brackets "df[['drive-wheels']]".


In [None]:
df['drive-wheels'].value_counts()

We can convert the series to a Dataframe as follows :

In [None]:
df['drive-wheels'].value_counts().to_frame()

 Let's repeat the above steps but save the results to the dataframe "drive_wheels_counts" and rename the column  'drive-wheels' to 'value_counts'.

In [None]:
drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)
drive_wheels_counts

 Now let's rename the index to 'drive-wheels':

In [None]:
drive_wheels_counts.index.name = 'drive-wheels'
drive_wheels_counts

We can repeat the above process for the variable 'engine-location'.

In [None]:
# engine-location as variable
engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts.head(10)

Examining the value counts of the engine location would not be a good predictor variable for the price. This is because we only have three cars with a rear engine and 198 with an engine in the front, this result is skewed. Thus, we are not able to draw any conclusions about the engine location.

## 3.4 Basic of Grouping

 The "groupby" method groups data by different categories. The data is grouped based on one or several variables and analysis is performed on the individual groups.

 For example, let's group by the variable "drive-wheels". We see that there are 3 different categories of drive wheels.

In [None]:
df['drive-wheels'].unique()

If we want to know, on average, which type of drive wheel is most valuable, we can group "drive-wheels" and then average them.

 we can select the columns 'drive-wheels','body-style' and 'price' , then assign it to the variable "df_group_one".

In [None]:
df_group_one=df[['drive-wheels','body-style','price']]

we can then calculate the average price for each of the different categories of data.

In [None]:
# grouping results

df_group_one=df_group_one.groupby(['drive-wheels'],as_index= False).mean()
df_group_one

From our data, it seems rear-wheel drive vehicles are, on average, the most expensive, while 4-wheel and front-wheel are approximately the same in price.

You can also group with multiple variables. For example, let's group by both 'drive-wheels' and 'body-style'. This groups the dataframe by the unique combinations 'drive-wheels' and 'body-style'. We can store the results in the variable 'grouped_test1'


In [None]:
# grouping results
df_gptest=df[['drive-wheels','body-style','price']]
grouped_test1=df_gptest.groupby(['drive-wheels','body-style'],as_index= False).mean()
grouped_test1

This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row. We can convert the dataframe to a pivot table using the method "pivot " to create a pivot table from the groups.

In this case, we will leave the drive-wheel variable as the rows of the table, and pivot body-style to become the columns of the table:

In [None]:
grouped_pivot=grouped_test1.pivot(index='drive-wheels',columns='body-style')
grouped_pivot

Often, we won't have data for some of the pivot cells. We can fill these missing cells with the value 0, but any other value could potentially be used as well. It should be mentioned that missing data is quite a complex subject and is an entire course on its own.

In [None]:
grouped_pivot=grouped_pivot.fillna(0) #fill missing values with 0
grouped_pivot

 ### Question 12 :Use the "groupby" function to find the average "price" of each car based on "body-style" ?

In [None]:
df_group_two=df[['body-style','price']]
df_group_two=df_group_two.groupby(['body-style'],as_index= False).mean()
df_group_two

If you didn't import "pyplot" let's do it again. 

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline 

#### Variables: Drive Wheels and Body Style vs Price

 Let's use a heat map to visualize the relationship between Body Style vs Price 

In [None]:
#use the grouped results
plt.pcolor(grouped_pivot, cmap='RdBu')
plt.colorbar()
plt.show()

The heatmap plots the target variable (price) proportional to colour with respect to the variables 'drive-wheel' and 'body-style' in the vertical and horizontal axis respectively. This allows us to visualize how the price is related to 'drive-wheel' and 'body-style', 
The default labels convey no useful information to us. Let's change that:

In [None]:
fig, ax=plt.subplots()
im=ax.pcolor(grouped_pivot, cmap='RdBu')

#label names
row_labels=grouped_pivot.columns.levels[1]
col_labels=grouped_pivot.index
#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1])+0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0])+0.5, minor=False)
#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)
#rotate label if too long
plt.xticks(rotation=90)

fig.colorbar(im)
plt.show()

Visualization is very important in data science, and Python visualization packages provide great freedom. We will go more in-depth in a separate Python Visualizations course.

 The main question we want to answer in this module, is "What are the main characteristics which have the most impact on the car price?". 

To get a better measure of the important characteristics, we look at the correlation of these variables with the car price, in other words: how is the car price dependent on this variable?

## 3.5 Correlation and Causation

 **Correlation**: a measure of the extent of interdependence between variables.

**Causation**: the relationship between cause and effect between two variables.

It is important to know the difference between these two and that correlation does not imply causation. Determining  correlation is much simpler  the determining causation as causation may require independent experimentation 

### Pearson Correlation
The Pearson Correlation measures the linear dependence between two variables X and Y.
The resulting coefficient is a value between -1 and 1 inclusive, where:
- **1**: total positive linear correlation,
- **0**: no linear correlation, the two variables most likely do not affect each other
- **-1**: total negative linear correlation.


 Pearson Correlation is the default method of the function "corr".  Like before we can calculate the Pearson correlation of the of the 'int64' or 'float64'  variables. 

In [None]:
df.corr()

 sometimes we would like to know the significant of the correlation estimate. 

**P-value**: 
What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant. 

By convention, when the 
- p-value is < 0.001 we say there is strong evidence that the correlation is significant,
- the p-value is < 0.05; there is moderate evidence that the correlation is significant,
- the p-value is < 0.1; there is weak evidence that the correlation is significant, and
- the p-value is > 0.1; there is no evidence that the correlation is significant.

 We can obtain this information using  "stats" module in the "scipy"  library.

In [None]:
from scipy import stats

### Wheel-base vs Price

 Let's calculate the  Pearson Correlation Coefficient and P-value of 'wheel-base' and 'price'. 

In [None]:
pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  

##### Conclusion: 
Since the p-value is < 0.001, the correlation between wheel-base and price is statistically significant, although the linear relationship isn't extremely strong (~0.585)

### Horsepower vs Price

 Let's calculate the  Pearson Correlation Coefficient and P-value of 'horsepower' and 'price'.

In [None]:
pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  

##### Conclusion:

Since the p-value is < 0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1)

### Length vs Price

 Let's calculate the  Pearson Correlation Coefficient and P-value of 'length' and 'price'.

In [None]:
pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  

##### Conclusion:
Since the p-value is < 0.001, the correlation between length and price is statistically significant, and the linear relationship is moderately strong (~0.691).

### Width vs Price

 Let's calculate the Pearson Correlation Coefficient and P-value of 'width' and 'price':

In [None]:
pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value ) 

##### Conclusion:

Since the p-value is < 0.001, the correlation between width and price is statistically significant, and the linear relationship is quite strong (~0.751).

### Curb-weight vs Price

 Let's calculate the Pearson Correlation Coefficient and P-value of 'curb-weight' and 'price':

In [None]:
pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  

##### Conclusion:
Since the p-value is < 0.001, the correlation between curb-weight and price is statistically significant, and the linear relationship is quite strong (~0.834).

### Engine-size vs Price

 Let's calculate the Pearson Correlation Coefficient and P-value of 'engine-size' and 'price':

In [None]:
pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value) 

##### Conclusion:
Since the p-value is < 0.001, the correlation between engine-size and price is statistically significant, and the linear relationship is very strong (~0.872).

### Bore vs Price

 Let's calculate the  Pearson Correlation Coefficient and P-value of 'bore' and 'price':

In [None]:
pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value ) 

##### Conclusion:
Since the p-value is < 0.001, the correlation between bore and price is statistically significant, but the linear relationship is only moderate (~0.521).

 We can relate the process for each 'City-mpg'  and 'Highway-mpg':

### City-mpg vs Price

In [None]:
pearson_coef, p_value = stats.pearsonr(df['city-mpg'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  

##### Conclusion:
Since the p-value is < 0.001, the correlation between city-mpg and price is statistically significant, and the coefficient of ~ -0.687 shows that the relationship is negative and moderately strong.

### Highway-mpg vs Price

In [None]:
pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value ) 

##### Conclusion:
Since the p-value is < 0.001, the correlation between highway-mpg and price is statistically significant, and the coefficient of ~ -0.705 shows that the relationship is negative and moderately strong.

## 3.6 ANOVA

### ANOVA: Analysis of Variance
The Analysis of Variance  (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

**F-test score**: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.

**P-value**:  P-value tells how statistically significant is our calculated score value

If our price variable is strongly correlated with the variable we are analyzing, expect ANOVA to return a sizeable F-test score and a small p-value.

### Drive Wheels

Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.

 Let's see if different types 'drive-wheels' impact  'price', we group the data.

In [None]:
grouped_test2=df_gptest[['drive-wheels','price']].groupby(['drive-wheels'])
grouped_test2.head(2)

 We can obtain the values of the method group using the method "get_group".  

In [None]:
grouped_test2.get_group('4wd')['price']

we can use the function 'f_oneway' in the module 'stats'  to obtain the **F-test score** and **P-value**.

In [None]:
# ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val)   

This is a great result, with a large F test score showing a strong correlation and a P value of almost 0 implying almost certain statistical significance. But does this mean all three tested groups are all this highly correlated? 

#### Separately: fwd and rwd

In [None]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val )

 Let's examine the other groups 

#### 4wd and rwd

In [None]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('rwd')['price'])  
   
print( "ANOVA results: F=", f_val, ", P =", p_val)   

#### 4wd and fwd

In [None]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('fwd')['price'])  
 
print("ANOVA results: F=", f_val, ", P =", p_val)   

### Important Variables

We now have a better idea of what our data looks like and which variables are important to take into account when predicting the car price. We have narrowed it down to the following variables:

Continuous numerical variables:
- Length
- Width
- Curb-weight
- Engine-size
- Horsepower
- City-mpg
- Highway-mpg
- Wheel-base
- Bore

Categorical variables:
- Drive-wheels

AS we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.

# 4. Model Development

In this section, we will develop several models that will predict the price of the car using the variables or features. This is just an estimate but should give us an objective idea of how much the car should cost.

<h3>Questions:</h3> <h4>How do I know if the dealer is offering fair value for my trade-in and how do I know if I put a fair value on my car?</h4>
    
In Data Analytics, we often use **Model Development** to help us predict future observations from the data we have. 

A Model will help us understand the exact relationship between different variables and how these variables are used to predict the result.

## 4.1 Linear Regression and Multiple Linear Regression 

###  4.1.1 Linear Regression


One example of a Data  Model that we will be using is 
 **Simple Linear Regression**.
Simple Linear Regression is a method to help us understand the relationship between two variables:
- The predictor/independent variable (X)
- The response/dependent variable (that we want to predict)(Y)


The result of Linear Regression is a **linear function** that predicts the response (dependent) variable as a function of the predictor (independent) variable. 



\begin{equation*}
 Y: Response \ Variable\\
 X :Predictor\ Variables
\end{equation*}


  ### Linear function:
\begin{equation*}
Yhat = a + b  X
\end{equation*}


- a refers to the **intercept** of the regression, in other words: the value of Y when X is 0 
- b refers to the **slope** of the regression line, in other words: the value with which Y changes when X increases by 1.





####  Lets load the modules for linear regression

In [None]:
from sklearn.linear_model import LinearRegression

#### Create the linear regression object

In [None]:
lm = LinearRegression()
lm

### How could Highway-mpg help us predict car price?

For this example, we want to look at how highway-mpg can help us predict car price.
Using simple linear regression, we will create a linear function with "highway-mpg" as the predictor variable and the "price" as the response variable.

In [None]:
X = df[['highway-mpg']]
Y = df['price']

Fit the linear model using highway-mpg.

In [None]:
lm.fit(X,Y)

 We can output a prediction 

In [None]:
Yhat=lm.predict(X)
Yhat[0:5]   

#### What is the value of the intercept (a) ?

In [None]:
lm.intercept_

#### What is the value of the Slope (b) ?

In [None]:
lm.coef_

### What is the final estimated linear model we get?

As we saw above, we should get a final linear model with the structure:

 \begin{equation*}
Yhat = a + b  X
\end{equation*}

Plugging in the actual values we get:

**price** = 38423.31 - 821.73 x  **highway-mpg**

### Question 13(a): Create a linear regression object?


In [None]:
lm1 = LinearRegression()
lm1 

### Question 13(b): Train the model using 'engine-size' as the independent variable and 'price' as the dependent variable?

In [None]:
X = df[['engine-size']]
Y = df['price']
lm1.fit(X,Y)

### Question 13(c): Find the slope and intercept of the model?


#### Slope 

In [None]:
lm1.coef_

#### Intercept

In [None]:
lm1.intercept_


### Question 13(d):What is the equation of the predicted line. You can use x and yhat or ''engine-size'  or  'price'?</b>


Yhat=38423.31-821.733*X

Price=38423.31-821.733*engine-size

### 4.1.2 Multiple Linear Regression

 What if we want to predict car price using more than one variable? 

If we want to use more variables in our model to predict car price, we can use **Multiple Linear Regression**.
Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and *two or more* predictor (independent) variables.
Most of the real-world regression models involve multiple predictors. We illustrate the structure by using four predictor variables, but these results can generalize to any integer :


 \begin{equation*}
Y: Response \ Variable\\
X_1 :Predictor\ Variable \ 1\\
X_2: Predictor\ Variable \ 2\\
X_3: Predictor\ Variable \ 3\\
X_4: Predictor\ Variable \ 4\\
\end{equation*}


 \begin{equation*}
a: intercept\\
b_1 :coefficients \ of\ Variable \ 1\\
b_2: coefficients \ of\ Variable \ 2\\
b_3: coefficients \ of\ Variable \ 3\\
b_4: coefficients \ of\ Variable \ 4\\
\end{equation*}


 The equation is given by 

 \begin{equation*}
Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4
\end{equation*}



From the previous section  we know that other good predictors of price could be: 
- Horsepower
- Curb-weight
- Engine-size
- Highway-mpg

Let's develop a model using these variables as the predictor variables.

In [None]:
Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]

Fit the linear model using the four above-mentioned variables.

In [None]:
 lm.fit(Z, df['price'])


 What is the value of the intercept(a)?

In [None]:
lm.intercept_

 What are the values of the coefficients (b1, b2, b3, b4) ?

In [None]:
lm.coef_

 What is the final estimated linear model that we get?

As we saw above, we should get a final linear function with the structure:

 \begin{equation*}
Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4
\end{equation*}

What is the linear function we get in this example?

**Price** = -15678.742628061467 + 52.65851272 x **horsepower** + 4.69878948 x **curb-weight** + 81.95906216 x **engine-size** + 33.58258185 x **highway-mpg**

### Question 14(a): Create and train a  Multiple Linear Regression model "lm2" where the response variable is price, and the predictor variable is  'normalized-losses' and  'highway-mpg'.

In [None]:
from sklearn.linear_model import LinearRegression
lm2 = LinearRegression()
lm2.fit(df[['normalized-losses' , 'highway-mpg']],df['price'])
lm2.coef_

### Question 14(b): Find the coefficient of the model?

lm2.coef_

## 4.2  Model Evaluation using Visualization

Now that we've developed some models, how do we evaluate our models and how do we choose the best one? One way to do this is by using visualization.

import the visualization package: seaborn

In [None]:
# import the visualization package: seaborn
import seaborn as sns
%matplotlib inline 

### 4.2.1 Regression Plot

When it comes to simple linear regression, an excellent way to visualise the fit of our model is by using **regression plots**.

This plot will show a combination of a scattered data points (a **scatterplot**), as well as the fitted **linear regression** line going through the data. This will give us a reasonable estimate of the relationship between the two variables, the strength of the correlation, as well as the direction (positive or negative correlation).

 Let's visualize Horsepower as potential predictor variable of price:

In [None]:
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)

We can see from this plot that price is negatively correlated to highway-mpg, since the regression slope is negative.
One thing to keep in mind when looking at a regression plot is to pay attention to how scattered the data points are around the regression line. This will give you a good indication of the variance of the data, and whether a linear model would be the best fit or not. If the data is too far off from the line, this linear model might not be the best model for this data. Let's compare this plot to the regression plot of "peak-rpm".

In [None]:
plt.figure(figsize=(width, height))
sns.regplot(x="peak-rpm", y="price", data=df)
plt.ylim(0,)

Comparing the regression plot of "peak-rpm" and "highway-mpg" We see that the points for "highway-mpg" are much closer to the generated line and on the average decrease. The points for "peak-rpm"  have more spread around the predicted line, and it is much harder to determine if the points are decreasing or increasing as the  "highway-mpg"  increases.

### Question 15: Given the regression plots above is "peak-rpm" or "highway-mpg"  more strongly correlated with "price". Use the method  ".corr()"  to verify your answer.


In [None]:
df[["peak-rpm","highway-mpg","price"]].corr()

### 4.2.2 Residual Plot

A good way to visualize the variance of the data is to use a residual plot.

What is a **residual**?

The difference between the observed value (y) and the predicted value (Yhat) is called the residual (e). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.

So what is a **residual plot**?

A residual plot is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal x-axis.

What do we pay attention to when looking at a residual plot?

We look at the spread of the residuals:

- If the points in a residual plot are **randomly spread out around the x-axis**, then a **linear model is appropriate** for the data. Why is that? Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.

In [None]:
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.residplot(df['highway-mpg'], df['price'])
plt.show()

*What is this plot telling us?*

We can see from this residual plot that the residuals are not randomly spread around the x-axis, which leads us to believe that maybe a non-linear model is more appropriate for this data.

###  4.2.3 Multiple Linear Regression

How do we visualise a model for Multiple Linear Regression? This gets a bit more complicated because you can't visualise it with regression or residual plot. 

One way to look at the fit of the model is by looking at the **distribution plot**: We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.

 First lets make a prediction 

In [None]:
Y_hat = lm.predict(Z)

In [None]:
plt.figure(figsize=(width, height))


ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Yhat, hist=False, color="b", label="Fitted Values" , ax=ax1)


plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')

plt.show()
plt.close()

We can see that the fitted values are reasonably close to the actual values, since the two distributions overlap a bit. However, there is definitely some room for improvement.

## 4.3 Polynomial Regression and Pipelines 

**Polynomial regression** is a particular case of the general linear regression model or multiple linear regression models. 
We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.

There are different orders of polynomial regression:

<center>**Quadratic - 2nd order**</center>


 \begin{equation*}
Yhat = a + b_1 X^2 +b_2 X^2 
\\
\end{equation*}


 <center>**Cubic - 3rd order**</center>
 
 
 \begin{equation*}
Yhat = a + b_1 X^2 +b_2 X^2 +b_3 X^3\\
\end{equation*}

<center> **Higher order**:</center>


 \begin{equation*}
Y = a + b_1 X^2 +b_2 X^2 +b_3 X^3 ....\\
\end{equation*}

We saw earlier that a linear model did not provide the best fit while using highway-mpg as the predictor variable. Let's see if we can try fitting a polynomial model to the data instead.

 We will use the following function to plot the data:

In [None]:
def PlotPolly(model,independent_variable,dependent_variabble, Name):
    x_new = np.linspace(15, 55, 100)
    y_new = model(x_new)

    plt.plot(independent_variable,dependent_variabble,'.', x_new, y_new, '-')
    plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
    ax = plt.gca()
   
    fig = plt.gcf()
    plt.xlabel(Name)
    plt.ylabel('Price of Cars')

    plt.show()
    plt.close()
    
print("done")

lets get the variables  

In [None]:
x = df['highway-mpg']
y = df['price']
print("done")

Let's fit the polynomial using the function **polyfit**, then use the function **poly1d** to display the polynomial function.

In [None]:
# Here we use a polynomial of the 3rd order (cubic) 
f = np.polyfit(x, y, 3)
p = np.poly1d(f)
print(p)

 Let's plot the function 

In [None]:
PlotPolly(p,x,y, ['highway-mpg'])

In [None]:
np.polyfit(x, y, 3)

We can already see from plotting that this polynomial model performs better than the linear model. This is because the generated polynomial function  "hits" more of the data points.

### Question 16: Create  11 order polynomial model with the variables x and y from above?

In [None]:
# calculate polynomial
# Here we use a polynomial of the 3rd order (cubic) 
f1 = np.polyfit(x, y, 11)
p1 = np.poly1d(f1)
print(p)
PlotPolly(p1,x,y, 'Length')

The analytical expression for Multivariate Polynomial function gets complicated. For example, the expression for a second-order (degree=2)polynomial with two variables is given by:

\begin{equation*}
Yhat = a + b_1 X_1 +b_2 X_2 +b_3 X_1 X_2+b_4 X_1^2+b_5 X_2^2
\end{equation*}

 We can perform a polynomial transform on multiple features. First, we import the  module:

In [None]:
from sklearn.preprocessing import PolynomialFeatures

We create a **PolynomialFeatures** object of degree 2: 

In [None]:
pr=PolynomialFeatures(degree=2)
pr

In [None]:
Z_pr=pr.fit_transform(Z)

The original data is of 201 samples and 4 features 

In [None]:
Z.shape

after the transformation, there 201 samples and 15 features

In [None]:
Z_pr.shape

### 4.3.1 Pipeline 

Data Pipelines simplify the steps of processing the data. We use the module  **Pipeline** to create a pipeline. We also use **StandardScaler** as a step in our pipeline.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

We create the pipeline, by creating a list of tuples including the name of the model or estimator and its corresponding constructor. 

In [None]:
Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(include_bias=False)),('model',LinearRegression())]

we input the list as an argument to the pipeline constructor 

In [None]:
pipe=Pipeline(Input)
pipe

We can normalize the data,  perform a transform and fit the model simultaneously. 

In [None]:
pipe.fit(Z,y)

 Similarly,  we can normalize the data, perform a transform and produce a prediction  simultaneously

In [None]:
ypipe=pipe.predict(Z)
ypipe[0:4]

### Question 17: Create a pipeline that Standardizes the data, then perform prediction using a linear regression model using the features Z and targets y

In [None]:
Input=[('scale',StandardScaler()),('model',LinearRegression())]

pipe=Pipeline(Input)

pipe.fit(Z,y)

ypipe=pipe.predict(Z)
ypipe[0:10]

## 4.4 Measures for In-Sample Evaluation

When evaluating our models, not only do we want to visualise the results, but we also want a quantitative measure to determine how accurate the model is.

Two very important measures that are often used in Statistics to determine the accuracy of a model are:

- R^2 / R-squared
- Mean Squared Error (MSE)

### 4.1.1 R-squared

R squared, also known as the coefficient of determination, is a measure to indicate how close the data is to the fitted regression line.
The value of the R-squared is the percentage of variation of the response variable (y) that is explained by a linear model.



### 4.4.2 Mean Squared Error (MSE)

The Mean Squared Error measures the average of the squares of errors, that is, the difference between actual value (y) and the estimated value (ŷ).

### Model 1: Simple Linear Regression

Let's calculate the R^2

In [None]:
#highway_mpg_fit
lm.fit(X, Y)
# Find the R^2
lm.score(X, Y)

We can say that ~ 49.659% of the variation of the price is explained by this simple linear model "horsepower_fit".

Let's calculate the MSE

We can predict the output i.e., "yhat" using the predict method, where X is the input variable:

In [None]:
Yhat=lm.predict(X)
Yhat[0:4]

 lets import the function **mean_squared_error** from the module **metrics**

In [None]:
from sklearn.metrics import mean_squared_error

 we compare the predicted results with the actual results 

In [None]:
#mean_squared_error(Y_true, Y_predict)
mean_squared_error(df['price'], Yhat)

### Model 2: Multiple Linear Regression

Let's calculate the R^2

In [None]:
# fit the model 
lm.fit(Z, df['price'])
# Find the R^2
lm.score(Z, df['price'])

We can say that ~ 80.896 % of the variation of price is explained by this multiple linear regression "multi_fit".

Let's calculate the MSE

 we produce a prediction 

In [None]:
Y_predict_multifit = lm.predict(Z)


 we compare the predicted results with the actual results 

In [None]:
mean_squared_error(df['price'], Y_predict_multifit)

### Model 3: Polynomial Fit

Let's calculate the R^2

 let’s import the function **r2_score** from the module ** metrics** as we are using a different function  


In [None]:
from sklearn.metrics import r2_score

We apply the function to get the value of r^2

In [None]:
r_squared = r2_score(y, p(x))
r_squared

We can say that ~ 67.419 % of the variation of price is explained by this polynomial fit

### MSE

 We can also calculate the MSE:  

In [None]:
mean_squared_error(df['price'], p(x))

##  4.5 Prediction and Decision Making
### Prediction

In the previous section, we trained the model using the method **fit**. Now we will use the method **predict** to produce a prediction.Lets import **pyplot** for plotting; we will also be using some functions from numpy. 
 


In [None]:
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline 

Create a  new input 

In [None]:
new_input=np.arange(1,100,1).reshape(-1,1)

 Fit the model 

In [None]:
lm.fit(X, Y)
lm

Produce a prediction 

In [None]:
yhat=lm.predict(new_input)
yhat[0:5]

we can plot the data 

In [None]:
plt.plot(new_input,yhat)
plt.show()

### Decision Making: Determining a Good Model Fit

Now that we have visualized the different models, and generated the R-squared and MSE values for the fits, how do we determine a good model fit?

- *What is a good R-squared value?*

When comparing models, **the model with the higher R-squared value is a better fit** for the data.


- *What is a good MSE?*

When comparing models, **the model with the smallest MSE value is a better fit** for the data.#### Let's take a look at the values for the different models we get.

#### Let's take a look at the values for the different models.
Simple Linear Regression: Using Highway-mpg as a Predictor Variable of Price.
- R-squared: 0.49659118843391759
- MSE: 3.16 x10^7

Multiple Linear Regression: Using Horsepower, Curb-weight, Engine-size, and Highway-mpg as Predictor Variables of Price.
- R-squared: 0.80896354913783497
- MSE: 1.2 x10^7

Polynomial Fit: Using Highway-mpg as a Predictor Variable of Price.
- R-squared: 0.6741946663906514
- MSE: 2.05 x 10^7

### Simple Linear Regression model (SLR) vs Multiple Linear Regression model (MLR)

Usually, the more variables you have, the better your model is at predicting, but this is not always true. Sometimes you may not have enough data, you may run into numerical problems, or many of the variables may not be useful and or even act as noise. As a result, you should always check the MSE and R^2. 

So to be able to compare the results of the MLR vs SLR models, we look at a combination of both the R-squared and MSE to make the best conclusion about the fit of the model.
 

- **MSE ** 
The MSE of SLR is  3.16x10^7  while MLR has an MSE of 1.2 x10^7.  The MSE of MLR is much smaller. 


- **R-squared**: 
In this case, we can also see that there is a big difference between the R-squared of the SLR and the R-squared of the MLR. The R-squared for the SLR (~0.497) is very small compared to the R-squared for the MLR (~0.809). 

This R-squared in combination with the MSE show that MLR seems like the better model fit in this case, compared to SLR.

### Simple Linear Model (SLR) vs Polynomial Fit

- **MSE**: We can see that Polynomial Fit brought down the MSE, since this MSE is smaller than the one from the SLR. 

- **R-squared**: The R-squared for the Polyfit is larger than the R-squared for the SLR, so the Polynomial Fit also brought up the R-squared quite a bit.

Since the Polynomial Fit resulted in a lower MSE and a higher R-squared, we can conclude that this was a better fit model than the simple linear regression for predicting Price with Highway-mpg as a predictor variable.

### Multiple Linear Regression (MLR) vs Polynomial Fit

- **MSE**: The MSE for the MLR is smaller than the MSE for the Polynomial Fit.
- **R-squared**: The R-squared for the MLR is also much larger than for the Polynomial Fit.

### Comparing these three models, we conclude that the MLR model is the best model to be able to predict price from our dataset. This result makes sense, since we have 27 variables in total, and we know that more than one of those variables are potential predictors of the final car price. 

# 5. Model Evaluation and Refinement 

We have built models and made predictions of vehicle prices. Now we will determine how accurate these predictions are. 




 First lets only use numeric data 

In [None]:
df=df._get_numeric_data()

 Libraries for plotting 

In [None]:
from IPython.display import display
from IPython.html import widgets 
from IPython.display import display
from ipywidgets import interact, interactive, fixed, interact_manual
print("done")

## Functions for plotting 

In [None]:
def DistributionPlot(RedFunction,BlueFunction,RedName,BlueName,Title ):
    width = 12
    height = 10
    plt.figure(figsize=(width, height))

    ax1 = sns.distplot(RedFunction, hist=False, color="r", label=RedName)
    ax2 = sns.distplot(BlueFunction, hist=False, color="b", label=BlueName, ax=ax1)

    plt.title(Title)
    plt.xlabel('Price (in dollars)')
    plt.ylabel('Proportion of Cars')

    plt.show()
    plt.close()
    

In [None]:
def PollyPlot(xtrain,xtest,y_train,y_test,lr,poly_transform):
    width = 12
    height = 10
    plt.figure(figsize=(width, height))
    
    
    #training data 
    #testing data 
    # lr:  linear regression object 
    #poly_transform:  polynomial transformation object 
 
    xmax=max([xtrain.values.max(),xtest.values.max()])

    xmin=min([xtrain.values.min(),xtest.values.min()])

    x=np.arange(xmin,xmax,0.1)


    plt.plot(xtrain,y_train,'ro',label='Training Data')
    plt.plot(xtest,y_test,'go',label='Test Data')
    plt.plot(x,lr.predict(poly_transform.fit_transform(x.reshape(-1,1))),label='Predicted Function')
    plt.ylim([-10000,60000])
    plt.ylabel('Price')
    plt.legend()


<a id="ref1"></a>

## 5.1 Training and Testing

An important step in testing your model is to split your data into training and testing data. We will place the target data **price** in a separate dataframe **y**:

In [None]:
y_data=df['price']

drop price data in x data

In [None]:
x_data=df.drop('price',axis=1)

 now we randomly split our data into training and testing data  using the function **train_test_split** 

In [None]:
from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=1)


print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])


The **test_size** parameter sets the proportion of data that is split into the testing set. In the above, the testing set is set to 10% of the total dataset. 

### Question 18: Use the function "train_test_split" to split up the data set such that 40% of the data samples will be utilized for testing, set the parameter "random_state" equal to zero. The output of the function should be the following:  "x_train_1" , "x_test_1", "y_train_1" and  "y_test_1".

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.40, random_state=0)

 Let's import **LinearRegression** from the module **linear_model**

In [None]:
from sklearn.linear_model import LinearRegression

 We create a Linear Regression object:

In [None]:
lre=LinearRegression()

we fit the model using the feature horsepower 

In [None]:
lre.fit(x_train[['horsepower']],y_train)

Let's Calculate the R^2 on the test data:

In [None]:
lre.score(x_test[['horsepower']],y_test)

we can see the R^2 is much smaller using the test data.

In [None]:
lre.score(x_train[['horsepower']],y_train)

### Question 19: Find the R^2  on the test data using 90% of the data for training data

In [None]:
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data, y_data, test_size= 0.9, random_state = 0)
lre.fit(x_train1[["horsepower"]], y_train1)
lre.score(x_test1[["horsepower"]], y_test1)

Sometimes you do not have sufficient testing data; as a result, you may want to perform Cross-validation. Let's  go over several methods that you can use for  Cross-validation. 

## Cross-validation Score 

Lets import **model_selection** from the module **cross_val_scor**

In [None]:
from sklearn.model_selection import cross_val_score
print("done")

We input the object, the feature in this case ' horsepower', the target data (y_data). The parameter 'cv'  determines the number of folds; in this case 4. 

In [None]:
Rcross=cross_val_score(lre,x_data[['horsepower']], y_data,cv=4)

The default scoring is R^2; each element in the array has the average  R^2 value in the fold:

In [None]:
Rcross

 We can calculate the average and standard deviation of our estimate:

In [None]:
print("The mean of the folds are", Rcross.mean(),"and the standard deviation is" ,Rcross.std())

 We can use negative squared error as a score by setting the parameter  'scoring' metric to 'neg_mean_squared_error'. 

In [None]:
-1*cross_val_score(lre,x_data[['horsepower']], y_data,cv=4,scoring='neg_mean_squared_error')

### Question 20: Calculate the average R^2 using two folds, find the average R^2 for the second fold utilizing the horsepower as a feature : 


In [None]:
Rcross2 = cross_val_score(lre,x_data[['horsepower']], y_data, cv= 2)
Rcross2[1]

You can also use the function 'cross_val_predict' to predict the output. The function splits up the data into the specified number of folds, using one fold to get a prediction while the rest of the folds are used as test data. First import the function:

In [None]:
from sklearn.model_selection import cross_val_predict

 We input the object, the feature in this case **'horsepower'** , the target data **y_data**. The parameter 'cv' determines the number of folds; in this case 4.  We can produce an output:

In [None]:
yhat=cross_val_predict(lre,x_data[['horsepower']], y_data,cv=4)
yhat[0:5]


<a id="ref2"></a>

##  5.2 Overfitting, Underfitting and Model Selection 

It turns out that the test data sometimes referred to as the out of sample data is a much better measure of how well your model performs in the real world.  One reason for this is overfitting; let's go over some examples. It turns out these differences are more apparent in Multiple Linear Regression and Polynomial Regression so we will explore overfitting in that context.

Let's create Multiple linear regression objects and train the model using **'horsepower'**, **'curb-weight'**, **'engine-size'** and **'highway-mpg'** as features.

In [None]:
lr=LinearRegression()
lr.fit(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']],y_train)

Prediction using training data:

In [None]:
yhat_train=lr.predict(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
yhat_train[0:5]

 Prediction using test data: 

In [None]:
yhat_test=lr.predict(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
yhat_test[0:5]

Let's perform some model evaluation using our training and testing data separately. First  we import the seaborn and matplotlibb library for plotting.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

Let's examine the distribution of the predicted values of the training data.

In [None]:
Title='Distribution  Plot of  Predicted Value Using Training Data vs Training Data Distribution '
DistributionPlot(y_train,yhat_train,"Actual Values (Train)","Predicted Values (Train)",Title)

 Figur 1: Plot of predicted values using the training data compared to the training data. 

So far the model seems to be doing well in learning from the training dataset. But what happens when the model encounters new data from the testing dataset? When the model generates new values from the test data, we see the distribution of the predicted values is much different from the actual target values. 

In [None]:
Title='Distribution  Plot of  Predicted Value Using Test Data vs Data Distribution of Test Data'
DistributionPlot(y_test,yhat_test,"Actual Values (Test)","Predicted Values (Test)",Title)

Figur 2: Plot of predicted value using the test data compared to the test data. 

Comparing Figure 1 and Figure 2; it is evident the distribution of the test data in Figure 1 is much better at fitting the data. This difference in Figure 2 is apparent where the ranges are from 5000 to 15 000. This is where the distribution shape is exceptionally different. Let's see if polynomial regression also exhibits a drop in the prediction accuracy when analysing the test dataset. 

In [None]:
from sklearn.preprocessing import PolynomialFeatures
print("done")

####  Overfitting 
Overfitting occurs when the model fits the noise, not the underlying process. Therefore when testing your model using the test-set, your model does not perform as well as it is modelling noise, not the underlying process that generated the relationship. Let's create a degree 5 polynomial model.

Let's use 55 percent of the data for testing and the rest for training:

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.45, random_state=0)
print("done")

We will perform a degree 5 polynomial transformation on the feature **'horse power'**. 

In [None]:
pr=PolynomialFeatures(degree=5)
x_train_pr=pr.fit_transform(x_train[['horsepower']])
x_test_pr=pr.fit_transform(x_test[['horsepower']])
pr

Now let's create a linear regression model "poly" and train it.

In [None]:
poly=LinearRegression()
poly.fit(x_train_pr,y_train)

 We can see the output of our model using the method  "predict." then assign the values to "yhat".

In [None]:
yhat=poly.predict(x_test_pr )
yhat[0:5]

Let's take the first five predicted values and compare it to the actual targets. 

In [None]:
print("Predicted values:", yhat[0:4])
print("True values:",y_test[0:4].values)

We will use the function "PollyPlot" that we defined at the beginning of the lab to display the training data, testing data, and the predicted function.

In [None]:
PollyPlot(x_train[['horsepower']],x_test[['horsepower']],y_train,y_test,poly,pr)

Figur 4 A polynomial regression model, red dots represent training data, green dots represent test data, and the blue line represents the model prediction. 

We see that the estimated function appears to track the data but around 200 horsepower, the function begins to diverge from the data points. 

 R^2 of the training data:

In [None]:
poly.score(x_train_pr, y_train)

 R^2 of the test data:

In [None]:
poly.score(x_test_pr, y_test)

We see the R^2 for the training data is 0.5567 while the R^2 on the test data was -29.87.  The lower the R^2, the worse the model, a Negative R^2 is a sign of overfitting.

Let's see how the R^2 changes on the test data for different order polynomials and plot the results:

In [None]:
Rsqu_test=[]

order=[1,2,3,4]
for n in order:
    pr=PolynomialFeatures(degree=n)
    
    x_train_pr=pr.fit_transform(x_train[['horsepower']])
    
    x_test_pr=pr.fit_transform(x_test[['horsepower']])    
    
    lr.fit(x_train_pr,y_train)
    
    Rsqu_test.append(lr.score(x_test_pr,y_test))

plt.plot(order,Rsqu_test)
plt.xlabel('order')
plt.ylabel('R^2')
plt.title('R^2 Using Test Data')
plt.text(3, 0.75, 'Maximum R^2 ')    

 We see the R^2 gradually increases until an order three polynomial is used. Then the  R^2 dramatically decreases at four.

 The following function will be used in the next section; please run the cell.

In [None]:
def f(order,test_data):
    x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=test_data, random_state=0)
    pr=PolynomialFeatures(degree=order)
    x_train_pr=pr.fit_transform(x_train[['horsepower']])
    x_test_pr=pr.fit_transform(x_test[['horsepower']])
    poly=LinearRegression()
    poly.fit(x_train_pr,y_train)
    PollyPlot(x_train[['horsepower']],x_test[['horsepower']],y_train,y_test,poly,pr)


The following interface allows you to experiment with different polynomial orders and different amounts of data. 

In [None]:
interact(f, order=(0,6,1),test_data=(0.05,0.95,0.05))

### Question 21(a): We can perform polynomial transformations with more than one feature. Create a "PolynomialFeatures" object "pr1" of degree two. ? 


In [None]:
pr1=PolynomialFeatures(degree=2)

### Question 21(b): Transform the training and testing samples for the features 'horsepower', 'curb-weight', 'engine-size' and 'highway-mpg'. Hint: use the method "fit_transform" 


In [None]:
x_train_pr1=pr.fit_transform(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
x_test_pr1=pr.fit_transform(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])

### Question 21(c): How many dimensions does the new feature have? Hint: use the attribute "shape"

### Question  21(d): Create a linear regression model "poly1" and train the object using the method "fit" using the polynomial features?



In [None]:
from sklearn import linear_model
poly1=linear_model.LinearRegression().fit(x_train_pr1,y_train)

### Question  21(e): Use the method  "predict" to predict an output on the polynomial features, then use the function "DistributionPlot"  to display the distribution of the predicted output vs the test data?</b>

In [None]:
yhat_test1=poly1.predict(x_train_pr1)
Title='Distribution  Plot of  Predicted Value Using Test Data vs Data Distribution of Test Data'
DistributionPlot(y_test,yhat_test1,"Actual Values (Test)","Predicted Values (Test)",Title)

### Question 21(f): Use the distribution plot to determine the two regions were the predicted prices are less accurate than the actual prices.

<img src= "https://ibm.box.com/shared/static/c35ipv9zeanu7ynsnppb8gjo2re5ugeg.png" width = "700" /, align = "center">



<a id="ref3"></a>

## 5.3 Ridge regression 

 In this section, we will review Ridge Regression we will see how the parameter Alfa changes the model. Just a note here our test data will be used as validation data.

 Let's perform a degree two polynomial transformation on our data. 

In [None]:
pr=PolynomialFeatures(degree=2)
x_train_pr=pr.fit_transform(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg','normalized-losses','symboling']])
x_test_pr=pr.fit_transform(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg','normalized-losses','symboling']])

 Let's import  **Ridge**  from the module **linear models**.

In [None]:
from sklearn.linear_model import Ridge

Let's create a Ridge regression object, setting the regularization parameter to 0.1 

In [None]:
RigeModel=Ridge(alpha=0.1)

 Like regular regression, you can fit the model using the method **fit**.

In [None]:
RigeModel.fit(x_train_pr,y_train)

 Similarly, you can obtain a prediction: 

In [None]:
yhat=RigeModel.predict(x_test_pr)

Let's compare the first five predicted samples to our test set 

In [None]:
print('predicted:', yhat[0:4])
print('test set :', y_test[0:4].values)

 We select the value of Alfa that minimizes the test error, for example, we can use a for loop. 

In [None]:
Rsqu_test=[]
Rsqu_train=[]
dummy1=[]
ALFA=5000*np.array(range(0,10000))
for alfa in ALFA:
    RigeModel=Ridge(alpha=alfa) 
    RigeModel.fit(x_train_pr,y_train)
    Rsqu_test.append(RigeModel.score(x_test_pr,y_test))
    Rsqu_train.append(RigeModel.score(x_train_pr,y_train))

We can plot out the value of R^2 for different Alphas 

In [None]:
width = 12
height = 10
plt.figure(figsize=(width, height))

plt.plot(ALFA,Rsqu_test,label='validation data  ')
plt.plot(ALFA,Rsqu_train,'r',label='training Data ')
plt.xlabel('alpha')
plt.ylabel('R^2')
plt.legend()


  Figure 6:The blue line represents the R^2 of the test data, and the red line represents the R^2 of the training data. The x-axis represents the different values of Alfa 

 The red line in figure 6 represents the  R^2 of the test data, as Alpha increases the R^2 decreases; therefore as Alfa increases the model performs worse on the test data.  The blue line represents the R^2 on the validation data, as the value for Alfa increases the R^2 decreases.   

### Question 22: Perform Ridge regression and calculate the R^2 using the polynomial features, use the training data to train the model and test data to test the model. The parameter alpha should be set to  10.


In [None]:
RigeModel=Ridge(alpha=0) 
RigeModel.fit(x_train_pr,y_train)
RigeModel.score(x_test_pr, y_test)

<a id="ref4"></a>

## 5.4 Grid Search

The term Alfa is a hyperparameter, sklearn has the class  **GridSearchCV** to make the process of finding the best hyperparameter simpler.

 Let's import **GridSearchCV** from  the module **model_selection**

In [None]:
from sklearn.model_selection import GridSearchCV
print("done")

We create a dictionary of parameter values:

In [None]:
parameters1= [{'alpha': [0.001,0.1,1, 10, 100, 1000,10000,100000,100000]}]
parameters1

Create a ridge regions object:

In [None]:
RR=Ridge()
RR

Create a ridge grid search object 

In [None]:
Grid1 = GridSearchCV(RR, parameters1,cv=4)

Fit the model 

In [None]:
Grid1.fit(x_data[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']],y_data)

The object finds the best parameter values on the validation data. We can obtain the estimator with the best parameters and assign it to the variable BestRR as follows:

In [None]:
BestRR=Grid1.best_estimator_
BestRR

 We now test our model on the test data 

In [None]:
BestRR.score(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']],y_test)

### Question 23: Perform a grid search for the alpha parameter and the normalization parameter, then find the best values of the parameters

In [None]:
parameters2= [{'alpha': [0.001,0.1,1, 10, 100, 1000,10000,100000,100000],'normalize':[True,False]} ]
Grid2 = GridSearchCV(Ridge(), parameters2,cv=4)
Grid2.fit(x_data[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']],y_data)
Grid2.best_estimator_

# About the Authors:  

This notebook written [Joseph Santarcangelo PhD]( https://www.linkedin.com/in/joseph-s-50398b136/),[Mahdi Noorian PhD](https://www.linkedin.com/in/mahdi-noorian-58219234/), Bahare Talayian, Eric Xiao, Steven Dong, Parizad , Hima Vsudevan, [Fiorella Wenver](https://www.linkedin.com/in/fiorellawever/) and edited by [Gokul R S](https://www.linkedin.com/in/gokulrs/). 

#### Recommended, Python for Data Science click to start course:
  
  <a href="http://cocl.us/DA0101ENtoPY0101EN">
    <img src="https://ibm.box.com/shared/static/jmtb4pgle2dsdlzfmyrgv755cnqw95wk.png" width ="100" /, align = "Left"></a>

Copyright &copy; 2017 [cognitiveclass.ai](cognitiveclass.ai?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).