<a href="https://colab.research.google.com/github/nalinis07/APT_Class_Copy_Links/blob/MASTER/AT_Lesson_69_Class_Copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 69: Car Price Prediction - Interpreting p-value

### Teacher-Student Activities

In the previous class, you learned feature encoding using the one-hot encoding and dummy coding processes. You also learned to calculate the adjusted R-squared value to evaluate a linear regression model.

In this class, you will learn the concept of p-value which will help you to determine which features are significant to the dataset and which are not so that you can create your model with those features which are significantly contributing in prediction.

Let's quickly run the codes covered in the previous classes and begin this session from **Activity 1: Understanding Hypothesis Testing** section.

---

### Problem Statement

Build a linear regression model to predict prices of cars based on its technical specifications such as car manufacturer, its engine capacity, fuel efficiency, body-type etc.

**Dataset Description:**

The dataset contains 205 rows and 26 columns. Each column represents an attribute of a car as described in the table below.

|Sr No.|Attribute|Attribute Information|
|-|-|-|
|1|Car_ID|Unique id of each car (Interger)|
|2|Symboling|Assigned insurance risk rating; a value of +3 indicates that the car is risky; -3 suggests that it is probably a safe car (Categorical)|
|3|carCompany|Name of car company (Categorical)|
|4|fueltype| fuel-type i.e. petrol or diesel (Categorical)|
|5|aspiration|Aspiration used in a car (Categorical)|
|6|doornumber|Number of doors in a car (Categorical)|
|7|carbody|Body-type of a car (Categorical)|
|8|drivewheel|Type of drive wheel (Categorical)|
|9|enginelocation|Location of car engine (Categorical)|
|10|wheelbase|Weelbase of car (Numeric)|
|11|carlength|Length of car (Numeric)|
|12|carwidth|Width of car (Numeric)|
|13|carheight|Height of car (Numeric)|
|14|curbweight|The weight of a car without occupants or baggage (Numeric)|
|15|enginetype|Type of engine (Categorical)|
|16|cylindernumber|Number of cylinders placed in the car engine (Categorical)||17|enginesize|Capacity of an engine (Numeric)|
|18|fuelsystem|Fuel system of a car (Categorical)|
|19|boreratio|Bore ratio of car (Numeric)|
|20|stroke|Stroke or volume inside the engine (Numeric)|
|21|compressionratio|Compression ratio of an engine (Numeric)|
|22|horsepower|Power output of an engine (Numeric)|
|23|peakrpm|Peak revolutions per minute (Numeric)|
|24|citympg|Mileage in city (Numeric)|
|25|highwaympg|Mileage on highway (Numeric)|
|26|price(Dependent variable)|Price of a car (Numeric)|

This data set consists of three types of entities:

- the specification of an auto in terms of various characteristics,

- its assigned insurance risk rating,

- its normalised losses in use as compared to other cars.

The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process **symboling**. A value of $+3$ indicates that the auto is risky, $-3$ that it is probably pretty safe.

The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality etc.), and represents the average loss per car per year.

**Note:** Several of the attributes in the database could be used as a "class" attribute.

**Dataset source:** https://archive.ics.uci.edu/ml/datasets/Automobile


The above dataset consists of data taken from 1985 Ward's Automotive Yearbook. Here's the list of original sources of the data:

1. 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook.

2. Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038

3. Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037






---

#### Recap

https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/car-prices.csv

In [None]:
# Import the modules, read the dataset and create a Pandas DataFrame.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read the dataset
cars_df = pd.read_csv("https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/car-prices.csv")

# Data Cleaning
# Extract the name of the manufactures from the car names and display the first 25 cars to verify whether names are extracted successfully.
car_companies = pd.Series([car.split(" ")[0] for car in cars_df['CarName']], index = cars_df.index)

# Create a new column named 'car_company'. It should store the company names of a the cars.
cars_df['car_company'] = car_companies

# Replace the misspelled 'car_company' names with their correct names.
# volkswagen
cars_df.loc[(cars_df['car_company'] == "vw") | (cars_df['car_company'] == "vokswagen"), 'car_company'] = 'volkswagen'

# porsche
cars_df.loc[cars_df['car_company'] == "porcshce", 'car_company'] = 'porsche'

# toyota
cars_df.loc[cars_df['car_company'] == "toyouta", 'car_company'] = 'toyota'

# nissan
cars_df.loc[cars_df['car_company'] == "Nissan", 'car_company'] = 'nissan'

# mazda
cars_df.loc[cars_df['car_company'] == "maxda", 'car_company'] = 'mazda'

# Drop 'CarName' column from the 'cars_df' DataFrame.
cars_df.drop(columns= ['CarName'], axis = 1, inplace = True)

# Data Preparation
# Extract all the numeric (float and int type) columns from the dataset.
cars_numeric_df = cars_df.select_dtypes(include = ['int64', 'float64'])

# Drop the 'car_ID' column from the 'cars_numeric_df' DataFrame.
cars_numeric_df.drop(columns = ['car_ID'], axis = 1, inplace = True)

# Mapping Categorical Values
# Map the values of the 'doornumber' and 'cylindernumber' columns to their corresponding numeric values.
words_dict = {"two": 2, "three": 3, "four": 4, "five": 5, "six": 6, "eight": 8, "twelve": 12}
def num_map(series):
    return series.map(words_dict)

# Applying the function to the two columns
cars_df[['cylindernumber', 'doornumber']] = cars_df[['cylindernumber', 'doornumber']].apply(num_map, axis = 1)

# Feature Encoding
# Create dummy variables for the 'carbody' columns.
car_body_dummies = pd.get_dummies(cars_df['carbody'], dtype = int)

# Create dummy variables for the 'carbody' columns with 1 column less.
car_body_new_dummies = pd.get_dummies(cars_df['carbody'], drop_first = True, dtype = int)

# Create a DataFrame containing all the non-numeric type features.
cars_categorical_df = cars_df.select_dtypes(include = ['object'])

# Get dummy variables for all the categorical type columns using the dummy coding process.
cars_dummies_df = pd.get_dummies(cars_categorical_df, drop_first = True, dtype = int)

# Drop the categorical type columns from the 'cars_df' DataFrame.
cars_df.drop(list(cars_categorical_df.columns), axis = 1, inplace = True)

# Concatenate the 'cars_df' and 'cars_dummies_df' DataFrames.
cars_df = pd.concat([cars_df, cars_dummies_df], axis = 1)

# Drop the 'car_ID' column
cars_df.drop('car_ID', axis = 1, inplace = True)

# Test-Train Split
# Split the 'cars_df' Dataframe into the train and test sets.
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(cars_df, test_size = 0.3, random_state = 42)

# Create separate data-frames for the feature and target variables for both the train and test sets.
features = list(cars_df.columns)
features.remove('price')

X_train = train_df[features]
y_train = train_df['price']
X_test = test_df[features]
y_test = test_df['price']

# Feature Scaling
# Normalise only the numeric columns that were you had prior to any data-cleaning exercise.
def standard_norm(series):
  new_series = (series - series.mean()) / series.std()
  return new_series

# Normalising the features in the train and test sets.
X_train[X_train.columns[:16]] = X_train[X_train.columns[:16]].apply(standard_norm, axis = 0)
X_test[X_train.columns[:16]] = X_test[X_train.columns[:16]].apply(standard_norm, axis = 0)

# Model Building
# Build a linear regression model using all the features to predict car prices.
import statsmodels.api as sm

X_train_sm = sm.add_constant(X_train)
lin_reg = sm.OLS(y_train, X_train_sm).fit()

# Print the summary of the linear regression report.
print(lin_reg.summary())

---


#### Activity 1: Understanding Hypothesis Testing

From the summary report of the linear regression, you may observe that each feature variable has a **p-value** `(P>|t|)` associated with it. The p-value is one of the important statistics which can be used to eliminate features which are not relatively significant in our model. Before understanding the p-value concept, let us first explore the concept of hypothesis testing.

**Hypothesis Testing**

Hypothesis Testing is basically testing an assumption that we make about a parameter. This assumption may or may not be true. Eg., "students having an affluent background are more likely to do well in academics in higher education" is one such hypothesis.

The steps followed in hypothesis testing are:

1. An initial assumption or hypothesis is made.
2. The validity of that hypothesis is tested.
3. If the hypothesis is found to be true, it is accepted otherwise it is rejected.

There are two types of hypothesis:

1. **Null hypothesis:** denoted by $H_0$, is a general statement or an initial assumption which we make about a parameter.
2. **Alternative hypothesis:** denoted by $H_1$ or $H_a$, It is contrary to the null hypothesis. It is the hypothesis we would accept if our null hypothesis is found to be false.

In hypothesis testing, we need to gather enough evidence to either accept or reject our null hypothesis. There are two types of hypothesis tests that can be used for multiple linear regression:
- **F-test:** This test measures the overall significance of all the coefficients.
- **T-test:** This test measures the significance of each individual coefficient.

Let us first determine the overall significance of our model using the F-test.

---

#### Activity 2: F-Test

The F-test is used to assess all the coefficients collectively. It validates whether any of the independent variables are significant. Let us apply F-test to the car price prediction model.

The regression equation for the car price prediction model can be given as

$$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \dots + \beta_{59} x_{59} + \epsilon$$

where,

 - $x_1$ is `symboling`
 - $x_2$ is `doornumber`
 - $x_3$ is `wheelbase`

 $\vdots$   

 - $x_{59}$ is `wheelbase` and
 - $Y$ is the `price`


**Step 1: Define null and alternative hypothesis**

$H_0: \beta_1 = \beta_2 = \dots = \beta_{59} = 0$ i.e. all the regression coefficients are equal to zero.

$H_1: \beta_i \neq 0$, i.e. at least one of the coefficient is not zero.

- $H_0$ means that none of the feature or independent variables have a significant relationship with our target variable `price` and our model has no predictive capability.

- $H_1$ means that at least one feature variable has a significant relationship with our target variable `price`.

**Step 2: Calculate the test statistic value** (in case of F-test it is F-statistic value)

It is calculated as

$$F* = \frac{\textrm{explained variance}}{\textrm{unexplained variance}} = \frac{\text{MSM}}{\text{MSE}}$$

where,

- MSM is the Mean of Squares for Model
- MSE is Mean of Squared Errors (or Residuals)

Further, MSM  is calculated as

$$\text{MSM} = \frac{\text{SSM}}{\text{DFM}}=\frac{\sum(y_{\text{pred}} - \bar{y})^2}{ p - 1}$$

where,
- SSM is the Sum of Squares for Model
- DFM is Degrees of Freedom for Model
- $p$ is the number of independent variables

Similarly, MSE is calculated as:

$$\text{MSE} = \frac{\text{SSE}}{\text{DFE}}=\frac{\sum(y - y_{\text{pred}})^2}{ N - p}$$

where,
- SSE is the Sum of Squares for Errors
- DFE is Degrees of Freedom for Errors
- $N$ is number of instances (or rows) in the dataset

Let's create `mean_sq_model()` and `mean_sq_error()` functions to calculate the MSM and MSE values using the above formulae respectively.

**Note:** You can also obtain the MSM and MSE values using the `mse_model` and `mse_resid` attributes respectively of `statsmodels.api` module.

In [None]:
# S2.1: Calculate N and p values


In [None]:
# S2.2: Create functions to calculate MSM and MSE values respectively.


In [None]:
# S2.3: Calculate the MSM and MSE on the train sets


Now let us calculate the F-statistic value using the

$$F* = \frac{\text{MSM}}{\text{MSE}}$$

 formula.

In [None]:
# S2.4: Calculate the F-statistic using the above formula.


**Step 3: Determine the p-value or probability value for the F-statistic**

We can use manually calculate p-value for any test-statistic using the formula:
$$\textrm{p value} = 2 \times  (1 - \textrm{cdf}(|ts|))$$

where $|ts|$ is the absolute value of test statistic (in this case, F-statistic)



In [None]:
# S2.5: Calculate p-value for F-statistic.


We can also directly calculate p-value for F-statistic using `f_pvalue` attribute of the `statsmodels.api` module.

In [None]:
# S2.6: Calculate p-value using f_pvalue attribute


Thus, the F-statistic value is 61.81 and its p-value is 0.0. You may observe a slight difference in the `F-statistic` and `Prob (F-statistic)` values of the summary table as it works slightly different. This is to show that you can also derive F-statistic and its p-value directly from the summary table.

**Step 4: Accept or reject null hypothesis based on the p-value**

After determining the p-value, we either accept or reject our null hypothesis.

If p-value is below 0.05, the null hypothesis will be rejected. Let's determine whether the p-value is below 0.05 or not.



In [None]:
# S2.7: Create a function to accept or reject null hypothesis


The p-value that we obtained from F-test is equal to 0.00, so we can reject our null hypothesis and conclude that at least one of the independent variable has linear relationship with our target variable `price`. But, what is p-value?

**What is meant by p-value?**

The p-value is a probability value that helps us to determine whether our hypothesis is correct. The p-value for each feature tests the null hypothesis that there is no correlation between the feature and the target variable. Smaller the p-value, stronger is the evidence that you should reject null hypothesis. A p-value less than 0.05 is statistically significant. It indicates that there is less than 5% probability that the null hypothesis is correct. Therefore, we reject the null hypothesis, and accept the alternative hypothesis. However, a p-value greater than 0.05 indicates weak evidence and we fail to reject the null hypothesis.

The F-test for our model rejected the null hypothesis and concluded that at least one feature variable is significant and our model definitely possess predictive capability. Now, we will perform **t-test** to determine which variables are significant in predicting the price of a car and which are not.

---

#### Activity 3: T-Test

After concluding from the F-test that at least one feature variable is significant, now we may want to know which variables are significant. For this, we can do a **t-test** to find out which independent variable is making a useful contribution in the prediction of the dependent variable.

Remember, the regression equation for our model is:



$$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \dots + \beta_{59} x_{59} + \epsilon$$

where,
 - $x_1$ is `symboling`
 - $x_2$ is `doornumber`
 - $x_3$ is `wheelbase`

 $\vdots$   

 - $x_{59}$ is `wheelbase` and
 - $Y$ is the `price`

For example, let us determine whether feature `symboling` is contributing significantly in the prediction of dependent variable `price`. We will follow the same steps as that of F-test.

**Step 1:  Define the null and alternative hypothesis**

$H_0:   \beta_1 = 0$ i.e. `symboling` and `price` are not linearly related

$H_1:   \beta_1 \neq 0$ i.e. `symboling` and `price` are linearly related

**Step 2: Calculate the test statistic value** (in case of t-test, it is t-statistic value)

The t-statistic is calculated as:

$$t∗= \frac{\textrm{coefficient - hypothesized  value} }{\textrm{standard  error  of  coefficient}}$$

As the hypothesized value is usually 0,
$$t∗= \frac{\textrm{coefficient} }{\textrm{standard  error  of  coefficient}}$$

For our example above, the t-statistic is:

$$t∗= \frac{\beta_1 }{SE(\beta_1)}$$

The **standard error of coefficient (SE)** is an estimate of the standard deviation of the coefficient, the amount it varies across cases. Its formula is quite complicated.

However, we can obtain standard error for every coefficient by using `bse` attribute of `statsmodels.api` module. The `b` in `bse` stands for the coefficient $\beta$ and `se` for standard errors.


In [None]:
# T3.1: Calculate the SE(beta_1) value.


In [None]:
# T3.2: Calculate t-statistic for beta_1 using the above formula.


**Step 3:  Determine the p-value or probability value for the t-statistic**

After obtaining the t-statistic for $\beta_1$, let's validate the null hypothesis by calculating the p-value.


In [None]:
# T3.3: Calculate p-value based on t-statistic.


Thus the t-statistic value for $\beta_1$ is -0.766 and its p-value is 0.443. You can also derive these values directly from the summary table.

**STEP 4: Accept or reject null hypothesis based on the p-value**

After determining the p-value, we either accept or reject our null hypothesis.



In [None]:
# S3.1: Accept or reject null hypothesis


Since the p-value is above 0.05, the null hypothesis will be accepted. This means that `symboling` and `price` are not linearly related and `symboling` is not making a useful contribution in predicting the target variable `price`. Hence, we can remove this feature from our model.

Similarly, let's perform t-test for the second feature `doornumber` to determine whether it is significant or not. For this, our null hypothesis and alternate hypothesis would be:

$H_0:   \beta_2 = 0$ i.e. `doornumber` and `price` are not linearly related

$H_1:   \beta_2 \neq 0$ i.e. `doornumber` and `price` are  linearly related



In [None]:
# S3.2: Calculate the SE(beta_2) value.


In [None]:
# S3.3: Calculate t-statistic for beta_2 using formula


In [None]:
# S3.4: Calculate p-value based on t-statistic


In [None]:
# S3.5: Accept or reject null hypothesis


Since the p-value is above 0.05, the null hypothesis will be accepted. This means that the feature `doornumber` is not making a useful contribution in predicting the target variable `price`. Hence, we can remove this feature from our model.

Similarly, you can perform t-test for each independent variable and determine which variable is actually contributing in predicting the price of a car.

You can obtain p-values for all features all at once either from the summary of linear regression report or by using `pvalues` attribute of Linear regression object.

In [None]:
# S3.6: Obtain p-values for all features


Let us obtain those features whose p-value is less than 0.05 and perform linear regression using the reduced features.



In [None]:
# S3.7: Create a dataframe with Features and their corresponding p-values


In [None]:
# S3.8: Drop those features whose p-value is greater than 0.05


As you can see, we created a dataframe with only significant features. Now let us again perform linear regression using the reduced features.

In [None]:
# S3.9: Build a linear regression model again with reduced features


In [None]:
# S3.10: Print the summary table for the above linear regression model.


We build the linear regression model again after removing all the features having the higher p-value and we still have a few features which have high p-values. This is not the right approach to tackle the high-values issue. Ideally, in the first iteration, we should remove only one feature having higher p-value, then rebuild the model, then again check for the p-value and then remove the next feature having highest p-value and so on.

These iterations can become very long. We can reduce the number of iterations by finding out the most relevant features in the first go. In the next class, we will how to select such relevant features to reduce the number of iterations for building the most accurate linear regression model.

---

### **Project**
You can now attempt the **Applied Tech Project 69 - Car price prediction - Interpreting p-values** on your own.

**Applied Tech Project 69 - Car price prediction - Interpreting p-values**: https://colab.research.google.com/drive/1fkUA8fMoDcZnia1kOeDgP0KnOoRUoU_6?usp=sharing

---