<a href="https://colab.research.google.com/github/nalinis07/APT_Proj_Ref_copy/blob/MASTER/AT_Lesson_66_Project_Solution_Copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Instructions

---

#### Goal of the Project

This project is designed for you to practice and solve the activities that are based on the concepts covered in the following lessons:

 1. Multiple linear regression - Introduction
 2. Multicollinearity
 3. Variance Inflation Factor



---

### Problem Statement

Implement multiple linear regression to create a predictive model capable of predicting the Body Mass Index using the Height, Weight and Gender of a person. The dataset contains data about 500 instances. Find out if there is multicollinearity in the dataset using Variance Inflation Factor.






---

### List of Activities

**Activity 1:** Analysing the Dataset

**Activity 2:** Data Manipulation

**Activity 3:** Train-Test Split

**Activity 4:** Model Training using `statsmodels.api`

**Activity 5:** Calculate VIF using `variance_inflation_factor`

**Activity 6:** Calculate VIF using formula



---


#### Activity 1:  Analysing the Dataset

- Create a Pandas DataFrame for **500_Person_Gender_Height_Weight_Index** dataset using the below link. This dataset consists of following columns:

|Columns|Description|
|--|--|
|Gender|Male/Female|
|Height|Height in cm|
|Weight|Weight in kg|
|Index|Body Mass Index. Values: 0 - Extremely Weak, 1 - Weak, 2 - Normal
||3 - Overweight, 4 - Obesity, 5 - Extreme Obesity|



   **Dataset Link:**  https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/Gender_Height_Weight_Index.csv

- Print the first five rows of the dataset. Check for null values and treat them accordingly.

In [None]:
# Import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df=pd.read_csv('https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/Gender_Height_Weight_Index.csv')
# Print first five rows using head() function
df.head()

Unnamed: 0,Gender,Height,Weight,Index
0,Male,174,96,4
1,Male,189,87,2
2,Female,185,110,4
3,Female,195,104,3
4,Male,149,61,3


In [None]:
# Check if there are any null values. If any column has null values, treat them accordingly
df.isnull().sum()

Gender    0
Height    0
Weight    0
Index     0
dtype: int64

---

#### Activity 2: Data Manipulation

The dataset contains a column `Gender` that is categorical. However for linear regression, we need all numerical variables. So to convert the categorical data to a numerical data, use `get_dummies()` function of the `pandas` module. This function converts the categorical variable into dummy variables.

**Syntax:** `pd.get_dummies(data)`



In [None]:
# Get dummy values for the 'Gender' column
df['Gender'] = pd.get_dummies(df['Gender'])

In [None]:
# Again print first five rows using head() function
df.head()

Unnamed: 0,Gender,Height,Weight,Index
0,0,174,96,4
1,0,189,87,2
2,1,185,110,4
3,1,195,104,3
4,0,149,61,3


------

#### Activity 3: Train-Test Split

We need to predict the value of `Index` variable, using other variables. Thus, `Index` is the target or dependent variable and other columns except `Index` are the features or the independent variables.

Split the dataset into training set and test set such that the training set contains 70% of the instances and the remaining instances will become the test set.

In [None]:
# Split the DataFrame into the training and test sets.
from sklearn.model_selection import train_test_split

X=df.drop('Index',axis=1)#feature variables
y=df['Index']# target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 42)

---

#### Activity 4: Model Training using `statsmodels.api`

Perform the following tasks:
- Implement multiple linear regression using `statsmodels.api` module and find the values of all the regression coefficients using this module.
-Print the statistical summary of the regression model.


In [None]:
# Build a linear regression model using the 'statsmodels.api' module.
import statsmodels.api as sm

# Add a constant to feature variables
X_train_sm = sm.add_constant(X_train)

# Fit the regression line using 'OLS'
stats_lr = sm.OLS(y_train, X_train_sm).fit()

# Print the parameters, i.e. the intercept and the slope of the regression line fitted
stats_lr.params

const     6.447978
Gender   -0.013118
Height   -0.037427
Weight    0.034593
dtype: float64

In [None]:
# Print statistical summary of the model
print(stats_lr.summary())

                            OLS Regression Results                            
Dep. Variable:                  Index   R-squared:                       0.842
Model:                            OLS   Adj. R-squared:                  0.840
Method:                 Least Squares   F-statistic:                     613.9
Date:                Fri, 23 Oct 2020   Prob (F-statistic):          3.78e-138
Time:                        14:37:29   Log-Likelihood:                -289.64
No. Observations:                 350   AIC:                             587.3
Df Residuals:                     346   BIC:                             602.7
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.4480      0.320     20.131      0.0

**Q:** What is the $R^2$ (R-squared) value for this model?

**A:** 84.2%


-----

#### Activity 5: Calculate VIF using `variance_inflation_factor`

Calculate the VIF values for each independent variables using the `variance_inflation_factor` function of the `statsmodels.stats.outliers_influence` module.


In [None]:
# Calculate the VIF values for 'Gender','Height', 'Weight' independent variables using the 'variance_inflation_factor' function.
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
vif = pd.DataFrame()
vif['Features'] = X_train_sm.columns
vif['VIF'] = [variance_inflation_factor(X_train_sm.values, i) for i in range(X_train_sm.values.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Unnamed: 0,Features,VIF
0,const,115.84
1,Gender,1.0
2,Height,1.0
3,Weight,1.0


-------

#### Activity 6: Calculate VIF using formula

Calculate the VIF values for each independent variables using the  $\frac{1}{1 - R^2}$  formula. For this, perform the following task:

- Build a linear regression model again taking `Weight` as the dependent variable and `Height` and `Gender` as the independent variables. Then calculate the $R^2$ value for this model.

- Calculate the VIF values using the $\frac{1}{1 - R^2}$ formula.


In [None]:
# Build a linear regression model taking 'Weight' as the target and 'Height' and 'Gender' as the independent variables.
w_X_train = X_train[['Height', 'Gender']]
w_y_train = X_train['Weight']

w_X_train_sm = sm.add_constant(w_X_train)
w_lin_reg = sm.OLS(w_y_train, w_X_train_sm).fit()

print(w_lin_reg.summary())

                            OLS Regression Results                            
Dep. Variable:                 Weight   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                 -0.004
Method:                 Least Squares   F-statistic:                    0.2499
Date:                Fri, 23 Oct 2020   Prob (F-statistic):              0.779
Time:                        14:37:29   Log-Likelihood:                -1717.3
No. Observations:                 350   AIC:                             3441.
Df Residuals:                     347   BIC:                             3452.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         97.3884     18.163      5.362      0.0

In [None]:
# Calculate the VIF value for Weight.

weight_vif = 1 / (1 - 0.001)
weight_vif

1.001001001001001

Repeat the same for `Height` as dependent variable.

In [None]:
# Build a linear regression model taking 'Height' as the target and 'Weight' and 'Gender' as the independent variables.
height_X_train = X_train[['Weight','Gender']]
height_y_train = X_train['Height']

height_X_train_sm = sm.add_constant(height_X_train)
height_lin_reg = sm.OLS(height_y_train, height_X_train_sm).fit()

print(height_lin_reg.summary())

                            OLS Regression Results                            
Dep. Variable:                 Height   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                 -0.003
Method:                 Least Squares   F-statistic:                    0.5493
Date:                Fri, 23 Oct 2020   Prob (F-statistic):              0.578
Time:                        14:37:30   Log-Likelihood:                -1479.4
No. Observations:                 350   AIC:                             2965.
Df Residuals:                     347   BIC:                             2976.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        168.3137      3.174     53.036      0.0

In [None]:
# Calculate the VIF value for Height.

height_vif = 1 / (1 - 0.003)
height_vif

1.0030090270812437

Build the same for `Gender` as dependent variable.

In [None]:
# Build a linear regression model taking 'Gender' as the target and 'Weight' and 'Height' as the independent variables.
gen_X_train = X_train[['Weight','Height']]
gen_y_train = X_train['Gender']

gen_X_train_sm = sm.add_constant(gen_X_train)
gen_lin_reg = sm.OLS(gen_y_train, gen_X_train_sm).fit()

print(gen_lin_reg.summary())

                            OLS Regression Results                            
Dep. Variable:                 Gender   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                 -0.003
Method:                 Least Squares   F-statistic:                    0.5306
Date:                Fri, 23 Oct 2020   Prob (F-statistic):              0.589
Time:                        14:37:30   Log-Likelihood:                -253.40
No. Observations:                 350   AIC:                             512.8
Df Residuals:                     347   BIC:                             524.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.3015      0.288      1.047      0.2

In [None]:
# Calculate the VIF value for Weight.

gender_vif = 1 / (1 - 0.003)
gender_vif

1.0030090270812437

**Q:** Is the VIF calculated using formula $\frac{1}{1 - R^2}$ and the python module `statsmodels.stats.outliers_influence` same for independent variables?

**A:** Yes



---