<a href="https://colab.research.google.com/github/nikshargithub/ML_PROJECTS/blob/main/2023_08_07_NIKHIL_SHARMA_PROJECT_66.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Instructions

---

#### Goal of the Project

This project is designed for you to practice and solve the activities that are based on the concepts covered in the following lessons:

 1. Multiple linear regression - Introduction
 2. Multicollinearity
 3. Variance Inflation Factor



---

#### Getting Started:

1. Click on this link to open the Colab file for this project.

  Link on Panel

2. Create a duplicate copy of the Colab file as described below.

  - Click on the **File menu**. A new drop-down list will appear.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-0/0_file_menu.png' width=500>

  - Click on the **Save a copy in Drive** option. A duplicate copy will get created. It will open up in the new tab on your web browser.

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-0/1_create_colab_duplicate_copy.png' width=500>

3. After creating the duplicate copy of the notebook, please rename it in the **YYYY-MM-DD_StudentName_Project66** format.

4. Now, write your code in the prescribed code cells.


---

### Problem Statement

Implement multiple linear regression to create a predictive model capable of predicting the Body Mass Index using the Height, Weight and Gender of a person. The dataset contains data about 500 instances. Find out if there is multicollinearity in the dataset using Variance Inflation Factor.






---

### List of Activities

**Activity 1:** Analysing the Dataset

**Activity 2:** Data Manipulation

**Activity 3:** Train-Test Split

**Activity 4:** Model Training using `statsmodels.api`

**Activity 5:** Calculate VIF using `variance_inflation_factor`

**Activity 6:** Calculate VIF using formula



---


#### Activity 1:  Analysing the Dataset

- Create a Pandas DataFrame for **500_Person_Gender_Height_Weight_Index** dataset using the below link. This dataset consists of following columns:

|Columns|Description|
|--|--|
|Gender|Male/Female|
|Height|Height in cm|
|Weight|Weight in kg|
|Index|Body Mass Index. Values: 0 - Extremely Weak, 1 - Weak, 2 - Normal
||3 - Overweight, 4 - Obesity, 5 - Extreme Obesity|



   **Dataset Link:**  https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/Gender_Height_Weight_Index.csv

- Print the first five rows of the dataset. Check for null values and treat them accordingly.

In [None]:
# Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Load the dataset
df = pd.read_csv('https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/Gender_Height_Weight_Index.csv')
# Print first five rows using head() function
df.head()

Unnamed: 0,Gender,Height,Weight,Index
0,Male,174,96,4
1,Male,189,87,2
2,Female,185,110,4
3,Female,195,104,3
4,Male,149,61,3


In [None]:
# Check if there are any null values. If any column has null values, treat them accordingly
df.isnull().sum()

Gender    0
Height    0
Weight    0
Index     0
dtype: int64

---

#### Activity 2: Data Manipulation

The dataset contains a column `Gender` that is categorical. However for linear regression, we need all numerical variables. So to convert the categorical data to a numerical data, use `get_dummies()` function of the `pandas` module. This function converts the categorical variable into dummy variables.

**Syntax:** `pd.get_dummies(data)`



In [None]:
# Get dummy values for the 'Gender' column
df.replace(to_replace='Male',value=0,inplace=True)
df.replace(to_replace='Female',value=1,inplace=True)

In [None]:
# Again print first five rows using head() function
df.head()

Unnamed: 0,Gender,Height,Weight,Index
0,0,174,96,4
1,0,189,87,2
2,1,185,110,4
3,1,195,104,3
4,0,149,61,3


------

#### Activity 3: Train-Test Split

We need to predict the value of `Index` variable, using other variables. Thus, `Index` is the target or dependent variable and other columns except `Index` are the features or the independent variables.

Split the dataset into training set and test set such that the training set contains 70% of the instances and the remaining instances will become the test set.

In [None]:
# Split the DataFrame into the training and test sets.
x=df[['Gender','Height','Weight']]
y=df['Index']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.30,random_state=42)

---

#### Activity 4: Model Training using `statsmodels.api`

Perform the following tasks:
- Implement multiple linear regression using `statsmodels.api` module and find the values of all the regression coefficients using this module.
-Print the statistical summary of the regression model.


In [None]:
# Build a linear regression model using the 'statsmodels.api' module.
# Add a constant to feature variables
const = sm.add_constant(x_train)
# Fit the regression line using 'OLS'
obj = sm.OLS(y_train,const).fit()
# Print the parameters, i.e. the intercept and the slope of the regression line fitted
print(obj.params)

const     6.447978
Gender   -0.013118
Height   -0.037427
Weight    0.034593
dtype: float64


In [None]:
# Print statistical summary of the model
print(obj.summary())

                            OLS Regression Results                            
Dep. Variable:                  Index   R-squared:                       0.842
Model:                            OLS   Adj. R-squared:                  0.840
Method:                 Least Squares   F-statistic:                     613.9
Date:                Mon, 07 Aug 2023   Prob (F-statistic):          3.78e-138
Time:                        04:20:05   Log-Likelihood:                -289.64
No. Observations:                 350   AIC:                             587.3
Df Residuals:                     346   BIC:                             602.7
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.4480      0.320     20.131      0.0

**Q:** What is the $R^2$ (R-squared) value for this model?

**A:**0.842


-----

#### Activity 5: Calculate VIF using `variance_inflation_factor`

Calculate the VIF values for each independent variables using the `variance_inflation_factor` function of the `statsmodels.stats.outliers_influence` module.


In [None]:
const.columns

Index(['const', 'Gender', 'Height', 'Weight'], dtype='object')

In [None]:
# Calculate the VIF values for 'Gender','Height', 'Weight' independent variables using the 'variance_inflation_factor' function.
# Create a dataframe that will contain the names of all the feature variables and their respective VIFs
df2 = pd.DataFrame()
df2['features'] = const.columns
df2['VIF'] = [variance_inflation_factor(const.values,i) for i in range(const.values.shape[1])]
df2 = df2.sort_values(by='VIF',ascending=False)
df2

Unnamed: 0,features,VIF
0,const,115.837795
2,Height,1.003166
1,Gender,1.003058
3,Weight,1.00144


-------

#### Activity 6: Calculate VIF using formula

Calculate the VIF values for each independent variables using the  $\frac{1}{1 - R^2}$  formula. For this, perform the following task:

- Build a linear regression model again taking `Weight` as the dependent variable and `Height` and `Gender` as the independent variables. Then calculate the $R^2$ value for this model.

- Calculate the VIF values using the $\frac{1}{1 - R^2}$ formula.


In [None]:
# Build a linear regression model taking 'Weight' as the target and 'Height' and 'Gender' as the independent variables.
x2 = df[['Height','Gender']]
y2 = df['Weight']
x2_train,x2_test,y2_train,y2_test = train_test_split(x2,y2,test_size=0.30,random_state=42)
c2 = sm.add_constant(x2_train)
obj2 = sm.OLS(y2_train,c2).fit()
# Print summary
print(obj2.summary())

                            OLS Regression Results                            
Dep. Variable:                 Weight   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                 -0.004
Method:                 Least Squares   F-statistic:                    0.2499
Date:                Mon, 07 Aug 2023   Prob (F-statistic):              0.779
Time:                        04:20:05   Log-Likelihood:                -1717.3
No. Observations:                 350   AIC:                             3441.
Df Residuals:                     347   BIC:                             3452.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         97.3884     18.163      5.362      0.0

In [None]:
# Calculate the VIF value for Weight.
vif_weight = 1/(1-0.001)
vif_weight

1.001001001001001

Repeat the same for `Height` as dependent variable.

In [None]:
# Build a linear regression model taking 'Height' as the target and 'Weight' and 'Gender' as the independent variables.
x3 = df[['Weight','Gender']]
y3 = df['Height']
x3_train,x3_test,y3_train,y3_test = train_test_split(x3,y3,test_size=0.30,random_state=42)
c3 = sm.add_constant(x3_train)
obj3 = sm.OLS(y3_train,c3).fit()
# Print summary
print(obj3.summary())

                            OLS Regression Results                            
Dep. Variable:                 Height   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                 -0.003
Method:                 Least Squares   F-statistic:                    0.5493
Date:                Mon, 07 Aug 2023   Prob (F-statistic):              0.578
Time:                        04:20:05   Log-Likelihood:                -1479.4
No. Observations:                 350   AIC:                             2965.
Df Residuals:                     347   BIC:                             2976.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        168.3137      3.174     53.036      0.0

In [None]:
# Calculate the VIF value for Height.
vif_height = 1/(1-0.003)
vif_height

1.0030090270812437

Repeat the same for `Gender` as dependent variable.

In [None]:
# Build a linear regression model taking 'Gender' as the target and 'Weight' and 'Height' as the independent variables.
x4 = df[['Weight','Height']]
y4 = df['Gender']
x4_train,x4_test,y4_train,y4_test = train_test_split(x4,y4,test_size=0.30,random_state=42)
c4 = sm.add_constant(x4_train)
obj4 = sm.OLS(y4_train,c4).fit()
# Print summary
print(obj4.summary())

                            OLS Regression Results                            
Dep. Variable:                 Gender   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                 -0.003
Method:                 Least Squares   F-statistic:                    0.5306
Date:                Mon, 07 Aug 2023   Prob (F-statistic):              0.589
Time:                        04:20:05   Log-Likelihood:                -253.40
No. Observations:                 350   AIC:                             512.8
Df Residuals:                     347   BIC:                             524.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.3015      0.288      1.047      0.2

In [None]:
# Calculate the VIF value for Weight.
vif_gender = 1/(1-0.003)
vif_gender

1.0030090270812437

**Q:** Is the VIF calculated using formula $\frac{1}{1 - R^2}$ and the python module `statsmodels.stats.outliers_influence` same for independent variables?

**A:**Yes



---

### Submitting the Project:

1. After finishing the project, click on the **Share** button on the top right corner of the notebook. A new dialog box will appear.

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/2_share_button.png' width=500>

2. In the dialog box, make sure that '**Anyone on the Internet with this link can view**' option is selected and then click on the **Copy link** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/3_copy_link.png' width=500>

3. The link of the duplicate copy (named as **YYYY-MM-DD_StudentName_Project66**) of the notebook will get copied

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/4_copy_link_confirmation.png' width=500>

4. Go to your dashboard and click on the **My Projects** option.
   
   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/5_student_dashboard.png' width=800>

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/6_my_projects.png' width=800>

5. Click on the **View Project** button for the project you want to submit.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/7_view_project.png' width=800>

6. Click on the **Submit Project Here** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/8_submit_project.png' width=800>

7. Paste the link to the project file named as **YYYY-MM-DD_StudentName_Project66** in the URL box and then click on the **Submit** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/9_enter_project_url.png' width=800>

---