<a href="https://colab.research.google.com/github/kyook17/UIUC_BADM/blob/main/BADM576_DS/Explaining_Variance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

import numpy as np

import statsmodels.formula.api as smf # we will use the "ols" function  to fit an explanatory regression model

#Dataset on car Prices

In [None]:
cars_df = pd.read_csv("car_price_variance.csv")

In [None]:
cars_df

Unnamed: 0,Brand,Model,Trim,Price
0,Toyota,Camry,LE,25000
1,Toyota,Camry,XLE,30000
2,Toyota,Corolla,LE,20000
3,Toyota,Corolla,XSE,25500
4,Toyota,RAV4,XLE,28000
5,Toyota,RAV4,Adventure,33000
6,Toyota,Highlander,LE,35000
7,Toyota,Highlander,XLE,40000
8,Honda,Accord,LX,24000
9,Honda,Accord,Touring,36000


In [None]:
cars_df.columns

Index(['Brand', 'Model', 'Trim', ' Price '], dtype='object')

`Price` column has an extra trailing space and it must be removed.

In [None]:
cars_df.columns = cars_df.columns.str.strip()

In [None]:
cars_df.columns

Index(['Brand', 'Model', 'Trim', 'Price'], dtype='object')

In [None]:
cars_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27 entries, 0 to 26
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Brand   27 non-null     object
 1   Model   27 non-null     object
 2   Trim    27 non-null     object
 3   Price   27 non-null     object
dtypes: object(4)
memory usage: 992.0+ bytes


We will change the column dtypes

In [None]:
cars_df["Price"] = cars_df["Price"].str.replace(',', '').astype("float64") # Don't run this twice, will throw error. If needed read the data again.

In [None]:
cars_df["Price"] = cars_df["Price"]/1000  # Scaling down Price so that the variance (Mean Squared Deviations or Error) values are not very large.

# If this is run multiple times, the data will be getting downscaled. Thus, be cautious.

In [None]:
cars_df['Brand'] = cars_df['Brand'].astype('category')
cars_df['Model'] = cars_df['Model'].astype('category')
cars_df['Trim'] = cars_df['Trim'].astype('category')

In [None]:
cars_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27 entries, 0 to 26
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   Brand   27 non-null     category
 1   Model   27 non-null     category
 2   Trim    27 non-null     category
 3   Price   27 non-null     float64 
dtypes: category(3), float64(1)
memory usage: 2.0 KB


In [None]:
cars_df

Unnamed: 0,Brand,Model,Trim,Price
0,Toyota,Camry,LE,25.0
1,Toyota,Camry,XLE,30.0
2,Toyota,Corolla,LE,20.0
3,Toyota,Corolla,XSE,25.5
4,Toyota,RAV4,XLE,28.0
5,Toyota,RAV4,Adventure,33.0
6,Toyota,Highlander,LE,35.0
7,Toyota,Highlander,XLE,40.0
8,Honda,Accord,LX,24.0
9,Honda,Accord,Touring,36.0


# Studying Variance in Price and Explaining it through Variance in other Variables.

In [None]:
np.var(cars_df["Price"]).round() # This is the overall variance in the car prices. This is also equal to MSD (Mean Squared Deviation).

11099.0

### This quantity represents the total uncertainty we have about Car Prices. If we try to make a prediction, the level of uncertainty will be reflected in the poor quality of our predictions.

### It may be easier for you to see that this variance in Price is for a few explainable reasons.

What are those reasons?

## Reasons

#### Brand as a reason

If brand is a reason that can explain the variance in Price, we should see the within each Brand the variance in Car Price should be lower.

In [None]:
# Group by 'Brand'
brand_groups = cars_df.groupby('Brand')['Price'].agg(var=lambda x: x.var(ddof=0), count='count').round()  # The lambda function is used to use the var function with ddof.
# ddof = 0 means we are calculating variance in the population and not sample. With large enough sample, it is irrelevant.
brand_groups


Unnamed: 0_level_0,var,count
Brand,Unnamed: 1_level_1,Unnamed: 2_level_1
BMW,276.0,4
Honda,48.0,8
Mercedes,769.0,4
Rolls Royce,3787.0,3
Toyota,35.0,8


In [None]:
# Calculating the combined variance (weighted average of variances)

combined_variance = np.sum(brand_groups["var"]* brand_groups["count"])/ np.sum(brand_groups["count"])

combined_variance.round() # This shows the left-over variance in Car Price after we use Variance in Car Brand as an explanation.

600.0

How much variance in the Car price, could we explain away using the Variance in the Car Brands?

## Using Both Brand and Model as explanations of variance in Car Prices.

This is for you to do.

In [None]:
# Group by 'Brand'
brand_model_groups = cars_df.groupby(['Brand', 'Model'])['Price'].agg(var=lambda x: x.var(ddof=0), count='count').round()  # The lambda function is used to use the var function with ddof.
# ddof = 0 means we are calculating variance in the population and not sample. With large enough sample, it is irrelevant.
brand_model_groups.reset_index()

Unnamed: 0,Brand,Model,var,count
0,BMW,3 Series,0.0,1
1,BMW,5 Series,0.0,1
2,BMW,7 Series,0.0,1
3,BMW,Accord,,0
4,BMW,C-Class,,0
...,...,...,...,...
90,Toyota,Passport,,0
91,Toyota,Phantom,,0
92,Toyota,RAV4,6.0,2
93,Toyota,S-Class,,0


In [None]:
# Calculating the combined variance (weighted average of variances)

combined_variance = np.sum(brand_model_groups["var"]*brand_model_groups["count"])/ np.sum(brand_model_groups["count"])

combined_variance.round() # This shows the left-over variance in Car Price after we use Variance in Car Brand as an explanation.

9.0

# Explaining Variance - Using a different approach (modeling based)

### **Stupid Model (M1)**

That has no information about a car. You can expect that this model will face tremendous difficulty in making prediction.

In [None]:
M1_pred = cars_df.Price.mean().round()

In [None]:

def mse(y, pred_y):

  error = y - pred_y # Actual - prediction

  sqrd_error = error * error # Square it up

  mean_sqrd_error = np.mean(sqrd_error) # calculate mean of squared errors

  return mean_sqrd_error


In [None]:
# Variance or Mean Square Deviation or Mean Squared Error
mse(y = cars_df["Price"], pred_y = M1_pred).round() # Does this match with what we had as the overall or total variance above?

11099.0

### **Model that incorporates information about the car Brand (M2)**

In [None]:
M2_pred = cars_df.groupby('Brand')['Price'].transform('mean')

In [None]:
cars_df["Brand_means"] = cars_df.groupby('Brand')['Price'].transform('mean').round()

In [None]:
cars_df

Unnamed: 0,Brand,Model,Trim,Price,Brand_means
0,Toyota,Camry,LE,25.0,30.0
1,Toyota,Camry,XLE,30.0,30.0
2,Toyota,Corolla,LE,20.0,30.0
3,Toyota,Corolla,XSE,25.5,30.0
4,Toyota,RAV4,XLE,28.0,30.0
5,Toyota,RAV4,Adventure,33.0,30.0
6,Toyota,Highlander,LE,35.0,30.0
7,Toyota,Highlander,XLE,40.0,30.0
8,Honda,Accord,LX,24.0,31.0
9,Honda,Accord,Touring,36.0,31.0


In [None]:
mse(y = cars_df["Price"], pred_y = M2_pred).round() # Does this match with what we calculated above as the combined variance within each brand.

600.0

In [None]:
1 - (600/ 11099)

0.9459410757725921

In [None]:
import statsmodels.formula.api as smf # we will use the "ols" function  to fit an explanatory regression model

# fit the explanatory MLR model

M2_model = smf.ols(formula='Price ~ (Brand)', data = cars_df).fit()


print(M2_model.summary())

                            OLS Regression Results                            
Dep. Variable:                  Price   R-squared:                       0.946
Model:                            OLS   Adj. R-squared:                  0.936
Method:                 Least Squares   F-statistic:                     96.23
Date:                Tue, 26 Mar 2024   Prob (F-statistic):           1.32e-13
Time:                        17:48:23   Log-Likelihood:                -124.67
No. Observations:                  27   AIC:                             259.3
Df Residuals:                      22   BIC:                             265.8
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept               60.9875 

### **Model that incorporates information about the car Brand and Model (M3)**

In [None]:
# using groupedby mean apporach.

In [None]:
M3_pred = cars_df.groupby(['Brand', "Model"])['Price'].transform('mean')

In [None]:
mse(y = cars_df["Price"], pred_y = M3_pred).round()

9.0

In [None]:
1 - (9/11099)

0.9991891161365889

In [None]:
# using Linear Model approach

In [None]:
import statsmodels.formula.api as smf # we will use the "ols" function  to fit an explanatory regression model

# fit the explanatory MLR model

M3_model = smf.ols(formula='Price ~ Brand + Model', data = cars_df).fit()


print(M3_model.summary())

                            OLS Regression Results                            
Dep. Variable:                  Price   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.997
Method:                 Least Squares   F-statistic:                     533.9
Date:                Tue, 26 Mar 2024   Prob (F-statistic):           2.36e-10
Time:                        17:52:48   Log-Likelihood:                -68.317
No. Observations:                  27   AIC:                             174.6
Df Residuals:                       8   BIC:                             199.3
Df Model:                          18                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept               41.2500 