d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Linear Regression Coefficients and P-values

**Objective**: *Demonstrate feature importance within linear regression.*

In this demo, we will complete a series of exercises to show how to examine the P Values from a Linear Regression model.

In [0]:
%run "../../Includes/Classroom-Setup"

Out[2]: DataFrame[]

## Prepare data

### Aggregate our user-level table

Remember that one of our project objectives is to predict a customer's BMI based on their recorded metrics. Therefore, we are interested in a user-level clustering. To prepare the dataset to do this, we'll aggregate our **`adsda.ht_user_metrics`** table at the user level.

In [0]:
%sql
CREATE OR REPLACE TABLE adsda.ht_user_metrics_lifestyle
USING DELTA LOCATION "/adsda/ht-user-metrics-lifestyle" AS (
  SELECT avg(resting_heartrate) AS avg_resting_heartrate,
         avg(active_heartrate) AS avg_active_heartrate,
         avg(bmi) AS bmi,
         avg(vo2) AS avg_vo2,
         avg(workout_minutes) AS avg_workout_minutes,
         avg(steps) AS steps,
         first(lifestyle) AS lifestyle
  FROM adsda.ht_daily_metrics
  GROUP BY device_id
)

num_affected_rows,num_inserted_rows


In [0]:
%sql
SELECT * FROM adsda.ht_user_metrics_lifestyle LIMIT 10

avg_resting_heartrate,avg_active_heartrate,bmi,avg_vo2,avg_workout_minutes,steps,lifestyle
82.68379727873081,139.43487473206162,22.398063650890798,20.99401157735923,5.5026324666656405,5171.495890410959,Sedentary
77.73294228506452,127.05715346661702,25.150812654086295,25.52747526955064,37.2167018100805,7115.591780821917,Weight Trainer
86.51162895591307,147.31573126952208,19.14825600046248,19.448406520026342,45.00008651086257,7257.693150684931,Weight Trainer
77.55054135762612,129.5770039396946,24.2403757288568,21.40130178285617,37.886068725488464,7129.690410958904,Weight Trainer
68.93310580458204,136.50268661405897,30.726595797380472,28.85523016925364,32.24198398599063,6958.378082191781,Weight Trainer
69.31244794850774,167.18585016710105,27.1326690342849,30.939205114246853,5.119426899323105,5128.024657534246,Sedentary
64.64397544858174,152.9654977304546,29.17716498363452,28.92795344089978,5.015081852287961,5167.789041095891,Sedentary
81.33282756113321,137.57131998347788,20.850071485672636,22.56400630458249,42.37552145726232,7281.586301369863,Weight Trainer
64.79507042723496,139.39836367080545,31.386431213715436,29.096510773429188,33.3298371336183,7029.608219178082,Weight Trainer
89.51117796589962,126.57048164605168,19.83075371640161,19.750462151303648,43.30528046136424,7362.769863013698,Weight Trainer


### Convert Spark DataFrame to Pandas

We will use this Pandas DataFrame in this demo.

In [0]:
ht_lifestyle_pd_df = spark.table("adsda.ht_user_metrics_lifestyle").toPandas()

View the data

In [0]:
ht_lifestyle_pd_df.head()

Unnamed: 0,avg_resting_heartrate,avg_active_heartrate,bmi,avg_vo2,avg_workout_minutes,steps,lifestyle
0,82.683797,139.434875,22.398064,20.994012,5.502632,5171.49589,Sedentary
1,77.732942,127.057153,25.150813,25.527475,37.216702,7115.591781,Weight Trainer
2,86.511629,147.315731,19.148256,19.448407,45.000087,7257.693151,Weight Trainer
3,77.550541,129.577004,24.240376,21.401302,37.886069,7129.690411,Weight Trainer
4,68.933106,136.502687,30.726596,28.85523,32.241984,6958.378082,Weight Trainer


## Fitting a Linear Regression Model and Examining Coefficients and P Values

This process has a few steps so we'll string everything together and explain step by step.

#### Step 1 - Feature Engineering
Pandas has a built-in method to one-hot encode called `get_dummies()`. We'll use that here to transform the `lifestyle` feature into a numeric feature.

In [0]:
import pandas as pd



We can see that the object we receive back is a new DataFrame. We'll save this out to a variable

In [0]:
lifestyle_dummies_df = pd.get_dummies(ht_lifestyle_pd_df['lifestyle'])

Then we'll join this back onto our original dataframe

In [0]:
ht_lifestyle_pd_df = ht_lifestyle_pd_df.join(lifestyle_dummies_df)

Finally, we'll drop the original `lifestyle` column because it is now uncessary.

In [0]:
ht_lifestyle_pd_df.drop('lifestyle', axis=1, inplace=True)

#### Step #2 - Create feature matrix and target
Now we need to create our X and y from our features and target. Recall that our target is the thing we are trying to predict, BMI, given some features about an observation.

In [0]:
X = ht_lifestyle_pd_df.drop('bmi', axis=1)
y = ht_lifestyle_pd_df['bmi']

#### Step 3 - Fit our model
Import statsmodels.

🎯Note that the statsmodels api refers to our target variable as the endogenous or dependent variable and our features as the exogenous or independent variable.

In [0]:
import statsmodels.api as sm

In [0]:
model = sm.OLS(endog=y, exog=X)
bmi_ols_results = model.fit()

#### Step 4 - Examine results
Once our model is fit, we have helper methods and attributes available. Of interest to us first are the coefficients. The most robust of all of these is a method called `.summary()`

In [0]:
bmi_ols_results.summary()

0,1,2,3
Dep. Variable:,bmi,R-squared:,0.884
Model:,OLS,Adj. R-squared:,0.884
Method:,Least Squares,F-statistic:,2858.0
Date:,"Sun, 25 Jun 2023",Prob (F-statistic):,0.0
Time:,00:33:59,Log-Likelihood:,-5528.5
No. Observations:,3000,AIC:,11070.0
Df Residuals:,2991,BIC:,11130.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
avg_resting_heartrate,-0.2623,0.008,-33.651,0.000,-0.278,-0.247
avg_active_heartrate,0.0219,0.003,6.522,0.000,0.015,0.028
avg_vo2,-0.1684,0.013,-13.317,0.000,-0.193,-0.144
avg_workout_minutes,-0.5190,0.010,-50.909,0.000,-0.539,-0.499
steps,-0.0013,8.97e-05,-14.517,0.000,-0.001,-0.001
Athlete,79.2708,1.296,61.163,0.000,76.730,81.812
Cardio Enthusiast,74.0344,1.477,50.133,0.000,71.139,76.930
Sedentary,54.9573,1.182,46.489,0.000,52.639,57.275
Weight Trainer,75.0686,1.152,65.157,0.000,72.810,77.328

0,1,2,3
Omnibus:,787.279,Durbin-Watson:,1.95
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4488.49
Skew:,1.122,Prob(JB):,0.0
Kurtosis:,8.557,Cond. No.,965000.0


The coefficients of our model are in the middle of the output. It looks like the largest coefficient is our `athlete` column, which if we recall, refers to whether or not a person is an athlete. Intuitively, this makes sense - people who are athletes are likely to have a different BMI than people who are not. We also see some negative coefficients. This does not imply that these features are unwanted or not helpful, it just means that the relationship between the two variables moves in opposite directions. Intuitively, this also makes sense - the more average workout minutes someone has, the lower their BMI is. Put another way: as workout minutes go 👆🏽, BMI goes 👇🏽.

Let's examine P-Values

In [0]:
bmi_ols_results.pvalues

Out[18]: avg_resting_heartrate    8.135620e-211
avg_active_heartrate      8.127692e-11
avg_vo2                   2.374802e-39
avg_workout_minutes       0.000000e+00
steps                     3.406990e-46
Athlete                   0.000000e+00
Cardio Enthusiast         0.000000e+00
Sedentary                 0.000000e+00
Weight Trainer            0.000000e+00
dtype: float64

It looks like all of our features have very low P Values. P-values here are answering the question: what is the probability that a world exists where the coefficient for this is equal to zero (no effect)? Given our P-values can assume that there is a very low probability that these coefficients do not have `no effect`.

To interpret these in the context of `feature importance` is first with coefficients. We can interpret coeffcients as a measure of the importance in our model. Here, our larger coefficients - `Athlete`, for example - can be thought of as being important.

With P-values, the smaller the value, the more likely our feature has an effect on the target! We can think of this as an impactful feature.

## Nicely done!

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>