d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Logistic Regression Lab 1

**Objectives**:
1. Develop a single-variable logistic regression model.
1. Develop a multi-variable logistic regression model.

In [0]:
%run ../../Includes/Classroom-Setup

## Setup

### Load the Data

The `Includes/Classroom-Setup` notebook has made an aggregate table of data
available to us via the Metastore associated with our workspace. We can load
the data as a pandas dataframe using the cell below.

This command loads the table using the Metastore reference. The `.toPandas()`
method converts the Spark DataFrame to a Pandas DataFrame. We will use the
Pandas DataFrame with Scikit-Learn throughout this Module.

In [0]:
ht_agg_spark_df = spark.read.table("ht_agg")
ht_agg_pandas_df = ht_agg_spark_df.toPandas()

### Framing a Business Problem

Over the next few labs, we will use supervised machine learning
to answer a new business question:

> Given a users fitness profile, can we predict the lifestyle of a user?

Like the regression problem we previously solved,
our **inputs** will be fitness profile information. This is, however, a classification
problem and will have a different **output**, lifestyle.

### The Scikit-Learn `estimator` API

Once more, we will use the sklearn **estimator** API.

The good news is that we use the exact same pattern for classification as we did
for regression.

```
estimator.fit(features, target)
estimator.score(features, target)
```

## Demonstration

### Single-Variable Logistic Regression


First, we'll import our estimator of choice, a predictor called Logistic Regression.

In [0]:
from sklearn.linear_model import LogisticRegression

Then, we'll instantiate or create an instance of our estimator.

In [0]:
lr = LogisticRegression(max_iter=10000)

### Create Feature Vectors

🧐 sklearn wants the shape of our data to be a matrix for our feature(s)
and the shape of our target to be a vector. This is why you will see two square
brackets around our feature - a matrix - and a single set of square brackets
around our target - a vector.

In [0]:
X = ht_agg_pandas_df[['mean_bmi']]

### Create Target Vector

An additional step, not required when perform the linear regression,
is necessary to encode our target vector when performing a logistic regression.

This has to do with the way the lifestyle lables are stored.

In [0]:
ht_agg_pandas_df["lifestyle"].unique()

-sandbox
<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Each lifestyle is recorded as a string value.

sklearn models can only work on numerical values. For this reason,
it is required to numerically encode our lifestyle values.

We will use an sklearn transformer to do this encoding.

An sklearn transformer is like an sklearn estimator except rather than
using it to `.predict()` or `.score()`, we will use it to `.transform()`

```
estimator.fit(data)
estimator.transform(data)
```

In [0]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
lifestyle = ht_agg_pandas_df['lifestyle']
le.fit(lifestyle)
y = le.transform(lifestyle)
y

### Fit the Model

Next, fit our model, using the same `.fit(feature, target)` pattern we learned earlier.

The model will learn the relationship between features and target, i.e.
we will "train or fit the model".

In [0]:
lr.fit(X, y)

### Evaluate the model

Finally, use the `.score()` method to evaluate the single-variable model.

Note that a classifier estimator in sklearn uses accuracy for scoring
by default.

In [0]:
lr.score(X, y)

## Your Turn

### Exercise 1: Single-Variable Logistic Regression

Fit a single-variable logistic model for each of the remaining feature.
1. prepare a feature matrix for each of these features:
 - `mean_bmi`
 - `mean_active_heartrate`
 - `mean_resting_heartrate`
 - `mean_vo2`
1. fit a single-variable logistic model for each of these features
1. evaluate using `.score()` each of these models and print the result

In [0]:
# TODO
X_bmi = ht_agg_pandas_df[['mean_bmi']]
X_active_heartrate = ht_agg_pandas_df[['mean_active_heartrate']]
X_resting_heartrate = ht_agg_pandas_df[['mean_resting_heartrate']]
X_vo2 = ht_agg_pandas_df[['mean_vo2']]

lr_bmi = LogisticRegression(max_iter=10000)
lr_active_heartrate = LogisticRegression(max_iter=10000)
lr_resting_heartrate = LogisticRegression(max_iter=10000)
lr_vo2 = LogisticRegression(max_iter=10000)

lr_bmi.fit(X_bmi, y)
lr_active_heartrate.fit(X_active_heartrate, y)
lr_resting_heartrate.fit(X_resting_heartrate, y)
lr_vo2.fit(X_vo2, y)

print("bmi: ", lr_bmi.score(X_bmi, y))
print("active_heartrate: ", lr_active_heartrate.score(X_active_heartrate, y))
print("resting_heartrate: ", lr_resting_heartrate.score(X_resting_heartrate, y))
print("vo2: ", lr_vo2.score(X_vo2, y))

**Question**: Which of these single-variable models is the best at predicting lifestyle?

## Demonstration
### Multiple-Variable Logistic Regression

Our next set of models will use more that one feature and but still have
a single target.

### Display results from previous models

Before we train this new model, let's display the results from the previous models
for comparison.

In [0]:
print("bmi:               ", lr_bmi.score(X_bmi, y))
print("active_heartrate:  ", lr_active_heartrate.score(X_active_heartrate, y))
print("resting_heartrate: ", lr_resting_heartrate.score(X_resting_heartrate, y))
print("vo2:               ", lr_vo2.score(X_vo2, y))

In [0]:
X_bmi_act_hr = ht_agg_pandas_df[['mean_bmi', 'mean_active_heartrate']]
lr_bmi_act_hr = LogisticRegression(max_iter=10000)
lr_bmi_act_hr.fit(X_bmi_act_hr, y)
print("bmi_act_hr: ", lr_bmi_act_hr.score(X_bmi_act_hr, y))

## Your Turn

### Exercise 2: Multi-Variable Logistic Regression
😎 Note that this two feature model performs better than any of the single feature models.

Fit four multiple-variable logistic models.
1. prepare a feature matrix
1. fit a logistic model for each of feature matrix
1. evaluate each model using `.score()` and print the result

👨🏼‍🎤 Did you try any models with more than two features? Multiple-variable
logistic regression models can use any or all of the features.

In [0]:
# TODO
X_1 = ht_agg_pandas_df[['mean_bmi','mean_active_heartrate','mean_resting_heartrate','mean_vo2']]
X_2 = ht_agg_pandas_df[['mean_bmi','mean_active_heartrate']]
X_3 = ht_agg_pandas_df[['mean_bmi','mean_active_heartrate','mean_resting_heartrate']]
X_4 = ht_agg_pandas_df[['mean_active_heartrate','mean_vo2']]

lr_1 = LogisticRegression(max_iter=10000)
lr_2 = LogisticRegression(max_iter=10000)
lr_3 = LogisticRegression(max_iter=10000)
lr_4 = LogisticRegression(max_iter=10000)

lr_1.fit(X_1,y)
lr_2.fit(X_2,y)
lr_3.fit(X_3,y)
lr_4.fit(X_4,y)

print("model 1: ", lr_1.score(X_1,y))
print("model 2: ", lr_2.score(X_2,y))
print("model 3: ", lr_3.score(X_3,y))
print("model 4: ", lr_4.score(X_4,y))

### Which of these models is the best at predicting lifestyle?

-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>