# Data Science Rapid Start: Preparing a Machine Learning Model

#### Step 1: Configure your Databricks notebook to perform your Moovio task.

**Exercise**: Define the `username` variable below to be your preferred
username.

In [0]:
# TODO
username = 'kwadud'
dbutils.widgets.text("username", username)
spark.sql(f"CREATE DATABASE IF NOT EXISTS dbacademy_{username}")
spark.sql(f"USE dbacademy_{username}")
dbfs_path = f"/dbacademy/{username}/datasciencerapidstart/"

#### Step 2: Download the data

This data is hosted on a Databricks server.

In [0]:
from urllib.request import urlretrieve

def retrieve_file(file_name: str) -> bool:
  URI = "https://files.training.databricks.com/static/data/health-tracker/" + file_name
  DRIVER_PATH = "file:/databricks/driver/" + file_name

  urlretrieve(URI, file_name)
  dbutils.fs.mv(DRIVER_PATH, dbfs_path + file_name)
  return True

## Retrieve Data

**Exercise** Use the function `retrieve_file` to download the following files from
cloud object storage.

- "agg_data.csv"
- "user_data.csv"

In [0]:
# TODO
retrieve_file("agg_data.csv")
retrieve_file("user_data.csv")

### Display the Downloaded Data

The command below shows you where the data is saved.

The `%fs` magic is a "file system" magic that provides a CLI-like interface
to the Databricks File System in your Workspace.

**Exercise**: Replace the value `<FILL_THIS_IN>` with the `username`
you defined above.

In [0]:
%fs ls /dbacademy/'kwadud'/datasciencerapidstart/

#### Step 3: Load data into a Spark DataFrame

DataFrames are the primary data structure for working with tabular data.

In [0]:
aggregateDataDF = (
  spark.read
  .option("header", True)
  .csv(dbfs_path + "agg_data.csv")
)
userDataDF = (
  spark.read
  .option("header", True)
  .csv(dbfs_path + "user_data.csv")
)

#### Step 4: Display the data.

First, we'll verify that both data files loaded correctly.

In the Spark DataFrame "aggregateDataDF" we see average measurements per user over a series of months. In "userDataDF" we have self-identified lifestyle data with one of three attributes (weight trainer, sedentary, and cardio trainer).

**Exercise:** Display the `aggregateDataDF` below.

In [0]:
# TODO
display(userDataDF)

#### Step 5: Exploratory data analysis

First, we take an average for each feature (active heart rate, VO2 max,
resting heart rate and average BMI) across all users. This average gives
us a better sense of the data we are working with.

**Exercise:** Display the `aggregateDataDF` below. We do this a second time
so that you can practice making a Databricks visualization.

In [0]:
# TODO
display(aggregateDataDF)

✏️ **Note**: Before moving forward, try experimenting with Databricks built-in
visualizations. In the above cell, click the Bar Char icon in the bottom
left-hand corner. Then, click on "Plot Options". Drag and drop the
following variables from "All fields" to "Values": `avg(active_heartrate)`,
`avg(VO2_max)`, `avg(resting_heartrate)`, `avg(BMI)`. Then, click Apply.

#### Step 6: Data manipulation

Our ultimate goal in our Moovio task is to see if we can use the device-reported biometric data to predict the self-identified classification data. In order to do that, we need to join the biometric data with the user data. From there, we will identify the data to be used in our classification model.

**Exercise**: Create the `joinedDF` by joining `aggregateDataDF` and `userDataDF`
on the `_id` column.

🏋🏽‍♀️ You can use this syntax to perform your join:

```
df1.join(df2, "join_column")
```

In [0]:
# TODO
joinedDF = aggregateDataDF.join(userDataDF, "_id")

#### Step 7: Exploratory data analysis

Now, we are exploring our joined data to see if there is a significant
difference among the three groups. If we see a significant difference,
that means that we can use this data to build a classification model.

In [0]:
display(joinedDF)

Note: This time, visualize this data as a bar chart by configuring it
is follows - Keys: `lifestyle`; Values: `avg(active_heartrate)`,
`avg(VO2_max)`, `avg(resting_heartrate)`, `avg(BMI)`

#### Step 8: Convert joined data to a Pandas DataFrame.

With the size of this data, it will be easier to train our model using
scikit-learn. This is why we convert it to a Pandas DataFrame.
Then, we display the data to verify that it was loaded correctly.

**Exercise:** Convert `joinedDF` to a Pandas DataFrame.

🐼 You can do this with the `.toPandas()` DataFrame method.

In [0]:
# TODO
joined_pd_df = joinedDF.toPandas()

##### Display the `.head()` of the Pandas DataFrame

In [0]:
joined_pd_df.head()

#### Step 9: Preparing features and target

To prepare our model, we need features and target. Here, we prepare those
from the Pandas DataFrame.

🤠 **Note:** the remaining cells in this notebook are advanced and are
included in this introductory notebook for demonstration purposes.

In this cell, we select just the numerical features and then numerically
encode the target vector.

In [0]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
features = joined_pd_df[["avg(BMI)", "avg(active_heartrate)", "avg(resting_heartrate)", "avg(VO2_max)"]]
target = joined_pd_df.lifestyle
target_numerical = le.fit_transform(target)

#### Step 10: Train a machine learning model

In the next few steps, we use MLflow to prepare a machine learning model.

First, we split the data into training and testing datasets.

In [0]:
from sklearn.model_selection import train_test_split

(X_train, X_test, y_train, y_test) = train_test_split(features, target_numerical)

Next, we import the machine learning models that we will use.

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

Finally, we test each of these models on our data using MLflow.

In [0]:
import mlflow

with mlflow.start_run() as run:
  model = LogisticRegression(max_iter=1e5)
  model.fit(X_train, y_train)

  train_accuracy = model.score(X_train, y_train)
  test_accuracy = model.score(X_test, y_test)
  mlflow.log_param("model", model.__class__)
  mlflow.log_metric("train_accuracy", train_accuracy)
  mlflow.log_metric("test_accuracy", test_accuracy)

with mlflow.start_run() as run:
  model = DecisionTreeClassifier()
  model.fit(X_train, y_train)

  train_accuracy = model.score(X_train, y_train)
  test_accuracy = model.score(X_test, y_test)
  mlflow.log_param("model", model.__class__)
  mlflow.log_metric("train_accuracy", train_accuracy)
  mlflow.log_metric("test_accuracy", test_accuracy)

with mlflow.start_run() as run:
  model = KNeighborsClassifier()
  model.fit(X_train, y_train)

  train_accuracy = model.score(X_train, y_train)
  test_accuracy = model.score(X_test, y_test)
  mlflow.log_param("model", model.__class__)
  mlflow.log_metric("train_accuracy", train_accuracy)
  mlflow.log_metric("test_accuracy", test_accuracy)


#### Step 11: View model training results

Step 11: View model training results
In the upper right hand corner of this notebook. click "Experiment".
A list of all of your runs will appear. At the bottom of that list,
click on Experiment UI in order to get more information about each run.

#### Step 12: Make a decision

Do you have enough information to make a decision about what type of
model to use? Why or why not? How would you move forward?