# Part 1: Data Prep

This part of the tutorial covers the following steps:
- Visualize the data using Seaborn and matplotlib
- Construct a new binary outcome variable labeling higher quality wines.

This notebook is designed to focus on data preparation as a set of concerns distinct from other parts of the model development workflow.

In [0]:
# Multiple people may be running this workshop at the same time.  We want each
# participant to have their own set of files.  To create your own file storage area,
# put your name below:

your_name = ""

try: run_name = dbutils.widgets.get("run_name")
except: run_name = your_name.strip()
run_name = "no_name" if run_name == "" else run_name

## Load data

This notebook assumes that a set of data is available for iterative training and validation. That dataset may be the output of some other ETL or data engineering process. The data science team will do some additional work to prepare the data for model training, which could include data quality checks, feature engineering, and target variable engineering.

In [0]:
import pandas as pd

data = pd.read_csv("/dbfs/tutorials/wine-data/wine-quality-all-raw.csv")
data = data.drop(["Unnamed: 0"], axis=1)

## Visualize data

Before training a model, explore the dataset using Seaborn and Matplotlib.
Although this notebook will eventually be run by an automated process, namely an Azure Pipeline triggered by a version-controlled code change and running "headless" and without human review, while we're developing the data prep code we'll typically plot variables to inform our decisions about what data preparation is necessary before handing off to model training.

Plot a histogram of the dependent variable, quality.

In [0]:
import seaborn as sns
sns.distplot(data.quality, kde=False)

Looks like quality scores are normally distributed between 3 and 9. 

Define a wine as high quality if it has quality >= 7.

Again, this notebook focuses on the preprocessing of the data in order to prepare it for modeling. In this example data prep notebook, the only substantive work done is creating this new binary variable, `quality`.

In [0]:
high_quality = (data.quality >= 7).astype(int)
data.quality = high_quality

Box plots are useful in noticing correlations between features and a binary label.

In [0]:
import matplotlib.pyplot as plt

dims = (3, 4)

f, axes = plt.subplots(dims[0], dims[1], figsize=(25, 15))
axis_i, axis_j = 0, 0
for col in data.columns:
  if col == 'is_red' or col == 'quality':
    continue # Box plots cannot be used on indicator variables
  sns.boxplot(x=high_quality, y=data[col], ax=axes[axis_i, axis_j])
  axis_j += 1
  if axis_j == dims[1]:
    axis_i += 1
    axis_j = 0

In the above box plots, a few variables stand out as good univariate predictors of quality. 

- In the alcohol box plot, the median alcohol content of high quality wines is greater than even the 75th quantile of low quality wines. High alcohol content is correlated with quality.
- In the density box plot, low quality wines have a greater density than high quality wines. Density is inversely correlated with quality.

## Preprocess data
Prior to training a model, check for missing values and split the data into training and validation sets.

In [0]:
data.isna().any()

There are no missing values.

## Save prepped and checked data

In [0]:
dbutils.fs.mkdirs(f"/tutorials/wine-data/{run_name}")
data.to_csv(f"/dbfs/tutorials/wine-data/{run_name}/wine-quality-all-prepped.csv")