# Part 0: Create Datasets for Training and Evaluation, and to mimic Production Data

This notebook creates the datasets that are used in the workshop.

- Import data from your local machine into the Databricks File System (DBFS)

In this example, you build a model to predict the quality of Portugese "Vinho Verde" wine based on the wine's physicochemical properties. 

The example uses a dataset from the UCI Machine Learning Repository, presented in [*Modeling wine preferences by data mining from physicochemical properties*](https://www.sciencedirect.com/science/article/pii/S0167923609001377?via%3Dihub) [Cortez et al., 2009].

## Requirements
This tutorial requires Databricks Runtime for Machine Learning.

## Import data
  
In this section, you download a dataset from the web and save it to Databricks File System (DBFS).
For this tutorial, we will use a public dataset which can be found at: https://archive.ics.uci.edu/dataset/186/wine+quality

Run the shell commands below to create a new directory in DBFS, download a `.zip` file with the data, and uncompress them to your directory

In [0]:
%sh
mkdir -p /dbfs/tutorials/wine-data
wget https://archive.ics.uci.edu/static/public/186/wine+quality.zip -p -O /dbfs/tutorials/wine-data/wine-quality.zip
unzip -o /dbfs/tutorials/wine-data/wine-quality.zip -d /dbfs/tutorials/wine-data/

## Read the Data

Now that we have the data downloaded, we can use regular Python pandas commands to read the files.

In [0]:
import pandas as pd

white_wine = pd.read_csv("/dbfs/tutorials/wine-data/winequality-white.csv", sep=";")
red_wine = pd.read_csv("/dbfs/tutorials/wine-data/winequality-red.csv", sep=";")

In [0]:
# Take a peek at the data to make sure everything was read as expected...
display(white_wine)

Merge the two DataFrames into a single dataset, with a new binary feature "is_red" that indicates whether the wine is red or white.

In [0]:
red_wine['is_red'] = 1
white_wine['is_red'] = 0

data = pd.concat([red_wine, white_wine], axis=0)

# Remove spaces from column names
data.rename(columns=lambda x: x.replace(' ', '_'), inplace=True)

In [0]:
data.head()

## Save the data for training and validation

We will save our combined datasets to a new file so we can use it in later steps for training and validation.

In [0]:
data.to_csv("/dbfs/tutorials/wine-data/wine-quality-all-raw.csv")

## Save data to mimic production batch inference data
There are many scenarios where you might want to evaluate a model on a corpus of new data. For example, you may have a fresh batch of data, or may need to compare the performance of two models on the same corpus of data.

To simulate a new corpus of data, save the a bootstrap resample of the X_train data to a Delta table. In the real world, this would be a new batch of data.

In [0]:
# split the same as in the training notebook
from sklearn.model_selection import train_test_split

# X = data.drop(["quality"], axis=1)
# y = data.quality

# Split out the training data
# X_train, X_rem, y_train, y_rem = train_test_split(X, y, train_size=0.6, random_state=123)

# Split the remaining data equally into validation and test
# X_val, X_test, y_val, y_test = train_test_split(X_rem, y_rem, test_size=0.5, random_state=123)

high_quality = (data.quality >= 7).astype(int)
data.quality = high_quality

# X_new_batch = X_train.sample(frac=1.0, replace=True, random_state=123)

X_new_batch = data.sample(frac=1.0, replace=True, random_state=123)

In [0]:
spark_df = spark.createDataFrame(X_new_batch)
table_path = "dbfs:/tutorials/wine-data/delta"

# Delete the contents of this path in case this cell has already been run
dbutils.fs.rm(table_path, True)
spark_df.write.format("delta").save(table_path)