## C3 Kares on Diamond Price Prediction
This Notebook will focus on using Kares to perform the Deep Learning regression to predict the diamond price.

The data will be coming from Kaggle's public dataset "Diamond": https://www.kaggle.com/shivam2503/diamonds

#### About this file:
A data frame with 53940 rows and 10 variables:

price -- price in US dollars (326 d--18,823 d)

carat -- weight of the diamond (0.2--5.01)

cut -- quality of the cut (Fair, Good, Very Good, Premium, Ideal)

color -- diamond colour, from J (worst) to D (best)

clarity -- a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

x -- length in mm (0--10.74)

y -- width in mm (0--58.9)

z -- depth in mm (0--31.8)

depth -- total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)

table -- width of top of diamond relative to widest point (43--95)

This file will require users to use `tensorflow_plot_3_0_0` kernel to operate.

## Step 1: Download and Preprocess Data

In [1]:
import numpy as np
import pandas as pd

import random

In [2]:
## we will need to change the string columns to be numeric columns ##
def translate_string_into_int(df, col):
    str_int_dict = df[col].unique().tolist()
    for i, key in enumerate(str_int_dict):
        df[col] = df[col].replace(key, int(i))
    return df

In [3]:
url = "https://github.com/tidyverse/ggplot2/blob/main/data-raw/diamonds.csv?raw=true"


In [4]:
## read the diamonds.csv into the pd ##
diamond = pd.read_csv(url, index_col=0)
#diamond.columns = ["id"] + diamond.columns.tolist()[1:]
#diamond = diamond.drop(["id"], axis=1)
diamond = diamond.reset_index()
diamond.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [5]:
## modify ##
diamond = translate_string_into_int(diamond, 'cut')
diamond = translate_string_into_int(diamond, 'color')
diamond = translate_string_into_int(diamond, 'clarity')
diamond.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,0,0,0,61.5,55.0,326,3.95,3.98,2.43
1,0.21,1,0,1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,2,0,2,56.9,65.0,327,4.05,4.07,2.31
3,0.29,1,1,3,62.4,58.0,334,4.2,4.23,2.63
4,0.31,2,2,0,63.3,58.0,335,4.34,4.35,2.75


In [6]:
## set random seed ##
random.seed(1014)

## create a train and test split using np ##
msk = np.random.rand(len(diamond)) < 0.8
train = diamond[msk]
test = diamond[~msk]

## exam the data ##
train.shape, test.shape, diamond.columns

((43047, 10),
 (10893, 10),
 Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
        'z'],
       dtype='object'))

In [7]:
## now, spliting the x and y in train and test ##
train_x, train_y = train.loc[:, train.columns != "price"], train[["price"]]
test_x, test_y = test.loc[:, test.columns != "price"], test[["price"]]

## normalize the X value ## TODO for testing 
def normalize(train, test):
    mean, std = train.mean(axis=0), train.std(axis=0)
    return (train - mean) / std, (test - mean) / std

# Normalize the X values for better results
train_x, test_x = normalize(train_x, test_x)

## exam the data ##
train_x.shape, train_y.shape, test_x.shape, test_y.shape

((43047, 9), (43047, 1), (10893, 9), (10893, 1))

In [8]:
train_x.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z
1,-1.238628,-0.183029,-1.549944,-0.713325,-1.351897,1.584494,-1.640317,-1.647411,-1.732812
3,-1.069901,-0.183029,-1.061722,0.416941,0.454121,0.240862,-1.363909,-1.308061,-1.281176
4,-1.02772,0.605962,-0.5735,-1.278457,1.079281,0.240862,-1.23908,-1.203646,-1.111812
5,-1.175356,1.394953,-0.5735,0.982073,0.73197,-0.207015,-1.595735,-1.542996,-1.49288
6,-1.175356,1.394953,-1.061722,1.547206,0.384659,-0.207015,-1.586818,-1.525593,-1.506994


In [9]:
## convert this into C3's Tensor Type ##
c3_train_x = c3.Dataset.fromPython(train_x)
c3_train_y = c3.Dataset.fromPython(train_y)
c3_test_x = c3.Dataset.fromPython(test_x)
c3_test_y = c3.Dataset.fromPython(test_y)


## Step 2: Create the Keras Regressor Model

In [10]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras import metrics
from tensorflow.keras import losses
from tensorflow.keras import optimizers

In [11]:
# Define keras regressor model
regressor_model = Sequential()
regressor_model.add(Dense(64, activation='relu', input_shape=(9,)))
regressor_model.add(Dense(32, activation='relu'))
regressor_model.add(Dense(3, activation='relu'))
regressor_model.add(Dense(1))
regressor_model.compile(loss=losses.MeanSquaredError(), 
                        optimizer=optimizers.RMSprop(0.001), 
                        metrics=[metrics.MeanAbsoluteError()])

## upsert native model
## custom classes & custom loss function are not supported in KerasPipes

In [12]:
# define the c3 KerasTechnique
regressor_pipe = c3.KerasPipe(name="diamond_regressor",
                              technique = c3.KerasTechnique(numEpochs=10, 
                                                            stepsPerEpoch=50, 
                                                            batchSize=128)).upsertNativeModel(regressor_model, upsertModelDefOnly=True)
regressor_pipe.id

'0b18ab85-ac10-4fcf-8693-4a30370cd6a1'

In [16]:
regressor_pipe

c3.KerasPipe(
 id='0b18ab85-ac10-4fcf-8693-4a30370cd6a1',
 name='diamond_regressor',
 meta=c3.Meta(
        tenantTagId=152,
        tenant='dev',
        tag='tc03d',
        created=datetime.datetime(2021, 10, 28, 20, 50, 27, tzinfo=datetime.timezone.utc),
        createdBy='zhang303@illinois.edu',
        updated=datetime.datetime(2021, 10, 28, 20, 50, 27, tzinfo=datetime.timezone.utc),
        updatedBy='zhang303@illinois.edu',
        timestamp=datetime.datetime(2021, 10, 28, 20, 50, 27, tzinfo=datetime.timezone.utc),
        fetchInclude='[]',
        fetchType='KerasPipe'),
 version=1,
 typeIdent='PIPE:LF:DLP:KRS',
 noTrainScore=False,
 persistedModelCategory='unidentified',
 untrainableOverride=False,
 technique=c3.KerasTechnique(
             modelDef='{"class_name": "Sequential", "config": {"name": '
                       '"sequential", "layers": [{"class_name": "Dense", '
                       '"config": {"name": "dense", "trainable": true, '
                       '"batch

In [13]:
# Train and Process
trained_regressor_pipe = regressor_pipe.train(c3_train_x, c3_train_y)


In [14]:
## predicting and results ##
prediction = trained_regressor_pipe.process(input=c3_test_x)

In [15]:
# Compare with ground truth
pred_df = c3.Dataset.toPandas(dataset=prediction)
#ground_df = c3.Dataset.toPandas(dataset=c3_test_y)

# combined pd #
combined = pred_df
combined['ground_truth'] = test_y.reset_index()['price']
combined.columns = ["predicted", "ground_truth"]
combined

Unnamed: 0,predicted,ground_truth
0,25.996393,326
1,249.927994,327
2,39.580635,339
3,21.847240,345
4,25.112823,352
...,...,...
10888,423.380310,2753
10889,193.092224,2756
10890,283.162476,2757
10891,214.260971,2757
