# C3 TF on Diamond Price Prediction #

This Notebook will focus on using  TF and build-in TF Estimator to perform the Deep Learning regression to predict the diamond price.

The data will be coming from Kaggle's public dataset "Diamond":
https://www.kaggle.com/shivam2503/diamonds

### About this file ###

A data frame with 53940 rows and 10 variables:

price -- price in US dollars (326 d--18,823 d)

carat -- weight of the diamond (0.2--5.01)

cut -- quality of the cut (Fair, Good, Very Good, Premium, Ideal)

color -- diamond colour, from J (worst) to D (best)

clarity -- a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

x -- length in mm (0--10.74)

y -- width in mm (0--58.9)

z -- depth in mm (0--31.8)

depth -- total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)

table -- width of top of diamond relative to widest point (43--95)

In [1]:
import numpy as np
import pandas as pd

import random

import tensorflow as tf

## Step 1: Read and Split into Train and Test Data ##

In [3]:
## read the diamonds.csv into the pd ##
url = "https://github.com/tidyverse/ggplot2/blob/main/data-raw/diamonds.csv?raw=true"
diamond = pd.read_csv(url, index_col=0)
#diamond.columns = ["id"] + diamond.columns.tolist()[1:]
#diamond = diamond.drop(["id"], axis=1)
diamond = diamond.reset_index()
diamond.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [None]:
## we will need to change the string columns to be numeric columns ##
def translate_string_into_int(df, col):
    str_int_dict = df[col].unique().tolist()
    for i, key in enumerate(str_int_dict):
        df[col] = df[col].replace(key, int(i))
    return df

In [None]:
diamond = translate_string_into_int(diamond, 'cut')
diamond = translate_string_into_int(diamond, 'color')
diamond = translate_string_into_int(diamond, 'clarity')
diamond.head()

In [None]:
## set random seed ##
random.seed(1014)

In [None]:
## create a train and test split using np ##
msk = np.random.rand(len(diamond)) < 0.8
train = diamond[msk]
test = diamond[~msk]

In [None]:
## exam the data ##
train.shape, test.shape, diamond.columns

In [None]:
## now, spliting the x and y in train and test ##
train_x, train_y = train.loc[:, train.columns != "price"], train[["price"]]
test_x, test_y = test.loc[:, test.columns != "price"], test[["price"]]

In [None]:
## exam the data ##
train_x.shape, train_y.shape, test_x.shape, test_y.shape

In [None]:
## normalize the X value ## TODO for testing 
def normalize(train, test):
    mean, std = train.mean(axis=0), train.std(axis=0)
    return (train - mean) / std, (test - mean) / std

# Normalize the X values for better results
train_x, test_x = normalize(train_x, test_x)

#### if the data has to have string inside, use following ####

def convert_df_into_nparray(df):
    names = df.columns
    arrays = [ df[col].values for col in names ]
    formats = [ array.dtype if array.dtype != 'O' else array.astype(str).dtype.str.replace('<U','S') for array in arrays ]
    rec_array = np.rec.fromarrays( arrays, dtype={'names': names, 'formats': formats} )
    return rec_array
    
train_x = convert_df_into_nparray(train_x)
train_y = convert_df_into_nparray(train_y)
test_x = convert_df_into_nparray(test_x)
test_y = convert_df_into_nparray(test_y)

c3_train_X = c3.FileSourceSpec.createFromNumpy(npArray=train_x)
c3_train_Y = c3.FileSourceSpec.createFromNumpy(npArray=train_y)
c3_test_X = c3.FileSourceSpec.createFromNumpy(npArray=test_x)
c3_test_Y = c3.FileSourceSpec.createFromNumpy(npArray=test_y)


In [None]:
## convert this into C3's Tensor Type ##

c3_train_x = c3.Dataset.fromPython(train_x)
c3_train_y = c3.Dataset.fromPython(train_y)
c3_test_x = c3.Dataset.fromPython(test_x)
c3_test_y = c3.Dataset.fromPython(test_y)


## Step 2: Create a Tensorflow Estimator ##

In theory, we will be finding an already built estimator.

In [None]:
## assign all the feature columns ##
feature_columns_diamond = []
for key in train_x.keys():
    feature_columns_diamond.append(tf.feature_column.numeric_column(key=key))
    
## fully connected network deinition ##
Regressor = tf.estimator.DNNRegressor(
    feature_columns=feature_columns_diamond,
    hidden_units = [10, 10])

Regressor.model_dir

In [None]:
serialized_tf_regressor = c3.PythonSerialization.serialize(obj=Regressor)

In [None]:
ml_pipe = c3.TensorflowRegressor(
    name = "DiamondRegressor",
    inputFormat="feature-map",
    technique=c3.DeepLearningTechnique(
        name="DL",
        modelDef=serialized_tf_regressor,
        hyperParameters={"numSteps":1000, "numEpochs":100, "batchSize":128}
    )
)
ml_pipe

## Step 3: Train and Predict the Model

In [None]:
## start training ##
trained_ml_pipe = ml_pipe.train(
    input=c3_train_x,
    targetOutput=c3_train_y
)
trained_ml_pipe.trainingStats

In [None]:
## predicting and results ##
prediction = trained_ml_pipe.process(input=c3_test_x)
# Compare with ground truth
combined = pd.concat([c3.Dataset.toPandas(dataset=prediction), 
                      c3.Dataset.toPandas(dataset=c3_test_y)], 
                     axis=1)
combined.columns = ["predicted", "ground_truth"]
combined