# Diamond Price Prediction

Remember to download the dataset! You can find it linked inside of the slideshow (found in Google Classroom). When you download it, make sure to move it into the same folder as the main notebook. You don't need to worry about all these fancy headers I've added to this notebook. Just focus on the code cells.

In [2]:
# Don't worry if these imports do not work. We will get them sorted next class.
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

### Load and Preview data

In [None]:
# This code will load the dataset into our program. Below, we are just printing a few things about it, such as size.
df = pd.read_csv("/kaggle/input/diamonds/diamonds.csv", index_col=0)
print("dataset size:", len(df))
print(df.head(10))
print(df.shape)

### Preparing Data

In [None]:
# These three things are called "maps". You give them an input, they give you the output.
# Notice how some of the diamond qualities are described using words. We can convert them into numbers based on what they mean.
cut_enc = {
    "Fair": 0,
    "Good": 1,
    "Very Good": 2,
    "Premium": 3,
    "Ideal": 4
}

clarity_enc = {
    "I1": 0, # Worst Clarity
    "SI2": 1,
    "SI1": 2,
    "VS2": 3,
    "VS1": 4,
    "VVS2": 5,
    "VVS1": 6,
    "IF": 7, # Best Clarity
}

color_enc = {
    "J": 0, # Worst Color
    "I": 1,
    "H": 2,
    "G": 3,
    "F": 4,
    "E": 5,
    "D": 6, # Best Color
}

# These lines will take the entire dataset, and replace the words using our maps that we made.
df["cut"] = df["cut"].map(cut_enc)
df["clarity"] = df["clarity"].map(clarity_enc)
df["color"] = df["color"].map(color_enc)

print(df.head())


### Visualizing Data

In [None]:
# This is a correlation heatmap. This should look like a grid with different colors.
# Each cell represents the correlation between the feature in the cell's column, and the feature in the cell's row.
# High correlation means they are more closely related.
sns.heatmap(df.corr(), annot=True)

# note that X, Y, and Z have a high correlation with Carats (makes sense, weight is calculated from dimensions)
# as a result, we are able to remove them without losing much information.

### Data preparation (Cont)

In [None]:
# Shuffling the dataset
df = df.sample(frac=1)

# Split into input and output data. These are our "test questions" used to train the neural network.
# X will store the "test questions", and y will store the "answer key"
X = df.drop(["price", "x", "y", "z"], axis=1)
y = df["price"].values
y = np.reshape(y, (-1, 1)) # This reshapes y from an array (wrong) to an array of rows with size 1 (right)

# Some visualization stuff
print(X.head())
print(X.shape)
print("---")
print(y)
print(y.shape)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Lets split our test questions into two parts.
# The first part, X_train and y_train will be used to train our model.
# The second part will act as the "new data" our model has never seen before.
# By testing the model on the second part, but not letting it learn from it, it might as well be "new data". This way, we can detect overfitting!
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

# Scaling the data into the range 0...1 for better performance in neural networks
X_scaler = MinMaxScaler()
X_train = X_scaler.fit_transform(X_train)
X_test = X_scaler.transform(X_test)

y_scaler = MinMaxScaler()
y_train = y_scaler.fit_transform(y_train)
y_test = y_scaler.transform(y_test)

print(X_train)
print("---")
print(y_train)

### Building the model

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

# In tensorflow, you build neural networks layer by layer. Very simple.
# Here, we add a two layers of 64 neurons. Our output layer contains a single neuron, which represents the predicted price of the input diamond.
model = Sequential()
model.add(Dense(64, activation="relu", input_dim=6))
model.add(Dense(64, activation="relu"))
model.add(Dense(1, activation="relu"))

model.summary()

# Finishing up our model. There is nothing much below this that you need to take note of.
from tensorflow.keras.losses import mae
model.compile(loss=mae, optimizer="Adam", metrics=["mse"])

### Training!!!

In [None]:
# This will begin the training process!
# "epochs" represents the number of times we will go through all of our test questions in "X_train".
# As you can see, we will test our neural network on these questions for 64 times.
model_stats = model.fit(
    X_train,
    y_train,
    epochs=64,
    validation_data=(X_test, y_test)
)

### Visualizing results

In [None]:
# We can plot the model loss as a graph to see how good it is
# Our goal is to see the loss go down. Loss measures how bad a neural network performs.
plt.plot(model_stats.history["loss"])
plt.plot(model_stats.history["val_loss"])
plt.legend(["train loss", "test loss"], loc="upper left")

### Testing our own data

In [None]:
# We can now supply our own data to see what the neural network gives us.
# Remember, we need to preprocess it in the same way we did with the training data!
X_pred = [1, "Ideal", "I", "VS1", 63.6, 57.0]

X_pred[1] = cut_enc[X_pred[1]]
X_pred[2] = color_enc[X_pred[2]]
X_pred[3] = clarity_enc[X_pred[3]]

X_pred = np.reshape(X_pred, (1, -1))
X_pred = X_scaler.transform(X_pred)

print(X_pred)

y_pred = model.predict(X_pred)
y_pred = y_scaler.inverse_transform(y_pred)

print("Predicted price:", y_pred)