<a href="https://colab.research.google.com/github/ibrahimgh25/CutterKit/blob/master/diamonds_price_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this file I try out differnt ML algorithms provided by sklean on a diamond database provided by Kaggle. I initially tried solving this problem with a loosely designed keras model (the first deeplearning model I code), but the results were really bad. After a while, I read a part of the book "Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems" April 2017 for the author Aurlien Gron. So I was curious to test the effeciency of the algorithms the author talked about (RandomForest, SVM, and RandomTree).

In [1]:
# Classic imports
import pandas as pd
import numpy as np

The diamond database is a database of about 54000 samples of diamonds. Each sample (row) contains 10 features (columns). The features are: number of carats, x(length), y(width), z(depth), color, cut quality, clarity, weight, and price - which will be treated as our target.
For more information about the database please refer to the kaggle website it was obatained from (https://www.kaggle.com/shivam2503/diamonds).

In [6]:
# The original database has an unnamed column for indexing, we'll just delete that
diamonds = pd.read_csv('diamonds.csv').drop('Unnamed: 0', axis=1)
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  object 
 2   color    53940 non-null  object 
 3   clarity  53940 non-null  object 
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


sklearn is a great tool for machine learning and we'll import a lot of stuff from there.

In [18]:
# For splitting the database
from sklearn.model_selection import train_test_split
# For creating a custom class (DataFrameSelector)
from sklearn.base import BaseEstimator, TransformerMixin
# For dealing encoding the categorical attributes and data standarization
from sklearn.preprocessing import OneHotEncoder, StandardScaler
# For creating a pipline for data preparation
from sklearn.pipeline import Pipeline, FeatureUnion
# The machine learning algorithms
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import LinearSVC

from sklearn.metrics import mean_squared_error

We'll split the database into an 80/20 train/dev sets. I know a lot of people prefer to name the sets train/test, but I heard once Dr. Andrew Ng (from deeplearning.ai) giving an argument for why train/dev is a better convention and I was convinced.

In [26]:
diamonds_train, diamonds_dev = train_test_split(diamonds, random_state=42, test_size=0.15)
print(diamonds_train.head())

       carat        cut color clarity  depth  table  price     x     y     z
13713   0.30      Ideal     E     VS2   62.3   56.0    603  4.27  4.30  2.67
3481    0.81      Ideal     G     VS2   61.5   55.0   3397  6.00  6.06  3.71
343     0.71  Very Good     E     VS2   64.0   57.0   2804  5.66  5.68  3.63
22822   1.55  Very Good     E     SI1   62.4   58.0  10851  7.36  7.42  4.61
51658   0.30      Ideal     G     VS2   61.2   55.0    545  4.35  4.38  2.67


In [8]:
# Define the numerical and categorical attributes for the database.
diamonds_features = diamonds.drop('price', axis=1)
num_attribs = diamonds_features.drop(['cut', 'color', 'clarity'], axis=1).columns
cat_attribs = diamonds_features.drop(num_attribs, axis=1).columns

print(num_attribs)
print(cat_attribs)

Index(['carat', 'depth', 'table', 'x', 'y', 'z'], dtype='object')
Index(['cut', 'color', 'clarity'], dtype='object')


In [10]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
  ''' This class will be used in the pipeline to select columns from the 
  database. The class inherits from the BaseEstimator and TranformerMixin classes
  so it should have a fit method and a transform method.
  Args:
    attribute_names (:obj:'list' of :obj:'str'): the list of columns
     (attributes) to be selected by the class object.
  '''
  def __init__(self, attribute_names):
    self.attribute_names = attribute_names
  def fit(self, X, y=None):
    # Just return self, this one doesn't need fitting.
    return self
  def transform(self, X):
    # Just return the values in the specified columns in X.
    return X[self.attribute_names].values

The next cell defines the pipelines to be used for data preparation, which are:
1. num_pipeline: used to prepare the numerical attributes, it selects them then apply StandardScaler
2. cat_pipeline: used to prepare the categorical attributes, it selects them, then applies a OneHotEncoder on them.
3. full_pipeline: uses FeatureUnion to combine the two pipelines into one.

In [11]:
num_pipeline = Pipeline([
 ('selector', DataFrameSelector(num_attribs)),
 ('std_scaler', StandardScaler()),
 ])

cat_pipeline = Pipeline([
 ('selector', DataFrameSelector(cat_attribs)),
 ('1hot_encoder', OneHotEncoder()),
 ])

full_pipeline = FeatureUnion(transformer_list=[
 ("num_pipeline", num_pipeline),
 ("cat_pipeline", cat_pipeline),
 ]) 

In [27]:
# Let's create our Xs and ys
X_train = full_pipeline.fit_transform(diamonds_train)
y_train = diamonds_train['price']
X_dev = full_pipeline.transform(diamonds_dev)
y_dev = diamonds_dev['price']
print(type(X_train), type(y_train))

<class 'scipy.sparse.csr.csr_matrix'> <class 'pandas.core.series.Series'>


The next cell contains a helper function to be used for training and evaluating the different algorithms.

In [21]:
def eval_model(X_train, y_train, X_dev, y_dev, model):
  '''This function that will make it easy to train multiple algorithms
  Parameters:
    X_train: the features to be used for training
    y_train: the targets to be used for training
    X_dev: the features to be used to evaluate the model (from the dev set)
    y_dev: the target to be used to evaluate the model(from the dev set)
    model: the model to be trained and evaluated
  Returns the trained_model, the training_error, and the dev error.
  '''
  model.fit(X_train, y_train)
  y_predict = model.predict(X_train)
  training_error = mean_squared_error(y_train, y_predict)
  y_predict = model.predict(X_dev)
  dev_error = mean_squared_error(y_dev, y_predict)
  return model, training_error, dev_error

In [28]:
forest_reg, training_error, dev_error = eval_model(X_train, y_train, X_dev, y_dev, RandomForestRegressor())

In [32]:
results = []
results.append(('forest_reg', training_error, dev_error))

In [None]:
linearsvc_reg = LinearSVC(penalty='l2', C=1.0, dual=False)
linearsvc_reg, training_error, dev_error = eval_model(X_train, y_train, X_dev, y_dev, linearsvc_reg)