# Arthena Data Science Challenge - Part 1.2 & Part 1.3

Author: Manksh Gupta

We would like to train a model to predict the `hammer_price` of lots by Picasso at upcoming auctions (some upcoming lots are included in your dataset). Note that this means that you can't use future data to predict the past. You may use as many or as few features as you like except for `buyers_premium` (it's based on the sale price. See [SCHEMA.md](SCHEMA.md) for more details). Did you perform any data cleaning, filtering or transformations to improve the model fit? Why did you choose this model? What loss function did you use? Why did you pick this loss function?

## Imports

In [1]:
from utils import preprocess_individual
from utils import percentage_diff
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor

## Preprocessing

This is the first stage of the problem, this is also argubly the most important stage. Models are data dependent and if the data is garbage, the model will give us garbage. Thus, its important to preprocess the data, extract useful features and then train models on the data.

In [2]:
picasso = pd.read_csv('artists/picasso.csv')

The Preprocessing techniques that I am using are:

1) Dealing with Missing Values/Incorrect Values  - The data has a lot of missing values denoted by '-1', wherever it makes sense, I replace this with the mean value and where that doesn't make sense, i either remove the enties or try some other missing data techniques, these are described in the comments.

2) Adjusting Hammer Price - Hammer price is in different currencies, I convert everything to USD using the given conversion rates.

3) Tring to integrate Time - Since there is a time component to te model, I try various techniques to incoporate that. First, I observed a monthly seasonality, thus i include auction month as a feature. I also take difference in years between auction and death, auction and work_execution etc. to see what affect this time has on the mode.

4) Categorical features - There are a lot of features such as name, type of work, auction dept, painting vs sculpture etc that are converted to one-hot-encoded features in the model. 

5) Size/Area of the artwork is one feature that I would have loved to include, however, due to the missing nature of the measurement units, it is very hard to include this- The paintings are measured in 'cm', 'mm', 'in', 'm' but a lot of these are missing in the data nd thus its hard to include this feature. 

Finally, I drop the columns that I dont use and then go to the next steps.

______________________________________________________________________________________________

In [3]:
df = preprocess_individual(picasso)

Now, I do the following:

1) Seperate the data where the hammer price is '-1'. There are two types of places where this happens, first when the lot didn't sell and when the auction is in the future. The instances when it didn't sell are later removed and the instances in the future are saves as a different dataset to predict on later.

2) I then seperate the data into labels and data for making the model.

3) I split the data into train and test. All artwork after 2017 is in the test and and all before is in the train set. This is different from how one would evaluate a model without any time component, here we are esentially using data upto 2017 to predict for the future.

4) I finally create the'future' data, this is the data for future auctions that we want to predict with the best model that we find. Assuming, November 2018 and after is the future.

In [4]:
final_test = df[df['adjusted_hammer_price'] < 0 ]
df = df[df['adjusted_hammer_price'] > 0 ]
label = df['adjusted_hammer_price']
df = df.drop(columns = ['adjusted_hammer_price'])

In [5]:
X_train = df[df.auction_year < 2017]
X_test = df[df.auction_year >= 2017]
y_train = label[X_train.index]
y_test = label[X_test.index]

In [6]:
future = final_test[final_test.auction_year>=2018]
future = future[future.auction_month>=11]
future = future.drop(columns = ['adjusted_hammer_price'])

## Modeling

I try different models below, gridsearch hyperparemeters and then finally decide on the model that works best for this problem. For all the models, I am trying to optimize mean_squared_error. I tried optimizing mean_absolute_error as well(due to the fact that its less sensitive to outliers than mse), however, the results were comparable and the training time was significantly slower. This is probably due to the fact that mse has nicer mathematical properties(differentiable).

For each of the following models that I try, I evaluate results using 2 main metrics - Rsquared and the percentage difference predictions have with the actual predictions. The first metric is fairly standard in industry and is one of the most popular metrics to evaluate regressions. The second metric however is a unique one. It essentially measures the percentage difference between actual and predicted values. Ideally, we want to choose the model where most observations have a small percentage difference. 

Due to the time series nature of the data, the training and testing of the model is done a little bit differently. Normally, one randomly shuffles the data and then trains on one part and tests on the other. Since we want to predict for the future, it make sense to test only on the past data and predict on future data. I split the data into train and test based on time(before and after 2017), then I trained on before 2017 and tested on after 2017. However, while doing cross validation, the data is randomly shuffeled and cross validated. This is fine as at a given point in time, we have all the past data and it makes sense to randomly cross validate.

Mean Squared Error is also an important metric that is tracked in regression problems, however I am not tracking that. This is because MSE is extremely sensitive to outliers. For example, If a painting was sold for 5 Million but we predict 4.5 million, the values are actuaally not that far but the MSE will be extremely high(difference of 500 Thousand squared). Thus, MSE may not be a good metric to track in this particuar problem.

### Linear Regression - Lasso Penalty 

In [7]:
parameters = {'alpha':[0.6,0.7,0.8,0.9,1,1.5,2,5,10,20,50,100,200,300,1000,2000,3000]}
regressor = linear_model.Lasso(fit_intercept=True, max_iter=10000, tol = .1, random_state = 42)
reg = GridSearchCV(regressor, parameters, cv=5)

In [8]:
reg.fit(X_train,y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=10000,
   normalize=False, positive=False, precompute=False, random_state=42,
   selection='cyclic', tol=0.1, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'alpha': [0.6, 0.7, 0.8, 0.9, 1, 1.5, 2, 5, 10, 20, 50, 100, 200, 300, 1000, 2000, 3000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [9]:
print('Best Parameters for Linear Regression are:', reg.best_params_)
print('Linear Regression - Train R-squared', reg.score(X_train, y_train))
print('Linear Regression - Test R-squared', reg.score(X_test, y_test))

Best Parameters for Linear Regression are: {'alpha': 2000}
Linear Regression - Train R-squared 0.9170384011915328
Linear Regression - Test R-squared 0.8793459122235444


In [10]:
percentage_diff(y_train, reg.predict(X_train))

{'10% or lesser Difference': 6825,
 '25% or lesser Difference': 151,
 '50% or lesser Difference': 26,
 '75% or lesser Difference': 5,
 'Greater than or Equal to 100%': 4}

In [11]:
percentage_diff(y_test, reg.predict(X_test))

{'10% or lesser Difference': 974,
 '25% or lesser Difference': 8,
 '50% or lesser Difference': 2,
 '75% or lesser Difference': 1,
 'Greater than or Equal to 100%': 3}

### Decision Tree

In [12]:
parameters = {'max_depth':[5,6,7,8,10,12,15,20],'min_samples_split':[5,10,15,20,30,50]  }
regressor = DecisionTreeRegressor(random_state=0, criterion='mse')
reg = GridSearchCV(regressor, parameters, cv=5)

In [13]:
reg.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=0, splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [5, 6, 7, 8, 10, 12, 15, 20], 'min_samples_split': [5, 10, 15, 20, 30, 50]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [14]:
print('Best Parameters for Decision tree are:', reg.best_params_)
print('Decision Tree - Train R-squared', reg.score(X_train, y_train))
print('Decision Tree - Test R-squared', reg.score(X_test, y_test))

Best Parameters for Decision tree are: {'max_depth': 6, 'min_samples_split': 10}
Decision Tree - Train R-squared 0.9603418424143736
Decision Tree - Test R-squared 0.8467637959802707


In [15]:
percentage_diff(y_train, reg.predict(X_train))

{'10% or lesser Difference': 6994,
 '25% or lesser Difference': 15,
 '50% or lesser Difference': 2,
 '75% or lesser Difference': 0,
 'Greater than or Equal to 100%': 0}

In [16]:
percentage_diff(y_test, reg.predict(X_test))

{'10% or lesser Difference': 985,
 '25% or lesser Difference': 3,
 '50% or lesser Difference': 0,
 '75% or lesser Difference': 0,
 'Greater than or Equal to 100%': 0}

### Random Forest

In [17]:
parameters = {'max_depth':[5,6,7,8,10,12],'min_samples_split':[5,10,15,20,30] }
regressor = RandomForestRegressor(random_state=0, n_estimators= 150, n_jobs=-1)
reg = GridSearchCV(regressor, parameters, cv=5)

In [18]:
reg.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=-1,
           oob_score=False, random_state=0, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [5, 6, 7, 8, 10, 12], 'min_samples_split': [5, 10, 15, 20, 30]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [19]:
print('Best Parameters for Random Forest are:', reg.best_params_)
print('Random Forest - Train R-squared', reg.score(X_train, y_train))
print('Random Forest - Test R-squared', reg.score(X_test, y_test))

Best Parameters for Random Forest are: {'max_depth': 5, 'min_samples_split': 5}
Random Forest - Train R-squared 0.9538001964306466
Random Forest - Test R-squared 0.8664418478729164


In [20]:
percentage_diff(y_train, reg.predict(X_train))

{'10% or lesser Difference': 6966,
 '25% or lesser Difference': 40,
 '50% or lesser Difference': 3,
 '75% or lesser Difference': 2,
 'Greater than or Equal to 100%': 0}

In [21]:
percentage_diff(y_test, reg.predict(X_test))

{'10% or lesser Difference': 984,
 '25% or lesser Difference': 4,
 '50% or lesser Difference': 0,
 '75% or lesser Difference': 0,
 'Greater than or Equal to 100%': 0}

### Adaboost

In [22]:
parameters = {'learning_rate':[0.001,0.01,0.1,.3,.5],
              'n_estimators':[50,100], 'loss': ['square', 'linear', 'exponential'],
             'base_estimator__max_depth': [1,2,3],
             'base_estimator__min_samples_split': [3,5,10,20]}
regressor = AdaBoostRegressor(DecisionTreeRegressor( random_state=0, min_samples_split= 5))
reg = GridSearchCV(regressor, parameters, cv=5)

In [23]:
reg.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=AdaBoostRegressor(base_estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=5, min_weight_fraction_leaf=0.0,
           presort=False, random_state=0, splitter='best'),
         learning_rate=1.0, loss='linear', n_estimators=50,
         random_state=None),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'learning_rate': [0.001, 0.01, 0.1, 0.3, 0.5], 'n_estimators': [50, 100], 'loss': ['square', 'linear', 'exponential'], 'base_estimator__max_depth': [1, 2, 3], 'base_estimator__min_samples_split': [3, 5, 10, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [24]:
print('Best Parameters for Adaboost are:', reg.best_params_)
print('AdaBoost - Train R-squared', reg.score(X_train, y_train))
print('AdaBoost - Test R-squared', reg.score(X_test, y_test))

Best Parameters for Adaboost are: {'base_estimator__max_depth': 3, 'base_estimator__min_samples_split': 10, 'learning_rate': 0.5, 'loss': 'exponential', 'n_estimators': 50}
AdaBoost - Train R-squared 0.9388106536209241
AdaBoost - Test R-squared 0.8590044003407089


In [25]:
percentage_diff(y_train, reg.predict(X_train))

{'10% or lesser Difference': 3331,
 '25% or lesser Difference': 2016,
 '50% or lesser Difference': 1189,
 '75% or lesser Difference': 287,
 'Greater than or Equal to 100%': 188}

In [26]:
percentage_diff(y_test, reg.predict(X_test))

{'10% or lesser Difference': 648,
 '25% or lesser Difference': 222,
 '50% or lesser Difference': 86,
 '75% or lesser Difference': 18,
 'Greater than or Equal to 100%': 14}

## Final Thoughts

From the above results, it seems like the Random Forest Regressor has performed the best out of the lot. I observe a fairly high training Rsquared, and a similarly high Test Rsquared. Since the best parametrs are chosen using gridsearch using 5 fold cross validation, we can be positive that the model is not overfitting.  

The linear regression with a lasso penalty also preforms fairly well on paper, However, upon observing predictions, I see that the model is actually predicting negative values for hammer price. This is probably due to the fact that the model tries to model the data linearly and tries to fit larget values, thereby sloping the line significantly and predicting negative values for lower values items.

Along with this, we also see from the other metric(percentage difference) that most of the values are within 10% difference of the original values in case of the random forest model which is very valuable in determining our confidence in the model.

Now, I run the Random Forest Model again on the 'Future' Data and write the 'Future' Predictions as a CSV file along with the original features that were given. Its not possible to evaluate how good/bad these predictions are as we do not have access to the actual values for this data.

I also tried the different models with different features, Initially I had some features that did not add predictive power to the model. Along with that, the feature transformations that I did helped a lot as it gave me a huge boost in metrics. I do not have all the different combinations of features that I used to keep the document short and legible.

In [27]:
#Regressor using best parameters found in gridsearch
regressor = RandomForestRegressor(random_state=0, n_estimators= 150, 
                                  n_jobs=-1,max_depth = 5, min_samples_split = 5)

In [28]:
regressor.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=5,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=5,
           min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=-1,
           oob_score=False, random_state=0, verbose=0, warm_start=False)

In [29]:
print('Random Forest - Train R-squared', regressor.score(X_train, y_train))
print('Random Forest - Test R-squared', regressor.score(X_test, y_test))

Random Forest - Train R-squared 0.9538001964306466
Random Forest - Test R-squared 0.8664418478729164


In [30]:
percentage_diff(y_train, regressor.predict(X_train))

{'10% or lesser Difference': 6966,
 '25% or lesser Difference': 40,
 '50% or lesser Difference': 3,
 '75% or lesser Difference': 2,
 'Greater than or Equal to 100%': 0}

In [31]:
percentage_diff(y_test, regressor.predict(X_test))

{'10% or lesser Difference': 984,
 '25% or lesser Difference': 4,
 '50% or lesser Difference': 0,
 '75% or lesser Difference': 0,
 'Greater than or Equal to 100%': 0}

Below, I am showing the correlation between actual values and the predicted values. I notice a very high correlation between the two, this alone is probably not the best metric to track, however, this along with other metrics makes a good case.

In [36]:
pd.DataFrame(np.corrcoef(y_train,regressor.predict(X_train) ))

Unnamed: 0,0,1
0,1.0,0.977686
1,0.977686,1.0


In [37]:
pd.DataFrame(np.corrcoef(y_test,regressor.predict(X_test) ))

Unnamed: 0,0,1
0,1.0,0.932686
1,0.932686,1.0


In [34]:
#Predicting Values for future auctions
predicted_hammer_price = regressor.predict(future)

In [35]:
#Merging with original data and writing as csv
future_data = picasso.loc[future.index]
predictions = pd.DataFrame(predicted_hammer_price).set_index(future_data.index) 
predictions.columns = ['predicted_hammer_price']
pd.concat([future_data, predictions], axis = 1).to_csv('artists/picasso_future_predictions.csv')

There is one interesting thing that I notice for the future predictions, artwork that has a smaller estimated/reah sale value, the model tends to predict a higher price, i.e. the minimum price the model predicts is higher than the actual min price. Also, below a threshold, the model predicts the same min value for different data points. This is because the model is trying harder to fit larger values (in 100's of thousands and Millions) than to care about lower values. This causes the random forest to make similar decisions for all values below a certain threshold.