# Arthena Data Science Challenge - Part 2.1

Author: Manksh Gupta

Merge all three csvs together to train a single pooled model. Validate your model against only the Picasso data (in the same way you did for Part 1.3) and report the final validation errors. How does this model and its error compare to the results from the Picasso-only model that you trained in Part 1? Describe the tradeoffs of using pooled models vs using models for individual artists.

## Imports

In [20]:
from utils import preprocess_all
from utils import percentage_diff
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor

## Preprocessing

This is the first stage of the problem, this is also argubly the most important stage. Models are data dependent and if the data is garbage, the model will give us garbage. Thus, its important to preprocess the data, extract useful features and then train models on the data.

I am essentially doing the same preprocessing as part 1.2, except, now for imputing missing values of the work execution year, I am using the median lifespan of each artist. I also made slight changes to the preprocessing function so that it now works with any number of artists and not only individual artists at a time.

In [21]:
picasso = pd.read_csv('artists/picasso.csv')
lewitt = pd.read_csv('artists/lewitt.csv')
warhol = pd.read_csv('artists/warhol.csv')
dataset = pd.concat([picasso,lewitt,warhol])

______________________________________________________________________________________________

In [22]:
df = preprocess_all(dataset)#loaded from utils
df = df.reset_index()
df = df.drop(columns = 'index')

Now, I do the following:

1) Seperate the data where the hammer price is '-1'. There are two types of places where this happens, first when the lot didn't sell and when the auction is in the future. The instances when it didn't sell are later removed and the instances in the future are saves as a different dataset to predict on later.

2) I then seperate the data into labels and data for making the model.

3) I split the data into train and test. All artwork after 2017 is in the test and and all before is in the train set. This is different from how one would evaluate a model without any time component, here we are esentially using data upto 2017 to predict for the future.

4) I finally create the'future' data, this is the data for future auctions that we want to predict with the best model that we find. Assuming, November 2018 and after is the future.

In [23]:
final_test = df[df['adjusted_hammer_price'] < 0 ]
df = df[df['adjusted_hammer_price'] > 0 ]
label = df['adjusted_hammer_price']
df = df.drop(columns = ['adjusted_hammer_price'])

In [24]:
#testing data mask - picasso above 2017 sales
mask = (df['artist_name_Pablo Picasso'] == 1) & (df['auction_year'] >= 2017)
X_train = df[~mask]
X_test = df[mask]
y_train = label[~mask]
y_test = label[mask]

In [25]:
future = final_test[final_test.auction_year>=2018]
future = future[future.auction_month>=11]
future = future[future['artist_name_Pablo Picasso'] == 1]
future = future.drop(columns = ['adjusted_hammer_price'])

## Modeling

I try different models below, gridsearch hyperparemeters and then finally decide on the model that works best for this problem. For all the models, I am trying to optimize mean_squared_error. I tried optimizing mean_absolute_error as well(due to the fact that its less sensitive to outliers than mse), however, the results were comparable and the training time was significantly slower. This is probably due to the fact that mse has nicer mathematical properties(differentiable).

#### Metrics:

For each of the following models that I try, I evaluate results using 2 main metrics - Rsquared and the percentage difference predictions have with the actual predictions. The first metric is fairly standard in industry and is one of the most popular metrics to evaluate regressions. The second metric however is a unique one. It essentially measures the percentage difference between actual and predicted values. Ideally, we want to choose the model where most observations have a small percentage difference. 

Mean Squared Error is also an important metric that is tracked in regression problems, however I am not tracking that. I am optimizing for that, but not reporting it due to misleading large numbers. This is because MSE is extremely sensitive to outliers. For example, If a painting was sold for 5 Million but we predict 4.5 million, the values are actuaally not that far but the MSE will be extremely high(difference of 500 thousand). I notice that the data given has a bunch of outliers and the outliers are very large quantities, these will artificially inflate the MSE thereby painting a bad picture. Thus, MSE may not be a good metric to track in this particuar problem.

______________________________________________________________________________________________

Due to the time series nature of the data, the training and testing of the model is done a little bit differently. Normally, one randomly shuffles the data and then trains on one part and tests on the other. Since we want to predict for the future, it make sense to test only on the past data and predict on future data. I split the data into train and test based on time(before and after 2017), then I trained on before 2017 and tested on after 2017. However, while doing cross validation, the data is randomly shuffeled and cross validated. This is fine as at a given point in time, we have all the past data and it makes sense to randomly cross validate.

### Decision Tree

In [7]:
parameters = {'max_depth':[5,6,7,8,10,12,15,20],'min_samples_split':[5,10,15,20,30,50]  }
regressor = DecisionTreeRegressor(random_state=0, criterion='mse')
reg = GridSearchCV(regressor, parameters, cv=5)

In [8]:
reg.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=0, splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [5, 6, 7, 8, 10, 12, 15, 20], 'min_samples_split': [5, 10, 15, 20, 30, 50]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [9]:
print('Best Parameters for Decision tree are:', reg.best_params_)
print('Decision Tree - Train R-squared', reg.score(X_train, y_train))
print('Decision Tree - Test R-squared', reg.score(X_test, y_test))

Best Parameters for Decision tree are: {'max_depth': 8, 'min_samples_split': 30}
Decision Tree - Train R-squared 0.8964104717938358
Decision Tree - Test R-squared 0.8119711803305818


In [10]:
percentage_diff(y_train, reg.predict(X_train))

{'10% or lesser Difference': 13614,
 '25% - 10% Difference': 11,
 '50% - 25% Difference': 2,
 '75% - 50% Difference': 0,
 'Greater than or Equal to 100%': 0}

In [11]:
percentage_diff(y_test, reg.predict(X_test))

{'10% or lesser Difference': 986,
 '25% - 10% Difference': 2,
 '50% - 25% Difference': 0,
 '75% - 50% Difference': 0,
 'Greater than or Equal to 100%': 0}

### Random Forest

In [12]:
parameters = {'max_depth':[2,3,4,5,6,7,8],'min_samples_split':[2,3,5,7,10,15] }
regressor = RandomForestRegressor(random_state=0, n_estimators= 150, n_jobs=-1)
reg = GridSearchCV(regressor, parameters, cv=5)

In [13]:
reg.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=-1,
           oob_score=False, random_state=0, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [2, 3, 4, 5, 6, 7, 8], 'min_samples_split': [2, 3, 5, 7, 10, 15]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [14]:
print('Best Parameters for Random Forest are:', reg.best_params_)
print('Random Forest - Train R-squared', reg.score(X_train, y_train))
print('Random Forest - Test R-squared', reg.score(X_test, y_test))

Best Parameters for Random Forest are: {'max_depth': 5, 'min_samples_split': 3}
Random Forest - Train R-squared 0.9412006866303314
Random Forest - Test R-squared 0.8491726442279472


In [15]:
percentage_diff(y_train, reg.predict(X_train))

{'10% or lesser Difference': 12954,
 '25% - 10% Difference': 596,
 '50% - 25% Difference': 66,
 '75% - 50% Difference': 8,
 'Greater than or Equal to 100%': 3}

In [16]:
percentage_diff(y_test, reg.predict(X_test))

{'10% or lesser Difference': 956,
 '25% - 10% Difference': 27,
 '50% - 25% Difference': 3,
 '75% - 50% Difference': 2,
 'Greater than or Equal to 100%': 0}

### Adaboost

In [22]:
parameters = {'learning_rate':[0.001,0.01,0.1,0.3,0.5],
              'n_estimators':[50,100], 'loss': ['square', 'linear', 'exponential'],
             'base_estimator__max_depth': [1,2,3],
             'base_estimator__min_samples_split': [3,5,10,20]}
regressor = AdaBoostRegressor(DecisionTreeRegressor( random_state=0, min_samples_split= 5))
reg = GridSearchCV(regressor, parameters, cv=5)

In [23]:
reg.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=AdaBoostRegressor(base_estimator=DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=5, min_weight_fraction_leaf=0.0,
           presort=False, random_state=0, splitter='best'),
         learning_rate=1.0, loss='linear', n_estimators=50,
         random_state=None),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'learning_rate': [0.001, 0.01, 0.1, 0.3, 0.5], 'n_estimators': [50, 100], 'loss': ['square', 'linear', 'exponential'], 'base_estimator__max_depth': [1, 2, 3], 'base_estimator__min_samples_split': [3, 5, 10, 20]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [24]:
print('Best Parameters for Adaboost are:', reg.best_params_)
print('AdaBoost - Train R-squared', reg.score(X_train, y_train))
print('AdaBoost - Test R-squared', reg.score(X_test, y_test))

Best Parameters for Adaboost are: {'base_estimator__max_depth': 3, 'base_estimator__min_samples_split': 3, 'learning_rate': 0.001, 'loss': 'exponential', 'n_estimators': 100}
AdaBoost - Train R-squared 0.8952665595429634
AdaBoost - Test R-squared 0.8091049173474916


In [25]:
percentage_diff(y_train, reg.predict(X_train))

{'10% or lesser Difference': 9520,
 '25% - 10% Difference': 2870,
 '50% - 25% Difference': 954,
 '75% - 50% Difference': 186,
 'Greater than or Equal to 100%': 97}

In [26]:
percentage_diff(y_test, reg.predict(X_test))

{'10% or lesser Difference': 740,
 '25% - 10% Difference': 189,
 '50% - 25% Difference': 45,
 '75% - 50% Difference': 9,
 'Greater than or Equal to 100%': 5}

## Final Thoughts

From the above results, it seems like the Random Forest Regressor has again performed the best out of the lot. 

I did not try the linear regression again due to its unstable predictions noticed in part 1. I did try doing it but the behavious persisted and thus I did not show it here for succinctness.

Now, I run the Random Forest Model again on the 'Future' Data and write the 'Future' Predictions as a CSV file along with the original features that were given. Its not possible to evaluate how good/bad these predictions are as we do not have access to the actual values for this data.

I also tried the different models with different features, Initially I had some features that did not add predictive power to the model. Along with that, the feature transformations that I did helped a lot as it gave me a huge boost in metrics. I do not have all the different combinations of features that I used to keep the document short and legible.

In [7]:
regressor = RandomForestRegressor(random_state=0, n_estimators= 150, 
                                  n_jobs=-1,max_depth = 5, min_samples_split = 3)

In [8]:
regressor.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=5,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=3,
           min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=-1,
           oob_score=False, random_state=0, verbose=0, warm_start=False)

In [9]:
print('Random Forest - Train R-squared', regressor.score(X_train, y_train))
print('Random Forest - Test R-squared', regressor.score(X_test, y_test))

Random Forest - Train R-squared 0.9412006866303314
Random Forest - Test R-squared 0.8491726442279472


In [10]:
percentage_diff(y_train, regressor.predict(X_train))

{'10% or lesser Difference': 12954,
 '25% - 10% Difference': 596,
 '50% - 25% Difference': 66,
 '75% - 50% Difference': 8,
 'Greater than or Equal to 100%': 3}

In [11]:
percentage_diff(y_test, regressor.predict(X_test))

{'10% or lesser Difference': 956,
 '25% - 10% Difference': 27,
 '50% - 25% Difference': 3,
 '75% - 50% Difference': 2,
 'Greater than or Equal to 100%': 0}

In [12]:
#Top features for Random Forest regression
X_train.columns[np.argsort(regressor.feature_importances_)[::-1]][:15]

Index(['estimate_high', 'high_low_diff', 'death_auction_diff',
       'execution_birth_diff', 'execution_auction_diff',
       'auction_house_Phillips', 'auction_year',
       'auction_department_Regional Specific', 'work_execution_year',
       'auction_month', 'work_medium_drawing', 'artist_name_Andy Warhol',
       'artist_name_Pablo Picasso', 'auction_house_Christies',
       'auction_department_Impressionist & Modern'],
      dtype='object')

In [26]:
#Predicting Values for future auctions
predicted_hammer_price = regressor.predict(future)

In [28]:
#Merging with original data and writing as csv
import warnings 
warnings.filterwarnings('ignore')
future_data = picasso.loc[future.index]
predictions = pd.DataFrame(predicted_hammer_price).set_index(future_data.index) 
predictions.columns = ['predicted_hammer_price']
pd.concat([future_data, predictions], axis = 1).to_csv('artists/picasso_pooled_future_predictions.csv')

The random forest model here seems to perform is bit worse than the initial individual model, the performance in my view is still comparable. I feel like the small loss in R-squared and percentage difference metric is okay if we look at the other side - the pooled model is more robust and it may a better predictor for new artists specially for artists that are lesser known. Eventually, it depends on what we want to acheive, building a seperate model for each artist is still a viable option, but it comes with problems(we need enough training data), thus a pooled model could be really useful in cases where we don't have enough data for an artist.