A pipeline to facilitate data extraction from opensea.io REST API and model cryptovoxel virtual home prices using XGBoost or similar frameworks.
This is an excerpt from an apt summary of the Cryptovoxel community: https://www.buildblockchain.tech/newsletter/issues/no-101-the-weird-world-of-cryptovoxels
TLDR:
"The browser-based game allows you to explore a sprawling three dimensional world built with chunky, pixel-like blocks commonly referred to as "voxels." Minecraft is the most famous voxel game, and its success popularized a whole genre of blocky free-play worlds. Like Minecraft, Cryptovoxels lets you explore and build without any strict goal or directive. Where Cryptovoxels differs from a game like Minecraft is in its integration of— you guessed it— crypto! While anyone can explore the Cryptovoxel world freely, to build in the world you have to own property. Ownership is tracked through NFTs on the Ethereum network."
Upon inital analysis of the target (sale price), we notice that the distribution is right skewed:
Modeling will be performed on the original target and also tested on the a normalized target. Below is the distribution after normalization:
Pearson correlations show that none of the initial predictors have high correlations. Looking at the same predictors, we see that building height and plot size are most highly correlated with the target (Sale Price).
Below are the features extracted from the raw JSON data:
- 'cv_plotSize_m_sq': Plot size in square meters
- 'cv_OCdistance_m': Distance from the Origin
- 'cv_buildHeight_m': Build height in meters
- 'cv_floor_elev_m': Base floor elevation in meters
- 'neighborhood': Text neighborhood name
- 'near_to': Sub neighborhoods that location is near
'Neighborhood' and 'Near To" are categorical fields and must be encoded for XGBoost consumption. The following encoding methods are explored:
- One Hot Encoding (a new field for every attribute of the field)
- Categorical Encoding (field attributes are converted to unique numeric values)
Here's a fun Tableau viz of training data: https://public.tableau.com/profile/matt.wheeler#!/vizhome/CryptovoxelData/CryptovoxelVirtualHomes
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. I am using a XGBoost regressor to predict cryptovoxol home sales using features such as Plot Size and Distance from Origin City.
Intial tuning jobs yield the following results, comparing the Neighborhood field as One Hot Encoded and as Categorical Encoded:
One hot encoding seems to perform slightly better.
To drive down the Over-predicting, we can remove outliers with Prices greater than 50 ETH. Also, we can implement Early Stopping. These are the resulting prediction results:
The RMSE/Mean finally snuck under 1.0! Diving deeper into the model results, we can see that across different Property sizes, the RMSE/Mean is scattered and is not necessarily worse for certain size buckets (this is good). Also, we can see the worst performing Neighborhoods - it may be possible to remedy this with other features.
This is quite reasonable given the limited features, but further scope analysis and feature exploration should produce an even more efficient model.
More comparitive analysis should be performed as well as adding additional data sources for the Crypto model. Also, scope analysis in terms of outlier removal, variablee transformation, and target smoothing can enhance results.