# Arthena Data Science Challenge - Part 1.1

Author: Manksh Gupta

______________________________________________________________________________________________

*Review the column headers in the csv files. Please list some assumptions about which features you think will the most and least important (please remember to commit this *before* proceeding to Part 1.2). Some things to consider: How do you plan to handle categorical data? Different currencies? Can you assume this dataset is i.i.d. given that there is time-variance? How will you structure your work to take advantage of the non i.i.d. properties of this dataset?

______________________________________________________________________________________________

## Schema

Field Name | Description
-----------|-------------
artist_name | Represents the artist name.
artist_nationality | Represents the artist's country of origin.
artist_birth_year | Represents the artist's year of birth.
artist_death_year	| Represents the artist's year of death. `-1` if still living.
auction_house	| Name of auction house.
auction_sale_id	| A unique id associated with the auction given by the auction house (ex. `PF1846`).
auction_department	| The type of department an auction belongs to. Note: These are synthetic departments. We map auction house departments across auction houses into a common taxonomy (ex. `Post-War & Contemporary`, `-1` if unknown).
auction_location | Represents the city where the auction takes place.
auction_date | Represents the datetime when the auction started. Auction houses may report the date and time of the auction or only the date (ex. `2017-03-01 00:00:00` or `2016-11-18 15:00:00.489`. Ignore trailing milliseconds).
auction_currency | ISO currency code representing denomination used in the auction (ex. `USD`).
exchange_rate_to_usd | Exchange rate from original currency to USD on auction date.
auction_lot_count	| Represents the number of lots in the auction (ex. `65`, `-1` if unknown).
lot_id | Unique lot id given by the auction house. These are alphanumeric (ex. `601` or `32T`)
lot_place_in_auction | The order in which the lot was offered during the auction.
lot_description	| A json, html, or natural language description of the lot from the auction house website. These have not been cleaned by us.
lot_link | Link to auction lot listing. These are occasionally null.
work_title | Title of the work. These are occasionally messy and may contain dates. Many works are untitled.
work_medium | Medium of the work, such as `painting`, `drawing`, or `photograph`.
work_execution_year	| Year when the work was is completed (ex. `1982`, `-1` if unknown).
work_dimensions | Unstructured text representing work dimensions (ex. `11 1/4 x 11 1/4 in. (28.6 x 28.6 cm.)`, `-1` if unknown).
work_height | Approximate height of the lot (`-1` if unknown).
work_width | Approximate width of the lot (`-1` if unknown).
work_depth | Approximate depth of the lot (`-1` if 2D or unknown).
unit | unit the measurements are in (ex. `cm`, `-1` if unknown).
hammer_price | The price at which a lot sold for at auction. If the lot was not sold, `-1` is recorded.
buyers_premium | The fee the auction house charges the buyer on top of the hammer price auction. Note that this is calculated after the lot is sold and should not be a feature of your model. `0` if the lot is unsold.
estimate_low | A the low estimate provided by the auction house prior to auction.
estimate_high	| A high estimate hammer price provided by the auction house prior to auction.

The schema shows a lot of information about the pieces of art that are in question. We have over 20 features that describe the artwork, the prices, the timestamps etc. For any model, Its  good quality features that make or break it, thus it makes sense to spend enough time to engineer good predictive features else the garbage in will just be gerbage out. Feature engineering involves using the current features to first get rid/impute unknowns, deal with outliers and hypothesize the predictive power of features. 

Below, I describe the features that I think will be most/least important after inspecting the data files and also state assumptions that I will be following in doing further analysis. Please note that these are after all, a priori assumptions and things may change when the actual analysis is conducted depending on what works better in practice.

______________________________________________________________________________________________

All the following is based on the data of all three artists being conactenated together. While predicting for only 1 artist, some features like name, lifespan etc may not be relevant as this would be the same across the dataste for the artist.

#### 1) 

First, as described in the problem, we cannot use future data to predict the past, thus we need to define what the present is. This is needed to validate the model at later stages and have a way to split the data into train/validation/test so that the model can be robust. Its not clear from the data what the present date is. To make the model robust, I train on datapoints before 2017 and test on points after this date. Then for auctions in the future, I give predictions but its not possible to evaluate these predictions as they are in the future and we dont know what they would actually sell for.

Also to note, the hammer price will first all be converted to 'USD' for consistency using the conversion rate. Then, the final column that we would predict is `adjusted_hammer_price`. If I have enough time, I would like to adjust this hammer price according to inflation, this requires inflation data from outside sources.

#### 2) Price Based:

From Inspection and calculating basic statistics, it seems like the `high_estimate` has a very high correlation with the hammer price. The estimate low also has a high correlation but that may be due to its correlation with the estimate high. Now, correlation does not convey any causation, however, the estimated high is based on analyses by industry experts and that having a high correlation with our target variable is a good start. The high estimate will also be inflation adjusted similar to how hammer price would be.

The most important features that I would include in the model is estimate high and the difference between the estimate high and estimate low. The difference gives us a good picture of how sure the auction house is with respect to the prices that they have conveyed as high and low. This will be called `high_low_diff`

#### 3) Artist Based

The birth and death year of the artist are very important quantities to look at, however, our dataset only has 3 different artists, i.e. we have only 6 different years. We can however get a little clever with this feature. I would first include `alive_at_auction` (relative to when the auction took place) as a feature and if this feature is false, I think that the days between the death of artist and the auction date together would be important, this transformed feature would be called: `death_auction_diff`, this could be a possibly telling feature of the data. There is a small problem with this feature- if an artist is not dead, this we cannot compute this. Thus, in that case, I would give an unknown value like 0.

In the data that we have, all the three artists are dead and thus, this would work. This feature also helps with the time component of the data, since we define auction date as relative to a fixed year with respect to different artists.

`artist_name` is an important categorical feature. This feature takes 3 categorical values and must be 'one-hot-encoded' in order to be included in the model. It remains to be seen if including it would give us better predictions. This can be seen in later sections based on experimentation.

#### 4) Auction Based

`auction_department` is a categorical feature that should be included as a one-hot-encoded variable in the model. The assumption is that different departments attract different types of audiences and thus may effect the price.

Along with this, auction_lot_count and lot_place_in_auction also seem to be important. I would combine this feature by converting this as a ratio of lot_place_in_auction relative to auction_lot_count. This would be called `relative_lot_position`. This feature may or may not turn out to be useful,  but apriori it seems like it can be a good addition to the model as the assumption is that if an artpiece is sold at the end of an auction, its the 'masterpiece of the auction' and thus may fetch a higher price, or it can also mean that if the lot is very big and the artpiece is the first one, people are willing to spend less as they are waiting for future pieces. Either way, this feature can add value to the model.

auction_date is not explicitly included, however, its included relative to when the artist died.

It may be likley that there is some seasonality aspect to this, thus I include `auction_month` as a feature in the model. This also helps to take care of time variance.



#### 5) Work Based

I would first have a feature `is_2d` This would differentiate between paintings and sculptures. I would then calculate the area of the artwork as `work_area` to include in the model. For 3d art, it would be the volume of the piec( length * width * height) as it would be very hard to calculate the surface area of sculptures due to the many different shapes. For unknown values, i replace it with avg. This is however very hard to do, a majority of the units of measurement is missing and calculating area in two different units then becomes impossible. I also try calculating the `aspect_ratio` of a 2d artwork.

`work_medium` would be included as a one-hot-encoded feature in the model. 

Finally, work_execution_year will be included as the years the work was completed before auctioning as `execution_auction_diff`. I will also look into including work_execution_year as the years after the artists birth as a proxy for where the artist was in his/her career when the work was complete. This would be `execution_birth_diff`