## Details 

El fichero data.csv contiene los datos necesarios para crear vuestros modelos, y está compuesto por las siguientes columnas:

- price: in US dollars [TARGET]
- carat: weight of the diamond
- cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- color: diamond colour, from J (worst) to D (best)
- clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- x: length in mm
- y: width in mm
- z: depth in mm
- depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y)
- table: width of top of diamond relative to widest point

El fichero **predict.csv** contiene las mismas columnas, com excepción de la columna price, que será tu tarea predecir. El fichero sample_submission.csv contiene un ejemplo del formato en que debe estar tu submission.

Atención! Los index en el submission deben ser los mismos de `predict.csv`, y todos los elementos deben estar presentes. Además del index, el submission debe contener la columna `price` con las predicciones.


## Tools

Puedes, y debes, probar diferentes modelos, parámetros y preparación de los datos. La documentación de sklearn será tú mejor amiga:

- [Pre Processing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
- [Supervised Learning](https://scikit-learn.org/stable/supervised_learning.html)
- [Model Selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)

nota: La métrica utilizada en esa competición será el RMSE.

Referencias: 

- [IGS - Measurements](https://www.gemsociety.org/article/diamond-measurements/)
- [The Diamond Pro - Clarity](https://www.diamonds.pro/education/clarity/)
- [The Diamond Pro - Proportions](https://www.diamonds.pro/guides/diamond-proportion/)
- [Loose Diamond - Cuts](https://www.loosediamondsreviews.com/diamondcut.html)
- [Beyond - Colors](https://beyond4cs.com/color/)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import patches
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso, Ridge
data = pd.read_csv("data/data.csv", index_col=0)
data_predict = pd.read_csv("data/predict.csv", index_col=0)
from sklearn.model_selection import KFold

In [2]:
data

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,1.01,Ideal,G,VVS2,60.6,57.0,6.54,6.50,3.95,7167
1,0.31,Good,F,SI1,63.5,56.0,4.30,4.33,2.74,516
2,1.02,Premium,D,SI2,59.5,62.0,6.56,6.52,3.89,4912
3,0.27,Ideal,E,VVS2,62.0,55.0,4.12,4.14,2.56,622
4,0.70,Very Good,F,VS2,61.7,63.0,5.64,5.61,3.47,2762
...,...,...,...,...,...,...,...,...,...,...
37753,1.51,Very Good,E,VS2,63.2,56.0,7.28,7.22,4.58,13757
37754,0.41,Premium,J,VS1,62.0,55.0,4.77,4.74,2.95,830
37755,0.32,Very Good,E,VVS2,61.6,54.0,4.43,4.46,2.74,816
37756,0.38,Good,G,VS2,58.8,62.0,4.68,4.71,2.76,771


In [10]:
data[data['depth'] > 71]

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
4028,0.99,Fair,J,I1,73.6,60.0,6.01,5.8,4.35,1789
4151,0.5,Fair,E,VS2,79.0,73.0,5.21,5.18,4.09,2579
19352,0.85,Fair,H,I1,71.2,54.0,5.77,5.65,4.07,1274
28144,0.5,Fair,E,VS2,79.0,73.0,5.21,5.18,4.09,2579
29710,1.03,Fair,E,I1,78.2,54.0,5.72,5.59,4.42,1262


In [11]:
data['cut'].unique()

array(['Ideal', 'Good', 'Premium', 'Very Good', 'Fair'], dtype=object)

In [12]:
data['clarity'].unique()

array(['VVS2', 'SI1', 'SI2', 'VS2', 'VS1', 'IF', 'VVS1', 'I1'],
      dtype=object)

In [13]:
data['color'].unique()

array(['G', 'F', 'D', 'E', 'I', 'J', 'H'], dtype=object)

In [15]:
data[data['table'] < 50]

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
11046,0.3,Fair,E,SI1,64.5,49.0,4.28,4.25,2.75,630
19529,1.0,Fair,I,VS1,64.0,49.0,6.43,6.39,4.1,3951
25639,0.29,Very Good,E,VS1,62.8,44.0,4.2,4.24,2.65,474


In [23]:
data[data['carat'] >5.3 ]

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1


In [25]:
data[data['x'] > 10]

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
13757,4.01,Premium,I,I1,61.0,61.0,10.14,10.1,6.17,15223
18753,4.01,Premium,J,I1,62.5,62.0,10.02,9.94,6.24,15223
22358,5.01,Fair,J,I1,65.5,59.0,10.74,10.54,6.98,18018
29253,4.0,Very Good,I,I1,63.3,58.0,10.01,9.94,6.31,15984


In [26]:
data[data['y'] > 10]

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
13757,4.01,Premium,I,I1,61.0,61.0,10.14,10.1,6.17,15223
17000,2.0,Premium,H,SI2,58.9,57.0,8.09,58.9,8.06,12210
22358,5.01,Fair,J,I1,65.5,59.0,10.74,10.54,6.98,18018
31002,0.51,Ideal,E,VS1,61.8,55.0,5.15,31.8,5.12,2075


In [27]:
data[data['z'] > 10]

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
7427,0.51,Very Good,E,VS1,61.8,54.7,5.12,5.15,31.8,1970


In [36]:
100*(2 * 31.8 )/ (5.12 + 5.15)

619.2794547224927

In [35]:
100*(2 * 6.98 )/ (10.54 + 10.74)

65.6015037593985

In [37]:
100*(2 * 8.06 )/ (8.09 + 58.90)

24.06329302881027