## Details 

El fichero data.csv contiene los datos necesarios para crear vuestros modelos, y está compuesto por las siguientes columnas:

- price: in US dollars [TARGET]
- carat: weight of the diamond
- cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- color: diamond colour, from J (worst) to D (best)
- clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- x: length in mm
- y: width in mm
- z: depth in mm
- depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y)
- table: width of top of diamond relative to widest point

El fichero **predict.csv** contiene las mismas columnas, com excepción de la columna price, que será tu tarea predecir. El fichero sample_submission.csv contiene un ejemplo del formato en que debe estar tu submission.

Atención! Los index en el submission deben ser los mismos de `predict.csv`, y todos los elementos deben estar presentes. Además del index, el submission debe contener la columna `price` con las predicciones.


## Tools

Puedes, y debes, probar diferentes modelos, parámetros y preparación de los datos. La documentación de sklearn será tú mejor amiga:

- [Pre Processing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)
- [Supervised Learning](https://scikit-learn.org/stable/supervised_learning.html)
- [Model Selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)

nota: La métrica utilizada en esa competición será el RMSE.

Referencias: 

- [IGS - Measurements](https://www.gemsociety.org/article/diamond-measurements/)
- [The Diamond Pro - Clarity](https://www.diamonds.pro/education/clarity/)
- [The Diamond Pro - Proportions](https://www.diamonds.pro/guides/diamond-proportion/)
- [Loose Diamond - Cuts](https://www.loosediamondsreviews.com/diamondcut.html)
- [Beyond - Colors](https://beyond4cs.com/color/)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import patches
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso, Ridge
df = pd.read_csv("data/data.csv", index_col=0)
df_pred = pd.read_csv("data/predict.csv", index_col=0)
from sklearn.model_selection import KFold

In [2]:
rows = df[(df['x'] == 0) | (df['y'] == 0) | (df['y'] > 11) | (df['z'] > 11) | (df['x'] > 11)].index
rows

Int64Index([47, 1839, 1872, 2353, 7427, 17000, 17917, 24489, 31002, 36330], dtype='int64', name='index')

In [3]:
df = df.drop(index = rows, axis=0)

In [4]:
df[df['z'] == 0]

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1695,2.2,Premium,H,SI1,61.2,59.0,8.42,8.37,0.0,17265
11975,1.0,Premium,G,SI2,59.1,59.0,6.55,6.48,0.0,3142
12662,1.5,Good,G,I1,64.0,61.0,7.15,7.04,0.0,4731
14004,1.12,Premium,G,I1,60.4,59.0,6.71,6.67,0.0,2383
20146,1.01,Premium,H,I1,58.1,59.0,6.66,6.6,0.0,3167
21775,2.18,Premium,H,SI2,59.4,61.0,8.49,8.45,0.0,12631
23421,1.1,Premium,G,SI2,63.0,59.0,6.5,6.47,0.0,3696
26426,1.15,Ideal,G,VS2,59.2,56.0,6.88,6.83,0.0,5564
27653,1.01,Premium,F,SI2,59.2,58.0,6.5,6.47,0.0,3837


In [5]:
def fill_z(row):
    #print(row['z'])
    if row['z'] == 0:
        return  (row['depth']/100)*(row['x'] + row['y'])/2
    else:
        return row['z']

In [6]:
df['z'] =  df.apply(fill_z, axis=1)

In [7]:
df[df['z'] == 0]

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1


In [8]:
df['cut'].unique()

array(['Ideal', 'Good', 'Premium', 'Very Good', 'Fair'], dtype=object)

In [9]:
df['cut'].replace({'Fair':1, 
                          'Good':2,
                          'Very Good':3,
                          'Ideal':5,
                          'Premium':4
                         }, inplace=True)

#predict data 
df_pred['cut'].replace({'Fair':1, 
                          'Good':2,
                          'Very Good':3,
                          'Ideal':5,
                          'Premium':4
                         }, inplace=True)

In [10]:
df['color'].unique()

array(['G', 'F', 'D', 'E', 'I', 'J', 'H'], dtype=object)

In [11]:
df['color'].replace({'J':1, 
                          'I':2,
                          'H':3,
                          'G':4,
                          'F':5,
                          'E':6,
                          'D':7
                         }, inplace=True)

df_pred['color'].replace({'J':1, 
                          'I':2,
                          'H':3,
                          'G':4,
                          'F':5,
                          'E':6,
                          'D':7
                         }, inplace=True)

In [12]:
df['clarity'].unique()

array(['VVS2', 'SI1', 'SI2', 'VS2', 'VS1', 'IF', 'VVS1', 'I1'],
      dtype=object)

In [13]:
df['clarity'].replace({'I1':1,
                          'SI2':2,
                          'SI1':3,
                          'VS2':4,
                          'VS1':5,
                          'VVS2':6,
                          'VVS1':7,
                          'IF':8
                         }, inplace=True)

df_pred['clarity'].replace({'I1':1,
                          'SI2':2,
                          'SI1':3,
                          'VS2':4,
                          'VS1':5,
                          'VVS2':6,
                          'VVS1':7,
                          'IF':8
                         }, inplace=True)

In [14]:
df.head()

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,1.01,5,4,6,60.6,57.0,6.54,6.5,3.95,7167
1,0.31,2,5,3,63.5,56.0,4.3,4.33,2.74,516
2,1.02,4,7,2,59.5,62.0,6.56,6.52,3.89,4912
3,0.27,5,6,6,62.0,55.0,4.12,4.14,2.56,622
4,0.7,3,5,4,61.7,63.0,5.64,5.61,3.47,2762


In [None]:
shade = ["#835656", "#baa0a0", "#ffc7c8", 
         "#a9a799", "#65634a"]#shades for hue
ax = sns.pairplot(df, 
                  hue= "cut",palette=shade)

In [None]:
X= df.drop(columns = ['price', 'x', 'z'])
y = df['price']

In [None]:
X

In [None]:
# Split the data into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# Random Forest (Bagging of multiple Decision Trees)
from sklearn.ensemble import RandomForestRegressor
RegModel = RandomForestRegressor(max_depth=5, n_estimators=100,criterion='mse')
# Good range for max_depth: 2-10 and n_estimators: 100-1000

# Printing all the parameters of Random Forest
print(RegModel)

# Creating the model on Training Data
RF=RegModel.fit(X_train,y_train)
prediction=RF.predict(X_test)



In [None]:
# RMSE
rmse_train = mse(y_train, RF.predict(X_train))**.5
rmse_test = mse(y_test, RF.predict(X_test))**.5

In [None]:
pd.DataFrame({
    "error_train":[rmse_train],
    "error_test":[rmse_test]
})

In [None]:
y = df['price']
X = df.drop(columns=['price'])



In [None]:
X

In [None]:
### Sandardization of data ###
#from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Choose either standardization or Normalization
# On this data Min Max Normalization produced better results

# Choose between standardization and MinMAx normalization
#PredictorScaler=StandardScaler()
#PredictorScaler=MinMaxScaler()

# Storing the fit object for later reference
#PredictorScalerFit=PredictorScaler.fit(X)

# Generating the standardized values of X
#X=PredictorScalerFit.transform(X)

# Split the data into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestRegressor
RegModel = RandomForestRegressor(max_depth=5, n_estimators=100,criterion='mse')
# Good range for max_depth: 2-10 and n_estimators: 100-1000

# Printing all the parameters of Random Forest
print(RegModel)

# Creating the model on Training Data
RF=RegModel.fit(X_train,y_train)
prediction=RF.predict(X_test)

In [None]:
# RMSE
rmse_train = mse(y_train, RF.predict(X_train))**.5
rmse_test = mse(y_test, RF.predict(X_test))**.5

In [None]:
pd.DataFrame({
    "error_train":[rmse_train],
    "error_test":[rmse_test]
})