## Features
- price: price in USD
- carat: weight of the diamond
- cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- color: diamond colour, from J (worst) to D (best)
- clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- x: length in mm
- y: width in mm
- z: depth in mm
- depth: total depth percentage
- table: width of top of diamond relative to widest point
- city: city where the diamonds is reported to be sold.
- id: only for test & sample submission files, id for prediction sample identification

## Info:
- D: incoloro
- E: incoloro
- F: Incoloro
- G: Casi incoloro
- H: Casi incolor 
- I: Casi inicoloro
- J: Casi incoloro
## Cut:calidad de cortye
x longitud
y altura
z profundidad interna


## Diamond examples:
### Diamond 1:
- Depth = 61'8
- Table = 58'0
- Price = 8497 Dollars
### Diamond 2:
- Depth = 67,8
- Table = Nan
- Price = 4809 Dollars


### Diamond 3:
- Depth = Nan
- Table = 66%
- Price = 4879 Dollars
    

Diamonds with the depth & table between 55 and 65 are more brilliant  

In [1]:
# imports

import sklearn as scikit_learn
#from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np
#from scipy.stats import trim_mean   # conda install scipy
#from statsmodels import robust      # conda install -c conda-forge statsmodels 
#import wquantiles                   # pip install wquantiles
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Ridge, Lasso

import seaborn as sns
import matplotlib.pylab as plt

# Data-extract

In [2]:
diamonds_train_df = pd.read_csv("..\data\diamonds_train.csv")
diamonds_test_df = pd.read_csv("..\data\diamonds_test.csv")

In [3]:
diamonds_test_df

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,city
0,0,0.79,Very Good,F,SI1,62.7,60.0,5.82,5.89,3.67,Amsterdam
1,1,1.20,Ideal,J,VS1,61.0,57.0,6.81,6.89,4.18,Surat
2,2,1.57,Premium,H,SI1,62.2,61.0,7.38,7.32,4.57,Kimberly
3,3,0.90,Very Good,F,SI1,63.8,54.0,6.09,6.13,3.90,Kimberly
4,4,0.50,Very Good,F,VS1,62.9,58.0,5.05,5.09,3.19,Amsterdam
...,...,...,...,...,...,...,...,...,...,...,...
13480,13480,0.57,Ideal,E,SI1,61.9,56.0,5.35,5.32,3.30,Amsterdam
13481,13481,0.71,Ideal,I,VS2,62.2,55.0,5.71,5.73,3.56,New York City
13482,13482,0.70,Ideal,F,VS1,61.6,55.0,5.75,5.71,3.53,Tel Aviv
13483,13483,0.70,Very Good,F,SI2,58.8,57.0,5.85,5.89,3.45,Surat


In [4]:
diamonds_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40455 entries, 0 to 40454
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   price    40455 non-null  int64  
 1   carat    40455 non-null  float64
 2   city     40455 non-null  object 
 3   depth    40455 non-null  float64
 4   table    40455 non-null  float64
 5   x        40455 non-null  float64
 6   y        40455 non-null  float64
 7   z        40455 non-null  float64
 8   cut      40455 non-null  object 
 9   color    40455 non-null  object 
 10  clarity  40455 non-null  object 
dtypes: float64(6), int64(1), object(4)
memory usage: 3.4+ MB


In [5]:
diamonds_train_df

Unnamed: 0,price,carat,city,depth,table,x,y,z,cut,color,clarity
0,4268,1.21,Dubai,62.4,58.0,6.83,6.79,4.25,Premium,J,VS2
1,505,0.32,Kimberly,63.0,57.0,4.35,4.38,2.75,Very Good,H,VS2
2,2686,0.71,Las Vegas,65.5,55.0,5.62,5.53,3.65,Fair,G,VS1
3,738,0.41,Kimberly,63.8,56.0,4.68,4.72,3.00,Good,D,SI1
4,4882,1.02,Dubai,60.5,59.0,6.55,6.51,3.95,Ideal,G,SI1
...,...,...,...,...,...,...,...,...,...,...,...
40450,10070,1.34,Antwerp,62.7,57.0,7.10,7.04,4.43,Ideal,G,VS1
40451,12615,2.02,Madrid,57.1,60.0,8.31,8.25,4.73,Good,F,SI2
40452,5457,1.01,Kimberly,62.7,56.0,6.37,6.42,4.01,Ideal,H,SI1
40453,456,0.33,Kimberly,61.9,54.3,4.45,4.47,2.76,Ideal,J,VS1


In [6]:
# Checking nulls
nulls = pd.isnull(diamonds_train_df).sum()

In [7]:
nulls

price      0
carat      0
city       0
depth      0
table      0
x          0
y          0
z          0
cut        0
color      0
clarity    0
dtype: int64

In [8]:
diamonds_train_df_cols = list(diamonds_train_df)

In [9]:
print(diamonds_train_df_cols)



['price', 'carat', 'city', 'depth', 'table', 'x', 'y', 'z', 'cut', 'color', 'clarity']


In [10]:
print(diamonds_train_df['city'].unique())
print(diamonds_train_df['color'].unique())
print(diamonds_train_df['clarity'].unique())
print(diamonds_train_df['cut'].unique())


['Dubai' 'Kimberly' 'Las Vegas' 'Tel Aviv' 'Amsterdam' 'Zurich' 'Antwerp'
 'Madrid' 'Paris' 'Surat' 'Luxembourg' 'London' 'New York City']
['J' 'H' 'G' 'D' 'F' 'E' 'I']
['VS2' 'VS1' 'SI1' 'SI2' 'IF' 'VVS1' 'VVS2' 'I1']
['Premium' 'Very Good' 'Fair' 'Good' 'Ideal']


In [11]:
diamonds_train_color_encoded = pd.get_dummies(diamonds_train_df['color'], prefix='Color', dummy_na=False, dtype=int)
diamonds_test_color_encoded = pd.get_dummies(diamonds_test_df['color'], prefix='Color', dummy_na=False, dtype=int)

In [12]:
diamonds_train_color_encoded

Unnamed: 0,Color_D,Color_E,Color_F,Color_G,Color_H,Color_I,Color_J
0,0,0,0,0,0,0,1
1,0,0,0,0,1,0,0
2,0,0,0,1,0,0,0
3,1,0,0,0,0,0,0
4,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...
40450,0,0,0,1,0,0,0
40451,0,0,1,0,0,0,0
40452,0,0,0,0,1,0,0
40453,0,0,0,0,0,0,1


In [13]:
diamonds_train_color_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40455 entries, 0 to 40454
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Color_D  40455 non-null  int32
 1   Color_E  40455 non-null  int32
 2   Color_F  40455 non-null  int32
 3   Color_G  40455 non-null  int32
 4   Color_H  40455 non-null  int32
 5   Color_I  40455 non-null  int32
 6   Color_J  40455 non-null  int32
dtypes: int32(7)
memory usage: 1.1 MB


In [14]:
diamonds_train_cut_encoded = pd.get_dummies(diamonds_train_df['cut'], prefix='Cut', dummy_na=False, dtype=int)
diamonds_test_cut_encoded = pd.get_dummies(diamonds_test_df['cut'], prefix='Cut', dummy_na=False, dtype=int)

In [15]:
diamonds_train_cut_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40455 entries, 0 to 40454
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Cut_Fair       40455 non-null  int32
 1   Cut_Good       40455 non-null  int32
 2   Cut_Ideal      40455 non-null  int32
 3   Cut_Premium    40455 non-null  int32
 4   Cut_Very Good  40455 non-null  int32
dtypes: int32(5)
memory usage: 790.3 KB


In [16]:
diamonds_train_cut_encoded

Unnamed: 0,Cut_Fair,Cut_Good,Cut_Ideal,Cut_Premium,Cut_Very Good
0,0,0,0,1,0
1,0,0,0,0,1
2,1,0,0,0,0
3,0,1,0,0,0
4,0,0,1,0,0
...,...,...,...,...,...
40450,0,0,1,0,0
40451,0,1,0,0,0
40452,0,0,1,0,0
40453,0,0,1,0,0


In [17]:
diamonds_train_cut_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40455 entries, 0 to 40454
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Cut_Fair       40455 non-null  int32
 1   Cut_Good       40455 non-null  int32
 2   Cut_Ideal      40455 non-null  int32
 3   Cut_Premium    40455 non-null  int32
 4   Cut_Very Good  40455 non-null  int32
dtypes: int32(5)
memory usage: 790.3 KB


In [18]:
diamonds_train_clarity_encoded = pd.get_dummies(diamonds_train_df['clarity'], prefix='Clarity', dummy_na=False, dtype=int)
diamonds_test_clarity_encoded = pd.get_dummies(diamonds_test_df['clarity'], prefix='Clarity', dummy_na=False, dtype=int)

In [19]:
diamonds_train_clarity_encoded

Unnamed: 0,Clarity_I1,Clarity_IF,Clarity_SI1,Clarity_SI2,Clarity_VS1,Clarity_VS2,Clarity_VVS1,Clarity_VVS2
0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,1,0,0
2,0,0,0,0,1,0,0,0
3,0,0,1,0,0,0,0,0
4,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...
40450,0,0,0,0,1,0,0,0
40451,0,0,0,1,0,0,0,0
40452,0,0,1,0,0,0,0,0
40453,0,0,0,0,1,0,0,0


In [20]:
diamonds_train_clarity_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40455 entries, 0 to 40454
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   Clarity_I1    40455 non-null  int32
 1   Clarity_IF    40455 non-null  int32
 2   Clarity_SI1   40455 non-null  int32
 3   Clarity_SI2   40455 non-null  int32
 4   Clarity_VS1   40455 non-null  int32
 5   Clarity_VS2   40455 non-null  int32
 6   Clarity_VVS1  40455 non-null  int32
 7   Clarity_VVS2  40455 non-null  int32
dtypes: int32(8)
memory usage: 1.2 MB


In [21]:
diamonds_train_city_encoded = pd.get_dummies(diamonds_train_df['city'], prefix='City', dummy_na=False, dtype=int)
diamonds_test_city_encoded = pd.get_dummies(diamonds_test_df['city'], prefix='City', dummy_na=False, dtype=int)

In [22]:
diamonds_train_city_encoded

Unnamed: 0,City_Amsterdam,City_Antwerp,City_Dubai,City_Kimberly,City_Las Vegas,City_London,City_Luxembourg,City_Madrid,City_New York City,City_Paris,City_Surat,City_Tel Aviv,City_Zurich
0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
40450,0,1,0,0,0,0,0,0,0,0,0,0,0
40451,0,0,0,0,0,0,0,1,0,0,0,0,0
40452,0,0,0,1,0,0,0,0,0,0,0,0,0
40453,0,0,0,1,0,0,0,0,0,0,0,0,0


In [23]:
diamonds_train_city_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40455 entries, 0 to 40454
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   City_Amsterdam      40455 non-null  int32
 1   City_Antwerp        40455 non-null  int32
 2   City_Dubai          40455 non-null  int32
 3   City_Kimberly       40455 non-null  int32
 4   City_Las Vegas      40455 non-null  int32
 5   City_London         40455 non-null  int32
 6   City_Luxembourg     40455 non-null  int32
 7   City_Madrid         40455 non-null  int32
 8   City_New York City  40455 non-null  int32
 9   City_Paris          40455 non-null  int32
 10  City_Surat          40455 non-null  int32
 11  City_Tel Aviv       40455 non-null  int32
 12  City_Zurich         40455 non-null  int32
dtypes: int32(13)
memory usage: 2.0 MB


In [24]:

# Train
dataframes = [diamonds_train_color_encoded, diamonds_train_cut_encoded, diamonds_train_clarity_encoded, diamonds_train_city_encoded ]
df_train_diamods_complete = pd.concat(dataframes, axis=1)
df_train_diamods_complete

# Test
dataframes_test = [diamonds_test_color_encoded, diamonds_test_cut_encoded, diamonds_test_clarity_encoded, diamonds_test_city_encoded ]
df_test_diamods_complete = pd.concat(dataframes_test, axis=1)
df_test_diamods_complete

Unnamed: 0,Color_D,Color_E,Color_F,Color_G,Color_H,Color_I,Color_J,Cut_Fair,Cut_Good,Cut_Ideal,...,City_Kimberly,City_Las Vegas,City_London,City_Luxembourg,City_Madrid,City_New York City,City_Paris,City_Surat,City_Tel Aviv,City_Zurich
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13480,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
13481,0,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
13482,0,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
13483,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [25]:
df_categoric_nulls = df_train_diamods_complete.isnull().sum()

In [26]:
df_categoric_nulls

Color_D               0
Color_E               0
Color_F               0
Color_G               0
Color_H               0
Color_I               0
Color_J               0
Cut_Fair              0
Cut_Good              0
Cut_Ideal             0
Cut_Premium           0
Cut_Very Good         0
Clarity_I1            0
Clarity_IF            0
Clarity_SI1           0
Clarity_SI2           0
Clarity_VS1           0
Clarity_VS2           0
Clarity_VVS1          0
Clarity_VVS2          0
City_Amsterdam        0
City_Antwerp          0
City_Dubai            0
City_Kimberly         0
City_Las Vegas        0
City_London           0
City_Luxembourg       0
City_Madrid           0
City_New York City    0
City_Paris            0
City_Surat            0
City_Tel Aviv         0
City_Zurich           0
dtype: int64

In [27]:
#diamonds_train_df["Volume"] = df_train_diamods_complete["x"] *  df_train_diamods_complete["y"] * df_train_diamods_complete["z"]

#df_test_diamods_complete["Volume"] = df_train_diamods_complete["x"] *  df_train_diamods_complete["y"] * df_train_diamods_complete["z"]
diamonds_train_df["Volume"] = diamonds_train_df["x"] *  diamonds_train_df["y"] * diamonds_train_df["z"]
diamonds_test_df["Volume"] = diamonds_train_df["x"] *  diamonds_train_df["y"] * diamonds_train_df["z"]


In [28]:
diamonds_train_df

Unnamed: 0,price,carat,city,depth,table,x,y,z,cut,color,clarity,Volume
0,4268,1.21,Dubai,62.4,58.0,6.83,6.79,4.25,Premium,J,VS2,197.096725
1,505,0.32,Kimberly,63.0,57.0,4.35,4.38,2.75,Very Good,H,VS2,52.395750
2,2686,0.71,Las Vegas,65.5,55.0,5.62,5.53,3.65,Fair,G,VS1,113.436890
3,738,0.41,Kimberly,63.8,56.0,4.68,4.72,3.00,Good,D,SI1,66.268800
4,4882,1.02,Dubai,60.5,59.0,6.55,6.51,3.95,Ideal,G,SI1,168.429975
...,...,...,...,...,...,...,...,...,...,...,...,...
40450,10070,1.34,Antwerp,62.7,57.0,7.10,7.04,4.43,Ideal,G,VS1,221.429120
40451,12615,2.02,Madrid,57.1,60.0,8.31,8.25,4.73,Good,F,SI2,324.276975
40452,5457,1.01,Kimberly,62.7,56.0,6.37,6.42,4.01,Ideal,H,SI1,163.990554
40453,456,0.33,Kimberly,61.9,54.3,4.45,4.47,2.76,Ideal,J,VS1,54.900540


In [29]:
diamonds_test_df

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,city,Volume
0,0,0.79,Very Good,F,SI1,62.7,60.0,5.82,5.89,3.67,Amsterdam,197.096725
1,1,1.20,Ideal,J,VS1,61.0,57.0,6.81,6.89,4.18,Surat,52.395750
2,2,1.57,Premium,H,SI1,62.2,61.0,7.38,7.32,4.57,Kimberly,113.436890
3,3,0.90,Very Good,F,SI1,63.8,54.0,6.09,6.13,3.90,Kimberly,66.268800
4,4,0.50,Very Good,F,VS1,62.9,58.0,5.05,5.09,3.19,Amsterdam,168.429975
...,...,...,...,...,...,...,...,...,...,...,...,...
13480,13480,0.57,Ideal,E,SI1,61.9,56.0,5.35,5.32,3.30,Amsterdam,150.885792
13481,13481,0.71,Ideal,I,VS2,62.2,55.0,5.71,5.73,3.56,New York City,87.254475
13482,13482,0.70,Ideal,F,VS1,61.6,55.0,5.75,5.71,3.53,Tel Aviv,240.919920
13483,13483,0.70,Very Good,F,SI2,58.8,57.0,5.85,5.89,3.45,Surat,174.982752


In [30]:
# Train
df_train_diamods_complete = pd.concat([df_train_diamods_complete, diamonds_train_df], axis=1)
df_train_diamods_complete = df_train_diamods_complete.drop(['city', 'cut', 'clarity', 'color'],axis=1)

# Test
df_test_diamods_complete = pd.concat([df_test_diamods_complete, diamonds_test_df], axis=1)
df_test_diamods_complete = df_test_diamods_complete.drop(['city', 'cut', 'clarity', 'color'],axis=1)


In [31]:
df_train_diamods_complete

Unnamed: 0,Color_D,Color_E,Color_F,Color_G,Color_H,Color_I,Color_J,Cut_Fair,Cut_Good,Cut_Ideal,...,City_Tel Aviv,City_Zurich,price,carat,depth,table,x,y,z,Volume
0,0,0,0,0,0,0,1,0,0,0,...,0,0,4268,1.21,62.4,58.0,6.83,6.79,4.25,197.096725
1,0,0,0,0,1,0,0,0,0,0,...,0,0,505,0.32,63.0,57.0,4.35,4.38,2.75,52.395750
2,0,0,0,1,0,0,0,1,0,0,...,0,0,2686,0.71,65.5,55.0,5.62,5.53,3.65,113.436890
3,1,0,0,0,0,0,0,0,1,0,...,0,0,738,0.41,63.8,56.0,4.68,4.72,3.00,66.268800
4,0,0,0,1,0,0,0,0,0,1,...,0,0,4882,1.02,60.5,59.0,6.55,6.51,3.95,168.429975
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40450,0,0,0,1,0,0,0,0,0,1,...,0,0,10070,1.34,62.7,57.0,7.10,7.04,4.43,221.429120
40451,0,0,1,0,0,0,0,0,1,0,...,0,0,12615,2.02,57.1,60.0,8.31,8.25,4.73,324.276975
40452,0,0,0,0,1,0,0,0,0,1,...,0,0,5457,1.01,62.7,56.0,6.37,6.42,4.01,163.990554
40453,0,0,0,0,0,0,1,0,0,1,...,0,0,456,0.33,61.9,54.3,4.45,4.47,2.76,54.900540


In [32]:
df_test_diamods_complete

Unnamed: 0,Color_D,Color_E,Color_F,Color_G,Color_H,Color_I,Color_J,Cut_Fair,Cut_Good,Cut_Ideal,...,City_Tel Aviv,City_Zurich,id,carat,depth,table,x,y,z,Volume
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0.79,62.7,60.0,5.82,5.89,3.67,197.096725
1,0,0,0,0,0,0,1,0,0,1,...,0,0,1,1.20,61.0,57.0,6.81,6.89,4.18,52.395750
2,0,0,0,0,1,0,0,0,0,0,...,0,0,2,1.57,62.2,61.0,7.38,7.32,4.57,113.436890
3,0,0,1,0,0,0,0,0,0,0,...,0,0,3,0.90,63.8,54.0,6.09,6.13,3.90,66.268800
4,0,0,1,0,0,0,0,0,0,0,...,0,0,4,0.50,62.9,58.0,5.05,5.09,3.19,168.429975
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13480,0,1,0,0,0,0,0,0,0,1,...,0,0,13480,0.57,61.9,56.0,5.35,5.32,3.30,150.885792
13481,0,0,0,0,0,1,0,0,0,1,...,0,0,13481,0.71,62.2,55.0,5.71,5.73,3.56,87.254475
13482,0,0,1,0,0,0,0,0,0,1,...,1,0,13482,0.70,61.6,55.0,5.75,5.71,3.53,240.919920
13483,0,0,1,0,0,0,0,0,0,0,...,0,0,13483,0.70,58.8,57.0,5.85,5.89,3.45,174.982752


# ENTRENAMIENTO DEL DATAFRAME

In [33]:
# Setting features and target
X = df_train_diamods_complete.drop("price", axis=1)
y = df_train_diamods_complete["price"] 



In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [35]:
my_model_vr = RandomForestRegressor(n_estimators=100, random_state=42)

In [36]:
my_model_vr.fit(X_train, y_train)

In [37]:
y_pred = my_model_vr.predict(X_test)

In [38]:
y_pred

array([2787.51, 2298.83,  861.6 , ..., 2913.71, 3807.58, 7554.42])

In [39]:
# Cross Validation
score = cross_val_score(my_model2,
                        X,
                        y,
                        scoring="neg_mean_squared_error",
                        cv=5,
                       n_jobs=-1) 

In [40]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 313137.6180216769
R-squared: 0.9807711669661275


In [41]:
# Ridge Regression
ridge_model = Ridge(alpha=1.0)  
ridge_model.fit(X_train, y_train)
y_pred_ridge = ridge_model.predict(X_test)








In [42]:
ridge_mse = mean_squared_error(y_test, y_pred_ridge)
ridge_r2 = r2_score(y_test, y_pred_ridge)
print(f'Ridge Regression - Mean Squared Error: {ridge_mse}')
print(f'Ridge Regression - R-squared: {ridge_r2}')

Ridge Regression - Mean Squared Error: 1254616.008836446
Ridge Regression - R-squared: 0.9229578295065479


In [43]:
# Lasso Regression
lasso_model = Lasso(alpha=0.1)  # Puedes ajustar alpha
lasso_model.fit(X_train, y_train)
y_pred_lasso = lasso_model.predict(X_test)

  model = cd_fast.enet_coordinate_descent(


In [44]:
lasso_mse = mean_squared_error(y_test, y_pred_lasso)
lasso_r2 = r2_score(y_test, y_pred_lasso)
print(f'Lasso Regression - Mean Squared Error: {lasso_mse}')
print(f'Lasso Regression - R-squared: {lasso_r2}')

Lasso Regression - Mean Squared Error: 1255134.4762680375
Lasso Regression - R-squared: 0.9229259919913411


# ENTRENAMIENTO DEL DATAFRAME TEST

In [45]:
df_test_diamods_complete = df_test_diamods_complete.drop("id", axis=1)


In [46]:
df_test_diamods_complete

Unnamed: 0,Color_D,Color_E,Color_F,Color_G,Color_H,Color_I,Color_J,Cut_Fair,Cut_Good,Cut_Ideal,...,City_Surat,City_Tel Aviv,City_Zurich,carat,depth,table,x,y,z,Volume
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0.79,62.7,60.0,5.82,5.89,3.67,197.096725
1,0,0,0,0,0,0,1,0,0,1,...,1,0,0,1.20,61.0,57.0,6.81,6.89,4.18,52.395750
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1.57,62.2,61.0,7.38,7.32,4.57,113.436890
3,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0.90,63.8,54.0,6.09,6.13,3.90,66.268800
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0.50,62.9,58.0,5.05,5.09,3.19,168.429975
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13480,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0.57,61.9,56.0,5.35,5.32,3.30,150.885792
13481,0,0,0,0,0,1,0,0,0,1,...,0,0,0,0.71,62.2,55.0,5.71,5.73,3.56,87.254475
13482,0,0,1,0,0,0,0,0,0,1,...,0,1,0,0.70,61.6,55.0,5.75,5.71,3.53,240.919920
13483,0,0,1,0,0,0,0,0,0,0,...,1,0,0,0.70,58.8,57.0,5.85,5.89,3.45,174.982752


In [47]:
y_pred_test = my_model_vr.predict(df_test_diamods_complete)

In [48]:
df_y_pred = pd.DataFrame(y_pred_test, columns=["price"])

In [49]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_train, y_train)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')

Mean Squared Error: 313137.6180216769
R-squared: 1.0


In [50]:
df_test_diamods_complete

Unnamed: 0,Color_D,Color_E,Color_F,Color_G,Color_H,Color_I,Color_J,Cut_Fair,Cut_Good,Cut_Ideal,...,City_Surat,City_Tel Aviv,City_Zurich,carat,depth,table,x,y,z,Volume
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0.79,62.7,60.0,5.82,5.89,3.67,197.096725
1,0,0,0,0,0,0,1,0,0,1,...,1,0,0,1.20,61.0,57.0,6.81,6.89,4.18,52.395750
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1.57,62.2,61.0,7.38,7.32,4.57,113.436890
3,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0.90,63.8,54.0,6.09,6.13,3.90,66.268800
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0.50,62.9,58.0,5.05,5.09,3.19,168.429975
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13480,0,1,0,0,0,0,0,0,0,1,...,0,0,0,0.57,61.9,56.0,5.35,5.32,3.30,150.885792
13481,0,0,0,0,0,1,0,0,0,1,...,0,0,0,0.71,62.2,55.0,5.71,5.73,3.56,87.254475
13482,0,0,1,0,0,0,0,0,0,1,...,0,1,0,0.70,61.6,55.0,5.75,5.71,3.53,240.919920
13483,0,0,1,0,0,0,0,0,0,0,...,1,0,0,0.70,58.8,57.0,5.85,5.89,3.45,174.982752


In [51]:
df_y_pred['id'] = df_y_pred.index

In [52]:
df_y_pred 

Unnamed: 0,price,id
0,4945.70,0
1,4791.17,1
2,7617.99,2
3,3387.09,3
4,4660.61,4
...,...,...
13480,1943.83,13480
13481,2346.09,13481
13482,7397.61,13482
13483,3693.02,13483


In [53]:
df_y_pred.to_csv('..\data\my_submission.csv',index=False)