# Diamonds Price Prediction Project

### Importing relevant libraries

In [1]:
import pandas as pd

### Relevant information

#### Files

- data.csv: training set
- test.csv: test set
- sample_submission.csv: sample submission

#### Features

- id: only for test & sample submission files, id for prediction sample identification
- price: price in USD
- carat: weight of the diamond
- cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- color: diamond colour, from J (worst) to D (best)
- clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- x: length in mm
- y: width in mm
- z: depth in mm
- depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
- table: width of top of diamond relative to widest point (43--95)

### Importing the training csv

In [2]:
#Importing the csv to Jupyter Notebook
training_df = pd.read_csv("../input/diamonds-datamad0120/diamonds_train.csv")
training_df.head()

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,price
0,0,0.78,Premium,F,VS1,61.5,58.0,5.93,5.98,3.66,3446
1,1,0.31,Ideal,D,SI1,60.8,56.0,4.37,4.32,2.64,732
2,2,0.3,Ideal,F,SI1,62.3,54.0,4.3,4.34,2.69,475
3,3,1.04,Ideal,E,VVS2,62.0,58.0,6.54,6.46,4.03,9552
4,4,0.65,Ideal,J,SI1,61.4,55.0,5.58,5.62,3.44,1276


In [3]:
#Checking the shape of the dataframe
training_df.shape

(40345, 11)

In [4]:
#Checking if there are missing values
training_df.isnull().sum()

id         0
carat      0
cut        0
color      0
clarity    0
depth      0
table      0
x          0
y          0
z          0
price      0
dtype: int64

In [5]:
#Removing the id column.
training_df=training_df.drop(['id'], axis=1)

In [6]:
training_df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,x,y,z,price
0,0.78,Premium,F,VS1,61.5,58.0,5.93,5.98,3.66,3446
1,0.31,Ideal,D,SI1,60.8,56.0,4.37,4.32,2.64,732
2,0.3,Ideal,F,SI1,62.3,54.0,4.3,4.34,2.69,475
3,1.04,Ideal,E,VVS2,62.0,58.0,6.54,6.46,4.03,9552
4,0.65,Ideal,J,SI1,61.4,55.0,5.58,5.62,3.44,1276


In [7]:
#Check what are the values that could be 
for col in training_df.columns:
    print(f"**** {col} **** --> {training_df[col].unique()}")

**** carat **** --> [0.78 0.31 0.3  1.04 0.65 0.9  0.71 2.05 1.1  1.19 0.33 1.3  1.29 0.69
 0.28 1.4  1.01 0.5  1.5  1.03 1.51 0.76 1.21 0.74 0.32 0.59 1.02 2.01
 0.91 0.43 0.23 0.52 0.34 2.   1.7  0.35 0.8  0.4  1.32 0.54 0.42 1.
 0.41 0.51 0.26 0.93 1.07 0.7  0.55 0.82 2.28 1.56 0.79 1.24 0.57 2.02
 0.63 0.72 1.09 1.06 0.36 0.61 0.25 1.45 1.52 1.6  0.56 1.05 1.2  1.31
 1.11 0.77 0.38 1.13 1.53 2.31 1.61 1.75 1.18 0.64 1.12 1.22 1.76 1.25
 1.17 1.71 1.14 0.37 1.27 0.48 1.44 0.73 1.23 0.53 1.33 0.27 1.54 0.62
 1.41 1.43 1.35 0.29 1.84 3.51 2.21 0.46 1.59 2.11 1.26 2.08 2.23 2.12
 2.09 0.39 0.92 1.16 1.64 0.83 2.35 0.44 0.49 1.63 0.58 2.32 1.57 1.74
 1.15 1.82 1.34 2.15 2.52 1.72 2.03 2.18 1.95 0.24 0.75 0.45 2.25 1.66
 0.85 1.62 0.47 0.81 2.2  3.   1.58 0.89 2.24 1.65 1.28 1.38 0.6  1.83
 1.73 0.66 0.96 1.79 2.04 0.95 2.1  0.84 1.08 1.55 0.99 0.94 2.06 0.87
 2.19 2.51 2.68 0.67 2.44 0.97 2.26 0.98 1.46 2.07 3.65 0.22 0.21 2.53
 2.29 1.86 1.37 2.13 2.14 1.9  3.01 2.39 2.27 1.39 0.2  2.5

### CONCLUSIONES

Después de ver los valores únicos de cada una de las columnas, se procederá como :

- Columna *'cut'*: Cada valor tiene una importancia, por lo que se va a proceder a reemplazar los valores por números.

- Columna *'color'*: A priori, ningún color tiene más importancia que otro, por lo que se usará la función get_dummies para conseguir que todos los valores tengan la misma importancia.

- Columna *'clarity'*: Cada valor tiene una importancia, por lo que se va a proceder a reemplazar los valores por números.

#### Columna *'cut'*

#### Columna *'color'*

#### Columna *'clarity'*