<img src="https://raw.githubusercontent.com/Ironhack-Data-Madrid-Marzo-2021/W7-Kaggle_competition/master/images/PORTADA.jpg">

# Import libraries

In [1]:
import numpy as np
import pandas as pd
import src.limpieza as lm

# Download Kaggle

In [2]:
lm.download_kaggle()

Kaggle file downloaded.
Kaggle file unzipped.
zip file deleted.
Files moved to data folder.


"DataFrames downloaded correctly as 'test' and 'train'."

# Read DataSets

In [3]:
test = pd.read_csv("data/test.csv")
train = pd.read_csv("data/train.csv")


In [4]:
train.sample(5)

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z,price
10846,10846,1.09,Ideal,I,SI2,60.3,57.0,6.64,6.7,4.02,8.37
32597,32597,0.51,Ideal,D,SI1,62.6,57.0,5.11,5.08,3.19,7.39
24590,24590,1.18,Very Good,I,VS1,62.2,57.0,6.72,6.76,4.19,8.604
18239,18239,1.31,Very Good,I,SI2,60.8,57.0,7.06,7.12,4.31,8.627
33130,33130,0.35,Ideal,H,VS2,61.2,56.0,4.54,4.55,2.78,6.347


In [5]:
test.head()

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z
0,0,2.01,Ideal,H,SI1,61.9,57.0,8.14,8.05,5.01
1,1,0.49,Good,D,VS1,57.5,60.0,5.18,5.25,3.0
2,2,1.03,Premium,F,SI1,58.6,62.0,6.65,6.6,3.88
3,3,0.9,Very Good,E,SI1,63.0,56.0,6.11,6.15,3.86
4,4,0.59,Ideal,D,SI1,62.5,55.0,5.35,5.4,3.36


# Exploring Data
Once we have downloaded the information we must explore and analyse it, the first thing we are going to see is its `.shape`, then the types of data contained in the dataframe with the `.dtypes` method. Also we have to check if there is null values with `.isnull()`.

In [6]:
train.shape

(40455, 11)

In [7]:
test.shape

(13485, 10)

In [8]:
test.dtypes

id           int64
carat      float64
cut         object
color       object
clarity     object
depth      float64
table      float64
x          float64
y          float64
z          float64
dtype: object

As we can see there are three columns containing `categorical values`, the next step is to see how many values are in each column and see if we could replace them with `numerical values`.

In [9]:
test.isnull().sum()

id         0
carat      0
cut        0
color      0
clarity    0
depth      0
table      0
x          0
y          0
z          0
dtype: int64

In [10]:
train.isnull().sum()

id         0
carat      0
cut        0
color      0
clarity    0
depth      0
table      0
x          0
y          0
z          0
price      0
dtype: int64

# Categorical to numerical a.k.a. `Getting Dummies`.
As the values that are strings have a priority ranking, we can evaluate them.

## Clarity
Clarity is a measure of a `diamond's purity` and rarity graded by the visibility of these characteristics under 10x magnification. A stone is classified as flawless if, under 10x magnification, it has no inclusions (internal imperfections) and no visible blemishes (external imperfections).

* **FL**(flawless): FL diamonds are flawless
* **IF**(internally flawless): IF diamonds are internally flawless
* **VVS1 - VVS2**(very very slightly included): VVS diamonds (1 and 2) have very very light inclusions. 
* **VS1 - VS2**(very slightly included): VS diamonds (1 and 2) have very light inclusions 
* **SI1 - SI2**(slightly included): SI diamonds (1 and 2) have light inclusions 
* **I1 - I2 - I3**(imperfect): I diamonds (1, 2 and 3) are flawed

In [11]:
test.clarity.value_counts()

SI1     3306
VS2     3059
SI2     2282
VS1     2003
VVS2    1255
VVS1     941
IF       461
I1       178
Name: clarity, dtype: int64

In [12]:
claridad = {
    "I1":1,
    "SI1":2,
    "SI2":2.5,
    "VS1":3,
    "VS2":3.5,
    "VVS1":4,
    "VVS2":4.5,
    "IF":5
    }

In [13]:
test.clarity = test.clarity.map(claridad)
test.sample(3)

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z
3,3,0.9,Very Good,E,2.0,63.0,56.0,6.11,6.15,3.86
7321,7321,0.51,Good,G,4.5,63.3,58.0,5.04,5.07,3.2
10471,10471,1.05,Premium,H,1.0,62.0,59.0,6.5,6.47,4.02


## Color
Colour is one of the `most important characteristics` of a diamond: the whiter (transparent), the more beautiful, scarcer and more valuable. To determine the clarity or transparency of a diamond, there is a colour scale that divides the colour grades from D to Z. This scale was established by the GIA (Gemological Institute of America) and is internationally accepted. 


In [14]:
test.color.value_counts()

G    2830
E    2489
F    2329
H    2103
D    1765
I    1288
J     681
Name: color, dtype: int64

In [15]:
list(test.color.value_counts().keys())

['G', 'E', 'F', 'H', 'D', 'I', 'J']

In [16]:
clr = {
    "D":7,
    "E":6,
    "F":5,
    "G":4,
    "H":3,
    "I":2,
    "J":1
    }

In [17]:
test.color = test.color.map(clr)
test.sample(3)

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z
1665,1665,0.41,Ideal,3,3.0,61.1,56.0,4.83,4.86,2.96
9420,9420,0.34,Ideal,4,5.0,61.7,55.0,4.5,4.48,2.77
8750,8750,3.0,Good,6,1.0,64.2,65.0,9.08,8.96,5.79


## Cut
The cut is the element that `reveals the brilliance of the diamond`. It is the only criterion among the 4Cs that depends on human expertise. The cut refers to the proportions of the stone.  A diamond sparkles and shines according to its cut.  If its proportions are not right, it will sparkle less because the light inside it will not reflect properly. 


In [18]:
test.cut.value_counts()

Ideal        5334
Premium      3452
Very Good    3068
Good         1238
Fair          393
Name: cut, dtype: int64

In [19]:
ct = {
    "Fair":1,
    "Good":2,
    "Very Good": 3,
    "Premium":4,
    "Ideal":5    
    }

In [20]:
test.cut = test.cut.map(ct)
test.sample(3)

Unnamed: 0,id,carat,cut,color,clarity,depth,table,x,y,z
7595,7595,0.34,5,5,3.0,62.2,56.0,4.44,4.47,2.77
716,716,0.51,5,7,3.0,61.9,57.0,5.17,5.14,3.19
3725,3725,0.27,2,6,4.0,63.9,57.0,4.07,4.1,2.61


# Correlation

Once we have changed the categorical values to numerical values, we check the collinearity of the data.