# Data Analysis 4 - ML
---
Data set [Diamonds](https://www.kaggle.com/shivam2503/diamonds)<br>
Main ideas from [pythonprogramming.net](https://pythonprogramming.net/machine-learning-python3-pandas-data-analysis/)

In [11]:
import pandas as pd

df = pd.read_csv("./archive/diamonds.csv", index_col=0)
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


Para poder aplicar ML necesitamos convertir todas las categorías que pasemos al modelo en valores numéricos.

In [12]:
df["cut"].unique()

array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)

In [5]:
# df["cut"].astype("category").cat.codes   # ~Solo para conocer el método, pero en nuestro caso el 
                                           # orden importa y tiene significado, Premium es mejor
                                           # que fair.

1        2
2        3
3        1
4        3
5        1
        ..
53936    2
53937    1
53938    4
53939    3
53940    2
Length: 53940, dtype: int8

In [6]:
# Alternativa
cut_class_dict = {"Fair": 1,
                  "Good": 2,
                  "Very Good": 3,
                  "Premium": 4,
                  "Ideal": 5,}

De la misma manera hacemos otro diccionario para la claridad y el color del diamante. En la descropción del data set viene el orden de mejor a peor.

In [7]:
clarity_dict = {"I3": 1,
                "I2": 2,
                "I1": 3,
                "SI2": 4,
                "SI1": 5,
                "VS2": 6,
                "VS1": 7,
                "VVS2": 8,
                "VVS1": 9,
                "IF": 10,
                "FL": 11}


color_dict = {"J": 1, "I": 2, "H": 3, "G": 4, "F": 5, "E": 6, "D": 7}

In [13]:
df["cut"] = df["cut"].map(cut_class_dict)
df["clarity"] = df["clarity"].map(clarity_dict)
df["color"] = df["color"].map(color_dict)

df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,5,6,4,61.5,55.0,326,3.95,3.98,2.43
2,0.21,4,6,5,59.8,61.0,326,3.89,3.84,2.31
3,0.23,2,6,7,56.9,65.0,327,4.05,4.07,2.31
4,0.29,4,2,6,62.4,58.0,334,4.2,4.23,2.63
5,0.31,2,1,4,63.3,58.0,335,4.34,4.35,2.75


Nuestro data set parece estar ordenado de acuerdo al precio, por lo que tendremos que cambiar el orden a uno aleatorio, de este modo el modelo estará bien entrenado

In [14]:
import sklearn
from sklearn import svm

df = sklearn.utils.shuffle(df)
df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
15814,1.29,5,1,5,61.6,57.0,6327,6.98,7.01,4.32
19684,0.27,5,3,8,62.1,57.0,623,4.15,4.1,2.56
30403,0.32,5,4,8,62.1,55.0,730,4.35,4.38,2.71
44471,0.54,5,6,5,61.5,56.0,1594,5.31,5.26,3.25
48271,0.53,5,7,6,60.6,57.0,1956,5.28,5.24,3.19


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53940 entries, 15814 to 23929
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  int64  
 2   color    53940 non-null  int64  
 3   clarity  53940 non-null  int64  
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(4)
memory usage: 4.5 MB


In [20]:
X = df.drop("price", axis=1).values
y = df['price'].values

test_size = int(53_940 - 53_940 * 0.8) # Entrenaremos con el 80% de lo datos a nuestro modelo

X_train = X[:-test_size]
y_train = y[:-test_size]

X_test = X[-test_size:]
y_test = y[-test_size:]

In [21]:
clf = svm.SVR(kernel = "linear")
clf.fit(X_train, y_train)

SVR(kernel='linear')

In [22]:
clf.score(X_test, y_test)

0.8057693945232232

In [24]:
for X,y in zip(X_test[:30], y_test[:30]):
    print(f"Model: {clf.predict([X])[0]}, Actual: {y}")

Model: 1057.1465542287588, Actual: 1087
Model: 2778.821736820708, Actual: 1969
Model: 6520.0775894996805, Actual: 7549
Model: 9844.468732909281, Actual: 16731
Model: 3963.105287516333, Actual: 3286
Model: 5366.153130563249, Actual: 6600
Model: 5009.086492361781, Actual: 4537
Model: 1721.3205487706982, Actual: 1298
Model: 2920.3607305515106, Actual: 2495
Model: 4088.1114029751698, Actual: 2932
Model: 7405.427672707901, Actual: 9072
Model: 4821.931924062195, Actual: 3750
Model: 456.768865033755, Actual: 636
Model: 4426.792450539582, Actual: 4788
Model: 3628.769917694717, Actual: 2617
Model: 10657.173371614062, Actual: 11579
Model: 3794.3443649549954, Actual: 1954
Model: 572.8100028204117, Actual: 1069
Model: 1693.7160140930464, Actual: 1323
Model: 4516.123511651607, Actual: 4870
Model: 2518.0777737668104, Actual: 1940
Model: 6593.312085934038, Actual: 8588
Model: 980.9965267711486, Actual: 694
Model: 4755.275341377936, Actual: 3692
Model: 5296.128072079777, Actual: 6618
Model: 4639.94698

Interesante ver como el modelo sugiere en algunos casos que le paguemos a alguien para que tome el diamante xd.
Comparémoslo entrenando ahora un rbf en lugar de uno lineal

In [25]:
clf = svm.SVR(kernel = "rbf")
clf.fit(X_train, y_train)

SVR()

In [26]:
clf.score(X_test, y_test)

-0.1284447183422448

In [27]:
for X,y in zip(X_test[:30], y_test[:30]):
    print(f"Model: {clf.predict([X])[0]}, Actual: {y}")

Model: 2333.5816802080817, Actual: 1087
Model: 2347.5625859427014, Actual: 1969
Model: 2473.7303685723114, Actual: 7549
Model: 2538.2940666207514, Actual: 16731
Model: 2444.725505960763, Actual: 3286
Model: 2426.772966362022, Actual: 6600
Model: 2461.2203702906513, Actual: 4537
Model: 2342.8845953253817, Actual: 1298
Model: 2420.540616910389, Actual: 2495
Model: 2395.555799860845, Actual: 2932
Model: 2480.1861345239595, Actual: 9072
Model: 2472.12770712327, Actual: 3750
Model: 2335.478897042499, Actual: 636
Model: 2458.4198493248537, Actual: 4788
Model: 2412.8527968365697, Actual: 2617
Model: 2540.4074488431074, Actual: 11579
Model: 2451.9587823278343, Actual: 1954
Model: 2370.735961950878, Actual: 1069
Model: 2338.7297255531876, Actual: 1323
Model: 2431.856143441774, Actual: 4870
Model: 2432.9752453657443, Actual: 1940
Model: 2411.96564606147, Actual: 8588
Model: 2312.1796441447163, Actual: 694
Model: 2435.497232319539, Actual: 3692
Model: 2421.506837289361, Actual: 6618
Model: 2448.1

En definitiva el modelo linear obtuvo un mejor resultado, a pesar de tener valores negativos en sus predicciones.

Otra cosa que podemos hacer es normalizar los datos y ver como afecto esto a nuestros modelos.

In [30]:
from sklearn import preprocessing

X = df.drop("price", axis=1).values
X = preprocessing.scale(X)
y = df["price"].values

X_train = X[:-test_size]
y_train = y[:-test_size]

X_test = X[-test_size:]
y_test = y[-test_size:]

In [31]:
clf = svm.SVR(kernel = "linear")
clf.fit(X_train, y_train)

print(f"Score: {clf.score(X_test, y_test) :.6g}")

for X,y in zip(X_test[:30], y_test[:30]):
    print(f"Model: {clf.predict([X])[0]}, Actual: {y}")

Scare: 0.862896
Model: 1015.9282596610756, Actual: 1087
Model: 2417.73038799347, Actual: 1969
Model: 7043.032881205988, Actual: 7549
Model: 11433.822854705271, Actual: 16731
Model: 3843.6943188617615, Actual: 3286
Model: 5242.935893930722, Actual: 6600
Model: 4994.434103727581, Actual: 4537
Model: 1612.9284060154714, Actual: 1298
Model: 2652.9767547045435, Actual: 2495
Model: 3594.675632580835, Actual: 2932
Model: 8090.877459118174, Actual: 9072
Model: 4731.709130767178, Actual: 3750
Model: 631.6344514739653, Actual: 636
Model: 4348.9992927453695, Actual: 4788
Model: 3347.9093130234196, Actual: 2617
Model: 12612.712761014565, Actual: 11579
Model: 3847.560978437749, Actual: 1954
Model: 613.0373360227336, Actual: 1069
Model: 1608.6855413760231, Actual: 1323
Model: 4513.164939034804, Actual: 4870
Model: 2244.6184996147817, Actual: 1940
Model: 6665.041099522412, Actual: 8588
Model: 991.634756709685, Actual: 694
Model: 4760.472488951949, Actual: 3692
Model: 5127.53447135253, Actual: 6618
Mo