# **Kaggle – DataTops®**
Tu TA ha decidido cambiar de aires y, por eso, ha comprado una tienda de portátiles. Sin embargo, su única especialidad es Data Science, por lo que ha decidido crear un modelo de ML para establecer los mejores precios.

¿Podrías ayudar a tu profe a mejorar ese modelo?

## Aspectos importantes
- Última submission:
    - Mañana: 17 de febrero a las 5pm
    - Tarde: 19 de febrero a las 5pm
- **Enlace de la competición**: https://www.kaggle.com/t/c5cc87b50c4b4770bdc8f5acbe15577d
- **Requisito**: Estar registrado en [Kaggle](https://www.kaggle.com/)

## Métrica:
El error cuadrático medio (RMSE, por sus siglas en inglés) es una medida de la desviación estándar de los residuos (errores de predicción). Los residuos representan la diferencia entre los valores observados y los valores predichos por el modelo. El RMSE indica qué tan dispersos están estos errores: cuanto menor es el RMSE, más cercanas están las predicciones a los valores reales. En otras palabras, el RMSE mide qué tan bien se ajusta la línea de regresión a los datos.


$$ RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(\frac{d_i -f_i}{\sigma_i}\Big)^2}}$$


## 1. Librerías

In [1]:
import numpy as np
import pandas as pd
from PIL import Image
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import urllib.request

## 2. Datos

In [2]:
# Para que funcione necesitas bajarte los archivos de datos de Kaggle
df = pd.read_csv("./data/train.csv")

### 2.1 Exploración de los datos

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 912 entries, 0 to 911
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   laptop_ID         912 non-null    int64  
 1   Company           912 non-null    object 
 2   Product           912 non-null    object 
 3   TypeName          912 non-null    object 
 4   Inches            912 non-null    float64
 5   ScreenResolution  912 non-null    object 
 6   Cpu               912 non-null    object 
 7   Ram               912 non-null    object 
 8   Memory            912 non-null    object 
 9   Gpu               912 non-null    object 
 10  OpSys             912 non-null    object 
 11  Weight            912 non-null    object 
 12  Price_in_euros    912 non-null    float64
dtypes: float64(2), int64(1), object(10)
memory usage: 92.8+ KB


In [4]:
df.head()

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price_in_euros
0,755,HP,250 G6,Notebook,15.6,Full HD 1920x1080,Intel Core i3 6006U 2GHz,8GB,256GB SSD,Intel HD Graphics 520,Windows 10,1.86kg,539.0
1,618,Dell,Inspiron 7559,Gaming,15.6,Full HD 1920x1080,Intel Core i7 6700HQ 2.6GHz,16GB,1TB HDD,Nvidia GeForce GTX 960<U+039C>,Windows 10,2.59kg,879.01
2,909,HP,ProBook 450,Notebook,15.6,Full HD 1920x1080,Intel Core i7 7500U 2.7GHz,8GB,1TB HDD,Nvidia GeForce 930MX,Windows 10,2.04kg,900.0
3,2,Apple,Macbook Air,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,898.94
4,286,Dell,Inspiron 3567,Notebook,15.6,Full HD 1920x1080,Intel Core i3 6006U 2.0GHz,4GB,1TB HDD,AMD Radeon R5 M430,Linux,2.25kg,428.0


In [5]:
df.tail()

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price_in_euros
907,28,Dell,Inspiron 5570,Notebook,15.6,Full HD 1920x1080,Intel Core i5 8250U 1.6GHz,8GB,256GB SSD,AMD Radeon 530,Windows 10,2.2kg,800.0
908,1160,HP,Spectre Pro,2 in 1 Convertible,13.3,Full HD / Touchscreen 1920x1080,Intel Core i5 6300U 2.4GHz,8GB,256GB SSD,Intel HD Graphics 520,Windows 10,1.48kg,1629.0
909,78,Lenovo,IdeaPad 320-15IKBN,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,2TB HDD,Intel HD Graphics 620,No OS,2.2kg,519.0
910,23,HP,255 G6,Notebook,15.6,1366x768,AMD E-Series E2-9000e 1.5GHz,4GB,500GB HDD,AMD Radeon R2,No OS,1.86kg,258.0
911,229,Dell,Alienware 17,Gaming,17.3,IPS Panel Full HD 1920x1080,Intel Core i7 7700HQ 2.8GHz,16GB,256GB SSD + 1TB HDD,Nvidia GeForce GTX 1060,Windows 10,4.42kg,2456.34


In [6]:
df.describe()

Unnamed: 0,laptop_ID,Inches,Price_in_euros
count,912.0,912.0,912.0
mean,650.3125,14.981579,1111.72409
std,382.727748,1.436719,687.959172
min,2.0,10.1,174.0
25%,324.75,14.0,589.0
50%,636.5,15.6,978.0
75%,982.25,15.6,1483.9425
max,1320.0,18.4,6099.0


### 2.3 Definir X e y

In [7]:
X = df.drop(['Price_in_euros'], axis=1)
y = df['Price_in_euros'].copy()
X.shape

(912, 12)

In [8]:
y.shape

(912,)

### 2.4 Dividir X_train, X_test, y_train, y_test

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

In [10]:
X_train

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight
25,1118,HP,ZBook 17,Workstation,17.3,IPS Panel Full HD 1920x1080,Intel Core i7 6700HQ 2.6GHz,8GB,1TB HDD,AMD FirePro W6150M,Windows 7,3.0kg
84,153,Dell,Inspiron 5577,Gaming,15.6,Full HD 1920x1080,Intel Core i7 7700HQ 2.8GHz,16GB,512GB SSD,Nvidia GeForce GTX 1050,Windows 10,2.56kg
10,275,Apple,MacBook Pro,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.9GHz,8GB,512GB SSD,Intel Iris Graphics 550,macOS,1.37kg
342,1100,HP,EliteBook 840,Notebook,14.0,Full HD 1920x1080,Intel Core i5 6200U 2.3GHz,4GB,500GB HDD,Intel HD Graphics 520,Windows 7,1.54kg
890,131,Dell,Inspiron 5770,Notebook,17.3,Full HD 1920x1080,Intel Core i7 8550U 1.8GHz,16GB,256GB SSD + 2TB HDD,AMD Radeon 530,Windows 10,2.8kg
...,...,...,...,...,...,...,...,...,...,...,...,...
106,578,HP,14-am079na (N3710/8GB/2TB/W10),Notebook,14.0,1366x768,Intel Pentium Quad Core N3710 1.6GHz,8GB,2TB HDD,Intel HD Graphics 405,Windows 10,1.94kg
270,996,Lenovo,IdeaPad 320-15ABR,Notebook,15.6,Full HD 1920x1080,AMD A12-Series 9720P 3.6GHz,6GB,256GB SSD,AMD Radeon 530,Windows 10,2.2kg
860,770,Dell,Latitude 7280,Ultrabook,12.5,Full HD 1920x1080,Intel Core i7 7600U 2.8GHz,16GB,256GB SSD,Intel HD Graphics 620,Windows 10,1.18kg
435,407,Lenovo,IdeaPad 320-15IAP,Notebook,15.6,1366x768,Intel Celeron Dual Core N3350 1.1GHz,4GB,1TB HDD,Intel HD Graphics 500,Windows 10,2.2kg


In [11]:
y_train

25     2899.00
84     1249.26
10     1958.90
342    1030.99
890    1396.00
        ...   
106     389.00
270     549.00
860    1859.00
435     306.00
102    1943.00
Name: Price_in_euros, Length: 729, dtype: float64

## 3. Procesado de datos

Nuestro target es la columna `Price_in_euros`

In [12]:
X = X.drop(['laptop_ID'], axis=1) #sin info predictiva


In [13]:
# columnas mal tipadas

X['Ram'] = X['Ram'].str.replace('GB', '').astype(int)
X['Weight'] = X['Weight'].str.replace('kg', '').astype(float)


In [14]:
#convertir categóricas

X = pd.get_dummies(X, drop_first=True)


In [15]:
#volver a hacer el split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
X_train

Unnamed: 0,Inches,Ram,Weight,Company_Apple,Company_Asus,Company_Chuwi,Company_Dell,Company_Fujitsu,Company_Google,Company_HP,...,Gpu_Nvidia Quadro M620,Gpu_Nvidia Quadro M620M,OpSys_Chrome OS,OpSys_Linux,OpSys_Mac OS X,OpSys_No OS,OpSys_Windows 10,OpSys_Windows 10 S,OpSys_Windows 7,OpSys_macOS
25,17.3,8,3.00,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,True,False
84,15.6,16,2.56,False,False,False,True,False,False,False,...,False,False,False,False,False,False,True,False,False,False
10,13.3,8,1.37,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
342,14.0,4,1.54,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,True,False
890,17.3,16,2.80,False,False,False,True,False,False,False,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106,14.0,8,1.94,False,False,False,False,False,False,True,...,False,False,False,False,False,False,True,False,False,False
270,15.6,6,2.20,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
860,12.5,16,1.18,False,False,False,True,False,False,False,...,False,False,False,False,False,False,True,False,False,False
435,15.6,4,2.20,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False


-----------------------------------------------------------------------------------------------------------------

## 4. Modelado

### 4.1 Baseline de modelos


In [16]:
#random forest
rf = RandomForestRegressor(random_state=42)

rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)

try:
    rmse_rf = root_mean_squared_error(y_test, y_pred_rf)
except:
    rmse_rf = np.sqrt(((y_test - y_pred_rf) ** 2).mean())



In [17]:
svr = SVR()

svr.fit(X_train, y_train)

y_pred_svr = svr.predict(X_test)

try:
    rmse_svr = root_mean_squared_error(y_test, y_pred_svr)
except:
    rmse_svr = np.sqrt(((y_test - y_pred_svr) ** 2).mean())



### 4.2 Sacar métricas, valorar los modelos

Recuerda que en la competición se va a evaluar con la métrica de ``RMSE``.

In [18]:
print("RMSE RandomForest:", rmse_rf)
print("RMSE SVR:", rmse_svr)


RMSE RandomForest: 348.0980439972959
RMSE SVR: 738.7418829584842


        - El mejor modelo es RandomForest

### 4.3 Optimización (up to you 🫰🏻)

In [19]:
results = []

for n in [100, 200, 300]:
    for depth in [None, 10, 20, 30]:
        rf = RandomForestRegressor(
            n_estimators=n,
            max_depth=depth,
            random_state=42
        )
        
        rf.fit(X_train, y_train)
        preds = rf.predict(X_test)
        
        try:
            rmse = root_mean_squared_error(y_test, preds)
        except:
            rmse = np.sqrt(((y_test - preds) ** 2).mean())
        
        results.append((n, depth, rmse))

results_df = pd.DataFrame(results, columns=["n_estimators", "max_depth", "RMSE"])
results_df.sort_values("RMSE")


Unnamed: 0,n_estimators,max_depth,RMSE
8,300,,340.532455
11,300,30.0,340.962188
4,200,,342.426161
10,300,20.0,342.595118
7,200,30.0,343.004794
6,200,20.0,345.971687
3,100,30.0,347.627995
0,100,,348.098044
2,100,20.0,350.041774
9,300,10.0,352.300141


In [20]:
results = []

for n in [300]:
    for depth in [None, 20, 30]:
        for min_split in [2, 5, 10]:
            for min_leaf in [1, 2, 4]:
                rf = RandomForestRegressor(
                    n_estimators=n,
                    max_depth=depth,
                    min_samples_split=min_split,
                    min_samples_leaf=min_leaf,
                    max_features='sqrt',
                    random_state=42
                )
                
                rf.fit(X_train, y_train)
                preds = rf.predict(X_test)
                
                try:
                    rmse = root_mean_squared_error(y_test, preds)
                except:
                    rmse = np.sqrt(((y_test - preds) ** 2).mean())
                
                results.append((depth, min_split, min_leaf, rmse))

results_df = pd.DataFrame(
    results,
    columns=["max_depth", "min_samples_split", "min_samples_leaf", "RMSE"]
)

results_df.sort_values("RMSE").head(10)


Unnamed: 0,max_depth,min_samples_split,min_samples_leaf,RMSE
0,,2,1,327.526762
3,,5,1,333.321727
18,30.0,2,1,333.531155
21,30.0,5,1,341.339355
6,,10,1,348.494759
24,30.0,10,1,351.340838
9,20.0,2,1,361.447359
12,20.0,5,1,363.905451
15,20.0,10,1,373.112702
1,,2,2,402.544411


In [21]:
rf_test = RandomForestRegressor(
    n_estimators=600,
    max_depth=None,
    min_samples_split=5,
    min_samples_leaf=2,
    max_features=None,
    random_state=42
)

rf_test.fit(X_train, y_train)

preds_test = rf_test.predict(X_test)

try:
    rmse_test = root_mean_squared_error(y_test, preds_test)
except:
    rmse_test = np.sqrt(((y_test - preds_test) ** 2).mean())

print("Nuevo RMSE:", rmse_test)


Nuevo RMSE: 340.15852228229966


In [22]:
product_cols = [col for col in X.columns if col.startswith("Product_")]

X_no_product = X.drop(product_cols, axis=1)

X_train_np, X_test_np, y_train_np, y_test_np = train_test_split(
    X_no_product, y, test_size=0.20, random_state=42
)

rf = RandomForestRegressor(
    n_estimators=300,
    max_depth=None,
    min_samples_split=5,
    min_samples_leaf=2,
    max_features='sqrt',
    random_state=42
)

rf.fit(X_train_np, y_train_np)
preds_np = rf.predict(X_test_np)

try:
    rmse_np = root_mean_squared_error(y_test_np, preds_np)
except:
    rmse_np = np.sqrt(((y_test_np - preds_np) ** 2).mean())

print("RMSE sin Product:", rmse_np)



RMSE sin Product: 377.54872376658034


In [23]:
#ñadir resolución de pixeles
res = df['ScreenResolution'].str.extract(r'(\d+)x(\d+)', expand=True)

df['ResX'] = pd.to_numeric(res[0], errors='coerce')
df['ResY'] = pd.to_numeric(res[1], errors='coerce')
df['Pixels'] = df['ResX'] * df['ResY']


In [24]:
#añadir memoria en GB
mem_nums = df['Memory'].str.findall(r'(\d+\.?\d*)\s*(TB|GB)')

total_gb = []
for row in mem_nums:
    s = 0.0
    for val, unit in row:
        v = float(val)
        if unit.upper() == 'TB':
            v *= 1024.0
        s += v
    total_gb.append(s if len(row) > 0 else 0)

df['Memory_GB'] = total_gb


In [25]:
df = df.drop(['ScreenResolution', 'Memory'], axis=1)


In [26]:
X = df.drop(['Price_in_euros'], axis=1)
y = df['Price_in_euros'].copy()

X = X.drop(['laptop_ID'], axis=1)

X['Ram'] = X['Ram'].str.replace('GB', '').astype(int)
X['Weight'] = X['Weight'].str.replace('kg', '').astype(float)

X = pd.get_dummies(X, drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)



In [27]:
rf = RandomForestRegressor(
    n_estimators=300,
    max_depth=None,
    min_samples_split=5,
    min_samples_leaf=2,
    max_features='sqrt',
    random_state=42
)

rf.fit(X_train, y_train)
preds = rf.predict(X_test)

try:
    rmse = root_mean_squared_error(y_test, preds)
except:
    rmse = np.sqrt(((y_test - preds) ** 2).mean())

print("RMSE con nuevas features:", rmse)


RMSE con nuevas features: 410.6001763855704


In [28]:
rf_final = RandomForestRegressor(
    n_estimators=300,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features='sqrt',
    random_state=42
)

rf_final.fit(X, y)


0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",300
,"criterion  criterion: {""squared_error"", ""absolute_error"", ""friedman_mse"", ""poisson""}, default=""squared_error"" The function to measure the quality of a split. Supported criteria are ""squared_error"" for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, ""friedman_mse"", which uses mean squared error with Friedman's improvement score for potential splits, ""absolute_error"" for the mean absolute error, which minimizes the L1 loss using the median of each terminal node, and ""poisson"" which uses reduction in Poisson deviance to find splits. Training using ""absolute_error"" is significantly slower than when using ""squared_error"". .. versionadded:: 0.18  Mean Absolute Error (MAE) criterion. .. versionadded:: 1.0  Poisson criterion.",'squared_error'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=1.0 The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None or 1.0, then `max_features=n_features`. .. note::  The default of 1.0 is equivalent to bagged trees and more  randomness can be achieved by setting smaller values, e.g. 0.3. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to 1.0. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",'sqrt'
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


-----------------------------------------------------------------

## Una vez listo el modelo, toca predecir ``test.csv``

**RECUERDA: APLICAR LAS TRANSFORMACIONES QUE HAYAS REALIZADO EN `train.csv` a `test.csv`.**


Véase:
- Estandarización/Normalización
- Eliminación de Outliers
- Eliminación de columnas
- Creación de columnas nuevas
- Gestión de valores nulos
- Y un largo etcétera de técnicas que como Data Scientist hayas considerado las mejores para tu dataset.

## 1. Carga los datos de `test.csv` para predecir.


In [29]:
X_pred = pd.read_csv("./data/test.csv")
X_pred.head()

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight
0,209,Lenovo,Legion Y520-15IKBN,Gaming,15.6,Full HD 1920x1080,Intel Core i7 7700HQ 2.8GHz,16GB,512GB SSD,Nvidia GeForce GTX 1060,No OS,2.4kg
1,1281,Acer,Aspire ES1-531,Notebook,15.6,1366x768,Intel Celeron Dual Core N3060 1.6GHz,4GB,500GB HDD,Intel HD Graphics 400,Linux,2.4kg
2,1168,Lenovo,V110-15ISK (i3-6006U/4GB/1TB/No,Notebook,15.6,1366x768,Intel Core i3 6006U 2.0GHz,4GB,1TB HDD,Intel HD Graphics 520,No OS,1.9kg
3,1231,Dell,Inspiron 7579,2 in 1 Convertible,15.6,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,Windows 10,2.191kg
4,1020,HP,ProBook 640,Notebook,14.0,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,4GB,256GB SSD,Intel HD Graphics 620,Windows 10,1.95kg


In [30]:
X_pred.tail()

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight
386,820,MSI,GE72MVR 7RG,Gaming,17.3,Full HD 1920x1080,Intel Core i7 7700HQ 2.8GHz,16GB,512GB SSD + 1TB HDD,Nvidia GeForce GTX 1070,Windows 10,2.9kg
387,948,Toshiba,Tecra Z40-C-12X,Notebook,14.0,IPS Panel Full HD 1920x1080,Intel Core i5 6200U 2.3GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.47kg
388,483,Dell,Precision M5520,Workstation,15.6,Full HD 1920x1080,Intel Core i7 7700HQ 2.8GHz,8GB,256GB SSD,Nvidia Quadro M1200,Windows 10,1.78kg
389,1017,HP,Probook 440,Notebook,14.0,1366x768,Intel Core i5 7200U 2.5GHz,4GB,500GB HDD,Intel HD Graphics 620,Windows 10,1.64kg
390,421,Asus,ZenBook Flip,2 in 1 Convertible,13.3,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,Windows 10,1.27kg


In [31]:
X_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 391 entries, 0 to 390
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   laptop_ID         391 non-null    int64  
 1   Company           391 non-null    object 
 2   Product           391 non-null    object 
 3   TypeName          391 non-null    object 
 4   Inches            391 non-null    float64
 5   ScreenResolution  391 non-null    object 
 6   Cpu               391 non-null    object 
 7   Ram               391 non-null    object 
 8   Memory            391 non-null    object 
 9   Gpu               391 non-null    object 
 10  OpSys             391 non-null    object 
 11  Weight            391 non-null    object 
dtypes: float64(1), int64(1), object(10)
memory usage: 36.8+ KB


 ## 2. Replicar el procesado para ``test.csv``

In [32]:
X_pred

Unnamed: 0,laptop_ID,Company,Product,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight
0,209,Lenovo,Legion Y520-15IKBN,Gaming,15.6,Full HD 1920x1080,Intel Core i7 7700HQ 2.8GHz,16GB,512GB SSD,Nvidia GeForce GTX 1060,No OS,2.4kg
1,1281,Acer,Aspire ES1-531,Notebook,15.6,1366x768,Intel Celeron Dual Core N3060 1.6GHz,4GB,500GB HDD,Intel HD Graphics 400,Linux,2.4kg
2,1168,Lenovo,V110-15ISK (i3-6006U/4GB/1TB/No,Notebook,15.6,1366x768,Intel Core i3 6006U 2.0GHz,4GB,1TB HDD,Intel HD Graphics 520,No OS,1.9kg
3,1231,Dell,Inspiron 7579,2 in 1 Convertible,15.6,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,Windows 10,2.191kg
4,1020,HP,ProBook 640,Notebook,14.0,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,4GB,256GB SSD,Intel HD Graphics 620,Windows 10,1.95kg
...,...,...,...,...,...,...,...,...,...,...,...,...
386,820,MSI,GE72MVR 7RG,Gaming,17.3,Full HD 1920x1080,Intel Core i7 7700HQ 2.8GHz,16GB,512GB SSD + 1TB HDD,Nvidia GeForce GTX 1070,Windows 10,2.9kg
387,948,Toshiba,Tecra Z40-C-12X,Notebook,14.0,IPS Panel Full HD 1920x1080,Intel Core i5 6200U 2.3GHz,4GB,128GB SSD,Intel HD Graphics 520,Windows 10,1.47kg
388,483,Dell,Precision M5520,Workstation,15.6,Full HD 1920x1080,Intel Core i7 7700HQ 2.8GHz,8GB,256GB SSD,Nvidia Quadro M1200,Windows 10,1.78kg
389,1017,HP,Probook 440,Notebook,14.0,1366x768,Intel Core i5 7200U 2.5GHz,4GB,500GB HDD,Intel HD Graphics 620,Windows 10,1.64kg


In [33]:
if 'ScreenResolution' in X_pred.columns:
    res = X_pred['ScreenResolution'].astype(str).str.extract(r'(\d+)x(\d+)', expand=True)
    X_pred['ResX'] = pd.to_numeric(res[0], errors='coerce')
    X_pred['ResY'] = pd.to_numeric(res[1], errors='coerce')
    X_pred['Pixels'] = (X_pred['ResX'] * X_pred['ResY']).fillna(0)
else:
    X_pred['Pixels'] = 0
    X_pred['ResX'] = 0
    X_pred['ResY'] = 0

if 'Memory' in X_pred.columns:
    mem_nums = X_pred['Memory'].astype(str).str.findall(r'(\d+\.?\d*)\s*(TB|GB)')
    total_gb = []
    for row in mem_nums:
        s = 0.0
        for val, unit in row:
            v = float(val)
            if unit.upper() == 'TB':
                v *= 1024.0
            s += v
        total_gb.append(s if len(row) > 0 else 0.0)
    X_pred['Memory_GB'] = total_gb
else:
    X_pred['Memory_GB'] = 0

cols_to_drop = [c for c in ['ScreenResolution', 'Memory', 'laptop_ID'] if c in X_pred.columns]
if len(cols_to_drop) > 0:
    X_pred = X_pred.drop(cols_to_drop, axis=1)

if 'Ram' in X_pred.columns:
    X_pred['Ram'] = X_pred['Ram'].astype(str).str.replace('GB', '', regex=False).astype(int)

if 'Weight' in X_pred.columns:
    X_pred['Weight'] = X_pred['Weight'].astype(str).str.replace('kg', '', regex=False).astype(float)

X_pred = pd.get_dummies(X_pred, drop_first=True)
X_pred = X_pred.reindex(columns=X.columns, fill_value=0)



In [34]:
print(X.shape)
print(X_pred.shape)
print((X.columns != X_pred.columns).sum())


(912, 715)
(391, 715)
0


In [35]:
predictions_submit = rf_final.predict(X_pred)
predictions_submit

array([1280.93418889,  298.42593333,  431.85986667, 1029.01853333,
        911.54153333,  577.76476667,  806.88733333,  938.32987278,
       1024.74074444,  458.09443333, 2135.96523333, 1410.7941    ,
        555.54236667, 1378.59506667,  797.7192    ,  796.41766667,
       1797.96503333, 1323.44245   , 1747.76013333,  611.176     ,
       1527.35486667,  526.09213333,  655.69746667, 1303.54636667,
        464.8165    ,  751.10677222,  549.8878    , 1140.83475556,
       2929.9722    , 1064.78473333, 2268.83006667,  454.80036667,
        710.9221    , 2740.7869    , 1882.2215    , 1550.49713333,
        645.51673333, 1509.3657    ,  910.32776222, 1539.21743333,
        675.72283333,  830.14483333,  559.5676    , 1238.95417143,
       1231.2383    , 1106.22803333, 1016.2352    ,  673.2304    ,
        664.6936    ,  447.28476667, 1563.85946667,  853.81431111,
       1105.23783333,  552.23463333, 1902.861     , 1697.82706667,
        684.0118    , 1052.07554444, 1025.44411667,  800.18503

**¡OJO! ¿Por qué me da error?**

IMPORTANTE:

- SI EL ARRAY CON EL QUE HICISTEIS `.fit()` ERA DE 4 COLUMNAS, PARA `.predict()` DEBEN SER LAS MISMAS
- SI AL ARRAY CON EL QUE HICISTEIS `.fit()` LO NORMALIZASTEIS, PARA `.predict()` DEBÉIS NORMALIZARLO
- TODO IGUAL SALVO **BORRAR FILAS**, EL NÚMERO DE ROWS SE DEBE MANTENER EN ESTE SET, PUES LA PREDICCIÓN DEBE TENER **391 FILAS**, SI O SI

**Entonces, si al cargar los datos de ``train.csv`` usaste `index_col=0`, ¿tendré que hacer lo también para el `test.csv`?**

In [36]:
# ¿Qué opináis?
# ¿Sí, no?

## 3. **¿Qué es lo que subirás a Kaggle?**

**Para subir a Kaggle la predicción esta tendrá que tener una forma específica.**

En este caso, la **MISMA** forma que `sample_submission.csv`.

In [37]:
sample = pd.read_csv("data/sample_submission.csv")

In [38]:
sample.head()

Unnamed: 0,laptop_ID,Price_in_euros
0,209,1949.1
1,1281,805.0
2,1168,1101.0
3,1231,1293.8
4,1020,1832.6


In [39]:
sample.shape

(391, 2)

## 4. Mete tus predicciones en un dataframe llamado ``submission``.

In [40]:
#¿Cómo creamos la submission?
submission = sample.copy()

submission['Price_in_euros'] = predictions_submit



In [41]:
submission.head()

Unnamed: 0,laptop_ID,Price_in_euros
0,209,1280.934189
1,1281,298.425933
2,1168,431.859867
3,1231,1029.018533
4,1020,911.541533


In [42]:
submission.shape

(391, 2)

## 5. Pásale el CHEQUEADOR para comprobar que efectivamente está listo para subir a Kaggle.

In [43]:
def chequeador(df_to_submit):
    """
    Esta función se asegura de que tu submission tenga la forma requerida por Kaggle.

    Si es así, se guardará el dataframe en un `csv` y estará listo para subir a Kaggle.

    Si no, LEE EL MENSAJE Y HAZLE CASO.

    Si aún no:
    - apaga tu ordenador,
    - date una vuelta,
    - enciendelo otra vez,
    - abre este notebook y
    - leelo todo de nuevo.
    Todos nos merecemos una segunda oportunidad. También tú.
    """
    if df_to_submit.shape == sample.shape:
        if df_to_submit.columns.all() == sample.columns.all():
            if df_to_submit.laptop_ID.all() == sample.laptop_ID.all():
                print("You're ready to submit!")
                df_to_submit.to_csv("submission.csv", index = False) #muy importante el index = False
                urllib.request.urlretrieve("https://www.mihaileric.com/static/evaluation-meme-e0a350f278a36346e6d46b139b1d0da0-ed51e.jpg", "gfg.png")
                img = Image.open("gfg.png")
                img.show()
            else:
                print("Check the ids and try again")
        else:
            print("Check the names of the columns and try again")
    else:
        print("Check the number of rows and/or columns and try again")
        print("\nMensaje secreto del TA: No me puedo creer que después de todo este notebook hayas hecho algún cambio en las filas de `test.csv`. Lloro.")

In [44]:
chequeador(submission)

You're ready to submit!
