In [1]:
import warnings
warnings.simplefilter("ignore")

## Transformación de Features: Escalamiento de los datos

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Diversos algoritmos son sensibles a la escala en la que viene cada feature. **Re-escalarlos** puede traer significativas mejoras de rendimiento.

Existen distintas estrategias de escalamiento de tus features, pero la más común es la estandarización donde convertimos la variable para que la distribución de esta siga una distribución que es Gaussiana de media 0 y de desviación estandar 1.



In [4]:
from sklearn.model_selection import train_test_split

X = pd.read_csv('data/X.csv')
y = X['worldwide_gross']
X = X.drop('worldwide_gross',axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [5]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

StandardScaler()

In [6]:
scaler.mean_

array([3.31228606e+07, 2.00233646e+03, 2.12779861e+00, 1.08542573e+02,
       1.03420903e+04, 4.12511404e+07, 6.44476933e+00])

In [7]:
scaler.scale_

array([4.06390314e+07, 1.17760798e+01, 7.19803302e-01, 2.31936799e+01,
       1.96129644e+04, 2.28225577e+08, 1.06154515e+00])

In [8]:
X.values

array([[4.25000000e+08, 2.00900000e+03, 1.78000000e+00, ...,
        4.83400000e+03, 2.37000000e+08, 7.90000000e+00],
       [3.06000000e+08, 2.00213073e+03, 2.12697615e+00, ...,
        1.43000000e+02, 4.04553863e+07, 7.10000000e+00],
       [3.00000000e+08, 2.00700000e+03, 2.35000000e+00, ...,
        4.83500000e+04, 3.00000000e+08, 7.10000000e+00],
       ...,
       [7.00000000e+03, 2.00500000e+03, 2.12697615e+00, ...,
        9.30000000e+01, 3.25000000e+03, 7.80000000e+00],
       [3.96700000e+03, 2.01200000e+03, 2.35000000e+00, ...,
        2.38600000e+03, 4.04553863e+07, 6.30000000e+00],
       [1.10000000e+03, 2.00400000e+03, 1.85000000e+00, ...,
        1.63000000e+02, 1.10000000e+03, 6.60000000e+00]])

In [9]:
scaler.transform(X_train)

array([[ 1.44618096e-01,  1.41264582e-01,  3.08697378e-01, ...,
        -3.75776459e-01, -9.86366383e-03, -1.07839910e+00],
       [-6.18195360e-01,  3.11100387e-01,  3.08697378e-01, ...,
        -1.58828123e-01, -1.45694189e-01,  1.37086084e+00],
       [ 3.41473181e-01,  3.11100387e-01, -3.85936834e-01, ...,
        -1.77387275e-01,  2.51893749e-02, -1.83201754e+00],
       ...,
       [-8.07668380e-01,  4.80936192e-01, -3.85936834e-01, ...,
        -5.03855007e-01, -1.79870902e-01,  6.17242396e-01],
       [ 9.07431552e-01,  9.05525704e-01,  3.08697378e-01, ...,
         2.73589936e-01,  3.83342644e-02,  2.40433174e-01],
       [-8.08898724e-01, -2.83324931e-01, -1.14260870e-03, ...,
        -4.25998342e-01, -1.79651820e-01, -3.24780658e-01]])

In [10]:
X_train_scaled, X_test_scaled = (scaler.transform(X_train), scaler.transform(X_test))

In [11]:
from sklearn.linear_model import Lasso

model = Lasso()
model_scaled = Lasso()

model.fit(X_train,y_train)
model_scaled.fit(X_train_scaled,y_train)

Lasso()

In [12]:
print(model.score(X_test,y_test))
print(model_scaled.score(X_test_scaled,y_test))

0.5903459484718894
0.5903459490552403


Los modelos de regresión no se ven afectados por el escalamiento de las features. Los de clasificación sí.