# Regresión lineal múltiple con el método LinearRegression de Scikit-learn

Llevamos a cabo un análisis procediendo de forma análoga al caso anterior, utilizando en este caso el método LinearRegression de la librería Scikit-learn.

En primer lugar, cargamos los módulos y el dataset como en el caso anterior

In [1]:
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
import pandas as pd
import time
import sys

In [2]:
datos = pd.read_csv("./insurance.csv")
datos.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [3]:

datos['sex'].replace({'female' : 0, 'male' : 1}, inplace= True)
datos['smoker'].replace({'no': 0, 'yes': 1}, inplace= True)

datos.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,southwest,16884.924
1,18,1,33.77,1,0,southeast,1725.5523
2,28,1,33.0,3,0,southeast,4449.462
3,33,1,22.705,0,0,northwest,21984.47061
4,32,1,28.88,0,0,northwest,3866.8552


In [4]:
datos= pd.get_dummies(datos)
datos.head()

Unnamed: 0,age,sex,bmi,children,smoker,charges,region_northeast,region_northwest,region_southeast,region_southwest
0,19,0,27.9,0,1,16884.924,0,0,0,1
1,18,1,33.77,1,0,1725.5523,0,0,1,0
2,28,1,33.0,3,0,4449.462,0,0,1,0
3,33,1,22.705,0,0,21984.47061,0,1,0,0
4,32,1,28.88,0,0,3866.8552,0,1,0,0


In [5]:
X = datos.drop(['charges','region_northeast'], axis = 1)
X.head()

Unnamed: 0,age,sex,bmi,children,smoker,region_northwest,region_southeast,region_southwest
0,19,0,27.9,0,1,0,0,1
1,18,1,33.77,1,0,0,1,0
2,28,1,33.0,3,0,0,1,0
3,33,1,22.705,0,0,1,0,0
4,32,1,28.88,0,0,1,0,0


In [6]:
Y = datos['charges']
Y.head()

0    16884.92400
1     1725.55230
2     4449.46200
3    21984.47061
4     3866.85520
Name: charges, dtype: float64

En este caso extraemos los datos de la Serie Y y del DataFrame X (se necesitan valores de tipo array para pasar a la función LinearRegression).

In [7]:
Y = np.array(Y)
X = X.values   

In [8]:
type(Y), type(X)

(numpy.ndarray, numpy.ndarray)

In [9]:
t1=time.time()

Por último creamos el modelo

In [10]:
rl = LinearRegression(normalize = True) # Creando el modelo.
modelo_rl=rl.fit(X, Y) # ajustando el modelo

In [11]:
#Estudio tiempo de ejecucion

t2=time.time()
tiempo = float(t2-t1)

print('Regresion lineal utilizando Scikit-learn')
print("Tiempo de ejecucion: {} segundos".format(tiempo))

Regresion lineal utilizando Scikit-learn
Tiempo de ejecucion: 0.05768013000488281 segundos


In [12]:
# Medimos el tamaño en bytes del objeto
print(sys.getsizeof(modelo_rl), 'bytes')

56 bytes


Para obtener el valor de la constante usamos intercept

In [13]:
intercept = rl.intercept_
intercept

-11938.538576167215

Para obtener el valor de R^2 usamos la función score()

In [14]:
score = rl.score(X, Y)
score

0.7509130345985208

Ahora obtenemos los coeficientes


In [15]:
coef = rl.coef_
coef

array([  256.85635254,  -131.3143594 ,   339.19345361,   475.50054515,
       23848.53454191,  -352.96389942, -1035.02204939,  -960.0509913 ])

In [16]:
#Estudio tiempo de ejecucion

t2=time.time()
tiempo = float(t2-t1)

print('Regresion lineal utilizando Scikit-learn')
print("Tiempo de ejecucion: {} segundos".format(tiempo))

Regresion lineal utilizando Scikit-learn
Tiempo de ejecucion: 0.0954580307006836 segundos


Predecimos los valores mediante el método predict()

In [17]:
new_data = datos.drop('region_northeast', axis = 1)
new_data.head()

Unnamed: 0,age,sex,bmi,children,smoker,charges,region_northwest,region_southeast,region_southwest
0,19,0,27.9,0,1,16884.924,0,0,1
1,18,1,33.77,1,0,1725.5523,0,1,0
2,28,1,33.0,3,0,4449.462,0,1,0
3,33,1,22.705,0,0,21984.47061,1,0,0
4,32,1,28.88,0,0,3866.8552,1,0,0


In [18]:
charges_real= new_data.pop('charges')

In [19]:
charges_real.head()

0    16884.92400
1     1725.55230
2     4449.46200
3    21984.47061
4     3866.85520
Name: charges, dtype: float64

In [20]:
# haciendo las predicciones
predicciones_charges = rl.predict(new_data.values)
predicciones_df = pd.DataFrame(predicciones_charges, columns=['Pred_Charges'])
predicciones_df.head() # predicciones de las primeras 5 lineas


Unnamed: 0,Pred_Charges
0,25293.713028
1,3448.602834
2,6706.988491
3,3754.830163
4,5592.493386
