<a href="https://colab.research.google.com/github/julianchaux/MachineLearning/blob/master/2%20-%20Regresi%C3%B3n/2_2%20-%20Regresi%C3%B3n%20Lineal%20M%C3%BAltiple/Regresion_Lineal_Multiple_Manual_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# REGRESIÓN LINEAL MÚLTIPLE

## Utiliza el método de optimización de Eliminación hacia Atrás PASO A PASO (Manualmente)

# Clonamos el repositorio para obtener los dataSet

In [1]:
!git clone https://github.com/julianchaux/MachineLearning.git

Cloning into 'MachineLearning'...
remote: Enumerating objects: 98, done.[K
remote: Counting objects: 100% (98/98), done.[K
remote: Compressing objects: 100% (83/83), done.[K
remote: Total 98 (delta 39), reused 46 (delta 11), pack-reused 0[K
Unpacking objects: 100% (98/98), done.


# Exploramos la carpeta

In [2]:
!ls '/content/MachineLearning'

'1 - PreProcesamiento de Datos'  '2 - Regresión'   README.md


# Cómo importar las librerías

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
print(f'La versión de numpy es {np.__version__}')
print(f'La versión de pandas es {pd.__version__}')

La versión de numpy es 1.19.5
La versión de pandas es 1.1.5


# Importar el data set

In [4]:
dataset = pd.read_csv('/content/MachineLearning/2 - Regresión/2_2 - Regresión Lineal Múltiple/50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

# Codificar datos categóricos

In [5]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
labelencoder_X = LabelEncoder()
X[:, 3] = labelencoder_X.fit_transform(X[:, 3])
ct = ColumnTransformer(
    [('one_hot_encoder', OneHotEncoder(categories='auto'), [3])],   
    remainder='passthrough'                        
)
X = np.array(ct.fit_transform(X), dtype=np.float)

# Evitamos la trampa de las variables ficticias (Dummy)

In [6]:
X = X[:, 1:]

# Dividir el data set en conjunto de entrenamiento y conjunto de testing

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Ajustar el modelo de Regresión Lineal Múltiple con el conjunto de datos de entrenamiento

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [10]:
regression = LinearRegression()
regression.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

### Hallamos los coeficientes del modelo de regresión

In [11]:
print(f'Los coeficientes de las variables independientes son {regression.coef_}')

Los coeficientes de las variables independientes son [-9.59284160e+02  6.99369053e+02  7.73467193e-01  3.28845975e-02
  3.66100259e-02]


In [12]:
print(f'El intercepto de la recta con el eje Y es {regression.intercept_}')

El intercepto de la recta con el eje Y es 42554.167617767


### El modelo de regresión múltiple creado es:

$$\tilde{y} = -9.59284160e^{2}\cdot x_{1} + 6.99369053e^{2}\cdot x_{2} + 7.73467193e^{-1}\cdot x_{3} + 3.28845975e^{-2}\cdot x_{4} + 3.66100259e^{-2}\cdot x_{5} + 42554.167617767 $$

### Cálculo del error cuadrático medio (MSE) de los *y_train* respecto a los *y* del modelo de predicción

$$MSE = \frac{\sum_{i=1}^n(y_i - \tilde{y_i})^2}{n}$$

In [13]:
mean_squared_error(y_train, regression.predict(X_train))

81571001.80077371

### Cálculo del coeficiente de determinación (r^2) de los *y_train* respecto a los *y* del modelo de predicción

In [14]:
r2_score(y_train, regression.predict(X_train))

0.9501847627493607

# Predecir los datos (Regresión Lineal Múltiple) con el conjunto de datos de test

In [15]:
y_pred = regression.predict(X_test)

### Cálculo del error cuadrático medio (MSE) de los *y_test* respecto a los *y* del modelo de predicción

In [16]:
mean_squared_error(y_test, y_pred)

83502864.03250548

### Cálculo del coeficiente de determinación (r^2) de los *y_test* respecto a los *y* del modelo de predicción

In [17]:
r2_score(y_test, y_pred)

0.9347068473282987

# --------------------------------------------------------------------------------------------------------------

# Construir el modelo óptimo de Regresión Lineal Múltiple utilizando la Eliminación hacia Atrás PASO A PASO

![picture](https://github.com/julianchaux/MachineLearning/blob/master/2%20-%20Regresi%C3%B3n/2_2%20-%20Regresi%C3%B3n%20Lineal%20M%C3%BAltiple/Eliminacion%20hacia%20atras.png?raw=true)

In [18]:
import statsmodels.api as sm

  import pandas.util.testing as tm


### Añadir nueva columna de unos (1) para incluir el termino independiente b0, se añade al inicio de X

In [19]:
X = np.append(arr = np.ones((50, 1)).astype(int), values = X, axis = 1)

### 1er paso: Creamos el nivel de significación SL

In [20]:
SL = 0.05

### 2do paso: Se inicia la optimización con todas la variables independientes

In [21]:
X_opt = X[:, [0, 1, 2, 3, 4, 5]].tolist()

### Crear el OLS (Ordinary List Square - Mínimos cuadrados ordinarios) con todas las variables predictoras

In [22]:
regression_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regression_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Thu, 15 Jul 2021",Prob (F-statistic):,1.34e-27
Time:,04:52:55,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.013e+04,6884.820,7.281,0.000,3.62e+04,6.4e+04
x1,198.7888,3371.007,0.059,0.953,-6595.030,6992.607
x2,-41.8870,3256.039,-0.013,0.990,-6604.003,6520.229
x3,0.8060,0.046,17.369,0.000,0.712,0.900
x4,-0.0270,0.052,-0.517,0.608,-0.132,0.078
x5,0.0270,0.017,1.574,0.123,-0.008,0.062

0,1,2,3
Omnibus:,14.782,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.266
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,1450000.0


### 3er y 4o Paso: Eliminamos la variable independiente x2 con el p-valor mas grande ya que es mayor que SL y ajustamos nuevamente

In [23]:
X_opt = X[:, [0, 1, 3, 4, 5]].tolist()
regression_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regression_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,217.2
Date:,"Thu, 15 Jul 2021",Prob (F-statistic):,8.49e-29
Time:,04:53:04,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1061.0
Df Residuals:,45,BIC:,1070.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.011e+04,6647.870,7.537,0.000,3.67e+04,6.35e+04
x1,220.1585,2900.536,0.076,0.940,-5621.821,6062.138
x2,0.8060,0.046,17.606,0.000,0.714,0.898
x3,-0.0270,0.052,-0.523,0.604,-0.131,0.077
x4,0.0270,0.017,1.592,0.118,-0.007,0.061

0,1,2,3
Omnibus:,14.758,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.172
Skew:,-0.948,Prob(JB):,2.53e-05
Kurtosis:,5.563,Cond. No.,1400000.0


### 3er y 4o Paso: Eliminamos la variable independiente x1 con el p-valor mas grande ya que es mayor que SL y ajustamos nuevamente

In [24]:
X_opt = X[:, [0, 3, 4, 5]].tolist()
regression_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regression_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,296.0
Date:,"Thu, 15 Jul 2021",Prob (F-statistic):,4.53e-30
Time:,04:53:19,Log-Likelihood:,-525.39
No. Observations:,50,AIC:,1059.0
Df Residuals:,46,BIC:,1066.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.012e+04,6572.353,7.626,0.000,3.69e+04,6.34e+04
x1,0.8057,0.045,17.846,0.000,0.715,0.897
x2,-0.0268,0.051,-0.526,0.602,-0.130,0.076
x3,0.0272,0.016,1.655,0.105,-0.006,0.060

0,1,2,3
Omnibus:,14.838,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.442
Skew:,-0.949,Prob(JB):,2.21e-05
Kurtosis:,5.586,Cond. No.,1400000.0


### 3er y 4o Paso: Eliminamos la variable independiente x4 con el p-valor mas grande ya que es mayor que SL y ajustamos nuevamente

In [25]:
X_opt = X[:, [0, 3, 5]].tolist()
regression_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regression_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,450.8
Date:,"Thu, 15 Jul 2021",Prob (F-statistic):,2.1600000000000003e-31
Time:,04:54:09,Log-Likelihood:,-525.54
No. Observations:,50,AIC:,1057.0
Df Residuals:,47,BIC:,1063.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.698e+04,2689.933,17.464,0.000,4.16e+04,5.24e+04
x1,0.7966,0.041,19.266,0.000,0.713,0.880
x2,0.0299,0.016,1.927,0.060,-0.001,0.061

0,1,2,3
Omnibus:,14.677,Durbin-Watson:,1.257
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.161
Skew:,-0.939,Prob(JB):,2.54e-05
Kurtosis:,5.575,Cond. No.,532000.0


### 3er y 4o Paso: Eliminamos la variable independiente x5 con el p-valor mas grande ya que es mayor que SL y ajustamos nuevamente

In [26]:
X_opt = X[:, [0, 3]].tolist()
regression_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regression_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Thu, 15 Jul 2021",Prob (F-statistic):,3.5000000000000004e-32
Time:,04:54:34,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.903e+04,2537.897,19.320,0.000,4.39e+04,5.41e+04
x1,0.8543,0.029,29.151,0.000,0.795,0.913

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.536
Skew:,-0.911,Prob(JB):,9.44e-05
Kurtosis:,5.361,Cond. No.,165000.0


# CONCLUSIÓN: Para este caso, la variable R&D es la variable más estadísticamente significativa