# T05 - Motor Trend Car Road Tests

|                |   |
:----------------|---|
| **Nombre**     | Jesús Emmanuel Flores Cortés  |
| **Fecha**      | 18/09/2025  |
| **Expediente** | 751571  |

## Librerias y df

In [32]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.metrics import r2_score
import statsmodels.api as sm

In [33]:
df = pd.read_excel("Motor Trend Car Road Tests.xlsx")
df.head()

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


# 1.1

In [34]:
df_num = df.drop(columns=["model"]).copy()  
X = df_num.drop(columns=["mpg"])
y = df_num["mpg"]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [35]:
X_sm = sm.add_constant(X_scaled)
ols_mpg = sm.OLS(y.values, X_sm).fit()
print(ols_mpg.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.869
Model:                            OLS   Adj. R-squared:                  0.807
Method:                 Least Squares   F-statistic:                     13.93
Date:                Thu, 18 Sep 2025   Prob (F-statistic):           3.79e-07
Time:                        23:07:18   Log-Likelihood:                -69.855
No. Observations:                  32   AIC:                             161.7
Df Residuals:                      21   BIC:                             177.8
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         20.0906      0.468     42.884      0.0

## 1.1.2

Entrenamos con el 40% de los datos

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y.values, train_size=0.4, random_state=137)

In [37]:
lm = LinearRegression().fit(X_train, y_train)
y_train_pred = lm.predict(X_train)
y_test_pred = lm.predict(X_test)
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)
print(f"R2 entrenamiento (lineal): {r2_train:.4f}")
print(f"R2 prueba (lineal):       {r2_test:.4f}")

R2 entrenamiento (lineal): 0.9901
R2 prueba (lineal):       -54.0718


## 1.1.3

Añadimos regularizacion L2

In [38]:
alphas = [0.1, 1, 10, 100]
r2_results = []

In [39]:
for a in alphas:
    model_ridge = Ridge(alpha=a).fit(X_train, y_train)
    r2_tr = r2_score(y_train, model_ridge.predict(X_train))
    r2_te = r2_score(y_test, model_ridge.predict(X_test))
    r2_results.append({"alpha": a, "r2_train": r2_tr, "r2_test": r2_te})

r2_df_mpg_num = pd.DataFrame(r2_results)
print(r2_df_mpg_num)

   alpha  r2_train   r2_test
0    0.1  0.980674  0.233276
1    1.0  0.936103  0.619246
2   10.0  0.849823  0.779531
3  100.0  0.474303  0.270727


Con las alphas de 1 y 10 hay menos cambios con la de entrenamientos y la de prueba, aunque la que menos tiene diferencia es con el alpha de 10

## 1.2

ahora con qsec

In [40]:
y_q = df_num["qsec"]
X_q = X.copy()  
X_sm_q = sm.add_constant(X_scaled)
ols_qsec = sm.OLS(y_q.values, X_sm_q).fit()
print(ols_qsec.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.518e+29
Date:                Thu, 18 Sep 2025   Prob (F-statistic):          1.12e-300
Time:                        23:07:19   Log-Likelihood:                 999.73
No. Observations:                  32   AIC:                            -1977.
Df Residuals:                      21   BIC:                            -1961.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         17.8488   1.43e-15   1.25e+16      0.0

In [41]:
Xq_train, Xq_test, yq_train, yq_test = train_test_split(X_scaled, y_q.values, train_size=0.4, random_state=42)

lm_q = LinearRegression().fit(Xq_train, yq_train)
r2_q_train = r2_score(yq_train, lm_q.predict(Xq_train))
r2_q_test = r2_score(yq_test, lm_q.predict(Xq_test))
print(f"R2 entrenamiento (qsec) : {r2_q_train:.4f}")
print(f"R2 prueba (qsec)       : {r2_q_test:.4f}")

R2 entrenamiento (qsec) : 1.0000
R2 prueba (qsec)       : 1.0000


In [42]:
r2_results_q = []
for a in alphas:
    model_ridge_q = Ridge(alpha=a).fit(Xq_train, yq_train)
    r2_tr_q = r2_score(yq_train, model_ridge_q.predict(Xq_train))
    r2_te_q = r2_score(yq_test, model_ridge_q.predict(Xq_test))
    r2_results_q.append({"alpha": a, "r2_train": r2_tr_q, "r2_test": r2_te_q})

r2_df_qsec_num = pd.DataFrame(r2_results_q)
print(r2_df_qsec_num)

   alpha  r2_train   r2_test
0    0.1  0.999025  0.988051
1    1.0  0.987910  0.948937
2   10.0  0.900947  0.826914
3  100.0  0.451942  0.382023


ahora cambio demasiado los resultados, ya que mejoraron bastantes. Ahora el que tenemos con menos diferencia es el de alpha = .1

## 2.1

convrtimos a dummies

In [43]:
df_dum = df.drop(columns=["model"]).copy()
df_dum = pd.get_dummies(df_dum, columns=["cyl", "gear", "carb"], drop_first=True)


In [44]:
X_dum = df_dum.drop(columns=["mpg"])
y_dum_mpg = df_dum["mpg"]

In [45]:
scaler2 = StandardScaler()
X_dum_scaled = scaler2.fit_transform(X_dum)

In [46]:
X_dum_sm = sm.add_constant(X_dum_scaled)
ols_mpg_dum = sm.OLS(y_dum_mpg.values, X_dum_sm).fit()
print(ols_mpg_dum.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.893
Model:                            OLS   Adj. R-squared:                  0.779
Method:                 Least Squares   F-statistic:                     7.830
Date:                Thu, 18 Sep 2025   Prob (F-statistic):           0.000124
Time:                        23:07:19   Log-Likelihood:                -66.608
No. Observations:                  32   AIC:                             167.2
Df Residuals:                      15   BIC:                             192.1
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         20.0906      0.501     40.114      0.0

entrenamos

In [47]:
Xdm_train, Xdm_test, ydm_train, ydm_test = train_test_split(X_dum_scaled, y_dum_mpg.values, train_size=0.4, random_state=137)

lm_dum = LinearRegression().fit(Xdm_train, ydm_train)
r2_dum_tr = r2_score(ydm_train, lm_dum.predict(Xdm_train))
r2_dum_te = r2_score(ydm_test, lm_dum.predict(Xdm_test))
print(f"R2 train (mpg con dummies): {r2_dum_tr:.4f}")
print(f"R2 test  (mpg con dummies): {r2_dum_te:.4f}")

R2 train (mpg con dummies): 1.0000
R2 test  (mpg con dummies): 0.1097


el modelo es pésimo

## 2.2

con qsec

In [48]:
y_dum_q = df_dum["qsec"]

X_dum_sm_q = sm.add_constant(X_dum_scaled)
ols_q_dum = sm.OLS(y_dum_q.values, X_dum_sm_q).fit()
print(ols_q_dum.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 3.173e+28
Date:                Thu, 18 Sep 2025   Prob (F-statistic):          5.06e-211
Time:                        23:07:19   Log-Likelihood:                 987.59
No. Observations:                  32   AIC:                            -1941.
Df Residuals:                      15   BIC:                            -1916.
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         17.8488   2.47e-15   7.23e+15      0.0

entrenamos 

In [49]:
Xdq_train, Xdq_test, ydq_train, ydq_test = train_test_split(X_dum_scaled, y_dum_q.values, train_size=0.4, random_state=137)

lm_q_dum = LinearRegression().fit(Xdq_train, ydq_train)
r2_q_dum_tr = r2_score(ydq_train, lm_q_dum.predict(Xdq_train))
r2_q_dum_te = r2_score(ydq_test, lm_q_dum.predict(Xdq_test))
print(f"R2 train (qsec con dummies): {r2_q_dum_tr:.4f}")
print(f"R2 test  (qsec con dummies): {r2_q_dum_te:.4f}")

R2 train (qsec con dummies): 1.0000
R2 test  (qsec con dummies): 0.9964


al parecer el modelo es muy bueno pues no existe tantas diferencias

## 3.1

In [50]:
summary_cmp = pd.DataFrame({
    "modelo": ["mpg_num_sin_dummies", "mpg_con_dummies"],
    "r2_train": [r2_train, r2_dum_tr]
})
summary_cmp

Unnamed: 0,modelo,r2_train
0,mpg_num_sin_dummies,0.990081
1,mpg_con_dummies,1.0


Al parecer con y sin dummies es practicamente igual

## 3.2

In [51]:
summary_cmp_q = pd.DataFrame({
    "modelo": ["qsec_num_sin_dummies", "qsec_con_dummies"],
    "r2_train": [r2_q_train, r2_q_dum_tr]
})
summary_cmp_q

Unnamed: 0,modelo,r2_train
0,qsec_num_sin_dummies,1.0
1,qsec_con_dummies,1.0


Aqui son exactamente iguales