<img src="https://i.imgur.com/JDsOpVN.png" style="float: left; margin: 20px; height: 290px">

# Model Tuning

---
Predicting House Prices with Linear Regression

**Author**: Miriam Sosa

1. [Model Selection](#Model-Selection)
    - [Linear Regression](#Linear-Regression)
    - [Ridge](#Ridge)
    - [Logistic Regression](#Logistic-Regression)
    - [Elastic Net](#Elastic-Net)

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd           
import seaborn as sns

from sklearn.linear_model import LinearRegression, Ridge, ElasticNetCV, LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler 

In [2]:
train = pd.read_csv('../data/train_clean.csv', keep_default_na=False, na_values=['']) 
test = pd.read_csv('../data/test_clean.csv', keep_default_na=False, na_values=[''])

In [3]:
train.shape, test.shape

((2051, 93), (878, 91))

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2051 entries, 0 to 2050
Data columns (total 93 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               2051 non-null   int64  
 1   PID              2051 non-null   int64  
 2   MS SubClass      2051 non-null   int64  
 3   MS Zoning        2051 non-null   object 
 4   LotFrontage      2051 non-null   float64
 5   Lot Area         2051 non-null   int64  
 6   Street           2051 non-null   object 
 7   Alley            2051 non-null   object 
 8   LotShape         2051 non-null   object 
 9   Land Contour     2051 non-null   object 
 10  Utilities        2051 non-null   object 
 11  LotConfig        2051 non-null   object 
 12  Land Slope       2051 non-null   object 
 13  Neighborhood     2051 non-null   object 
 14  Condition 1      2051 non-null   object 
 15  Condition 2      2051 non-null   object 
 16  Bldg Type        2051 non-null   object 
 17  House Style   

# Model Selection

## Linear Regression

In [5]:
features = ['foundations', 
            'central_air', 
            'garage_index', 
            'pool', 
            'cond_norm', 
            'cond_pos', 
            'LotFrontage', 
            'Year Built', 
            'BsmtFin SF 1', 
            'SF', 
            'outdoorSF']

In [6]:
X = train[features]
y = train['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [7]:
lr_1 = LinearRegression()
lr_1.fit(X_train, y_train)
lr_1.score(X_train, y_train), lr_1.score(X_test, y_test)

(0.7214982452068421, 0.7732696059334563)

# Regularization

In [8]:
X = train[features]
y = train['SalePrice']

poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)

X_overfit = poly.fit_transform(X)

In [9]:
poly.get_feature_names(X.columns)

['foundations',
 'central_air',
 'garage_index',
 'pool',
 'cond_norm',
 'cond_pos',
 'LotFrontage',
 'Year Built',
 'BsmtFin SF 1',
 'SF',
 'outdoorSF',
 'foundations^2',
 'foundations central_air',
 'foundations garage_index',
 'foundations pool',
 'foundations cond_norm',
 'foundations cond_pos',
 'foundations LotFrontage',
 'foundations Year Built',
 'foundations BsmtFin SF 1',
 'foundations SF',
 'foundations outdoorSF',
 'central_air^2',
 'central_air garage_index',
 'central_air pool',
 'central_air cond_norm',
 'central_air cond_pos',
 'central_air LotFrontage',
 'central_air Year Built',
 'central_air BsmtFin SF 1',
 'central_air SF',
 'central_air outdoorSF',
 'garage_index^2',
 'garage_index pool',
 'garage_index cond_norm',
 'garage_index cond_pos',
 'garage_index LotFrontage',
 'garage_index Year Built',
 'garage_index BsmtFin SF 1',
 'garage_index SF',
 'garage_index outdoorSF',
 'pool^2',
 'pool cond_norm',
 'pool cond_pos',
 'pool LotFrontage',
 'pool Year Built',
 'poo

In [10]:
X_overfit.shape

(2051, 77)

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X_overfit, 
                                                    y,
                                                    test_size=0.7,
                                                    random_state=42)

In [12]:
sc = StandardScaler()
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)

In [13]:
print(f'Z_train shape is: {Z_train.shape}')
print(f'y_train shape is: {y_train.shape}')
print(f'Z_test shape is: {Z_test.shape}')
print(f'y_test shape is: {y_test.shape}')

Z_train shape is: (615, 77)
y_train shape is: (615,)
Z_test shape is: (1436, 77)
y_test shape is: (1436,)


In [14]:
ols = LinearRegression()
ols.fit(Z_train, y_train)

LinearRegression()

In [15]:
print(ols.score(Z_train, y_train))
print(ols.score(Z_test, y_test))

0.8416907481721667
-1.7795170765833418e+23


 R2 is negative 

# Ridge

In [16]:
ridge_model = Ridge(alpha=10)

ridge_model.fit(Z_train, y_train)

print(ridge_model.score(Z_train, y_train))
print(ridge_model.score(Z_test, y_test))

0.816809119665017
0.7800208993364244


In [None]:
# Scores are slightly better compared to my multiple linear regression model that included neighborhood

## Logistic Regression

In [18]:
LogisticRegression(penalty='none')

LogisticRegression(penalty='none')

In [19]:
from sklearn.datasets import make_classification

In [20]:
X, y = make_classification(
    n_samples=1000,
    n_features=200,
    n_informative=15,
    random_state=123
)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

In [21]:
logreg = LogisticRegression(C=1e9, solver='lbfgs')
logreg.fit(X_train_sc, y_train)

# Overfit model (low on test score, high on train)
print(logreg.score(X_train_sc, y_train))
print(logreg.score(X_test_sc, y_test))

0.9306666666666666
0.636


In [22]:
logreg_cv = LogisticRegressionCV(Cs=10, cv=5, penalty="l1", solver="liblinear")
logreg_cv.fit(X_train_sc, y_train)

#much better overall - comparable to production model
print(logreg_cv.score(X_train_sc, y_train))
print(logreg_cv.score(X_test_sc, y_test))

0.8146666666666667
0.812


### These scores are good, but not much better than the multiple linear regression model w/ a dummy variable for neighborhood. 

### Will use the multiple linear regression model 

In [23]:
logreg_cv.C_

array([0.04641589])

## Elastic Net

In [24]:
enet_alphas = np.linspace(0.5, 1.0, 100)

enet_ratio = 0.5

enet_model = ElasticNetCV(alphas=enet_alphas, l1_ratio=enet_ratio, cv=5)

enet_model = enet_model.fit(X_train, y_train)

enet_model_preds = enet_model.predict(X_test)
enet_model_preds_train = enet_model.predict(X_train)

print(enet_model.score(X_train, y_train))
print(enet_model.score(X_test, y_test))

0.21246113948141965
0.21368840628470098


### Very low scores compared to other approaches investigated here and MLS

In [25]:
enet_model.alpha_

0.5