# Linear Regression
This notebook tests two linear regression models without regularization.  A model including all numerical features performs more strongly than a model limited to the five most positively correlated features and the five most negatively correlated values.

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [2]:
df = pd.read_csv('/Users/omarcarr/Desktop/Notebooks/DSI-US-5/Projects/project-2/train_numbers_clean.csv')

In [3]:
df.head()

Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,...,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,SalePrice
0,109,533352170,60,0.0,13517,6,8,1976,2005,289.0,...,0,44,0,0,0,0,0,3,2010,130500
1,544,531379050,60,43.0,11492,7,5,1996,1997,132.0,...,0,74,0,0,0,0,0,4,2009,220000
2,153,535304180,20,68.0,7922,5,7,1953,2007,0.0,...,0,52,0,0,0,0,0,1,2010,109000
3,318,916386060,60,73.0,9802,5,5,2006,2007,0.0,...,100,0,0,0,0,0,0,4,2010,174000
4,255,906425045,50,82.0,14235,6,8,1900,1993,0.0,...,0,59,0,0,0,0,0,3,2010,138500


In [4]:
#check for no nulls:

df.isnull().sum().sum()

0

In [5]:
#drop columns not in use:

df = df.drop(['Id', 'PID'], axis=1)

In [6]:
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']

# Train-test split:

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [8]:
df.shape

(2051, 37)

# Linear Regression Model with all numeric values

 - A linear regression model with all numeric values gets an R-squared score of .86 on the holdout test set

In [9]:
lr = LinearRegression()

lr.fit(X_train, y_train)

print(lr.score(X_train, y_train))

print(lr.score(X_test, y_test))

0.8304108345862737
0.8643115427071087


# Linear Regression Model with top 5 positively and top 5 negatively correlated variables

- The top five positively and negatively correlated explanatory variables are selected and used in a linear regression model, which performs slightly worse on the test set than the linear regression model with all numeric variables included.

In [10]:
#sorting the five lowest and five highest
five_lowest = list(pd.concat(
    [X_train, y_train],
    axis=1).corr()['SalePrice'].sort_values().index[:5])
five_largest = list(pd.concat(
    [X_train, y_train]
    , axis=1).corr()['SalePrice'].sort_values(ascending=False).index[1:6])

In [11]:
print('Five most negatively:\n', five_lowest)
print()
print('Five most positively:\n', five_largest)

Five most negatively:
 ['Enclosed Porch', 'Kitchen AbvGr', 'Overall Cond', 'MS SubClass', 'Bsmt Half Bath']

Five most positively:
 ['Overall Qual', 'Gr Liv Area', 'Garage Cars', 'Garage Area', 'Total Bsmt SF']


In [12]:
features = five_lowest + five_largest
features

['Enclosed Porch',
 'Kitchen AbvGr',
 'Overall Cond',
 'MS SubClass',
 'Bsmt Half Bath',
 'Overall Qual',
 'Gr Liv Area',
 'Garage Cars',
 'Garage Area',
 'Total Bsmt SF']

- Note to self -- this Kaggle problem uses RMSE, I should import to test these!!!!

In [13]:
lr = LinearRegression()

lr.fit(X_train[features], y_train)

print(lr.score(X_train[features], y_train))

print(lr.score(X_test[features], y_test))

0.7695832701211398
0.8320849926780919


In [14]:
from sklearn.metrics import mean_squared_error

In [15]:
mean_squared_error(y_test, lr.predict(X_test[features])) ** 0.5

32109.2244930776

- Limiting it to only some of the numerical features