<a href="https://colab.research.google.com/github/phmorris610/Regression/blob/main/RegressionHousingData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##In this notebook I will do a multiple linear regression using Housing Price Data.

In [None]:
class regModel:
    def __init__(self, X, y):
        self.X = X
        self.y = y

First I will define a model with a reModel class

In [None]:
    def reg(self):
        from sklearn.linear_model import LinearRegression
        import matplotlib.pyplot as plt
        linreg_model = LinearRegression()
        linreg_model.fit(self.X, self.y)
        import statsmodels.api as sm
        regModel = sm.OLS(self.y, self.X).fit()
        y_pred = linreg_model.predict(self.X)
        # and a residual plot
        plt.title('Residulas')
        plt.scatter(regModel.model.exog[:, 1], regModel.resid)
        plt.show()
        mse_linreg = ((y_pred ** 2).mean())
        print("MSE = ", mse_linreg)
        print("rMSE = ", np.sqrt(mse_linreg))
        return regModel.summary()  # finally return a Regression Summary

    def chiSquareTest(self, k):
        from sklearn.feature_selection import SelectKBest
        from sklearn.feature_selection import chi2
        bestfeatures = SelectKBest(score_func=chi2, k=k)
        fit = bestfeatures.fit(self.X, self.y)
        dfscores = pd.DataFrame(fit.scores_)
        dfcolumns = pd.DataFrame(self.X.columns)
        featureScores = pd.concat([dfcolumns, dfscores], axis=1)
        featureScores.columns = ['Specs', 'Score']
        return featureScores.nlargest(k, 'Score')

    def extraTree(self, n):
        from sklearn.ensemble import ExtraTreesClassifier
        import matplotlib.pyplot as plt
        model = ExtraTreesClassifier()
        model.fit(self.X, self.y)
        feat_importances = pd.Series(model.feature_importances_, index=self.X.columns)
        feat_importances.nlargest(n).plot(kind='barh')
        plt.show()
        feat = []
        feat.append(feat_importances.nlargest(n))
        return feat

    def heatmap(self):
        from sklearn.linear_model import LinearRegression
        import seaborn as sns
        import matplotlib.pyplot as plt
        corrmat = data.corr()
        top_corr_feat = corrmat.index
        plt.figure(figsize=(20, 20))
        g = sns.heatmap(data[top_corr_feat].corr(), annot=True, cmap='RdYlGn')
        plt.show()

    def trainTestSplit(self, t_size, state):
        from sklearn.model_selection import train_test_split
        X_train, X_val, y_train, y_val = train_test_split(self.X, self.y, train_size=t_size, random_state=state)
        return X_train, X_val, y_train, y_val

    def normal(self, feature):
        import matplotlib.pyplot as plt
        plt.hist(self.X[feature], bins=20, edgecolor='black')
        plt.show()  # TODO: I would like to somehow spit out a display of all the hists at once of all of my features

    def tranform(self, selections, trans):
        import numpy as np
        X_tlog = self.X[selections].applymap(lambda x: np.log(x+1))
        y_tlog = np.log(y)
        return X_tlog, y_tlog

    def describe(self):
        import matplotlib.pyplot as plt
        plt.scatter(self.X, self.y, alpha=0.3)  #TODO: I would like to display all X's vs the y's
        plt.show()

As seen above the regModel class has pretty much everything we need to produce a Regression Model, from defining to feature selection to the finer points of splitting and transforming. Note (some functions are still in progress)

In [None]:
data = pd.read_csv('C:/Users/Paul Morris/Desktop/DeepLearn/Housing.csv')
train_data = data
null_entries = data.isnull().sum()
cols_w_nulls = null_entries[null_entries > 0].index
X = train_data.drop(columns=cols_w_nulls).copy()
y = train_data.pop('SalePrice')
for colname in X.select_dtypes('object').columns:
    X[colname], _ = X[colname].factorize()

The first step is loading the data, via a csv, deleting the null entries then splitting the data between X (observations, or explanatory variables), and y (the response)

In [None]:
features = regModel(X, y)
xTree = regModel(X, y)
features.chiSquareTest(30)
xTree.extraTree(30)

Nows it's important to see which variable are pertinent in explaining response (y) wich will be Sale Price. So I'll produce a list of the top 30 explanatory variables using a Chi Squared Test and the Extra Tree Classifier, both produce the same results but the Extra Tree Classifier is a more pleasant display. The vector of the top 30 is below.
[SalePrice       0.053516
Id              0.035669
GrLivArea       0.034066
1stFlrSF        0.032866
GarageArea      0.032635
MoSold          0.032203
BsmtUnfSF       0.032188
TotalBsmtSF     0.032173
LotArea         0.032163
YearBuilt       0.031221
YrSold          0.030410
YearRemodAdd    0.030267
BsmtFinSF1      0.029967
OpenPorchSF     0.026133
TotRmsAbvGrd    0.026111
WoodDeckSF      0.025732
Neighborhood    0.025510
OverallQual     0.023704
Exterior2nd     0.022308
Exterior1st     0.021462
2ndFlrSF        0.019214
OverallCond     0.019014
HeatingQC       0.018181
BedroomAbvGr    0.018124
LotConfig       0.018107
MSSubClass      0.016914
Fireplaces      0.016889
LotShape        0.016859
BsmtFullBath    0.016400
GarageCars      0.014455
dtype: float64]

In [None]:
featured_selections = ['GrLivArea', 'BsmtFinSF1', 'LotArea', 'GarageArea', '1stFlrSF',
                       'YearRemodAdd', 'BsmtFinSF1', 'Neighborhood', 'WoodDeckSF',
                       'OverallQual', 'Exterior1st']
X_linreg = X[featured_selections].copy()

Obviously sale price is correlated with itself but this gives us a good baseline regarding a top score
maybe we want to go with the variables less than 0.2, now append the featured_selections (also don't include Id,
or year sold).
Feature selection is done lets see the model

In [None]:
r = regModel(X_linreg, y)

looks like judging from the p-Values we can get rid of a few, Exterior 2nd, Total rooms above ground, OpenPorch SF,
Year Build, Month Sold (should have extracted that earlier how embarrassing)
Neighborhood is borderline, interesting
After removing the above explanatory variables, the model improved tremendously. Also the Residual plot doesn't show any glaring signs of skewness.