# In this notebook, I will:
- Generate polynonmial features without the dummy columns 
- Find the high correlation columns 
- Use those columns and standardize them (this is a new dataframe)
- Append the standardized dataframe with the three 3 dummy columns shown before 

In [28]:
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
import pandas as pd

In [29]:
train = pd.read_csv('./datasets/df_with_dummy.csv')
train.head()

Unnamed: 0.1,Unnamed: 0,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,...,Misc Val,Mo Sold,Yr Sold,SalePrice,homeage,baths,garageage,Foundation_PConc,Bsmt Qual_Ex,Kitchen Qual_Ex
0,0,60,68.0,13517,6,8,1976,2005,289.0,533.0,...,0,3,2010,130500,34,2.5,34.0,0,0,0
1,1,60,43.0,11492,7,5,1996,1997,132.0,637.0,...,0,4,2009,220000,13,2.5,12.0,1,0,0
2,2,20,68.0,7922,5,7,1953,2007,0.0,731.0,...,0,1,2010,109000,57,1.0,57.0,0,0,0
3,3,60,73.0,9802,5,5,2006,2007,0.0,0.0,...,0,4,2010,174000,4,2.5,3.0,1,0,0
4,4,50,82.0,14235,6,8,1900,1993,0.0,0.0,...,0,3,2010,138500,110,2.0,53.0,1,0,0


In [30]:
train.shape

(2051, 44)

In [31]:
#removing dummy column so i can create polynomial features 
numeric_col = []
for col in list(train):
    if '_' not in col: 
        numeric_col.append(col)

In [32]:
len(numeric_col)

41

In [33]:
#removing irrelevant columns 
numeric_col.remove('Unnamed: 0')

In [34]:
numeric_col.remove('SalePrice')

In [35]:
len(numeric_col)

39

# Polynomial Features

In [36]:
X = train[numeric_col]

In [38]:
poly = PolynomialFeatures(include_bias= False)
X_poly = poly.fit_transform(X)

In [39]:
#creating a dataframe of our polynomial features
poly_df = pd.DataFrame(X_poly, columns = poly.get_feature_names(numeric_col))

In [40]:
poly_df.shape

(2051, 819)

# Finding Polynomial Columns with High Correlation

In [14]:
train['SalePrice'].shape

(2051,)

#### This time, I am setting the correlation magnitude at 0.70

In [15]:
corr_col = []
for col in list(poly_df):
    corr = train['SalePrice'].corr(poly_df[col])
    if corr > 0.50 or corr < -.5:
        corr_col.append(col)

In [16]:
corr_col

['Overall Qual',
 'Year Built',
 'Year Remod/Add',
 'Mas Vnr Area',
 'Total Bsmt SF',
 '1st Flr SF',
 'Gr Liv Area',
 'Full Bath',
 'TotRms AbvGrd',
 'Garage Yr Blt',
 'Garage Cars',
 'Garage Area',
 'homeage',
 'baths',
 'garageage',
 'Lot Frontage Overall Qual',
 'Lot Frontage Mas Vnr Area',
 'Lot Frontage Gr Liv Area',
 'Lot Frontage Full Bath',
 'Lot Frontage Garage Cars',
 'Lot Frontage Garage Area',
 'Lot Frontage baths',
 'Lot Area Overall Qual',
 'Lot Area Garage Cars',
 'Overall Qual^2',
 'Overall Qual Overall Cond',
 'Overall Qual Year Built',
 'Overall Qual Year Remod/Add',
 'Overall Qual Mas Vnr Area',
 'Overall Qual BsmtFin SF 1',
 'Overall Qual Total Bsmt SF',
 'Overall Qual 1st Flr SF',
 'Overall Qual Gr Liv Area',
 'Overall Qual Full Bath',
 'Overall Qual Bedroom AbvGr',
 'Overall Qual Kitchen AbvGr',
 'Overall Qual TotRms AbvGrd',
 'Overall Qual Fireplaces',
 'Overall Qual Garage Yr Blt',
 'Overall Qual Garage Cars',
 'Overall Qual Garage Area',
 'Overall Qual Yr Sold'

In [17]:
len(corr_col)

164

In [18]:
poly_df[corr_col].head()

Unnamed: 0,Overall Qual,Year Built,Year Remod/Add,Mas Vnr Area,Total Bsmt SF,1st Flr SF,Gr Liv Area,Full Bath,TotRms AbvGrd,Garage Yr Blt,...,Garage Cars Garage Area,Garage Cars Yr Sold,Garage Cars baths,Garage Area^2,Garage Area Yr Sold,Garage Area baths,Yr Sold homeage,Yr Sold baths,Yr Sold garageage,baths^2
0,6.0,1976.0,2005.0,289.0,725.0,725.0,1479.0,2.0,6.0,1976.0,...,950.0,4020.0,5.0,225625.0,954750.0,1187.5,68340.0,5025.0,68340.0,6.25
1,7.0,1996.0,1997.0,132.0,913.0,913.0,2122.0,2.0,8.0,1997.0,...,1118.0,4018.0,5.0,312481.0,1123031.0,1397.5,26117.0,5022.5,24108.0,6.25
2,5.0,1953.0,2007.0,0.0,1057.0,1057.0,1057.0,1.0,5.0,1953.0,...,246.0,2010.0,1.0,60516.0,494460.0,246.0,114570.0,2010.0,114570.0,1.0
3,5.0,2006.0,2007.0,0.0,384.0,744.0,1444.0,2.0,7.0,2007.0,...,800.0,4020.0,5.0,160000.0,804000.0,1000.0,8040.0,5025.0,6030.0,6.25
4,6.0,1900.0,1993.0,0.0,676.0,831.0,1445.0,2.0,6.0,1957.0,...,968.0,4020.0,4.0,234256.0,972840.0,968.0,221100.0,4020.0,106530.0,4.0


# At this point, `poly_df[corr_col]` has only columns with polynomial features. Now let's standardize it!

In [19]:
#I want to standardize my new polynomial features 
X = poly_df[corr_col]

In [20]:
ss= StandardScaler()
ss.fit(X)  #learning the means and std for each column 
poly_scaled = ss.transform(X)

In [21]:
poly = pd.DataFrame(poly_scaled, columns = corr_col)

In [22]:
poly.head()

Unnamed: 0,Overall Qual,Year Built,Year Remod/Add,Mas Vnr Area,Total Bsmt SF,1st Flr SF,Gr Liv Area,Full Bath,TotRms AbvGrd,Garage Yr Blt,...,Garage Cars Garage Area,Garage Cars Yr Sold,Garage Cars baths,Garage Area^2,Garage Area Yr Sold,Garage Area baths,Yr Sold homeage,Yr Sold baths,Yr Sold garageage,baths^2
0,-0.078644,0.142227,0.989479,1.092329,-0.741232,-1.108838,-0.040634,0.769779,-0.279441,-0.112447,...,-0.052622,0.295128,0.737673,-0.199323,0.008581,0.496281,-0.06721,1.160521,0.203566,1.138397
1,0.622656,0.805126,0.60909,0.191491,-0.322705,-0.63451,1.244529,0.769779,1.002738,0.73709,...,0.174714,0.293824,0.737673,0.182637,0.396922,0.848706,-0.763445,1.158568,-0.68613,1.138397
2,-0.779944,-0.620106,1.084576,-0.565901,-0.00213,-0.271195,-0.884084,-1.051232,-0.920531,-1.042893,...,-1.00527,-1.014939,-1.07535,-0.925409,-1.053629,-1.083758,0.6951,-1.194594,1.133449,-1.046041
3,-0.779944,1.136575,1.084576,-0.565901,-1.50037,-1.0609,-0.110588,0.769779,0.361648,1.141632,...,-0.255601,0.295128,0.737673,-0.487916,-0.339305,0.181616,-1.061526,1.160521,-1.049756,1.138397
4,-0.078644,-2.376787,0.418896,-0.565901,-0.850317,-0.841397,-0.108589,0.769779,-0.279441,-0.881076,...,-0.028265,0.295128,0.284417,-0.161367,0.050327,0.127913,2.451726,0.375483,0.971731,0.202209


In [23]:
append_col = ['Foundation_PConc','Bsmt Qual_Ex', 'Kitchen Qual_Ex', 'SalePrice']
#adding the three dummy columns and saleprice 

In [24]:
train[append_col].head()

Unnamed: 0,Foundation_PConc,Bsmt Qual_Ex,Kitchen Qual_Ex,SalePrice
0,0,0,0,130500
1,1,0,0,220000
2,0,0,0,109000
3,1,0,0,174000
4,1,0,0,138500


In [25]:
mother = pd.concat(objs=[poly, train[append_col]], axis='columns')

In [26]:
mother.shape

(2051, 168)

# We finally made the mother dataframe! These are the features I picked. The workflow process fro preprocessing are as follows:
- Genrate polynomial features without dummy
- Find high correlation of those polynomial features
- Standardize those specific features
- Append three dummies and `SalePrice`

**This medium blog helped me streamline my preprocessing workflow.**
- https://medium.com/@samchaaa/preprocessing-why-you-should-generate-polynomial-features-first-before-standardizing-892b4326a91d

In [27]:
mother.to_csv('./datasets/mother.csv')

# This mother data contains:
- Non-standardized `SalePrice` column
- Three non-standardized dummy columns
- Nine polynomial and standardized columns

Besides the `SalePrice` column, these are my 9 features.  