# House prices reduction 

Let's check out the power of PCA on tabular data. Download the dataset here 👉👉 <a href="https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+non+Supervis%C3%A9/PCA/house_prices.csv" target="_blank">House Prices</a>

1. Import `numpy` and `pandas`

In [1]:
import numpy as np
import pandas as pd

2. Import dataset and remove column `Id`

In [2]:
data = pd.read_csv("https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Machine+Learning+non+Supervis%C3%A9/PCA/house_prices.csv")
# remove ID
data = data.iloc[:, 1:]
data.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


3. Remove NaN

In [3]:
data = data.dropna(axis=1)
data.shape

(1460, 61)

4. Split data into `X` and `y` where X contains all columns except `SalePrice`

In [4]:
# X, y split 
X = data.loc[:, data.columns != "SalePrice"]
y = data.loc[:, "SalePrice"]
X.head()

Unnamed: 0,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,RL,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,...,61,0,0,0,0,0,2,2008,WD,Normal
1,20,RL,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,Veenker,...,0,0,0,0,0,0,5,2007,WD,Normal
2,60,RL,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,...,42,0,0,0,0,0,9,2008,WD,Normal
3,70,RL,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,...,35,272,0,0,0,0,2,2006,WD,Abnorml
4,60,RL,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,...,84,0,0,0,0,0,12,2008,WD,Normal


5. Using <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html" target="_blank">`select_dtypes`</a> from Pandas, filter out all categorical variables. 

In [5]:
X = X.select_dtypes(exclude=["object"])
X.head()

Unnamed: 0,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,60,8450,7,5,2003,2003,706,0,150,856,...,548,0,61,0,0,0,0,0,2,2008
1,20,9600,6,8,1976,1976,978,0,284,1262,...,460,298,0,0,0,0,0,0,5,2007
2,60,11250,7,5,2001,2002,486,0,434,920,...,608,0,42,0,0,0,0,0,9,2008
3,70,9550,7,5,1915,1970,216,0,540,756,...,642,0,35,272,0,0,0,0,2,2006
4,60,14260,8,5,2000,2000,655,0,490,1145,...,836,192,84,0,0,0,0,0,12,2008


6. Split `X`, `y` in a train set and test set.
> NB: specify `random_state=0` and `test_size=0.2`

In [6]:
# Train_test_split 
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)

X_train.head()

Unnamed: 0,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
618,20,11694,9,5,2007,2007,48,0,1774,1822,...,774,0,108,0,0,260,0,0,7,2007
870,20,6600,5,5,1962,1962,0,0,894,894,...,308,0,0,0,0,0,0,0,8,2009
92,30,13360,5,7,1921,2006,713,0,163,876,...,432,0,0,44,0,0,0,0,8,2009
817,20,13265,8,5,2002,2002,1218,0,350,1568,...,857,150,59,0,0,0,0,0,7,2008
302,20,13704,7,5,2001,2002,0,0,1541,1541,...,843,468,81,0,0,0,0,0,1,2006


7. Import and fit `RandomForestRegressor` on your train set. Check out your score on X_test
> NB: specify `random_state=0` and `min_samples_split=15`

In [7]:
from sklearn.ensemble import RandomForestRegressor

# Instanciate RandomForestRegressor
rf = RandomForestRegressor(random_state=0, min_samples_split=15)
rf.fit(X_train, y_train)

# Get score on test set
rf.score(X_test, y_test)

0.8366009574575247

8. Verify overfitting

In [8]:
# Get score on train set
rf.score(X_train, y_train)

0.945322157767256

9. Use cross validation to check your model overall performance 
> NB: you can use <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html?highlight=cross%20val%20score#sklearn.model_selection.cross_val_score" target="_blank" >`cross_val_score`</a> from sklearn

In [9]:
from sklearn.model_selection import cross_val_score
# Cross val score with cv = 25
scores = cross_val_score(rf, X_train, y_train, cv=25)

# Display all scores
scores

array([0.84408484, 0.92121978, 0.87159194, 0.81829412, 0.95281324,
       0.69528887, 0.88008512, 0.25312327, 0.82353659, 0.93022963,
       0.89541079, 0.89355802, 0.89833128, 0.84354082, 0.89330607,
       0.82659114, 0.82030364, 0.86116028, 0.91333381, 0.80960393,
       0.88973339, 0.87521317, 0.91229801, 0.81887341, 0.89465573])

In [10]:
# Get the average score
scores.mean()

0.8414472358819682

In [11]:
scores.std()

0.13084777590580488

10. Normalize your train set and apply it on your test set 

In [12]:
from sklearn.preprocessing import StandardScaler
# Instanciate standard scaler
std = StandardScaler()

# Fit and transform X_train
X_std_train = std.fit_transform(X_train)

# Apply on X_test
X_std_test = std.transform(X_test)

# Visualize X_std_train
X_std_train

array([[-0.86836547,  0.10321202,  2.13150648, ..., -0.09258767,
         0.25639595, -0.61356151],
       [-0.86836547, -0.37288066, -0.79485211, ..., -0.09258767,
         0.62339407,  0.88411842],
       [-0.63114155,  0.25891881, -0.79485211, ..., -0.09258767,
         0.62339407,  0.88411842],
       ...,
       [ 0.79220197, -0.15511546, -0.06326246, ..., -0.09258767,
        -0.84459842,  1.63295838],
       [ 1.50387373, -0.69102346,  0.66832719, ..., -0.09258767,
         1.35739032, -1.36240148],
       [ 0.08053021,  0.57762239,  0.66832719, ..., -0.09258767,
        -0.11060217,  1.63295838]])

11. Apply PCA on your normalized train set.
> NB: specify `n_components=3`

In [13]:
from sklearn.decomposition import PCA

# Instanciate PCA with 3 components
pca = PCA(n_components=3)

# Fit transform X_std_train
X_opt_train = pca.fit_transform(X_std_train)

# Apply on X_std_test
X_opt_test = pca.transform(X_std_test)

12. Visualize your principal components

In [14]:
# Visualize as a DataFrame to make it look nicer
pd.DataFrame(X_opt_train).head()

Unnamed: 0,0,1,2
0,4.32885,-0.581272,-0.607753
1,-3.230328,-0.38708,-1.064587
2,-2.374479,-1.772113,0.328578
3,3.683356,-2.856229,0.459145
4,2.951254,-1.151214,-1.29765


13. Check the explained variance ratio from your principal components

In [15]:
# Use pca.explained_variance_ratio_
print("Explained Variance ration per PC: {}".format(pca.explained_variance_ratio_))

print("Total explained variance ratio: {}%".format(pca.explained_variance_ratio_.sum()))

Explained Variance ration per PC: [0.19200008 0.09828489 0.06323194]
Total explained variance ratio: 0.3535169132457604%


14. Fit your `RandomForestRegressor` on your train set where you applied PCA, check out your score on your test set. 

In [16]:
rf.fit(X_opt_train, y_train)
rf.score(X_opt_test, y_test)

0.7889533957667985

15. Use `cross_val_score` again to see your model's overall performance. What can you say about it?

In [17]:
scores = cross_val_score(rf, X_opt_train, y_train, cv=25)
scores.mean()
# Our model is as almost as performant with 3 columns than with 33!

0.8299550987928164

16. Finally, use `scatter_3d` from `plotly.express` and `go.Scatter3d` from `plotly.graph_objects` to visualize a 3D graphs with all Principal Components from train and from test.

In [18]:
# Import plotly.express and plotly.graph_objects
import plotly.express as px 
import plotly.graph_objects as go

# Use plotly express to plot train data
fig = px.scatter_3d(X_opt_train, 
                    x=0, 
                    y=1, 
                    z=2)

# Add trace with test data 
fig.add_trace(go.Scatter3d(x=X_opt_test[:, 0], 
                            y=X_opt_test[:, 1], 
                            z=X_opt_test[:, 2],
                            mode="markers",
                            name="test"
                          ))

# Render on notebook
fig.show(renderer="iframe_connected")

In [19]:
#Example on train set coloring by SalePrice
df_plot = pd.DataFrame(X_opt_train)
df_plot.columns = ['PC1', 'PC2', 'PC3']
df_plot['SalePrice'] = list(y_train)

fig = px.scatter_3d(df_plot, x='PC1', y='PC2', z='PC3', color='SalePrice')
fig.show(renderer="iframe_connected")