## Codio Activity 6.4: Adjusting Parameters for Variance

**Expected Time: 60 Minutes**

**Total Points: 20 Points**

This activity focuses on using the $\Sigma$ matrix to limit the principal components based on how much variance should be kept.  In the last activity, a scree plot was used to see when the difference in variance explained slows.  Here, you will determine how many components are required to explain a proportion of variance.  The dataset is a larger example of a housing dataset related to individual houses and features in Ames Iowa.  For our purposes the non-null numeric data is selected.

## Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)

In [1]:
import numpy as np
from scipy.linalg import svd
from sklearn.datasets import fetch_openml

In [2]:
# fetching the data
housing = fetch_openml(name="house_prices", as_frame=True, data_home="data")

In [3]:
# examine the dataframe
housing.frame

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


In [4]:
# select numeric data and drop missing values
df = housing.frame.select_dtypes(["float", "int"]).dropna(
    axis=1
)  # .select_dtypes(['int', 'float'])

In [5]:
df.shape

(1460, 35)

[Back to top](#Index:) 

## Problem 1

### Scale the data

**5 Points**

After selecting our numeric data, scale the data so that it is ready for SVD.  Assign the scaled data to `df_scaled` below.  Your answer should be of type DataFrame.

In [10]:
df_scaled = (df - df.mean()) / df.std()
print(type(df_scaled))
df_scaled.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,-1.730272,0.07335,-0.207071,0.651256,-0.517023,1.050634,0.878367,0.575228,-0.288554,-0.944267,...,-0.751918,0.216429,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,-1.598563,0.13873,0.347154
1,-1.7279,-0.872264,-0.091855,-0.071812,2.178881,0.15668,-0.42943,1.171591,-0.288554,-0.641008,...,1.625638,-0.704242,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,-0.488943,-0.614228,0.007286
2,-1.725528,0.07335,0.073455,0.651256,-0.517023,0.984415,0.82993,0.092875,-0.288554,-0.30154,...,-0.751918,-0.070337,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,0.990552,0.13873,0.53597
3,-1.723156,0.309753,-0.096864,0.651256,-0.517023,-1.862993,-0.720051,-0.499103,-0.288554,-0.061648,...,-0.751918,-0.175988,4.091122,-0.116299,-0.270116,-0.068668,-0.087658,-1.598563,-1.367186,-0.515105
4,-1.720785,0.07335,0.37502,1.374324,-0.517023,0.951306,0.733056,0.46341,-0.288554,-0.174805,...,0.77993,0.563567,-0.359202,-0.116299,-0.270116,-0.068668,-0.087658,2.100173,0.13873,0.869545


[Back to top](#Index:) 

## Problem 2

### Extracting $\Sigma$

**5 Points**

Using the scaled data, extract the singular values from the data using the `scipy.linalg` function `svd`.  Assign your results to `sigma` below. 

In [11]:
(U, sigma, VT) = svd(df_scaled, full_matrices=False)

print(type(sigma))
print(sigma.shape)

<class 'numpy.ndarray'>
(35,)


[Back to top](#Index:) 

## Problem 3

### Percent Variance Explained

**5 Points**

To compute the percent variance explained, we will divide each singular value by the sum of the singular values.  Assign your percents as an array to `percent_variance_explained` below.  Note that due to rounding this percent won't sum to exactly 1.  

In [12]:
percent_variance_explained = sigma / sigma.sum()
print(percent_variance_explained.shape)
print(percent_variance_explained.sum())

(35,)
1.0


[Back to top](#Index:) 

## Problem 4

### Cumulative Variance Explained

**5 Points**

Using the solution to problem 3, how many principal components are necessary to retain up to 80% of the explained variance if we consider them in descending order?  Assign your response to `ans4` below as an integer. 

**HINT**: explore the `np.cumsum` function.

In [13]:
csum_percent_variance_explained = np.cumsum(percent_variance_explained)
ans4 = np.argmin(csum_percent_variance_explained <= 0.8)
# print(ans4)
# display(csum_percent_variance_explained[ans4 - 1])
# display(csum_percent_variance_explained[ans4])
# display(csum_percent_variance_explained[ans4 + 1])
# plt.plot(np.arange(1, len(csum_percent_variance_explained) + 1), csum_percent_variance_explained, linestyle="solid")
# plt.grid(True)
# plt.show()
csum_percent_variance_explained[ans4 - 1 : ans4 + 2]

array([0.79116773, 0.81691206, 0.8419786 ])