# Problem Set 3

## Problem 1: Conceptual Machine Learning (20 points)
######  Answer each question with 3-4 sentences. Make sure you fully explain your answer and back it with evidence.

#### a. Does Gradient Descent always converge to an optimum?

The purpose of gradient descent is to minimize the cost function through the parameters.  In doing this, it can land on a number of optimal points but where it lands may not be the optimum.  This can be due to its step sizes.  If the step size is too large, it can "step" over the optimum and land at the nearest feature point.

#### b. What Happens when learning rate is small and large in Gradient Descent?

When a learning rate is too small, then the algorithm will have to go through a lot of iterations in order to converge.  This will take a long time to complete.  Conversely, when the learning rate is too high, the algorithm may diverge with larger and larger values.  When this happens, it will fail to find a good solution.

#### c. How Can you relate KNN algorithm to bias variance tradeoff?

In a KNN algorithm, increasing K will increase the complexity of the model, since it has to consider more neighbors.  Increasing a model's complexity, in this case K, will increase its variance and reduce its bias.  The opposite is true if the model's complexity is decreased.  This will typically result in an increase in bias and a decrease in variance. 

#### d. What do you do if you have High Variance Problem?


A high variance problem stems from the model fitting very well with the training data but not with the new data (test set), also called "over-fitting".  High variance can be caused due to the model being too complex, if the model uses too many features, and if higher-order polynomial in the model creates unnecessary curves and angles in the model which is unrelated to data.  This can be solved by either reducing unnecessary features in the model or through regularization.  Regularization, keeps all of the features in the model but reduces the magnitude of higher-order polynomials.

#### e. What do you do if you have High Bias Problem?

High bias occurs when the model fits poorly with the training data, also called "underfitting".  The problem is mostly caused when the model is too simple or has too few features.  We can help to solve the problem by adding more features to the training data.  If new features are not avaiable, we can create new features by combining two or more existing features.  Adding too many features will lead to high variance problem.

## Problem 2: Supervised Learning with scikit-learn (30 points).

###### Complete the Supervised Learning with scikit-learn Assignment on Datacamp platform.


## Problem 3: House Prices - Advanced Regression Techniques (50 points).

###### We will continue the "Lab 6 Housing Prices.ipynb Exercise" that we started in the class with a few adjustments.  I will ask you to manipulate the data in certain ways, run ridge and lasso regression algorithms, and evaluate the model’s performance. Specifically, you will:

In [1]:
# import common packages
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt


# import dataset
url_test = 'https://raw.githubusercontent.com/assamidanov/econ_590/main/datasets/test.csv'
url_train = 'https://raw.githubusercontent.com/assamidanov/econ_590/main/datasets/train.csv'

# create dataframe
train = pd.read_csv(url_train, index_col=[0])
test = pd.read_csv(url_test, index_col=[0])

#### a. Check for missing values and data types.

In [2]:
# check data types for training set
train.info() # total_bedrooms has missing values, ocean_proximity is 'object'

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16512 entries, 0 to 16511
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           16512 non-null  float64
 1   latitude            16512 non-null  float64
 2   housing_median_age  16512 non-null  float64
 3   total_rooms         16512 non-null  float64
 4   total_bedrooms      16362 non-null  float64
 5   population          16512 non-null  float64
 6   households          16512 non-null  float64
 7   median_income       16512 non-null  float64
 8   ocean_proximity     16512 non-null  object 
 9   median_house_value  16512 non-null  float64
dtypes: float64(9), object(1)
memory usage: 1.4+ MB


In [3]:
# check for missing values for the training set
train.isnull().sum() # total bedrooms NaN is less than 1% of the dataset

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        150
population              0
households              0
median_income           0
ocean_proximity         0
median_house_value      0
dtype: int64

In [4]:
# check data types for the test set
test.info() # total_bedrooms has missing values, ocean_proximity is 'object'

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4128 entries, 0 to 4127
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           4128 non-null   float64
 1   latitude            4128 non-null   float64
 2   housing_median_age  4128 non-null   float64
 3   total_rooms         4128 non-null   float64
 4   total_bedrooms      4071 non-null   float64
 5   population          4128 non-null   float64
 6   households          4128 non-null   float64
 7   median_income       4128 non-null   float64
 8   ocean_proximity     4128 non-null   object 
dtypes: float64(8), object(1)
memory usage: 322.5+ KB


In [5]:
# check for missing values for the test set
test.isna().sum() # total_bedrooms NaN is about 1% of the test set

longitude              0
latitude               0
housing_median_age     0
total_rooms            0
total_bedrooms        57
population             0
households             0
median_income          0
ocean_proximity        0
dtype: int64

#### b. Analyze the missing values, and decide whether you want to drop them or impute mean, median, or mode.

I ran the regressions multiple ways:  
   1.  Dropping NaN values from the total_bedrooms column
   2.  Replacing the NaN values with the median
   3.  Replacing NaN values with the mode
   4.  Replacing NaN values with the mean

I found that each time the model produced different results.  The best results came from replacing NaN values with the mean number of bedrooms in each district.  This tells us, that by dropping the NaN values, the model was dropping important information.  

In [6]:
# Rplace NaN values in the training set with the mean number of bedrooms
mean_bedroom = train["total_bedrooms"].mean()
train["total_bedrooms"].fillna(mean_bedroom, inplace=True)

In [7]:
train.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
ocean_proximity       0
median_house_value    0
dtype: int64

In [8]:
# replacing the NaN values in the test set with the mean 
mean_bedroom = test["total_bedrooms"].mean()
test["total_bedrooms"].fillna(mean_bedroom, inplace=True)

In [9]:
test.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
ocean_proximity       0
dtype: int64

#### c. Convert categorical variables into dummy variables.

In [10]:
# convert ocean_proximity to dummy for train set
# train set
train = pd.get_dummies(train, columns = ['ocean_proximity'],
                     prefix = '', prefix_sep = '')

# view the results
train

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-119.25,35.77,35.0,1618.0,378.0,1449.0,398.0,1.6786,56500.0,0,1,0,0,0
1,-119.65,36.35,21.0,1745.0,266.0,837.0,292.0,4.3911,107900.0,0,1,0,0,0
2,-120.02,39.24,32.0,1347.0,444.0,825.0,303.0,1.8269,225000.0,0,1,0,0,0
3,-118.25,33.79,32.0,1205.0,340.0,1799.0,370.0,2.3750,128000.0,0,0,0,0,1
4,-117.58,33.92,16.0,4157.0,586.0,2036.0,594.0,6.1550,246400.0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16507,-122.11,37.64,31.0,1487.0,280.0,854.0,301.0,5.2312,197600.0,0,0,0,1,0
16508,-118.18,33.98,38.0,1477.0,374.0,1514.0,408.0,2.5703,178600.0,1,0,0,0,0
16509,-118.12,33.79,41.0,1762.0,314.0,738.0,300.0,4.1687,240700.0,1,0,0,0,0
16510,-122.29,37.51,35.0,3040.0,520.0,1374.0,518.0,6.1004,426400.0,0,0,0,0,1


In [11]:
# convert ocean_proximity to dummy for test set
test = pd.get_dummies(test, columns = ['ocean_proximity'],
                     prefix = '', prefix_sep = '')

# view the results
test

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-117.67,33.61,23.0,3588.0,577.0,1695.0,569.0,6.1401,1,0,0,0,0
1,-122.46,37.75,52.0,1590.0,236.0,622.0,232.0,5.8151,0,0,0,1,0
2,-121.97,37.79,17.0,5688.0,824.0,2111.0,773.0,6.6131,1,0,0,0,0
3,-117.16,34.06,17.0,2285.0,554.0,1412.0,541.0,1.8152,0,1,0,0,0
4,-118.42,34.09,40.0,3552.0,392.0,1024.0,370.0,15.0001,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4123,-122.13,37.74,41.0,4400.0,666.0,1476.0,648.0,5.0000,0,0,0,1,0
4124,-118.18,33.93,35.0,952.0,271.0,949.0,261.0,2.4297,1,0,0,0,0
4125,-118.29,34.16,35.0,1257.0,318.0,764.0,319.0,3.2083,1,0,0,0,0
4126,-122.20,37.76,37.0,2680.0,736.0,1925.0,667.0,1.4097,0,0,0,1,0


#### d. Leverage data and create new variables that aren’t in the training set. Produce new features with the goal of enhancing model accuracy.


Housing prices are probably most heavily dependent on location, but other factors can influence these as well.
I was looking at the dataset, and I do not believe that the 'total_bedrooms', 'household', and 'total_rooms'
columns tell us much in the form they are in.  It really doesn't matter how many rooms are in each district, but it probably matters how many rooms are in each house in each district.  For this reason, I created the following:

1.  total_bedrooms per household:  the number of bedrooms in a house can be very important, especially for families.  How many bedrooms a region has is not important to a buyer, but the number of bedrooms in a house is. 

2.  total_bedrooms per population: (this I may drop, as it may not be a good indicator.  Couples typically require only one bedroom, kids can share rooms)

3.  population per household: a high popuation per household may indicate a housing shortage and drive prices up.  Or it could mean that there are a lot of families.  (I may not want to include this one.)

4.  total_rooms per household:  this is similar to bedrooms.  Again, we want to know house attributes, not region. 

In [12]:
# create bedrooms per population variable 
# for the training set
train['beds_pop'] = (train['total_bedrooms'])/(train['population'])

# for the test set
test['beds_pop'] = (test['total_bedrooms'])/(test['population'])

In [13]:
# total rooms per household 

# for training set
train['rooms_house'] = (train['total_rooms'])/(train['households'])

# for test set
test['rooms_house'] = (test['total_rooms']/(test['households']))

In [14]:
# bedrooms per household 

# for training set
train['bed_house'] = (train['total_bedrooms'])/(train['households'])

# for test set
test['bed_house'] = (test['total_bedrooms'])/(test['households'])

In [15]:
# persons per household 

# for training set
train['pop_house'] = (train['population'])/(train['households'])

# for test set
test['pop_house'] = (test['population'])/(test['households'])

In [16]:
train['age_2'] = (train['housing_median_age'])**2

In [17]:
test['age_2'] = (test['housing_median_age'])**2

In [18]:
# show the results for the training set
train

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN,beds_pop,rooms_house,bed_house,pop_house,age_2
0,-119.25,35.77,35.0,1618.0,378.0,1449.0,398.0,1.6786,56500.0,0,1,0,0,0,0.260870,4.065327,0.949749,3.640704,1225.0
1,-119.65,36.35,21.0,1745.0,266.0,837.0,292.0,4.3911,107900.0,0,1,0,0,0,0.317802,5.976027,0.910959,2.866438,441.0
2,-120.02,39.24,32.0,1347.0,444.0,825.0,303.0,1.8269,225000.0,0,1,0,0,0,0.538182,4.445545,1.465347,2.722772,1024.0
3,-118.25,33.79,32.0,1205.0,340.0,1799.0,370.0,2.3750,128000.0,0,0,0,0,1,0.188994,3.256757,0.918919,4.862162,1024.0
4,-117.58,33.92,16.0,4157.0,586.0,2036.0,594.0,6.1550,246400.0,0,1,0,0,0,0.287819,6.998316,0.986532,3.427609,256.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16507,-122.11,37.64,31.0,1487.0,280.0,854.0,301.0,5.2312,197600.0,0,0,0,1,0,0.327869,4.940199,0.930233,2.837209,961.0
16508,-118.18,33.98,38.0,1477.0,374.0,1514.0,408.0,2.5703,178600.0,1,0,0,0,0,0.247028,3.620098,0.916667,3.710784,1444.0
16509,-118.12,33.79,41.0,1762.0,314.0,738.0,300.0,4.1687,240700.0,1,0,0,0,0,0.425474,5.873333,1.046667,2.460000,1681.0
16510,-122.29,37.51,35.0,3040.0,520.0,1374.0,518.0,6.1004,426400.0,0,0,0,0,1,0.378457,5.868726,1.003861,2.652510,1225.0


In [19]:
# show the results for the test set
test

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN,beds_pop,rooms_house,bed_house,pop_house,age_2
0,-117.67,33.61,23.0,3588.0,577.0,1695.0,569.0,6.1401,1,0,0,0,0,0.340413,6.305800,1.014060,2.978910,529.0
1,-122.46,37.75,52.0,1590.0,236.0,622.0,232.0,5.8151,0,0,0,1,0,0.379421,6.853448,1.017241,2.681034,2704.0
2,-121.97,37.79,17.0,5688.0,824.0,2111.0,773.0,6.6131,1,0,0,0,0,0.390336,7.358344,1.065977,2.730918,289.0
3,-117.16,34.06,17.0,2285.0,554.0,1412.0,541.0,1.8152,0,1,0,0,0,0.392351,4.223660,1.024030,2.609982,289.0
4,-118.42,34.09,40.0,3552.0,392.0,1024.0,370.0,15.0001,1,0,0,0,0,0.382812,9.600000,1.059459,2.767568,1600.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4123,-122.13,37.74,41.0,4400.0,666.0,1476.0,648.0,5.0000,0,0,0,1,0,0.451220,6.790123,1.027778,2.277778,1681.0
4124,-118.18,33.93,35.0,952.0,271.0,949.0,261.0,2.4297,1,0,0,0,0,0.285564,3.647510,1.038314,3.636015,1225.0
4125,-118.29,34.16,35.0,1257.0,318.0,764.0,319.0,3.2083,1,0,0,0,0,0.416230,3.940439,0.996865,2.394984,1225.0
4126,-122.20,37.76,37.0,2680.0,736.0,1925.0,667.0,1.4097,0,0,0,1,0,0.382338,4.017991,1.103448,2.886057,1369.0


In [20]:
# drop the variables that did not add to the regression based on region level data

# drop from training set
train = train.drop(['households', 'total_bedrooms', 'total_rooms'], axis = 1)

In [21]:
# show the results for the test set
train

Unnamed: 0,longitude,latitude,housing_median_age,population,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN,beds_pop,rooms_house,bed_house,pop_house,age_2
0,-119.25,35.77,35.0,1449.0,1.6786,56500.0,0,1,0,0,0,0.260870,4.065327,0.949749,3.640704,1225.0
1,-119.65,36.35,21.0,837.0,4.3911,107900.0,0,1,0,0,0,0.317802,5.976027,0.910959,2.866438,441.0
2,-120.02,39.24,32.0,825.0,1.8269,225000.0,0,1,0,0,0,0.538182,4.445545,1.465347,2.722772,1024.0
3,-118.25,33.79,32.0,1799.0,2.3750,128000.0,0,0,0,0,1,0.188994,3.256757,0.918919,4.862162,1024.0
4,-117.58,33.92,16.0,2036.0,6.1550,246400.0,0,1,0,0,0,0.287819,6.998316,0.986532,3.427609,256.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16507,-122.11,37.64,31.0,854.0,5.2312,197600.0,0,0,0,1,0,0.327869,4.940199,0.930233,2.837209,961.0
16508,-118.18,33.98,38.0,1514.0,2.5703,178600.0,1,0,0,0,0,0.247028,3.620098,0.916667,3.710784,1444.0
16509,-118.12,33.79,41.0,738.0,4.1687,240700.0,1,0,0,0,0,0.425474,5.873333,1.046667,2.460000,1681.0
16510,-122.29,37.51,35.0,1374.0,6.1004,426400.0,0,0,0,0,1,0.378457,5.868726,1.003861,2.652510,1225.0


In [22]:
# drop from test set
test = test.drop(['households', 'total_bedrooms', 'total_rooms'], axis = 1)

In [23]:
# show the results for the test set
test

Unnamed: 0,longitude,latitude,housing_median_age,population,median_income,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN,beds_pop,rooms_house,bed_house,pop_house,age_2
0,-117.67,33.61,23.0,1695.0,6.1401,1,0,0,0,0,0.340413,6.305800,1.014060,2.978910,529.0
1,-122.46,37.75,52.0,622.0,5.8151,0,0,0,1,0,0.379421,6.853448,1.017241,2.681034,2704.0
2,-121.97,37.79,17.0,2111.0,6.6131,1,0,0,0,0,0.390336,7.358344,1.065977,2.730918,289.0
3,-117.16,34.06,17.0,1412.0,1.8152,0,1,0,0,0,0.392351,4.223660,1.024030,2.609982,289.0
4,-118.42,34.09,40.0,1024.0,15.0001,1,0,0,0,0,0.382812,9.600000,1.059459,2.767568,1600.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4123,-122.13,37.74,41.0,1476.0,5.0000,0,0,0,1,0,0.451220,6.790123,1.027778,2.277778,1681.0
4124,-118.18,33.93,35.0,949.0,2.4297,1,0,0,0,0,0.285564,3.647510,1.038314,3.636015,1225.0
4125,-118.29,34.16,35.0,764.0,3.2083,1,0,0,0,0,0.416230,3.940439,0.996865,2.394984,1225.0
4126,-122.20,37.76,37.0,1925.0,1.4097,0,0,0,1,0,0.382338,4.017991,1.103448,2.886057,1369.0


In [24]:
# set up training features and label
train_features = train.drop('median_house_value', axis = 1)
train_label = train['median_house_value']

# import packages
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train_features, 
                                                    train_label, 
                                                    test_size = 0.2, 
                                                    shuffle = True)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)                                                   

(13209, 15) (3303, 15) (13209,) (3303,)


#### f.  Find out the best parameters.


In [25]:
# import packages
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from math import sqrt

In [26]:
#### LASSO REGRESSION ####
lasso_reg = Lasso()
lasso_reg.fit(x_train, y_train)

  model = cd_fast.enet_coordinate_descent(


In [27]:
# print training and testing MSE
print("Training Error:")
print(sqrt(mean_squared_error(y_train, lasso_reg.predict(x_train))))
testing_predictions_linear = lasso_reg.predict(x_test)
print("Testing Error:")
print(sqrt(mean_squared_error(y_test, testing_predictions_linear)))

Training Error:
67358.09407458916
Testing Error:
68490.71968980173


In [28]:
# establish parameters
# Finetune with GridSearch
params = {'alpha': [0.01, 0.1, 1, 5, 10]}
lasso_gs = GridSearchCV(lasso_reg,
                       params,
                       cv=5,
                       scoring='neg_root_mean_squared_error',
                       n_jobs=-1)

In [29]:
# fit to our model
lasso_gs.fit(train_features, train_label)

  model = cd_fast.enet_coordinate_descent(


In [30]:
# show best parameters for this model
lasso_gs.best_params_

{'alpha': 0.01}

In [31]:
#### RIDGE REGRESSION ####
ridge = Ridge()
ridge.fit(x_train, y_train)

In [32]:
# print training and testing MSE
print("Training Error:")
print(sqrt(mean_squared_error(y_train, ridge.predict(x_train))))
testing_predictions_linear = ridge.predict(x_test)
print("Testing Error:")
print(sqrt(mean_squared_error(y_test, testing_predictions_linear)))

Training Error:
67359.76418663852
Testing Error:
68505.03910745578


In [33]:
# establish parameters
# Finetune with GridSearch
params = {'alpha':[0.01, 0.1, 1, 5, 10]}
ridge_gs = GridSearchCV(ridge,
                       params,
                       scoring = 'neg_root_mean_squared_error',
                       n_jobs = -1, cv = 5)

In [34]:
# fit to the model
ridge_gs.fit(train_features, train_label)

In [35]:
# show besst parameter for this model
ridge_gs.best_params_

{'alpha': 10}

#### g.  Create predictions results for the test dataset leveraging parameters from grid search.


In [36]:
#### LASSO REGRESSION ####

# show RSME for each alpha level
means = lasso_gs.cv_results_['mean_test_score']
params = lasso_gs.cv_results_['params']

for mean, param in zip (means, params):
    print("%f with %r" % (mean, param))

-68020.677664 with {'alpha': 0.01}
-68020.897319 with {'alpha': 0.1}
-68023.251777 with {'alpha': 1}
-68026.361559 with {'alpha': 5}
-68022.556514 with {'alpha': 10}


In [37]:
lasso_best = Lasso(alpha = 0.01)

In [38]:
# fit to the model using best parameter
lasso_best.fit(train_features, train_label)

  model = cd_fast.enet_coordinate_descent(


In [39]:
# predict housing values for the test set
lasso_best.predict(test)

array([293095.35343419, 333356.25036948, 318721.89646336, ...,
       214886.77951813, 140401.95736578, 368706.788166  ])

In [40]:
lasso_best.predict(x_test)

array([ 13855.87075703, 123435.69108296, 183274.32937766, ...,
       270982.95574245, 253863.66999125, 112415.64985603])

In [41]:
#### RIDGE REGRESSION ####

# show RSME for each alpha level
means = ridge_gs.cv_results_['mean_test_score']
params = ridge_gs.cv_results_['params']

for mean, param in zip (means, params):
    print("%f with %r" % (mean, param))

-68020.598256 with {'alpha': 0.01}
-68020.097512 with {'alpha': 0.1}
-68015.071473 with {'alpha': 1}
-67996.994140 with {'alpha': 5}
-67984.425562 with {'alpha': 10}


In [42]:
ridge_best = Ridge(alpha = 10)

In [43]:
# fit to the model using best param
ridge_best.fit(train_features, train_label)

In [44]:
# predict housing values for the test set
ridge_best.predict(test)

array([293003.84141566, 333773.28247287, 318409.48585452, ...,
       214226.26751952, 140926.62218152, 368679.12695738])

In [45]:
ridge_best.predict(x_test)

array([ 15973.70658017, 124213.95229682, 183713.39357803, ...,
       270072.34763403, 252035.47204579, 111566.12067475])

Both models perform almost identically.  The ridge regression produces slightly better results with an RSME of 68028.904 compared to the lasso regression RSME of 68063.790.