# Regression Models using Refined Parameters (Ames, Iowa Housing)¶

### Lodaing, Splitting and Scaling the data

In [1]:
cd ..

/home/jovyan/04-Final


In [2]:
run __init__.py

In [3]:
%run src/load_data.py

In [4]:
import warnings
warnings.filterwarnings('ignore')

In [5]:
from sklearn import linear_model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV

from tqdm import tqdm

In [6]:
train_data =  data['ames']['train']['engineered']
train_labels = data['ames']['train']['labels']
test_data =  data['ames']['test']['engineered']
test_labels = data['ames']['test']['labels']

### Using the following models with range of parameters for fine tunning

In [7]:
models = {
    'Ridge' : linear_model.Ridge(),
    'Lasso' : linear_model.Lasso(),
    'K Nearest Neighbors' : KNeighborsRegressor(),
    'Decision Tree' : DecisionTreeRegressor(),
    'Support Vector Machines - Linear' : SVR(kernel ='linear'),
}

In [8]:
models_params = {
    'Ridge' : {'alpha': range(1,50)},
    'Lasso' : {'alpha': range(1,50)},
    'K Nearest Neighbors' : {'n_neighbors': range(1,20)},
    'Decision Tree' : {'max_depth': range(1,200)},
    'Support Vector Machines - Linear' : {'C': range(1,250,10)},
}

In [9]:
def run_model_grid_search(model_name, X_train, y_train, X_test, y_test):
    model = models[model_name]
    reg_params = models_params[model_name]
    model_gs = GridSearchCV(model, 
                      param_grid= reg_params,
                      cv=5,
                      return_train_score=True)
    model_gs.fit(X_train, y_train)
    return {
        'model_name' : model_name,
        'model_best_params' : model_gs.best_params_,
        'model_train_score' : model_gs.best_score_,
        'model_test_score' : model_gs.score(X_test, y_test)
    }

In [10]:
results = []
for model_name in tqdm(models.keys()):
    results.append(run_model_grid_search(model_name, train_data, train_labels, test_data, test_labels))

100%|██████████| 5/5 [09:31<00:00, 114.21s/it]


### $R^2$ scores with best parameter values to use

In [11]:
results_df = pd.DataFrame(results)
results_df

Unnamed: 0,model_best_params,model_name,model_test_score,model_train_score
0,{'alpha': 27},Ridge,0.871117,0.859448
1,{'alpha': 23},Lasso,0.84838,0.860783
2,{'n_neighbors': 14},K Nearest Neighbors,0.780543,0.830535
3,{'max_depth': 8},Decision Tree,0.685095,0.792498
4,{'C': 241},Support Vector Machines - Linear,0.874041,0.850832


As it is is apparent from the results, in some cases the test data scores outperformed the ones from the train data set (although very close). This can be attributed to the overall uncertainty with the data. Once again the linear models outperformed the other models, which further confirms our earlier observation.

Although all models improved after applying this step, the SVM model with the linear kernel improved considerably once the C parameter was tuned, going from the lowest $R^2$ score to the highest.

We should also note that for this project we have a limited amount of computing resources (namely our AWS server), and it is not feasible to experiment with a large number of parameters.