<center><h2> Energy Efficiency Regression </h2></center>
<center><h4> by Nickhil Tekwani </h4></center>

Intro to ML (regression) using Energy Efficiency dataset from UCI Machine Learning Repository. Use the following regressors:
Linear Regression, Ridge, Lasso, kNN, and LinearSVR

### Data

In [1]:
import pandas as pd

def load_data(excel_link):
    return pd.read_excel(excel_link)

energy_df = load_data('https://archive.ics.uci.edu/ml/machine-learning-databases/00242/ENB2012_data.xlsx')
energy_df["target"] = energy_df["Y2"]
energy_df = energy_df.drop(["Y1", "Y2"], axis=1)

energy_df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,target
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,21.33
4,0.9,563.5,318.5,122.5,7.0,2,0.0,0,28.28


In [2]:
# Get features and target
def features_and_target(df):
    features = df.drop(["target"], axis=1)
    target = df["target"]
    return (features, target)
features, target = features_and_target(energy_df)

### Application and Evaluation of Regression Estimators
Linear Regression, Ridge, Lasso, kNN, and LinearSVR

In [3]:
# split data
from sklearn.model_selection import train_test_split
def split_the_dataset():
    X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)
    return (X_train, X_test, y_train, y_test)
X_train, X_test, y_train, y_test = split_the_dataset()

In [4]:
# estimators
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import LinearSVR

estimators = {"Linear Regression": LinearRegression(),
              "Ridge": Ridge(), 
              "Lasso": Lasso(),
              "k-Nearest Neighbor": KNeighborsRegressor(), 
              "Support Vector Machine": LinearSVR() 
             }

In [5]:
# fit regression estimators using percentage-split approach
from sklearn.metrics import r2_score
def regressors_percentage_split():
    for name, clf in estimators.items():
        model = clf.fit(X=X_train, y=y_train)
        r_train = r2_score(y_train, model.predict(X_train))
        r_test = r2_score(y_test, model.predict(X_test))
        
        print(name + ": \n \t R-squared value for training set: " + str(r_train)
              + "\n \t R-squared value for testing set: " + str(r_test) + "\n")
regressors_percentage_split()

Linear Regression: 
 	 R-squared value for training set: 0.8904498825817151
 	 R-squared value for testing set: 0.8784000590776224

Ridge: 
 	 R-squared value for training set: 0.8857523325352672
 	 R-squared value for testing set: 0.8696356299456395

Lasso: 
 	 R-squared value for training set: 0.7909719493377705
 	 R-squared value for testing set: 0.7415012433952806

k-Nearest Neighbor: 
 	 R-squared value for training set: 0.9729795093926208
 	 R-squared value for testing set: 0.9547529979241635

Support Vector Machine: 
 	 R-squared value for training set: 0.7664603158481333
 	 R-squared value for testing set: 0.7356963899946077





In [6]:
# fit estimators, but this time normalize x_train and X_test
from sklearn.preprocessing import MinMaxScaler

def preprocessed_regression():
    scaler = MinMaxScaler()
    scaler.fit(X_train) 
    X_train_scaled = scaler.transform(X_train) 
    X_test_scaled = scaler.transform(X_test) 
    
    for name, clf in estimators.items():
        model = clf.fit(X=X_train_scaled, y=y_train)
        r_train = r2_score(y_train, model.predict(X_train_scaled))
        r_test = r2_score(y_test, model.predict(X_test_scaled))
        
        print(name + ": \n \t R-squared value for training set: " + str(r_train)
              + "\n \t R-squared value for testing set: " + str(r_test) + "\n")
        
    return (X_train_scaled, X_test_scaled)
X_train_scaled, X_test_scaled = preprocessed_regression()

Linear Regression: 
 	 R-squared value for training set: 0.8904498825817151
 	 R-squared value for testing set: 0.8784000590776223

Ridge: 
 	 R-squared value for training set: 0.8880375941362452
 	 R-squared value for testing set: 0.8711496370323977

Lasso: 
 	 R-squared value for training set: 0.7603527845294313
 	 R-squared value for testing set: 0.7226204084888882

k-Nearest Neighbor: 
 	 R-squared value for training set: 0.9260441752256717
 	 R-squared value for testing set: 0.8870455496740268

Support Vector Machine: 
 	 R-squared value for training set: 0.876804566492899
 	 R-squared value for testing set: 0.8490341981993731



In [10]:
# RFE for feature selection
from sklearn.feature_selection import RFE
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor

def RFE_feature_selection():
    select = RFE(DecisionTreeRegressor(random_state = 3000), n_features_to_select = 3)
    select.fit(X_train_scaled, y_train)
    
    X_train_selected = select.transform(X_train_scaled)
    X_test_selected = select.transform(X_test_scaled)
    
    flist = list(features.columns.values)
    supp = select.get_support()
    
    print("Selected features after RFE:")
    for i in range(len(supp)):
        if supp[i] == True:
            print("\t " + flist[i])
    
    model = KNeighborsRegressor().fit(X=X_train_selected, y=y_train)
    
    print("kNN Regression performance with selected features:")
    print("\tR-squared value for training set: ", r2_score(y_train, model.predict(X_train_selected)))
    print("\tR-squared value for testing set: ", r2_score(y_test, model.predict(X_test_selected)))
    
    return(X_train_selected, X_test_selected)

X_train_selected, X_test_selected = RFE_feature_selection()

Selected features after RFE:
	 X1
	 X3
	 X7
kNN Regression performance with selected features:
	R-squared value for training set:  0.961705245742336
	R-squared value for testing set:  0.9497826076129472


In [11]:
# Grid search to tune kNN regression algorithim
param_grid = {"n_neighbors":[1, 5, 10], "metric": ["euclidean", "manhattan", "minkowski"]}

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor

def grid_search_kNN():
    gs = GridSearchCV(KNeighborsRegressor(), param_grid)
    gs.fit(X=X_train_selected, y=y_train)
    print("Best parameters: ", gs.best_params_)
    print("Training set score with best parameters: ", gs.best_score_)
    print("Test set score with best parameters: ", gs.score(X_test_selected, y_test))
    
grid_search_kNN()

Best parameters:  {'metric': 'euclidean', 'n_neighbors': 10}
Training set score with best parameters:  0.9612669818748619
Test set score with best parameters:  0.9636318595177941
