# <font color="blue">Lesson 6 - Feature Engineering and Selection</font>

## LASSO 

For lasso and ridge regression tutorials, we'll be using sklearn's Boston dataset with data on the housing prices in Boston. 

In [None]:
# load dataset
import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()

# pull out features and targets
X, y = boston.data, boston.target
names = boston.feature_names

Remember that lasso and ridge are regularization techniques that use regression to shrink the coefficient estimates towards zero. We do this prevent overfitting. 

### Pre-Process Data Set
Before we can get efficient results with regularization, we need to make sure all of our data points are on the same scale. And with this dataset, they are not. 

In [None]:
# import a scaler from sklearn 
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### Fit Model
Because Lasso is a regression technique, we need to import a linear model from sklearn and use the lasso method. 

The alpha term is used to control the regularization strength. When alpha is too high, you risk underfitting your model. When alpha is too low, you risk overfitting your model. So choose alpha carefully. 

In [None]:
# let's fit our model

from sklearn import linear_model

alpha = 0.5 # Increasing alpha can shrink more variable coefficients to 0
clf = linear_model.Lasso(alpha=alpha)
clf.fit(X, y)

# let's look at the coefficients for each feature
lasso_df = pd.DataFrame({"Feature":names,
                        "Coeffients": clf.coef_})
lasso_df

Try out the following code to compare the effect of different levels of alpha: 

In [None]:
# Create a function called lasso,
def compare_lasso(alphas):
    '''
    Takes in a list of alphas. Outputs a dataframe containing the coefficients of lasso regressions from each alpha.
    '''
    # Create an empty data frame
    df = pd.DataFrame()
    
    # Create a column of feature names
    df['Feature Name'] = names
    
    # For each alpha value in the list of alpha values,
    for alpha in alphas:
        # Create a lasso regression with that alpha value,
        lasso = linear_model.Lasso(alpha=alpha)
        
        # Fit the lasso regression
        lasso.fit(X, y)
        
        # Create a column name for that alpha value
        column_name = 'Alpha = %f' % alpha

        # Create a column of coefficient values
        df[column_name] = lasso.coef_
        
    # Return the datafram    
    return df

In [None]:
# Run the function called, Lasso
compare_lasso([.0001, .5, 10])

## Consider this
Notice that as the alpha value increases, more features have a coefficient of 0. What effect does this have on the data? 

## Ridge
We'll use the same dataset for ridge regression so we can compare results. 

In [None]:
# Ridge Regression
from sklearn import linear_model
alpha = 10 
clf = linear_model.Ridge(alpha=alpha)
clf.fit(X, y)

# let's look at the coefficients for each feature
ridge_df = pd.DataFrame({"Feature":names,
                        "Coeffients": clf.coef_})
ridge_df

Now we'll compare different levels of alpha for ridge regression. 

In [None]:
# Create a function called lasso,
def compare_ridge(alphas):
    '''
    Takes in a list of alphas. Outputs a dataframe containing the coefficients of lasso regressions from each alpha.
    '''
    # Create an empty data frame
    df = pd.DataFrame()
    
    # Create a column of feature names
    df['Feature Name'] = names
    
    # For each alpha value in the list of alpha values,
    for alpha in alphas:
        # Create a lasso regression with that alpha value,
        ridge = linear_model.Ridge(alpha=alpha)
        
        # Fit the lasso regression
        ridge.fit(X, y)
        
        # Create a column name for that alpha value
        column_name = 'Alpha = %f' % alpha

        # Create a column of coefficient values
        df[column_name] = ridge.coef_
        
    # Return the datafram    
    return df

In [None]:
# Run the function called, Lasso
alphas = [.0001, .5, 10]
compare_ridge(alphas)

## Consider this
How do the values for the features compare between LASSO and Ridge regression with increasing alpha?