# Ridge and Lasso Lab

### Introduction

In this lesson, we'll practice working with both ridge and lasso regression.  As we'll see ridge regression has the effect of reducing the *size* of our coefficients, while lasso regression tends to zero out coefficients, and thus perform a degree of feature selection for us.

### Loading the Data

For this lesson, we'll be working with data from a iphone app company.  As we'll see the company produces different applications and wants to predict the lifetime value of customers given a stream of purchases.  Because the lifetime value of a customer technically has no end date, instead we will calculate the total revenue collected from a customer 30 days after downloading an application.

Let's get started by loading up the data.

> You can find the data by downloading it [here](https://drive.google.com/file/d/1XwEWgvPj31fflN94tGeCIFLZjV-CDNPE/view).

Then read it in as a csv with something like the following.

In [6]:
import pandas as pd

df = pd.read_csv('./ltv_prediction_demo_data.csv.gz', index_col = 0)

  mask |= (ar1 == a)


In [108]:
df[:2]

# 	idApp	dt	ltvDay1	ltvDay2	ltvDay3	ltvDay4	ltvDay5	ltvDay6	ltvDay7	ltvDay8	ltvDay30
# 0	279	2018-02-18	0	0	0	2190000	2190000	2190000	5780000	5780000	5780000
# 1	279	2018-04-20	0	0	0	0	0	0	0	0	0

Unnamed: 0,idApp,dt,ltvDay1,ltvDay2,ltvDay3,ltvDay4,ltvDay5,ltvDay6,ltvDay7,ltvDay8,ltvDay30
0,279,2018-02-18,0,0,0,2190000,2190000,2190000,5780000,5780000,5780000
1,279,2018-04-20,0,0,0,0,0,0,0,0,0


Take a look at the shape of the data.

In [109]:
df.shape

# (3324150, 11)

(3324150, 11)

We can see that we have a lot of data here.  Each row represents a different user, and the `dt` marks the date that the app was downloaded.

Let's narrow our data by trying to make predictions for just a single app.  Use valuecounts to get a sense of the downloads for each of the applications.

In [9]:
df['idApp'].value_counts()

# 5620     1077387
# 279       532735
# 5619      484202
# 278       282209
# 313       221160
#           ...   
# 50357          1
# 5570           1
# 50458          1
# 5494           1
# 5548           1
# Name: idApp, Length: 130, dtype: int64

5620     1077387
279       532735
5619      484202
278       282209
313       221160
          ...   
50357          1
5570           1
50458          1
5494           1
5548           1
Name: idApp, Length: 130, dtype: int64

We can see that one of the top apps is the app with id `279`.  Let's select the records just of that app.

In [114]:
df_app = df[df['idApp'] == 279]

In [115]:
df_app.shape
# (532735, 11)

(532735, 11)

Next, because our data has a time component to it, let's sort our data by the `dt` column.

In [17]:
df_app_sorted = df_app.sort_values('dt')

In [18]:
df_app_sorted[:2]

# 	idApp	dt	ltvDay1	ltvDay2	ltvDay3	ltvDay4	ltvDay5	ltvDay6	ltvDay7	ltvDay8	ltvDay30
# 959172	279	2017-06-18	0	0	3872692	3872692	3872692	3872692	3872692	3872692	3872692
# 3276944	279	2017-06-18	0	3872692	3872692	3872692	3872692	3872692	11902781	11902781	153992377

Unnamed: 0,idApp,dt,ltvDay1,ltvDay2,ltvDay3,ltvDay4,ltvDay5,ltvDay6,ltvDay7,ltvDay8,ltvDay30
959172,279,2017-06-18,0,0,3872692,3872692,3872692,3872692,3872692,3872692,3872692
3276944,279,2017-06-18,0,3872692,3872692,3872692,3872692,3872692,11902781,11902781,153992377


Next, we'll add the `add_datepart` function in here to perform some basic feature engineering.

In [64]:
import numpy as np
import re
def add_datepart(df, fldname, drop=True, time=False, errors="raise"):
    fld = df[fldname]
    fld_dtype = fld.dtype
    if isinstance(fld_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
        fld_dtype = np.datetime64

    if not np.issubdtype(fld_dtype, np.datetime64):
        df[fldname] = fld = pd.to_datetime(fld, infer_datetime_format=True, errors=errors)
    targ_pre = re.sub('[Dd]ate$', '', fldname)
    attr = ['Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',
            'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']
    if time: attr = attr + ['Hour', 'Minute', 'Second']
    for n in attr: df[targ_pre + n] = getattr(fld.dt, n.lower())
    df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9
    if drop: df.drop(fldname, axis=1, inplace=True)

In [None]:
add_datepart(df_app_sorted, 'dt')

After calling the `add_datepart` function, our data should look like the following:

In [117]:
df_app_sorted.columns

# Index(['idApp', 'ltvDay1', 'ltvDay2', 'ltvDay3', 'ltvDay4', 'ltvDay5',
#        'ltvDay6', 'ltvDay7', 'ltvDay8', 'ltvDay30', 'dtYear', 'dtMonth',
#        'dtWeek', 'dtDay', 'dtDayofweek', 'dtDayofyear', 'dtIs_month_end',
#        'dtIs_month_start', 'dtIs_quarter_end', 'dtIs_quarter_start',
#        'dtIs_year_end', 'dtIs_year_start', 'dtElapsed'],
#       dtype='object')

Index(['idApp', 'ltvDay1', 'ltvDay2', 'ltvDay3', 'ltvDay4', 'ltvDay5',
       'ltvDay6', 'ltvDay7', 'ltvDay8', 'ltvDay30', 'dtYear', 'dtMonth',
       'dtWeek', 'dtDay', 'dtDayofweek', 'dtDayofyear', 'dtIs_month_end',
       'dtIs_month_start', 'dtIs_quarter_end', 'dtIs_quarter_start',
       'dtIs_year_end', 'dtIs_year_start', 'dtElapsed'],
      dtype='object')

Ok, now let's separate our data into X and y.  Our variable X should have every column except `idApp` and `ltvDay30` and the variable `y` should just be the column `ltvDay30`.

In [118]:
X = df_app_sorted.drop(columns = ['idApp', 'ltvDay30'])
y = df_app_sorted['ltvDay30']

Next let's separate our data into training, valudation and test data.  We can allocate 10 percent of our data for validation and testing.  

In [122]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, shuffle = False)
X_validate, X_test, y_validate, y_test = train_test_split(X_test, y_test, test_size = .5, shuffle = False)

### Fitting a model

Let's begin by using a normal linear regression model.  Fit the linear regression model and score it on the validation set.

In [123]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)
model.score(X_validate, y_validate)

0.86370730288396

Next let's use the ridge cv model.  As we know the ridge cv model is designed to reduce the variance in our model by reducing the size of our model's coefficients.  Let's begin by defining a list of alphas, 20 of them evenly spaced from `.001` to `5`.

In [134]:
alphas = np.linspace(.001, 3, 20)
alphas.round(2)

# array([0.  , 0.16, 0.32, 0.47, 0.63, 0.79, 0.95, 1.11, 1.26, 1.42, 1.58,
#        1.74, 1.9 , 2.05, 2.21, 2.37, 2.53, 2.68, 2.84, 3.  ])

array([0.  , 0.16, 0.32, 0.47, 0.63, 0.79, 0.95, 1.11, 1.26, 1.42, 1.58,
       1.74, 1.9 , 2.05, 2.21, 2.37, 2.53, 2.68, 2.84, 3.  ])

Then, we can use the RidgeCV model to find the value of alpha the performs the best.  To ensure that we are performing cross validation properly, let's use a TimeSeriesSplit.  Make sure to set `normalize = True`.

In [136]:
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

ridge_model = RidgeCV(alphas = alphas, normalize = True, cv = TimeSeriesSplit())

ridge_model.fit(X_train, y_train)

ridge_model.score(X_validate, y_validate)
# 0.8636962814979466

0.8636962814979466

Let's look at the alpha value that was used to optimize the score.

In [137]:
ridge_model.alpha_

0.001

We see that it was the lowest alpha value, so apparently we were unable to improve our score using ridge regression.  Still let's initialize a series were we look at the coefficients of our ridge regression model along with the corresponding feature.

In [139]:
ridge_coef = pd.Series(ridge_model.coef_, X_validate.columns)
ridge_coef.round(2)

# ltvDay1                    -0.03
# ltvDay2                    -0.02
# ltvDay3                    -0.06
# ltvDay4                     0.03
# ltvDay5                    -0.23
# ltvDay6                     0.05
# ltvDay7                     0.02
# ltvDay8                     1.24
# dtYear                 672033.55
# dtMonth                109449.66
# dtWeek                  -8820.19
# dtDay                   15602.74
# dtDayofweek            -47902.52
# dtDayofyear             -1272.18
# dtIs_month_end        -879266.70
# dtIs_month_start      1570141.45
# dtIs_quarter_end      1824947.16
# dtIs_quarter_start   -2835222.54
# dtIs_year_end        -2763302.47
# dtIs_year_start       6920864.75
# dtElapsed                   0.10
# dtype: float64

ltvDay1                    -0.03
ltvDay2                    -0.02
ltvDay3                    -0.06
ltvDay4                     0.03
ltvDay5                    -0.23
ltvDay6                     0.05
ltvDay7                     0.02
ltvDay8                     1.24
dtYear                 672033.55
dtMonth                109449.66
dtWeek                  -8820.19
dtDay                   15602.74
dtDayofweek            -47902.52
dtDayofyear             -1272.18
dtIs_month_end        -879266.70
dtIs_month_start      1570141.45
dtIs_quarter_end      1824947.16
dtIs_quarter_start   -2835222.54
dtIs_year_end        -2763302.47
dtIs_year_start       6920864.75
dtElapsed                   0.10
dtype: float64

### Lasso Regression

Now let's move to lasso regression.  We defined a set of alpha values for you.  Make sure to use `TimeSeriesSplit` for our cross validation.  In addition to `normalize = True`, set the number of iterations to 5000.

In [158]:
alphas = np.linspace(.01, 5, 10)

In [154]:
from sklearn.linear_model import LassoCV
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

lasso_model = LassoCV(alphas = alphas, normalize = True, max_iter = 5000)

lasso_model.fit(X_train, y_train)

# LassoCV(alphas=array([0.01      , 0.56444444, 1.11888889, 1.67333333, 2.22777778,
#        2.78222222, 3.33666667, 3.89111111, 4.44555556, 5.        ]),
#         max_iter=5000, normalize=True)

  tol, rng, random, positive)
  tol, rng, random, positive)
  tol, rng, random, positive)
  tol, rng, random, positive)
  tol, rng, random, positive)
  tol, rng, random, positive)


LassoCV(alphas=array([0.01      , 0.56444444, 1.11888889, 1.67333333, 2.22777778,
       2.78222222, 3.33666667, 3.89111111, 4.44555556, 5.        ]),
        max_iter=5000, normalize=True)

In [155]:
lasso_model.score(X_validate, y_validate)

# 0.8637231479045426

0.8637338145608716

Here we performed essentially as well as the our linear regression model.

Now let's again look at the value of alpha that was used.

In [156]:
lasso_model.alpha_
# 5.0

5.0

And notice that our model did manage to zero out certain features.

In [159]:
lasso_coef = pd.Series(lasso_model.coef_, X_validate.columns)

In [157]:
np.abs(lasso_coef).sort_values(ascending = True)

# dtWeek                0.000000e+00
# dtDayofyear           0.000000e+00
# dtMonth               0.000000e+00
# ltvDay2               1.367119e-02
# ltvDay1               2.546847e-02
# ltvDay4               3.998085e-02
# ltvDay3               5.769050e-02
# ltvDay6               6.592506e-02
# ltvDay7               8.225642e-02
# dtElapsed             1.152870e-01
# ltvDay5               2.511533e-01
# ltvDay8               1.335798e+00
# dtDay                 1.150842e+04
# dtDayofweek           4.859411e+04
# dtYear                2.801050e+05
# dtIs_month_end        8.144355e+05
# dtIs_month_start      1.555152e+06
# dtIs_quarter_end      1.700133e+06
# dtIs_year_end         2.608323e+06
# dtIs_quarter_start    2.727256e+06
# dtIs_year_start       6.766724e+06
# dtype: float64

dtWeek                0.000000e+00
dtDayofyear           0.000000e+00
dtMonth               0.000000e+00
ltvDay2               1.367119e-02
ltvDay1               2.546847e-02
ltvDay4               3.998085e-02
ltvDay3               5.769050e-02
ltvDay6               6.592506e-02
ltvDay7               8.225642e-02
dtElapsed             1.152870e-01
ltvDay5               2.511533e-01
ltvDay8               1.335798e+00
dtDay                 1.150842e+04
dtDayofweek           4.859411e+04
dtYear                2.801050e+05
dtIs_month_end        8.144355e+05
dtIs_month_start      1.555152e+06
dtIs_quarter_end      1.700133e+06
dtIs_year_end         2.608323e+06
dtIs_quarter_start    2.727256e+06
dtIs_year_start       6.766724e+06
dtype: float64

### Summary

In this lesson, we saw how we can use both ridge and lasso regression to reduce the amount of variance in our model.  We saw that because decreasing variance comes with a tradeoff to bias, we did not necessarily improve upon our linear regression model, when working with this data.  Still, we saw that we could use lasso regression to achieve similar performance to our linear regression model, while reducing the number of features that we use.