### Introduction 

Often you'll have hundreds or thousands of features after various encodings and feature generation. This can lead to two problems. First, the more features you have, the more likely you are to overfit to the training and validation sets. This will cause your model to perform worse at generalizing to new data.

Secondly, the more features you have, the longer it will take to train your model and optimize hyperparameters. Also, when building user-facing products, you'll want to make inference as fast as possible. Using fewer features can speed up inference at the cost of predictive performance.

To help with these issues, you'll want to use feature selection techniques to keep the most informative features for your model.

In [23]:
%matplotlib inline
import itertools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics

In [4]:
ks = pd.read_csv('ks-projects-201801.csv',parse_dates = ['deadline','launched'])

In [5]:
ks

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.00
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.00
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.00
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
378656,999976400,ChknTruk Nationwide Charity Drive 2014 (Canceled),Documentary,Film & Video,USD,2014-10-17,50000.0,2014-09-17 02:35:30,25.0,canceled,1,US,25.0,25.0,50000.00
378657,999977640,The Tribe,Narrative Film,Film & Video,USD,2011-07-19,1500.0,2011-06-22 03:35:14,155.0,failed,5,US,155.0,155.0,1500.00
378658,999986353,Walls of Remedy- New lesbian Romantic Comedy f...,Narrative Film,Film & Video,USD,2010-08-16,15000.0,2010-07-01 19:40:30,20.0,failed,1,US,20.0,20.0,15000.00
378659,999987933,BioDefense Education Kit,Technology,Technology,USD,2016-02-13,15000.0,2016-01-13 18:13:53,200.0,failed,6,US,200.0,200.0,15000.00


In [6]:
#drop live projects
ks = ks.query('state != "live"')

In [7]:
ks

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.00
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.00
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.00
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
378656,999976400,ChknTruk Nationwide Charity Drive 2014 (Canceled),Documentary,Film & Video,USD,2014-10-17,50000.0,2014-09-17 02:35:30,25.0,canceled,1,US,25.0,25.0,50000.00
378657,999977640,The Tribe,Narrative Film,Film & Video,USD,2011-07-19,1500.0,2011-06-22 03:35:14,155.0,failed,5,US,155.0,155.0,1500.00
378658,999986353,Walls of Remedy- New lesbian Romantic Comedy f...,Narrative Film,Film & Video,USD,2010-08-16,15000.0,2010-07-01 19:40:30,20.0,failed,1,US,20.0,20.0,15000.00
378659,999987933,BioDefense Education Kit,Technology,Technology,USD,2016-02-13,15000.0,2016-01-13 18:13:53,200.0,failed,6,US,200.0,200.0,15000.00


In [8]:
#add outcome column, "successful" == 1, others are 0
ks = ks.assign(outcome =(ks['state']=='successful').astype(int))

In [9]:
ks

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real,outcome
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95,0
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.00,0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.00,0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.00,0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
378656,999976400,ChknTruk Nationwide Charity Drive 2014 (Canceled),Documentary,Film & Video,USD,2014-10-17,50000.0,2014-09-17 02:35:30,25.0,canceled,1,US,25.0,25.0,50000.00,0
378657,999977640,The Tribe,Narrative Film,Film & Video,USD,2011-07-19,1500.0,2011-06-22 03:35:14,155.0,failed,5,US,155.0,155.0,1500.00,0
378658,999986353,Walls of Remedy- New lesbian Romantic Comedy f...,Narrative Film,Film & Video,USD,2010-08-16,15000.0,2010-07-01 19:40:30,20.0,failed,1,US,20.0,20.0,15000.00,0
378659,999987933,BioDefense Education Kit,Technology,Technology,USD,2016-02-13,15000.0,2016-01-13 18:13:53,200.0,failed,6,US,200.0,200.0,15000.00,0


In [17]:
ks = ks.assign(hour=ks.launched.dt.hour,
               day=ks.launched.dt.day,
               month=ks.launched.dt.month,
               year=ks.launched.dt.year)

In [18]:
#label Encoding
encoder = LabelEncoder()
cat_features = ['category','currency','country']
encoded = ks[cat_features].apply(encoder.fit_transform)

In [19]:
encoded

Unnamed: 0,category,currency,country
0,108,5,9
1,93,13,22
2,93,13,22
3,90,13,22
4,55,13,22
...,...,...,...
378656,39,13,22
378657,93,13,22
378658,93,13,22
378659,138,13,22


In [20]:
data_cols = ['goal', 'hour', 'day', 'month', 'year', 'outcome']
baseline = ks[data_cols].join(encoded)
baseline

Unnamed: 0,goal,hour,day,month,year,outcome,category,currency,country
0,1000.0,12,11,8,2015,0,108,5,9
1,30000.0,4,2,9,2017,0,93,13,22
2,45000.0,0,12,1,2013,0,93,13,22
3,5000.0,3,17,3,2012,0,90,13,22
4,19500.0,8,4,7,2015,0,55,13,22
...,...,...,...,...,...,...,...,...,...
378656,50000.0,2,17,9,2014,0,39,13,22
378657,1500.0,3,22,6,2011,0,93,13,22
378658,15000.0,19,1,7,2010,0,93,13,22
378659,15000.0,18,13,1,2016,0,138,13,22


In [22]:
cat_features = ['category','currency','country']
interactions  = pd.DataFrame(index=ks.index)
interactions

0
1
2
3
4
...
378656
378657
378658
378659
378660


In [25]:
for col1,col2 in itertools.combinations(cat_features,2):
    new_col_name = "_".join([col1,col2])
    #convert the string and combine
    new_values = ks[col1].map(str)+"_"+ks[col2].map(str)
    label_enc = LabelEncoder()
    interactions[new_col_name]= label_enc.fit_transform(new_values)
    

In [26]:
interactions

Unnamed: 0,category_currency,category_country,currency_country
0,1215,1900,18
1,1047,1630,31
2,1047,1630,31
3,1024,1595,31
4,630,979,31
...,...,...,...
378656,447,699,31
378657,1047,1630,31
378658,1047,1630,31
378659,1546,2412,31


In [32]:
baseline_data = baseline.join(interactions)
baseline_data

Unnamed: 0,goal,hour,day,month,year,outcome,category,currency,country,category_currency,category_country,currency_country
0,1000.0,12,11,8,2015,0,108,5,9,1215,1900,18
1,30000.0,4,2,9,2017,0,93,13,22,1047,1630,31
2,45000.0,0,12,1,2013,0,93,13,22,1047,1630,31
3,5000.0,3,17,3,2012,0,90,13,22,1024,1595,31
4,19500.0,8,4,7,2015,0,55,13,22,630,979,31
...,...,...,...,...,...,...,...,...,...,...,...,...
378656,50000.0,2,17,9,2014,0,39,13,22,447,699,31
378657,1500.0,3,22,6,2011,0,93,13,22,1047,1630,31
378658,15000.0,19,1,7,2010,0,93,13,22,1047,1630,31
378659,15000.0,18,13,1,2016,0,138,13,22,1546,2412,31


In [33]:
launched =pd.Series(ks.index,index=ks.launched,name = "count_7_days").sort_index()

In [36]:
launched

launched
1970-01-01 01:00:00     94579
1970-01-01 01:00:00    319002
1970-01-01 01:00:00    247913
1970-01-01 01:00:00     48147
1970-01-01 01:00:00     75397
                        ...  
2017-12-29 03:22:32    339929
2017-12-29 21:06:11     62039
2017-12-31 13:53:53     11463
2018-01-01 00:54:41    167940
2018-01-02 03:05:10     15604
Name: count_7_days, Length: 375862, dtype: int64

In [39]:
count_7_days = launched.rolling('7d').count()-1
count_7_days.index = launched.values
count_7_days =count_7_days.reindex(ks.index)
count_7_days

0         1409.0
1          957.0
2          739.0
3          907.0
4         1429.0
           ...  
378656    1482.0
378657     505.0
378658     238.0
378659    1100.0
378660     542.0
Name: count_7_days, Length: 375862, dtype: float64

In [40]:
baseline_data = baseline_data.join(count_7_days)
baseline_data

Unnamed: 0,goal,hour,day,month,year,outcome,category,currency,country,category_currency,category_country,currency_country,count_7_days
0,1000.0,12,11,8,2015,0,108,5,9,1215,1900,18,1409.0
1,30000.0,4,2,9,2017,0,93,13,22,1047,1630,31,957.0
2,45000.0,0,12,1,2013,0,93,13,22,1047,1630,31,739.0
3,5000.0,3,17,3,2012,0,90,13,22,1024,1595,31,907.0
4,19500.0,8,4,7,2015,0,55,13,22,630,979,31,1429.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
378656,50000.0,2,17,9,2014,0,39,13,22,447,699,31,1482.0
378657,1500.0,3,22,6,2011,0,93,13,22,1047,1630,31,505.0
378658,15000.0,19,1,7,2010,0,93,13,22,1047,1630,31,238.0
378659,15000.0,18,13,1,2016,0,138,13,22,1546,2412,31,1100.0


In [49]:
def time_since_last_project(series):
    return series.diff().dt.total_seconds()/3600.

In [45]:
df = ks[['category','launched']].sort_values('launched')
df

Unnamed: 0,category,launched
94579,Theater,1970-01-01 01:00:00
319002,Publishing,1970-01-01 01:00:00
247913,Music,1970-01-01 01:00:00
48147,Art,1970-01-01 01:00:00
75397,Film & Video,1970-01-01 01:00:00
...,...,...
339929,Hip-Hop,2017-12-29 03:22:32
62039,Web,2017-12-29 21:06:11
11463,Tabletop Games,2017-12-31 13:53:53
167940,Comic Books,2018-01-01 00:54:41


In [50]:
timedeltas = df.groupby('category').transform(time_since_last_project)
timedeltas

Unnamed: 0,launched
94579,
319002,
247913,
48147,
75397,
...,...
339929,409.380833
62039,220.661944
11463,84.081944
167940,208.305278


In [51]:
timedeltas = timedeltas.fillna(timedeltas.max())


In [52]:
timedeltas

Unnamed: 0,launched
94579,347972.515000
319002,347972.515000
247913,347972.515000
48147,347972.515000
75397,347972.515000
...,...
339929,409.380833
62039,220.661944
11463,84.081944
167940,208.305278


In [53]:
baseline_data = baseline_data.join(timedeltas.rename({'launched':'time_since_last_project'},axis=1))


In [54]:
baseline_data

Unnamed: 0,goal,hour,day,month,year,outcome,category,currency,country,category_currency,category_country,currency_country,count_7_days,time_since_last_project
0,1000.0,12,11,8,2015,0,108,5,9,1215,1900,18,1409.0,18.606111
1,30000.0,4,2,9,2017,0,93,13,22,1047,1630,31,957.0,5.592778
2,45000.0,0,12,1,2013,0,93,13,22,1047,1630,31,739.0,1.313611
3,5000.0,3,17,3,2012,0,90,13,22,1024,1595,31,907.0,0.635000
4,19500.0,8,4,7,2015,0,55,13,22,630,979,31,1429.0,16.661389
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
378656,50000.0,2,17,9,2014,0,39,13,22,447,699,31,1482.0,2.659167
378657,1500.0,3,22,6,2011,0,93,13,22,1047,1630,31,505.0,3.835833
378658,15000.0,19,1,7,2010,0,93,13,22,1047,1630,31,238.0,20.575278
378659,15000.0,18,13,1,2016,0,138,13,22,1546,2412,31,1100.0,0.614444


In [58]:
def get_data_splits(dataframe, valid_fraction=0.1):
    valid_fraction = 0.1
    valid_size = int(len(dataframe) * valid_fraction)

    train = dataframe[:-valid_size * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_size * 2:-valid_size]
    test = dataframe[-valid_size:]
    
    return train, valid, test

def train_model(train, valid):
    feature_cols = train.columns.drop('outcome')

    dtrain = lgb.Dataset(train[feature_cols], label=train['outcome'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome'])

    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    print("Training model!")
    bst = lgb.train(param, dtrain, num_boost_round=1000, valid_sets=[dvalid], 
                    early_stopping_rounds=10, verbose_eval=False)
    valid_pred = bst.predict(valid[feature_cols])
        
    valid_score = metrics.roc_auc_score(valid['outcome'], valid_pred)
    print(f"Validation AUC score: {valid_score:.4f}")
    return bst

### Univariate Feature Selection

The simplest and fastest methods are based on univariate statistical tests. For each feature, measure how strongly the target depends on the feature using a statistical test like  χ2  or ANOVA.

From the scikit-learn feature selection module, feature_selection.SelectKBest returns the K best features given some scoring function. For our classification problem, the module provides three different scoring functions:  χ2 , ANOVA F-value, and the mutual information score. The F-value measures the linear dependency between the feature variable and the target. This means the score might underestimate the relation between a feature and the target if the relationship is nonlinear. The mutual information score is nonparametric and so can capture nonlinear relationships.

With SelectKBest, we define the number of features to keep, based on the score from the scoring function. Using .fit_transform(features, target) we get back an array with only the selected features.

In [62]:
from sklearn.feature_selection import SelectKBest,f_classif

In [63]:
feature_cols = baseline_data.columns.drop('outcome')

In [64]:
feature_cols

Index(['goal', 'hour', 'day', 'month', 'year', 'category', 'currency',
       'country', 'category_currency', 'category_country', 'currency_country',
       'count_7_days', 'time_since_last_project'],
      dtype='object')

In [67]:
#keep 5 features
selector = SelectKBest(f_classif,k=5)
X_new = selector.fit_transform(baseline_data[feature_cols],baseline_data['outcome'])
X_new

array([[2015.,    5.,    9.,   18., 1409.],
       [2017.,   13.,   22.,   31.,  957.],
       [2013.,   13.,   22.,   31.,  739.],
       ...,
       [2010.,   13.,   22.,   31.,  238.],
       [2016.,   13.,   22.,   31., 1100.],
       [2011.,   13.,   22.,   31.,  542.]])

However, I've done something wrong here. The statistical tests are calculated using all of the data. This means information from the validation and test sets could influence the features we keep, introducing a source of leakage. This means we should select features using only a training set.

In [68]:
feature_cols = baseline_data.columns.drop('outcome')
train,valid,_ = get_data_splits(baseline_data)
#keep 5 features
selector = SelectKBest(f_classif, k=5)

X_new = selector.fit_transform(train[feature_cols], train['outcome'])
X_new

array([[2.015e+03, 5.000e+00, 9.000e+00, 1.800e+01, 1.409e+03],
       [2.017e+03, 1.300e+01, 2.200e+01, 3.100e+01, 9.570e+02],
       [2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 7.390e+02],
       ...,
       [2.011e+03, 1.300e+01, 2.200e+01, 3.100e+01, 5.150e+02],
       [2.015e+03, 1.000e+00, 3.000e+00, 2.000e+00, 1.306e+03],
       [2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 1.084e+03]])

You should notice that the selected features are different than when I used the entire dataset. Now we have our selected features, but it's only the feature values for the training set. To drop the rejected features from the validation and test sets, we need to figure out which columns in the dataset were kept with SelectKBest. To do this, we can use .inverse_transform to get back an array with the shape of the original data.

In [69]:
# Get back the features we've kept, zero out all other features
select_features = pd.DataFrame(selector.inverse_transform(X_new),index=train.index, columns=feature_cols)
select_features

Unnamed: 0,goal,hour,day,month,year,category,currency,country,category_currency,category_country,currency_country,count_7_days,time_since_last_project
0,0.0,0.0,0.0,0.0,2015.0,0.0,5.0,9.0,0.0,0.0,18.0,1409.0,0.0
1,0.0,0.0,0.0,0.0,2017.0,0.0,13.0,22.0,0.0,0.0,31.0,957.0,0.0
2,0.0,0.0,0.0,0.0,2013.0,0.0,13.0,22.0,0.0,0.0,31.0,739.0,0.0
3,0.0,0.0,0.0,0.0,2012.0,0.0,13.0,22.0,0.0,0.0,31.0,907.0,0.0
4,0.0,0.0,0.0,0.0,2015.0,0.0,13.0,22.0,0.0,0.0,31.0,1429.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
302891,0.0,0.0,0.0,0.0,2015.0,0.0,13.0,22.0,0.0,0.0,31.0,1498.0,0.0
302892,0.0,0.0,0.0,0.0,2014.0,0.0,13.0,22.0,0.0,0.0,31.0,1369.0,0.0
302893,0.0,0.0,0.0,0.0,2011.0,0.0,13.0,22.0,0.0,0.0,31.0,515.0,0.0
302894,0.0,0.0,0.0,0.0,2015.0,0.0,1.0,3.0,0.0,0.0,2.0,1306.0,0.0


This returns a DataFrame with the same index and columns as the training set, but all the dropped columns are filled with zeros. We can find the selected columns by choosing features where the variance is non-zero.

In [70]:
# Dropped columns have values of all 0s, so var is 0, drop them
selected_columns = select_features.columns[select_features.var()!=0]
#get the dataset with the selected features
valid[selected_columns].head()


Unnamed: 0,year,currency,country,currency_country,count_7_days
302896,2015,13,22,31,1534.0
302897,2013,13,22,31,625.0
302898,2014,5,9,18,851.0
302899,2014,13,22,31,1973.0
302900,2014,5,9,18,2163.0


### L1 Regularization

Univariate methods consider only one feature at a time when making a selection decision. Instead, we can make our selection using all of the features by including them in a linear model with L1 regularization. This type of regularization (sometimes called Lasso) penalizes the absolute magnitude of the coefficients, as compared to L2 (Ridge) regression which penalizes the square of the coefficients.

As the strength of regularization is increased, features which are less important for predicting the target are set to 0. This allows us to perform feature selection by adjusting the regularization parameter. We choose the parameter by finding the best performance on a hold-out set, or decide ahead of time how many features to keep.

For regression problems you can use sklearn.linear_model.Lasso, or sklearn.linear_model.LogisticRegression for classification. These can be used along with sklearn.feature_selection.SelectFromModel to select the non-zero coefficients. Otherwise, the code is similar to the univariate tests.

In [71]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

In [79]:
train,valid,_  = get_data_splits(baseline_data)
X,y = train[train.columns.drop('outcome')],train['outcome']
#set the regularization parameter to C =1
logistic = LogisticRegression(C=1, penalty='l1', solver='liblinear').fit(X, y)
model = SelectFromModel(logistic,prefit=True)
X_new = model.transform(X)
X_new

array([[1.000e+03, 1.200e+01, 1.100e+01, ..., 1.900e+03, 1.800e+01,
        1.409e+03],
       [3.000e+04, 4.000e+00, 2.000e+00, ..., 1.630e+03, 3.100e+01,
        9.570e+02],
       [4.500e+04, 0.000e+00, 1.200e+01, ..., 1.630e+03, 3.100e+01,
        7.390e+02],
       ...,
       [2.500e+03, 0.000e+00, 3.000e+00, ..., 1.830e+03, 3.100e+01,
        5.150e+02],
       [2.600e+03, 2.100e+01, 2.300e+01, ..., 1.036e+03, 2.000e+00,
        1.306e+03],
       [2.000e+04, 1.600e+01, 4.000e+00, ..., 9.200e+02, 3.100e+01,
        1.084e+03]])

Similar to the univariate tests, we get back an array with the selected features. Again, we will want to convert these to a DataFrame so we can get the selected columns.

In [80]:
# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(model.inverse_transform(X_new), 
                                 index=X.index,
                                 columns=X.columns)

# Dropped columns have values of all 0s, keep other columns 
selected_columns = selected_features.columns[selected_features.var() != 0]

In [82]:
select_features

Unnamed: 0,goal,hour,day,month,year,category,currency,country,category_currency,category_country,currency_country,count_7_days,time_since_last_project
0,0.0,0.0,0.0,0.0,2015.0,0.0,5.0,9.0,0.0,0.0,18.0,1409.0,0.0
1,0.0,0.0,0.0,0.0,2017.0,0.0,13.0,22.0,0.0,0.0,31.0,957.0,0.0
2,0.0,0.0,0.0,0.0,2013.0,0.0,13.0,22.0,0.0,0.0,31.0,739.0,0.0
3,0.0,0.0,0.0,0.0,2012.0,0.0,13.0,22.0,0.0,0.0,31.0,907.0,0.0
4,0.0,0.0,0.0,0.0,2015.0,0.0,13.0,22.0,0.0,0.0,31.0,1429.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
302891,0.0,0.0,0.0,0.0,2015.0,0.0,13.0,22.0,0.0,0.0,31.0,1498.0,0.0
302892,0.0,0.0,0.0,0.0,2014.0,0.0,13.0,22.0,0.0,0.0,31.0,1369.0,0.0
302893,0.0,0.0,0.0,0.0,2011.0,0.0,13.0,22.0,0.0,0.0,31.0,515.0,0.0
302894,0.0,0.0,0.0,0.0,2015.0,0.0,1.0,3.0,0.0,0.0,2.0,1306.0,0.0


In this case with the L1 parameter C=1, we're dropping the time_since_last_project column.

In general, feature selection with L1 regularization is more powerful the univariate tests, but it can also be very slow when you have a lot of data and a lot of features. Univariate tests will be much faster on large datasets, but also will likely perform worse.