<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-data" data-toc-modified-id="Import-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import data</a></span></li><li><span><a href="#Tidying-data" data-toc-modified-id="Tidying-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Tidying data</a></span><ul class="toc-item"><li><span><a href="#Data-inspection" data-toc-modified-id="Data-inspection-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Data inspection</a></span></li></ul></li></ul></div>

**Applied Statistics**<br/>
Prof. Dr. Jan Kirenz <br/>
Hochschule der Medien Stuttgart

In [28]:
# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


#import glmnet as gln
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale 
from sklearn import model_selection
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error

%matplotlib inline
plt.style.use('seaborn-white')

# Assignement Lasso  (least absolute shrinkage and selection operator) 


Lasso performs L1 regularization, i.e. adds penalty equivalent to absolute value of the magnitude of coefficients.
Minimization objective = RSS + α * (sum of absolute value of coefficients). 

α (alpha) provides a trade-off between balancing RSS and magnitude of coefficients. α can take various values:
  
  - α = 0: Same coefficients as simple linear regression
  - α = ∞: All coefficients zero (same logic as before)
  - 0 < α < ∞: coefficients between 0 and that of simple linear regression

## Import data

In [10]:
# Load the csv data files into pandas dataframes
PATH = '/Users/jankirenz/Dropbox/Data/' 
df = pd.read_csv(PATH + 'auto.csv')

## Tidying data

### Data inspection

First of all, let's take a look at the variables (columns) in the data set.

In [11]:
# show all variables in the data set
df.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'year', 'origin', 'name'],
      dtype='object')

In [12]:
# show the first 5 rows (i.e. head of the DataFrame)
df.head(5)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
mpg             397 non-null float64
cylinders       397 non-null int64
displacement    397 non-null float64
horsepower      397 non-null object
weight          397 non-null int64
acceleration    397 non-null float64
year            397 non-null int64
origin          397 non-null int64
name            397 non-null object
dtypes: float64(3), int64(4), object(2)
memory usage: 28.0+ KB


In [25]:
lm = smf.ols(formula ='mpg ~ .', data=df).fit()
lm.summary()

NameError: name 'smf' is not defined

In [6]:
#dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
#dummies.info()
#print(dummies.head())

<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, -Alan Ashby to -Willie Wilson
Data columns (total 6 columns):
League_A       263 non-null uint8
League_N       263 non-null uint8
Division_E     263 non-null uint8
Division_W     263 non-null uint8
NewLeague_A    263 non-null uint8
NewLeague_N    263 non-null uint8
dtypes: uint8(6)
memory usage: 3.6+ KB
                   League_A  League_N  Division_E  Division_W  NewLeague_A  \
-Alan Ashby               0         1           0           1            0   
-Alvin Davis              1         0           0           1            1   
-Andre Dawson             0         1           1           0            0   
-Andres Galarraga         0         1           1           0            0   
-Alfredo Griffin          1         0           0           1            1   

                   NewLeague_N  
-Alan Ashby                  1  
-Alvin Davis                 0  
-Andre Dawson                1  
-Andres Galarraga            1  
-Al

In [29]:
y = df.mpg
# Drop the column with the independent variable (Salary), and columns for which we created dummy variables
X = df.drop(['mpg', 'name'], axis=1)



#X = df.drop(['mpg'], axis=1).astype('float64')
# Define the feature set X.
#X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)
#X.info()

In [34]:
X.head(5)

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,year,origin
0,8,307.0,130,3504,12.0,70,1
1,8,350.0,165,3693,11.5,70,1
2,8,318.0,150,3436,11.0,70,1
3,8,304.0,150,3433,12.0,70,1
4,8,302.0,140,3449,10.5,70,1


In [45]:
# Split Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
print(f'Trainingset size: {len(X_train)}')
print(f'Testset size: {len(X_test)}')

Trainingset size: 277
Testset size: 120


# Lasso

In [24]:
# define alphas
alphas =  [1e-10, 1e-5, 1e-3, 1, 5, 10]


[1e-10, 1e-05, 0.001, 1, 5, 10]

In [48]:
alphas = 10**np.linspace(10,-2,100)*0.5

lasso = Lasso(max_iter=10000)
coefs = []

for a in alphas*2:
    lasso.set_params(alpha=a)
    lasso.fit(scale(X_train), y_train)
    coefs.append(lasso.coef_)


ValueError: could not convert string to float: '?'

The above plot shows that the Ridge coefficients get larger when we decrease alpha.

In [50]:
lassocv = LassoCV(alphas=None, cv=10, max_iter=10000)
lassocv.fit(X_train, y_train.values.ravel())

ValueError: could not convert string to float: '?'

In [51]:
lassocv.alpha_

AttributeError: 'LassoCV' object has no attribute 'alpha_'

In [13]:
lasso.set_params(alpha=lassocv.alpha_)
lasso.fit(scale(X_train), y_train)
mean_squared_error(y_test, lasso.predict(scale(X_test)))

99832.53772263639

In [14]:
# Some of the coefficients are now reduced to exactly zero.
pd.Series(lasso.coef_, index=X.columns)

AtBat         -308.252956
Hits           234.922245
HmRun           -0.000000
Runs            29.777512
RBI             84.187270
Walks           97.125243
Years          -58.147623
CAtBat          -0.000000
CHits          159.949342
CHmRun          54.104078
CRuns          108.887617
CRBI            72.229754
CWalks        -150.193730
PutOuts         87.468140
Assists         38.289197
Errors          -7.028023
League_N        24.385367
Division_W     -60.186400
NewLeague_N      0.000000
dtype: float64